Skip to main content

Administering installed services

Below are recommendations on administering installed Urbi Pro in an On-Premise installation.

Backing up infrastructure databases

Backup is recommended for PostgreSQL databases of the Urbi Pro service as they contain unique data.

Collecting logs

Below are recommendations for collecting and sending logs from services within an On-Premise installation.

Log format

On-Premise services in Kubernetes write logs to stdout in JSON or Plaintext format. In some cases, a combination of both formats is used, allowing flexible log output depending on system needs.

Tools for collecting logs

To collect and send logs, installing the following agents is recommended:

Storing and analyzing logs

For centralized log collection, storage, and analysis, using the following tools is recommended:

  • Elasticsearch + Kibana (ELK stack). Suitable for complex queries and long-term log storage and provides the following features:

    • full-text search and log analytics
    • flexible dashboards and visualizations in Kibana

    How it works: agents (Fluent Bit, Filebeat) collect logs from nodes and send them to Elasticsearch. Then data is indexed and becomes available in Kibana via index patterns.

  • Grafana Loki. Optimized for working with Kubernetes logs and provides the following features:

    • cost-effective log storage (uses object storage, e.g., MinIO)
    • integration with Grafana for combined log and metric analysis.

    How it works: agents (Fluent Bit) send logs to Loki with labels for quick search. Then logs become available in Grafana via LogQL queries.

Monitoring services

On-Premise services provide metrics in Prometheus format. This format ensures secure data transmission (via HTTPS or protected endpoints) and is easily integrated with most modern monitoring systems.

Below are recommendations for configuring service monitoring.

Monitoring methods

Two main monitoring methods are recommended to use:

  • For APIs and services in Kubernetes - the RED method:

    • Rate (Request rate)
    • Errors
    • Duration
  • For databases and storage on virtual machines - the USE method:

    • Utilization (Resource usage)
    • Saturation
    • Errors

Tools

For a complete monitoring cycle (collection, storage, visualization, and alerting), using the following tools is recommended:

  • Metrics collection and storage: Prometheus - provides flexible queries (PromQL language) and reliable data storage.
  • Visualization: Grafana - offers dashboards with customizable queries.
  • Alerting: AlertManager - allows sending notifications via e-mail, Slack, and other channels when thresholds are exceeded.

Collecting metrics

Collecting metrics directly from service pods is recommended. If your Kubernetes cluster uses Prometheus with Kubernetes SD (Service Discovery), add the following annotations to the podAnnotations parameter in the Helm charts of the services:

  • prometheus.io/scrape: "true" - enable metrics collection from this pod.
  • prometheus.io/path: "/metrics" - the path where metrics are available.
  • prometheus.io/port: "80" - pod port.

The following tools are recommended for collecting metrics:

API service metrics in Kubernetes

When monitoring API services in Kubernetes, consider the following key indicators, which help assess system performance and stability and detect potential issues:

  • Network indicators:

    • RPS (Requests Per Second) - the number of requests per second passing through Ingress or Service. It is important to collect metrics specifically from the Ingress controller, as it shows the total incoming traffic.

    • Latency - request processing latency:

      • p50 - median.
      • p90 - 90th percentile.
      • p99 - 99th percentile. It is important to monitor p99 growth, as it indicates the most critical delays.
    • HTTP response codes:

      • 2xx - successful requests.
      • 4xx - client errors. It is important to monitor increases in 429 errors (Too Many Requests).
      • 5xx - server errors. The number of these errors should be minimal.
  • Container resources (CPU and RAM):

    • CPU Usage (%) - actual CPU resource consumption by the container.
    • CPU Throttling - if the container exceeds CPU limits (limits.cpu), Kubernetes restricts its execution.
    • Memory Usage (RAM, MB/GB) - current memory usage.
    • OOMKills (Out of Memory Kills) - if the container exceeds the memory limit (limits.memory), Kubernetes may terminate it.
  • Node resources:

    • Total CPU and memory load on the node - if the node is overloaded, pods may experience issues.
    • Disk utilization - disk space usage and disk performance (IOPS, read/write speed).
  • Network errors and connection errors:

    • Connection Errors - errors in connections between services (e.g., Connection reset, Timeout).
    • DNS Latency & Failures - delays and errors in DNS requests (CoreDNS).