Skip to content

Metrics and Observability

This guide explains how to expose Prometheus metrics, what metrics exist, and how to build dashboards and alerts for Pleiades GSLB.

Enable Prometheus endpoint Add to /etc/gslb/config.yaml:

metrics:
  enablePrometheus: true
  listenAddr: "0.0.0.0"
  port: 9090
The endpoint is served at http://<listenAddr>:<port>/metrics and exports Prometheus text format.

Constant labels - All metrics include cluster and node constant labels when cluster.id and/or node.id are set.

Key metrics (namespace: gslbd) - DNS - gslbd_dns_requests_total - GitOps - gslbd_gitops_fetch_total{result} - gslbd_gitops_verify_total{result} - gslbd_gitops_apply_total{result} - gslbd_gitops_last_apply_info{sha,signer} value 1 for last applied commit - State sync (NATS/JetStream) - gslbd_state_nats_connected (0/1) - gslbd_state_nats_published_total{type} - gslbd_state_nats_received_total{type} - gslbd_state_kv_put_total{bucket,result} - gslbd_state_kv_get_total{bucket,result} - gslbd_state_merge_lag_ms (histogram) - gslbd_state_active_members (gauge) - Health - gslbd_health_endpoints_total{family} - gslbd_health_endpoints_healthy{family}

Scrape configuration example

scrape_configs:
  - job_name: 'gslbd'
    scrape_interval: 15s
    static_configs:
      - targets: ['gslbd-hostname:9090']

Grafana dashboard - Suggested panels: - DNS RPS: rate of gslbd_dns_requests_total. - NATS connectivity: gslbd_state_nats_connected. - Merge lag p95: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le,cluster,node)). - Health totals and healthy by family: gauges/graphs from gslbd_health_*. - Active members over time: gslbd_state_active_members.

Example alerts

groups:
- name: gslbd
  rules:
  - alert: GslbdNATSDisconnected
    expr: gslbd_state_nats_connected == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: NATS connection down
  - alert: GslbdMergeLagHigh
    expr: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le)) > 2000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High merge lag (p95)
  - alert: GslbdActiveMembersZero
    expr: gslbd_state_active_members == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: No active members detected

Troubleshooting - No metrics visible: ensure the server log shows the metrics server started; verify port openness and any firewall. - Missing labels: ensure cluster.id and node.id are set; metrics include them as constant labels.