Metrics and Observability¶

This guide explains how to expose Prometheus metrics, what metrics exist, and how to build dashboards and alerts for Nexus GSLB.

Enable Prometheus endpoint Add to /etc/gslb/config.yaml:

metrics:
  enablePrometheus: true
  listenAddr: "0.0.0.0"
  port: 9090

The endpoint is served at http://<listenAddr>:<port>/metrics and exports Prometheus text format.

Constant labels - All metrics include cluster and node constant labels when cluster.id and/or node.id are set.

Key metrics (namespace: gslbd) - DNS - gslbd_dns_requests_total - GitOps - gslbd_gitops_fetch_total{result} - gslbd_gitops_verify_total{result} - gslbd_gitops_apply_total{result} - gslbd_gitops_last_apply_info{sha,signer} value 1 for last applied commit - State sync (NATS/JetStream) - gslbd_state_nats_connected (0/1) - gslbd_state_nats_published_total{type} - gslbd_state_nats_received_total{type} - gslbd_state_kv_put_total{bucket,result} - gslbd_state_kv_get_total{bucket,result} - gslbd_state_merge_lag_ms (histogram) - gslbd_state_active_members (gauge) - Health - gslbd_health_endpoints_total{family} — total endpoints by IP family (v4/v6) - gslbd_health_endpoints_healthy{family} — healthy endpoints by family - gslbd_pool_members_total{pool_id} — total tracked members per pool - gslbd_pool_members_healthy{pool_id} — healthy members per pool; alert when this drops to 0 - gslbd_health_failover_detection_seconds{direction} (histogram) — Layer 1 failover detection latency per state transition. direction="down" = healthy→unhealthy; direction="up" = recovery. Use this to verify your checkInterval is achieving the detection speed you expect. - DNS - gslbd_dns_requests_total - gslbd_dns_query_duration_seconds (histogram) — end-to-end DNS handler latency

Scrape configuration example

scrape_configs:
  - job_name: 'gslbd'
    scrape_interval: 15s
    static_configs:
      - targets: ['gslbd-hostname:9090']

Grafana dashboard - Suggested panels: - DNS RPS: rate(gslbd_dns_requests_total[1m]). - DNS p99 latency: histogram_quantile(0.99, sum(rate(gslbd_dns_query_duration_seconds_bucket[5m])) by (le)). - NATS connectivity: gslbd_state_nats_connected. - Merge lag p95: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le,cluster,node)). - Pool health: gslbd_pool_members_healthy and gslbd_pool_members_total per pool — heatmap or bar chart. - Failover detection p95: histogram_quantile(0.95, sum(rate(gslbd_health_failover_detection_seconds_bucket[15m])) by (le, direction)) — shows how quickly the daemon is confirming state changes. - Active members over time: gslbd_state_active_members.

Example alerts

name="__codelineno-2-1" href="#__codelineno-2-1">groups: name: gslbd rules: - alert: GslbdNATSDisconnected expr: gslbd_state_nats_connected == 0 for: 2m labels: severity: warning annotations: summary: NATS connection down - alert: GslbdMergeLagHigh expr: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le)) > 2000 for: 5m labels: severity: warning annotations: summary: High merge lag (p95) - alert: GslbdActiveMembersZero expr: gslbd_state_active_members == 0 for: 5m labels: severity: critical annotations: summary: No active members detected - alert: GslbdPoolDegraded expr: gslbd_pool_members_healthy < 1 for: 30s labels: severity: critical annotations: summary: Pool {{ $labels.pool_id }} has no healthy members; DNS will return no records - alert: GslbdFailoverDetectionSlow # Adjust the threshold to 2× your checkInterval. Default checkInterval=10s → threshold=20s. expr: histogram_quantile(0.95, sum(rate(gslbd_health_failover_detection_seconds_bucket[15m])) by (le, direction)) > 20 for: 10m labels: severity: warning annotations: summary: Failover detection p95 exceeds expected check interval ({{ $value }}s)

Troubleshooting - No metrics visible: ensure the server log shows the metrics server started; verify port openness and any firewall. - Missing labels: ensure cluster.id and node.id are set; metrics include them as constant labels.