Metrics and Observability¶
This guide explains how to expose Prometheus metrics, what metrics exist, and how to build dashboards and alerts for Nexus GSLB.
Enable Prometheus endpoint
Add to /etc/gslb/config.yaml:
http://<listenAddr>:<port>/metrics and exports Prometheus text format.
Constant labels
- All metrics include cluster and node constant labels when cluster.id and/or node.id are set.
Key metrics (namespace: gslbd)
- DNS
- gslbd_dns_requests_total
- GitOps
- gslbd_gitops_fetch_total{result}
- gslbd_gitops_verify_total{result}
- gslbd_gitops_apply_total{result}
- gslbd_gitops_last_apply_info{sha,signer} value 1 for last applied commit
- State sync (NATS/JetStream)
- gslbd_state_nats_connected (0/1)
- gslbd_state_nats_published_total{type}
- gslbd_state_nats_received_total{type}
- gslbd_state_kv_put_total{bucket,result}
- gslbd_state_kv_get_total{bucket,result}
- gslbd_state_merge_lag_ms (histogram)
- gslbd_state_active_members (gauge)
- Health
- gslbd_health_endpoints_total{family} — total endpoints by IP family (v4/v6)
- gslbd_health_endpoints_healthy{family} — healthy endpoints by family
- gslbd_pool_members_total{pool_id} — total tracked members per pool
- gslbd_pool_members_healthy{pool_id} — healthy members per pool; alert when this drops to 0
- gslbd_health_failover_detection_seconds{direction} (histogram) — Layer 1 failover detection latency per state transition. direction="down" = healthy→unhealthy; direction="up" = recovery. Use this to verify your checkInterval is achieving the detection speed you expect.
- DNS
- gslbd_dns_requests_total
- gslbd_dns_query_duration_seconds (histogram) — end-to-end DNS handler latency
Scrape configuration example
scrape_configs:
- job_name: 'gslbd'
scrape_interval: 15s
static_configs:
- targets: ['gslbd-hostname:9090']
Grafana dashboard
- Suggested panels:
- DNS RPS: rate(gslbd_dns_requests_total[1m]).
- DNS p99 latency: histogram_quantile(0.99, sum(rate(gslbd_dns_query_duration_seconds_bucket[5m])) by (le)).
- NATS connectivity: gslbd_state_nats_connected.
- Merge lag p95: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le,cluster,node)).
- Pool health: gslbd_pool_members_healthy and gslbd_pool_members_total per pool — heatmap or bar chart.
- Failover detection p95: histogram_quantile(0.95, sum(rate(gslbd_health_failover_detection_seconds_bucket[15m])) by (le, direction)) — shows how quickly the daemon is confirming state changes.
- Active members over time: gslbd_state_active_members.
Example alerts
groups:
- name: gslbd
rules:
- alert: GslbdNATSDisconnected
expr: gslbd_state_nats_connected == 0
for: 2m
labels:
severity: warning
annotations:
summary: NATS connection down
- alert: GslbdMergeLagHigh
expr: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le)) > 2000
for: 5m
labels:
severity: warning
annotations:
summary: High merge lag (p95)
- alert: GslbdActiveMembersZero
expr: gslbd_state_active_members == 0
for: 5m
labels:
severity: critical
annotations:
summary: No active members detected
- alert: GslbdPoolDegraded
expr: gslbd_pool_members_healthy < 1
for: 30s
labels:
severity: critical
annotations:
summary: Pool {{ $labels.pool_id }} has no healthy members; DNS will return no records
- alert: GslbdFailoverDetectionSlow
# Adjust the threshold to 2× your checkInterval. Default checkInterval=10s → threshold=20s.
expr: histogram_quantile(0.95, sum(rate(gslbd_health_failover_detection_seconds_bucket[15m])) by (le, direction)) > 20
for: 10m
labels:
severity: warning
annotations:
summary: Failover detection p95 exceeds expected check interval ({{ $value }}s)
Troubleshooting
- No metrics visible: ensure the server log shows the metrics server started; verify port openness and any firewall.
- Missing labels: ensure cluster.id and node.id are set; metrics include them as constant labels.