Skip to content

Metrics and Observability

This guide explains how to expose Prometheus metrics, what metrics exist, and how to build dashboards and alerts for Nexus GSLB.

Enable Prometheus endpoint Add to /etc/gslb/config.yaml:

metrics:
  enablePrometheus: true
  listenAddr: "0.0.0.0"
  port: 9090
The endpoint is served at http://<listenAddr>:<port>/metrics and exports Prometheus text format.

Constant labels - All metrics include cluster and node constant labels when cluster.id and/or node.id are set.

Key metrics (namespace: gslbd) - DNS - gslbd_dns_requests_total - GitOps - gslbd_gitops_fetch_total{result} - gslbd_gitops_verify_total{result} - gslbd_gitops_apply_total{result} - gslbd_gitops_last_apply_info{sha,signer} value 1 for last applied commit - State sync (NATS/JetStream) - gslbd_state_nats_connected (0/1) - gslbd_state_nats_published_total{type} - gslbd_state_nats_received_total{type} - gslbd_state_kv_put_total{bucket,result} - gslbd_state_kv_get_total{bucket,result} - gslbd_state_merge_lag_ms (histogram) - gslbd_state_active_members (gauge) - Health - gslbd_health_endpoints_total{family} — total endpoints by IP family (v4/v6) - gslbd_health_endpoints_healthy{family} — healthy endpoints by family - gslbd_pool_members_total{pool_id} — total tracked members per pool - gslbd_pool_members_healthy{pool_id} — healthy members per pool; alert when this drops to 0 - gslbd_health_failover_detection_seconds{direction} (histogram) — Layer 1 failover detection latency per state transition. direction="down" = healthy→unhealthy; direction="up" = recovery. Use this to verify your checkInterval is achieving the detection speed you expect. - DNS - gslbd_dns_requests_total - gslbd_dns_query_duration_seconds (histogram) — end-to-end DNS handler latency

Scrape configuration example

scrape_configs:
  - job_name: 'gslbd'
    scrape_interval: 15s
    static_configs:
      - targets: ['gslbd-hostname:9090']

Grafana dashboard - Suggested panels: - DNS RPS: rate(gslbd_dns_requests_total[1m]). - DNS p99 latency: histogram_quantile(0.99, sum(rate(gslbd_dns_query_duration_seconds_bucket[5m])) by (le)). - NATS connectivity: gslbd_state_nats_connected. - Merge lag p95: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le,cluster,node)). - Pool health: gslbd_pool_members_healthy and gslbd_pool_members_total per pool — heatmap or bar chart. - Failover detection p95: histogram_quantile(0.95, sum(rate(gslbd_health_failover_detection_seconds_bucket[15m])) by (le, direction)) — shows how quickly the daemon is confirming state changes. - Active members over time: gslbd_state_active_members.

Example alerts

groups:
- name: gslbd
  rules:
  - alert: GslbdNATSDisconnected
    expr: gslbd_state_nats_connected == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: NATS connection down
  - alert: GslbdMergeLagHigh
    expr: histogram_quantile(0.95, sum(rate(gslbd_state_merge_lag_ms_bucket[5m])) by (le)) > 2000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High merge lag (p95)
  - alert: GslbdActiveMembersZero
    expr: gslbd_state_active_members == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: No active members detected
  - alert: GslbdPoolDegraded
    expr: gslbd_pool_members_healthy < 1
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: Pool {{ $labels.pool_id }} has no healthy members; DNS will return no records
  - alert: GslbdFailoverDetectionSlow
    # Adjust the threshold to 2× your checkInterval. Default checkInterval=10s → threshold=20s.
    expr: histogram_quantile(0.95, sum(rate(gslbd_health_failover_detection_seconds_bucket[15m])) by (le, direction)) > 20
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Failover detection p95 exceeds expected check interval ({{ $value }}s)

Troubleshooting - No metrics visible: ensure the server log shows the metrics server started; verify port openness and any firewall. - Missing labels: ensure cluster.id and node.id are set; metrics include them as constant labels.