Skip to content

Operations & Runbooks

This guide covers day-2 operations: monitoring, troubleshooting, routine tasks, and failure handling.

Monitoring - Prometheus metrics (see docs/Metrics.md). Suggested dashboards: - DNS throughput: gslbd_dns_requests_total rate. - Health status: gslbd_health_endpoints_healthy{family}. - NATS connectivity: gslbd_state_nats_connected. - Merge lag p95: histogram quantiles on gslbd_state_merge_lag_ms. - Active members: gslbd_state_active_members.

Logs - GitOps reconcile logs: fetch/verify/apply outcomes and errors. - NATS errors: connection/disconnect and subscriber drain errors. - DNS errors: server start/stop issues.

Routine tasks - Rotate GPG keys: update trust store on nodes; update allowedSigners list; merge a new signed commit to test. - Rotate NATS creds/certs: generate new creds; restart gslbd or trigger reconnect. - Update endpoints: commit to Git repo; watcher applies within pollInterval.

Troubleshooting

Start with Metrics

When troubleshooting, check the Prometheus metrics first — they often reveal the root cause faster than logs.

  • No DNS answers:
  • Check licensing limit: look for SERVFAIL and rate limit logs.
  • Verify endpoints configured and healthy (gslbd_health_endpoints_healthy).
  • Confirm domain matches dns.domain and query suffix is exact (trailing dot).
  • GitOps changes not applied:
  • Verify signatures (gslbd_gitops_verify_total{result="error"}) and signer allowlist.
  • Check validation errors and logs.
  • Ensure pathPrefix and repo URL are correct; test git access on host.
  • State sync not working:
  • Check gslbd_state_nats_connected.
  • Verify NATS URLs and TLS/auth settings; test nats CLI.
  • Confirm JetStream enabled and KV buckets created.
  • Global quorum policy seems off:
  • Ensure heartbeatInterval/heartbeatTTL reflect reality; active membership must be non-zero.
  • Validate quorumMinPercent and that enough nodes are reporting within TTL.

Maintenance windows - Use GitOps to stage config changes with signed commits. - Consider "freeze windows" by temporarily pausing GitOps changes (set repoURL empty or pause at network layer).

Testing WAN partitions - Use tc/netem to inject latency/loss; observe merge lag and policy behavior. - Ensure local-only fallback serves queries when NATS is unreachable.