Operations & Runbooks¶
This guide covers day-2 operations: monitoring, troubleshooting, routine tasks, and failure handling.
Monitoring
- Prometheus metrics (see docs/Metrics.md). Suggested dashboards:
- DNS throughput: gslbd_dns_requests_total rate.
- Health status: gslbd_health_endpoints_healthy{family}.
- NATS connectivity: gslbd_state_nats_connected.
- Merge lag p95: histogram quantiles on gslbd_state_merge_lag_ms.
- Active members: gslbd_state_active_members.
Logs - GitOps reconcile logs: fetch/verify/apply outcomes and errors. - NATS errors: connection/disconnect and subscriber drain errors. - DNS errors: server start/stop issues.
Routine tasks
- Rotate GPG keys: update trust store on nodes; update allowedSigners list; merge a new signed commit to test.
- Rotate NATS creds/certs: generate new creds; restart gslbd or trigger reconnect.
- Update endpoints: commit to Git repo; watcher applies within pollInterval.
Troubleshooting
Start with Metrics
When troubleshooting, check the Prometheus metrics first — they often reveal the root cause faster than logs.
- No DNS answers:
- Check licensing limit: look for SERVFAIL and rate limit logs.
- Verify endpoints configured and healthy (
gslbd_health_endpoints_healthy). - Confirm domain matches
dns.domainand query suffix is exact (trailing dot). - GitOps changes not applied:
- Verify signatures (
gslbd_gitops_verify_total{result="error"}) and signer allowlist. - Check validation errors and logs.
- Ensure
pathPrefixand repo URL are correct; testgitaccess on host. - State sync not working:
- Check
gslbd_state_nats_connected. - Verify NATS URLs and TLS/auth settings; test
natsCLI. - Confirm JetStream enabled and KV buckets created.
- Global quorum policy seems off:
- Ensure
heartbeatInterval/heartbeatTTLreflect reality; active membership must be non-zero. - Validate
quorumMinPercentand that enough nodes are reporting within TTL.
Maintenance windows
- Use GitOps to stage config changes with signed commits.
- Consider "freeze windows" by temporarily pausing GitOps changes (set repoURL empty or pause at network layer).
Testing WAN partitions
- Use tc/netem to inject latency/loss; observe merge lag and policy behavior.
- Ensure local-only fallback serves queries when NATS is unreachable.