Skip to content

Operations & Troubleshooting

This guide collects day-2 operations, common issues, and quick runbooks for Pleiades GSLB.

Day-0/1 checklists - Binary/service is installed and starts on boot (systemd or container orchestrator) - Config file present at /etc/gslb/config.yaml and readable - Metrics endpoint exposed (optional) and scraped by Prometheus - NATS connectivity (if enabled) confirmed; health policy chosen - GitOps repo reachable, signatures verified (if enabled)

Operational commands - Check service status (systemd):

systemctl status gslbd
journalctl -u gslbd -f
- Check metrics endpoint:
curl -s http://127.0.0.1:9090/metrics | head
- DNS sanity check:
dig @127.0.0.1 -p 5353 A gslb.local

Common issues and fixes 1) No DNS answers returned - Verify endpoints are valid IPs and reachable. - Check health metrics: gslbd_health_endpoints_total{family} and ..._healthy{family}. - Confirm policy: with prefer-local, global health won’t override local failures.

2) GitOps changes not applied - Check metrics gslbd_gitops_* for result="error". - Inspect logs for signature verification failure. - Ensure gitops.pathPrefix points to a directory containing gslbd.yaml (or index.yaml). - Validate YAML locally using yamllint and confirm against the Configuration Guide.

3) NATS state sync inactive or flapping - Check gslbd_state_nats_connected and reconnect logs. - Verify TLS cert paths and that NATS JetStream is enabled. - Ensure clocks are synchronized (NTP) to avoid excessive merge lag.

4) HTTPS health checks failing with certificate errors - Ensure health.http.tls: true and the certificate matches health.http.host (SNI/Host header). - Avoid insecureSkipVerify: true in production; if required in lab, set it explicitly.

5) Running on port 53 - Either run as root (not recommended) or grant capability:

sudo setcap 'cap_net_bind_service=+ep' /usr/local/bin/gslbd
- Then run as a non-root user.

Upgrades - Stop the service, replace the binary, and start again. - When using containers, roll out a new image gradually (blue/green or rolling update). - Config changes via GitOps: use signed commits; test in a staging branch/cluster first.

Backup and restore - GitOps: the Git repository is the source of truth—back it up using your standard VCS backups. - NATS JetStream: snapshot metadata and streams per NATS documentation (primarily for diagnostics; the system functions without historical state). - Config files: back up /etc/gslb/config.yaml if not using GitOps.

Runbooks - Incident: WAN partition / NATS down - Expect fallback to local-only behavior (prefer-local policy). No manual action required to continue serving local health. - Incident: all endpoints unhealthy - Validate backend service; verify health path/port and firewall rules; use curl from the node to test HTTP checks. - Rollback a bad config - Revert with a signed Git commit; wait for the next poll (or restart gslbd to force reconcile).

Diagnostics - Increase verbosity by running in foreground and inspecting logs. - Export metrics to a local Prometheus and view trends for health and membership.