Operations & Troubleshooting¶
This guide collects day-2 operations, common issues, and quick runbooks for Pleiades GSLB.
Day-0/1 checklists
- Binary/service is installed and starts on boot (systemd or container orchestrator)
- Config file present at /etc/gslb/config.yaml and readable
- Metrics endpoint exposed (optional) and scraped by Prometheus
- NATS connectivity (if enabled) confirmed; health policy chosen
- GitOps repo reachable, signatures verified (if enabled)
Operational commands - Check service status (systemd):
- Check metrics endpoint: - DNS sanity check:Common issues and fixes
1) No DNS answers returned
- Verify endpoints are valid IPs and reachable.
- Check health metrics: gslbd_health_endpoints_total{family} and ..._healthy{family}.
- Confirm policy: with prefer-local, global health won’t override local failures.
2) GitOps changes not applied
- Check metrics gslbd_gitops_* for result="error".
- Inspect logs for signature verification failure.
- Ensure gitops.pathPrefix points to a directory containing gslbd.yaml (or index.yaml).
- Validate YAML locally using yamllint and confirm against the Configuration Guide.
3) NATS state sync inactive or flapping
- Check gslbd_state_nats_connected and reconnect logs.
- Verify TLS cert paths and that NATS JetStream is enabled.
- Ensure clocks are synchronized (NTP) to avoid excessive merge lag.
4) HTTPS health checks failing with certificate errors
- Ensure health.http.tls: true and the certificate matches health.http.host (SNI/Host header).
- Avoid insecureSkipVerify: true in production; if required in lab, set it explicitly.
5) Running on port 53 - Either run as root (not recommended) or grant capability:
- Then run as a non-root user.Upgrades - Stop the service, replace the binary, and start again. - When using containers, roll out a new image gradually (blue/green or rolling update). - Config changes via GitOps: use signed commits; test in a staging branch/cluster first.
Backup and restore
- GitOps: the Git repository is the source of truth—back it up using your standard VCS backups.
- NATS JetStream: snapshot metadata and streams per NATS documentation (primarily for diagnostics; the system functions without historical state).
- Config files: back up /etc/gslb/config.yaml if not using GitOps.
Runbooks
- Incident: WAN partition / NATS down
- Expect fallback to local-only behavior (prefer-local policy). No manual action required to continue serving local health.
- Incident: all endpoints unhealthy
- Validate backend service; verify health path/port and firewall rules; use curl from the node to test HTTP checks.
- Rollback a bad config
- Revert with a signed Git commit; wait for the next poll (or restart gslbd to force reconcile).
Diagnostics - Increase verbosity by running in foreground and inspecting logs. - Export metrics to a local Prometheus and view trends for health and membership.