Health Checks¶
Pleiades performs active health checks against all configured endpoints and exposes the last-known status to the load balancer and (optionally) to the global state sync publisher.
Features
- TCP connect check to a configured port.
- HTTP/HTTPS GET to a configured path.
- Optional constraints: expected HTTP status and substring presence in response body (limited to 1 MB read).
- Configurable interval and timeout per check.
- Secure-by-default HTTPS: TLS verification is enabled when http.tls: true. You can opt out with http.insecureSkipVerify: true (not recommended for production).
Configuration
health:
type: http # tcp | http
port: 443
checkinterval: 10s
timeout: 2s
http:
path: "/healthz"
host: "app.example.com" # optional Host header override
expectedstatus: 200 # 0 to ignore status check
contains: "ok" # empty to ignore body substring check
tls: true
insecureSkipVerify: false # set true only for trusted/self-signed test envs
Behavior
- Initial state is optimistic (healthy) until the first probe completes, preventing a brief blackout window at startup.
- Each run, the checker iterates the current endpoint list and updates the in-memory status map atomically per IP.
- The load balancer queries IsHealthy(ip) before returning an endpoint.
- DB write deduplication: UpsertHealthStatus is only called when an endpoint's health state actually changes (unhealthy→healthy or healthy→unhealthy). Stable-state probe results (still healthy, still unhealthy) do not generate DB writes. This eliminates the steady-state write storm in large pools with short check intervals while still persisting every state transition immediately.
- On pool restart (health check updated via API), the dedup cache is cleared so the first post-restart probe always syncs the DB.
Partial health scoring (ScoreWindow)
- Set scoreWindow: N in the health check config to enable a rolling success-rate score for each endpoint.
- N is the window size (number of recent probes). A value of 0 (the default) disables scoring entirely.
- Score: successes / N over the last N probes (0.0–1.0). Before N probes have been recorded the denominator is the actual count of recorded probes.
- IsHealthy gate: an endpoint with scoring enabled is considered healthy as long as its score is > 0 (at least one success in the window). This allows a degraded-but-recoverable backend to continue receiving traffic rather than being hard-cut at the first failure.
- DNS candidate ordering: before the round-robin, WRR, or geo-ip algorithm runs, candidates are sorted descending by score. Higher-scored backends are preferred; score acts as a tiebreaker within the geo-ip preference ordering.
- API exposure: GET /api/v1/pools/{id}/status and GET /api/v1/members/{id}/status include a score field (0.0–1.0) alongside healthy. Persisted in the health_status.score column and served from the DB when no live checker is running.
- Backwards compatibility: ScoreWindow: 0 (the default) leaves all existing behaviour unchanged — binary healthy/unhealthy, no sorting overhead.
Example config with scoring enabled:
health:
type: http
port: 443
checkInterval: 10s
timeout: 2s
scoreWindow: 10 # score over last 10 probes; prefer >50%-healthy backends
http:
path: "/healthz"
expectedStatus: 200
tls: true
Edge cases & timeouts
- Timeouts apply to the TCP dial and to the entire HTTP request via http.Client.Timeout.
- HTTP body is read only as needed for substring matching and capped at 1 MB.
- If an endpoint is removed by GitOps, it is removed from both the checker and the load balancer atomically.
- A log warning is emitted at startup when insecureSkipVerify: true so the setting is never silently active.
Code references
- internal/health/checker.go: probe implementation, config types, optimistic initial state.
- internal/health/manager.go: per-pool checker lifecycle, dedup sink (dispatchSink).
- cmd/gslbd/main.go: wiring and lifecycle management.