Health Checks¶

Pleiades performs active health checks against all configured endpoints and exposes the last-known status to the load balancer and (optionally) to the global state sync publisher.

Features - TCP connect check to a configured port. - HTTP/HTTPS GET to a configured path. - Optional constraints: expected HTTP status and substring presence in response body (limited to 1 MB read). - Configurable interval and timeout per check. - Secure-by-default HTTPS: TLS verification is enabled when http.tls: true. You can opt out with http.insecureSkipVerify: true (not recommended for production).

Configuration

health:
  type: http            # tcp | http
  port: 443
  checkinterval: 10s
  timeout: 2s
  http:
    path: "/healthz"
    host: "app.example.com"   # optional Host header override
    expectedstatus: 200        # 0 to ignore status check
    contains: "ok"            # empty to ignore body substring check
    tls: true
    insecureSkipVerify: false  # set true only for trusted/self-signed test envs

Behavior - Initial state is optimistic (healthy) until the first probe completes, preventing a brief blackout window at startup. - Each run, the checker iterates the current endpoint list and updates the in-memory status map atomically per IP. - The load balancer queries IsHealthy(ip) before returning an endpoint. - DB write deduplication: UpsertHealthStatus is only called when an endpoint's health state actually changes (unhealthy→healthy or healthy→unhealthy). Stable-state probe results (still healthy, still unhealthy) do not generate DB writes. This eliminates the steady-state write storm in large pools with short check intervals while still persisting every state transition immediately. - On pool restart (health check updated via API), the dedup cache is cleared so the first post-restart probe always syncs the DB.

Partial health scoring (ScoreWindow) - Set scoreWindow: N in the health check config to enable a rolling success-rate score for each endpoint. - N is the window size (number of recent probes). A value of 0 (the default) disables scoring entirely. - Score: successes / N over the last N probes (0.0–1.0). Before N probes have been recorded the denominator is the actual count of recorded probes. - IsHealthy gate: an endpoint with scoring enabled is considered healthy as long as its score is > 0 (at least one success in the window). This allows a degraded-but-recoverable backend to continue receiving traffic rather than being hard-cut at the first failure. - DNS candidate ordering: before the round-robin, WRR, or geo-ip algorithm runs, candidates are sorted descending by score. Higher-scored backends are preferred; score acts as a tiebreaker within the geo-ip preference ordering. - API exposure: GET /api/v1/pools/{id}/status and GET /api/v1/members/{id}/status include a score field (0.0–1.0) alongside healthy. Persisted in the health_status.score column and served from the DB when no live checker is running. - Backwards compatibility: ScoreWindow: 0 (the default) leaves all existing behaviour unchanged — binary healthy/unhealthy, no sorting overhead.

Example config with scoring enabled:

health:
  type: http
  port: 443
  checkInterval: 10s
  timeout: 2s
  scoreWindow: 10   # score over last 10 probes; prefer >50%-healthy backends
  http:
    path: "/healthz"
    expectedStatus: 200
    tls: true

Edge cases & timeouts - Timeouts apply to the TCP dial and to the entire HTTP request via http.Client.Timeout. - HTTP body is read only as needed for substring matching and capped at 1 MB. - If an endpoint is removed by GitOps, it is removed from both the checker and the load balancer atomically. - A log warning is emitted at startup when insecureSkipVerify: true so the setting is never silently active.

Code references - internal/health/checker.go: probe implementation, config types, optimistic initial state. - internal/health/manager.go: per-pool checker lifecycle, dedup sink (dispatchSink). - cmd/gslbd/main.go: wiring and lifecycle management.