Health Checks¶

Pleiades performs active health checks against all configured endpoints and exposes the last-known status to the load balancer and (optionally) to the global state sync publisher.

Check Types¶

Type	How it works	Privilege required
`tcp`	TCP connect to `port`; success = connection established	none
`http`	HTTP/HTTPS GET to `port` + `httpPath`; success = status matches `httpExpectedStatus`	none
`icmp`	ICMP Echo Request (ping) to member IP; success = reply received	`CAP_NET_RAW` or root
`script`	Run executable at `scriptPath`; success = exit code 0	file execute permission
`webhook`	HTTP call to `webhookURL`; success = 2xx response	none

Configuration¶

TCP (default)¶

health:
  type: tcp
  port: 443
  checkInterval: 10s
  timeout: 2s

HTTP / HTTPS¶

health:
  type: http            # tcp | http | icmp | script | webhook
  port: 443
  checkInterval: 10s
  timeout: 2s
  http:
    path: "/healthz"
    host: "app.example.com"   # optional: FQDN used for SNI + HTTP Host header
    expectedStatus: 200        # 0 to ignore status code
    contains: "ok"            # empty to ignore body substring check
    tls: true
    insecureSkipVerify: false  # set true only for trusted/self-signed test envs

ICMP Ping¶

{
  "type": "icmp",
  "icmpCount": 3,
  "intervalMs": 10000,
  "timeoutMs": 2000
}

icmpCount (default 3): number of echo requests sent per probe interval. The check succeeds if at least one reply is received. The RTT recorded is the average of successful replies.

Privilege requirement: ICMP raw sockets require CAP_NET_RAW on Linux. If gslbd runs as root this is automatic. Otherwise grant the capability:

sudo setcap cap_net_raw+ep /usr/local/bin/gslbd

Port is ignored for ICMP checks.

Custom Script¶

{
  "type": "script",
  "scriptPath": "/usr/local/bin/check-myapp.sh",
  "port": 8080,
  "intervalMs": 30000,
  "timeoutMs": 5000
}

The script receives two environment variables:

Variable	Value
`NEXUS_HC_IP`	Member IP address (e.g. `10.0.0.1`)
`NEXUS_HC_PORT`	Configured port (e.g. `8080`)

Exit code 0 → healthy
Any other exit code → unhealthy
Script is killed after timeoutMs milliseconds

Webhook¶

{
  "type": "webhook",
  "webhookURL": "https://monitor.example.com/check",
  "webhookMethod": "POST",
  "port": 80,
  "intervalMs": 10000,
  "timeoutMs": 3000
}

For POST (default), the daemon sends:

{"ip": "10.0.0.1", "port": 80}

For GET, no body is sent. HTTP 2xx response = healthy; any other status or network error = unhealthy.

The port field is included in the webhook body even for GET requests.

Kubernetes and Container Considerations¶

ICMP (ping) in containers¶

ICMP requires CAP_NET_RAW. The default K8s statefulset drops all capabilities. You must explicitly re-add NET_RAW:

# In the gslbd container securityContext:
securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_RAW"]

In plain Docker:

docker run --cap-add NET_RAW ... gslbd

If the capability is absent, the ICMP check will log a warning on every probe and mark the endpoint unhealthy — it will never silently pass.

Custom script checks in containers¶

readOnlyRootFilesystem: true only locks the container's overlay filesystem — mounted volumes are still writable. Three approaches work, in order of recommendation:

Option A: scriptContent (recommended — works everywhere)¶

Store the script body in the health check config via the API, WebUI, or Terraform. gslbd writes it to a tmpfs temp file at probe time, executes it, and removes it. The script is stored in SQLite and replicated to every cluster node via NATS automatically — no filesystem setup required on any node.

{
  "type": "script",
  "scriptContent": "#!/bin/sh\ncurl -sf http://$NEXUS_HC_IP:$NEXUS_HC_PORT/health",
  "intervalMs": 30000,
  "timeoutMs": 5000
}

In Terraform:

resource "nexus_health_check" "app" {
  pool_id        = nexus_pool.app.id
  type           = "script"
  script_content = file("${path.module}/scripts/check-app.sh")
  interval_ms    = 30000
  timeout_ms     = 5000
}

Works identically on VMs, Docker, and Kubernetes. The K8s statefulset includes a tmpfs emptyDir at /tmp for this purpose.

Option B: data PVC (simplest, single-replica)¶

The StatefulSet already has a PVC mounted at /var/lib/gslbd for the SQLite database. Scripts can live in a subdirectory of that same volume with no manifest changes:

kubectl exec -n nexus-gslb gslbd-0 -- mkdir -p /var/lib/gslbd/hc-scripts
kubectl cp check.sh nexus-gslb/gslbd-0:/var/lib/gslbd/hc-scripts/check.sh
kubectl exec -n nexus-gslb gslbd-0 -- chmod +x /var/lib/gslbd/hc-scripts/check.sh

Configure the health check with scriptPath: /var/lib/gslbd/hc-scripts/check.sh. Scripts survive pod restarts.

Limitation: Each StatefulSet replica has its own PVC. You must copy scripts to every replica (gslbd-0, gslbd-1, …) separately, and new replicas added by scaling start with no scripts. Use this approach for single-replica deployments only.

Option B: ConfigMap (multi-replica, GitOps-friendly)¶

Declarative, propagates to all replicas automatically including new ones on scale-up. Create the ConfigMap and add it to the manifest (see commented example in deploy/kubernetes/statefulset.yaml):

kubectl create configmap gslbd-hc-scripts \
  --from-file=check.sh=/path/to/your/check.sh \
  -n nexus-gslb

# volumeMount (add alongside existing mounts):
- name: hc-scripts
  mountPath: /etc/gslb/hc-scripts
  readOnly: true

# volume (add alongside existing volumes):
- name: hc-scripts
  configMap:
    name: gslbd-hc-scripts
    defaultMode: 0755   # critical — K8s does not set execute bit by default

Configure with scriptPath: /etc/gslb/hc-scripts/check.sh. 1 MB per ConfigMap limit (scripts should never approach this).

Option C: shared RWX PVC (multi-replica, imperative)¶

If your cluster has a ReadWriteMany storage class (NFS, CephFS, etc.), a single PVC can be mounted by all replicas simultaneously. Copy scripts once; all pods see them including new replicas:

volumes:
  - name: hc-scripts
    persistentVolumeClaim:
      claimName: gslbd-hc-scripts
      readOnly: false

Option D: scriptPath (VM / bare-metal only)¶

Set scriptPath to a filesystem path. The script must exist on every node independently. Not portable across container or K8s deployments. Use scriptContent instead unless you have an existing config management system (Ansible, Puppet) already distributing scripts to nodes.

Script contract (all options): - Receives NEXUS_HC_IP and NEXUS_HC_PORT environment variables - Exit 0 = healthy; any other exit code = unhealthy - Killed after timeoutMs milliseconds - Inherits the gslbd process environment

Webhook checks (recommended for containers)¶

If managing ConfigMaps or capabilities is undesirable, the webhook type is the container-native alternative to script checks. Run a sidecar or an external service that implements the health logic; gslbd POSTs the target IP and port to it.

# Sidecar example — lightweight health check service on :9191
containers:
  - name: gslbd
    ...
  - name: hc-webhook
    image: mycompany/health-checker:latest
    ports:
      - containerPort: 9191

{
  "type": "webhook",
  "webhookURL": "http://localhost:9191/check",
  "webhookMethod": "POST",
  "intervalMs": 15000,
  "timeoutMs": 3000
}

The webhook service receives: {"ip":"10.0.0.1","port":8080} and returns any 2xx status to indicate health.

HTTP Host / SNI override (`http.host`)¶

Most web servers — including Caddy, nginx, and Apache — use SNI-based virtual hosting: the server decides which certificate and site to serve based on the TLS SNI extension in the ClientHello, which is derived from the URL hostname.

Without http.host, health checks connect to the pool member's IP address directly:

https://45.92.9.73/healthz   → Caddy sees SNI = "45.92.9.73" → no matching site → 404 or TLS error

With http.host: "admin.gslb.cc", the TCP dial still goes to the member IP (e.g. 45.92.9.73:443) but the URL presented to the TLS stack — and therefore the SNI value in the ClientHello and the HTTP Host header — is admin.gslb.cc:

TCP connect → 45.92.9.73:443
TLS SNI     → "admin.gslb.cc"    ← Caddy matches this site block
HTTP Host   → "admin.gslb.cc"    ← Caddy checks this for routing

This is essential in any deployment where each node is fronted by a reverse proxy that serves multiple virtual hosts from the same IP.

Example — GSLB-managed service behind Caddy¶

Pool members: lon-01 (45.92.9.73), eu-01 (65.21.14.204), lab-01 (172.16.1.35). Public FQDN: admin.gslb.cc. Each node runs Caddy with:

admin.gslb.cc {
    reverse_proxy localhost:3000
}

Health check config:

{
  "type": "http",
  "port": 443,
  "httpPath": "/healthz",
  "httpHost": "admin.gslb.cc",
  "tls": true,
  "httpExpectedStatus": 200
}

Nexus probes each member IP (45.92.9.73:443, 65.21.14.204:443, 172.16.1.35:443) while presenting admin.gslb.cc as the SNI and Host. Caddy answers correctly on all three nodes. Without httpHost, probes would fail because Caddy has no site block for bare IPs.

Behavior - Initial state is optimistic (healthy) until the first probe completes, preventing a brief blackout window at startup. - Each run, the checker iterates the current endpoint list and updates the in-memory status map atomically per IP. - The load balancer queries IsHealthy(ip) before returning an endpoint. - DB write deduplication: UpsertHealthStatus is only called when an endpoint's health state actually changes (unhealthy→healthy or healthy→unhealthy). Stable-state probe results (still healthy, still unhealthy) do not generate DB writes. This eliminates the steady-state write storm in large pools with short check intervals while still persisting every state transition immediately. - On pool restart (health check updated via API), the dedup cache is cleared so the first post-restart probe always syncs the DB.

Partial health scoring (ScoreWindow) - Set scoreWindow: N in the health check config to enable a rolling success-rate score for each endpoint. - N is the window size (number of recent probes). A value of 0 (the default) disables scoring entirely. - Score: successes / N over the last N probes (0.0–1.0). Before N probes have been recorded the denominator is the actual count of recorded probes. - IsHealthy gate: an endpoint with scoring enabled is considered healthy as long as its score is > 0 (at least one success in the window). This allows a degraded-but-recoverable backend to continue receiving traffic rather than being hard-cut at the first failure. - DNS candidate ordering: before the round-robin, WRR, or geo-ip algorithm runs, candidates are sorted descending by score. Higher-scored backends are preferred; score acts as a tiebreaker within the geo-ip preference ordering. - API exposure: GET /api/v1/pools/{id}/status and GET /api/v1/members/{id}/status include a score field (0.0–1.0) alongside healthy. Persisted in the health_status.score column and served from the DB when no live checker is running. - Backwards compatibility: ScoreWindow: 0 (the default) leaves all existing behaviour unchanged — binary healthy/unhealthy, no sorting overhead.

Example config with scoring enabled:

health:
  type: http
  port: 443
  checkInterval: 10s
  timeout: 2s
  scoreWindow: 10   # score over last 10 probes; prefer >50%-healthy backends
  http:
    path: "/healthz"
    expectedStatus: 200
    tls: true

Edge cases & timeouts - Timeouts apply to the TCP dial and to the entire HTTP request via http.Client.Timeout. - HTTP body is read only as needed for substring matching and capped at 1 MB. - If an endpoint is removed by GitOps, it is removed from both the checker and the load balancer atomically. - A log warning is emitted at startup when insecureSkipVerify: true so the setting is never silently active.

Automated TLS Certificates (ACME DNS-01) — end-to-end guide for Caddy + GSLB, including the httpHost health check config
Configuration reference — health.* field definitions

Code references¶

internal/health/checker.go: probe implementation, config types, optimistic initial state, httpHost dialer logic.
internal/health/manager.go: per-pool checker lifecycle, dedup sink (dispatchSink).
cmd/gslbd/main.go: wiring and lifecycle management.