State Synchronization (NATS + JetStream)¶
Pleiades exchanges runtime state (endpoint health and cluster membership) over NATS with JetStream for durability and rejoin resilience. Each node publishes its local active health and heartbeats and subscribes to global state to build a GlobalHealthView. The load balancer can optionally consider the global view via policy.
Concepts
- NATS subjects
- Health events: gslb.<cluster>.health.<nodeId>.<family> where <family> ∈ {v4,v6}.
- Heartbeats: gslb.<cluster>.membership.heartbeat.
- JetStream Key-Value (KV)
- Health bucket: gslb_<cluster>_health (per-IP JSON with TTL).
- Membership bucket: gslb_<cluster>_membership (per-node JSON with TTL).
- GlobalHealthView
- Maintains per-IP, per-node health reports with timestamps.
- Tracks active members by heartbeat TTL.
- Provides IsHealthy (any active node healthy) and QuorumHealthy helpers.
Policies
- prefer-local (default): return local health; global is used only as a future hint.
- local-only: ignore global view entirely.
- global-any-healthy: healthy if either local or any active global report is healthy.
- global-quorum: healthy if a minimum percentage of active members report healthy within staleness TTL.
Configuration
state:
healthPolicy: "prefer-local"
quorumMinPercent: 51
heartbeatInterval: "10s"
heartbeatTTL: "30s"
nats:
servers: ["nats://n1:4222","nats://n2:4222","nats://n3:4222"]
tls:
caFile: "/etc/gslb/pki/ca.crt"
certFile: "/etc/gslb/pki/client.crt"
keyFile: "/etc/gslb/pki/client.key"
auth:
credsFile: "/etc/gslb/nats/client.creds" # or user/password or nkey
jetStream:
domain: "gslb"
Behavior
- On start, subscriber snapshots both KV buckets (health+membership) then subscribes to subjects for live updates.
- Publisher sends a heartbeat at a fixed cadence and updates KV for the node; it publishes per-IP health every checker interval.
- TTLs: health KV TTL is 2×health.interval by default; membership TTL is configurable.
- During WAN partitions or NATS failures, the node operates with the local health only.
Metrics
- gslbd_state_nats_connected (0/1)
- gslbd_state_nats_published_total{type} and ..._received_total{type} with type ∈ {health, heartbeat}
- gslbd_state_kv_put_total{bucket,result} and ...kv_get_total{bucket,result}
- gslbd_state_merge_lag_ms histogram
- gslbd_state_active_members gauge
Code references
- internal/state/*: NATS client, publisher, subscriber, subjects, types, view, provider.
- cmd/gslbd/main.go: wiring and policy selection.
Configuration Sync (JetStream)
Overview - In addition to health/membership, Pleiades distributes desired configuration as YAML via JetStream. - Canonical format: a single YAML document per cluster (v1) representing the full desired state (DNS + services/endpoints/weights + health + policies). - Transport: JetStream Stream (events) and JetStream KV (latest snapshot).
JetStream resources
- Stream name (default): PLEIADES.cfg
- Subject: <subjectPrefix>.<cluster> (default prefix: pleiades.cfg)
- KV bucket (latest desired state): PLEIADES_CFG with key <cluster>
Message schema (headers + payload)
- Headers
- pleiades-version: monotonically increasing integer
- pleiades-commit: Git commit SHA (from GitOps pipeline)
- content-type: application/yaml
- Payload
- Raw YAML bytes of the ClusterConfig document
ClusterConfig YAML (example)
apiVersion: pleiades.io/v1
kind: ClusterConfig
metadata:
cluster: prod-global
version: 7
commitSHA: "abcd123..."
timestamp: "2025-12-31T12:45:00Z"
spec:
dns:
domain: example.gslb
records:
- name: "www"
type: "A"
ttl: 30
values: ["203.0.113.10", "203.0.113.11"]
- name: "www"
type: "AAAA"
ttl: 30
values: ["2001:db8::10"]
services:
- name: www
algorithm: weighted-round-robin
endpoints: ["203.0.113.10", "203.0.113.11", "2001:db8::10"]
weights: {"203.0.113.10": 5, "203.0.113.11": 1, "2001:db8::10": 3}
health:
type: http
port: 443
checkInterval: 10s
timeout: 2s
http:
path: /healthz
expectedStatus: 200
tls: true
Node configuration to enable config sync
state:
enableConfigSync: true
nats:
servers: ["nats://n1.example.com:4222"]
jetStream:
domain: "gslb"
config:
mode: "jetstream"
stream: "PLEIADES.cfg"
subjectPrefix: "pleiades.cfg"
kvBucket: "PLEIADES_CFG"
applyTimeout: "5s"
Behavior
- On startup, NewNATSConfigSync eagerly creates the JetStream stream (EnsureConfigStream) and KV bucket (EnsureConfigKVBucket), then returns a ready-to-use ConfigSync.
- SnapshotCluster(clusterID): fetches the latest YAML from KV; returns nil YAML without error if the key does not exist yet.
- WatchCluster(ctx, clusterID): returns a buffered channel (capacity 16) of ConfigEvent. A nil entry from the KV watcher signals initial values complete. Non-Put operations emit type:"delete" events. Watch terminates when ctx is cancelled.
- PublishCluster(ctx, clusterID, yaml, meta): writes to KV (latest snapshot) and publishes to the JetStream stream with Pleiades-Version, Pleiades-Commit, and Content-Type: application/yaml headers.
- The NATS connection is drained on error shutdown.
Operational notes
- Ensure your NATS account permissions allow publishing to pleiades.cfg.* and reading the PLEIADES_CFG bucket.
- Use super-clusters or leafnodes for global distribution with low latency.