Performance Tuning Guide¶
Guidance on health check intervals, algorithm selection, SQLite tuning, backup strategy, and cluster topology. Each section describes the tradeoff and gives a concrete recommendation for common deployment sizes.
Failover latency¶
DNS failover has three independent time dimensions that stack. Understanding each is essential for setting realistic SLAs.
Layer 1 — Detection (gslbd stops returning the IP)¶
Detection latency = avg(checkInterval / 2) + probe_timeout
The health checker fires a ticker every checkInterval. On average a backend fails halfway through the current interval. After the probe result is recorded the in-memory status map is updated immediately. All DNS queries answered after that point exclude the failed IP — no DB round-trip, no NATS — purely in-memory.
checkInterval |
timeout |
p99 detection |
|---|---|---|
1s |
500ms |
~1.5 s |
2s |
1s |
~3 s |
5s |
2s |
~7 s |
10s (default) |
2s |
~12 s |
Layer 2 — Cluster convergence (peer nodes agree)¶
When NATS state sync is enabled, every health state transition triggers an immediate NATS publish. Peer nodes receive the update and update their GlobalHealthView within <200 ms of the detection event (NATS transit + subscriber processing). A background 30 s catchup ticker handles any missed signals.
Without NATS (single-node), Layer 2 does not apply.
Layer 3 — Client cache (clients stop hitting the dead backend)¶
Clients that already received a DNS response cache it for ttl seconds. The server does not reduce TTL dynamically — clients must wait for their cached entry to expire before re-querying.
Per-service ttl |
Client failover after Layer 1 resolves |
|---|---|
1 |
~1 s |
10 |
~10 s |
30 |
~30 s |
60 (default) |
~60 s |
Recommended configurations by SLA target¶
| Target | checkInterval |
timeout |
ttl |
Notes |
|---|---|---|---|---|
| <2 s server-side | 1s |
500ms |
any | Higher probe volume; only practical for small pools |
| <5 s client-side | 2s |
1s |
3 |
Low TTL increases DNS query rate proportionally |
| <30 s client-side | 5s |
2s |
20 |
Good balance for most production deployments |
| <70 s client-side | 10s |
2s |
60 |
Default; lowest probe overhead |
Rule of thumb: timeout ≤ 50% of checkInterval. A probe that times out must complete before the next tick fires; a timeout exceeding the interval causes probes to pile up.
BGP RHI — independent of all three layers¶
When BGP RHI is configured (bgp.*), the route withdrawal fires in the same status-sink callback as detection — it is not gated on NATS or TTL. BGP convergence latency (1–30 s depending on hold-timer and peer network) is the relevant SLA for anycast deployments. This is how NS1 and Akamai achieve their sub-second claims: they combine BGP anycast with a local resolver at each PoP, bypassing client DNS caching entirely.
Measuring actual failover latency¶
Use scripts/measure-failover.sh to measure Layer 1 + Layer 3 against the live cluster:
The daemon also records Layer 1 detection latency in a Prometheus histogram:
curl -s http://localhost:9090/metrics | grep failover_detection
# gslbd_health_failover_detection_seconds_bucket{direction="down",...}
# gslbd_health_failover_detection_seconds_bucket{direction="up",...}
direction="down" = healthy→unhealthy; direction="up" = unhealthy→healthy (recovery).
Health check intervals¶
Health check interval controls how quickly Pleiades detects a failing backend and stops sending traffic to it.
| Setting | Effect | Tradeoff |
|---|---|---|
| Short interval (≤ 5 s) | Fast failover; unhealthy backends removed within one interval | Higher probe traffic; more DB writes at state transitions |
| Long interval (≥ 30 s) | Less probe traffic | Slow failover; brief outages may go undetected |
Recommendation by scenario:
| Scenario | checkInterval |
timeout |
|---|---|---|
| Customer-facing API (SLA < 10 s failover) | 5s |
3s |
| Internal service (SLA < 60 s failover) | 15s |
5s |
| Low-traffic or batch backend | 30s |
10s |
Rule of thumb: timeout should be ≤ 50% of checkInterval so a slow probe does not block the next scheduled probe.
Choosing a check type:
- Use
type: tcpfor non-HTTP services or when the HTTP check adds too much latency (e.g., backends with slow TLS handshakes). - Use
type: httpwhen you want to validate application-level health (correct status code, not just port open). HTTPS checks withinsecureSkipVerify: falseadd ~1–2 ms for cert validation but give full chain verification;insecureSkipVerify: trueis for dev/staging only. - Use
type: icmpto detect host reachability at the network layer (not application layer). RequiresCAP_NET_RAWon Linux; see Health Checks — ICMP. Probe overhead is negligible (~0.1 ms RTT on LAN) but gives no guarantee the application is listening. - Use
type: scriptfor custom health logic not covered by tcp/http/icmp. Script probes run a subprocess on each check interval — budgettimeoutMs≥ 2× the expected script runtime to avoid false negatives from transient slowness. PreferscriptContent(DB-stored, NATS-replicated) overscriptPathfor portable deployments. - Use
type: webhookas the container-native alternative to script checks. The probe is an outbound HTTP call to a sidecar or external service; probe overhead depends on network RTT to the webhook endpoint.
State transition writes: Pleiades deduplicates DB writes — UpsertHealthStatus is only called when health flips between healthy and unhealthy. A stable pool generates zero health-check DB writes. Only transitions matter, so short intervals do not cause write amplification during steady-state.
Algorithm selection¶
Round-robin¶
Best for: - Homogeneous backends (same capacity, same region). - Lowest overhead — O(1) counter increment per query, no external state.
Avoid when: - Backends have different capacities (use weighted round-robin). - Clients care about geographic proximity (use geo-ip or map-file).
Weighted round-robin¶
Use when backends have different capacities. Assign weights proportional to capacity:
loadBalancer:
algorithm: weighted-round-robin
weights:
"192.0.2.10": 4 # 4× the traffic of the 1-weight server
"192.0.2.11": 1
Weight changes take effect immediately (hot-reload via API PUT /api/v1/members/<id>). No restart required.
Geo-IP (MaxMind GeoLite2 / GeoIP2)¶
Best for: - Routing clients to the nearest region (latency optimization). - Multi-region deployments where data residency matters.
DB size matters:
| Database | Size | Lookup cost | Use when |
|---|---|---|---|
| GeoLite2-Country | ~6 MB | ~0.1 ms | Country-level routing only |
| GeoLite2-City | ~70 MB | ~0.5 ms | City or region-level routing |
| GeoIP2-City (commercial) | ~60 MB | ~0.5 ms | Higher accuracy, same cost |
City DB is loaded into memory on first open. Lookups are fast (~0.1–0.5 ms), but the initial memory allocation is ~3× the file size (~200 MB RSS increase for City). If memory is constrained, prefer Country DB or map-file.
Hot reload: gslbctl can trigger a reload, or geoipupdate (MaxMind's update tool) can write a new DB file in-place. The daemon watches the DB directory with fsnotify and reloads automatically on file write/rename. No traffic interruption.
Map-file (CIDR → endpoint)¶
Best for: - Private network routing (corporate MPLS, VPNs) where geo-IP is inaccurate. - Static routing tables that change infrequently. - Guaranteed deterministic routing regardless of GeoIP DB quality.
loadBalancer:
algorithm: map-file
mapFile:
rules:
- cidr: "10.0.0.0/8"
endpoint: "10.1.2.3" # internal endpoint for RFC-1918 clients
- cidr: "2001:db8::/32"
endpoint: "2001:db8::1"
Matched endpoint is tried first; falls back to all endpoints if the matched one is unhealthy. Rules are evaluated longest-prefix-first.
When to choose map-file over geo-ip: - Your client subnets are known and stable (corporate offices, CDN PoPs). - GeoIP gives wrong region for your enterprise clients (common for Anycast IPs). - You need deterministic routing for compliance (data residency by subnet).
SQLite performance¶
Pleiades uses SQLite in WAL mode. WAL allows concurrent reads during writes, which is important because DNS queries read while health checks write.
Default pragmas (set at open time):
Optional pragmas for high-traffic deployments:
Add these to the DB path in config or apply them manually after gslbd starts (they persist in WAL mode):
sqlite3 /var/lib/gslbd/gslbd.db "
PRAGMA cache_size = -8000; -- 8 MB page cache (negative = KB)
PRAGMA busy_timeout = 5000; -- 5 s busy wait instead of immediate error
PRAGMA synchronous = NORMAL; -- safe with WAL; faster than FULL
PRAGMA temp_store = MEMORY; -- temp tables in RAM
"
Index on services.domain: The daemon creates idx_services_domain at startup. Verify it exists:
Without this index, every DNS query is a full table scan — unacceptable above ~50 services.
DNS query timeout: Each DNS resolution that hits the DB has a 200 ms context deadline. If the DB is under heavy write load (e.g., bulk member import via API), DNS queries may fail their deadline and return SERVFAIL. Schedule bulk imports during low-traffic windows.
WAL checkpoint: SQLite auto-checkpoints after 1000 pages by default. At high write rates (many health transitions), the WAL grows. Force a checkpoint:
This is non-blocking but briefly serialises writes. The backup job runs VACUUM INTO, which also effectively checkpoints.
Backup strategy¶
| Strategy | When to use | Recovery time |
|---|---|---|
Periodic VACUUM INTO (built-in) |
Standard deployments | Minutes (restore file, restart) |
| NATS JetStream KV snapshot | Multi-region, need cross-site recovery | Seconds (apply snapshot on new node) |
| Both | Production with SLA | Best |
Configuring periodic backup:
backup:
enabled: true
dir: "/var/lib/gslbd/backups"
interval: "1h" # how often to write a backup
keep: 24 # how many backups to retain (older are deleted)
VACUUM INTO creates a fully compacted, single-file copy of the DB while the daemon runs — safe at any time.
Interval guidance:
| Deployment | interval |
keep |
Disk per backup |
|---|---|---|---|
| Small (< 100 services) | 6h |
7 | < 1 MB |
| Medium (100–1000 services) | 1h |
24 | 1–10 MB |
| Large (1000+ services) | 30m |
48 | 10–100 MB |
Offsite backup: Copy backup files to S3 or another host after each write. Example cron:
# /etc/cron.hourly/gslbd-backup-sync
aws s3 sync /var/lib/gslbd/backups/ s3://my-bucket/gslbd-backups/
NATS JetStream KV snapshot: When state.nats is configured, health state is replicated to JetStream KV. A new node joining the cluster replays KV history and catches up without needing the SQLite file. Configuration data (pools, members, services) is not replicated to NATS — only health state.
Cluster topology¶
Single-node¶
No NATS needed. Use when: - Single datacenter. - Failover is handled at the infrastructure layer (e.g., cloud load balancer in front of two gslbd nodes with shared NFS mount for the DB).
Two-node active-active (same region)¶
Both nodes serve DNS independently. Health state is shared via NATS. Use healthPolicy: global-any-healthy so each node incorporates the other's probe results — useful when backends are split across network segments.
Each node has its own SQLite DB. Configuration changes must be applied to both nodes or via GitOps.
Multi-region (recommended for production)¶
Region A Region B
[gslbd-A] ←→ [NATS-A] [NATS-B] ←→ [gslbd-B]
↕ super-cluster gateway ↕
[NATS-A-gw] [NATS-B-gw]
- 3+ NATS servers per region with JetStream enabled and super-cluster gateways configured.
- Each
gslbdnode connects to its local NATS cluster. State propagates via the super-cluster gateway. - Set
state.healthPolicy: prefer-local(default) — each region responds using local probe results, falls back to global state if local probe data is missing. - Set unique
cluster.idper logical cluster and uniquenode.idper node.
Heartbeat TTL tuning:
heartbeatInterval |
heartbeatTTL |
Effect |
|---|---|---|
5s |
15s |
Fast membership detection; 3 missed heartbeats = evicted |
10s (default) |
30s |
Balanced; tolerates short network partition |
30s |
90s |
Tolerates flapping networks; slower failover |
Set heartbeatTTL to at least 3× heartbeatInterval to tolerate transient packet loss.
Quorum policy:
Use global-quorum when you need strict consistency (e.g., financial applications where split-brain must be avoided):
A backend is considered healthy only if ≥ 51% of active cluster members report it as healthy. This prevents a partitioned node from serving stale "healthy" answers if it cannot reach the majority.
GOMAXPROCS and concurrency¶
Go defaults GOMAXPROCS to the number of logical CPUs. For DNS workloads, this is usually correct. Exceptions:
- Container deployments: Set
GOMAXPROCSequal to the CPU limit (not the host CPU count). Use automaxprocs or set via environment:GOMAXPROCS=2. - Single-CPU hosts:
GOMAXPROCS=1serialises goroutines; DNS latency is fine but throughput is capped at ~10–20 k QPS per CPU. - High QPS (> 50 k): Allocate 2–4 CPUs. The DNS handler goroutine pool scales with
GOMAXPROCS. Above 4 CPUs, gains diminish because the bottleneck shifts to SQLite WAL lock contention.
Memory sizing:
| Component | Approximate RSS |
|---|---|
| Base daemon + Go runtime | ~30 MB |
| SQLite page cache (default) | ~2 MB |
| GeoIP City DB (when used) | ~200 MB |
| Per-pool health state (1000 members) | ~5 MB |
A typical production node without GeoIP fits comfortably in 64 MB. With GeoIP City DB, plan for 256 MB.
Benchmarking¶
DNS throughput:
# dnsperf (install from your package manager)
echo "app.example.com. A" > queries.txt
dnsperf -s 127.0.0.1 -p 5353 -d queries.txt -l 30 -Q 10000
API throughput:
DB latency:
# Run the built-in storage benchmark
go test -bench=BenchmarkGetServiceByDomain -benchtime=10s ./internal/storage/
If BenchmarkGetServiceByDomain p99 exceeds 1 ms, verify the idx_services_domain index exists and the page cache is warm (run the benchmark twice; first run cold-starts the cache).