Skip to content

Performance Tuning Guide

Guidance on health check intervals, algorithm selection, SQLite tuning, backup strategy, and cluster topology. Each section describes the tradeoff and gives a concrete recommendation for common deployment sizes.


Failover latency

DNS failover has three independent time dimensions that stack. Understanding each is essential for setting realistic SLAs.

Layer 1 — Detection (gslbd stops returning the IP)

Detection latency = avg(checkInterval / 2) + probe_timeout

The health checker fires a ticker every checkInterval. On average a backend fails halfway through the current interval. After the probe result is recorded the in-memory status map is updated immediately. All DNS queries answered after that point exclude the failed IP — no DB round-trip, no NATS — purely in-memory.

checkInterval timeout p99 detection
1s 500ms ~1.5 s
2s 1s ~3 s
5s 2s ~7 s
10s (default) 2s ~12 s

Layer 2 — Cluster convergence (peer nodes agree)

When NATS state sync is enabled, every health state transition triggers an immediate NATS publish. Peer nodes receive the update and update their GlobalHealthView within <200 ms of the detection event (NATS transit + subscriber processing). A background 30 s catchup ticker handles any missed signals.

Without NATS (single-node), Layer 2 does not apply.

Layer 3 — Client cache (clients stop hitting the dead backend)

Clients that already received a DNS response cache it for ttl seconds. The server does not reduce TTL dynamically — clients must wait for their cached entry to expire before re-querying.

Per-service ttl Client failover after Layer 1 resolves
1 ~1 s
10 ~10 s
30 ~30 s
60 (default) ~60 s
Target checkInterval timeout ttl Notes
<2 s server-side 1s 500ms any Higher probe volume; only practical for small pools
<5 s client-side 2s 1s 3 Low TTL increases DNS query rate proportionally
<30 s client-side 5s 2s 20 Good balance for most production deployments
<70 s client-side 10s 2s 60 Default; lowest probe overhead

Rule of thumb: timeout ≤ 50% of checkInterval. A probe that times out must complete before the next tick fires; a timeout exceeding the interval causes probes to pile up.

BGP RHI — independent of all three layers

When BGP RHI is configured (bgp.*), the route withdrawal fires in the same status-sink callback as detection — it is not gated on NATS or TTL. BGP convergence latency (1–30 s depending on hold-timer and peer network) is the relevant SLA for anycast deployments. This is how NS1 and Akamai achieve their sub-second claims: they combine BGP anycast with a local resolver at each PoP, bypassing client DNS caching entirely.

Measuring actual failover latency

Use scripts/measure-failover.sh to measure Layer 1 + Layer 3 against the live cluster:

./scripts/measure-failover.sh 127.0.0.1:5353 app.example.com 45.92.9.73

The daemon also records Layer 1 detection latency in a Prometheus histogram:

curl -s http://localhost:9090/metrics | grep failover_detection
# gslbd_health_failover_detection_seconds_bucket{direction="down",...}
# gslbd_health_failover_detection_seconds_bucket{direction="up",...}

direction="down" = healthy→unhealthy; direction="up" = unhealthy→healthy (recovery).


Health check intervals

Health check interval controls how quickly Pleiades detects a failing backend and stops sending traffic to it.

Setting Effect Tradeoff
Short interval (≤ 5 s) Fast failover; unhealthy backends removed within one interval Higher probe traffic; more DB writes at state transitions
Long interval (≥ 30 s) Less probe traffic Slow failover; brief outages may go undetected

Recommendation by scenario:

Scenario checkInterval timeout
Customer-facing API (SLA < 10 s failover) 5s 3s
Internal service (SLA < 60 s failover) 15s 5s
Low-traffic or batch backend 30s 10s

Rule of thumb: timeout should be ≤ 50% of checkInterval so a slow probe does not block the next scheduled probe.

Choosing a check type:

  • Use type: tcp for non-HTTP services or when the HTTP check adds too much latency (e.g., backends with slow TLS handshakes).
  • Use type: http when you want to validate application-level health (correct status code, not just port open). HTTPS checks with insecureSkipVerify: false add ~1–2 ms for cert validation but give full chain verification; insecureSkipVerify: true is for dev/staging only.
  • Use type: icmp to detect host reachability at the network layer (not application layer). Requires CAP_NET_RAW on Linux; see Health Checks — ICMP. Probe overhead is negligible (~0.1 ms RTT on LAN) but gives no guarantee the application is listening.
  • Use type: script for custom health logic not covered by tcp/http/icmp. Script probes run a subprocess on each check interval — budget timeoutMs ≥ 2× the expected script runtime to avoid false negatives from transient slowness. Prefer scriptContent (DB-stored, NATS-replicated) over scriptPath for portable deployments.
  • Use type: webhook as the container-native alternative to script checks. The probe is an outbound HTTP call to a sidecar or external service; probe overhead depends on network RTT to the webhook endpoint.

State transition writes: Pleiades deduplicates DB writes — UpsertHealthStatus is only called when health flips between healthy and unhealthy. A stable pool generates zero health-check DB writes. Only transitions matter, so short intervals do not cause write amplification during steady-state.


Algorithm selection

Round-robin

Best for: - Homogeneous backends (same capacity, same region). - Lowest overhead — O(1) counter increment per query, no external state.

Avoid when: - Backends have different capacities (use weighted round-robin). - Clients care about geographic proximity (use geo-ip or map-file).

Weighted round-robin

Use when backends have different capacities. Assign weights proportional to capacity:

loadBalancer:
  algorithm: weighted-round-robin
  weights:
    "192.0.2.10": 4   # 4× the traffic of the 1-weight server
    "192.0.2.11": 1

Weight changes take effect immediately (hot-reload via API PUT /api/v1/members/<id>). No restart required.

Geo-IP (MaxMind GeoLite2 / GeoIP2)

Best for: - Routing clients to the nearest region (latency optimization). - Multi-region deployments where data residency matters.

DB size matters:

Database Size Lookup cost Use when
GeoLite2-Country ~6 MB ~0.1 ms Country-level routing only
GeoLite2-City ~70 MB ~0.5 ms City or region-level routing
GeoIP2-City (commercial) ~60 MB ~0.5 ms Higher accuracy, same cost

City DB is loaded into memory on first open. Lookups are fast (~0.1–0.5 ms), but the initial memory allocation is ~3× the file size (~200 MB RSS increase for City). If memory is constrained, prefer Country DB or map-file.

Hot reload: gslbctl can trigger a reload, or geoipupdate (MaxMind's update tool) can write a new DB file in-place. The daemon watches the DB directory with fsnotify and reloads automatically on file write/rename. No traffic interruption.

Map-file (CIDR → endpoint)

Best for: - Private network routing (corporate MPLS, VPNs) where geo-IP is inaccurate. - Static routing tables that change infrequently. - Guaranteed deterministic routing regardless of GeoIP DB quality.

loadBalancer:
  algorithm: map-file
  mapFile:
    rules:
      - cidr: "10.0.0.0/8"
        endpoint: "10.1.2.3"     # internal endpoint for RFC-1918 clients
      - cidr: "2001:db8::/32"
        endpoint: "2001:db8::1"

Matched endpoint is tried first; falls back to all endpoints if the matched one is unhealthy. Rules are evaluated longest-prefix-first.

When to choose map-file over geo-ip: - Your client subnets are known and stable (corporate offices, CDN PoPs). - GeoIP gives wrong region for your enterprise clients (common for Anycast IPs). - You need deterministic routing for compliance (data residency by subnet).


SQLite performance

Pleiades uses SQLite in WAL mode. WAL allows concurrent reads during writes, which is important because DNS queries read while health checks write.

Default pragmas (set at open time):

PRAGMA journal_mode=WAL;
PRAGMA foreign_keys=ON;

Optional pragmas for high-traffic deployments:

Add these to the DB path in config or apply them manually after gslbd starts (they persist in WAL mode):

sqlite3 /var/lib/gslbd/gslbd.db "
  PRAGMA cache_size = -8000;        -- 8 MB page cache (negative = KB)
  PRAGMA busy_timeout = 5000;       -- 5 s busy wait instead of immediate error
  PRAGMA synchronous = NORMAL;      -- safe with WAL; faster than FULL
  PRAGMA temp_store = MEMORY;       -- temp tables in RAM
"

Index on services.domain: The daemon creates idx_services_domain at startup. Verify it exists:

sqlite3 /var/lib/gslbd/gslbd.db ".indexes services"

Without this index, every DNS query is a full table scan — unacceptable above ~50 services.

DNS query timeout: Each DNS resolution that hits the DB has a 200 ms context deadline. If the DB is under heavy write load (e.g., bulk member import via API), DNS queries may fail their deadline and return SERVFAIL. Schedule bulk imports during low-traffic windows.

WAL checkpoint: SQLite auto-checkpoints after 1000 pages by default. At high write rates (many health transitions), the WAL grows. Force a checkpoint:

sqlite3 /var/lib/gslbd/gslbd.db "PRAGMA wal_checkpoint(TRUNCATE);"

This is non-blocking but briefly serialises writes. The backup job runs VACUUM INTO, which also effectively checkpoints.


Backup strategy

Strategy When to use Recovery time
Periodic VACUUM INTO (built-in) Standard deployments Minutes (restore file, restart)
NATS JetStream KV snapshot Multi-region, need cross-site recovery Seconds (apply snapshot on new node)
Both Production with SLA Best

Configuring periodic backup:

backup:
  enabled: true
  dir: "/var/lib/gslbd/backups"
  interval: "1h"      # how often to write a backup
  keep: 24            # how many backups to retain (older are deleted)

VACUUM INTO creates a fully compacted, single-file copy of the DB while the daemon runs — safe at any time.

Interval guidance:

Deployment interval keep Disk per backup
Small (< 100 services) 6h 7 < 1 MB
Medium (100–1000 services) 1h 24 1–10 MB
Large (1000+ services) 30m 48 10–100 MB

Offsite backup: Copy backup files to S3 or another host after each write. Example cron:

# /etc/cron.hourly/gslbd-backup-sync
aws s3 sync /var/lib/gslbd/backups/ s3://my-bucket/gslbd-backups/

NATS JetStream KV snapshot: When state.nats is configured, health state is replicated to JetStream KV. A new node joining the cluster replays KV history and catches up without needing the SQLite file. Configuration data (pools, members, services) is not replicated to NATS — only health state.


Cluster topology

Single-node

[gslbd] → [SQLite]
       [DNS clients]

No NATS needed. Use when: - Single datacenter. - Failover is handled at the infrastructure layer (e.g., cloud load balancer in front of two gslbd nodes with shared NFS mount for the DB).

Two-node active-active (same region)

[gslbd-1] ←→ [NATS cluster] ←→ [gslbd-2]
              (3-node NATS)

Both nodes serve DNS independently. Health state is shared via NATS. Use healthPolicy: global-any-healthy so each node incorporates the other's probe results — useful when backends are split across network segments.

Each node has its own SQLite DB. Configuration changes must be applied to both nodes or via GitOps.

Region A                  Region B
[gslbd-A] ←→ [NATS-A]   [NATS-B] ←→ [gslbd-B]
              ↕   super-cluster gateway   ↕
           [NATS-A-gw]  [NATS-B-gw]
  • 3+ NATS servers per region with JetStream enabled and super-cluster gateways configured.
  • Each gslbd node connects to its local NATS cluster. State propagates via the super-cluster gateway.
  • Set state.healthPolicy: prefer-local (default) — each region responds using local probe results, falls back to global state if local probe data is missing.
  • Set unique cluster.id per logical cluster and unique node.id per node.

Heartbeat TTL tuning:

heartbeatInterval heartbeatTTL Effect
5s 15s Fast membership detection; 3 missed heartbeats = evicted
10s (default) 30s Balanced; tolerates short network partition
30s 90s Tolerates flapping networks; slower failover

Set heartbeatTTL to at least 3× heartbeatInterval to tolerate transient packet loss.

Quorum policy:

Use global-quorum when you need strict consistency (e.g., financial applications where split-brain must be avoided):

state:
  healthPolicy: global-quorum
  quorumMinPercent: 51

A backend is considered healthy only if ≥ 51% of active cluster members report it as healthy. This prevents a partitioned node from serving stale "healthy" answers if it cannot reach the majority.


GOMAXPROCS and concurrency

Go defaults GOMAXPROCS to the number of logical CPUs. For DNS workloads, this is usually correct. Exceptions:

  • Container deployments: Set GOMAXPROCS equal to the CPU limit (not the host CPU count). Use automaxprocs or set via environment: GOMAXPROCS=2.
  • Single-CPU hosts: GOMAXPROCS=1 serialises goroutines; DNS latency is fine but throughput is capped at ~10–20 k QPS per CPU.
  • High QPS (> 50 k): Allocate 2–4 CPUs. The DNS handler goroutine pool scales with GOMAXPROCS. Above 4 CPUs, gains diminish because the bottleneck shifts to SQLite WAL lock contention.

Memory sizing:

Component Approximate RSS
Base daemon + Go runtime ~30 MB
SQLite page cache (default) ~2 MB
GeoIP City DB (when used) ~200 MB
Per-pool health state (1000 members) ~5 MB

A typical production node without GeoIP fits comfortably in 64 MB. With GeoIP City DB, plan for 256 MB.


Benchmarking

DNS throughput:

# dnsperf (install from your package manager)
echo "app.example.com. A" > queries.txt
dnsperf -s 127.0.0.1 -p 5353 -d queries.txt -l 30 -Q 10000

API throughput:

# wrk (https://github.com/wg/wrk)
wrk -t4 -c100 -d30s http://localhost:8080/api/v1/pools

DB latency:

# Run the built-in storage benchmark
go test -bench=BenchmarkGetServiceByDomain -benchtime=10s ./internal/storage/

If BenchmarkGetServiceByDomain p99 exceeds 1 ms, verify the idx_services_domain index exists and the page cache is warm (run the benchmark twice; first run cold-starts the cache).