GSLB Failover
Configure DNS-based failover between sites
GSLB Failover
GSLB failover provides automated disaster recovery by detecting site outages and redirecting DNS responses to healthy alternate sites. When a primary data centre becomes unreachable, Tula's GSLB engine (gdnsd) stops advertising that site's IP addresses and begins returning the addresses of the configured secondary site, transparently redirecting all new client connections.
Active-Passive Failover
In an active-passive GSLB configuration, one site is designated as the primary and handles all traffic under normal conditions. One or more secondary sites stand by, ready to absorb traffic if the primary fails.
To configure active-passive failover in Tula:
- Navigate to GSLB > Failover Groups and create a new failover group.
- Add the primary site with its VIP address and assign it the highest priority.
- Add one or more secondary sites with lower priorities. These will only receive traffic when all higher-priority sites are marked down.
- Configure the health check parameters for each site (see below).
- Save and apply the configuration.
gdnsd evaluates site health continuously and adjusts DNS responses in real time. When the primary site recovers, traffic is automatically redirected back to it based on priority ordering.
Health Monitoring of Remote Sites
Reliable failover depends on accurate health monitoring. Tula monitors remote sites using configurable health checks that probe each site's availability at regular intervals.
Supported health check types include:
- TCP connect -- Verifies that a TCP connection can be established to the specified port.
- HTTP/HTTPS -- Sends an HTTP request and validates the response status code and optionally the response body content.
Configure health check parameters for each site:
| Parameter | Description | Recommended Value |
|---|---|---|
| Interval | Time between health checks | 5-10 seconds |
| Timeout | Maximum wait for a response | 3-5 seconds |
| Down threshold | Consecutive failures before marking down | 2-3 checks |
| Up threshold | Consecutive successes before marking up | 2-3 checks |
Setting appropriate thresholds prevents flapping -- rapid alternation between up and down states caused by transient network issues. Requiring multiple consecutive failures before triggering failover ensures that brief network blips do not cause unnecessary site switches.
TTL Considerations
DNS Time-to-Live (TTL) is the single most important factor in GSLB failover timing. The TTL value controls how long recursive resolvers and clients cache DNS responses before requesting a fresh lookup.
Lower TTL values (30-60 seconds) provide faster failover because clients refresh their DNS records more frequently. However, they increase DNS query volume and may slightly increase connection setup latency for clients whose resolvers do not have the record cached.
Higher TTL values (300+ seconds) reduce DNS infrastructure load but delay failover. During a site outage, clients with cached DNS records will continue attempting to connect to the failed site until their cache expires.
Recommended approach: Set TTL to 60 seconds for services where rapid failover is critical, or 300 seconds for services where some delay is acceptable. Tula allows you to configure the TTL per GSLB record.
Failover Timing and DNS Propagation
The total time from site failure to full traffic redirection depends on several factors:
- Detection time = check interval x down threshold (e.g., 10s x 3 = 30 seconds).
- DNS propagation = up to the configured TTL for clients with cached records.
- Total worst-case failover time = detection time + TTL.
With an interval of 10 seconds, a down threshold of 3, and a TTL of 60 seconds, the worst-case failover time is approximately 90 seconds. Clients that happen to make a fresh DNS query during the detection window will be redirected even sooner.
Note that some recursive resolvers may not honour the TTL strictly, caching records for longer than specified. This is outside your control but affects a small minority of resolvers.
Configuring Primary and Secondary Sites
When planning your failover topology, consider:
- Capacity symmetry. Ensure the secondary site can handle the full traffic load of the primary. An undersized secondary that collapses under failover load defeats the purpose of disaster recovery.
- Data synchronisation. Application data must be replicated between sites so the secondary can serve requests meaningfully. GSLB handles traffic redirection, not data consistency.
- Failback policy. Decide whether traffic should automatically return to the primary when it recovers (automatic failback) or remain on the secondary until manually switched (manual failback). Automatic failback is the default in Tula.
Testing Failover
Regularly test your GSLB failover configuration to ensure it functions correctly under real conditions:
- Simulate a site failure by stopping the monitored service on the primary site or by blocking health check traffic with a firewall rule.
- Monitor the GSLB status in the Tula web interface under GSLB > Status to confirm that the primary site is marked down.
- Verify DNS responses using
digornslookupto confirm that queries now return the secondary site's address. - Test client connectivity by accessing the service and confirming it is served by the secondary site.
- Restore the primary and verify that traffic returns according to your failback policy.
Document your failover test results and repeat testing after any infrastructure changes to maintain confidence in your disaster recovery posture.