Route 53’s failover routing isn’t just about switching to a backup; it’s a sophisticated system that actively monitors your primary endpoints and orchestrates a global DNS update when it detects an issue.
Let’s see it in action. Imagine you have a primary website hosted in us-east-1 and a failover site in eu-west-1.
# Simulate a health check failure for the primary endpoint
# (In reality, this would be a real failure of the EC2 instance or ALB)
echo "Simulating failure of primary endpoint..."
# Route 53 health check status will change from OK to unhealthy
# DNS queries will automatically start resolving to the secondary endpoint
When a health check fails, Route 53 doesn’t just wait. It immediately begins the process of updating DNS records across its global network of name servers. This isn’t a slow, propagation-based event like traditional DNS changes; it’s a coordinated, rapid switch.
The core of this system is the Health Check. You configure these health checks to point to your primary resource. This could be an IP address, a hostname, or even a specific path on a web server. Route 53 then polls these endpoints from multiple global locations. If a configurable number of these checks fail consecutively, Route 53 marks the primary endpoint as unhealthy.
Once an endpoint is deemed unhealthy, Route 53’s Failover Routing Policy kicks in. When you set up a failover record, you associate it with a health check. You define a primary record (e.g., www.example.com pointing to your us-east-1 resource) and a secondary record (e.g., www.example.com pointing to your eu-west-1 resource). Crucially, both records share the same name, type, and routing policy. The failover policy tells Route 53: "If the health check associated with the primary record fails, serve the IP address/hostname from the secondary record instead."
The magic is in the TTL (Time To Live) and Route 53’s internal infrastructure. While you set a TTL on your DNS records (e.g., 60 seconds), Route 53’s failover mechanism often overrides this for the switch itself. When a failure is detected, Route 53 pushes the updated DNS resolution (pointing to the secondary) to its name servers much faster than a standard TTL would dictate, aiming for a switchover within minutes, often seconds. This is achieved through Route 53’s Anycast network and proprietary internal routing mechanisms.
To set this up, you’d create two records in your hosted zone:
-
Primary Record:
- Name:
www.example.com - Type:
A(orAAAA,CNAME) - Alias:
Yes - Alias Target:
Your primary ALB/CloudFront distribution/etc. in us-east-1 - Routing Policy:
Failover - Record Type:
Primary - Health Check ID:
hc-xxxxxxxxxxxxxxxxx(the ID of your health check) - Evaluate Target Health:
Yes
- Name:
-
Secondary Record:
- Name:
www.example.com - Type:
A(orAAAA,CNAME) - Alias:
Yes - Alias Target:
Your secondary ALB/CloudFront distribution/etc. in eu-west-1 - Routing Policy:
Failover - Record Type:
Secondary - Health Check ID:
(None - this record is only served when the primary is unhealthy) - Evaluate Target Health:
Yes
- Name:
You’d also create a health check that points to your primary resource. For instance, an HTTP health check on http://your-primary-alb-dns.amazonaws.com/health with a request interval of 30 seconds and a failure threshold of 3.
When Route 53 detects that the primary resource is unresponsive (e.g., the health check fails 3 times in a row, each check taking 30 seconds), it will stop returning the IP address associated with the primary record. Instead, it will immediately start returning the IP address associated with the secondary record for all www.example.com queries. This transition is near-instantaneous from the perspective of DNS resolution, though the actual time depends on the health check configuration and how quickly Route 53’s global network propagates the change.
The "Evaluate Target Health" setting on an Alias record is critical. When set to Yes for a failover configuration, it means that if the target of the Alias record (e.g., an ALB) is itself unhealthy, Route 53 will treat that as a health check failure, even if the explicit health check you configured is still passing. This provides an extra layer of safety.
What most people don’t realize is how Route 53 handles the "switch back." When the primary endpoint becomes healthy again, Route 53 will resume serving the primary record’s IP address. This switchback is also managed automatically and rapidly, ensuring that traffic returns to your primary region as soon as it’s stable.
The next challenge you’ll face is managing the state of your secondary environment, ensuring it’s always ready to take over without data loss.