Route 53 can be slow to update health checks, making it seem like your service is down when it’s actually just a propagation delay.

Let’s watch Route 53 health checks in action. Imagine you have a web application behind an Elastic Load Balancer (ELB). Route 53 is configured to health check the ELB’s health, and if the ELB is unhealthy, Route 53 will stop sending traffic to it.

Here’s a simplified DNS record configuration:

{
  "Comment": "Health-checked ELB endpoint",
  "Changes": [
    {
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z1ABCDEFGHIJ2KL",
          "DNSName": "my-elb-1234567890.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        },
        "TTL": 60
      }
    }
  ]
}

The key here is "EvaluateTargetHealth": true. This tells Route 53 to consult the health status of the underlying resource (the ELB in this case) when determining if the DNS record is healthy.

Now, let’s look at the metrics. You’ll want to monitor these in CloudWatch:

  • CDNCacheHits and CDNCacheMisses (for Route 53’s resolver endpoints, if you’re using them): These tell you how often Route 53 is serving DNS queries from its cache versus having to go upstream. High cache misses might indicate a problem with upstream resolvers or network connectivity, but often it’s just normal traffic patterns.
  • HealthCheckStatus (for your Route 53 health checks): This is the most critical metric. It reports 1 for healthy and 0 for unhealthy. You’ll want to set up alarms on this.
  • DNSResponseTime (for Route 53 queries from your users): This measures the latency of DNS lookups performed by clients. Spikes here can indicate general DNS infrastructure issues, though they are less specific to your application’s availability.

To monitor the health of your app.example.com record, you’d create a CloudWatch Alarm based on the HealthCheckStatus metric for the associated Route 53 health check.

Alarm Configuration Example:

  • Metric: HealthCheckStatus
  • Health Check ID: hc-abcdef1234567890
  • Statistic: Minimum
  • Period: 60 seconds
  • Threshold Type: Static
  • Upper/Lower Threshold: Less/Equal
  • Threshold Value: 0 (Unhealthy)
  • Datapoints to Alarm: 3 out of 3

This alarm will trigger if the health check reports unhealthy for three consecutive 60-second periods, meaning the service has been unhealthy for at least 3 minutes.

When the ELB becomes unhealthy (e.g., its instances are failing health checks), Route 53’s health check will report 0. If EvaluateTargetHealth is true, Route 53 will then effectively mark app.example.com as unhealthy. Once the alarm threshold is met, an SNS notification is sent, and potentially an auto-scaling group is triggered, or a notification is sent to your operations team.

The internal mechanism for EvaluateTargetHealth involves Route 53 periodically (every 10-30 seconds, depending on health check configuration) querying the health of the target resource. If the target resource is an ELB, Route 53 queries the ELB’s health status directly. If the target resource is an EC2 instance, Route 53 performs a direct health check to that instance’s IP. The DNS record is only considered healthy if both the Route 53 health check itself passes and the target resource is healthy.

A common pitfall is assuming that Route 53 health checks are instantaneous. There’s a health check interval (e.g., 30 seconds), a propagation delay for the health check status update across Route 53’s global network (up to 60 seconds), and then the DNS TTL for resolvers to pick up the change. This means a service could be down for 2-3 minutes before Route 53 stops serving traffic, and then another few minutes before clients stop receiving stale, incorrect DNS responses.

The actual health check endpoint for an ELB is not the ELB’s DNS name itself, but rather a specific health check endpoint associated with the ELB that Route 53 queries. Route 53 internally maps your AliasTarget to this specific health check endpoint.

The next thing you’ll likely grapple with is how to differentiate between a Route 53 health check failure and an ELB’s underlying instance health check failure.

Want structured learning?

Take the full Route53 course →