The Route 53 health check alarm failure means the health check is no longer reporting "Healthy" to Route 53, causing it to stop sending traffic to the associated resource.
A common culprit is an incorrect health check configuration, especially when dealing with custom health check endpoints or specific HTTP status codes.
Diagnosis: Navigate to the Route 53 console, select "Health checks," and find the specific health check experiencing the failure. Examine the "Status" and "Last failure reason" fields. For a more detailed view, use the AWS CLI:
aws route53 list-health-checks --query "HealthChecks[?CallerReference=='your-caller-reference-id'].{Id:Id, CallerReference:CallerReference, HealthStatus:HealthStatus, LastFailureReason:LastFailureReason}"
Replace your-caller-reference-id with the unique identifier for your health check.
Cause 1: Incorrect Endpoint or Port The health check is attempting to connect to the wrong IP address, hostname, or port. This is most frequent when an underlying infrastructure change (like an IP address update or a port change) hasn’t been reflected in the health check configuration.
- Diagnosis: In the Route 53 health check configuration, verify the "Domain name" and "Port" fields. If you’re using an IP address, ensure it’s still valid and reachable. If it’s a hostname, confirm DNS resolution is working correctly to the expected IP.
- Fix: Update the "Domain name" or "Port" in the health check configuration to the correct values. For example, if your service moved from port 80 to 8080, change the port in the health check to 8080.
- Why it works: Route 53 sends a TCP connection request or an HTTP(S) request to the specified endpoint and port. If these are incorrect, the connection will fail immediately, leading to an unhealthy status.
Cause 2: Unresponsive Application or Service The application or service that the health check is monitoring is not responding to requests, or it’s responding with an error. This could be due to application crashes, resource exhaustion (CPU, memory), or bugs.
- Diagnosis: Attempt to manually access the health check endpoint from an external network or a machine similar to Route 53’s vantage points. For an HTTP health check, use
curl:
Check the HTTP status code returned. A 2xx or 3xx code is generally considered healthy by default. If you get a 4xx or 5xx, or a timeout, the application is the issue.curl -I http://your-domain.com/health - Fix: Investigate the application logs and server metrics (CPU, memory, network I/O) for the resource. Restart the application, scale up resources, or fix the underlying bug causing the error.
- Why it works: Route 53’s health check is a direct indicator of the application’s ability to serve requests. By fixing the application’s responsiveness, you restore the health check’s ability to receive a valid response.
Cause 3: Network Connectivity Issues (Firewall, Security Groups, NACLs) Network access to the health check endpoint is blocked. This is a very common cause, especially after infrastructure changes or new deployments.
- Diagnosis: Route 53 health checks originate from a global network of AWS edge locations. Ensure that firewalls, AWS Security Groups, Network Access Control Lists (NACLs), or on-premises firewalls are configured to allow inbound traffic on the health check port (usually 80 or 443) from Route 53’s health check IP ranges. You can find these ranges in the AWS documentation (search for "Route 53 health checker IP address ranges").
- Fix: Add inbound rules to your Security Groups or NACLs to permit traffic from the Route 53 health checker IP ranges to your instance on the health check port. For example, in a Security Group, you’d add a rule like:
- Type: Custom TCP
- Protocol: TCP
- Port Range: 80 (or 443)
- Source:
172.16.0.0/12(this is a placeholder for the actual Route 53 health checker CIDR blocks, which you should look up)
- Why it works: By allowing traffic from Route 53’s health check IPs, you enable the health check probes to reach your application, allowing them to report a healthy status.
Cause 4: SSL/TLS Certificate Problems (for HTTPS health checks) If you’re using HTTPS health checks, an expired, invalid, or misconfigured SSL/TLS certificate on the target resource will cause the health check to fail.
- Diagnosis: Attempt to access the HTTPS endpoint using
curlwith verbose output and ignoring certificate errors for initial diagnosis:
If this works but a regularcurl -I --insecure https://your-domain.com/healthcurl -I https://your-domain.com/healthfails, it’s likely a certificate issue. Check the certificate’s expiration date and whether it’s trusted by common Certificate Authorities. - Fix: Renew or replace the SSL/TLS certificate on your server. Ensure the certificate chain is complete and that the domain name in the certificate matches the domain name used in the health check.
- Why it works: Route 53’s HTTPS health check performs a full SSL/TLS handshake. If the certificate is invalid or expired, the handshake fails, and Route 53 cannot establish a secure connection.
Cause 5: Incorrect Health Check Response (Custom Health Check) For custom health checks that expect a specific string in the response body or a specific HTTP status code, an incorrect response from the application will cause failure.
- Diagnosis: In the Route 53 health check configuration, review the "Request interval," "Failure threshold," and importantly, the "Search string" or "Type" (e.g., "HTTP response code"). Manually query your health check endpoint and compare the response body and status code to what’s configured.
If you expect "OK" in the response and get "Healthy," or if you expect 200 OK and get 204 No Content, the health check will fail.curl http://your-domain.com/health - Fix: Adjust the application’s health check endpoint to return the exact string or the specific HTTP status code that the Route 53 health check is configured to expect. For instance, if Route 53 is looking for
200 OKand your app returns204 No Content, modify the app to return200 OK. - Why it works: Route 53 health checks are designed to be precise. If the application’s response doesn’t meet the exact criteria defined in the health check configuration, Route 53 will interpret it as an unhealthy state.
Cause 6: Health Check Thresholds Too Strict The health check is failing due to transient network issues or brief application slowdowns that are within acceptable limits for your service but exceed the health check’s "Failure threshold."
- Diagnosis: Review the "Request interval" and "Failure threshold" settings for your health check. If the interval is very short (e.g., 10 seconds) and the failure threshold is low (e.g., 2 failures), even a minor, temporary hiccup can trigger a failure.
- Fix: Increase the "Failure threshold." For example, change it from 2 to 3 or 4. This gives the health check more attempts to confirm a persistent failure before marking the resource as unhealthy.
- Why it works: A higher failure threshold means Route 53 must observe the unhealthy state for a longer duration or across more probes before it considers the resource truly unavailable, mitigating false positives from temporary network glitches.
The next error you’ll likely encounter after fixing these is a DNS resolution failure for your custom domain name if the underlying DNS records themselves are misconfigured or pointing to an invalid target.