Test RDS Failover: Chaos Engineering for Multi-AZ (2026)

RDS Multi-AZ failover isn’t just a safety net; it’s an active participant in your application’s resilience, and you should be testing it.

Here’s what a simulated Multi-AZ failover looks like in the wild. Imagine a typical application front-end connecting to an RDS PostgreSQL instance.

import psycopg2
import time

# Database connection parameters
db_params = {
    "host": "your-rds-instance.abcdefghijk.us-east-1.rds.amazonaws.com",
    "database": "mydatabase",
    "user": "myuser",
    "password": "mypassword",
    "port": "5432"
}

def connect_and_query():
    conn = None
    try:
        conn = psycopg2.connect(**db_params)
        cursor = conn.cursor()
        cursor.execute("SELECT pg_current_xact_id();") # A simple query to check connectivity and get transaction ID
        result = cursor.fetchone()
        print(f"Successfully connected. Current XID: {result[0]}")
        return conn
    except psycopg2.Error as e:
        print(f"Database connection error: {e}")
        return None

# --- Main execution loop ---
if __name__ == "__main__":
    print("Starting RDS failover test simulation...")
    connection = None
    while True:
        connection = connect_and_query()
        if connection:
            time.sleep(5) # Keep connection somewhat alive, but not too long to avoid idle timeouts
            connection.close()
        else:
            print("Attempting to reconnect...")
            time.sleep(10) # Wait before retrying if connection failed

This script constantly tries to connect to your RDS instance, runs a quick query, and then disconnects. The pg_current_xact_id() is a good stand-in for a real application query because it’s simple and always available. If this script runs for minutes or hours without issues, it’s a good sign your application can handle normal operations.

Now, let’s simulate a failover. You’d trigger this via the AWS Console or AWS CLI:

aws rds reboot-db-instance --db-instance-identifier your-rds-instance --force-failover

The force-failover flag tells RDS to initiate a failover immediately, promoting the standby replica to primary. This is the "chaos" part. Your application, represented by the Python script, will likely see a brief interruption.

When the failover happens, the primary DNS endpoint for your RDS instance remains the same. However, the underlying IP address changes as the standby becomes the new primary. Your application’s TCP connection will break. The psycopg2.Error in the except block will catch this. The script will then enter the else block and start retrying the connection.

The critical factor for your application is how quickly it can re-establish a connection. RDS typically takes 60-120 seconds to complete a Multi-AZ failover. During this time, the primary is unavailable. Your application needs to tolerate this outage.

Here’s how to think about the system:

Primary Instance: Your active database.
Standby Replica: A synchronous replica in a different Availability Zone (AZ). It’s always up-to-date.
DNS Endpoint: A single DNS name that always points to the current primary instance. This is what your application connects to.
Replication: Synchronous replication ensures data written to the primary is immediately written to the standby. This guarantees zero data loss during failover.
Failover Process: When the primary becomes unhealthy (detected by RDS health checks), RDS performs these steps:
- Stops the primary instance.
- Promotes the standby replica to become the new primary.
- Updates the DNS endpoint to point to the new primary.
- (Optional) Provisions a new standby replica in the original primary’s AZ.
Application Reconnection: Your application’s connection is broken. It needs to retry connecting to the same DNS endpoint. The DNS resolution will then return the IP of the new primary.

The most surprising true thing about RDS Multi-AZ failover is that your application should ideally not need to know a failover happened. The DNS endpoint remains constant. The chaos comes from the time it takes for the DNS to resolve to the new IP after the underlying instance changes. Your application’s resilience is measured by its ability to gracefully handle that temporary network and database unavailability and then reconnect.

Consider how your application manages connection pooling and retries. A naive retry loop might hammer the endpoint during the failover window, potentially overwhelming the newly promoted instance. A more sophisticated approach uses exponential backoff with jitter. For example, instead of retrying every 10 seconds, you might retry after 10s, then 20s, then 40s, with a small random delay added to each interval. This prevents thundering herd problems.

The psycopg2.connect() call itself embodies this. When the connection breaks, the next call to connect() will eventually succeed once DNS resolves to the new primary. The time.sleep(10) in the script is a basic form of backoff.

The single most important configuration parameter for your application’s resilience during a failover is its connection timeout. If your application’s TCP connection timeout is set to, say, 30 seconds, and a failover takes 100 seconds, the application will give up before the RDS failover even completes. You need to ensure your application’s connection timeout is longer than the expected failover duration, or better yet, implement robust retry logic that doesn’t rely solely on a fixed timeout.

If you’ve successfully tested a failover and your application reconnects, the next problem you’ll encounter is understanding the performance implications of synchronous replication on write latency.