The RDS instance is refusing new connections because it’s choked by a single, runaway SQL query that has locked essential system tables.
Common Causes and Fixes
1. Accidental Infinite Loop or Massive Data Scan
- Diagnosis: Connect to your RDS instance using a client that can show active queries (like
psqlfor PostgreSQL or MySQL Workbench for MySQL). Look for queries that have been running for an unusually long time (minutes, hours, or days) and are consuming significant CPU or I/O. For PostgreSQL,SELECT * FROM pg_stat_activity WHERE state = 'active' AND query NOT ILIKE '%pg_stat_activity%';is your friend. For MySQL,SHOW FULL PROCESSLIST;is the command. - Fix: Identify the
process_id(MySQL) orpid(PostgreSQL) of the offending query.- PostgreSQL:
SELECT pg_terminate_backend(<pid>); - MySQL:
KILL <process_id>;
- PostgreSQL:
- Why it works: These commands send a signal to the database server to stop processing the specified query and release any locks it holds.
2. Application Bug: Missing WHERE Clause in UPDATE or DELETE
- Diagnosis: As above, find the long-running query. If it’s an
UPDATEorDELETEstatement that appears to be processing an impossibly large number of rows (e.g.,UPDATE my_table SET status = 'processed'), it’s likely missing aWHEREclause. Thepg_stat_activityorSHOW PROCESSLISToutput will show the exact SQL statement. - Fix:
- PostgreSQL:
SELECT pg_terminate_backend(<pid>); - MySQL:
KILL <process_id>; - Then, immediately fix the application code to include the missing
WHEREclause. For example, changeUPDATE my_table SET status = 'processed';toUPDATE my_table SET status = 'processed' WHERE id = 123;.
- PostgreSQL:
- Why it works: The
KILLcommand stops the runaway operation. Fixing the application code prevents the same issue from recurring.
3. Deadlock on Critical Tables
- Diagnosis: Long-running queries can sometimes be the result of a deadlock, where two or more transactions are waiting for each other to release locks. Check your database logs for "deadlock detected" messages. In PostgreSQL, you might see this in
pg_stat_activityas queries stuck in awaitingstate, often withwait_event_typeandwait_eventfields indicating lock contention. For MySQL,SHOW ENGINE INNODB STATUS;will often reveal deadlock information in theLATEST DETECTED DEADLOCKsection. - Fix:
- PostgreSQL:
SELECT pg_terminate_backend(<pid>);(identify the PID involved in the deadlock). - MySQL:
KILL <process_id>;(identify the process ID involved in the deadlock). - Application Level: Review transaction isolation levels and the order of operations in your application to avoid acquiring locks in conflicting orders.
- PostgreSQL:
- Why it works: Terminating one of the participants in a deadlock allows the other(s) to proceed. Addressing the application logic prevents future deadlocks.
4. Excessive Index Rebuilding or Table Rewriting
- Diagnosis: Sometimes, maintenance operations like
VACUUM FULL(PostgreSQL) orOPTIMIZE TABLE(MySQL), or even certain types of index rebuilds, can manifest as very long-running queries that lock tables. These operations often rewrite entire tables or indexes, which can take a considerable amount of time and resources. Checkpg_stat_activityfor queries mentioningVACUUMorREINDEX, orSHOW PROCESSLISTforOPTIMIZE TABLE. - Fix:
- PostgreSQL: If it’s
VACUUM FULLorREINDEX, these are generally difficult to interrupt gracefully. The safest bet is often toSELECT pg_terminate_backend(<pid>);. Be aware this can leave the table in an inconsistent state requiring a manualVACUUMorREINDEX. - MySQL:
KILL <process_id>;forOPTIMIZE TABLE. - Prevention: Schedule such maintenance during low-traffic periods or use alternative, non-blocking maintenance strategies (e.g.,
VACUUMwithoutFULLin PostgreSQL, online DDL for some operations in MySQL).
- PostgreSQL: If it’s
- Why it works: The
KILLcommand stops the blocking operation. Understanding and avoiding long-running, blocking maintenance is key.
5. Large Transaction Holding Locks
- Diagnosis: A transaction that started long ago and has yet to be committed or rolled back can hold locks on many rows or even entire tables. In
pg_stat_activity, look for queries with astateofidle in transactionoridle in transaction (aborted)and abackend_starttime from long ago. In MySQL’sSHOW PROCESSLIST, look for connections with aCommandofSleepbut a highTimevalue, and check if they have active transactions (SHOW ENGINE INNODB STATUS;). - Fix:
- PostgreSQL:
SELECT pg_terminate_backend(<pid>);for theidle in transactionsession. - MySQL:
KILL <process_id>;for the sleeping connection. - Application Level: Ensure your application code explicitly commits or rolls back transactions, and implement connection pooling with aggressive timeouts to prevent stale transactions from lingering.
- PostgreSQL:
- Why it works: Terminating the session forces the database to roll back any uncommitted transaction, releasing its locks.
6. Resource Exhaustion (Less Common as a Direct Cause of a Single Long Query, but Contributory)
- Diagnosis: While not typically the root cause of a single query running forever, if your RDS instance is under extreme CPU, memory, or I/O pressure, even normal queries can slow to a crawl, appearing "long-running." Monitor CloudWatch metrics for
CPUUtilization,FreeableMemory,ReadIOPS, andWriteIOPS. - Fix:
- Scale Up: Increase the instance class (e.g., from
db.t3.mediumtodb.m5.large). - Scale Out: For read-heavy workloads, add read replicas.
- Optimize Queries: Analyze query plans (
EXPLAIN) for the long-running query and others to find missing indexes or inefficient operations.
- Scale Up: Increase the instance class (e.g., from
- Why it works: More resources allow queries to complete faster. Optimized queries use resources more efficiently.
After resolving the immediate long-running query, you’ll likely encounter Connection timed out or Too many connections errors if the initial problem caused a backlog of connection attempts that are now trying to get through simultaneously.