Your RDS instance is screaming under high CPU load because the database engine can’t keep up with the incoming requests, and it’s hitting its processing limits.
The most common culprit is inefficient queries. A query that scans millions of rows when it only needs a few can balloon CPU usage. You’ll often see this manifest as SELECT statements dominating the pg_stat_statements view (for PostgreSQL) or SHOW PROFILE output (for MySQL).
Diagnosis: Start by checking the RDS Performance Insights dashboard. Look for the "SQL" tab and sort by "CPU Time." This will immediately highlight the most resource-intensive queries. If you don’t have Performance Insights enabled, or for a more granular look, connect to your database and run:
For PostgreSQL:
SELECT
total_exec_time,
calls,
rows,
substring(query, 1, 60) AS query_snippet
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
For MySQL:
SET profiling = 1;
SHOW PROFILES;
SHOW PROFILE FOR QUERY <Query_ID>;
(Replace <Query_ID> with the ID of the most time-consuming query from SHOW PROFILES).
Cause 1: Missing or Ineffective Indexes
A query might be doing a full table scan because the columns used in its WHERE or JOIN clauses aren’t indexed. This forces the database to read every single row to find the data.
Diagnosis Command:
Use EXPLAIN (or EXPLAIN ANALYZE for PostgreSQL) before your slow query. Look for "Seq Scan" (Sequential Scan) in the output.
Fix:
Create an index on the relevant columns. For example, if a query SELECT * FROM users WHERE email = 'test@example.com'; is slow, and email isn’t indexed:
CREATE INDEX idx_users_email ON users (email);
Why it works: An index is like a sorted lookup table. Instead of scanning the entire users table, the database can quickly jump to the specific row(s) where email matches.
Cause 2: Outdated Table Statistics The database’s query planner uses statistics about the data distribution in your tables to decide the most efficient way to execute a query. If these statistics are stale, the planner might choose a suboptimal execution plan.
Diagnosis Command:
For PostgreSQL, check the last_autoanalyze and last_autovacuum times in pg_stat_user_tables. For MySQL, examine the INFORMATION_SCHEMA.STATISTICS table and compare it to actual data distribution if possible.
Fix: Manually update statistics.
For PostgreSQL:
ANALYZE users; -- Or ANALYZE VERBOSE users; for more detail
For MySQL:
ANALYZE TABLE users;
Why it works: This forces the database to re-evaluate the data distribution and update its internal statistics, leading to better query plan choices.
Cause 3: Inefficient Query Logic
Sometimes, the query itself is written poorly. This could involve SELECT * when only a few columns are needed, complex subqueries that could be rewritten as joins, or unnecessary OR conditions that prevent index usage.
Diagnosis Command:
Again, EXPLAIN ANALYZE (PostgreSQL) or SHOW PROFILE (MySQL) on the problematic query. Look for high costs, large row counts being processed, or specific operations known to be slow.
Fix:
Rewrite the query. For instance, replace SELECT * with specific column names, and refactor subqueries.
Example PostgreSQL rewrite: Instead of:
SELECT name FROM orders WHERE customer_id IN (SELECT id FROM customers WHERE country = 'USA');
Use a JOIN:
SELECT o.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';
Why it works: Joins are often more efficient for the database to optimize than correlated subqueries or IN clauses with large subquery results, especially when appropriate indexes exist.
Cause 4: High Connection Count / Idle Connections A very large number of active or idle connections can consume significant memory and CPU, even if those connections aren’t actively running queries. Each connection has overhead.
Diagnosis Command: For PostgreSQL:
SELECT count(*) FROM pg_stat_activity;
For MySQL:
SHOW PROCESSLIST;
Look for a number of connections significantly higher than your max_connections setting, or a high number of connections in an idle or idle in transaction state.
Fix:
- Increase
max_connections(carefully): If your application legitimately needs more connections, increase themax_connectionsparameter in your RDS parameter group. Warning: This also increases memory usage. - Tune Application Connection Pooling: Implement or tune connection pooling on your application side. This reuses existing connections, drastically reducing the overhead of establishing new ones.
- Kill Idle Transactions: For long-running idle transactions in PostgreSQL:
(ReplaceSELECT pg_terminate_backend(<pid>);<pid>with the process ID frompg_stat_activity).
Why it works: Reducing the sheer number of active processes or cleaning up hung/idle ones frees up the CPU and memory resources they were consuming.
Cause 5: Insufficient Instance Size / IOPS Sometimes, the workload genuinely exceeds the capacity of your current RDS instance type or provisioned IOPS. The CPU is maxed out because the hardware simply cannot process the data fast enough, or disk I/O is so slow it’s causing CPU to wait.
Diagnosis Command: Monitor the "CPU Utilization" metric in RDS CloudWatch. Also, check "Read IOPS" and "Write IOPS" if you’re using provisioned IOPS, and "Disk Queue Depth" for general I/O bottlenecks. If CPU is consistently above 80-90% for extended periods, and the above tuning steps haven’t helped, this is a strong possibility.
Fix:
- Modify RDS Instance Class: Upgrade to a larger instance class (e.g.,
db.r5.largetodb.r5.xlarge). - Increase Provisioned IOPS: If using
io1orgp3storage, increase the allocated IOPS. - Switch Storage Type: Consider migrating to
gp3orio1/io2if you’re ongp2and hitting I/O limits, asgp3offers independent scaling of IOPS and throughput.
Why it works: A larger instance provides more CPU cores and RAM. Increased IOPS or a better storage type allows the database to read and write data faster, reducing wait times that contribute to CPU load.
Cause 6: Vacuuming/Autovacuum Issues (PostgreSQL Specific)
In PostgreSQL, VACUUM (and its autovacuum daemon) reclaims space from dead rows and prevents transaction ID wraparound. If autovacuum isn’t keeping up, tables can bloat, leading to slower queries and higher CPU.
Diagnosis Command:
Check pg_stat_user_tables for n_dead_tup (number of dead tuples) and last_autovacuum / last_vacuum times. High n_dead_tup counts and infrequent vacuuming indicate a problem.
Fix:
- Tune Autovacuum Parameters: Adjust
autovacuum_max_workers,autovacuum_naptime,autovacuum_vacuum_threshold, andautovacuum_vacuum_scale_factorin your RDS parameter group. - Manual VACUUM: If immediate relief is needed:
VACUUM (ANALYZE, VERBOSE) your_table_name;
Why it works: A well-tuned autovacuum process keeps tables lean, ensuring that queries scan fewer rows and indexes remain efficient, thereby reducing CPU load.
After addressing these, the next error you might encounter is related to increased memory usage if you scaled up the instance, or perhaps ERR_QUOTA_EXCEEDED if you’re hitting IOPS limits on a new, larger workload.