RDS Enhanced Monitoring is not just a fancier version of CloudWatch basic metrics; it actually streams detailed OS-level performance data from your RDS instances into CloudWatch Logs, giving you a much deeper look into what’s happening under the hood.
Let’s see it in action. Imagine you’ve got a PostgreSQL RDS instance and you’re seeing some performance degradation. You’ve enabled Enhanced Monitoring and set the granularity to 1 second.
Here’s a snippet of what you might see in CloudWatch Logs, within a log group like /aws/rds/instance/your-db-instance-identifier/enhanced-monitoring:
{
"logType": "OS",
"instanceID": "your-db-instance-identifier",
"timestamp": "2023-10-27T10:30:00Z",
"metrics": {
"cpu": {
"user": 85.2,
"system": 10.5,
"idle": 3.1,
"iowait": 1.2,
"steal": 0.0
},
"memory": {
"total": 32768,
"free": 2048,
"used": 30720,
"buffers": 1024,
"cached": 4096
},
"disk": {
"/dev/xvda1": {
"reads": 1500,
"writes": 1200,
"read_bytes": 62914560,
"write_bytes": 50331648,
"await": 25.5,
"util": 88.9
}
},
"network": {
"eth0": {
"rx_bytes": 1073741824,
"tx_bytes": 858993459,
"rx_packets": 1000000,
"tx_packets": 800000,
"rx_errors": 5,
"tx_errors": 0
}
}
}
}
This JSON object, streamed every second (or whatever granularity you set), is the raw data. It includes CPU utilization broken down by user, system, idle, and iowait, memory usage with buffers and cached clearly delineated, disk I/O statistics like reads, writes, and await (average wait time for I/O operations), and network traffic.
The problem Enhanced Monitoring solves is the inherent black box nature of managed databases. Before this, you were largely limited to RDS-level metrics (CPU Utilization, Freeable Memory, Disk Queue Depth) which are aggregated and can mask underlying OS-level issues. If your RDS CPU was at 90%, was it the database process itself, or was it something else on the underlying OS consuming resources? Enhanced Monitoring gives you that granular visibility.
Internally, an agent runs on the EC2 instance that hosts your RDS database. This agent collects metrics from the OS (using tools like top, iostat, vmstat, netstat) and then pushes these metrics to CloudWatch Logs. You configure the granularity (how often metrics are collected, from 1 second to 1 hour) and the log stream (which specific metrics you want to capture).
The key levers you control are:
- Granularity: This dictates the frequency of metric collection. A 1-second granularity gives you near real-time insights but generates significantly more log data and incurs higher costs. A 1-minute granularity is often a good balance for most troubleshooting. You set this when you enable Enhanced Monitoring for an instance.
- Metrics Collection: You can choose to collect a subset of OS metrics. For example, if you’re only concerned about CPU and memory, you can deselect disk and network metrics to reduce log volume. This is also configured during the setup.
Most people don’t realize that the iowait metric in the cpu section of Enhanced Monitoring is a direct indicator of how much CPU time is being spent waiting for I/O operations to complete. If iowait is consistently high, it points to a storage bottleneck, even if the overall CPU utilization might not seem excessively high due to other processes. This is distinct from the DiskQueueDepth metric at the RDS level, which is a higher-level abstraction. High iowait means the CPU itself is stalled, waiting for the disk subsystem to respond, which is a more fundamental performance problem.
Once you’ve mastered OS-level metrics, the next logical step is to learn how to create CloudWatch Alarms based on these granular metrics, allowing for proactive problem detection.