Linux perf can deeply profile block device I/O latency by tracing kernel events related to I/O submission, completion, and queueing.
Let’s see perf in action, profiling I/O latency for a specific block device, say /dev/sda.
First, we need to identify the kernel tracepoints related to block I/O. These are typically found under block:block_rq_insert (when a request is inserted into the I/O scheduler queue) and block:block_rq_complete (when a request is finished). We’ll also want to capture sched:sched_switch to understand what tasks are running and potentially blocking I/O.
Here’s a command to start collecting data:
perf record -e 'block:block_rq_insert,block:block_rq_complete,sched:sched_switch' -a -- sleep 10
This command records events from all CPUs (-a) for 10 seconds (sleep 10) using the specified tracepoints (-e). The -a flag is crucial for capturing system-wide activity.
After the recording is done, we can analyze the data with perf report.
perf report
In perf report, you’ll see a list of events. To focus on I/O latency, we’ll look for the time spent between block_rq_insert and block_rq_complete for requests associated with /dev/sda.
To get more granular information, we can use perf script to dump the raw trace events and then process them.
perf script > trace.txt
Now, we can use tools like awk to parse trace.txt and calculate latencies. We’re looking for pairs of block_rq_insert and block_rq_complete events for the same request. The kernel typically logs a unique rq_comm (command name) and rq_disk (device name) along with rq_sector and rq_bytes in these events.
Here’s a simplified awk script to find the latency for /dev/sda:
awk '
/block:block_rq_insert/ {
split($0, a, " ");
rq_insert_ts = $1;
for (i=1; i<=NF; i++) {
if (a[i] == "rq_disk=sda") {
disk_sda = 1;
}
if (a[i] ~ /^rq_seq=/) {
rq_seq = substr(a[i], 8);
}
}
if (disk_sda) {
insert_events[rq_seq] = rq_insert_ts;
disk_sda = 0;
}
next;
}
/block:block_rq_complete/ {
split($0, a, " ");
rq_complete_ts = $1;
for (i=1; i<=NF; i++) {
if (a[i] == "rq_disk=sda") {
disk_sda = 1;
}
if (a[i] ~ /^rq_seq=/) {
rq_seq = substr(a[i], 8);
}
}
if (disk_sda && insert_events[rq_seq]) {
latency = rq_complete_ts - insert_events[rq_seq];
print "Seq:", rq_seq, "Latency:", latency * 1000000, "ns";
delete insert_events[rq_seq];
disk_sda = 0;
}
}
' trace.txt > sda_latencies.txt
This awk script stores the timestamp of block_rq_insert for requests targeting sda in an associative array indexed by rq_seq. When a block_rq_complete event for the same sequence number and device arrives, it calculates the difference, which is the I/O latency. The * 1000000 converts seconds to nanoseconds, which is often more readable for I/O latencies.
The perf report view can also be filtered and sorted. After running perf report, you can press / to search for block:block_rq_complete and then navigate through the results. Sorting by overhead or count can reveal the most time-consuming I/O operations.
To understand why an I/O is slow, we can correlate the block_rq_insert and block_rq_complete timestamps with sched:sched_switch events. If a sched_switch event shows a process being descheduled and then rescheduled after the I/O was inserted but before it completed, it suggests CPU contention or a process voluntarily yielding the CPU.
For example, if you see a long gap between block_rq_insert and block_rq_complete and during that gap the process that initiated the I/O was in a TASK_UNINTERRUPTIBLE state (which you can infer from sched:sched_switch and other kernel tracepoints like sched:sched_stat_iostats), it means the process was waiting on the I/O completion and was not runnable. However, if the I/O itself is taking a long time in the kernel queue, that’s a different problem.
When perf report shows high latency, you’ll often see specific processes or kernel threads consuming the most time. This helps pinpoint whether the latency is due to user-space applications making slow requests or kernel-level I/O processing bottlenecks.
The block_rq_insert event itself contains the rq_depth (request depth) for the device at the time of insertion. High rq_depth values can indicate a busy I/O subsystem, potentially leading to increased latencies as requests wait in longer queues. Analyzing the distribution of rq_depth across block_rq_insert events can highlight periods of high load.
The disk_name and partno fields in the tracepoints are essential for distinguishing between different physical disks and partitions if you have multiple devices. You can modify the awk script to filter for specific devices or even specific partitions.
Looking at the comm field in the trace output can tell you which application is generating the I/O. If a specific application consistently shows up with high I/O latency, it’s a strong indicator that the application’s I/O patterns are the primary cause.
If you suspect the I/O scheduler is the bottleneck, you can examine the block:block_rq_merge and block:block_rq_issue tracepoints. block_rq_issue signifies when the I/O scheduler actually dispatches the request to the device driver, and the time between insertion and issue can highlight scheduler overhead or delays.
After you’ve resolved your I/O latency issues, the next problem you might encounter is CPU saturation, where the system is spending too much time processing requests, leading to overall sluggishness.