The perf tool in Linux isn’t just for finding bottlenecks; it’s a powerful way to quantitatively prove that your changes actually made a difference, or that they made things worse.
Let’s say you’ve optimized some code, tuned a kernel parameter, or changed a hardware configuration, and you want to show the impact. The core idea is to run perf before your change, capture the results, make the change, and then run perf again, comparing the two reports.
Here’s a typical workflow. First, identify what you want to measure. Are you interested in CPU cycles, cache misses, branch mispredictions, or maybe a specific hardware event? For a general performance overview, perf stat is your go-to.
Let’s assume we’re benchmarking a hypothetical my_application that performs some heavy computation.
Before the Change:
We’ll run perf stat with the -e flag to specify the events we care about. For a broad view, let’s look at cycles, instructions, cache misses, and context switches.
perf stat -e cycles,instructions,cache-misses,context-switches -- ./my_application
This command will execute my_application and, upon its completion, print a summary of the requested performance counters. You’ll get output like this:
Performance counter stats for './my_application':
1,234,567,890 cycles (83.33%)
987,654,321 instructions (83.33%)
12,345 cache-misses (83.33%)
100 context-switches (83.33%)
0.500000 task-clock (msec)
0.100000 wall-clock (msec)
Note down these numbers precisely. The task-clock and wall-clock are also crucial indicators of overall execution time. The percentage in parentheses indicates the CPU’s utilization during the measurement.
Make Your Change:
Now, apply your optimization. This could be modifying source code and recompiling, changing a /proc/sys value, or reconfiguring a service.
After the Change:
Run the exact same perf stat command again.
perf stat -e cycles,instructions,cache-misses,context-switches -- ./my_application
You’ll get a new set of numbers. Let’s imagine they are:
Performance counter stats for './my_application':
1,000,000,000 cycles (83.33%)
900,000,000 instructions (83.33%)
8,000 cache-misses (83.33%)
80 context-switches (83.33%)
0.400000 task-clock (msec)
0.080000 wall-clock (msec)
Comparison:
Now you compare the two sets of results.
- Cycles: Decreased from 1,234,567,890 to 1,000,000,000. This is good, fewer clock cycles per operation.
- Instructions: Decreased from 987,654,321 to 900,000,000. This suggests your optimization might have reduced redundant work or improved instruction-level parallelism.
- Cache Misses: Decreased from 12,345 to 8,000. A significant reduction, indicating better data locality.
- Context Switches: Decreased from 100 to 80. This points to less time spent by the kernel switching between processes.
- Task-Clock: Decreased from 0.500000 msec to 0.400000 msec. This is the CPU’s perspective on how long the task actually ran, accounting for CPU time.
- Wall-Clock: Decreased from 0.100000 msec to 0.080000 msec. This is the real-world elapsed time, the most direct measure of perceived performance.
The change clearly improved performance across the board, with a ~20% reduction in wall-clock time.
Beyond perf stat:
For deeper analysis, perf record and perf report are invaluable. perf record samples events over time, creating a perf.data file. perf report then visualizes this data, showing which functions or code regions are responsible for the most overhead.
To compare perf record outputs:
-
Record before:
perf record --call-graph dwarf -o perf.data.before ./my_application(The
--call-graph dwarfoption is crucial for detailed call stack information.) -
Make change.
-
Record after:
perf record --call-graph dwarf -o perf.data.after ./my_application -
Analyze and compare: You can then use
perf report -i perf.data.beforeandperf report -i perf.data.afterto inspect the profiled data. For direct comparison,perf diffis the command you want:perf diff perf.data.before perf.data.afterThis will show you the difference in event counts per symbol (function), highlighting where performance improved or degraded. For instance, it might show that a specific function’s
cyclescount decreased significantly, or that a new function now consumes morecache-misses.
Important Considerations for Benchmarking:
- Reproducibility: Ensure your benchmark is repeatable. Run it multiple times and average results if there’s variance.
- Environment: Keep the system as idle as possible. Close unnecessary applications. Disable dynamic frequency scaling (
cpupower frequency-set -g performance) and Turbo Boost if you need absolute consistency. - Workload: The benchmark workload should accurately reflect your real-world usage. A micro-benchmark might not capture system-wide effects.
- Event Selection: Choose events relevant to your suspected bottlenecks. Too many events can slow down the measurement itself.
perfVersion: Ensure you’re using a recent version ofperffor the best features and bug fixes.
The next step after identifying performance gains or losses is often to investigate the specific code paths or system interactions that perf report or perf diff point to.