The perf tool in Linux isn’t just for finding bottlenecks; it’s a powerful way to quantitatively prove that your changes actually made a difference, or that they made things worse.

Let’s say you’ve optimized some code, tuned a kernel parameter, or changed a hardware configuration, and you want to show the impact. The core idea is to run perf before your change, capture the results, make the change, and then run perf again, comparing the two reports.

Here’s a typical workflow. First, identify what you want to measure. Are you interested in CPU cycles, cache misses, branch mispredictions, or maybe a specific hardware event? For a general performance overview, perf stat is your go-to.

Let’s assume we’re benchmarking a hypothetical my_application that performs some heavy computation.

Before the Change:

We’ll run perf stat with the -e flag to specify the events we care about. For a broad view, let’s look at cycles, instructions, cache misses, and context switches.

perf stat -e cycles,instructions,cache-misses,context-switches -- ./my_application

This command will execute my_application and, upon its completion, print a summary of the requested performance counters. You’ll get output like this:

 Performance counter stats for './my_application':

     1,234,567,890      cycles                                                      (83.33%)
     987,654,321      instructions                                                (83.33%)
          12,345      cache-misses                                                (83.33%)
             100      context-switches                                            (83.33%)

     0.500000      task-clock (msec)
     0.100000      wall-clock (msec)

Note down these numbers precisely. The task-clock and wall-clock are also crucial indicators of overall execution time. The percentage in parentheses indicates the CPU’s utilization during the measurement.

Make Your Change:

Now, apply your optimization. This could be modifying source code and recompiling, changing a /proc/sys value, or reconfiguring a service.

After the Change:

Run the exact same perf stat command again.

perf stat -e cycles,instructions,cache-misses,context-switches -- ./my_application

You’ll get a new set of numbers. Let’s imagine they are:

 Performance counter stats for './my_application':

     1,000,000,000      cycles                                                      (83.33%)
     900,000,000      instructions                                                (83.33%)
           8,000      cache-misses                                                (83.33%)
              80      context-switches                                            (83.33%)

     0.400000      task-clock (msec)
     0.080000      wall-clock (msec)

Comparison:

Now you compare the two sets of results.

  • Cycles: Decreased from 1,234,567,890 to 1,000,000,000. This is good, fewer clock cycles per operation.
  • Instructions: Decreased from 987,654,321 to 900,000,000. This suggests your optimization might have reduced redundant work or improved instruction-level parallelism.
  • Cache Misses: Decreased from 12,345 to 8,000. A significant reduction, indicating better data locality.
  • Context Switches: Decreased from 100 to 80. This points to less time spent by the kernel switching between processes.
  • Task-Clock: Decreased from 0.500000 msec to 0.400000 msec. This is the CPU’s perspective on how long the task actually ran, accounting for CPU time.
  • Wall-Clock: Decreased from 0.100000 msec to 0.080000 msec. This is the real-world elapsed time, the most direct measure of perceived performance.

The change clearly improved performance across the board, with a ~20% reduction in wall-clock time.

Beyond perf stat:

For deeper analysis, perf record and perf report are invaluable. perf record samples events over time, creating a perf.data file. perf report then visualizes this data, showing which functions or code regions are responsible for the most overhead.

To compare perf record outputs:

  1. Record before:

    perf record --call-graph dwarf -o perf.data.before ./my_application
    

    (The --call-graph dwarf option is crucial for detailed call stack information.)

  2. Make change.

  3. Record after:

    perf record --call-graph dwarf -o perf.data.after ./my_application
    
  4. Analyze and compare: You can then use perf report -i perf.data.before and perf report -i perf.data.after to inspect the profiled data. For direct comparison, perf diff is the command you want:

    perf diff perf.data.before perf.data.after
    

    This will show you the difference in event counts per symbol (function), highlighting where performance improved or degraded. For instance, it might show that a specific function’s cycles count decreased significantly, or that a new function now consumes more cache-misses.

Important Considerations for Benchmarking:

  • Reproducibility: Ensure your benchmark is repeatable. Run it multiple times and average results if there’s variance.
  • Environment: Keep the system as idle as possible. Close unnecessary applications. Disable dynamic frequency scaling (cpupower frequency-set -g performance) and Turbo Boost if you need absolute consistency.
  • Workload: The benchmark workload should accurately reflect your real-world usage. A micro-benchmark might not capture system-wide effects.
  • Event Selection: Choose events relevant to your suspected bottlenecks. Too many events can slow down the measurement itself.
  • perf Version: Ensure you’re using a recent version of perf for the best features and bug fixes.

The next step after identifying performance gains or losses is often to investigate the specific code paths or system interactions that perf report or perf diff point to.

Want structured learning?

Take the full Perf course →