perf stat isn’t just about counting instructions; it’s your window into the CPU’s internal dance, revealing bottlenecks you’d never find with top.

Let’s see it in action. Imagine you’re running a simple C program that does a lot of floating-point math:

#include <stdio.h>
#include <math.h>

int main() {
    double sum = 0.0;
    for (int i = 0; i < 100000000; ++i) {
        sum += sin(i);
    }
    printf("Sum: %f\n", sum);
    return 0;
}

Compile it: gcc -o math_test math_test.c -lm

Now, run perf stat on it:

perf stat -e cycles,instructions,branches,branch-misses, L1-dcache-loads,L1-dcache-load-misses, L1-icache-loads,L1-icache-load-misses ./math_test

Here’s what you might see:

 Performance counter stats for './math_test':

        1,578,945,873      cycles                                                      (83.33%)
        1,475,315,123      instructions                                                (83.33%)
          110,592,100      branches                                                    (83.33%)
            3,456,789      branch-misses                                               (83.33%)
      1,789,123,456      L1-dcache-loads                                             (83.33%)
         12,345,678      L1-dcache-load-misses                                       (83.33%)
      1,500,000,000      L1-icache-loads                                             (83.33%)
            123,456      L1-icache-load-misses                                       (83.33%)

       1.578945678 seconds time elapsed

This output tells you that for every 100 instructions, your CPU executed about 107 cycles. The ratio of instructions to cycles (IPC) is around 0.92 (1475M / 1578M). A higher IPC generally means your CPU is doing more useful work per clock tick. You also see a low branch miss rate (around 3%) and a very low L1 data cache miss rate (less than 1%). This suggests the floating-point math is efficient, but maybe not perfectly so.

perf stat works by tapping into your CPU’s Performance Monitoring Units (PMUs). These are special hardware registers that can count events happening within the CPU core, like instruction fetches, data accesses, cache hits/misses, and branch predictions. The perf tool in Linux is a frontend that lets you access these PMUs without needing to write low-level assembly or directly manipulate hardware registers.

The -e flag is your primary lever. You specify the events you want to measure. Common ones include:

  • cycles: The number of CPU cycles the program ran for.
  • instructions: The total number of instructions retired (completed).
  • branches: The number of branch instructions encountered.
  • branch-misses: The number of times the CPU’s branch predictor guessed wrong.
  • L1-dcache-loads: Number of requests to the Level 1 data cache.
  • L1-dcache-load-misses: Number of times a Level 1 data cache load request missed.
  • L1-icache-loads: Number of requests to the Level 1 instruction cache.
  • L1-icache-load-misses: Number of times a Level 1 instruction cache load request missed.
  • LLC-loads (Last Level Cache loads) and LLC-load-misses: For misses further out in the cache hierarchy.

You can find a comprehensive list of available events on your system by running perf list. The event names can be architecture-specific, so what works on an Intel CPU might differ slightly on an AMD or ARM processor.

The "percentage" next to the event counts indicates the CPU utilization on which the event was measured. If you have multiple cores, perf might spread the measurement across them. This can be useful for identifying if a bottleneck is CPU-bound and on which cores.

The real power comes from combining perf stat with profiling tools like perf record and perf report. perf stat gives you the aggregate numbers for the entire program run. perf record captures these events on a per-instruction basis and saves them to a perf.data file. Then, perf report analyzes this file to show you which functions or which lines of code are responsible for the most events, pinpointing the exact hot spots.

Many developers overlook the significance of L1 cache misses. A miss in the L1 data cache means the CPU had to go to a slower level of cache (L2, L3) or even main memory to fetch the data. This stall can be incredibly costly, often far more than a few extra instructions. If your L1-dcache-load-misses count is high relative to L1-dcache-loads, it’s a strong signal that your data access patterns are inefficient. Reordering data structures, improving locality, or using techniques like prefetching can drastically reduce these misses and improve performance. Similarly, high branch-misses indicate that the CPU’s speculative execution is often wasted, which can be improved by restructuring code to have more predictable control flow or by using compiler hints.

After fixing your cache miss issues, you’ll likely encounter an increase in the number of retired instructions per cycle, which is the next logical metric to optimize.

Want structured learning?

Take the full Perf course →