Linux perf can access hardware performance counters (PMUs) to give you a window into what your CPU is actually doing, beyond just instructions and cycles.
Here’s perf sampling a process and showing a few common PMU events:
sudo perf record -e cycles -e instructions -e branch-misses -e cache-references -e cache-misses -e page-faults --call-graph dwarf -- sleep 10
sudo perf report
This perf record command captures cycles, instructions, branch-misses, cache-references, cache-misses, and page-faults for 10 seconds. The --call-graph dwarf option adds call stack information, which is crucial for pinpointing where in the code these events are happening. perf report then lets you explore the collected data.
The core problem PMUs solve is that software-level metrics like CPU usage (from /proc/stat or top) are often too coarse. They tell you that the CPU is busy, but not why. Are you waiting on memory? Branching inefficiently? Hitting the TLB too often? PMUs provide a way to observe these hardware-level phenomena directly.
Internally, your CPU has dedicated, fixed-function hardware units that can count specific events. These events are exposed through a Programmer’s Visible Interface (PVI), often called a Performance Monitoring Unit (PMU). The Linux kernel’s perf subsystem acts as a bridge, allowing userspace tools to access and interpret these hardware counters. Each CPU architecture (x86, ARM, etc.) has its own set of PMU events, though perf often provides common aliases.
The fundamental levers you control are the events you choose to monitor and the sampling configuration. Events can be broadly categorized:
- Architectural Events: These are general-purpose events available on most modern CPUs. Examples include
cycles,instructions,branch-misses,LLC-loads,LLC-misses. These are great for broad performance analysis. - Cache Events: Specific to memory hierarchy. Events like
cache-referencesandcache-misses(often referring to the Last Level Cache or LLC) help diagnose memory bottlenecks. - Microarchitectural Events: These are more CPU-specific and delve into internal pipeline behavior. Examples include
uops_issued,br_retired,mem_loads. These require deeper knowledge of the specific CPU microarchitecture. - Software Events: Not strictly PMU events, but often grouped with them in
perf. Examples:page-faults,context-switches.
You can discover the available PMU events on your system using perf list. The output is extensive and often includes both generic names and architecture-specific ones. For example, on an x86 system, you might see cpu/event=0x01,name=cpu/cycles/, and perf will translate this to the human-readable cycles.
When you ask perf to sample cache-misses, it configures the CPU’s PMU to increment a counter every time a cache miss occurs. perf then periodically reads these counters. If you’re sampling, it means perf interrupts the program at regular intervals and records the current state of the PMU, along with the program’s instruction pointer (and call stack, if requested). By aggregating these samples, you build a statistical profile of where your program spends its time in terms of these hardware events.
The perf tool uses a kernel module (perf_event_paranoid) to control what events users can access. If you’re getting "permission denied" errors, you might need to adjust this setting. A common adjustment is:
sudo sysctl kernel.perf_event_paranoid=1
This setting, 1, allows basic PMU events (like cycles, instructions) but restricts access to more sensitive or kernel-internal events. Setting it to 0 allows all events, while 2 or 3 restrict access further. The perf_event_paranoid setting is a security mechanism to prevent certain hardware events from leaking information about other processes or the kernel.
When you see cache-misses as a significant event in your perf report, it’s a strong indicator that your program is spending time waiting for data to be fetched from slower memory levels. This could be due to poor data locality (accessing data that’s far apart in memory), small cache sizes relative to your working set, or inefficient access patterns.
The real power of PMUs comes when you correlate them with code. If perf report shows a high concentration of branch-misses in a specific function, it means that function is frequently mispredicting the outcome of conditional branches. This often points to code that has an uneven distribution of outcomes for its if statements or loops, or code that is simply too complex for the CPU’s branch predictor to handle efficiently. Optimizing this might involve restructuring loops, using lookup tables, or flattening conditional logic where possible.
A subtle point is that the rate at which events are counted can be as important as the total count. perf can also be configured to count events per instruction (e.g., perf stat -e '{instructions,branch-misses}' ./my_program). This gives you a ratio, like branch-misses/instructions, which is often a more stable performance metric than raw counts, as it normalizes for how much work the program is doing. For instance, a program might have a high number of branch misses, but if it’s also executing billions of instructions, the ratio might be acceptable. Conversely, a low total number of branch misses might still be a problem if the program is only executing a few thousand instructions.
The next step is understanding how to use these hardware events to tune your code, moving from observation to active optimization.