Linux perf can tell you how many instructions your CPU is executing per clock cycle, a key indicator of how efficiently your code is running.
Let’s see perf in action. Imagine we have a simple C program that does a lot of addition in a loop:
#include <stdio.h>
#include <stdlib.h>
int main() {
long long sum = 0;
int i;
for (i = 0; i < 1000000000; i++) {
sum += i;
}
printf("Sum: %lld\n", sum);
return 0;
}
Compile it: gcc -O2 loop.c -o loop
Now, let’s run perf stat on it to get a general overview of performance counters, including Instructions Per Cycle (IPC).
perf stat ./loop
You’ll see output like this (numbers will vary based on your CPU):
Performance counter stats for './loop':
1,234,567,890 instructions # 1.50 insn per cycle
823,045,260 cycles
548,703,507 stalled cycles # 66.68% of all cycles
12,345,678,901 cpu-migrations
0 page-faults
...
The line 1,234,567,890 instructions # 1.50 insn per cycle is our IPC. It means for every clock cycle, the CPU was able to retire, on average, 1.5 instructions. A higher IPC generally means better performance.
This loop program is computationally bound. It’s doing a lot of work, and the CPU is churning through instructions. The IPC is relatively low because even though it’s executing many instructions, the CPU is also spending a significant amount of time stalled (stalled cycles). This hints that the CPU might be waiting for data or other resources, even for this simple loop.
The core problem perf IPC analysis addresses is understanding why your program isn’t running as fast as it could be. It’s not enough to know you have a slow program; you need to know if it’s slow because of CPU limitations, memory bottlenecks, I/O waits, or other factors. IPC is a high-level indicator that helps point you in the right direction.
How it Works Internally: The CPU’s Perspective
Modern CPUs are incredibly complex. They don’t just execute instructions one by one in order. They have sophisticated pipelines, out-of-order execution engines, branch predictors, and multiple execution units.
- Instructions: The basic commands your program tells the CPU to perform (e.g., add, move, load, store).
- Cycles: The fundamental clock ticks of the CPU. The CPU’s clock speed (e.g., 3 GHz) tells you how many cycles happen per second.
- IPC (Instructions Per Cycle): The ratio of instructions retired to clock cycles. If IPC is 1.0, it means on average, one instruction was completed every clock cycle. If it’s 2.0, two instructions were completed per cycle. The theoretical maximum IPC is limited by the CPU’s architecture (e.g., a 4-wide out-of-order machine might aim for an IPC of 4).
- Stalled Cycles: These are cycles where the CPU could have been doing work but wasn’t. This is the most crucial part
perfhelps diagnose. Stalls happen when the CPU pipeline is empty or waiting for something. Common reasons include:- Data Cache Misses: The CPU needs data that isn’t in its fast local cache (L1, L2, L3) and has to fetch it from slower main memory (RAM). This is a very common and expensive stall.
- Branch Mispredictions: The CPU tries to guess which way a conditional branch (like an
ifstatement) will go to keep its pipeline full. If it guesses wrong, it has to discard the work it did speculatively and start over. - Instruction Cache Misses: The CPU needs the next instruction, but it’s not in the instruction cache.
- Resource Contention: The instruction being executed needs a specific execution unit (e.g., an integer adder, a floating-point multiplier), but that unit is already busy with another instruction.
- Memory Bandwidth Limitations: Even if data is in RAM, the system might not be able to fetch it fast enough.
perf uses hardware performance counters (PMCs) built into the CPU. These counters are special registers that the CPU increments automatically when certain events occur. perf can read these counters and aggregate them for you.
For IPC analysis, perf typically uses two main events:
instructions: Counts every instruction retired.cycles: Counts every CPU cycle.
The ratio gives you IPC. But perf stat also shows stalled cycles, which is often derived from cpu-cycles and instructions (though it can also be calculated directly with specific events like stalled-cycles if available).
Drilling Down: What’s Causing Stalls?
To get a deeper understanding, you can use perf record to collect detailed event data and then perf report to analyze it.
Let’s try recording specific events that often cause stalls:
perf record -e cycles -e stalled-cycles,instructions,cache-misses,branch-misses ./loop
perf report
After running perf record, you’ll get a perf.data file. Running perf report will open an interactive TUI where you can explore the collected data. Look for the breakdown by symbol or function. You’ll likely see the main function and potentially the loop itself.
The key is to look at the percentage of cycles spent stalled, and what specific events are contributing to those stalls.
Common Causes and Fixes:
-
High Data Cache Misses:
- Diagnosis: In
perf report, look forcache-misses(or more specific events likeL1-dcache-load-misses,LLC-load-misses). If these are a large percentage of your total cycles, this is a prime suspect. - Fix: Improve data locality. This often means restructuring your data structures (e.g., using arrays instead of linked lists for sequential access) or algorithms to ensure data is accessed in a predictable, sequential manner that fits within caches. For our loop example, the data
sumis small and likely stays in L1, but if you were processing a large array, cache misses would be significant. - Why it works: Keeping frequently accessed data in fast CPU caches (L1, L2, L3) dramatically reduces the time the CPU spends waiting for data from slow main memory.
- Diagnosis: In
-
High Branch Mispredictions:
- Diagnosis: Look for
branch-missesinperf report. If this count is high relative to instructions, your code is frequently making wrong guesses about control flow. - Fix: Reduce unpredictable branches. For simple loops, compilers are usually good. But in complex
if/else ifchains or with data-dependent branches, try to make branches more predictable. Sometimes, loop unrolling or restructuring logic can help. For instance, if anifcondition is true 99% of the time, the CPU can learn to predict "taken." If it oscillates between true and false, it mispredicts often. - Why it works: By reducing mispredictions, the CPU doesn’t waste cycles speculatively executing instructions that will be discarded.
- Diagnosis: Look for
-
Instruction Cache Misses:
- Diagnosis: Look for
i-cache-load-missesor similar events. This is less common for small, tight loops but can occur with very large functions or code that jumps around a lot. - Fix: Reduce the working set size of your code. This might involve reorganizing functions, using function pointers less dynamically, or ensuring code is laid out contiguously.
- Why it works: Ensures the CPU can fetch instructions quickly without waiting for them to be loaded from memory.
- Diagnosis: Look for
-
Memory Bandwidth Saturation:
- Diagnosis: If
cache-missesare high, andstalled cyclesare high, but the CPU itself isn’t maxed out (e.g.,LLC-load-missesare high, butcpu-cyclesare not matchinginstructionsperfectly), you might be hitting memory bandwidth limits.perfmight not directly show this as a "cause" but rather as a symptom of high memory latency. - Fix: Reduce the amount of data being moved. Use more efficient data structures, compress data, or perform computations in batches that minimize data movement.
- Why it works: Less data transferred means less time waiting for the memory subsystem.
- Diagnosis: If
-
Front-end Bound (Instruction Fetch/Decode):
- Diagnosis: If
stalled cyclesare high, but cache misses and branch mispredictions are low, the CPU’s instruction fetch and decode logic might be the bottleneck. This is often indicated by a low IPC even when the execution units are relatively free.perfmight show this as a general increase installed cycleswithout a single dominant event. - Fix: Simplify instruction sequences, reduce complex addressing modes, or consider compiler optimizations that generate simpler instruction streams.
- Why it works: A faster, less complex path for fetching and decoding instructions allows the execution units to be fed more consistently.
- Diagnosis: If
-
Back-end Bound (Execution Units):
- Diagnosis: If
stalled cyclesare low, but IPC is still below the CPU’s theoretical maximum, it could be that the execution units are fully utilized, and the instruction mix is such that the CPU can’t issue more than it is. This is often a sign that the code is already quite efficient but might have opportunities for parallelism. - Fix: Introduce more parallelism (e.g., multithreading), use SIMD (Single Instruction, Multiple Data) instructions if applicable, or re-evaluate the algorithm for better computational density.
- Why it works: More work is done in parallel or using more efficient instructions, increasing the effective IPC.
- Diagnosis: If
The next error you’ll hit after fixing IPC issues is often a CPU utilization bottleneck, where your program becomes so efficient it’s now limited by the raw processing power of the CPU, leading to 100% CPU usage and potentially longer runtimes if that’s the only core available.