Linux perf can tell you exactly which instructions your CPU is spending time on, but most people use it to just list symbols, missing the forest for the trees.
Let’s see perf in action. Imagine we have a simple C program that does some heavy computation:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
// A function that does a lot of work
long long process_data(int iterations) {
long long sum = 0;
for (int i = 0; i < iterations; ++i) {
sum += i * i; // Some arbitrary computation
// Simulate some other work
if (i % 100000 == 0) {
volatile int temp = i; // Prevent optimization
}
}
return sum;
}
int main() {
srand(time(NULL));
int num_iterations = 50000000; // 50 million iterations
printf("Processing data...\n");
long long result = process_data(num_iterations);
printf("Result: %lld\n", result);
return 0;
}
We compile this with debug symbols to make perf’s output more readable:
gcc -g -O2 -o compute compute.c
Now, let’s profile it. We want to count CPU cycles (event cycles) and see where they’re being spent. The -g flag in perf record tells it to use debug symbols for annotation, and -F 99 tries to sample at 99Hz, which is a good balance between detail and overhead.
perf record -g -F 99 --call-graph dwarf ./compute
This will run ./compute and collect profiling data. After it finishes, we’ll have a perf.data file.
To see the results, we use perf report.
perf report
This opens an interactive TUI. By default, it shows a list of functions, sorted by the percentage of CPU cycles attributed to them.
# To load symbols run:
# perf_script -i perf.data --symbol-map
#
# Overhead Command Shared Object Symbol
# ........ ......... .................. ..................
#
99.98% compute compute [.] process_data
0.01% compute libc-2.31.so [.] __libc_write
0.00% compute libc-2.31.so [.] _start
This immediately tells us process_data is where almost all the time is spent. But that’s just the function. What within the function?
We can drill down. Pressing Enter on process_data in perf report will show the assembly instructions within that function, also sorted by overhead.
# Overhead Command Shared Object Symbol
# ........ ......... .................. ..................
#
99.98% compute compute [.] process_data
98.50% compute compute [.] process_data
85.23% compute compute [.] process_data
20.50% compute compute [.] process_data
10.10% compute compute [.] process_data
5.00% compute compute [.] process_data
1.00% compute compute [.] process_data
0.50% compute compute [.] process_data
0.25% compute compute [.] process_data
0.12% compute compute [.] process_data
0.06% compute compute [.] process_data
0.03% compute compute [.] process_data
0.01% compute compute [.] process_data
<... more instructions ...>
You’ll see assembly instructions like:
0x0000000000001144 <+44>: imul rax,rcx
0x0000000000001147 <+47>: add rax,rdx
0x000000000000114a <+50>: mov rdx,rax
0x000000000000114d <+53>: mov rax,QWORD PTR [rbp-0x10]
0x0000000000001151 <+57>: add rax,rdx
0x0000000000001154 <+60>: mov QWORD PTR [rbp-0x10],rax
0x0000000000001158 <+64>: add rbp,0x1
0x000000000000115c <+68>: cmp rbp,rsi
0x000000000000115f <+71>: jne 0x1144 <process_data+44>
This shows the imul and add instructions are the most expensive. This is the core of the computation.
The key to efficient profiling with perf isn’t just running perf record and perf report. It’s understanding what events to sample and how to interpret the output. We used cycles because it’s a fundamental measure of CPU work. When an instruction uses many cycles, it’s a good indicator of a hotspot.
For more complex scenarios, you might want to look at other events. For example, perf stat ./compute gives a summary of events without going into the call graph or assembly:
Performance counter stats for './compute':
51,898,193 cycles (83.33%)
47,479,246 instructions # 0.91 insn per cycle
10,350,000 branches (83.33%)
1,234 branch-misses (83.33%)
3.193845375 seconds time elapsed
This shows the total cycles, instructions, and branch behavior. A low instruction-per-cycle (IPC) count can indicate a CPU bottleneck (e.g., waiting for memory). High branch misses also point to inefficiencies.
Beyond just cycles, you can profile specific CPU events like cache misses (cache-misses), branch mispredictions (branch-misses), or even hardware performance counters specific to your CPU architecture (e.g., LLC-loads, LLC-load-misses for Last Level Cache). You can list available events with perf list.
The real power comes when you combine perf record with perf report and understand how to navigate the TUI. You can search (/), filter (/ then typing a symbol name), and expand call chains (a key). The -g flag (or --call-graph) is crucial for understanding how you got to a hot function, not just that it’s hot. dwarf is the most common and generally preferred method for capturing call graphs when debug symbols are available.
The structure you see in perf report is a flame graph (or a simplified version of it). The wider a bar, the more time that function or instruction consumes. This visual representation is incredibly powerful for quickly identifying the most significant contributors to performance.
What’s often missed is that perf can also profile kernel code. If your application is spending time waiting for the OS (e.g., disk I/O, network operations), perf can show you that too. You might see [kernel.kallsyms] appearing prominently, and then you’d drill down into kernel functions to understand the wait.
The most surprising thing about perf is how it handles speculation. Modern CPUs execute instructions speculatively, meaning they guess what will happen next and run those instructions ahead of time. If the guess is wrong, the work is discarded. perf can sometimes attribute cycles to instructions that were speculatively executed and then thrown away, which can make certain sections of code appear hotter than they actually are in a non-speculative execution. Understanding this nuance is key to not over-optimizing code that isn’t truly the bottleneck.
Once you’ve optimized process_data, you’ll likely start seeing libc functions or other parts of your program become more prominent in perf report.