The perf annotate command in Linux is a powerful tool for performance analysis, but its output can be cryptic without context. It shows assembly instructions and their associated performance counter data, but linking those instructions back to the original source code lines is key to understanding why a piece of code is hot.
Let’s see perf annotate in action. Imagine we have a simple C program that does some heavy computation:
// heavy_loop.c
#include <stdio.h>
#include <stdlib.h>
long long sum_squares(int n) {
long long sum = 0;
for (int i = 0; i < n; ++i) {
sum += (long long)i * i;
}
return sum;
}
int main() {
int iterations = 100000000;
long long result = sum_squares(iterations);
printf("Sum of squares up to %d is %lld\n", iterations, result);
return 0;
}
We can compile this with debugging symbols (crucial for perf annotate to link to source):
gcc -g heavy_loop.c -o heavy_loop
Now, let’s run perf record to gather performance data. We’ll focus on CPU cycles as a metric:
perf record -e cycles:u ./heavy_loop
This command records CPU cycle events that occur in user-space (:u) while heavy_loop is executing. After it finishes, perf.data will be created.
Next, we use perf annotate to inspect the collected data. If we run it without any arguments, it will try to annotate the most frequently called function:
perf annotate
The output will look something like this (simplified):
Overhead Command Shared Object Symbol
--------------------------------------------------------------------------------
99.89% heavy_loop heavy_loop [.] sum_squares
... (assembly instructions with percentages) ...
0x00005555555540a0 <+0>: push %rbp
0x00005555555540a1 <+1>: mov %rsp,%rbp
0x00005555555540a4 <+4>: mov $0x0,%rax
0x00005555555540ab <+11>: mov %rax,%rdx
0x00005555555540ae <+14>: mov $0x0,%rsi
0x00005555555540b5 <+21>: mov %rsi,%rbx
0x00005555555540b8 <+24>: mov %rbp,%rdi
0x00005555555540bb <+27>: mov $0x0,%r8d
0x00005555555540c2 <+34>: jmp 0x5555555540e0 <sum_squares+52>
0x00005555555540c4 <+36>: mov %rbx,%rax
0x00005555555540c7 <+39>: imul %rbx,%rax
0x00005555555540cb <+43>: add %rax,%rdx
0x00005555555540ce <+46>: add $0x1,%rbx
0x00005555555540d2 <+50>: cmp %r8d,%rbx
0x00005555555540d5 <+53>: jl 0x5555555540c4 <sum_squares+36>
0x00005555555540d7 <+55>: mov %rdx,%rax
0x00005555555540da <+58>: pop %rbp
0x00005555555540db <+59>: retq
Notice the assembly instructions. The lines 0x00005555555540c4 <+36>: mov %rbx,%rax, 0x00005555555540c7 <+39>: imul %rbx,%rax, 0x00005555555540ce <+46>: add %rbx and 0x00005555555540d5 <+53>: jl are consuming a lot of cycles, as indicated by their higher percentages (which would be shown in a real perf annotate output). But what C code do they correspond to?
To get source code annotation, we need to tell perf to use the debug information. The -S flag is key here:
perf annotate -S
Now, the output will look more like this:
Overhead Command Shared Object Symbol
--------------------------------------------------------------------------------
99.89% heavy_loop heavy_loop [.] sum_squares
...
+-----------------------------------------------------------------+
99.95% | for (int i = 0; i < n; ++i) { |
| sum += (long long)i * i; |
| } |
+-----------------------------------------------------------------+
99.89% | 0x00005555555540c4 <sum_squares+36> mov %rbx,%rax |
99.85% | 0x00005555555540c7 <sum_squares+39> imul %rbx,%rax |
99.79% | 0x00005555555540cb <sum_squares+43> add %rax,%rdx |
99.71% | 0x00005555555540ce <sum_squares+46> add $0x1,%rbx |
99.62% | 0x00005555555540d5 <sum_squares+53> jl 0x5555555540c4 |
| |
| sum += (long long)i * i; |
+-----------------------------------------------------------------+
| |
| } |
+-----------------------------------------------------------------+
0.01% | 0x00005555555540d7 <sum_squares+55> mov %rdx,%rax |
| |
| return sum; |
+-----------------------------------------------------------------+
The magic happens at the top of each annotated code block. You’ll see the C source code line(s) corresponding to the assembly instructions below them. The percentage shown next to the source line indicates how much of the total profiled events are attributed to the execution of that specific source line.
In this example, the lines sum += (long long)i * i; are clearly the most expensive, consuming nearly all the CPU cycles. This is because the compiler has effectively unrolled the loop slightly and optimized the multiplication and addition operations to be as efficient as possible, but they still represent the core work of the function.
The perf annotate command works by using the debug information (-g flag during compilation) to map addresses in the executable to source files and line numbers. When perf records events, it stores the instruction pointer at the time of the event. perf annotate then uses the debug symbols to translate these instruction pointer addresses back into source code locations.
The key to getting useful source-level annotations is ensuring your executable was compiled with debug symbols (-g). Without them, perf annotate can only show you assembly, and you’ll be stuck trying to map assembly back to source code yourself, which is significantly harder.
If you want to annotate a specific function, you can pass its name to perf annotate:
perf annotate -S sum_squares
This will focus the annotation output solely on the sum_squares function.
The percentages displayed are cumulative for the instructions generated for that source line. A single source line can often compile down to multiple assembly instructions. perf annotate aggregates the performance data for all instructions originating from a single source line and attributes it to that line.
The next problem you’ll likely encounter is understanding why a particular source line is hot. perf annotate tells you what is hot, but not necessarily why. This often leads to further investigation using tools like perf script to get a timeline of events or diving deeper into compiler optimizations and CPU microarchitectural details.