Linux perf can tell you exactly which instructions your CPU is spending time on, but most people use it to just list symbols, missing the forest for the trees.

Let’s see perf in action. Imagine we have a simple C program that does some heavy computation:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// A function that does a lot of work
long long process_data(int iterations) {
    long long sum = 0;
    for (int i = 0; i < iterations; ++i) {
        sum += i * i; // Some arbitrary computation
        // Simulate some other work
        if (i % 100000 == 0) {
            volatile int temp = i; // Prevent optimization
        }
    }
    return sum;
}

int main() {
    srand(time(NULL));
    int num_iterations = 50000000; // 50 million iterations
    printf("Processing data...\n");
    long long result = process_data(num_iterations);
    printf("Result: %lld\n", result);
    return 0;
}

We compile this with debug symbols to make perf’s output more readable:

gcc -g -O2 -o compute compute.c

Now, let’s profile it. We want to count CPU cycles (event cycles) and see where they’re being spent. The -g flag in perf record tells it to use debug symbols for annotation, and -F 99 tries to sample at 99Hz, which is a good balance between detail and overhead.

perf record -g -F 99 --call-graph dwarf ./compute

This will run ./compute and collect profiling data. After it finishes, we’ll have a perf.data file.

To see the results, we use perf report.

perf report

This opens an interactive TUI. By default, it shows a list of functions, sorted by the percentage of CPU cycles attributed to them.

# To load symbols run:
#   perf_script -i perf.data --symbol-map
#
# Overhead  Command    Shared Object      Symbol
# ........  .........  .................. ..................
#
  99.98%    compute    compute            [.] process_data
   0.01%    compute    libc-2.31.so       [.] __libc_write
   0.00%    compute    libc-2.31.so       [.] _start

This immediately tells us process_data is where almost all the time is spent. But that’s just the function. What within the function?

We can drill down. Pressing Enter on process_data in perf report will show the assembly instructions within that function, also sorted by overhead.

# Overhead  Command    Shared Object      Symbol
# ........  .........  .................. ..................
#
  99.98%    compute    compute            [.] process_data
  98.50%    compute    compute            [.] process_data
     85.23%    compute    compute            [.] process_data
        20.50%    compute    compute            [.] process_data
           10.10%    compute    compute            [.] process_data
              5.00%    compute    compute            [.] process_data
                 1.00%    compute    compute            [.] process_data
                    0.50%    compute    compute            [.] process_data
                       0.25%    compute    compute            [.] process_data
                          0.12%    compute    compute            [.] process_data
                             0.06%    compute    compute            [.] process_data
                                0.03%    compute    compute            [.] process_data
                                   0.01%    compute    compute            [.] process_data
                                      <... more instructions ...>

You’ll see assembly instructions like:

    0x0000000000001144 <+44>:      imul   rax,rcx
    0x0000000000001147 <+47>:      add    rax,rdx
    0x000000000000114a <+50>:      mov    rdx,rax
    0x000000000000114d <+53>:      mov    rax,QWORD PTR [rbp-0x10]
    0x0000000000001151 <+57>:      add    rax,rdx
    0x0000000000001154 <+60>:      mov    QWORD PTR [rbp-0x10],rax
    0x0000000000001158 <+64>:      add    rbp,0x1
    0x000000000000115c <+68>:      cmp    rbp,rsi
    0x000000000000115f <+71>:      jne    0x1144 <process_data+44>

This shows the imul and add instructions are the most expensive. This is the core of the computation.

The key to efficient profiling with perf isn’t just running perf record and perf report. It’s understanding what events to sample and how to interpret the output. We used cycles because it’s a fundamental measure of CPU work. When an instruction uses many cycles, it’s a good indicator of a hotspot.

For more complex scenarios, you might want to look at other events. For example, perf stat ./compute gives a summary of events without going into the call graph or assembly:

 Performance counter stats for './compute':

         51,898,193      cycles                                                      (83.33%)
         47,479,246      instructions              # 0.91 insn per cycle
         10,350,000      branches                                                      (83.33%)
              1,234      branch-misses                                                 (83.33%)

       3.193845375 seconds time elapsed

This shows the total cycles, instructions, and branch behavior. A low instruction-per-cycle (IPC) count can indicate a CPU bottleneck (e.g., waiting for memory). High branch misses also point to inefficiencies.

Beyond just cycles, you can profile specific CPU events like cache misses (cache-misses), branch mispredictions (branch-misses), or even hardware performance counters specific to your CPU architecture (e.g., LLC-loads, LLC-load-misses for Last Level Cache). You can list available events with perf list.

The real power comes when you combine perf record with perf report and understand how to navigate the TUI. You can search (/), filter (/ then typing a symbol name), and expand call chains (a key). The -g flag (or --call-graph) is crucial for understanding how you got to a hot function, not just that it’s hot. dwarf is the most common and generally preferred method for capturing call graphs when debug symbols are available.

The structure you see in perf report is a flame graph (or a simplified version of it). The wider a bar, the more time that function or instruction consumes. This visual representation is incredibly powerful for quickly identifying the most significant contributors to performance.

What’s often missed is that perf can also profile kernel code. If your application is spending time waiting for the OS (e.g., disk I/O, network operations), perf can show you that too. You might see [kernel.kallsyms] appearing prominently, and then you’d drill down into kernel functions to understand the wait.

The most surprising thing about perf is how it handles speculation. Modern CPUs execute instructions speculatively, meaning they guess what will happen next and run those instructions ahead of time. If the guess is wrong, the work is discarded. perf can sometimes attribute cycles to instructions that were speculatively executed and then thrown away, which can make certain sections of code appear hotter than they actually are in a non-speculative execution. Understanding this nuance is key to not over-optimizing code that isn’t truly the bottleneck.

Once you’ve optimized process_data, you’ll likely start seeing libc functions or other parts of your program become more prominent in perf report.

Want structured learning?

Take the full Perf course →