Linux perf uses Last Branch Records (LBR) to reconstruct call traces for specific events, providing insights into program execution flow.

Let’s see perf with LBR in action. Imagine you’re debugging a performance issue in a C program that calls a specific function repeatedly.

#include <stdio.h>
#include <stdlib.h>

void deep_function(int depth) {
    if (depth <= 0) {
        return;
    }
    // Simulate some work
    volatile int x = 0;
    for (int i = 0; i < 1000; ++i) {
        x += i;
    }
    deep_function(depth - 1);
}

void middle_function(int depth) {
    deep_function(depth);
}

void top_function(int depth) {
    middle_function(depth);
}

int main() {
    top_function(10);
    return 0;
}

To profile this with perf and LBR, you’d first ensure your CPU supports LBR and it’s enabled in the kernel (most modern Intel/AMD CPUs do). Then, you’d run perf like this:

perf record -e branch-instructions -o perf.data --call-graph lbr ./your_program

Here’s what’s happening:

  • -e branch-instructions: We’re telling perf to sample on every branch instruction executed. This is the event that triggers LBR recording.
  • -o perf.data: This saves the raw data to a file named perf.data.
  • --call-graph lbr: This is the crucial part. It instructs perf to use the Last Branch Record hardware feature to reconstruct the call graph, rather than relying solely on stack walking.

After running your program, you’d analyze the data:

perf report -i perf.data

This command will show you a report. When you navigate to a specific function (e.g., deep_function), you can often see a "call graph" or "callees" view. With LBR enabled, this view is far more accurate for deep or complex call stacks because it relies on hardware state rather than potentially unreliable stack unwinding.

The problem perf with LBR solves is getting accurate call traces, especially in scenarios where traditional stack walking might fail. This includes:

  • Optimized code: Compilers aggressively optimize code, which can make stack frames non-standard or absent, confusing stack walkers.
  • Indirect calls: Function pointers and virtual method calls make it harder to trace execution flow by just looking at the stack.
  • Interrupts and context switches: These can disrupt the normal stack progression, leading to broken call traces.

LBR works by having dedicated hardware registers on the CPU that record the last few branch (call, jump, return) destinations. When perf is configured to use LBR, it doesn’t just sample the current instruction pointer. Instead, it reads these LBR registers, which tell it where the CPU came from and where it went. By chaining these records together, perf can build a history of execution flow, effectively reconstructing the call stack at the time of the sample.

The exact levers you control are primarily the event you are sampling (-e) and the method of call graph generation (--call-graph). For LBR, common events include branch-instructions (any branch) and branch-load-instructions (branches that also involve a load). You can also specify lbr=far or lbr=near to control the type of branches LBR records, though perf usually handles this automatically.

The LBR stack is finite. If your call depth exceeds the number of LBR entries (which varies by CPU, often 16 or 32), the oldest entries will be overwritten. perf tries to mitigate this by intelligently sampling and correlating LBR entries with other events, but it’s a hardware limitation to be aware of. This means that extremely deep, sustained call chains might still present gaps, but it’s significantly better than pure stack walking.

Once you’ve mastered perf and LBR for basic call tracing, the next step is exploring how to filter and annotate these traces for specific types of branches, like taken vs. not-taken branches, or branches to user-space vs. kernel-space.

Want structured learning?

Take the full Perf course →