The perf tool on Linux isn’t just for counting events; it’s a powerful profiler that can reconstruct the full call stack, showing you exactly which functions called which, all the way down to the kernel.

Let’s see perf in action. Imagine you have a Python script that’s unexpectedly slow. You want to know why.

# First, make sure you have perf installed. On Debian/Ubuntu:
sudo apt update
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Now, run perf to record a profile. We'll use a sampling frequency of 99Hz
# and record for 30 seconds, focusing on the 'perf_event_open' syscall
# which is how most event collection starts. The '-g' flag is key for call graphs.
# Replace 'your_python_script.py' with your actual script.
sudo perf record -F 99 -g --call-graph dwarf python your_python_script.py

After perf record finishes, you’ll have a perf.data file. This is the raw data. To make sense of it, we use perf report.

# This opens an interactive TUI (Text User Interface)
sudo perf report

In perf report, you’ll see a list of functions sorted by their contribution to the total samples. Pressing Enter on a function expands its call graph, showing you the callers and callees. You can navigate this tree to pinpoint bottlenecks. The dwarf option in perf record tells perf to use DWARF debugging information (if available) to unwind the stack, which is crucial for getting accurate, deep call stacks, especially in languages like Python or C++ with complex runtimes.

The real power comes from understanding the levers:

  • -F 99 (Frequency): This tells perf to sample at 99 Hertz (99 times per second). Higher frequencies give more detail but increase overhead. Lower frequencies are faster but might miss short-lived events. Experiment to find a balance for your application.
  • -g (Call Graph): This is the magic for stack traces. perf needs to know how to walk the call stack. dwarf is usually the most robust for modern compiled languages and interpreted languages that have debug info. Other options include fp (frame pointers) or lbr (last branch record), but dwarf is generally preferred when available.
  • --call-graph dwarf: Explicitly specifies DWARF unwinding. This requires your binaries (and Python interpreter, if applicable) to be compiled with debug symbols (-g for C/C++). For Python, the interpreter itself needs to be built with appropriate flags.
  • Event Selection: By default, perf samples on CPU cycles. You can profile other events like cache misses (cache-misses), branch mispredictions (branch-misses), or specific hardware counters. For example, sudo perf record -F 99 -g --call-graph dwarf -e cycles,instructions,cache-misses python your_script.py.
  • perf script: This command can convert perf.data into a human-readable script format, useful for scripting further analysis or for input to tools like FlameGraph.

The surprising thing about perf’s call graph generation is how it handles interpreted languages like Python. It doesn’t just sample Python bytecode. When you use perf with DWARF unwinding and the Python interpreter is compiled correctly, perf can actually see the C frames of the Python interpreter itself, and within those, it can often infer the Python function calls that led to that C execution. This gives you a "full stack" profile that bridges the gap between your Python code and the underlying C implementation.

When analyzing perf report, you’ll often see a lot of [unknown] or kernel functions. This isn’t necessarily a problem. It means your application was waiting on the kernel (e.g., I/O, network, mutex locks). Expanding these kernel frames can reveal why the kernel is busy, pointing you towards system-level bottlenecks.

The next step after mastering call graphs is often understanding how to generate and visualize flame graphs from perf data, which provides an intuitive, interactive way to explore your profiles.

Want structured learning?

Take the full Perf course →