The Linux perf tool is a powerful, built-in profiler that can give you deep insights into your system’s performance, but its true strength lies in understanding why certain events are happening, not just that they are.

Let’s see perf in action. Imagine you have a Python script that’s unexpectedly slow.

import time

def slow_function():
    total = 0
    for i in range(10_000_000):
        total += i
    time.sleep(0.1)
    return total

def main():
    result = slow_function()
    print(f"Result: {result}")

if __name__ == "__main__":
    main()

We can profile this script to see where it’s spending its time. First, we need to record some performance data. We’ll focus on CPU cycles and page faults, which are common culprits for slowdowns.

perf record -e cycles,page-faults --call-graph dwarf python your_script.py
  • perf record: This is the command to start collecting performance data.
  • -e cycles,page-faults: This specifies the events we want to monitor. cycles measures how many CPU cycles an application is consuming, and page-faults tracks how often the system has to fetch data that isn’t in memory.
  • --call-graph dwarf: This is crucial for understanding where within your code these events are happening. It tells perf to record call stack information using DWARF debugging symbols. Make sure your Python interpreter and any compiled libraries are built with debugging symbols.
  • python your_script.py: This is the command to run your application under perf’s watch.

After the script finishes, perf will create a perf.data file. Now, we analyze it.

perf report

This will open an interactive TUI (Text User Interface). You’ll see a list of functions and the percentage of events attributed to them. You can navigate this list, expand call stacks, and drill down into the specifics. In our example, you’d likely see slow_function at the top, and if you expand it, you might see the for loop consuming a large percentage of cycles. You might also see page-faults occurring in the time.sleep or potentially within the loop if memory access patterns are inefficient.

The real power of perf comes from understanding the types of events and how they relate to your system’s architecture and your application’s behavior. cycles is a good general-purpose metric for CPU work. instructions can tell you how many instructions are being executed. cache-misses point to problems with data locality. branch-misses can indicate inefficient control flow. page-faults are often a sign of I/O or excessive memory pressure.

To get a full mental model, think about the CPU pipeline. When you profile cycles, you’re seeing how much time is spent waiting or executing. High cycles in a function might mean it’s doing a lot of work, or it’s stalled waiting for memory (which would also show up in cache-misses or page-faults). If you see high instructions but also high cycles, it might mean the instructions themselves are complex or that the CPU is frequently having to fetch them.

Let’s say you’re profiling a C++ application and perf report shows a lot of cache-misses in a tight loop.

# Example of perf report output snippet
#  %   overhead    command      function
# -------------------------------------
# 70.00%  15.00%  my_app       process_data
# 30.00%   5.00%  my_app       <...page-faults...>

Expanding process_data might reveal:

#  %   overhead    command      function
# -------------------------------------
# 70.00%  15.00%  my_app       process_data
#  60.00%  12.00%  my_app       [.] process_data
#  40.00%   3.00%  my_app       [.] do_calculation
#                                ^-- 60.00% of process_data's cycles

If do_calculation is showing a high percentage of cache-misses, the fix isn’t necessarily to rewrite do_calculation, but to change how data is accessed. For instance, if you’re iterating over a 2D array in C++ row by row, but the array is stored column by column in memory, you’ll get many cache misses. Swapping the loop order (for col... for row...) or restructuring the data (e.g., using a struct of arrays instead of an array of structs) can dramatically improve performance.

perf also allows you to annotate specific functions with source code or assembly. After running perf record, you can use:

perf annotate -i perf.data [symbol_name]

For example, to annotate the do_calculation function:

perf annotate -i perf.data [my_app]!<do_calculation>

This will show you the source code or assembly alongside the performance event counts for each line or instruction. You can then see exactly which lines are contributing most to the measured events.

The most surprising thing about perf is how often performance bottlenecks aren’t CPU-bound in the traditional sense, but rather memory-bound or I/O-bound, and how easily these can be masked. A function might appear to be consuming a lot of CPU cycles simply because it’s repeatedly trying to access data that isn’t in the CPU cache, leading to long stalls. Profiling cache-misses or page-faults alongside cycles is often more revealing than looking at cycles alone.

The next step in mastering perf is to explore more advanced event types and their kernel-level implications, such as understanding hardware performance counters versus software-defined events.

Want structured learning?

Take the full Perf course →