The perf command in Linux is a powerful, low-level profiling tool that can give you incredibly granular insights into your system’s performance.

Let’s see perf in action. Imagine you’re trying to figure out why a specific application is slow. You can use perf to sample the CPU’s performance counters.

sudo perf record -e cpu-cycles -a -- sleep 10
sudo perf report

This command records CPU cycles across all cores for 10 seconds, then perf report shows you a breakdown of which functions consumed the most cycles. You might see something like this:

# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ......
   55.12%  my_app   my_app         [.] process_data
   20.50%  my_app   libc-2.31.so   [.] _IO_fgets
    8.75%  my_app   my_app         [.] calculate_sum
    ...

This immediately tells you that process_data in your my_app binary is the biggest bottleneck.

perf works by interacting directly with the Performance Monitoring Units (PMUs) on your CPU. These are special hardware registers that can count events like CPU cycles, cache misses, branch mispredictions, and more. perf can read these counters in real-time or record them to a file for later analysis.

The core idea behind perf is event-driven sampling. Instead of meticulously tracking every instruction (which would be too slow), perf periodically interrupts the CPU (based on a specific event, like a certain number of CPU cycles passing) and records the current program counter. By aggregating these samples, you build a statistical picture of where the CPU is spending its time.

The most fundamental command is perf list, which shows you all the available hardware and software events your system supports.

perf list

You’ll see output like:

List of performance events:
  cpu-cycles                                          [Hardware event]
  instructions                                        [Hardware event]
  cache-references                                    [Hardware event]
  cache-misses                                        [Hardware event]
  branch-instructions                                 [Hardware event]
  branch-misses                                       [Hardware event]
  ...
  sched:sched_switch                              [Software event]
  syscalls:sys_enter_read                           [Software event]
  ...

To record events, you use perf record. The -e flag specifies the event(s), -a means record across all CPUs, and -g enables call graph (stack trace) recording, which is crucial for understanding why a function is being called.

sudo perf record -e cache-misses -g -a -- ./my_slow_program

After perf record finishes, perf report analyzes the generated perf.data file. You can navigate this report with arrow keys and press Enter to drill down into specific functions.

perf stat is for a quick overview of total counts for a command.

perf stat -e cpu-cycles,instructions,cache-misses -- ./my_fast_program

This gives you aggregated numbers, not a breakdown by function.

 Performance counter stats for './my_fast_program':

         1,234,567      cpu-cycles                    #    2.00 GHz
         9,876,543      instructions                  #    8.00 insn per cycle
            54,321      cache-misses

For kernel-level analysis, you often need to enable kernel symbol information. If perf report shows lots of [k] symbols without names, you might need to ensure your kernel debug symbols are installed (e.g., linux-image-$(uname -r)-dbgsym on Debian/Ubuntu).

You can also trace specific system calls with perf trace.

sudo perf trace -e 'syscalls:sys_enter_*' -- my_app

This will show every system call entry that my_app makes. It’s incredibly verbose but can pinpoint I/O or network issues.

The -p flag lets you attach perf to a running process.

sudo perf record -e cpu-cycles -a -p $(pgrep my_daemon) -- sleep 30

This attaches to a process named my_daemon and records CPU cycles for 30 seconds.

When dealing with complex systems, understanding the interaction between user-space and kernel-space is key. perf bridges this gap by allowing you to profile both. For instance, if perf report shows a lot of time spent in kernel functions like tcp_sendmsg, it indicates that your application’s network operations are causing significant kernel overhead.

One thing that trips many people up is the difference between hardware and software events, and how to interpret them. Hardware events are directly tied to CPU operations (cycles, cache misses), giving you insight into raw processing. Software events, on the other hand, are generated by the kernel for things like context switches (sched:sched_switch) or system call entries. When you see a high number of sched:sched_switch events, it means your processes are being preempted very frequently, suggesting a CPU contention problem or a process that’s holding onto the CPU for too short a time, leading to high context-switching overhead.

The next logical step after basic profiling is often to dive into the specifics of I/O or network performance using perf’s tracing capabilities or to correlate CPU profiling with memory access patterns using cache event profiling.

Want structured learning?

Take the full Perf course →