Linux perf can sample your production system with surprisingly little overhead.
Let’s see perf in action. Imagine we have a web server experiencing intermittent slowdowns. We want to understand what’s consuming CPU without adding significant load.
# Start a 10-second sampling run, focusing on CPU cycles
sudo perf record -e cycles:ppp -o perf.data -- sleep 10
# Analyze the results, showing the top 10 events
sudo perf report -n --stdio
This perf record command starts a sampling process. The -e cycles:ppp part is crucial:
cycles: This tellsperfto sample on every CPU cycle event. This is a very fine-grained event, meaning it happens extremely often.ppp: This is the "precision" flag. It tellsperfto use a more precise, but slightly higher-overhead, method for sampling. For production,pppis often a good balance, but for truly minimal overhead, you might omit it or usepp(less precise).
The -- sleep 10 is our workload. perf will sample for exactly 10 seconds while sleep is running. The output is saved to perf.data.
perf report then takes that perf.data file and presents a human-readable summary. The -n flag adds line numbers to the source code, and --stdio forces it to print to standard output instead of opening the interactive TUI.
The output will look something like this:
# Overhead Command Shared Object Symbol
# .......... ........ ................ ........
85.10% sleep [kernel.kallsyms] [k] SyS_nanosleep
10.50% perf perf [.] 0x0000000000001234
2.00% [unknown] [unknown] [.] 0x0000000000005678
1.50% systemd [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath
This tells us that during our 10-second sleep period, 85.10% of the sampled CPU cycles were spent inside the SyS_nanosleep kernel function, called by the sleep process itself. The perf command itself accounted for 10.50% of the samples, which is expected as it’s actively running. systemd shows up briefly, indicating some background activity.
The problem this solves is understanding where CPU time is being spent without drastically altering the system’s behavior. Traditional profiling methods often involve instrumenting code or running the entire application under a profiler, which can introduce significant overhead and change the program’s timing characteristics, potentially masking or altering the very issues you’re trying to find. perf’s sampling approach, especially with carefully chosen events, minimizes this intrusion.
Internally, perf leverages the Linux kernel’s performance monitoring unit (PMU) capabilities. The CPU itself can be configured to trigger an interrupt (or a special event) after a certain number of occurrences of a specified event (like CPU cycles, cache misses, branch instructions). When this event occurs, the kernel captures the current instruction pointer (IP) and the process context, saving it to a buffer. This is done in hardware with minimal software intervention, hence the low overhead.
The exact levers you control are the sampling events (-e), the sampling frequency (implicit via the event itself, or explicit with -c <count>), and the sampling duration (sleep <seconds> or by attaching to a running process). For production, you’ll often use less frequent events than cycles to further reduce overhead. Common choices include:
cpu-clock: A software event that approximates CPU time. Lower overhead thancycles.context-switches: To see how often processes are being swapped out.page-faults: To identify memory pressure.branch-misses: For CPU pipeline efficiency issues.
Let’s re-run the previous example, but use cpu-clock and sample 1000 times per second (-c 1000) for 10 seconds.
# Sample cpu-clock 1000 times per second for 10 seconds
sudo perf record -e cpu-clock -c 1000 -o perf.data -- sleep 10
# Analyze the results
sudo perf report -n --stdio
This command samples the CPU clock event 1000 times per second. The -c 1000 tells the kernel to generate an event every 1000 occurrences of cpu-clock. Since cpu-clock is meant to represent time, this effectively means sampling roughly 1000 times per second. The total number of samples will be around 10 seconds * 1000 samples/second = 10,000 samples.
The key to perf’s low overhead is that the sampling is largely driven by hardware or efficient kernel mechanisms. The CPU hardware is configured to trigger an interrupt only when a certain number of events have passed. The kernel then quickly records the Program Counter (PC) and context, and returns to normal operation. The bulk of the work happens when you run perf report to analyze the collected data, which has no impact on the running system.
When you use perf record -e cycles:ppp, you’re telling perf to use the precise sampling mode for the cycles event. This mode, while more accurate, can introduce a slightly higher software overhead because the kernel needs to do more work to ensure the sampled instruction pointer is precisely at the point of the event. For truly minimal overhead in a highly sensitive production environment, you might opt for perf record -e cycles -c <large_number>, where <large_number> is set so you get a reasonable number of samples over your observation period, or use software events like cpu-clock.
The next hurdle is often understanding how to correlate these samples back to specific lines of your application code, especially when dealing with optimized binaries or dynamically generated code.