Sampling profilers like perf are surprisingly more effective than instrumentation profilers like Valgrind for understanding real-world performance bottlenecks in production systems.

Let’s see perf in action. Imagine you have a C++ application, my_app, that you suspect is slow. To profile it with perf, you’d run:

sudo perf record -g -o perf.data ./my_app --some-arg

Then, to see the results, you’d use:

perf report

This command will open an interactive TUI where you can navigate through your application’s call stack and see which functions are consuming the most CPU time. The -g flag enables call graph recording, which is crucial for understanding the context of where your time is being spent.

Contrast this with Valgrind’s callgrind tool. To get similar information, you’d run:

valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes -v ./my_app --some-arg

And then analyze the output with callgrind_annotate.

The core difference lies in their methodology: perf uses sampling, while Valgrind uses instrumentation.

Sampling (perf) works by periodically interrupting the running program and recording the instruction pointer’s location. Think of it like taking snapshots of your program’s execution at regular intervals. If a particular section of code is executed frequently, the instruction pointer will be caught in that section more often during these snapshots, thus appearing as a hotspot. This is incredibly lightweight and has minimal overhead, making it suitable for production environments.

Instrumentation (Valgrind) works by rewriting the application’s binary code before it runs. It inserts extra instructions to track every single instruction executed, every memory access, and so on. This provides extremely detailed information but comes at a significant performance cost. Running Valgrind can slow down your application by 10x to 100x, making it impractical for profiling long-running or performance-sensitive applications in realistic scenarios.

Here’s why perf is often preferred for production:

  1. Overhead: perf’s sampling overhead is typically 1-5%, whereas Valgrind’s can be 1000% or more. This difference is massive. A 100x slowdown means a process that normally takes 1 minute will take 100 minutes. You simply can’t profile a complex, long-running system effectively under such conditions.
  2. Real-world Context: perf profiles the application as it runs on the actual hardware, including the effects of the OS scheduler, cache behavior, and other system-level events. Valgrind’s instrumentation can alter the program’s execution flow and timing characteristics in ways that might mask or even create artificial performance issues.
  3. Hardware Support: perf leverages sophisticated performance monitoring units (PMUs) built into modern CPUs. These hardware counters provide highly accurate and detailed metrics about instruction execution, cache misses, branch mispredictions, and more, with very little software intervention. Valgrind relies purely on software emulation.
  4. Ease of Use (for production): perf record and perf report are straightforward commands. While perf report can be dense, navigating it to find the top functions is quick. Valgrind requires more setup and its output analysis can be more complex, in addition to the extreme slowdown.
  5. System-wide Profiling: perf can also profile the entire system, not just a single process. This is invaluable for understanding interactions between different services or the impact of kernel activity. valgrind is generally limited to profiling a single process.
  6. Event Granularity: perf can track a vast array of hardware and software events (e.g., cycles, instructions, cache-misses, context-switches, page-faults). Valgrind’s core focus is on instruction execution and memory errors, though it can simulate cache behavior.

The one thing most people don’t realize about perf is that its ability to track hardware events like cache misses and branch mispredictions is not just about counting. The CPU’s PMU is designed to detect these events with minimal latency. When perf samples, it’s not just capturing the instruction pointer; it’s also capturing the state of these hardware counters at the moment of the sample. This allows you to correlate high instruction counts with poor cache performance or excessive branch mispredictions, pointing to algorithmic inefficiencies or data structure issues that simple instruction counting wouldn’t reveal.

If perf shows you a hotspot in a function, and you suspect memory access patterns are the issue, the next step is often to investigate cache miss events using perf stat or by configuring perf record to specifically sample on cache miss events.

Want structured learning?

Take the full Perf course →