The most surprising thing about perf versus gprof is that gprof was never actually designed to profile Linux at all.
Let’s see perf in action. Imagine you’re running a Python script, my_script.py, that you suspect is slow.
# First, generate some sample Python code
cat <<EOF > my_script.py
import time
def slow_function():
time.sleep(0.1)
def medium_function():
for _ in range(1000):
pass
slow_function()
def fast_function():
for _ in range(100000):
pass
if __name__ == "__main__":
print("Starting...")
medium_function()
fast_function()
print("Done.")
EOF
# Now, let's profile it with perf
perf record -o perf.data python my_script.py
When you run perf report, you’ll see something like this (actual output will vary based on your system and Python version):
# Overhead Command Shared Object Symbol
# ........ ....... ............. .............................................
# 50.10% python [kernel] [k] SyS_nanosleep
# 25.00% python python my_script.py:<lambda>
# 15.00% python python my_script.py:medium_function
# 10.00% python python my_script.py:fast_function
This tells you that about half the time was spent in the kernel’s SyS_nanosleep (which is where time.sleep delegates), and the rest was distributed among your Python functions.
Now, contrast this with gprof. To use gprof, you’d typically compile your code with specific flags and then run the executable. For a C program like this:
// slow_program.c
#include <stdio.h>
#include <unistd.h>
void slow_function() {
usleep(100000); // 0.1 seconds
}
void medium_function() {
for (volatile int i = 0; i < 1000; ++i);
slow_function();
}
void fast_function() {
for (volatile int i = 0; i < 100000; ++i);
}
int main() {
printf("Starting...\n");
medium_function();
fast_function();
printf("Done.\n");
return 0;
}
You’d compile like this:
gcc -pg slow_program.c -o slow_program
./slow_program
./gprof slow_program gmon.out > gprof_report.txt
The gprof_report.txt would show you call counts and time spent in each function, but it relies on instrumentation added at compile time and a separate gmon.out file.
The core problem perf solves, and gprof largely ignores, is the difference between sampling and instrumentation. gprof works by instrumenting your code. This means the compiler adds extra code to count function calls and measure time. This is accurate for the code it instruments, but it adds overhead and can sometimes alter program behavior. perf, on the other hand, is a sampling profiler. It uses hardware performance counters and the kernel to periodically interrupt the running program and record what it’s doing. This is much less intrusive and can profile any running program, including interpreted languages like Python or even the kernel itself, without requiring recompilation. It’s like taking snapshots of what the CPU is doing rather than meticulously logging every step.
The real power of perf comes from its ability to leverage hardware performance counters. These are special CPU registers that can count events like cache misses, branch mispredictions, or just cycles. By telling perf to sample on specific events (e.g., perf record -e cycles python my_script.py), you can understand performance bottlenecks at a much lower level than just function call times. For instance, if perf report shows a lot of time in a function but perf record -e cache-misses shows a high number of cache misses within that same function, you know the problem isn’t just the function’s complexity, but its memory access patterns.
perf also offers a much richer set of data. gprof primarily gives you flat profiles (total time per function) and call graphs (who called whom). perf can provide both, but also attribute events to specific lines of code, show kernel activity, and even visualize call graphs interactively. The perf script command, for example, can output a trace of all sampled events, which can then be fed into other tools for deeper analysis.
A common misconception is that perf is only for C/C++ or kernel code. It’s incredibly versatile. You can profile Java applications, Python scripts, shell commands, and even system daemons. The key is understanding which events to sample. For I/O bound tasks, sampling on page-faults might be more insightful than sampling on cycles. For CPU-bound tasks, cycles or instructions are usually good starting points.
The biggest leap in understanding perf is realizing it doesn’t need your source code to do meaningful profiling. It operates at the machine code level, correlating sampled instruction pointers back to symbols (function names, source files, line numbers) using the system’s DWARF debug information or symbol tables. This means you can profile binaries you didn’t compile yourself, as long as debug symbols are available.
When you’re done profiling and want to clean up any generated files, you’d typically remove perf.data and gmon.out (if you used gprof).
The next hurdle after mastering basic perf usage is often understanding and effectively using hardware event filtering for fine-grained analysis.