CPU profiling isn’t about finding slow code; it’s about finding code that’s unnecessarily consuming CPU, often in places you’d least expect.

Let’s see it in action. Imagine we have a simple Python web application that’s getting sluggish. We’ll use perf on Linux to get a bird’s-eye view.

First, we need to identify the process ID (PID) of our application. If it’s a Flask app, it might be running under a web server like Gunicorn. Let’s assume our Flask app’s main process has PID 12345.

Now, we can start collecting CPU samples:

sudo perf record -p 12345 -g --call-graph dwarf
  • sudo: Profiling often requires elevated privileges to access kernel data.
  • -p 12345: This tells perf to attach to the process with PID 12345.
  • -g: This crucial flag enables call graph (or "dwarf" based) sampling, which lets us see not just which functions are consuming CPU, but how they were called. This is how we trace execution paths.
  • --call-graph dwarf: This specifies the method for collecting call graph information. Dwarf is generally more accurate for interpreted languages like Python.

After letting this run for a minute or two while the application is under load, we stop the collection by pressing Ctrl+C. perf will create a perf.data file in the current directory.

Next, we analyze the collected data:

sudo perf report

This opens an interactive TUI. You’ll see a list of functions sorted by their CPU usage percentage. The "Self" column shows time spent in the function itself, while the "Shared" column shows time spent in functions called by it. The "Symbol" column shows the function name, and the "Net" column shows the total time spent in the function and its callees.

The real magic happens when you navigate this report. Use the arrow keys to select a function and press Enter to expand its call graph. You’ll see the functions that called it (Called by) and the functions it called (Calls). This is how you trace the execution path.

Let’s say perf report shows a high percentage for a function called _PyEval_EvalFrameDefault in the Python interpreter. This is the heart of Python’s execution loop. If this is a hotspot, it means Python is spending a lot of time executing bytecode. The call graph might reveal that this function is being called repeatedly by a specific part of our application code.

The problem this solves is identifying performance bottlenecks that are not obvious from static code analysis or simple timing. It allows us to pinpoint the exact lines of code, or more importantly, the paths through the code, that are consuming the most CPU cycles. This is essential for optimizing applications where even small percentage gains can have a significant impact on responsiveness and resource utilization.

Internally, perf works by periodically interrupting the CPU and inspecting its current state: the program counter (PC), registers, and stack. By aggregating these samples over time, it builds a statistical profile of where the CPU spent its time. The -g flag augments this by walking the call stack at each interrupt, allowing it to reconstruct the call chains leading to the sampled instructions.

The levers you control are primarily the sampling frequency, the type of events being sampled (CPU cycles, cache misses, etc.), and the method of call graph collection. For example, perf record -e cycles:P -p 12345 -g --call-graph dwarf samples on CPU cycles (cycles:P) which is a common and effective choice for CPU-bound issues. You can also sample on other events like cache-misses or branch-misses to understand different performance characteristics.

One aspect often overlooked is that perf samples the entire system by default if no PID is specified. When you attach to a specific PID, you’re filtering those samples. However, the overhead of sampling, especially with detailed call graph collection, can itself impact performance. It’s a trade-off: more data means more potential overhead. For extremely latency-sensitive applications, you might need to adjust sampling frequency (-F 99 for 99Hz, for example) or focus on specific events.

The next step after identifying a CPU hotspot is usually to investigate the specific code path and consider algorithmic improvements or, if it’s in a library, whether a more optimized alternative exists.

Want structured learning?

Take the full Performance course →