Flame graphs are the fastest way to find CPU bottlenecks, but most people use them wrong by not understanding what the width of a flame actually means.
Let’s see what perf and Flame Graphs can do for you. Imagine you have a web server that’s suddenly slow. You suspect a CPU issue.
First, you need to collect data. This command samples your running process (let’s say it’s nginx with PID 12345) every 100 times per second for 30 seconds:
sudo perf record -F 100 -p 12345 -g -- sleep 30
-F 100: This sets the sampling frequency to 100 Hz.perfwill interrupt the CPU 100 times every second to see what it’s doing.-p 12345: This targets the specific process ID you want to profile.-g: This is crucial! It tellsperfto record the call stack (the sequence of functions that led to the current execution point). This is what allows for flame graphs.-- sleep 30: This makesperf recordrun for exactly 30 seconds.
After perf record finishes, you’ll have a perf.data file. Now, you need to convert this into a format FlameGraph.pl can use. The perf script command does this:
perf script > out.perf
This out.perf file is a text representation of the sampled events and their call stacks. Now, we feed this into the FlameGraph script. You’ll need to download the FlameGraph scripts from Brendan Gregg’s GitHub repository.
perl FlameGraph.pl out.perf > cpu-flame-graph.svg
This generates an SVG file that you can open in any web browser.
Here’s what you’re looking at:
- Each bar is a function.
- The width of a bar represents its total CPU time across all samples. This is the key takeaway. A wide bar means that function (and its children) consumed a lot of CPU.
- The x-axis is the total number of samples. It’s not time.
- The y-axis represents the call stack. Functions higher up on the stack are called by functions below them.
Let’s say you see a wide bar labeled my_app_process_request. This means that function itself, and any functions it calls, are responsible for a significant chunk of the CPU time observed. If my_app_process_request is wide, and directly above it is a narrower bar for json_parse_string, that tells you that json_parse_string is a major contributor to the work done by my_app_process_request.
You can also generate "folded" maps, which are the input for the FlameGraph.pl script. Here’s how you’d get a map specifically for CPU usage:
perf script | stackcollapse-perf.pl | flamegraph.pl --color=func > cpu-flame-graph.svg
The stackcollapse-perf.pl script takes the raw perf script output and aggregates identical call stacks, making the input for flamegraph.pl more concise. The --color=func option colors the bars based on the function name, which can help visually distinguish different parts of your application.
The mental model is simple: you’re seeing a "stack" of CPU usage. If you zoom in on a wide section, you’re looking at a CPU hotspot. The width is king. The most common mistake is thinking the x-axis represents time directly, or that the height means "more important." It’s the width that screams "I’m using a lot of CPU."
If you’re profiling a multi-threaded application, you might notice that the total width of all bars exceeds 100% of your CPU cores. This is normal. Flame graphs aggregate samples, not concurrent execution. A single core can only do one thing at a time, but a wide flame graph indicates that across all your cores, your application is busy.
When you analyze a flame graph, you want to look for wide, flat-topped functions. These are your primary suspects for performance issues. If you see a wide function that you don’t recognize, it’s likely a library call that’s consuming CPU. If it’s a function within your own codebase, that’s where you’ll focus your optimization efforts.
A common pattern is to see a wide kernel function like do_sys_open or tcp_sendmsg. This indicates that your application is spending a lot of time in the kernel doing I/O operations. Your next step would then be to investigate your application’s I/O patterns, perhaps using tools like iotop or by examining your application’s file or network operations.
The real power comes when you start correlating these flame graphs with specific events. For example, you can profile disk I/O (perf record -e 'disk:*' ...), memory allocation (perf record -e 'kmem:*' ...), or even specific system calls. This allows you to build a comprehensive picture of where your application is spending its time, not just on the CPU, but also waiting for or interacting with other system resources.
If you see a very wide bar at the bottom of the graph, representing a function that isn’t calling anything else, that function is a leaf in the call stack and is directly consuming CPU time. This is often where you’ll find the most computationally intensive parts of your code.
The next step after identifying CPU hotspots is often to investigate why those functions are consuming so much CPU, which might involve looking at the specific arguments they are being called with or the data they are processing.