Flame graphs are a visualization technique that shows where your application is spending its time during execution, but the most surprising thing is how often they reveal that the problem isn’t where you’d expect it to be.

Let’s see one in action. Imagine you’re profiling a web server. You run perf record -F 99 -g -- sleep 30 and then perf script > out.perf to collect the data, and finally stackcollapse-perf.pl out.perf | flamegraph.pl > flamegraph.svg.

This generates an SVG file that you can open in a web browser. It looks like a stack of blocks.

<svg width="100%" height="100%" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/2000/svg">
  <rect width="100%" height="100%" fill="#f0f0f0"/>
  <g transform="translate(40,40) scale(1)">
    <text x="0" y="-10" font-family="sans-serif" font-size="14">root</text>
    <g transform="translate(0,0)">
      <rect width="500" height="20" fill="#a0a0a0"/>
      <text x="5" y="15" font-family="sans-serif" font-size="10" fill="#ffffff">syscall_read</text>
    </g>
    <g transform="translate(500,0)">
      <rect width="300" height="20" fill="#b0b0a0"/>
      <text x="5" y="15" font-family="sans-serif" font-size="10" fill="#ffffff">kernel_context_switch</text>
    </g>
    <g transform="translate(800,0)">
      <rect width="100" height="20" fill="#c0c0b0"/>
      <text x="5" y="15" font-family="sans-serif" font-size="10" fill="#ffffff">my_app_request_handler</text>
    </g>
    <g transform="translate(900,0)">
      <rect width="50" height="20" fill="#d0d0c0"/>
      <text x="5" y="15" font-family="sans-serif" font-size="10" fill="#ffffff">my_app_parse_json</text>
    </g>
  </g>
</svg>

In this simplified example, the widest block at the bottom, syscall_read, represents the most time spent. The blocks above it are functions that called those functions. So, kernel_context_switch was called by syscall_read, and my_app_request_handler was called by kernel_context_switch, and so on. The width of each block is proportional to the total time spent in that function and its descendants.

The core problem flame graphs solve is that traditional profilers often give you a flat list of functions and their self-time. This tells you what functions are slow, but not why. You might see my_app_process_data taking 50% of CPU, but is it slow because of its own logic, or because it’s blocked waiting for I/O, or because it’s constantly being preempted by the kernel? Flame graphs, by showing the call stack context, immediately point you to the hot path – the sequence of calls that are consuming the most resources.

Here’s how it works internally. The perf tool samples the program’s instruction pointer at a high frequency (e.g., 99 times per second). When a sample is taken, it records the entire call stack at that moment. The stackcollapse-perf.pl script processes these samples, aggregating them by their call stacks. For instance, if it sees 100 samples where the stack is syscall_read -> kernel_context_switch -> my_app_request_handler, it knows that this specific call path was active for 100 samples. The flamegraph.pl script then takes these aggregated stacks and draws them as rectangles, where width represents the count of samples.

The key levers you control are the sampling frequency (-F 99 in perf record) and the duration of the profiling run. A higher frequency gives more detail but generates more data. A longer run captures more typical behavior but might miss short-lived spikes. You can also use perf to profile specific processes (-p <pid>) or even specific functions (-e <event>).

What most people miss is that the "top" function in a flame graph isn’t necessarily the one you need to optimize. The widest block at the bottom is the root cause of the current observed slowness. If my_app_parse_json is a narrow sliver at the very top, but its parent my_app_request_handler is enormous, and its parent kernel_context_switch is even bigger, then optimizing my_app_parse_json might have zero impact if the real bottleneck is in the kernel’s scheduling or I/O handling that my_app_request_handler is waiting on. Always trace the widest path down to its root.

Once you’ve identified a hot path, the next step is understanding the underlying cause of that path’s slowness, which often leads to exploring kernel events or specific I/O patterns.

Want structured learning?

Take the full Performance course →