Performance Profiling: Beyond CPU Usage

Flame graphs are a visualization technique that helps you understand where your CPU time is being spent in a program.

Let’s see one in action. Imagine you’re running a Java application and you want to see what’s hogging the CPU.

First, you’d attach async-profiler to your running Java process. Let’s say the process ID (PID) is 12345.

./profiler.sh start 12345 --all

This command starts profiling all threads in the JVM. After a minute or two, you’d stop it and generate a flame graph:

./profiler.sh stop 12345 --flamegraph

This will output a file, typically named flamegraph.html. When you open this in a web browser, you’ll see something like this:

      +-------------------------------------------------+
      |                                                 |
      |  ^ 100%                                         |
      |  |                                              |
      |  |                                              |
      |  |     +-------------------+                    |
      |  |     |  MyApplication.run |                    |
      |  |     |                   |                    |
      |  |     |  +-------------+  |                    |
      |  |     |  |  doWork     |  |                    |
      |  |     |  |             |  |                    |
      |  |     |  |  +-------+  |  |                    |
      |  |     |  |  |  loop |  |  |                    |
      |  |     |  |  |       |  |  |                    |
      |  +-----+--+--+-------+--+--+--------------------+
      |                                                 |
      +-------------------------------------------------+
        <-------------------- CPU Time ----------------->

Each bar represents a function call, and its width indicates how much CPU time it consumed. The stack of bars shows the call stack: a function is "on top of" the function that called it. The wider the bar, the more time that function (and its children in the stack) is using. So, in this simplified example, MyApplication.run is the top-level function, and a significant portion of its time is spent in doWork, which in turn spends a lot of time in loop.

This visualization is incredibly powerful because it directly maps CPU usage to code. You can quickly identify the "hot spots" – the parts of your code that are consuming the most CPU resources. Instead of wading through raw logs or complex metrics, you see the problem visually.

The core problem flame graphs solve is making performance profiling accessible and intuitive. Traditional profilers often dump mountains of data, requiring significant effort to interpret. Flame graphs, by their visual nature, highlight the most impactful areas immediately.

Internally, these tools work by periodically sampling the call stack of running threads. For perf, the Linux kernel’s performance counter subsystem is used. It interrupts the CPU at regular intervals and records the current instruction pointer and the call stack leading to it. async-profiler does something similar for Java, using JVM TI (Tool Interface) to get stack traces without significant overhead.

The key levers you control are:

Sampling Interval: How often the profiler takes a snapshot. A shorter interval gives more detail but incurs more overhead.
Duration: How long you let the profiler run. Longer durations capture more representative behavior but also more data.
Filtering: Most profilers allow you to focus on specific threads, processes, or even kernel functions. For example, with perf, you might run perf record -g -p <pid> --call-graph dwarf to capture call graphs for a specific process.
Profiling Type: Whether you’re profiling CPU (instructions executed, cycles) or other events like cache misses, page faults, or I/O operations.

The "width" of a bar in a flame graph is proportional to the total CPU time spent in that function and all the functions it calls. This is a crucial distinction. A function might appear narrow on its own, but if it’s calling many other functions that consume CPU, its total contribution to the flame graph will be wide. This means you’re not just looking for functions that are slow themselves, but functions that are leading to slow execution down the call stack.

Once you’ve identified a bottleneck with flame graphs, the next step is often to dive deeper into the specific functions that appear wide. This might involve using a more detailed profiler like perf with specific event counters or examining the code itself for algorithmic inefficiencies.