perf can tell you what your CPU is doing, but it can also tell you what your memory subsystem is doing. This is key because modern CPUs are incredibly fast, but they’re often waiting around for data to arrive from RAM. perf can show you just how much waiting is happening, and more importantly, why.

Let’s look at a real-world example. We’ll use perf to profile memory access patterns in a simple C program that just does a lot of array lookups.

#include <stdio.h>
#include <stdlib.h>

#define SIZE 1024 * 1024 // 1 million elements
#define ITERATIONS 10

int main() {
    int *arr = (int *)malloc(SIZE * sizeof(int));
    if (arr == NULL) {
        perror("malloc failed");
        return 1;
    }

    // Initialize array (simple sequential access)
    for (long i = 0; i < SIZE; ++i) {
        arr[i] = i;
    }

    // Access array in a way that might cause cache misses
    volatile int sum = 0;
    for (int iter = 0; iter < ITERATIONS; ++iter) {
        for (long i = 0; i < SIZE; ++i) {
            sum += arr[i];
        }
    }

    printf("Sum: %d\n", sum);

    free(arr);
    return 0;
}

Now, let’s profile this with perf. We’re interested in memory-related events. The most fundamental ones are cache misses.

perf record -e cache-misses,cache-references ./memory_profile
perf report

This will give us a breakdown of where the cache-misses and cache-references are occurring. You’ll see percentages of total events attributed to different functions. The key is to look for functions that have a high ratio of cache-misses to cache-references.

The output of perf report will show you something like this (simplified):

  20.00%  memory_profile        [.] main
  15.00%  memory_profile        [.] main
   5.00%  libc-2.31.so          [.] _int_malloc

The percentages here represent the proportion of the profiled events (in this case, cache-misses and cache-references) that occurred within the specified function. A high percentage in main for cache-related events means your application’s code is the primary driver of memory access issues.

The problem perf helps us diagnose here is that the CPU is spending a lot of time waiting for data from RAM because it’s not finding that data in its caches. This can happen for several reasons, but the most common ones relate to how data is laid out in memory and how it’s accessed.

Common Causes of High Cache Misses:

  1. Poor Data Locality / Strided Access: Accessing memory with large strides (e.g., arr[i * stride]) means you’re likely to fetch a cache line and only use one or two elements before needing the next one, potentially far away in memory. This leads to many cache lines being loaded only to be discarded quickly.

    • Diagnosis: perf record -e cache-misses,cache-references --call-graph dwarf ./your_program then analyze perf report. Look for loops with high miss rates.
    • Fix: Rearrange data structures or access patterns to be more contiguous. If iterating through a 2D array, iterate through rows first if it’s row-major, or columns if column-major, to match memory layout. For our example, the sequential access is good, but if SIZE was much larger or the access pattern was different, this would be a culprit.
    • Why it works: Cache lines are blocks of memory (e.g., 64 bytes). When you access arr[i], an entire cache line containing arr[i] and its neighbors is loaded. If you access arr[i+1], it’s likely already in the cache. If you access arr[i + stride] where stride is large, you might miss.
  2. Cache Thrashing: Frequently accessing more data than can fit into a cache level (L1, L2, L3). When new data is brought in, old data that might be needed soon is evicted.

    • Diagnosis: perf stat -e cache-misses,cache-references ./your_program. Compare the total misses to total references. If the miss rate (cache-misses / cache-references) is very high (e.g., > 20-30%), this is a strong indicator.
    • Fix: Reduce the working set size (the amount of data actively being used). This might involve processing data in smaller chunks, using more efficient data structures, or optimizing algorithms to require less memory. For our example, if SIZE was significantly larger than what fits in L3 cache, and the program constantly reread the same large array, thrashing could occur.
    • Why it works: By keeping the actively used data smaller than the cache size, you maximize the chances that data you need next is already present.
  3. False Sharing (Multi-threaded applications): When two different threads modify independent variables that happen to reside in the same cache line. The cache coherency protocol will bounce the cache line between processors, causing misses even though the threads aren’t logically contending for the same data.

    • Diagnosis: perf record -e cache-misses,cache-references -a --call-graph dwarf ./your_multithreaded_program. Analyze perf report and look for high miss rates in shared data structures accessed by different threads.
    • Fix: Pad data structures so that variables frequently modified by different threads are on separate cache lines. For example, if struct { int counter1; int counter2; } is used by two threads, one on counter1 and one on counter2, they might share a cache line. Add padding: struct { int counter1; char padding[64 - sizeof(int)]; int counter2; }.
    • Why it works: Ensures that modifications to one variable don’t invalidate the cache line containing another variable used by a different thread.
  4. Insufficient Memory Bandwidth: While perf primarily focuses on cache events, high cache miss rates ultimately lead to waiting for main memory. If the system’s memory bus is saturated, even if data could be in the cache, the overall system performance will be limited by the inability to fetch new data quickly enough.

    • Diagnosis: Use perf stat -e cpu-cycles,instructions,cache-misses ./your_program. A high ratio of cache-misses to instructions combined with a low ratio of instructions to cpu-cycles suggests the CPU is stalled. Also, monitor system-level tools like sar -B or vmstat for high memory bandwidth utilization.
    • Fix: Optimize memory access to reduce the number of fetches. This might involve using more compact data structures, algorithmic changes, or, in extreme cases, hardware upgrades (more RAM, faster RAM, more memory channels).
    • Why it works: Reducing the demand for memory bandwidth by improving cache hit rates or reducing the working set is the primary software solution.
  5. NUMA (Non-Uniform Memory Access) Issues: On multi-socket systems, memory access times vary depending on which CPU socket the memory is attached to. Accessing local memory is fast, while accessing remote memory is slow.

    • Diagnosis: perf record -e cache-misses,cache-references -a --phys-addresses ./your_program. Analyze the output, correlating memory addresses with numactl -H. Tools like numastat can also show memory allocation across nodes.
    • Fix: Use numactl to bind your process to a specific NUMA node and ensure its memory allocations are on that node. For example, numactl --cpunodebind=0 --membind=0 ./your_program.
    • Why it works: Ensures that memory accesses are predominantly to the local memory of the CPU cores executing the code, minimizing latency.
  6. TLB (Translation Lookaside Buffer) Misses: The TLB caches virtual-to-physical address translations. If the TLB is too small or access patterns cause frequent misses, the CPU has to walk the page tables in memory, which is very slow.

    • Diagnosis: perf record -e dtlb-load-misses,itlb-load-misses ./your_program. Analyze perf report for high miss rates in these events.
    • Fix: For large data sets, consider using huge pages (e.g., 2MB or 1GB instead of 4KB). This reduces the number of page table entries and thus the pressure on the TLB. Configure this at the OS level (e.g., via /etc/sysctl.conf for vm.nr_hugepages).
    • Why it works: Larger pages mean fewer entries are needed in the TLB to cover the same amount of memory, significantly reducing TLB miss frequency.

After fixing these issues, the next error you’ll likely encounter is a CPU-bound bottleneck because the memory subsystem is no longer the primary limiter.

Want structured learning?

Take the full Perf course →