The Linux perf tool can tell you exactly which CPU cache levels (L1, L2, L3) are missing the data your program needs, and it’s often a bottleneck you didn’t even realize was there.

Let’s see what a cache miss looks like in action. Imagine we have a simple C program that does a lot of sequential array access:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define ARRAY_SIZE 1024 * 1024 // 1 million elements
#define ITERATIONS 100

int main() {
    int *arr = (int *)malloc(ARRAY_SIZE * sizeof(int));
    if (arr == NULL) {
        perror("malloc failed");
        return 1;
    }

    // Initialize array with some values
    for (int i = 0; i < ARRAY_SIZE; i++) {
        arr[i] = i;
    }

    long long sum = 0;
    clock_t start = clock();

    // Access array elements repeatedly
    for (int iter = 0; iter < ITERATIONS; iter++) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            sum += arr[i]; // This is where the cache access happens
        }
    }

    clock_t end = clock();
    double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("Sum: %lld\n", sum);
    printf("Time taken: %f seconds\n", cpu_time_used);

    free(arr);
    return 0;
}

We’ll compile this with gcc -o cache_test cache_test.c and then run perf to profile it.

The core problem perf helps diagnose is when the CPU has to go to slower memory (main RAM) because the data it needs isn’t in its fast, on-chip caches. Modern CPUs have multiple levels of cache (L1, L2, L3), with L1 being the smallest and fastest, and L3 being the largest and slowest (but still much faster than RAM). When the CPU requests data and it’s not found in the requested cache level, it’s a "cache miss." These misses force the CPU to stall, waiting for data to be fetched, which drastically slows down execution.

Here’s how you’d start investigating cache misses with perf:

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-load-misses,L2-cache-load-misses,LLC-load-misses ./cache_test

Let’s break down the perf stat command and what to look for.

  • cycles: Total number of CPU cycles.
  • instructions: Total number of instructions retired.
  • cache-references: Total number of cache accesses.
  • cache-misses: Total number of cache misses across all levels.
  • L1-dcache-load-misses: Misses in the L1 data cache. This is the most critical level for individual loads.
  • L2-cache-load-misses: Misses in the L2 cache. If L1 misses, the CPU checks L2.
  • LLC-load-misses: Misses in the Last Level Cache (LLC), which is usually L3. If L2 misses, the CPU checks L3.

When you run this, you’ll see output similar to this (exact numbers will vary by CPU and system):

 Performance counter stats for './cache_test':

        1,234,567      cycles                                                      (100.00%)
        5,678,910      instructions                                                (100.00%)
       10,000,000      cache-references                                            (100.00%)
        8,000,000      cache-misses              #    80.00% of cache-references    (100.00%)
        7,000,000      L1-dcache-load-misses     #    87.50% of cache-misses        (100.00%)
          900,000      L2-cache-load-misses      #    11.25% of cache-misses        (100.00%)
            1,000      LLC-load-misses           #     0.01% of cache-misses        (100.00%)

        0.01234567 seconds time elapsed

The key metrics here are the miss rates for each cache level. A high L1-dcache-load-misses percentage (like 87.50% in this example) means your program is frequently requesting data that isn’t even in the fastest cache. A high L2-cache-load-misses means that even after missing L1, the data isn’t found in L2. LLC-load-misses indicates misses in the final cache before going to RAM.

Common Causes and Fixes

  1. Poor Data Locality (Accessing Data Far Apart in Memory)

    • Diagnosis: High L1-dcache-load-misses and L2-cache-load-misses. The program iterates over a large data structure, jumping around in memory.
    • Check: Look at your code’s memory access patterns. Are you iterating linearly through arrays, or jumping between widely separated elements?
    • Fix: Restructure your data or access patterns to be more sequential. For the example above, the access is linear, but if ARRAY_SIZE were much larger, it would still exceed cache capacity. If you were accessing a 2D array arr[i][j] and j was the fastest-changing index but your array was stored row-major, you’d have a problem. Reordering to arr[j][i] if j is fastest changing (or using structures that naturally group related data) can help.
    • Why it works: CPUs fetch data in cache lines. When you access arr[i], the CPU might fetch arr[i], arr[i+1], arr[i+2], etc., into the cache. Linear access ensures subsequent accesses hit these prefetched lines.
  2. Large Data Structures Exceeding Cache Size

    • Diagnosis: High miss rates across L1, L2, and potentially LLC. The total size of the data your program actively uses at any given time is larger than the available cache.
    • Check: Calculate the working set size of your application. For our example, ARRAY_SIZE * sizeof(int) is 1024 * 1024 * 4 bytes = 4MB. If your L2 cache is 2MB, this entire array won’t fit.
    • Fix:
      • Reduce data size: If possible, use smaller data types (e.g., short instead of int if the range allows).
      • Process data in chunks: Load and process parts of the data structure, then discard them before loading the next chunk. This is called tiling or blocking.
      • Change algorithm: Sometimes a different algorithm with a smaller working set exists.
    • Why it works: By processing data in chunks that do fit into the cache, you maximize cache hits for that chunk before it’s evicted.
  3. False Sharing (Multi-threaded Applications)

    • Diagnosis: High cache miss rates, especially L1/L2, in multi-threaded applications where threads access different variables that happen to reside on the same cache line.
    • Check: Analyze shared data structures accessed by multiple threads. If threads independently modify variables that are close in memory, they might be on the same cache line.
    • Fix: Pad your data structures. Add unused bytes between variables that are accessed by different threads to ensure they reside on separate cache lines. For example, if threadA writes to varX and threadB writes to varY, and varX and varY are on the same cache line, you have false sharing. Add padding to varX (or varY) so they are on different lines.
    • Why it works: Each CPU core has its own L1 cache. When a thread writes to a variable, its cache line is marked as "dirty." If another thread on a different core tries to read or write to a variable on the same cache line (even if it’s a different variable), the cache coherence protocol invalidates the other core’s copy, forcing a reload from main memory or another core’s cache, causing a miss. Padding prevents unrelated variables from sharing a cache line.
  4. Inefficient Data Structures

    • Diagnosis: High miss rates, particularly when performing lookups or traversals.
    • Check: Consider data structures like linked lists, which have pointers scattered throughout memory. Traversing a linked list often results in cache misses because each node might be in a different memory location.
    • Fix: Use contiguous data structures like arrays, vectors, or specialized structures like B-trees or tries where nodes are often grouped together.
    • Why it works: Contiguous structures improve spatial locality. When one node is fetched into the cache, nearby nodes are also fetched, increasing the chance of subsequent accesses being cache hits.
  5. TLB (Translation Lookaside Buffer) Misses Mimicking Cache Misses

    • Diagnosis: High L1-dcache-load-misses or L2-cache-load-misses that don’t seem to correlate with data access patterns but rather with address translations. TLB misses cause similar stalls.
    • Check: Use perf stat -e tlb-load-misses,tlb-stores,tlb-loads,page-faults and examine tlb-load-misses. A high number of TLB misses means the CPU had to go to page table entries in memory to resolve virtual addresses.
    • Fix:
      • Use HUGETLB: For very large, contiguous memory allocations, using huge pages (e.g., 2MB or 1GB instead of 4KB) reduces the number of page table entries needed, thus reducing TLB pressure. This requires kernel configuration and program modification to use mmap with MAP_HUGETLB.
      • Reduce memory fragmentation: Highly fragmented memory can increase the likelihood of TLB misses.
    • Why it works: The TLB is a cache for virtual-to-physical address translations. Fewer entries mean more lookups in the full page tables, which are in main memory. Huge pages require fewer TLB entries because they cover a larger address range per entry.
  6. Prefetcher Issues (Over-prefetching or Under-prefetching)

    • Diagnosis: High miss rates, especially when the pattern of access is regular but the CPU’s hardware prefetcher isn’t detecting it, or when it’s fetching too much irrelevant data.
    • Check: This is harder to diagnose directly with basic perf stat. It often manifests as unexpected miss rates for regular access patterns.
    • Fix: In some cases, you can disable or tune the hardware prefetcher via CPU MSRs (Model-Specific Registers) or BIOS settings. This is an advanced technique and often requires specific knowledge of your CPU architecture. A more common software fix is to ensure your access patterns are extremely regular and predictable to help the prefetcher.
    • Why it works: Hardware prefetchers try to predict future memory accesses and fetch data into the cache before it’s explicitly requested. If they work well, they reduce cache misses. If they work poorly, they can evict useful data or fetch useless data, increasing misses or wasting bandwidth.

After fixing cache misses, you’ll likely encounter the next bottleneck, which could be instruction cache misses (L1-icache-load-misses), branch mispredictions (branch-misses), or simply the raw instruction execution rate.

Want structured learning?

Take the full Perf course →