perf can tell you what your CPU is doing, but it can also tell you what your memory subsystem is doing. This is key because modern CPUs are incredibly fast, but they’re often waiting around for data to arrive from RAM. perf can show you just how much waiting is happening, and more importantly, why.
Let’s look at a real-world example. We’ll use perf to profile memory access patterns in a simple C program that just does a lot of array lookups.
#include <stdio.h>
#include <stdlib.h>
#define SIZE 1024 * 1024 // 1 million elements
#define ITERATIONS 10
int main() {
int *arr = (int *)malloc(SIZE * sizeof(int));
if (arr == NULL) {
perror("malloc failed");
return 1;
}
// Initialize array (simple sequential access)
for (long i = 0; i < SIZE; ++i) {
arr[i] = i;
}
// Access array in a way that might cause cache misses
volatile int sum = 0;
for (int iter = 0; iter < ITERATIONS; ++iter) {
for (long i = 0; i < SIZE; ++i) {
sum += arr[i];
}
}
printf("Sum: %d\n", sum);
free(arr);
return 0;
}
Now, let’s profile this with perf. We’re interested in memory-related events. The most fundamental ones are cache misses.
perf record -e cache-misses,cache-references ./memory_profile
perf report
This will give us a breakdown of where the cache-misses and cache-references are occurring. You’ll see percentages of total events attributed to different functions. The key is to look for functions that have a high ratio of cache-misses to cache-references.
The output of perf report will show you something like this (simplified):
20.00% memory_profile [.] main
15.00% memory_profile [.] main
5.00% libc-2.31.so [.] _int_malloc
The percentages here represent the proportion of the profiled events (in this case, cache-misses and cache-references) that occurred within the specified function. A high percentage in main for cache-related events means your application’s code is the primary driver of memory access issues.
The problem perf helps us diagnose here is that the CPU is spending a lot of time waiting for data from RAM because it’s not finding that data in its caches. This can happen for several reasons, but the most common ones relate to how data is laid out in memory and how it’s accessed.
Common Causes of High Cache Misses:
-
Poor Data Locality / Strided Access: Accessing memory with large strides (e.g.,
arr[i * stride]) means you’re likely to fetch a cache line and only use one or two elements before needing the next one, potentially far away in memory. This leads to many cache lines being loaded only to be discarded quickly.- Diagnosis:
perf record -e cache-misses,cache-references --call-graph dwarf ./your_programthen analyzeperf report. Look for loops with high miss rates. - Fix: Rearrange data structures or access patterns to be more contiguous. If iterating through a 2D array, iterate through rows first if it’s row-major, or columns if column-major, to match memory layout. For our example, the sequential access is good, but if
SIZEwas much larger or the access pattern was different, this would be a culprit. - Why it works: Cache lines are blocks of memory (e.g., 64 bytes). When you access
arr[i], an entire cache line containingarr[i]and its neighbors is loaded. If you accessarr[i+1], it’s likely already in the cache. If you accessarr[i + stride]wherestrideis large, you might miss.
- Diagnosis:
-
Cache Thrashing: Frequently accessing more data than can fit into a cache level (L1, L2, L3). When new data is brought in, old data that might be needed soon is evicted.
- Diagnosis:
perf stat -e cache-misses,cache-references ./your_program. Compare the total misses to total references. If the miss rate (cache-misses / cache-references) is very high (e.g., > 20-30%), this is a strong indicator. - Fix: Reduce the working set size (the amount of data actively being used). This might involve processing data in smaller chunks, using more efficient data structures, or optimizing algorithms to require less memory. For our example, if
SIZEwas significantly larger than what fits in L3 cache, and the program constantly reread the same large array, thrashing could occur. - Why it works: By keeping the actively used data smaller than the cache size, you maximize the chances that data you need next is already present.
- Diagnosis:
-
False Sharing (Multi-threaded applications): When two different threads modify independent variables that happen to reside in the same cache line. The cache coherency protocol will bounce the cache line between processors, causing misses even though the threads aren’t logically contending for the same data.
- Diagnosis:
perf record -e cache-misses,cache-references -a --call-graph dwarf ./your_multithreaded_program. Analyzeperf reportand look for high miss rates in shared data structures accessed by different threads. - Fix: Pad data structures so that variables frequently modified by different threads are on separate cache lines. For example, if
struct { int counter1; int counter2; }is used by two threads, one oncounter1and one oncounter2, they might share a cache line. Add padding:struct { int counter1; char padding[64 - sizeof(int)]; int counter2; }. - Why it works: Ensures that modifications to one variable don’t invalidate the cache line containing another variable used by a different thread.
- Diagnosis:
-
Insufficient Memory Bandwidth: While
perfprimarily focuses on cache events, high cache miss rates ultimately lead to waiting for main memory. If the system’s memory bus is saturated, even if data could be in the cache, the overall system performance will be limited by the inability to fetch new data quickly enough.- Diagnosis: Use
perf stat -e cpu-cycles,instructions,cache-misses ./your_program. A high ratio ofcache-missestoinstructionscombined with a low ratio ofinstructionstocpu-cyclessuggests the CPU is stalled. Also, monitor system-level tools likesar -Borvmstatfor high memory bandwidth utilization. - Fix: Optimize memory access to reduce the number of fetches. This might involve using more compact data structures, algorithmic changes, or, in extreme cases, hardware upgrades (more RAM, faster RAM, more memory channels).
- Why it works: Reducing the demand for memory bandwidth by improving cache hit rates or reducing the working set is the primary software solution.
- Diagnosis: Use
-
NUMA (Non-Uniform Memory Access) Issues: On multi-socket systems, memory access times vary depending on which CPU socket the memory is attached to. Accessing local memory is fast, while accessing remote memory is slow.
- Diagnosis:
perf record -e cache-misses,cache-references -a --phys-addresses ./your_program. Analyze the output, correlating memory addresses withnumactl -H. Tools likenumastatcan also show memory allocation across nodes. - Fix: Use
numactlto bind your process to a specific NUMA node and ensure its memory allocations are on that node. For example,numactl --cpunodebind=0 --membind=0 ./your_program. - Why it works: Ensures that memory accesses are predominantly to the local memory of the CPU cores executing the code, minimizing latency.
- Diagnosis:
-
TLB (Translation Lookaside Buffer) Misses: The TLB caches virtual-to-physical address translations. If the TLB is too small or access patterns cause frequent misses, the CPU has to walk the page tables in memory, which is very slow.
- Diagnosis:
perf record -e dtlb-load-misses,itlb-load-misses ./your_program. Analyzeperf reportfor high miss rates in these events. - Fix: For large data sets, consider using huge pages (e.g., 2MB or 1GB instead of 4KB). This reduces the number of page table entries and thus the pressure on the TLB. Configure this at the OS level (e.g., via
/etc/sysctl.confforvm.nr_hugepages). - Why it works: Larger pages mean fewer entries are needed in the TLB to cover the same amount of memory, significantly reducing TLB miss frequency.
- Diagnosis:
After fixing these issues, the next error you’ll likely encounter is a CPU-bound bottleneck because the memory subsystem is no longer the primary limiter.