NUMA nodes are physical groupings of CPUs and memory on a server. When a CPU on one NUMA node accesses memory attached to another NUMA node, it’s a "remote memory access," and it’s slow. perf can tell you exactly how much you’re paying for this.
Let’s see it in action. Imagine a simple C program that allocates a large array and then iterates through it. If this program is forced to run on a single NUMA node but the memory it’s using is primarily on another, we’ll see the penalty.
#include <stdio.h>
#include <stdlib.h>
#include <numa.h>
#define SIZE (1024 * 1024 * 1024) // 1GB
int main() {
char *buffer;
long long sum = 0;
int node = 0; // Try to allocate on node 0
if (numa_available() < 0) {
fprintf(stderr, "NUMA not available\n");
exit(1);
}
// Allocate memory on a specific NUMA node
buffer = numa_alloc_onnode(SIZE, node);
if (!buffer) {
perror("numa_alloc_onnode");
exit(1);
}
printf("Allocated %d bytes on NUMA node %d\n", SIZE, node);
// Pin the process to a specific CPU on a different NUMA node
// This requires knowing your system's topology. For this example,
// assume node 0 has CPUs 0-7 and node 1 has CPUs 8-15.
// We'll pin to CPU 8, which is on node 1.
// This part is tricky and system-dependent. You'd typically use
// taskset or sched_setaffinity. For demonstration, let's assume
// we're running this with `numactl --physcpubind=8 --membind=0 ./my_program`.
printf("Accessing memory...\n");
for (int i = 0; i < SIZE; ++i) {
sum += buffer[i];
}
printf("Sum: %lld\n", sum);
numa_free(buffer);
return 0;
}
Compile this with gcc -lnuma -o remote_access remote_access.c. Then, run it using numactl to force the CPU to one node and the memory to another. If your system has two NUMA nodes, and CPU 0 is on node 0 and CPU 8 is on node 1, you’d run:
numactl --physcpubind=8 --membind=0 ./remote_access
Now, let’s use perf to see the NUMA penalty. We’re interested in hardware performance counters related to memory access and specifically cache misses that indicate remote access.
First, let’s get a baseline of what typical memory access looks like without NUMA issues (same node for CPU and memory).
# Run on a single node, e.g., CPU 0 on Node 0, memory on Node 0
numactl --physcpubind=0 --membind=0 ./remote_access
# Then, record perf data
perf record -e cycles,instructions,cache-misses,L1-dcache-load-misses,LLC-load-misses,node-loads,node-distances -a -- sleep 10 # Run for a bit to get some data
perf script
Now, let’s do the same but force the remote access scenario:
# Run with CPU on Node 1, memory on Node 0
numactl --physcpubind=8 --membind=0 ./remote_access
# Then, record perf data
perf record -e cycles,instructions,cache-misses,L1-dcache-load-misses,LLC-load-misses,node-loads,node-distances -a -- sleep 10
perf script
The key perf events here are node-loads and node-distances. These are specifically designed to help diagnose NUMA behavior.
node-loads: This event counts the number of memory loads that cross NUMA node boundaries. A high number here is a direct indicator of remote memory access.node-distances: This event provides a matrix showing the latency between different NUMA nodes. When you seenode-loadsoriginating from a node with a highnode-distancesvalue, you’ve found your penalty.
When you run perf script after collecting data with -e node-loads,node-distances, you’ll see output that looks something like this (simplified):
...
<...>-<PID> <...>( 80.00%): node-loads: 123456789
<...>-<PID> <...>( 80.00%): node-distances: 0:0=10, 0:1=100, 1:0=110, 1:1=20 # Latency in ns
...
The node-loads count is the raw number of times a CPU on one node fetched data from another. The node-distances event tells you the inherent latency. If your node-loads are high, and the node-distances for those cross-node accesses are significant (e.g., 100ns vs. 10ns), you’re looking at a substantial performance hit.
To address this, you need to align your CPU and memory. The numactl command is your primary tool.
Common Causes and Fixes:
-
Application Memory Allocation Strategy: The application is allocating memory on a node different from the one its threads are running on.
- Diagnosis: Use
perf record -e node-loads -a -- <your_application>andnumastat -m. Ifnode-loadsis high andnumastat -mshows significant memory usage on nodes other than where your CPUs are pinned, this is the cause. - Fix: Recompile your application to use
numa_alloc_onnodeornuma_preferredto allocate memory on the same node as the CPU that will access it. If you can’t recompile, usenumactl --membind=<node_id>when launching the application to force memory allocation on a specific node. - Why it works: By ensuring memory is allocated on the same NUMA node as the CPU, you eliminate the need for cross-node interconnect traffic, drastically reducing access latency.
- Diagnosis: Use
-
Thread Affinity Incorrectly Set: Threads are pinned to CPUs on one NUMA node, but the memory they access is on another. This is often due to misconfiguration in the OS scheduler or manual thread pinning.
- Diagnosis: Run
perf record -e node-loads,task-clock -a -- <your_application>. Then analyze theperf scriptoutput, correlatingnode-loadswith specific PIDs/threads. Usetaskset -p <PID>or check/proc/<PID>/statusforCpus_allowedto see where threads are running. - Fix: Use
numactl --physcpubind=<cpu_list>to bind threads to CPUs on the same NUMA node where their data resides. If using OpenMP or pthreads, ensure their affinity settings are NUMA-aware. For OpenMP,OMP_NUMACTHREADSandOMP_PROC_BIND=truecan help. - Why it works: Aligning thread execution with the physical location of their data reduces or eliminates remote memory accesses.
- Diagnosis: Run
-
System-Wide Default Memory Policy: The OS default policy might be to spread memory allocations across all nodes, leading to remote access even if the application doesn’t explicitly specify a node.
- Diagnosis: Run
numastat -mwithoutnumactl. Observe the memory usage per node. If it’s evenly distributed and your application is performance-sensitive, this could be an issue. - Fix: Use
numactl --interleave=allto set the default policy for an application to interleave memory across all nodes (good for some workloads, bad for others) ornumactl --membind=<node_id>to bind the entire process to a single node’s memory. For a system-wide change, you might adjust/proc/sys/vm/numa_balancing(though this is often better left tonumactlper application). - Why it works: Explicitly telling the system or application where to place memory ensures it’s local to the CPUs that will consume it.
- Diagnosis: Run
-
Large Data Structures Straddling NUMA Boundaries: Very large data structures might be allocated in a way that spans multiple nodes by default if not managed carefully, leading to CPUs on one node accessing parts of the structure on another.
- Diagnosis: This is harder to spot directly with
perfevents alone. You’d look for highnode-loadsand then use memory profiling tools (like Valgrind’s massif, or analyze heap dumps) to see how large objects are laid out. - Fix: Implement NUMA-aware allocation for large structures. This might involve allocating chunks of the structure on specific nodes or using techniques like memory pools that are NUMA-aware.
- Why it works: By controlling the placement of large objects, you can ensure that the active portions being used by a CPU are on its local NUMA node.
- Diagnosis: This is harder to spot directly with
-
I/O and DMA: Devices connected to specific NUMA nodes might perform Direct Memory Access (DMA) to memory on a different node, causing contention and latency.
- Diagnosis: Use
perf record -e node-loads,dma_lat -a -- <your_application>ifdma_latis available, or monitor I/O performance metrics. Checklspci -vvvto see which NUMA node a device is associated with. - Fix: If possible, configure devices or drivers to use memory local to their NUMA node. This often involves kernel boot parameters or specific driver options.
- Why it works: Keeping DMA traffic local to a NUMA node reduces its impact on the interconnect and CPUs on other nodes.
- Diagnosis: Use
-
Virtualization Overhead: In virtualized environments, the hypervisor might not be NUMA-aware, or guest OS NUMA policies might be misconfigured, leading to remote accesses within the guest or between guest and host.
- Diagnosis: Use
perfon the host and within the guest if possible. Check VM settings for NUMA node affinity for virtual CPUs and memory. - Fix: Configure the hypervisor (e.g., VMware, KVM) to respect NUMA topology. Ensure guest OS
numactlpolicies are set appropriately within the VM. - Why it works: Proper NUMA awareness at the hypervisor and guest levels ensures virtual resources are mapped efficiently to physical NUMA nodes.
- Diagnosis: Use
After fixing remote access issues, the next problem you might encounter is cache contention or insufficient memory bandwidth on the local node.