NUMA nodes are physical groupings of CPUs and memory on a server. When a CPU on one NUMA node accesses memory attached to another NUMA node, it’s a "remote memory access," and it’s slow. perf can tell you exactly how much you’re paying for this.

Let’s see it in action. Imagine a simple C program that allocates a large array and then iterates through it. If this program is forced to run on a single NUMA node but the memory it’s using is primarily on another, we’ll see the penalty.

#include <stdio.h>
#include <stdlib.h>
#include <numa.h>

#define SIZE (1024 * 1024 * 1024) // 1GB

int main() {
    char *buffer;
    long long sum = 0;
    int node = 0; // Try to allocate on node 0

    if (numa_available() < 0) {
        fprintf(stderr, "NUMA not available\n");
        exit(1);
    }

    // Allocate memory on a specific NUMA node
    buffer = numa_alloc_onnode(SIZE, node);
    if (!buffer) {
        perror("numa_alloc_onnode");
        exit(1);
    }
    printf("Allocated %d bytes on NUMA node %d\n", SIZE, node);

    // Pin the process to a specific CPU on a different NUMA node
    // This requires knowing your system's topology. For this example,
    // assume node 0 has CPUs 0-7 and node 1 has CPUs 8-15.
    // We'll pin to CPU 8, which is on node 1.
    // This part is tricky and system-dependent. You'd typically use
    // taskset or sched_setaffinity. For demonstration, let's assume
    // we're running this with `numactl --physcpubind=8 --membind=0 ./my_program`.

    printf("Accessing memory...\n");
    for (int i = 0; i < SIZE; ++i) {
        sum += buffer[i];
    }
    printf("Sum: %lld\n", sum);

    numa_free(buffer);
    return 0;
}

Compile this with gcc -lnuma -o remote_access remote_access.c. Then, run it using numactl to force the CPU to one node and the memory to another. If your system has two NUMA nodes, and CPU 0 is on node 0 and CPU 8 is on node 1, you’d run:

numactl --physcpubind=8 --membind=0 ./remote_access

Now, let’s use perf to see the NUMA penalty. We’re interested in hardware performance counters related to memory access and specifically cache misses that indicate remote access.

First, let’s get a baseline of what typical memory access looks like without NUMA issues (same node for CPU and memory).

# Run on a single node, e.g., CPU 0 on Node 0, memory on Node 0
numactl --physcpubind=0 --membind=0 ./remote_access
# Then, record perf data
perf record -e cycles,instructions,cache-misses,L1-dcache-load-misses,LLC-load-misses,node-loads,node-distances -a -- sleep 10 # Run for a bit to get some data
perf script

Now, let’s do the same but force the remote access scenario:

# Run with CPU on Node 1, memory on Node 0
numactl --physcpubind=8 --membind=0 ./remote_access
# Then, record perf data
perf record -e cycles,instructions,cache-misses,L1-dcache-load-misses,LLC-load-misses,node-loads,node-distances -a -- sleep 10
perf script

The key perf events here are node-loads and node-distances. These are specifically designed to help diagnose NUMA behavior.

  • node-loads: This event counts the number of memory loads that cross NUMA node boundaries. A high number here is a direct indicator of remote memory access.
  • node-distances: This event provides a matrix showing the latency between different NUMA nodes. When you see node-loads originating from a node with a high node-distances value, you’ve found your penalty.

When you run perf script after collecting data with -e node-loads,node-distances, you’ll see output that looks something like this (simplified):

...
  <...>-<PID>     <...>( 80.00%): node-loads: 123456789
  <...>-<PID>     <...>( 80.00%): node-distances: 0:0=10, 0:1=100, 1:0=110, 1:1=20  # Latency in ns
...

The node-loads count is the raw number of times a CPU on one node fetched data from another. The node-distances event tells you the inherent latency. If your node-loads are high, and the node-distances for those cross-node accesses are significant (e.g., 100ns vs. 10ns), you’re looking at a substantial performance hit.

To address this, you need to align your CPU and memory. The numactl command is your primary tool.

Common Causes and Fixes:

  1. Application Memory Allocation Strategy: The application is allocating memory on a node different from the one its threads are running on.

    • Diagnosis: Use perf record -e node-loads -a -- <your_application> and numastat -m. If node-loads is high and numastat -m shows significant memory usage on nodes other than where your CPUs are pinned, this is the cause.
    • Fix: Recompile your application to use numa_alloc_onnode or numa_preferred to allocate memory on the same node as the CPU that will access it. If you can’t recompile, use numactl --membind=<node_id> when launching the application to force memory allocation on a specific node.
    • Why it works: By ensuring memory is allocated on the same NUMA node as the CPU, you eliminate the need for cross-node interconnect traffic, drastically reducing access latency.
  2. Thread Affinity Incorrectly Set: Threads are pinned to CPUs on one NUMA node, but the memory they access is on another. This is often due to misconfiguration in the OS scheduler or manual thread pinning.

    • Diagnosis: Run perf record -e node-loads,task-clock -a -- <your_application>. Then analyze the perf script output, correlating node-loads with specific PIDs/threads. Use taskset -p <PID> or check /proc/<PID>/status for Cpus_allowed to see where threads are running.
    • Fix: Use numactl --physcpubind=<cpu_list> to bind threads to CPUs on the same NUMA node where their data resides. If using OpenMP or pthreads, ensure their affinity settings are NUMA-aware. For OpenMP, OMP_NUMACTHREADS and OMP_PROC_BIND=true can help.
    • Why it works: Aligning thread execution with the physical location of their data reduces or eliminates remote memory accesses.
  3. System-Wide Default Memory Policy: The OS default policy might be to spread memory allocations across all nodes, leading to remote access even if the application doesn’t explicitly specify a node.

    • Diagnosis: Run numastat -m without numactl. Observe the memory usage per node. If it’s evenly distributed and your application is performance-sensitive, this could be an issue.
    • Fix: Use numactl --interleave=all to set the default policy for an application to interleave memory across all nodes (good for some workloads, bad for others) or numactl --membind=<node_id> to bind the entire process to a single node’s memory. For a system-wide change, you might adjust /proc/sys/vm/numa_balancing (though this is often better left to numactl per application).
    • Why it works: Explicitly telling the system or application where to place memory ensures it’s local to the CPUs that will consume it.
  4. Large Data Structures Straddling NUMA Boundaries: Very large data structures might be allocated in a way that spans multiple nodes by default if not managed carefully, leading to CPUs on one node accessing parts of the structure on another.

    • Diagnosis: This is harder to spot directly with perf events alone. You’d look for high node-loads and then use memory profiling tools (like Valgrind’s massif, or analyze heap dumps) to see how large objects are laid out.
    • Fix: Implement NUMA-aware allocation for large structures. This might involve allocating chunks of the structure on specific nodes or using techniques like memory pools that are NUMA-aware.
    • Why it works: By controlling the placement of large objects, you can ensure that the active portions being used by a CPU are on its local NUMA node.
  5. I/O and DMA: Devices connected to specific NUMA nodes might perform Direct Memory Access (DMA) to memory on a different node, causing contention and latency.

    • Diagnosis: Use perf record -e node-loads,dma_lat -a -- <your_application> if dma_lat is available, or monitor I/O performance metrics. Check lspci -vvv to see which NUMA node a device is associated with.
    • Fix: If possible, configure devices or drivers to use memory local to their NUMA node. This often involves kernel boot parameters or specific driver options.
    • Why it works: Keeping DMA traffic local to a NUMA node reduces its impact on the interconnect and CPUs on other nodes.
  6. Virtualization Overhead: In virtualized environments, the hypervisor might not be NUMA-aware, or guest OS NUMA policies might be misconfigured, leading to remote accesses within the guest or between guest and host.

    • Diagnosis: Use perf on the host and within the guest if possible. Check VM settings for NUMA node affinity for virtual CPUs and memory.
    • Fix: Configure the hypervisor (e.g., VMware, KVM) to respect NUMA topology. Ensure guest OS numactl policies are set appropriately within the VM.
    • Why it works: Proper NUMA awareness at the hypervisor and guest levels ensures virtual resources are mapped efficiently to physical NUMA nodes.

After fixing remote access issues, the next problem you might encounter is cache contention or insufficient memory bandwidth on the local node.

Want structured learning?

Take the full Perf course →