NUMA, or Non-Uniform Memory Access, is a fancy way of describing how modern multi-CPU systems access memory. It’s not a single, monolithic pool of RAM; instead, each CPU (or "socket") has its own local memory, and accessing that local memory is lightning-fast, while accessing memory attached to another CPU is noticeably slower.
Here’s Ollama running a model, showing how it uses NUMA nodes:
# First, let's pull a model if you don't have one
ollama pull llama3
# Now, run it and watch the system's NUMA activity.
# We'll use a simple prompt and monitor resource usage.
# You might need to install 'htop' or a similar tool to see NUMA node distribution.
# On Linux, 'numactl --hardware' gives a good overview of your NUMA setup.
# Example output of numactl --hardware:
# available: 2 nodes (0-1)
# node 0 cpus: 0-11,24-35
# node 0 memory: 66560 MB
# node 1 cpus: 12-23,36-47
# node 1 memory: 66560 MB
# Now, let's run Ollama.
# We'll simulate a long-running inference to observe behavior.
# This command will run the model and stream output.
# Use Ctrl+C to stop it.
ollama run llama3 "Tell me a short story about a cat who learned to fly."
When Ollama starts, it needs to load the model weights into memory. If you have a NUMA system, the operating system’s memory allocator will try to place these weights on the NUMA node where the CPU core that’s currently doing the most work is located. This is usually a good default, but it’s not always optimal for sustained inference.
The core problem is that inference, especially with large language models, involves massive amounts of data movement. If the CPU cores heavily involved in the computation are primarily accessing memory attached to a different NUMA node, you’ll see performance degradation due to increased latency. Ollama, by default, might not always make the most NUMA-aware decisions during model loading and execution.
The most common culprit is the default memory allocator’s behavior. It often favors allocating memory on the node where the thread starts, rather than where the thread will spend most of its execution time. For a CPU-bound task like LLM inference, this can lead to a "hot spot" on one NUMA node, with other nodes sitting idle or, worse, being accessed across the relatively slow inter-node links.
Diagnosis:
Use numastat -m to see memory allocation per NUMA node. Look for a significant imbalance where one node has a much higher percentage of the model’s memory allocated compared to others. Also, use htop (with NUMA view enabled, often by pressing F2 then Display Options -> Show NUMA nodes) to see which CPU cores are active during inference and check their associated NUMA nodes. If active cores are predominantly on one node and memory is heavily skewed to that same node, it might be fine. If active cores are on one node but memory is spread or skewed to another, you have a NUMA issue.
Cause 1: Default Memory Allocation Imbalance When Ollama loads a model, the OS memory allocator might place most of the model’s weights on the NUMA node where the initial loading thread is running. If the inference threads later run predominantly on cores of a different NUMA node, they’ll constantly be fetching data across the inter-node interconnect.
Diagnosis Command: numastat -m
Exact Check: Observe Active and Free memory percentages per node. If one node is saturated and another is mostly free, and your inference is slow, this is likely the issue.
Exact Fix: Use numactl --interleave=all before running Ollama. This tells the OS to distribute memory allocations across all NUMA nodes evenly.
numactl --interleave=all ollama run llama3 "Explain the concept of NUMA."
Why it works: This strategy forces the memory allocator to spread the model’s weights across all available NUMA nodes, ensuring that each CPU core has relatively local access to parts of the model data, reducing inter-node traffic.
Cause 2: CPU Affinity Not Aligned with Memory Even if memory is somewhat distributed, the inference threads might not be scheduled on CPU cores that are physically closest to their allocated memory.
Diagnosis Command: htop (with NUMA view) and numastat -m
Exact Check: Identify which CPU cores are running the Ollama process during inference. Compare this with the memory distribution shown by numastat -m. If cores on node 0 are busy but most memory is on node 1, you have an affinity problem.
Exact Fix: Manually bind Ollama to specific NUMA nodes and their associated CPUs. This is more advanced and depends on your specific hardware and desired configuration. A common approach is to bind to a specific node.
# Example: Bind Ollama to run only on CPUs of NUMA node 0
numactl --cpunodebind=0 --membind=0 ollama run llama3 "Generate a poem about the sea."
Why it works: By explicitly telling numactl to use only the CPUs and memory of a specific NUMA node, you ensure that the process is entirely contained within that node, eliminating cross-node memory access for that instance. You might run multiple instances, each bound to a different node, for parallel processing.
Cause 3: Insufficient Local Memory on Active Node Your system might have multiple NUMA nodes, but the most active CPU cores might be on a node that doesn’t have enough RAM to hold the entire model or a significant portion of it.
Diagnosis Command: numastat -m and free -h
Exact Check: Check the Free memory on the NUMA node(s) where your CPU cores are most active. If it’s insufficient to hold the model, performance will suffer due to swapping or constant fetching.
Exact Fix: If possible, migrate the workload to a NUMA node with more available memory, or use --interleave=all to distribute the load. If you must run on a node with limited memory, consider using smaller models or quantizations.
# If node 1 has more free memory and active cores are there
numactl --cpunodebind=1 --membind=1 ollama run llama3 "Write a haiku about autumn."
Why it works: By directing the workload to a node with ample local memory, you ensure that the majority of the model’s data can reside in fast, local RAM, minimizing latency.
Cause 4: OS Scheduler Not NUMA-Aware Enough The default Linux scheduler might not always be optimal for NUMA workloads, sometimes migrating threads between nodes too aggressively or not keeping them on the most advantageous node.
Diagnosis Command: htop (with NUMA view) and cat /proc/cmdline
Exact Check: Observe thread migration in htop. If threads for Ollama are constantly hopping between NUMA nodes, the scheduler might be working against you. Check your kernel boot parameters for NUMA-related scheduler settings.
Exact Fix: Tune kernel scheduler parameters (e.g., kernel.numa_balancing). This is an advanced topic, but enabling numa_balancing in the kernel can help the OS automatically adjust thread placement.
# Check if numa_balancing is enabled
sysctl kernel.numa_balancing
# If not, enable it (requires root)
sudo sysctl -w kernel.numa_balancing=1
Why it works: numa_balancing is a kernel feature designed to automatically move tasks to the NUMA node where their memory resides, and vice-versa, reducing manual tuning and improving performance for NUMA-sensitive applications.
Cause 5: Docker/Containerization Isolation If you’re running Ollama inside a Docker container, the container’s view of NUMA might be restricted or not automatically aligned with the host’s NUMA topology.
Diagnosis Command: docker inspect <container_id> and numactl --hardware on the host.
Exact Check: Inside the container, run numactl --hardware. If it reports only one NUMA node or a different topology than the host, the container isn’t seeing the host’s NUMA structure correctly.
Exact Fix: Run your Docker container with --cpuset-cpus and --memory-nodes flags to explicitly map container processes to specific NUMA nodes.
docker run --gpus all -d --name ollama_server \
--cpuset-cpus="0-11" \
--memory-nodes="0" \
ollama/ollama
# Then run ollama commands inside the container, or use the ollama CLI on the host
# This example binds the container's CPUs to node 0.
Why it works: By explicitly defining which CPUs and memory nodes the container can access, you ensure that the containerized Ollama process respects the host’s NUMA topology and benefits from local memory access.
The next common issue you’ll encounter after optimizing for NUMA is managing GPU memory allocation, especially when running multiple models or large models that exceed VRAM.