The surprising truth about Ollama’s GPU/CPU hybrid mode is that it’s not about splitting a single model’s layers between devices, but rather about strategically offloading entire layers to the GPU while the rest run on the CPU.
Let’s see this in action. Imagine you have a model that’s too big to fit entirely into your GPU’s VRAM. Ollama, by default, would either fail to load it or run it entirely on the CPU. But with hybrid mode, we can tell it to use the GPU for the most computationally intensive parts.
Here’s a common scenario: You’re running a large language model on a system with a decent CPU but a GPU with limited VRAM. You’ve tried to load the model, and it’s either failing or crawling at CPU speeds.
ollama run llama3:8b
If this command, or a similar one for a larger model, is slow or throws an out-of-memory error related to the GPU, it’s time to explore hybrid mode.
The core of this functionality lies in the OLLAMA_NUM_GPU environment variable. This variable doesn’t specify which layers to split, but rather how many layers Ollama should attempt to load onto the GPU. Ollama’s internal logic then decides which layers are best suited for GPU acceleration based on their size and computational cost, prioritizing the earlier layers of the model.
To enable this, you’d set the environment variable before running Ollama. For example, if you have a GPU with 8GB of VRAM and want Ollama to try and offload some layers to it, you might set OLLAMA_NUM_GPU=1. This tells Ollama to load the first layer (or a set of layers it deems appropriate for the first "chunk") onto the GPU.
export OLLAMA_NUM_GPU=1
ollama run llama3:8b
If you have more VRAM, you can experiment with higher values. For instance, OLLAMA_NUM_GPU=4 might load the first four layers (or a larger chunk depending on layer size) onto the GPU. The key is that Ollama manages this placement automatically. It will load layers sequentially onto the GPU until it hits a limit (either the OLLAMA_NUM_GPU value or, more practically, the available VRAM). Once it reaches that limit, any subsequent layers will be processed by the CPU.
The benefit is a significant speedup for the parts of the model that are on the GPU, while still allowing you to run models that are too large for your VRAM entirely. It’s a pragmatic compromise that leverages available hardware effectively. Ollama is smart enough to handle the data transfer between CPU and GPU for these offloaded layers seamlessly.
The exact number of layers that can be offloaded is highly dependent on the specific model architecture, the size of each layer, and your GPU’s VRAM. For a Llama 3 8B model, OLLAMA_NUM_GPU=1 might offload a few hundred MB, while OLLAMA_NUM_GPU=10 might push it closer to using a significant portion of a 6GB or 8GB GPU. You’ll need to experiment to find the sweet spot for your hardware and model.
The underlying mechanism involves Ollama’s runtime detecting how much VRAM is available and then allocating contiguous blocks of layers to the GPU, starting from the input side of the model. When a layer is requested for computation, if it’s on the GPU, the computation happens there. If it’s not, it’s handled by the CPU, and the intermediate results are passed back to the GPU for the next offloaded layer.
A common misconception is that you can precisely control which specific layers go to the GPU. You can’t. You can only control the number of layers Ollama attempts to offload. Ollama’s inference engine then makes the final decision based on its internal heuristics and available memory.
One subtle point that most users overlook is that the performance gain isn’t linear with the number of offloaded layers. Beyond a certain point, the overhead of transferring data between the CPU and GPU for each layer can start to negate the benefits of GPU acceleration. This is especially true if your system has a slow PCIe bus or if the CPU is heavily burdened. You might find that OLLAMA_NUM_GPU=5 is only marginally faster than OLLAMA_NUM_GPU=3, or even slower in some edge cases.
The next challenge you’ll likely encounter is optimizing the type of layers that are offloaded, which is a more advanced configuration not directly controlled by OLLAMA_NUM_GPU.