The most surprising thing about running LLMs locally with Ollama on Windows WSL2 is how easily you can bypass Windows’ own GPU driver stack for a significantly smoother, more performant experience.
Let’s see it in action. Imagine you have a model like llama3 downloaded. On your Windows machine, you’d typically run ollama run llama3. But within WSL2, you can also leverage your NVIDIA GPU.
First, ensure you have the NVIDIA drivers installed on your Windows host. WSL2 doesn’t install drivers directly; it uses a passthrough mechanism. Inside your WSL2 distribution (e.g., Ubuntu), you’ll need the CUDA toolkit.
# Inside your WSL2 Ubuntu distribution
sudo apt update && sudo apt upgrade -y
sudo apt install nvidia-cuda-toolkit -y
This installs the necessary libraries for CUDA-aware applications. Now, when you run Ollama within WSL2, it can detect and utilize your GPU.
# Inside your WSL2 Ubuntu distribution
ollama pull llama3
ollama run llama3
You’ll notice the first-time model download will be relatively quick, and subsequent inference will be significantly faster than CPU-only execution. The key here is that Ollama, when built with CUDA support and running within WSL2, communicates with the NVIDIA driver exposed by the Windows host.
The problem Ollama solves is making powerful, large language models accessible on personal hardware without needing a cloud subscription or a dedicated Linux machine. It abstracts away the complexity of model downloading, management, and serving, providing a simple command-line interface. Internally, Ollama downloads model weights in a quantized format (like GGUF) to reduce memory footprint and then uses optimized inference engines (like llama.cpp, which is CUDA-enabled) to run them.
The exact levers you control are primarily the models themselves and their configurations. You can pull various models:
ollama pull mistral
ollama pull codellama:7b
ollama pull phi3
And then run them:
ollama run mistral
ollama run codellama:7b "Write a Python function to calculate Fibonacci numbers."
The performance difference between CPU and GPU is stark. On a CPU, generating a few hundred tokens can take tens of seconds to minutes. On a compatible GPU, this can be reduced to a few seconds. The critical component for this GPU acceleration within WSL2 is the nvidia-container-toolkit (or its equivalent for WSL2 integration), which sets up the necessary device mappings so that the CUDA runtime inside WSL2 can see and interact with the host’s GPU hardware. Without this, Ollama would fall back to CPU inference, even if CUDA libraries are installed within WSL2.
A common point of confusion is where the drivers need to be installed. People often try to install NVIDIA drivers directly inside WSL2, which is not how it works. The drivers are managed by the Windows host operating system, and WSL2’s integration with the Windows kernel allows it to access those drivers through a special interface. This means your Windows NVIDIA drivers need to be up-to-date for the best compatibility and performance.
The next concept you’ll likely explore is fine-tuning models locally or serving Ollama models via its built-in API for integration with other applications.