Ollama, when properly configured, uses your NVIDIA GPU for massive speedups on AI model inference, but sometimes it just doesn’t seem to be picking it up.

Let’s see it in action. First, make sure you have a compatible NVIDIA driver installed. You can check your driver version with nvidia-smi. You should see output similar to this, showing your driver version and GPU details:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05   Driver Version: 535.104.05   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8     7W /  N/A |    123MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

If nvidia-smi doesn’t run or shows an error, you need to install or update your NVIDIA drivers. For Ubuntu, this often involves sudo apt update && sudo apt install nvidia-driver-535 (replace 535 with the recommended version for your hardware).

Next, ensure you have the CUDA Toolkit installed, specifically a version compatible with your NVIDIA driver and Ollama. Ollama typically bundles its own CUDA libraries, but having a system-wide CUDA installation can help with discovery. You can check your installed CUDA version with nvcc --version. If it’s not installed, download it from the NVIDIA developer website.

Now, let’s check Ollama’s configuration. Ollama usually detects CUDA automatically if the drivers and libraries are correctly set up. The key is that the libcudart.so library needs to be discoverable by Ollama.

The most common reason Ollama doesn’t use the GPU is that it can’t find the necessary CUDA libraries. This often happens if your system’s library path (LD_LIBRARY_PATH) isn’t configured correctly, or if Ollama is installed in a way that isolates it from system libraries.

Here’s how to diagnose and fix it:

1. Verify Ollama is running and check its logs. Run ollama serve in your terminal. If it’s already running in the background, you might need to stop and restart it for new configurations to take effect. Look for any error messages related to CUDA or NVIDIA.

2. Check if Ollama sees the GPU. Run ollama show <any-model-name>. If it’s using the GPU, you’ll see significantly faster output compared to CPU-only inference. A more direct check is to run ollama ps. If your GPU is active for Ollama, you’ll see your models listed with a GPU utilization percentage.

3. Ensure CUDA libraries are in the path. Ollama needs to find libcudart.so. The typical location for this is within your CUDA installation, e.g., /usr/local/cuda/lib64/. You can test if it’s discoverable by running:

ldconfig -p | grep libcudart.so

You should see output like:

	libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/lib64/libcudart.so.12

If this command doesn’t show the library, Ollama won’t find it.

4. Set LD_LIBRARY_PATH (if necessary). If ldconfig -p shows the library but Ollama still isn’t using the GPU, you might need to explicitly tell Ollama where to look. The most robust way is to ensure your system’s LD_LIBRARY_PATH is set correctly before starting Ollama. Edit your ~/.bashrc or ~/.zshrc file and add:

export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

Replace /usr/local/cuda/lib64 with the actual path to your CUDA lib64 directory if it’s different. After saving the file, run source ~/.bashrc (or source ~/.zshrc) and then ollama serve.

5. Check Ollama installation method. If you installed Ollama via Docker, you need to ensure the Docker container has access to your NVIDIA GPU. This requires the NVIDIA Container Toolkit. You’ll need to run your Ollama container with the --gpus all flag:

docker run --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

If you’re using docker-compose, add deploy: resources: reservations: devices: driver: nvidia count: all capability: gpu to your service definition.

6. Ensure Ollama is built with CUDA support. If you’re building Ollama from source, you must ensure the build process is configured to use CUDA. This typically involves setting environment variables like LLAMA_CUDA=1 and ensuring the CUDA toolkit development headers are available. If you downloaded a pre-compiled binary, this is less likely to be the issue unless the binary itself was compiled without CUDA support.

7. Verify NVIDIA driver compatibility. Sometimes, a very new or very old NVIDIA driver might have compatibility issues with the CUDA version Ollama expects. Check the Ollama documentation or GitHub issues for known driver version compatibilities. A common fix is to downgrade or upgrade your NVIDIA driver to a more stable, widely-tested version. For example, rolling back from a beta driver to a production branch.

8. Check for conflicting CUDA installations. If you have multiple CUDA versions installed, or if your system has CUDA libraries in unexpected locations, it can confuse Ollama. Use which nvcc and ldconfig -p | grep cuda to identify all CUDA-related paths. You might need to clean up or explicitly set LD_LIBRARY_PATH to prioritize the correct CUDA installation.

9. Resource exhaustion (less common for initial detection, but relevant for performance). While not strictly a "detection" issue, if your GPU is already heavily utilized by other processes (e.g., other AI tasks, gaming), Ollama might struggle to allocate VRAM. Use nvidia-smi to monitor GPU memory usage. If Memory-Usage is near 8192MiB (or your total VRAM), you’ll need to free up resources.

Once your GPU is correctly detected and utilized, the next thing you’ll likely encounter is optimizing model loading times, especially for larger models.

Want structured learning?

Take the full Ollama course →