Ollama doesn’t actually have built-in, configurable resource limits for memory or loaded models in the way you might expect from a traditional application.
Let’s see Ollama in action, not by describing it, but by showing it. Imagine you’ve just installed Ollama and want to run a small model, say phi-3-mini.
# Pull the model
ollama pull phi-3-mini
# Run the model
ollama run phi-3-mini
>>> What is the capital of France?
Paris.
This seems straightforward. You pull a model, you run it. But what happens under the hood, especially when you start running multiple models or larger ones? Ollama leverages the operating system’s memory management and, for GPU acceleration, the CUDA or Metal drivers. The "limits" you perceive are less about Ollama’s internal settings and more about your system’s overall capacity and how Ollama requests resources.
When you ollama pull a model, it’s downloaded to disk. When you ollama run a model, Ollama loads it into memory. If you’re using a GPU, it attempts to load model weights into GPU VRAM. If you’re not using a GPU, or if VRAM is insufficient, it falls back to system RAM.
The problem is, Ollama, by default, will try to load all of a model’s weights into RAM or VRAM when you run it. It doesn’t dynamically swap parts of the model in and out like some more sophisticated inference engines might. This means if you try to run a 70B parameter model on a system with 16GB of RAM, it’s going to fail.
So, how do you "cap" these resources? You’re not really capping Ollama directly, but rather managing the environment in which it runs and the requests it makes.
The primary lever you have is environment variables. Ollama respects certain environment variables that influence its behavior, particularly regarding GPU usage and model loading.
1. GPU Acceleration (OLLAMA_GPU)
This is the most significant factor in resource consumption. If you have a capable GPU, Ollama will try to use it.
-
Diagnosis: Check if Ollama is using your GPU.
# On Linux/macOS with NVIDIA nvidia-smi # On macOS with Apple Silicon # This is harder to see directly without specific tools, but 'htop' might show high GPU usage. # A good indicator is if your system becomes sluggish when running models.If
nvidia-smishowspythonorollamaprocesses consuming VRAM, it’s using the GPU. -
Fix: To disable GPU acceleration and force Ollama to use only system RAM, set the
OLLAMA_GPUenvironment variable tonone.export OLLAMA_GPU=none # Then run your ollama command ollama run llama3This forces Ollama to load the model entirely into system RAM, which is often much more plentiful but significantly slower for inference. This is your most direct way to "cap" GPU VRAM usage.
2. Model Quantization
While not an Ollama setting, the model file itself dictates its resource footprint. Quantization reduces the precision of the model’s weights (e.g., from 16-bit floating point to 4-bit integers), drastically shrinking its size and RAM/VRAM requirements.
-
Diagnosis: Look at the size of the model you’re pulling.
ollama list # Example output: # NAME ID SIZE MODIFIED # llama3:8b ... 4.7 GB 2 days ago # phi-3-mini ... 1.9 GB 1 week agoA
llama3:8bmodel is already quantized to 8-bit parameters. Larger models (e.g., 70B) will be much larger even when quantized. -
Fix: Choose smaller, more aggressively quantized models. For example, instead of
llama3:70b, tryllama3:8borphi-3-mini. You can also find models on platforms like Hugging Face that are specifically quantized to 4-bit (e.g., Q4_K_M) and then import them into Ollama using theollama createcommand.# Example of creating a model from a GGUF file ollama create my-quantized-model -f ./Modelfile # Where Modelfile might contain: # FROM ./path/to/your/gguf/model.ggufThis directly reduces the memory required to load the model.
3. System-Level Memory Limits (cgroups/Docker)
If you’re running Ollama within a container or on a system with cgroups configured, you can impose hard limits on the memory available to the Ollama process.
-
Diagnosis: Check your container runtime or system configuration.
-
Fix: Configure your container or service to limit memory.
- Docker: Start or run your container with memory limits.
This prevents Ollama (and any models it loads) from exceeding 4GB of RAM.docker run -d --memory=4g --name ollama ollama/ollama # Or for an existing container: docker update --memory=4g <container_id_or_name> - Systemd: Edit the service file and add memory limits.
Then reload and restart the service:[Service] # ... other directives MemoryMax=4G # ...
This ensures the Ollama process itself, and thus its loaded models, cannot consume more than 4GB of RAM system-wide.sudo systemctl daemon-reload sudo systemctl restart ollama
- Docker: Start or run your container with memory limits.
4. Running Multiple Models Concurrently
Ollama’s design means each ollama run command (if not using existing sessions) can potentially load a model. If you run multiple models simultaneously, they will stack their memory requirements.
-
Diagnosis: Use system monitoring tools to see total RAM/VRAM usage.
# Linux htop # macOS Activity Monitor # GPU specific (NVIDIA) nvidia-smiObserve the combined RAM or VRAM usage of
ollamaprocesses. -
Fix: The "fix" here is to avoid running too many models at once, or to ensure your system has sufficient resources for the number of models you intend to run concurrently. You can also manage active sessions. If you run
ollama run modelAand thenollama run modelB,modelAmight still be loaded. You can exit sessions with/byeorexit.# Start model A ollama run modelA >>> Hello! # Exit session cleanly >>> /bye # Now start model B ollama run modelBThis ensures that only one model is loaded at a time, rather than two or more simultaneously.
5. The OLLAMA_HOST and Port Binding
While not a direct resource cap, how Ollama is exposed can indirectly affect resource management if you’re running multiple instances or want to isolate them.
-
Diagnosis: Check running processes and network listeners.
ps aux | grep ollama ss -tulnp | grep 11434 -
Fix: By default, Ollama binds to
127.0.0.1:11434. If you need to run multiple Ollama instances on the same machine (e.g., for different projects or resource profiles), you can bind them to different IP addresses or ports.# Run an instance on a specific IP/port OLLAMA_HOST=192.168.1.100:11435 ollama serve & # And another on a different IP/port OLLAMA_HOST=192.168.1.101:11436 ollama serve &Each of these instances would then load models independently, allowing you to manage their resource consumption separately, although you’re still limited by the host system’s total resources.
The key takeaway is that Ollama’s "resource limits" are primarily managed by controlling its environment, choosing the right models, and leveraging OS-level or containerization tools.
The next hurdle you’ll likely encounter is managing model versions and ensuring compatibility when you start to use more advanced features like the OpenAI API compatibility layer.