Ollama doesn’t actually have built-in, configurable resource limits for memory or loaded models in the way you might expect from a traditional application.

Let’s see Ollama in action, not by describing it, but by showing it. Imagine you’ve just installed Ollama and want to run a small model, say phi-3-mini.

# Pull the model
ollama pull phi-3-mini

# Run the model
ollama run phi-3-mini
>>> What is the capital of France?
Paris.

This seems straightforward. You pull a model, you run it. But what happens under the hood, especially when you start running multiple models or larger ones? Ollama leverages the operating system’s memory management and, for GPU acceleration, the CUDA or Metal drivers. The "limits" you perceive are less about Ollama’s internal settings and more about your system’s overall capacity and how Ollama requests resources.

When you ollama pull a model, it’s downloaded to disk. When you ollama run a model, Ollama loads it into memory. If you’re using a GPU, it attempts to load model weights into GPU VRAM. If you’re not using a GPU, or if VRAM is insufficient, it falls back to system RAM.

The problem is, Ollama, by default, will try to load all of a model’s weights into RAM or VRAM when you run it. It doesn’t dynamically swap parts of the model in and out like some more sophisticated inference engines might. This means if you try to run a 70B parameter model on a system with 16GB of RAM, it’s going to fail.

So, how do you "cap" these resources? You’re not really capping Ollama directly, but rather managing the environment in which it runs and the requests it makes.

The primary lever you have is environment variables. Ollama respects certain environment variables that influence its behavior, particularly regarding GPU usage and model loading.

1. GPU Acceleration (OLLAMA_GPU)

This is the most significant factor in resource consumption. If you have a capable GPU, Ollama will try to use it.

  • Diagnosis: Check if Ollama is using your GPU.

    # On Linux/macOS with NVIDIA
    nvidia-smi
    
    # On macOS with Apple Silicon
    # This is harder to see directly without specific tools, but 'htop' might show high GPU usage.
    # A good indicator is if your system becomes sluggish when running models.
    

    If nvidia-smi shows python or ollama processes consuming VRAM, it’s using the GPU.

  • Fix: To disable GPU acceleration and force Ollama to use only system RAM, set the OLLAMA_GPU environment variable to none.

    export OLLAMA_GPU=none
    # Then run your ollama command
    ollama run llama3
    

    This forces Ollama to load the model entirely into system RAM, which is often much more plentiful but significantly slower for inference. This is your most direct way to "cap" GPU VRAM usage.

2. Model Quantization

While not an Ollama setting, the model file itself dictates its resource footprint. Quantization reduces the precision of the model’s weights (e.g., from 16-bit floating point to 4-bit integers), drastically shrinking its size and RAM/VRAM requirements.

  • Diagnosis: Look at the size of the model you’re pulling.

    ollama list
    # Example output:
    # NAME          ID          SIZE    MODIFIED
    # llama3:8b     ...         4.7 GB  2 days ago
    # phi-3-mini    ...         1.9 GB  1 week ago
    

    A llama3:8b model is already quantized to 8-bit parameters. Larger models (e.g., 70B) will be much larger even when quantized.

  • Fix: Choose smaller, more aggressively quantized models. For example, instead of llama3:70b, try llama3:8b or phi-3-mini. You can also find models on platforms like Hugging Face that are specifically quantized to 4-bit (e.g., Q4_K_M) and then import them into Ollama using the ollama create command.

    # Example of creating a model from a GGUF file
    ollama create my-quantized-model -f ./Modelfile
    # Where Modelfile might contain:
    # FROM ./path/to/your/gguf/model.gguf
    

    This directly reduces the memory required to load the model.

3. System-Level Memory Limits (cgroups/Docker)

If you’re running Ollama within a container or on a system with cgroups configured, you can impose hard limits on the memory available to the Ollama process.

  • Diagnosis: Check your container runtime or system configuration.

    • Docker: Inspect your container’s configuration.
      docker inspect <container_id_or_name>
      # Look for "Memory" and "MemorySwap" settings
      
    • Systemd/cgroups: Examine the service file for Ollama if you’re running it as a systemd service.
      systemctl cat ollama.service
      # Look for MemoryMax, MemoryHigh directives
      
  • Fix: Configure your container or service to limit memory.

    • Docker: Start or run your container with memory limits.
      docker run -d --memory=4g --name ollama ollama/ollama
      # Or for an existing container:
      docker update --memory=4g <container_id_or_name>
      
      This prevents Ollama (and any models it loads) from exceeding 4GB of RAM.
    • Systemd: Edit the service file and add memory limits.
      [Service]
      # ... other directives
      MemoryMax=4G
      # ...
      
      Then reload and restart the service:
      sudo systemctl daemon-reload
      sudo systemctl restart ollama
      
      This ensures the Ollama process itself, and thus its loaded models, cannot consume more than 4GB of RAM system-wide.

4. Running Multiple Models Concurrently

Ollama’s design means each ollama run command (if not using existing sessions) can potentially load a model. If you run multiple models simultaneously, they will stack their memory requirements.

  • Diagnosis: Use system monitoring tools to see total RAM/VRAM usage.

    # Linux
    htop
    
    # macOS
    Activity Monitor
    
    # GPU specific (NVIDIA)
    nvidia-smi
    

    Observe the combined RAM or VRAM usage of ollama processes.

  • Fix: The "fix" here is to avoid running too many models at once, or to ensure your system has sufficient resources for the number of models you intend to run concurrently. You can also manage active sessions. If you run ollama run modelA and then ollama run modelB, modelA might still be loaded. You can exit sessions with /bye or exit.

    # Start model A
    ollama run modelA
    >>> Hello!
    # Exit session cleanly
    >>> /bye
    
    # Now start model B
    ollama run modelB
    

    This ensures that only one model is loaded at a time, rather than two or more simultaneously.

5. The OLLAMA_HOST and Port Binding

While not a direct resource cap, how Ollama is exposed can indirectly affect resource management if you’re running multiple instances or want to isolate them.

  • Diagnosis: Check running processes and network listeners.

    ps aux | grep ollama
    ss -tulnp | grep 11434
    
  • Fix: By default, Ollama binds to 127.0.0.1:11434. If you need to run multiple Ollama instances on the same machine (e.g., for different projects or resource profiles), you can bind them to different IP addresses or ports.

    # Run an instance on a specific IP/port
    OLLAMA_HOST=192.168.1.100:11435 ollama serve &
    # And another on a different IP/port
    OLLAMA_HOST=192.168.1.101:11436 ollama serve &
    

    Each of these instances would then load models independently, allowing you to manage their resource consumption separately, although you’re still limited by the host system’s total resources.

The key takeaway is that Ollama’s "resource limits" are primarily managed by controlling its environment, choosing the right models, and leveraging OS-level or containerization tools.

The next hurdle you’ll likely encounter is managing model versions and ensuring compatibility when you start to use more advanced features like the OpenAI API compatibility layer.

Want structured learning?

Take the full Ollama course →