Ollama models don’t just use RAM; they are RAM for all intents and purposes, meaning their entire weight needs to be loaded into memory before they can even begin to process a single token.

Let’s see Llama 2 7B in action.

# Download Llama 2 7B
ollama pull llama2

# Run Llama 2 7B and ask it a question
ollama run llama2 "What is the capital of France?"

You’ll notice that ollama pull takes a while, and ollama run is almost instantaneous after the initial load. That’s because the model’s weights are being copied from disk directly into your system’s RAM. There’s no complex disk I/O happening during inference.

The core problem Ollama solves is making it incredibly easy to run large language models locally. Instead of dealing with complex Python environments, CUDA setups, and model file conversions, you get a single binary and a simple command-line interface. It abstracts away the underlying complexity of loading and running these massive neural networks.

Internally, Ollama uses a C++ backend with a Rust frontend. When you ollama pull, it downloads the model weights (usually in a quantized format like GGUF) and stores them in its model directory. When you ollama run, the backend loads these weights into RAM. For inference, it leverages optimized libraries like llama.cpp (for CPU and GPU offloading) or integrates with specific GPU vendor libraries. The key is that the entire model, or at least the parts being actively used, must reside in RAM.

The size tag on Ollama models isn’t just a marketing number; it directly correlates to the number of parameters (billions) and, consequently, the RAM footprint. A 7B model has roughly 7 billion parameters. Each parameter is typically stored as a floating-point number.

Here’s a breakdown of RAM requirements by model size, assuming standard 16-bit floating-point (FP16) precision for a rough estimate. Quantized models will use less, but this gives you the upper bound.

  • 0.5B (e.g., TinyLlama): ~1 GB RAM. These are small enough to run on most modern laptops with plenty of headroom.
  • 1B (e.g., Phi-2): ~2 GB RAM. Still very manageable.
  • 3B (e.g., Qwen 1.5 3B): ~6 GB RAM. You’ll want at least 8GB total system RAM to comfortably run these.
  • 7B (e.g., Llama 2 7B, Mistral 7B): ~14 GB RAM. This is the sweet spot for many consumer-grade GPUs (like 12GB VRAM cards) if you offload, or requires at least 16GB system RAM for CPU inference.
  • 13B (e.g., Llama 2 13B): ~26 GB RAM. You’re definitely looking at 32GB system RAM or a high-end GPU with 24GB VRAM for significant offloading.
  • 30B/34B (e.g., Mixtral 8x7B, CodeLlama 34B): ~60-70 GB RAM. This is firmly in workstation/server territory. You’ll need 64GB RAM or more.
  • 70B (e.g., Llama 2 70B): ~140 GB RAM. This requires substantial hardware, often multiple high-VRAM GPUs or a server with 128GB+ RAM.

These numbers are minimums for the model weights alone. You also need RAM for the operating system, Ollama itself, and any other applications running. For smooth operation, it’s wise to add a buffer of at least 4-8GB on top of the model’s requirement, especially for larger models or if you plan to run multiple models concurrently.

The "size" listed in ollama list (e.g., 7.7GB) is the downloaded size of the model file. The loaded RAM requirement is roughly double that for unquantized models, as each parameter takes up 2 bytes (FP16). Quantization (e.g., Q4_K_M) reduces this significantly, often by half or more, but the principle remains: the model’s parameters must be in memory.

When you choose to offload layers to a GPU using the OLLAMA_NUM_GPU environment variable or the num_gpu flag in serve, Ollama doesn’t magically reduce the total RAM needed. Instead, it distributes the model’s layers between your system RAM and your GPU’s VRAM. If you have a 7B model (14GB RAM needed) and a GPU with 8GB VRAM, Ollama will load approximately 8GB of the model onto the GPU and the remaining ~6GB into system RAM. The total memory pressure remains, just distributed.

The most surprising thing about Ollama’s RAM usage is that "loading" a model isn’t a one-time event for its entire lifespan; it’s more like a continuous state. Unlike a program you launch and then forget about its initial load time, a model in Ollama remains resident in memory. If you’re running a 70B model, that ~140GB of RAM is actively occupied for as long as that model is available to serve requests, even if you’re not actively sending prompts. This is why resource management is critical; leaving multiple large models loaded simultaneously will quickly exhaust even high-end systems.

The next hurdle is understanding how quantization affects performance and memory, and how to balance it with the num_gpu setting for optimal inference speed.

Want structured learning?

Take the full Ollama course →