Ollama doesn’t actually measure "tokens per second" as its primary performance metric, which is why benchmarks can be misleading.
Here’s how Ollama handles model inference and what you’re really seeing when you run a benchmark.
First, the surprise: Ollama’s performance isn’t about raw LLM computation speed in isolation. It’s a complex dance between model loading, prompt processing, and the actual generation of output tokens. The "tokens per second" you see in many benchmarks is a calculated value, often derived from the total tokens generated divided by the time taken for the entire operation, from prompt submission to final output. This includes overheads that can significantly mask the underlying model’s true generation speed.
Let’s see this in action. Imagine you’re running a small model, phi-3-mini, on your machine.
# Start Ollama server (if not already running)
ollama serve &
# Run a prompt and time it
time ollama run phi3 "Tell me a short story about a brave knight."
When you run this, Ollama does several things:
- Model Download/Load: If
phi-3-miniisn’t cached, it downloads and loads the model weights into memory. This can be a significant time sink on the first run. - Prompt Processing: The input prompt is tokenized and fed through the model to establish the initial context. This is often much faster than generating new tokens but still contributes to the total time.
- Token Generation: The model then iteratively predicts the next token, appends it to the sequence, and uses the new sequence to predict the subsequent token, and so on. This is the core LLM inference loop.
- Output Formatting: The generated tokens are converted back into human-readable text and streamed back to your terminal.
The time command in the example above captures the duration of all these steps. Therefore, a benchmark result like "15 tokens/sec" might be an average over 30 seconds where 5 seconds were spent loading, 2 seconds on prompt processing, and 23 seconds on generating 345 tokens (345 tokens / 23 seconds ≈ 15 tokens/sec). The actual generation speed might be much higher if you exclude the initial overheads.
The core problem Ollama solves is making it incredibly easy to run any compatible LLM locally, abstracting away the complexities of model quantization, hardware acceleration (CPU/GPU), and API management. It provides a unified interface for interacting with diverse models, regardless of their underlying architecture or size.
Internally, Ollama uses libraries like llama.cpp (for CPU and some GPU acceleration) or onnxruntime to run the models. It manages the memory allocation for model weights and activations, handles prompt tokenization using model-specific tokenizers, and orchestrates the iterative generation process. The "server" aspect means it exposes an HTTP API, allowing other applications to interact with the models without needing to manage the inference engine directly.
You have several levers to pull to influence Ollama’s perceived performance. The most impactful is the choice of model. Larger models are generally slower but more capable. Quantization is also key: a Q4_K_M quantized model will be significantly faster and use less VRAM than a F16 version of the same model, with a trade-off in accuracy. Hardware acceleration is paramount; running on a GPU with sufficient VRAM will dwarf CPU performance. Finally, the length of the prompt and the number of tokens requested for generation both impact the total time. A short prompt with a request for 100 tokens will finish much faster, and thus appear to have a higher "tokens per second" if you’re only looking at the end-to-end time, than a long prompt requesting 1000 tokens.
The most common misconception is that the "tokens per second" reported by ollama run is purely the model’s generation speed. In reality, it’s a holistic measure that includes prompt processing, model loading (on first run), and output formatting. To get a clearer picture of just generation speed, you’d need to time only the iterative token prediction loop, which Ollama’s simple run command doesn’t expose directly. This is why external benchmarking tools often offer more granular control over what is being measured.
The next hurdle you’ll likely encounter is understanding how to effectively leverage GPU acceleration for different model sizes and types within Ollama.