Ollama’s batch inference capability doesn’t just speed up your LLM requests; it fundamentally changes how you think about parallel processing by intelligently grouping disparate requests.
Let’s see it in action. Imagine you have three separate requests, each with a different prompt and potentially different model parameters, all hitting your Ollama server at roughly the same time.
# Request 1
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Request 2
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "What are the benefits of meditation?",
"stream": false
}'
# Request 3
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3",
"prompt": "Explain the concept of recursion in programming.",
"stream": false
}'
Without batching, each of these requests would likely be processed independently. If llama3 is busy with Request 1, Request 3 would have to wait. If you have many such requests, especially to the same model, you can end up with a lot of idle time on your GPU as it waits for one request to finish before starting the next.
Ollama’s batch inference, particularly when using the ollama serve command with appropriate configuration, allows it to group these incoming requests. It identifies requests destined for the same model and, crucially, those that can share the same context. Instead of processing each request from scratch, Ollama can group them. The system loads the model and its initial context once. Then, as subsequent requests arrive, if they can leverage the existing context (e.g., a continuation of a previous conversation or a similar prompt structure), Ollama can append them to the current processing batch. This means the LLM doesn’t have to re-tokenize and re-process initial common tokens for each request. It’s like having multiple students in a classroom; instead of the teacher explaining the same fundamental concept to each student individually, they explain it once to the whole class, and then address individual follow-up questions.
The primary lever you control is how Ollama is configured to handle concurrency and batching. When you run ollama serve, it starts an HTTP server. The underlying logic for batching is largely internal, but its effectiveness is influenced by:
- Model Loading: Ollama keeps loaded models in memory. If multiple requests target the same model, they can be queued for the same loaded instance.
- Context Sharing: The core of batching is context sharing. If requests are semantically similar or sequential (like turns in a chat), Ollama can avoid re-processing the initial prompt tokens for each.
- GPU Utilization: By grouping requests, Ollama aims to keep the GPU busy with actual inference work rather than waiting for data or finishing one-off tasks.
The exact configuration for explicit batch size or timeout isn’t directly exposed as a simple command-line flag for ollama serve in the same way you might see in other inference servers. Ollama’s batching is more dynamic and context-aware. However, the NUM_PARALLEL environment variable can influence how many requests Ollama attempts to process concurrently. Setting NUM_PARALLEL=4 (for example) tells Ollama to try and handle up to four requests simultaneously, which will then be subject to its internal batching logic.
The one thing most people don’t realize is that batching isn’t just about how many requests you send at once, but what kind of requests they are. Ollama’s internal heuristics look for opportunities to reuse computation. If you send 10 requests that are all completely unrelated and target different models, Ollama will likely run them mostly sequentially, even with a high NUM_PARALLEL setting. But if you send 10 requests to llama3 with similar starting phrases or as part of a conversational flow, Ollama can group them much more effectively, leading to a dramatic reduction in latency per request compared to processing them one by one.
The next step in optimizing LLM throughput is understanding prompt engineering techniques that further enable context sharing and reduce token generation costs.