Ollama streaming lets you see LLM output as it’s generated, token by token, instead of waiting for the whole response.
Let’s see it in action. Imagine you have Ollama running and have pulled down a model, say llama3. You want to interact with it programmatically.
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": true
}'
When you run this, you won’t get a single, massive JSON response. Instead, you’ll get a stream of small JSON objects, each representing a chunk of the generated text.
{"model":"llama3","created_at":"2024-05-15T10:00:00.123456Z","response":"The","done":false,"context":[...],"total_duration":123456789,"load_duration":12345678,"prompt_eval_count":10,"prompt_eval_duration":123456,"eval_count":1,"eval_duration":123456}
{"model":"llama3","created_at":"2024-05-15T10:00:00.456789Z","response":" Rayleigh","done":false,"context":[...],"total_duration":123456789,"load_duration":12345678,"prompt_eval_count":10,"prompt_eval_duration":123456,"eval_count":2,"eval_duration":123456}
{"model":"llama3","created_at":"2024-05-15T10:00:00.789012Z","response":" scattering","done":false,"context":[...],"total_duration":123456789,"load_duration":12345678,"prompt_eval_count":10,"prompt_eval_duration":123456,"eval_count":3,"eval_duration":123456}
...
{"model":"llama3","created_at":"2024-05-15T10:00:01.987654Z","response":".","done":true,"context":[...],"total_duration":123456789,"load_duration":12345678,"prompt_eval_count":10,"prompt_eval_duration":123456,"eval_count":50,"eval_duration":123456}
The key field here is "done": false for most messages, indicating more tokens are coming, and "done": true on the final message. The "response" field contains the actual text token(s) generated in that step.
This streaming capability is crucial for interactive applications. It provides a responsive user experience, making it feel like the LLM is "thinking" and typing in real-time, rather than presenting a delayed, monolithic block of text. This is achieved by the Ollama server processing the prompt, feeding it to the LLM, and then sending back each generated token as soon as it’s available, rather than buffering the entire output.
The stream: true flag in the API request is the primary lever. When set to false (the default if omitted), Ollama will wait for the complete response before sending it back as a single JSON object. Internally, Ollama uses libraries like go-glm to manage the model inference. When streaming, these libraries yield tokens incrementally. Ollama’s API server then packages these individual tokens into discrete JSON payloads and sends them over the HTTP connection. This is typically done using Server-Sent Events (SSE) under the hood, though the client often just sees a continuous stream of data.
The context field in the JSON is important for maintaining conversational state. Each streaming response includes the updated context, which can be sent back in subsequent requests to allow the LLM to remember previous parts of the conversation. This is fundamental for building chatbots.
A common pitfall is not handling the stream correctly on the client side. If you expect a single JSON object and receive many small ones, your JSON parser will likely fail. You need to process each line as a distinct JSON object. Many HTTP client libraries have specific support for streaming responses or Server-Sent Events that simplify this.
The "eval_duration" in each response object shows how long it took to generate that specific token or batch of tokens. Summing these up, along with "prompt_eval_duration", gives you a breakdown of the inference time.
The total duration reported in the stream is the overall time from receiving the request to completing the generation, including model loading time if the model wasn’t already in memory. This can be significantly higher on the first request after starting Ollama or pulling a new model.
When you’re building a client application, you’ll typically iterate over the response body, decoding each line as JSON. You’ll append the "response" field from each non-final message to your display, and when you encounter a message with "done": true, you know the generation is complete.
The actual mechanism for how the LLM generates tokens is complex, involving probability distributions over the vocabulary. For a given input context, the model predicts the most likely next token. This token is then appended to the context, and the process repeats. Streaming exposes this step-by-step generation process.
You might not realize that the context array sent back in each response is not just a raw memory dump; it’s a compressed, tokenized representation of the conversation history that the model can efficiently process. Sending this back on every turn is how state is maintained without re-processing the entire history from scratch for every new token.
The next step in interacting with Ollama after mastering streaming is often managing multiple conversations or using the /api/chat endpoint for more structured dialogue management.