You can get responses from OpenAI models much faster than waiting for the entire output by using streaming.
Here’s how it looks in practice. Imagine you’re asking for a story about a brave knight. Without streaming, you’d just see a blank screen until the whole story was ready.
import openai
openai.api_key = "YOUR_API_KEY"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a short story about a brave knight."}
],
stream=True # This is the magic!
)
for chunk in response:
if chunk.choices[0].delta.get("content"):
print(chunk.choices[0].delta.content, end="")
When you run this, you don’t get a single, large output. Instead, you get tiny pieces of the story, word by word, as they are generated. It’s like watching a story being written in real-time on your screen.
Sir Reginald, known throughout the land as the Lionheart, adjusted the worn leather of his gauntlets.
A chilling wind whipped through the jagged peaks of the Dragon's Tooth mountains, carrying with it the scent of pine and a faint, metallic tang.
He squinted, his gaze fixed on the shadowed entrance of the Whispering Caves, the reputed lair of the fearsome Gryphon.
His quest was simple, yet fraught with peril: retrieve the Sunstone, an artifact of immense power, before the encroaching darkness consumed the kingdom.
With a deep breath, he drew his ancestral blade, its polished surface reflecting the meager light.
The Gryphon's screeches echoed from within, a symphony of terror that only fueled his resolve.
He stepped into the darkness, the fate of his people resting on his valiant heart.
The core problem this solves is perceived latency. Users hate waiting. For conversational AI, or any generative task where the output can be long, waiting for the entire response to be ready before showing anything feels slow, even if the total generation time is the same. Streaming breaks that up, showing progress immediately and making the experience feel significantly more responsive.
Internally, the OpenAI API, when stream=True is set, doesn’t buffer the entire response. Instead, it sends back small chunks of data as soon as they are generated by the model. Each chunk is a Server-Sent Event (SSE) that contains a small piece of the generated text. Your client code then iterates over these events, extracts the text content from each, and concatenates it to form the complete response, displaying it as it arrives.
The delta object within each chunk is key. It represents the change from the previous chunk. For text generation, this delta usually contains a content field with the newly generated tokens. You’re essentially subscribing to a stream of these deltas.
The model doesn’t know the "end" of the response until it’s finished generating it. When it’s done, it sends a final chunk, often without a content field, signaling the end of the stream. Your loop naturally terminates when the stream closes.
Most people think the content field is the only thing they need to look for, but sometimes, especially for very short responses or specific control messages, the delta might contain other fields like role (e.g., "assistant") or function_call. If you’re not handling these, your streaming output might appear incomplete or behave unexpectedly in edge cases.
The next step is handling function calls within a streaming response.