OpenAI API Migration: Completions to Chat API (2026)

The Chat Completions API is the successor to the Completions API, and it’s designed to be more powerful and flexible, especially for conversational use cases.

Let’s see it in action. Imagine we want to build a simple chatbot that can answer questions about a fictional product.

import openai

openai.api_key = "YOUR_OPENAI_API_KEY"

def ask_chat_api(prompt):
    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

question = "What are the main features of the 'Quantum Leap' device?"
answer = ask_chat_api(question)
print(answer)

This code snippet demonstrates the core of the Chat Completions API. We define a system message to set the behavior of the AI, and then a user message containing the actual prompt. The model parameter specifies which AI model to use, with gpt-3.5-turbo being a popular and cost-effective choice.

The key difference from the older Completions API is the messages parameter. Instead of a single prompt string, you provide a list of message objects, each with a role and content. This allows for a more structured representation of a conversation, including system instructions, user queries, and AI responses. This structure is crucial for maintaining context across multiple turns of a dialogue.

Here’s how it breaks down internally:

model: This is your choice of the AI engine. gpt-3.5-turbo is fast and affordable. gpt-4 or gpt-4-turbo offer more advanced reasoning and creativity but at a higher cost. The choice depends on your application’s needs for sophistication versus budget.
messages: This is the heart of the interaction.
- role: "system": This message sets the overall behavior and persona of the AI. It’s like giving instructions to an actor before they go on stage. You can tell it to be funny, professional, or to act as a specific character.
- role: "user": These are the prompts or questions from the end-user.
- role: "assistant": These are the AI’s previous responses. Including these in subsequent calls allows the model to understand the flow of the conversation and respond contextually.
temperature: A value between 0 and 2. Higher values (e.g., 0.8) make the output more random and creative, while lower values (e.g., 0.2) make it more deterministic and focused. For a chatbot answering factual questions, you might keep this low. For creative writing, you’d raise it.
max_tokens: This limits the length of the AI’s response. It’s a way to control costs and ensure responses aren’t excessively long.

The migration from the Completions API (which used a single prompt parameter) to Chat Completions is driven by the need for better conversational AI. The old API was essentially a text-in, text-out black box. The Chat Completions API, by explicitly defining roles, allows the model to better understand the turn-taking nature of dialogue. This means it can handle follow-up questions, remember previous parts of the conversation (when you include past messages), and adopt specific personas more reliably. The system message, in particular, is a powerful tool for guiding the AI’s behavior in ways that were difficult or impossible with just a single prompt.

When you provide a list of messages, the API doesn’t just concatenate them. Instead, it processes them as a sequence, understanding the relationships between the system’s instructions, the user’s queries, and the AI’s prior outputs. This contextual awareness is what makes the Chat Completions API so much more effective for building interactive applications.

The way temperature interacts with top_p is often misunderstood. While temperature controls randomness by sampling from the probability distribution of the next token, top_p (nucleus sampling) truncates that distribution, only considering tokens that cumulatively account for a certain probability mass. If both are set, temperature is applied first, and then top_p is applied to the resampled distribution. This means that if temperature makes the distribution very flat, top_p might end up including a larger set of tokens than intended, and vice-versa. For most use cases, it’s best to adjust only one of them.

As you continue with conversational AI, you’ll soon encounter the need for fine-tuning models for highly specific tasks, or managing long conversations efficiently.