OpenAI Chat Completions API: Every Parameter Explained (2026)

The OpenAI Chat Completions API doesn’t just generate text; it actively predicts the most probable next token based on the vast statistical patterns it learned during training.

Let’s see it in action. Imagine you want to simulate a conversation with a helpful assistant.

import openai

openai.api_key = "YOUR_API_KEY"

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ]
)

print(response['choices'][0]['message']['content'])

This code snippet sends a request to the gpt-3.5-turbo model, instructing it to act as a helpful assistant and asking a question. The API then returns the predicted answer, which is "The capital of France is Paris."

The core of this API revolves around the messages parameter, which structures the conversation. It’s a list of dictionaries, each representing a turn. The role can be system (setting the AI’s persona), user (the human’s input), or assistant (previous AI responses, crucial for maintaining context).

Beyond messages and model, several other parameters let you fine-tune the output:

temperature: This controls randomness. A value of 0 makes the output deterministic (always the same for the same prompt), while higher values (up to 2) increase creativity and surprise. For factual questions, you’d keep this low, like 0.2. For brainstorming, you might crank it up to 0.8.
top_p: An alternative to temperature, it samples from tokens whose cumulative probability mass exceeds top_p. A top_p of 0.1 means only tokens in the top 10% probability mass are considered. It’s often used with temperature=1.
n: The number of completions to generate. If n=3, you’ll get three different possible continuations of the conversation. This is useful for exploring multiple response options.
stream: If set to True, the response is sent back token by token as it’s generated, rather than waiting for the entire completion. This is essential for interactive chat applications to provide a real-time feel.
stop: A sequence or list of sequences that will cause the API to stop generating further tokens. For example, if you’re generating a list and want it to stop after the fifth item, you might use stop=["5."] or stop=["\n"] to end at a newline.
presence_penalty: Ranges from -2.0 to 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.
frequency_penalty: Also ranges from -2.0 to 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.
max_tokens: The maximum number of tokens to generate in the completion. This prevents excessively long responses and helps manage costs. Be mindful that max_tokens counts all tokens, including those in the prompt.

When you set temperature to 0.0, the model is essentially picking the single most likely next token at each step. This deterministic behavior is powerful for tasks where consistency and accuracy are paramount, like code generation or precise data extraction. However, it can lead to repetitive or uninspired text if used for creative writing.

The messages parameter acts as the model’s short-term memory. For longer conversations, you must include the relevant history in this list. If you don’t, the model will treat each new request as if it’s the first turn of a conversation, losing all prior context. This means a conversation with a user asking follow-up questions would look like this:

# First turn
response1 = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about photosynthesis."}
  ]
)
print(response1['choices'][0]['message']['content'])

# Second turn (with context)
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me about photosynthesis."},
    {"role": "assistant", "content": response1['choices'][0]['message']['content']} # Crucial!
]
conversation_history.append({"role": "user", "content": "What are its main inputs?"})

response2 = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=conversation_history
)
print(response2['choices'][0]['message']['content'])

The truly counterintuitive aspect of max_tokens is that it limits the generated tokens, not the total tokens processed. So, if your prompt is 500 tokens long and you set max_tokens=100, the API will stop generating after it produces 100 tokens, but the total token count for billing purposes will be 600 (500 prompt + 100 completion). This means a very long prompt can eat up your max_tokens budget before any actual new content is generated.

Understanding these parameters is key to unlocking the full potential of the Chat Completions API, transforming it from a simple text generator into a versatile conversational agent.

The next hurdle is often managing long conversations and their associated token limits effectively.