OpenAI’s Usage API doesn’t just track your spending; it’s a real-time ledger of every single token processed, offering granular insights that can dramatically reshape your cost management strategy.
Let’s see it in action. Imagine you’ve got a Python application making calls to the gpt-3.5-turbo model for summarization. You want to know how much you’re spending per request.
import openai
import os
import time
# Ensure you have your OpenAI API key set as an environment variable
# export OPENAI_API_KEY='your-api-key'
openai.api_key = os.getenv("OPENAI_API_KEY")
def summarize_text(text, model="gpt-3.5-turbo"):
start_time = time.time()
try:
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Summarize the following text: {text}"}
]
)
end_time = time.time()
summary = response.choices[0].message.content
usage = response.usage
print(f"--- Request Details ---")
print(f"Model: {model}")
print(f"Prompt Tokens: {usage.prompt_tokens}")
print(f"Completion Tokens: {usage.completion_tokens}")
print(f"Total Tokens: {usage.total_tokens}")
print(f"Request Latency: {end_time - start_time:.2f} seconds")
print(f"Summary: {summary}")
print(f"-----------------------\n")
return summary, usage
except Exception as e:
print(f"An error occurred: {e}")
return None, None
# Example usage
long_text = """
The advancements in artificial intelligence have been nothing short of revolutionary.
Machine learning algorithms, particularly deep learning, have enabled systems to
perform tasks that were once considered exclusively within the domain of human
intelligence. Natural Language Processing (NLP) has seen significant breakthroughs,
allowing machines to understand, interpret, and generate human language with
increasing sophistication. Computer vision models are now capable of recognizing
objects and scenes with accuracy comparable to, and in some cases exceeding, human
perception. These technologies are finding applications across a vast array of
industries, from healthcare and finance to entertainment and transportation.
However, alongside these remarkable achievements come significant ethical
considerations and challenges that require careful navigation. The development
and deployment of AI systems must be guided by principles of fairness,
transparency, and accountability to ensure that these powerful tools benefit
humanity as a whole.
"""
summarize_text(long_text)
When you run this, you’ll see output like this:
--- Request Details ---
Model: gpt-3.5-turbo
Prompt Tokens: 96
Completion Tokens: 41
Total Tokens: 137
Request Latency: 2.51 seconds
Summary: Artificial intelligence, particularly deep learning and NLP, has made significant strides, enabling machines to perform complex tasks like language understanding and image recognition. These advancements are transforming various industries but also raise critical ethical concerns regarding fairness, transparency, and accountability. Responsible AI development is crucial for ensuring these technologies benefit society.
-----------------------
The key here is the response.usage object. It breaks down prompt_tokens, completion_tokens, and total_tokens for each API call. This is your primary lever for understanding and controlling costs.
The problem this solves is the "black box" of cloud AI spending. Without this granular detail, you’re flying blind, unable to pinpoint which features, user actions, or model configurations are driving up your bill. You might know your total spend is $1000, but you won’t know if that’s from 10,000 tiny, cheap requests or 100 very expensive ones.
Internally, every time you send a request to an OpenAI model, the system tokenizes your input prompt and tokenizes the model’s generated output. The Usage API (which is what response.usage represents) is essentially a real-time report of these token counts for that specific interaction. The pricing is then applied to these total_tokens based on the model used (e.g., gpt-3.5-turbo has different pricing for prompt vs. completion tokens, while older models might have a single rate).
Your control levers are:
- Model Selection:
gpt-3.5-turbois significantly cheaper thangpt-4. Switching models can yield massive savings. - Prompt Length: Longer prompts mean more prompt tokens, which directly increases cost. Can you make your prompts more concise without sacrificing quality?
- Response Length (Max Tokens): While not directly in
response.usagefor actual tokens used, you can setmax_tokensin your request. If your completions are consistently much shorter than themax_tokensyou allow, you’re not directly paying for unused capacity, but it’s good practice to set this realistically. Thecompletion_tokensinresponse.usagewill tell you how many tokens were actually generated. - Batching (where applicable): For tasks that can be parallelized, consider if batching requests can be more efficient, though for token-based pricing, this often nets out. The primary gain is usually latency.
- Caching: If the same prompt is sent repeatedly, cache the results to avoid redundant API calls and costs.
The most surprising thing about token usage is that the system token, the initial instruction telling the model how to behave (e.g., "You are a helpful assistant."), is also counted. This means even your most basic system prompts contribute to your token count, and by extension, your cost.
When you start seeing total_tokens that seem unusually high for a seemingly simple request, it’s often because the model is being very verbose in its internal reasoning or "thinking" process before generating the final output. You can sometimes influence this by being extremely precise and concise in your system and user prompts, guiding the model more directly towards the desired output and minimizing extraneous generation.
The next concept you’ll likely grapple with is how to proactively enforce these spending controls, moving beyond reactive monitoring to automated budget alerts and request throttling.