OpenAI’s prompt caching is a clever optimization that can dramatically reduce both latency and cost by storing and reusing previous responses to identical prompts.
Let’s see it in action. Imagine you’re building a chatbot that needs to repeatedly answer the same FAQ. Without caching, each time a user asks "What are your business hours?", your application sends that exact prompt to OpenAI, waits for the response, and pays for the token usage. With caching, the first time the question is asked, the prompt and its response are stored. Subsequent identical questions are served directly from your cache, bypassing OpenAI entirely.
Here’s a simplified Python example illustrating the concept:
import openai
import hashlib
import json
# Assume you have your OpenAI API key set as an environment variable
# openai.api_key = os.getenv("OPENAI_API_KEY")
# In-memory cache for demonstration. In production, use Redis, Memcached, etc.
prompt_cache = {}
def get_cached_or_openai_response(prompt_text, model="gpt-3.5-turbo", max_tokens=150):
# Generate a cache key based on the prompt and model
cache_key = hashlib.md5(f"{model}:{prompt_text}".encode()).hexdigest()
if cache_key in prompt_cache:
print("--- Cache HIT ---")
return prompt_cache[cache_key]
else:
print("--- Cache MISS ---")
try:
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt_text}],
max_tokens=max_tokens
)
cached_response = {
"content": response.choices[0].message.content,
"finish_reason": response.choices[0].finish_reason,
"usage": response.usage
}
# Store the response in the cache
prompt_cache[cache_key] = cached_response
return cached_response
except Exception as e:
print(f"Error calling OpenAI: {e}")
return None
# --- Example Usage ---
faq_question = "What is your refund policy?"
# First call - Cache MISS
response1 = get_cached_or_openai_response(faq_question)
if response1:
print(f"Response 1: {response1['content']}\n")
# Second call with the exact same prompt - Cache HIT
response2 = get_cached_or_openai_response(faq_question)
if response2:
print(f"Response 2: {response2['content']}\n")
# A slightly different prompt - Cache MISS
different_question = "What is your refund policy for damaged items?"
response3 = get_cached_or_openai_response(different_question)
if response3:
print(f"Response 3: {response3['content']}\n")
The core problem prompt caching solves is the inherent latency and cost associated with every single API call to a large language model. For repetitive tasks or when dealing with a large number of users asking similar questions, the cumulative cost and delay can become significant. Caching intercepts these redundant requests, serving pre-computed answers instantly from local storage.
Internally, the process involves:
- Hashing the Prompt: A unique identifier (a hash) is generated for each incoming prompt. This hash typically includes the prompt text itself and the model being used, as variations in either would constitute a different request.
- Cache Lookup: The system checks if a response associated with this hash already exists in the cache.
- Cache Hit: If found, the stored response is returned immediately. This bypasses the network call to OpenAI, saving both time and money. The stored response often includes the model’s output, finish reason, and token usage (though usage isn’t incurred for cached hits).
- Cache Miss: If not found, the prompt is sent to OpenAI. Upon receiving a response, it’s stored in the cache using the generated hash before being returned to the user. This ensures future identical prompts will hit the cache.
The key levers you control are:
- Cache Key Generation: How you construct the unique identifier for each prompt. This needs to be robust enough to differentiate distinct requests while being consistent for identical ones. Including model name, system messages, and even temperature settings (if they vary per request) in the hash is crucial.
- Cache Storage: Where you store the cached data. An in-memory dictionary is fine for demos, but for production, you’ll want a distributed cache like Redis or Memcached for scalability and persistence, or even a simple file-based cache for smaller-scale applications.
- Cache Invalidation/Expiration: How long a response stays in the cache. For static information like FAQs, long expiration times (or no expiration) make sense. For dynamic or time-sensitive information, you’ll need a strategy to update or remove stale cache entries.
The most subtle aspect of prompt caching is how it interacts with the inherent stochasticity of LLMs, even when the prompt is identical. If you use a non-zero temperature setting in your OpenAI API calls, you might receive slightly different responses to the exact same prompt across multiple calls. In this scenario, a naive hash-based cache would treat these as distinct requests, leading to cache misses. To effectively cache such scenarios, you’d need to either: (a) enforce temperature=0 for prompts you intend to cache, ensuring deterministic output, or (b) implement a more sophisticated caching mechanism that can recognize semantic similarity rather than just exact string matches, which is significantly more complex.
The next step after implementing basic prompt caching is often exploring techniques for semantic caching and handling prompt variations gracefully.