The most surprising thing about RAG embedding caches is that they don’t actually store embeddings; they store the queries that produced those embeddings, and only if those queries are deemed "cacheable."

Let’s see this in action. Imagine a user asking about "the capital of France."

First, the RAG system needs to embed this question to find similar documents. It looks for an embedding of "the capital of France" in its cache.

# Hypothetical cache lookup
cache_key = "the capital of france" # Lowercased and normalized query
if cache_key in embedding_cache:
    cached_embedding = embedding_cache[cache_key]
    print("Cache hit! Using cached embedding.")
    # Proceed with retrieved embedding
else:
    print("Cache miss. Generating new embedding.")
    # Call embedding API
    new_embedding = call_embedding_api("the capital of france")
    # Store for future use if deemed cacheable
    embedding_cache[cache_key] = new_embedding

If it’s a cache miss, the system calls an external embedding API (like OpenAI’s text-embedding-ada-002). This costs money and takes time.

# Hypothetical API call
import openai
openai.api_key = "YOUR_API_KEY"

def call_embedding_api(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response['data'][0]['embedding']

# Example usage
embedding = call_embedding_api("the capital of france")

The problem this solves is the immense cost and latency associated with repeatedly embedding identical or very similar user queries. For a chatbot that might answer the same basic questions thousands of times a day, calling an API for each instance is wasteful. A cache dramatically reduces these API calls.

Internally, the cache is typically a dictionary-like structure. The key is a normalized version of the user’s query (e.g., lowercased, punctuation removed, maybe stemmed). The value is the actual embedding vector returned by the API.

The "cacheable" decision is crucial. Not every query should be cached. Queries that are highly specific, contain timestamps, or are part of a session-specific context are poor candidates. A common strategy is to only cache queries that match a predefined set of "frequently asked questions" or have a high confidence score from a similarity search against a knowledge base of common queries.

Here’s a look at a simplified cache implementation:

class EmbeddingCache:
    def __init__(self, max_size=10000):
        self.cache = {}
        self.max_size = max_size
        self.eviction_policy = "LRU" # Least Recently Used

    def _normalize_query(self, query):
        # Basic normalization: lowercase and remove common punctuation
        import re
        query = query.lower()
        query = re.sub(r'[^\w\s]', '', query)
        return query.strip()

    def get(self, query):
        normalized_query = self._normalize_query(query)
        if normalized_query in self.cache:
            # For LRU, mark as recently used
            self.cache[normalized_query]['used_at'] = time.time()
            return self.cache[normalized_query]['embedding']
        return None

    def put(self, query, embedding):
        normalized_query = self._normalize_query(query)
        if len(self.cache) >= self.max_size:
            self._evict()
        self.cache[normalized_query] = {'embedding': embedding, 'used_at': time.time()}

    def _evict(self):
        if self.eviction_policy == "LRU":
            lru_key = min(self.cache, key=lambda k: self.cache[k]['used_at'])
            del self.cache[lru_key]
        # Add other policies like FIFO, LFU if needed

# Example usage
embedding_cache = EmbeddingCache(max_size=5000)

# First time: Cache miss
query1 = "What is the largest planet in our solar system?"
embedding1 = embedding_cache.get(query1)
if embedding1 is None:
    embedding1 = call_embedding_api(query1)
    embedding_cache.put(query1, embedding1)

# Second time: Cache hit
query2 = "What is the largest planet in our solar system?"
embedding2 = embedding_cache.get(query2)
if embedding2 is None:
    # This block won't be hit if query2 is identical to query1
    embedding2 = call_embedding_api(query2)
    embedding_cache.put(query2, embedding2)

print(f"Embeddings are equal: {embedding1 == embedding2}")

The levers you control are the max_size of the cache, the eviction_policy (LRU, FIFO, etc.), and the _normalize_query function. A more aggressive normalization might lead to more cache hits but could also incorrectly match dissimilar queries. A less aggressive normalization might result in fewer hits but higher accuracy when a hit does occur.

A common mistake is to think that the cache stores the documents retrieved by the embedding. It doesn’t. It only stores the embedding for the query itself. The subsequent document retrieval and ranking steps happen after a cache hit.

The next problem to tackle is efficiently invalidating or updating cache entries when the underlying knowledge base changes.

Want structured learning?

Take the full Rag course →