Retrieval Augmented Generation (RAG) LLM caches are often described as simply storing past queries and their results, but their real power, and a significant source of performance gains, comes from a more nuanced concept: semantic deduplication.

Let’s see this in action. Imagine we have a RAG system that can answer questions about a company’s internal documentation.

Here’s a simplified representation of a query and its retrieval results:

{
  "query": "What is the process for requesting a new laptop?",
  "retrieved_docs": [
    {
      "id": "doc_123",
      "content": "To request a new laptop, employees must submit a ticket through the IT portal. The ticket should include your department, manager's approval, and the reason for the request. Approval typically takes 2-3 business days.",
      "score": 0.92
    },
    {
      "id": "doc_456",
      "content": "The IT department handles hardware requests. New laptops are issued after a formal request is approved by your department head. Please use the online ticketing system for all such requests.",
      "score": 0.88
    }
  ],
  "llm_response": "To request a new laptop, you need to submit a ticket via the IT portal, including your department, manager's approval, and the reason. Approval usually takes 2-3 business days."
}

Now, consider a slightly different query:

{
  "query": "How do I get a new company computer?",
  "retrieved_docs": [
    {
      "id": "doc_123",
      "content": "To request a new laptop, employees must submit a ticket through the IT portal. The ticket should include your department, manager's approval, and the reason for the request. Approval typically takes 2-3 business days.",
      "score": 0.89
    },
    {
      "id": "doc_789",
      "content": "Hardware procurement policies mandate online submission for all new equipment. This includes laptops and desktops. Ensure your request is approved by your direct supervisor.",
      "score": 0.85
    }
  ],
  "llm_response": "To get a new company computer, you should submit a request through the IT portal. Make sure to include your department, your manager's approval, and the reason for the request. This process typically takes about 2-3 business days for approval."
}

Notice how the second query, "How do I get a new company computer?", is semantically very similar to the first, "What is the process for requesting a new laptop?". Even though the exact words differ, the underlying intent and the information needed to answer are the same. A naive cache might treat these as entirely separate, re-running the retrieval and LLM generation for the second query.

A RAG cache with semantic deduplication, however, recognizes this similarity. It uses techniques like vector embeddings to represent the meaning of the query. If the embedding of the new query is close enough (within a defined similarity threshold) to an embedding of a previously cached query, the system can bypass the retrieval and LLM generation steps entirely. Instead, it serves the cached LLM response from the similar, earlier query.

The core problem RAG aims to solve is grounding LLM responses in factual, up-to-date, or domain-specific information, preventing hallucination and providing relevant answers. A cache, in general, speeds this up by avoiding redundant computations. Semantic deduplication elevates this by understanding that semantically equivalent queries don’t need to be recomputed, even if they’re phrased differently. This is crucial because users rarely ask the exact same question twice in the same way.

Internally, a semantic deduplication cache typically works like this:

  1. Query Embedding: When a new query arrives, its text is converted into a high-dimensional vector (an embedding) using a separate embedding model (e.g., all-MiniLM-L6-v2).
  2. Similarity Search: This new query embedding is compared against a database of embeddings from previously cached queries. This search is usually performed using approximate nearest neighbor (ANN) algorithms for efficiency.
  3. Thresholding: If the similarity score (e.g., cosine similarity) between the new query embedding and any existing cached query embedding exceeds a predefined threshold (e.g., 0.95), the system identifies it as a semantic duplicate.
  4. Cache Hit: The system then retrieves the LLM response associated with the closest matching cached query and returns it directly, without hitting the retriever or the LLM.
  5. Cache Miss: If no sufficiently similar cached query is found, the system proceeds with the standard RAG pipeline (retrieval, LLM generation), and the new query, its retrieved documents, and the LLM response are then embedded and stored in the cache.

The key levers you control are:

  • Embedding Model: The choice of model significantly impacts how "semantic similarity" is defined.
  • Similarity Threshold: A higher threshold means stricter matching, reducing cache hits but increasing the likelihood that a cached response is truly relevant. A lower threshold increases cache hits but might serve less precisely matched responses.
  • Cache Eviction Policy: For large caches, deciding which entries to remove when space is limited (e.g., Least Recently Used - LRU, Least Frequently Used - LFU) is important.

Most people focus on exact string matching or simple keyword overlap for caching. The real magic of a smart RAG cache is its ability to infer intent and meaning, treating "how to request a new laptop" and "how to get a new company computer" as the same problem, thus saving significant LLM inference costs and reducing latency.

The next challenge is handling queries where the intent is similar, but the specific parameters or context require a fresh retrieval, even if the LLM response format might be reusable.

Want structured learning?

Take the full Rag course →