RAG’s contextual compression is actually about removing information, not just adding it.

Let’s see it in action. Imagine we have a document about the history of the Roman Empire and we want to ask a question about the Punic Wars.

{
  "question": "What were the main causes of the Punic Wars?",
  "documents": [
    {
      "id": "roman_history_v1.txt",
      "content": "The Roman Republic, founded in 509 BC, grew rapidly through military conquest and political maneuvering. Early expansion saw conflicts with neighboring Italian tribes and the Etruscans.  \n\nHowever, Rome's true test came with Carthage, a powerful Phoenician city-state in North Africa.  Their rivalry, fueled by competing economic interests and territorial ambitions in the Mediterranean, led to a series of devastating conflicts known as the Punic Wars. \n\nThe First Punic War (264-241 BC) was primarily fought over Sicily. Rome, initially a land power, developed a strong navy to challenge Carthaginian dominance at sea. Key battles included Mylae and Aegates Islands. \n\nThe Second Punic War (218-201 BC) is famous for Hannibal's invasion of Italy, crossing the Alps with his army and elephants.  He inflicted crushing defeats on the Romans at Trebia, Trasimene, and Cannae. Despite these setbacks, Rome ultimately prevailed, largely due to Scipio Africanus's campaign in North Africa. \n\nThe Third Punic War (149-146 BC) was a more straightforward siege and destruction of Carthage. Cato the Elder's famous cry, 'Carthago delenda est' ('Carthage must be destroyed'), reflects the Roman sentiment. \n\nBeyond these major conflicts, Roman society underwent significant changes. The Gracchi brothers attempted land reforms in the late 2nd century BC, aiming to address growing inequality. The rise of powerful generals like Marius and Sulla led to civil wars and the eventual end of the Republic.  Later, the transition to the Roman Empire under Augustus marked a new era.  The construction of aqueducts and roads facilitated trade and communication across the vast empire.  The Pax Romana, a period of relative peace and prosperity, characterized the early centuries of the Empire."
    }
  ]
}

A naive RAG system might just concatenate all of this content and feed it to a large language model. But notice all the information about the Gracchi brothers, Marius, Sulla, aqueducts, and the Pax Romana? That’s noise for our specific question. Contextual compression aims to filter that out.

The core problem RAG contextual compression solves is the context window limitation of LLMs and the dilution of relevant information when too much irrelevant context is present. LLMs have a finite amount of tokens they can process at once. If you stuff too much irrelevant information in, the truly important bits get lost, leading to less accurate or even nonsensical answers. Compression techniques allow us to pack more signal into that limited window.

Here’s how it works conceptually:

  1. Initial Retrieval: A standard retrieval mechanism (like a vector database search) pulls a set of potentially relevant documents or passages.
  2. Re-ranking/Filtering: This is where compression happens. Instead of just passing the raw retrieved chunks, a lighter-weight model or a set of heuristics analyzes each chunk in relation to the original query. It scores or filters out passages that are less likely to be helpful.
  3. Final Context Assembly: Only the most relevant, compressed set of passages is passed to the LLM for synthesis.

Think of it like a librarian who, before giving you a stack of books on "Roman Wars," first quickly flips through each one, pulls out the chapters specifically about the Punic Wars, and discards the chapters about Roman plumbing or the daily lives of senators.

The exact levers you control often involve the parameters of the re-ranking or filtering process. For example, you might adjust:

  • Similarity Thresholds: How close a passage needs to be to the query’s embedding to be considered.
  • Keyword Overlap Heuristics: Using simple keyword matching to boost or penalize passages.
  • LLM-based Re-rankers: Employing a smaller, faster LLM to score the relevance of each retrieved chunk to the original query. This is powerful but computationally more expensive.
  • Summarization/Extraction: Instead of passing full passages, you might pass compressed summaries of key information from those passages.

One common and effective method involves using a lightweight LLM to re-rank the retrieved documents. The process looks like this: retrieve N documents, then for each document d and the original query q, ask a small LLM to predict a relevance score score(d, q). You then select the top K documents based on this score. This is more sophisticated than simple embedding similarity because it can understand nuances of relevance that embeddings might miss, like a passage mentioning "Hannibal" but in the context of his later life, not the Punic Wars.

The next logical step after effectively compressing your context is to consider how the LLM synthesizes this highly targeted information, which often leads to exploring different prompt engineering strategies for synthesis.

Want structured learning?

Take the full Rag course →