Conversation history is what makes RAG feel like a dialogue, not a Q&A bot that just forgot everything.
Let’s see this in action. Imagine a user asking about a specific product, then following up with a question about its compatibility.
User: "Tell me about the new Stellaris X1 laptop."
RAG system retrieves information about the Stellaris X1, including its specs and features.
Assistant: "The Stellaris X1 features a 15.6-inch QHD display, an Intel Core i9 processor, 32GB of RAM, and an NVIDIA RTX 4070 GPU. It’s designed for high-performance gaming and content creation."
User: "Is it compatible with the latest VR headsets?"
Here’s where conversation history becomes crucial. The RAG system needs to know "it" refers to the Stellaris X1. It retrieves the previous turn’s context and then searches its knowledge base for VR headset compatibility specifically for the Stellaris X1.
Assistant: "Yes, the Stellaris X1 meets the recommended specifications for all major VR headsets, including the Valve Index and Meta Quest 3, due to its powerful GPU and ample RAM."
This multi-turn capability is built by augmenting the user’s current query with relevant past exchanges. The core problem this solves is maintaining context across multiple user inputs, allowing the AI to understand follow-up questions that rely on previous turns. Without it, each query is treated in isolation, and the AI would have no idea what "it" or "that" refers to.
Internally, this often involves a simple strategy: concatenating the last N turns of the conversation (both user and assistant messages) into a single, larger prompt that’s then fed to the RAG system’s retrieval and generation components. The N is a critical tunable parameter. Too small, and context is lost. Too large, and the prompt becomes unwieldy, potentially diluting the focus on the current query or exceeding token limits.
A common approach is to format the conversation history clearly. For instance, you might prefix each turn:
User: Tell me about the new Stellaris X1 laptop.
Assistant: The Stellaris X1 features a 15.6-inch QHD display, an Intel Core i9 processor, 32GB of RAM, and an NVIDIA RTX 4070 GPU. It's designed for high-performance gaming and content creation.
User: Is it compatible with the latest VR headsets?
This structured input helps the underlying language model parse the history and the current query effectively. The retrieval mechanism then uses this combined input to find the most relevant documents, and the generation model uses both the retrieved documents and the full conversation history to formulate a coherent response.
The complexity often lies in deciding what to include from the history. Simply concatenating raw text can lead to noise. More sophisticated methods involve summarizing previous turns, extracting key entities, or using a dedicated "context manager" to decide which parts of the history are most pertinent to the current query. For example, if the conversation veers off-topic and then returns, you might want to prune the irrelevant interlude.
One of the most overlooked aspects is how the retrieval system itself handles this augmented query. If your retriever is keyword-based, adding conversational filler from history might degrade performance. Embeddings-based retrievers are generally more robust, as they can capture semantic similarity even with longer, more complex inputs. However, the sheer length of an augmented prompt can still cause issues with embedding quality or computational cost.
The next frontier is adaptive context management, where the system intelligently selects or summarizes history based on the current query’s semantic drift.