The most surprising thing about reducing RAG hallucinations is that the problem isn’t just about finding more relevant documents, but about how the retriever ranks and prioritizes the ones it does find.

Let’s see this in action. Imagine we have a RAG system that fetches documents for a query like "What are the side effects of drug X?"

# Hypothetical RAG system setup
from langchain_community.llms import OpenAI
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load and chunk documents (simplified)
loader = TextLoader("drug_x_info.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

# Create embeddings and vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(texts, embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Fetch top 3 chunks

# LLM and prompt
llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt_template)

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

# Query
query = "What are the common side effects of drug X?"
result = qa_chain.invoke({"query": query})
print(result["result"])

The retriever here fetches the top k=3 chunks based on a similarity score. If the top 3 chunks, despite being the most similar, contain conflicting or incomplete information about drug X’s side effects, the LLM will struggle. It might synthesize a hallucinated side effect by combining partial mentions from different chunks, or it might confidently state something incorrect if one chunk is slightly misleading.

The core problem RAG aims to solve is providing the LLM with factual, contextually relevant information to prevent it from "making things up" (hallucinating) when it doesn’t have direct knowledge. A RAG system works by:

  1. Retrieval: When a user asks a question, the system first searches a knowledge base (like a vector database) for documents or text snippets that are semantically similar to the query. This is typically done using embeddings.
  2. Augmentation: The retrieved snippets are then combined with the original user query.
  3. Generation: This augmented prompt (query + retrieved context) is fed to a Large Language Model (LLM), which then generates an answer based on the provided context.

The "grounding" aspect means ensuring the LLM’s answer is firmly rooted in the retrieved documents. "Verification" implies a process to check if the generated answer actually aligns with the source material.

The magic of RAG isn’t just in what documents are retrieved, but how they’re presented to the LLM. If the retriever k=3 fetches three chunks, and chunk 1 says "nausea is common," chunk 2 says "headache is a side effect," and chunk 3 mentions "occasional dizziness," the LLM should be able to synthesize this. But if chunk 1 actually says "nausea is not a common side effect, but a rare one," and the LLM only picks up "nausea," it can hallucinate. The ranking and the precise wording matter.

Consider the search_kwargs={"k": 3} in the retriever setup. This tells the system to fetch the 3 most similar chunks. But what if the 4th and 5th chunks are also highly relevant, and contain crucial disambiguating information or a more nuanced answer? A simple k=3 might miss this. This is where techniques like reranking come in. After initial retrieval, a more sophisticated model can re-evaluate the order of the fetched chunks or even fetch more chunks and then prune them down. For instance, using a cross-encoder model for reranking can significantly improve the relevance of the top-k documents passed to the LLM, reducing the chance of irrelevant or contradictory context polluting the prompt.

Another critical lever is the prompt engineering itself. The prompt template shown above includes "If you don’t know the answer, just say that you don’t know, don’t try to make up an answer." This is a direct instruction to the LLM to avoid hallucination. However, the effectiveness of this instruction depends on the LLM’s ability to truly identify when the context is insufficient. If the context is subtly misleading, the LLM might still try to answer.

A common pitfall is assuming that higher similarity scores always mean better context. Sometimes, a query might be very specific, and the most similar chunks might be slightly off-topic or only partially relevant, while a slightly less similar chunk might contain the exact answer. This is where techniques that go beyond simple vector similarity become powerful. For example, using a hybrid search that combines keyword-based search (like BM25) with vector similarity can capture both semantic meaning and exact term matches, leading to more robust retrieval. You might configure your retriever like this:

from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever

# Assuming you have your Chroma instance 'vectorstore' already created
vectorstore_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # Fetch more initially

# Initialize BM25 retriever
bm25_retriever = BM25Retriever.from_documents(texts)
bm25_retriever.k = 5

# Combine retrievers (simplified example, actual implementation might vary)
from langchain.retrievers import MultiRetriever

# You'd typically create a base retriever here, then wrap it.
# For demonstration, let's assume a conceptual 'combined_retriever'
# In practice, you'd use something like ReRanker or a custom chain to merge results.
# For simplicity here, we'll stick to tuning the k value and prompt.

Instead of just k=3, you might fetch k=5 or k=10 and then implement a reranking step. A reranker takes the initial set of retrieved documents and reorders them based on a more powerful, often cross-encoder, model that looks at the query and each document pair together. This can be crucial for surfacing the most pertinent information.

Another powerful technique is Self-Critique or Constitutional AI applied during generation. After the LLM generates an initial answer, a second LLM call (or a dedicated critique model) can evaluate the answer against the retrieved context, checking for factual consistency and adherence to instructions. If the critique finds issues, the answer can be regenerated. This adds a layer of verification.

The prompt template is also key. Instead of just asking for an answer, you can instruct the LLM to explicitly cite its sources within the retrieved context. For example:

Use the following pieces of context to answer the question at the end. For each statement in your answer, cite the chunk number it came from. If you cannot find the answer in the context, state "I cannot find the answer in the provided context."

Context:
Chunk 1: {chunk1_content}
Chunk 2: {chunk2_content}
...

This forces the LLM to map its generated text back to specific source material, making hallucinations much easier to detect and often preventing them by making the LLM self-aware of its grounding.

The final piece of the puzzle is fact-checking the retriever itself. Sometimes, the documents in your knowledge base might be outdated or contain errors. Regularly updating your vector store and perhaps even having a separate process to curate and validate the source documents is essential. If the source material is flawed, RAG will faithfully reproduce those flaws, making it appear like a hallucination.

The next challenge you’ll face is dealing with queries that require synthesizing information across many disparate documents, where the retriever might pull relevant snippets but the LLM struggles to weave them into a coherent, non-contradictory narrative.

Want structured learning?

Take the full Rag course →