RAG A/B testing is less about comparing two AIs and more about comparing how well two different retrieval mechanisms can feed information to a single AI.

Let’s see it in action. Imagine you’re building a customer support chatbot that uses RAG to answer questions about your product. You have a knowledge base of FAQs and product manuals. You want to test two different ways of finding the most relevant document chunks to answer a user’s query.

Scenario: A user asks, "How do I reset my Model X widget?"

Retrieval Strategy A (Keyword-based): This strategy simply looks for direct keyword matches between the user’s query and your documents. It might use TF-IDF or a simple BM25 implementation.

  • User Query: "How do I reset my Model X widget?"
  • Documents:
    • doc_1.txt: "The Model X widget can be reset by holding the power button for 10 seconds."
    • doc_2.txt: "Troubleshooting common issues with the Model Y device."
    • doc_3.txt: "User manual for Model X: general operation."
  • Retrieved Chunks: doc_1.txt (high relevance due to "Model X", "widget", "reset"), doc_3.txt (medium relevance due to "Model X", "manual").

Retrieval Strategy B (Semantic Search + Metadata Filtering): This strategy uses a dense vector embedding model (like all-MiniLM-L6-v2) to understand the meaning of the query and documents. It can also filter by metadata, like document type or product version.

  • User Query: "How do I reset my Model X widget?"
  • Documents:
    • doc_1.txt: "The Model X widget can be reset by holding the power button for 10 seconds." (metadata: product="Model X", type="FAQ")
    • doc_2.txt: "Troubleshooting common issues with the Model Y device." (metadata: product="Model Y", type="Troubleshooting")
    • doc_3.txt: "User manual for Model X: general operation." (metadata: product="Model X", type="Manual")
    • doc_4.txt: "Instructions for factory reset on Model X devices." (metadata: product="Model X", type="Instructions")
  • Retrieved Chunks:
    • Using semantic similarity: doc_1.txt (very high semantic match), doc_4.txt (high semantic match), doc_3.txt (medium semantic match).
    • With metadata filter: product="Model X": All of the above are still relevant.
    • If the query was "How to reset Model X software": Semantic search might prioritize doc_4.txt over doc_1.txt if doc_4.txt has more semantic overlap with "software reset".

The A/B Test:

You would deploy both retrieval strategies to a subset of your users.

  • Group A: Receives answers generated using Retrieval Strategy A.
  • Group B: Receives answers generated using Retrieval Strategy B.

You’d then measure metrics like:

  • Answer Relevance: Did the AI provide the correct information? (Human evaluation or automated scoring)
  • User Satisfaction: Did the user find the answer helpful? (e.g., thumbs up/down, CSAT score)
  • Task Completion Rate: Was the user able to achieve their goal? (e.g., successfully reset their widget)
  • Latency: How long did it take to get an answer?

The Problem Solved:

RAG systems’ effectiveness hinges entirely on the quality of information retrieved. If the retrieval system returns irrelevant or insufficient context, the LLM will generate a poor answer, even if the LLM itself is powerful. A/B testing retrieval strategies allows you to empirically determine which method of finding relevant documents leads to better downstream outcomes for your specific knowledge base and user queries.

How it Works Internally (The Mental Model):

  1. Indexing: Your documents are processed and stored in a searchable format. This could be a vector database (for semantic search), a traditional search index (for keyword search), or a hybrid. Each document is broken into smaller "chunks" for more granular retrieval.
  2. Query Transformation: The user’s natural language query is processed. This might involve simple tokenization, or it could involve generating embeddings (vector representations) of the query.
  3. Retrieval: The core of the A/B test. The system queries the index using the transformed query.
    • Strategy A (Keyword): Looks for exact or near-exact word matches. It ranks documents based on term frequency and inverse document frequency (TF-IDF) or similar algorithms.
    • Strategy B (Semantic): Compares the vector embedding of the query with the vector embeddings of document chunks. It retrieves chunks whose embeddings are closest (e.g., using cosine similarity). Metadata filtering can prune results based on predefined criteria.
  4. Re-ranking (Optional but common): The initial set of retrieved chunks might be re-ranked using a more sophisticated model to ensure the absolute best are presented to the LLM.
  5. Context Augmentation: The top k retrieved and re-ranked chunks are concatenated into a prompt for the LLM.
  6. LLM Generation: The LLM receives the augmented prompt and generates a natural language answer based on the provided context.

The Levers You Control:

  • Chunking Strategy: How you split your documents (e.g., by paragraph, fixed size, sentence boundaries) significantly impacts retrieval granularity.
  • Embedding Model: For semantic search, the choice of embedding model dictates its understanding of language nuances.
  • Similarity Metric: (e.g., Cosine Similarity, Dot Product) Affects how "closeness" is measured in vector space.
  • Retrieval Algorithm: BM25, sparse vector search, dense vector search, hybrid approaches.
  • Number of Retrieved Chunks (k): Too few, and you might miss crucial info; too many, and you introduce noise and increase LLM cost/latency.
  • Re-ranking Model: If used, its complexity and training data.
  • Metadata Schema & Filtering: For semantic search, defining and utilizing metadata effectively.

The most counterintuitive part of RAG A/B testing is realizing that sometimes, a simpler, older retrieval method (like BM25) can outperform a complex neural semantic search for certain types of queries or knowledge bases. This is often because specialized terminology, acronyms, or very direct factual recall are better served by exact keyword matching than by semantic models that might generalize too much. It forces you to treat the retrieval component as a first-class citizen, not just a black box that feeds the LLM.

Once you’ve validated your retrieval strategies, the next challenge is optimizing the prompt engineering to best leverage the retrieved context for the LLM.

Want structured learning?

Take the full Rag course →