The most surprising thing about evaluating Retrieval Augmented Generation (RAG) is that the metrics you think are about generation quality are actually retrieval quality in disguise.
Let’s see RAGAS in action. Imagine we have a RAG system that answers questions based on a set of documents.
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, recall, answer_precision
# Sample data
data = {
"question": ["What is the capital of France?"],
"answer": ["The capital of France is Paris."],
"contexts": [["Paris is the capital and most populous city of France."]],
"ground_truth": ["Paris is the capital of France."],
}
# Convert to Ragas Dataset
dataset = Dataset.from_dict(data)
# Evaluate
result = evaluate(
dataset,
metrics=[
faithfulness,
recall,
answer_precision,
]
)
print(result)
This looks simple, but the magic happens when Ragas inspects the contexts and answer to judge faithfulness and answer_precision. Recall is judged against the ground_truth and contexts.
Here’s the mental model: RAG has two major phases: Retrieval and Generation.
- Retrieval: This is where your retriever (e.g., a vector database search) finds relevant documents or chunks of text based on the user’s query. The quality of this step directly impacts everything that follows.
- Generation: The Large Language Model (LLM) takes the user’s query and the retrieved context to generate an answer.
RAGAS breaks down the evaluation into these core components:
- Faithfulness: Does the generated
answerstick to the information present in thecontexts? A faithful answer only uses information from the provided context. If the LLM hallucinates or brings in outside knowledge, faithfulness drops. Ragas checks this by asking an LLM to determine if the answer is supported by the context. - Recall: Does the generated
answercover all the essential information present in theground_truth? This metric checks if the retrieved context was sufficient to answer the question completely. If the retriever missed crucial information that was in the source documents (and thus missing from the context provided to the LLM), recall will suffer. Ragas compares the ground truth answer to the answer generated from the context. - Answer Precision: Is every piece of information in the generated
answerrelevant and directly supported by thecontexts? This is the flip side of faithfulness. While faithfulness asks "is the answer in the context?", answer precision asks "is everything in the answer also in the context, and is it directly supported?". Ragas also uses an LLM to assess this, focusing on the answer’s conciseness and direct relevance to the context.
The core problem RAG solves is grounding LLM responses in specific, verifiable information, moving beyond their general knowledge. RAGAS gives you concrete scores for how well your system achieves this grounding.
The levers you control are primarily in the retrieval phase:
- Chunking strategy: How you split your source documents into smaller pieces.
- Embedding model: The model used to convert text into vectors.
- Retriever algorithm: The method used to search for similar vectors (e.g., kNN, MMR).
- Number of retrieved chunks (k): How many pieces of context you pass to the LLM.
What most people don’t realize is that a low recall score often points to a retrieval problem, not a generation problem. If the LLM had the right information in its context window, it could likely generate a better answer. Similarly, low faithfulness or answer_precision can indicate that the LLM is being prompted in a way that encourages it to stray, or that the retrieved context is too noisy or contradictory.
The next concept you’ll grapple with is how to improve these metrics, which often involves tuning your retrieval pipeline or adjusting your prompt engineering for the generation phase.