Retrieval-Augmented Generation (RAG) systems don’t magically know the answer; they retrieve relevant documents first and then generate. Measuring how well they retrieve is critical, and standard metrics like Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), and Hit Rate give us a precise way to do that.
Let’s see RAG in action. Imagine a user asking: "What are the side effects of aspirin?"
A RAG system might retrieve these documents (in order of perceived relevance):
- Document A: "Aspirin is a nonsteroidal anti-inflammatory drug (NSAID) used to reduce fever and inflammation. Common side effects include stomach upset, heartburn, and nausea. Less common but serious side effects can include bleeding and ringing in the ears." (This is the correct document containing the answer.)
- Document B: "Ibuprofen is another NSAID with similar uses. Its side effects are also generally mild, like stomach pain."
- Document C: "Paracetamol, also known as acetaminophen, is a pain reliever. It does not typically cause stomach upset."
- Document D: "Aspirin’s chemical structure and synthesis."
Now, let’s break down the metrics.
Hit Rate
Hit Rate is the simplest. It tells you if any of the retrieved documents were relevant. In our example, Document A is relevant. If we set our retrieval to fetch the top 4 documents, and Document A is among them, the Hit Rate is 100% for this query. If Document A was ranked 5th and we only retrieved 4, the Hit Rate would be 0%.
Why it works: It’s a binary "yes/no" for relevance within the retrieved set. It’s good for a quick check: is the system ever finding good stuff?
Diagnosis: To calculate Hit Rate, you need a ground truth list of relevant documents for each query and the list of documents your RAG system retrieved.
def calculate_hit_rate(retrieved_docs, relevant_docs_ids):
for doc_id in retrieved_docs:
if doc_id in relevant_docs_ids:
return 1.0 # Hit!
return 0.0 # Miss!
# Example usage:
retrieved_document_ids = ['doc_A', 'doc_B', 'doc_C', 'doc_D']
truly_relevant_document_ids = ['doc_A', 'doc_X'] # doc_X wasn't retrieved
hit_rate = calculate_hit_rate(retrieved_document_ids, truly_relevant_document_ids)
print(f"Hit Rate: {hit_rate}") # Output: Hit Rate: 1.0
Fix: If your Hit Rate is low, it means your retriever is fundamentally missing relevant documents. This often points to issues with your embedding model’s ability to capture semantic meaning, poor document chunking (too large or too small chunks), or an inadequate index that doesn’t cover the breadth of your data. A common fix is to fine-tune the embedding model on your specific domain or to experiment with different chunking strategies.
Mean Reciprocal Rank (MRR)
MRR goes a step further than Hit Rate. It cares about where the first relevant document appears in the ranked list. The "reciprocal rank" is 1 divided by the rank of the first relevant document. For our example, Document A is the first relevant document and it’s at rank 1. So, the reciprocal rank is 1/1 = 1.
If Document A was at rank 2, and Document B was irrelevant, the reciprocal rank would be 1/2 = 0.5. If the first relevant document was at rank 3, it would be 1/3. MRR is the average of these reciprocal ranks across multiple queries.
Why it works: It rewards systems that put the most relevant document higher up in the list, which is crucial for user experience – nobody wants to scroll forever.
Diagnosis: You need the rank of the first relevant document for each query.
def calculate_mrr(retrieved_docs_with_ranks, relevant_docs_ids):
for rank, doc_id in retrieved_docs_with_ranks:
if doc_id in relevant_docs_ids:
return 1.0 / rank # Reciprocal of the rank
return 0.0 # No relevant document found
# Example usage:
# Documents are (rank, doc_id) tuples
retrieved_documents_ranked = [(1, 'doc_A'), (2, 'doc_B'), (3, 'doc_C'), (4, 'doc_D')]
truly_relevant_document_ids = ['doc_A', 'doc_X']
mrr_score = calculate_mrr(retrieved_documents_ranked, truly_relevant_document_ids)
print(f"MRR for this query: {mrr_score}") # Output: MRR for this query: 1.0
# Another example:
retrieved_documents_ranked_2 = [(1, 'doc_B'), (2, 'doc_A'), (3, 'doc_C'), (4, 'doc_D')]
mrr_score_2 = calculate_mrr(retrieved_documents_ranked_2, truly_relevant_document_ids)
print(f"MRR for query 2: {mrr_score_2}") # Output: MRR for query 2: 0.5
Fix: A low MRR means your retriever isn’t consistently placing the best document at the top. This could be due to:
- Embedding drift: Your embeddings aren’t capturing the nuances needed for the most relevant document.
- Re-ranking layer: You might need a more sophisticated re-ranking step after initial retrieval to boost the truly best document.
- Query expansion: The initial query might not be specific enough.
Fine-tuning your retriever with a focus on ranking accuracy (e.g., using contrastive learning) or implementing a cross-encoder for re-ranking can significantly improve MRR.
Normalized Discounted Cumulative Gain (NDCG)
NDCG is the most sophisticated. It considers all relevant documents and their positions, but it also discounts documents that appear lower in the list. It’s "normalized" so you can compare scores across different queries.
Here’s the breakdown:
- Gain: For each document, assign a "gain" value. If a document is perfectly relevant, gain = max_gain (e.g., 5). If it’s highly relevant, gain = 4, and so on, down to 0 for irrelevant.
- Cumulative Gain (CG): Sum the gains of the documents up to a certain rank.
- Discounted Cumulative Gain (DCG): Apply a logarithmic discount to the gain of each document based on its rank. The formula is
gain / log2(rank + 1). Documents at higher ranks (lower numbers) get their gain less discounted. - Ideal DCG (IDCG): Calculate the DCG if the documents were ranked in perfect order of relevance (highest gain first).
- NDCG: Divide the calculated DCG by the IDCG.
NDCG = DCG / IDCG.
Let’s use our example. Assume:
- Document A: Highly relevant (gain = 3)
- Document B, C, D: Irrelevant (gain = 0)
Our retrieved list: [(1, 'doc_A', gain=3), (2, 'doc_B', gain=0), (3, 'doc_C', gain=0), (4, 'doc_D', gain=0)]
-
DCG:
- Rank 1 (doc_A):
3 / log2(1 + 1) = 3 / log2(2) = 3 / 1 = 3 - Rank 2 (doc_B):
0 / log2(2 + 1) = 0 / log2(3) = 0 - Rank 3 (doc_C):
0 / log2(3 + 1) = 0 / log2(4) = 0 - Rank 4 (doc_D):
0 / log2(4 + 1) = 0 / log2(5) = 0 - Total DCG = 3 + 0 + 0 + 0 = 3
- Rank 1 (doc_A):
-
IDCG: The ideal ranking would have doc_A first.
- Rank 1 (doc_A):
3 / log2(1 + 1) = 3 - The rest are 0.
- Total IDCG = 3
- Rank 1 (doc_A):
-
NDCG:
DCG / IDCG = 3 / 3 = 1.0
If Document A was at rank 2, and we had another moderately relevant document (gain=1) at rank 1:
Retrieved: [(1, 'doc_B', gain=1), (2, 'doc_A', gain=3), (3, 'doc_C', gain=0), (4, 'doc_D', gain=0)]
-
DCG:
- Rank 1 (doc_B):
1 / log2(2) = 1 - Rank 2 (doc_A):
3 / log2(3) ≈ 3 / 1.58 ≈ 1.89 - Total DCG = 1 + 1.89 = 2.89
- Rank 1 (doc_B):
-
IDCG: Ideal order: doc_A (gain=3) then doc_B (gain=1).
- Rank 1 (doc_A):
3 / log2(2) = 3 - Rank 2 (doc_B):
1 / log2(3) ≈ 0.63 - Total IDCG = 3 + 0.63 = 3.63
- Rank 1 (doc_A):
-
NDCG:
DCG / IDCG = 2.89 / 3.63 ≈ 0.796
Why it works: NDCG is the gold standard because it rewards systems that not only retrieve relevant documents but also rank them highly, and it accounts for partial relevance. A score of 1.0 means perfect ranking.
Diagnosis: You need graded relevance for documents (not just binary) and the ranked retrieval list.
import math
def calculate_dcg(retrieved_docs_with_grades):
dcg = 0
for rank, grade in retrieved_docs_with_grades:
if rank == 0: continue # Should not happen for ranks
dcg += grade / math.log2(rank + 1)
return dcg
def calculate_idcg(ideal_docs_with_grades):
idcg = 0
for rank, grade in ideal_docs_with_grades:
if rank == 0: continue
idcg += grade / math.log2(rank + 1)
return idcg
def calculate_ndcg(retrieved_docs_with_grades, ideal_docs_with_grades):
dcg = calculate_dcg(retrieved_docs_with_grades)
idcg = calculate_idcg(ideal_docs_with_grades)
if idcg == 0:
return 1.0 if dcg == 0 else 0.0 # Handle cases where no relevant docs exist
return dcg / idcg
# Example from above:
# (rank, doc_id, grade)
retrieved_docs_graded = [(1, 'doc_A', 3), (2, 'doc_B', 0), (3, 'doc_C', 0), (4, 'doc_D', 0)]
# Ideal ranking: doc_A (grade 3) first, then others (grade 0)
ideal_docs_graded = [(1, 'doc_A', 3), (2, 'doc_B', 0), (3, 'doc_C', 0), (4, 'doc_D', 0)]
ndcg_score = calculate_ndcg(retrieved_docs_graded, ideal_docs_graded)
print(f"NDCG: {ndcg_score}") # Output: NDCG: 1.0
# Second example:
retrieved_docs_graded_2 = [(1, 'doc_B', 1), (2, 'doc_A', 3), (3, 'doc_C', 0), (4, 'doc_D', 0)]
# Ideal ranking is still doc_A then doc_B
ideal_docs_graded_2 = [(1, 'doc_A', 3), (2, 'doc_B', 1), (3, 'doc_C', 0), (4, 'doc_D', 0)]
ndcg_score_2 = calculate_ndcg(retrieved_docs_graded_2, ideal_docs_graded_2)
print(f"NDCG 2: {ndcg_score_2}") # Output: NDCG 2: 0.796...
Fix: Low NDCG indicates that either relevant documents are missing, or they are poorly ranked. This is where you’d look at:
- Embedding quality: Are your embeddings truly capturing the semantic similarity needed for nuanced relevance?
- Re-ranking: A powerful re-ranker (like a cross-encoder) can drastically improve NDCG by looking at document-query pairs holistically.
- Data quality: Are your documents clean and well-structured?
- Grading scale: Ensure your relevance grading is consistent and meaningful.
Fine-tuning embedding models, implementing robust re-ranking layers, and curating high-quality graded relevance datasets are key to boosting NDCG.
These metrics help you understand not just if your RAG system retrieves information, but how well it prioritizes and presents it, which directly impacts the quality of the generated answers. The next step is often evaluating the generation quality itself.