Retrieval Augmented Generation (RAG) often feels like a black box where latency just happens, but the real secret is that most of the P99 tail is usually a predictable, solvable problem rooted in the retrieval step, not the LLM itself.

Let’s see this in action. Imagine a simple RAG system serving product recommendations.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response.notebook_utils import display_response
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Initialize embedding model and LLM
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm = OpenAI(model="gpt-3.5-turbo")

# Build index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5, # Number of documents to retrieve
    vector_store_query_mode="hybrid", # Use hybrid search
    alpha=0.5 # Weight for hybrid search
)

# Build query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    llm=llm,
    # response_synthesizer=..., # Optional: customize response synthesis
    # node_postprocessors=..., # Optional: further filter/rank nodes
)

# Query
query = "What are the best laptops for programming?"
response = query_engine.query(query)

display_response(response)

This looks straightforward, but the retriever.query() call is where the magic (and the latency) happens. The query_engine orchestrates this: it takes your natural language query, converts it into an embedding, sends that embedding to the vector store to find the most similar document chunks (nodes), and then passes those nodes along with the original query to the LLM for synthesis.

The problem we’re trying to solve is that while the average query might be fast, a small percentage (the P99) take much longer. This is almost always due to bottlenecks in the retrieval phase. The LLM itself is generally very fast on typical prompt sizes, but fetching those prompt pieces can be agonizingly slow if not optimized.

Here’s a breakdown of the common culprits and how to fix them:

1. Suboptimal Vector Store Configuration: Your vector database is the heart of retrieval. If it’s not tuned, everything else suffers. * Diagnosis: Check your vector store’s dashboard or logs for query times. Look at indexing times and memory usage. Are you using an instance type that’s too small? Is disk I/O a bottleneck? For managed services like Pinecone, Weaviate, or Qdrant, check their performance metrics. If self-hosting (e.g., FAISS, Milvus), monitor CPU, RAM, and disk. * Fix: For managed services, scale up your instance size or choose a higher-performance tier. If self-hosting, ensure you have sufficient RAM for your index (in-memory is fastest) and consider faster SSDs. For example, if using Milvus and observing high latency, you might need to increase max_memory_usage in your milvus.yaml or provision a larger instance. * Why it works: A faster vector store can process similarity searches and return results orders of magnitude quicker, directly impacting retrieval time.

2. Inefficient Indexing Strategy: How your data is chunked and indexed matters. Large, unwieldy chunks or a poorly chosen embedding model can lead to slower searches. * Diagnosis: Examine your document chunk sizes and overlap. Are your chunks too large (e.g., >1000 tokens)? Are you using a very large or slow embedding model? * Fix: Experiment with smaller chunk sizes (e.g., 256-512 tokens) and a small overlap (e.g., 10-20% of chunk size). Switch to a faster, smaller embedding model like BAAI/bge-small-en-v1.5 or all-MiniLM-L6-v2 if you’re currently using a larger one. In LlamaIndex, this means re-indexing: index = VectorStoreIndex.from_documents(documents, embed_model=new_embed_model). * Why it works: Smaller, more focused chunks allow the vector store to find more precise matches, reducing the number of irrelevant results to process. Faster embedding models reduce the time taken to encode the query itself.

3. Over-reliance on Pure Similarity Search: Pure vector similarity (dot_product, cosine) doesn’t always capture semantic intent perfectly. This can lead to retrieving many results that look similar but aren’t contextually relevant, requiring the LLM to sift through noise. * Diagnosis: Observe the retrieved nodes. Are they diverse and relevant, or are they repetitive and slightly off-topic? * Fix: Implement hybrid search. Combine vector search with keyword search (like BM25). Many vector stores support this natively. In LlamaIndex, set vector_store_query_mode="hybrid" and tune alpha (e.g., alpha=0.5 for an even split). You can also add a KeywordTableRetriever alongside your VectorIndexRetriever and combine their results. * Why it works: Keyword search excels at exact term matching, while vector search handles semantic understanding. Combining them provides a more robust retrieval signal, reducing the need for the LLM to filter out irrelevant but semantically similar results.

4. Too Many Retrieved Documents (similarity_top_k is too high): Fetching more documents than necessary increases the processing load on both the retriever and the LLM. * Diagnosis: Check the similarity_top_k parameter in your retriever configuration. Are you fetching 10, 20, or more documents when only a few are actually needed for context? * Fix: Lower similarity_top_k. Start with 3 or 5 and incrementally increase only if relevance degrades. For example, change similarity_top_k=10 to similarity_top_k=5. * Why it works: Fewer documents mean less data to embed for the LLM’s context window and less text for the LLM to process during synthesis, directly reducing LLM inference time.

5. Inefficient Node Post-processing or Response Synthesis: While less common than retrieval issues, custom post-processing steps or complex response synthesis can add latency. * Diagnosis: If you’ve added custom node_postprocessors or a custom response_synthesizer in LlamaIndex, profile their execution time. * Fix: Optimize or simplify these custom components. For example, if a post-processor involves expensive re-ranking or filtering, see if it can be done more efficiently or if its scope can be narrowed. If using a complex response_synthesizer, evaluate if a simpler one (like CompactAndRefine) would suffice. * Why it works: Streamlining these later stages reduces the computational overhead after the core retrieval is complete.

6. Network Latency to the Vector Store: If your vector store is hosted remotely, network hops can add up. * Diagnosis: Use tools like ping or traceroute to your vector store’s endpoint. Check cloud provider metrics for inter-region or inter-AZ latency. * Fix: Deploy your application and your vector store in the same cloud region and availability zone. If using a managed service, check if they offer dedicated network connections or private endpoints. * Why it works: Reducing network round trips between your application and the data source drastically cuts down on I/O wait times.

The next hurdle you’ll face is often dealing with the cost implications of high-throughput, low-latency retrieval, especially if you’re using managed vector databases or LLM APIs.

Want structured learning?

Take the full Rag course →