Build a Production RAG Pipeline End to End (2026)

Retrieval Augmented Generation (RAG) pipelines are often described as "just connecting a retriever to a generator," but the real magic, and the most surprising truth, is that the generator’s performance is almost entirely dictated by the quality of the retrieved context, not by its own inherent capabilities.

Let’s walk through building a production-ready RAG pipeline. We’ll use LangChain for orchestration, OpenAI for the LLM, and ChromaDB as our vector store.

1. Data Loading and Preprocessing

First, we need to load our documents. For this example, let’s imagine we have a collection of PDF technical manuals.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("technical_manual.pdf")
documents = loader.load()

Next, we split these documents into smaller chunks. This is crucial because LLMs have context window limits, and smaller chunks allow for more precise retrieval.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(documents)

chunk_size determines the maximum token count for each chunk, and chunk_overlap ensures that semantic continuity is maintained across adjacent chunks, preventing information loss at the boundaries.

2. Embedding and Indexing

Now, we need to convert these text chunks into numerical representations (embeddings) that our vector store can understand. We’ll use OpenAI’s text-embedding-ada-002 model.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Create a Chroma vector store
# We'll persist this to disk for later use
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

The persist_directory argument tells Chroma where to save the index. This is vital for production; you don’t want to re-embed and re-index your entire knowledge base every time your application restarts.

3. Retrieval

When a user asks a question, we’ll use the vector store to find the most relevant chunks of text.

retriever = vectorstore.as_retriever(
    search_type="similarity", # or "mmr" for Maximal Marginal Relevance
    search_kwargs={"k": 5} # Number of documents to retrieve
)

query = "How do I reset the device to factory settings?"
relevant_docs = retriever.invoke(query)

for doc in relevant_docs:
    print(f"--- Score: {doc.metadata.get('score', 'N/A')} ---") # Chroma doesn't directly expose score without custom setup
    print(doc.page_content)
    print("-" * 20)

search_type="similarity" means we’re looking for chunks whose embeddings are closest (most similar) to the query’s embedding. k=5 means we want the top 5 most similar chunks.

4. Generation

Finally, we feed the retrieved documents along with the original query to an LLM to generate an answer.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-3.5-turbo")

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # "stuff" puts all retrieved docs into one prompt
    retriever=retriever,
    return_source_documents=True # Optional: to see which docs were used
)

result = qa_chain.invoke({"query": query})

print("Answer:", result["result"])
# print("Source Documents:", result["source_documents"]) # Uncomment to see sources

The chain_type="stuff" is the simplest; it stuffs all retrieved document contents into the prompt. Other chain types like map_reduce or refine can handle larger numbers of documents by processing them in stages.

The most crucial aspect of this setup is the retriever. If the retriever returns irrelevant or incomplete information, even the most powerful LLM will produce a poor answer. The quality of the embeddings, the chunking strategy, and the retrieval parameters (search_type, k) are the primary levers for improving RAG performance. Many people focus on prompt engineering for the LLM, but optimizing the retrieval stage yields far greater gains.

The next logical step in improving this pipeline is to explore advanced retrieval techniques, such as re-ranking or query expansion.