The most surprising thing about building local RAG applications with Ollama and LangChain is how little infrastructure you actually need to get started.
Let’s see it in action. Imagine you have a small PDF, say my_notes.pdf, containing your brilliant ideas. You want to ask it questions, and you want to do it all locally, without sending your data to the cloud.
First, you need Ollama running. If you don’t have it, download it from ollama.ai. Once installed, pull a model:
ollama pull llama3
This downloads the Llama 3 model, a powerful LLM, to your machine. Now, let’s get your PDF into a format Ollama can use. We’ll use LangChain for this. Install the necessary libraries:
pip install langchain-community langchain-chroma langchain-text-splitters
Now, write a Python script to ingest your document:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
# Load the document
loader = PyPDFLoader("my_notes.pdf")
documents = loader.load()
# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
# Create embeddings using Ollama
embeddings = OllamaEmbeddings(model="llama3")
# Create a Chroma vector store
vector_store = Chroma.from_documents(
texts,
embeddings,
persist_directory="./chroma_db"
)
print("Document loaded, split, embedded, and stored in ./chroma_db")
This script does a few key things:
- Loads the PDF:
PyPDFLoaderreads your document page by page. - Splits into Chunks:
RecursiveCharacterTextSplitterbreaks down large documents into smaller, manageable pieces. This is crucial because LLMs have context window limits, and you want to retrieve only the most relevant pieces.chunk_size=1000means each piece will be at most 1000 characters, andchunk_overlap=200means 200 characters will be shared between consecutive chunks to maintain context. - Generates Embeddings:
OllamaEmbeddingssends your text chunks to the locally running Ollama instance (using thellama3model) to generate numerical representations (embeddings) of their meaning. - Stores in Vector DB:
Chroma.from_documentstakes these embeddings and the original text chunks and stores them in a Chroma database directory (./chroma_db). Chroma is an efficient, in-memory (or disk-persisted) vector database, perfect for local applications.
After running this script, you’ll have a chroma_db directory containing your data, ready for querying.
Now, let’s build the RAG part. This involves retrieving relevant chunks from your vector store and feeding them, along with your question, to the LLM.
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Initialize the LLM
llm = Ollama(model="llama3")
# Load the vector store
embeddings = OllamaEmbeddings(model="llama3")
vector_store = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vector_store.as_retriever()
# Define the RAG prompt template
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
# Define the RAG chain
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Ask a question
question = "What are my main ideas for the new project?"
answer = rag_chain.invoke(question)
print(f"Question: {question}")
print(f"Answer: {answer}")
This second script sets up the query mechanism:
- Initialize LLM:
Ollama(model="llama3")connects to your local Ollama instance. - Load Vector Store: We load the Chroma database we created earlier.
- Create Retriever:
vector_store.as_retriever()creates an object that can fetch relevant document chunks from the vector store based on a query. - Define Prompt: The
templateis crucial. It tells the LLM how to behave: use only the providedcontextto answer thequestion.{context}and{question}are placeholders that will be filled in dynamically. - Build RAG Chain: This is the core of LangChain’s orchestration.
{"context": retriever, "question": RunnablePassthrough()}: This part takes the input question (RunnablePassthrough()) and passes it to theretriever. The retriever, in turn, fetches relevant documents based on that question. The output is a dictionary with keys "context" (the retrieved documents) and "question" (the original question).| prompt: This passes the dictionary to ourprompttemplate, filling in the{context}and{question}placeholders.| llm: The formatted prompt is then sent to thellm(Ollama).| StrOutputParser(): This takes the LLM’s response and converts it into a simple string.
- Invoke Chain:
rag_chain.invoke(question)sends your actual question to this entire pipeline, and you get the answer.
This setup gives you a fully functional RAG system running entirely on your machine. The "context" passed to the LLM is not just the raw retrieved documents, but rather a formatted string where the content of the retrieved Document objects (specifically their page_content attribute) is joined together, typically separated by newlines. This ensures the LLM receives a coherent block of text derived from the most relevant parts of your original document.
The real power of this local setup is its privacy and cost-effectiveness. You can iterate rapidly on your RAG application without worrying about API costs or data privacy concerns.
Most people don’t realize that the retriever object, when passed into the RunnablePassthrough dictionary, automatically invokes its invoke method with the question. LangChain’s LCEL (LangChain Expression Language) handles this implicit execution, making the chain definition cleaner than manual function calls.
The next step is to integrate this RAG chain into a web application or a more complex agentic workflow.