The most surprising thing about building local RAG applications with Ollama and LangChain is how little infrastructure you actually need to get started.

Let’s see it in action. Imagine you have a small PDF, say my_notes.pdf, containing your brilliant ideas. You want to ask it questions, and you want to do it all locally, without sending your data to the cloud.

First, you need Ollama running. If you don’t have it, download it from ollama.ai. Once installed, pull a model:

ollama pull llama3

This downloads the Llama 3 model, a powerful LLM, to your machine. Now, let’s get your PDF into a format Ollama can use. We’ll use LangChain for this. Install the necessary libraries:

pip install langchain-community langchain-chroma langchain-text-splitters

Now, write a Python script to ingest your document:

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

# Load the document
loader = PyPDFLoader("my_notes.pdf")
documents = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# Create embeddings using Ollama
embeddings = OllamaEmbeddings(model="llama3")

# Create a Chroma vector store
vector_store = Chroma.from_documents(
    texts,
    embeddings,
    persist_directory="./chroma_db"
)

print("Document loaded, split, embedded, and stored in ./chroma_db")

This script does a few key things:

  1. Loads the PDF: PyPDFLoader reads your document page by page.
  2. Splits into Chunks: RecursiveCharacterTextSplitter breaks down large documents into smaller, manageable pieces. This is crucial because LLMs have context window limits, and you want to retrieve only the most relevant pieces. chunk_size=1000 means each piece will be at most 1000 characters, and chunk_overlap=200 means 200 characters will be shared between consecutive chunks to maintain context.
  3. Generates Embeddings: OllamaEmbeddings sends your text chunks to the locally running Ollama instance (using the llama3 model) to generate numerical representations (embeddings) of their meaning.
  4. Stores in Vector DB: Chroma.from_documents takes these embeddings and the original text chunks and stores them in a Chroma database directory (./chroma_db). Chroma is an efficient, in-memory (or disk-persisted) vector database, perfect for local applications.

After running this script, you’ll have a chroma_db directory containing your data, ready for querying.

Now, let’s build the RAG part. This involves retrieving relevant chunks from your vector store and feeding them, along with your question, to the LLM.

from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Initialize the LLM
llm = Ollama(model="llama3")

# Load the vector store
embeddings = OllamaEmbeddings(model="llama3")
vector_store = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vector_store.as_retriever()

# Define the RAG prompt template
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# Define the RAG chain
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a question
question = "What are my main ideas for the new project?"
answer = rag_chain.invoke(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

This second script sets up the query mechanism:

  1. Initialize LLM: Ollama(model="llama3") connects to your local Ollama instance.
  2. Load Vector Store: We load the Chroma database we created earlier.
  3. Create Retriever: vector_store.as_retriever() creates an object that can fetch relevant document chunks from the vector store based on a query.
  4. Define Prompt: The template is crucial. It tells the LLM how to behave: use only the provided context to answer the question. {context} and {question} are placeholders that will be filled in dynamically.
  5. Build RAG Chain: This is the core of LangChain’s orchestration.
    • {"context": retriever, "question": RunnablePassthrough()}: This part takes the input question (RunnablePassthrough()) and passes it to the retriever. The retriever, in turn, fetches relevant documents based on that question. The output is a dictionary with keys "context" (the retrieved documents) and "question" (the original question).
    • | prompt: This passes the dictionary to our prompt template, filling in the {context} and {question} placeholders.
    • | llm: The formatted prompt is then sent to the llm (Ollama).
    • | StrOutputParser(): This takes the LLM’s response and converts it into a simple string.
  6. Invoke Chain: rag_chain.invoke(question) sends your actual question to this entire pipeline, and you get the answer.

This setup gives you a fully functional RAG system running entirely on your machine. The "context" passed to the LLM is not just the raw retrieved documents, but rather a formatted string where the content of the retrieved Document objects (specifically their page_content attribute) is joined together, typically separated by newlines. This ensures the LLM receives a coherent block of text derived from the most relevant parts of your original document.

The real power of this local setup is its privacy and cost-effectiveness. You can iterate rapidly on your RAG application without worrying about API costs or data privacy concerns.

Most people don’t realize that the retriever object, when passed into the RunnablePassthrough dictionary, automatically invokes its invoke method with the question. LangChain’s LCEL (LangChain Expression Language) handles this implicit execution, making the chain definition cleaner than manual function calls.

The next step is to integrate this RAG chain into a web application or a more complex agentic workflow.

Want structured learning?

Take the full Ollama course →