LlamaIndex doesn’t actually index your data; it indexes representations of your data that are designed for efficient retrieval.

Here’s how you’d set up a basic RAG (Retrieval Augmented Generation) pipeline using Ollama for local LLM inference and LlamaIndex for data indexing and retrieval.

First, ensure you have Ollama running with a model downloaded. For this example, we’ll use llama3. You can download it with ollama pull llama3.

ollama serve &

Next, install the necessary Python libraries:

pip install llama-index ollama

Now, let’s create a Python script to build the RAG pipeline.

import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

# Configure Ollama for LLM and embeddings
# Ensure Ollama is running in another terminal with `ollama serve`
Settings.llm = Ollama(model="llama3", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model="llama3", request_timeout=60.0)

# Load documents
# Create a directory named 'data' and place your text files inside it.
# For example, create 'data/my_document.txt' with some content.
if not os.path.exists("data"):
    os.makedirs("data")
    with open("data/sample_doc.txt", "w") as f:
        f.write("The capital of France is Paris. Paris is known for the Eiffel Tower and the Louvre Museum.")

print("Loading documents from the 'data' directory...")
documents = SimpleDirectoryReader("data").load_data()
print(f"Loaded {len(documents)} document(s).")

# Build the index
print("Building the index...")
index = VectorStoreIndex.from_documents(documents)
print("Index built successfully.")

# Create a query engine
print("Creating query engine...")
query_engine = index.as_query_engine()
print("Query engine created.")

# Perform a query
query_text = "What is the capital of France?"
print(f"\nQuerying: {query_text}")
response = query_engine.query(query_text)

print("\n--- Response ---")
print(response)
print("----------------")

# Another query
query_text_2 = "Tell me about famous landmarks in Paris."
print(f"\nQuerying: {query_text_2}")
response_2 = query_engine.query(query_text_2)

print("\n--- Response ---")
print(response_2)
print("----------------")

To run this:

  1. Save the code as a Python file (e.g., rag_pipeline.py).
  2. Create a directory named data in the same location as your Python script.
  3. Place one or more text files (e.g., sample_doc.txt) inside the data directory. The script includes a fallback to create a sample file if data is empty.
  4. Make sure your Ollama server is running in a separate terminal: ollama serve.
  5. Run the script: python rag_pipeline.py.

When you run this script, LlamaIndex will:

  • Read the documents from the data directory using SimpleDirectoryReader.
  • For each document, it will split the text into smaller chunks.
  • It will then use the configured OllamaEmbedding model to generate vector embeddings for each chunk. These embeddings are numerical representations of the text’s meaning.
  • These embeddings are stored in a VectorStoreIndex. By default, LlamaIndex uses an in-memory vector store, but you can configure persistent ones.
  • When you query, LlamaIndex first generates an embedding for your query.
  • It then searches the VectorStoreIndex for the most semantically similar text chunks (based on their embeddings).
  • Finally, it takes your original query and the retrieved text chunks and sends them to the configured LLM (Ollama with llama3) as a prompt, instructing it to answer the query based on the provided context.

The Settings.llm and Settings.embed_model are crucial here. They tell LlamaIndex how to communicate with your local models. Ollama is the LlamaIndex integration for Ollama, and OllamaEmbedding specifically handles the embedding process using Ollama. The request_timeout is set to prevent issues with potentially slower local model responses.

The most surprising aspect is how seamlessly LlamaIndex abstracts away the complexities of chunking, embedding, and retrieval. You’re essentially just pointing it to your data and telling it which LLM to use, and it handles the heavy lifting of creating a searchable knowledge base. The VectorStoreIndex.from_documents() call is where the magic of turning raw text into a retrievable knowledge graph happens.

You’ll next want to explore different chunking strategies and vector store persistence options for larger datasets.

Want structured learning?

Take the full Ollama course →