OpenAI + LlamaIndex: Build RAG with GPT Models (2026)

OpenAI and LlamaIndex are working together to make Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) accessible to everyone. RAG is a technique that enhances LLM responses by providing them with relevant information retrieved from an external knowledge base. This approach allows LLMs to access and use up-to-date or domain-specific information, leading to more accurate and contextually relevant answers.

Let’s see how this works in practice with a simple example.

import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Set your OpenAI API key as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Configure LlamaIndex to use OpenAI models
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Load documents from a directory
# Create a directory named 'data' and place some text files inside.
# For example, create 'data/my_document.txt' with some content.
try:
    documents = SimpleDirectoryReader("data").load_data()
except FileNotFoundError:
    print("Please create a 'data' directory and add some text files to it.")
    exit()

# Build an index from the documents
index = VectorStoreIndex.from_documents(documents)

# Create a query engine
query_engine = index.as_query_engine()

# Query the index
response = query_engine.query("What is the main topic of the documents?")

print(response)

This code snippet demonstrates the core of building a RAG system. First, we import necessary components from llama_index and configure LlamaIndex to use OpenAI’s gpt-3.5-turbo for generation and text-embedding-ada-002 for embeddings. Then, we load text documents from a local directory named data. LlamaIndex processes these documents, creating a VectorStoreIndex. This index essentially transforms the text into numerical representations (embeddings) that capture their semantic meaning, allowing for efficient similarity searches. Finally, we create a query_engine from the index and ask a question. The query_engine first retrieves relevant document chunks based on the query’s embedding and then feeds these chunks, along with the original query, to the LLM to generate an answer.

The problem RAG solves is the inherent limitation of LLMs: their knowledge is static, based on their training data, and they can "hallucinate" information. By integrating external, up-to-date, or private data sources, RAG grounds the LLM’s responses in factual information, significantly improving reliability and relevance. The process involves several key steps:

Data Loading: Documents (PDFs, text files, web pages, etc.) are ingested.
Text Splitting: Large documents are broken down into smaller, manageable chunks. This is crucial because LLMs have context window limits, and smaller chunks allow for more precise retrieval.
Embedding Generation: Each chunk is converted into a numerical vector (embedding) using an embedding model. These embeddings capture the semantic meaning of the text.
Index Creation: The embeddings are stored in a vector database or index, enabling fast similarity searches.
Querying: When a user asks a question, the query is also embedded.
Retrieval: The index is searched for document chunks whose embeddings are most similar to the query embedding.
Augmentation: The retrieved chunks are combined with the original query to form a prompt for the LLM.
Generation: The LLM uses this augmented prompt to generate a final answer.

You control the behavior of this system through several levers. The Settings.llm and Settings.embed_model directly determine which LLM and embedding model are used, impacting cost, performance, and the quality of embeddings. The SimpleDirectoryReader can be swapped for other readers to ingest data from various sources like web pages (WebPageReader), PDFs (PDFReader), or databases. Crucially, the text_splitter parameter within the VectorStoreIndex.from_documents method allows fine-tuning how documents are chunked. For instance, you might use RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) to create chunks of 1000 characters with an overlap of 200 characters, ensuring context isn’t lost at chunk boundaries. The as_query_engine method also accepts parameters like similarity_top_k to control how many document chunks are retrieved for augmentation.

The most surprising thing about RAG is how much of the "intelligence" is actually in the retrieval step, not the generation. An LLM is a powerful text predictor, but without the right context, it’s just guessing. The real magic happens when you can find the exact pieces of information that answer the user’s question and present them to the LLM in a coherent way. The LLM’s job becomes much simpler: it’s not trying to recall information from its training data or infer answers; it’s tasked with synthesizing an answer based on the precise facts you’ve provided. This shifts the focus from prompt engineering for recall to effective data indexing and retrieval.

The next step is to explore advanced retrieval strategies, such as query transformations and re-ranking, to further refine the relevance of retrieved documents.