Pinecone + LlamaIndex: Index Documents for RAG (2026)

Pinecone is a vector database, and LlamaIndex is a data framework for LLM applications. Together, they let you build RAG (Retrieval Augmented Generation) systems that can query your own documents using natural language.

Let’s say you have a bunch of PDFs containing your company’s internal documentation. You want to build a chatbot that can answer questions based on these PDFs. You can’t just feed the whole PDF to an LLM; they have context window limits and aren’t designed for efficient document retrieval. This is where Pinecone and LlamaIndex come in.

First, you need to get your documents into a format that an LLM can understand and query. This involves:

Loading Documents: Read your raw documents (PDFs, text files, web pages, etc.).
Chunking: Break down large documents into smaller, manageable pieces.
Embedding: Convert these chunks into numerical vectors using an embedding model. These vectors capture the semantic meaning of the text.
Indexing: Store these vectors in a vector database (Pinecone) so they can be efficiently searched.

When a user asks a question:

Embed the Query: Convert the user’s question into a vector using the same embedding model.
Vector Search: Query Pinecone with the question vector to find the most semantically similar document chunks (the "context").
Augment Prompt: Combine the retrieved context with the original question and feed it to an LLM.
Generate Answer: The LLM uses the provided context to generate an accurate answer.

Here’s how you’d set this up with LlamaIndex and Pinecone.

First, install the necessary libraries:

pip install llama-index pinecone-client transformers sentence-transformers

You’ll need a Pinecone API key and environment, which you can get from the Pinecone console. You’ll also need an OpenAI API key for the embedding model and LLM.

Let’s create a simple example. Assume you have a file named my_docs.txt with some text:

This is the first document about apples. Apples are a type of fruit.
This is the second document about bananas. Bananas are also a fruit, and they are yellow.
This is the third document, discussing the benefits of healthy eating. Fruits are good for you.

Now, let’s load, chunk, embed, and index these documents into Pinecone using LlamaIndex.

import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from pinecone import Pinecone, ServerlessSpec
from llama_index.embeddings.huggingface import HuggingFaceEmbedding # Using a local embedding model

# --- Configuration ---
# Set your environment variables for API keys
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # If using OpenAI embeddings/LLM
os.environ["PINECONE_API_KEY"] = "YOUR_PINECONE_API_KEY"
os.environ["PINECONE_ENVIRONMENT"] = "YOUR_PINECONE_ENVIRONMENT" # e.g., "gcp-starter" or "us-east1-gcp"

# Pinecone specific configuration
PINECONE_INDEX_NAME = "llama-rag-index"
PINECONE_CLOUD = "aws" # e.g., "aws", "gcp", "azure"
PINECONE_REGION = "us-east-1" # e.g., "us-east-1", "us-west-2", "europe-west1"

# Embedding model configuration
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# --- 1. Load Documents ---
# Create a dummy document file
with open("my_docs.txt", "w") as f:
    f.write("This is the first document about apples. Apples are a type of fruit.\n")
    f.write("This is the second document about bananas. Bananas are also a fruit, and they are yellow.\n")
    f.write("This is the third document, discussing the benefits of healthy eating. Fruits are good for you.\n")

# Use SimpleDirectoryReader to load documents from a directory
documents = SimpleDirectoryReader("./").load_data()

# --- 2. Initialize Pinecone ---
# Initialize Pinecone connection
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Define the index specification
# For serverless, use ServerlessSpec
spec = ServerlessSpec(cloud=PINECONE_CLOUD, region=PINECONE_REGION)

# Check if the index exists, create if not
if PINECONE_INDEX_NAME not in pc.list_indexes().names:
    pc.create_index(
        name=PINECONE_INDEX_NAME,
        dimension=384,  # Dimension for 'all-MiniLM-L6-v2' is 384
        metric="cosine",
        spec=spec
    )
    print(f"Index '{PINECONE_INDEX_NAME}' created.")
else:
    print(f"Index '{PINECONE_INDEX_NAME}' already exists.")

# Connect to the Pinecone index
vector_store = PineconeVectorStore(
    pinecone_api_key=os.environ["PINECONE_API_KEY"],
    environment=os.environ["PINECONE_ENVIRONMENT"], # This might be deprecated for serverless, use region/cloud instead
    index_name=PINECONE_INDEX_NAME,
    pinecone_spec=spec
)

# --- 3. Initialize Embedding Model ---
# Use HuggingFaceEmbedding for local embeddings
embed_model = HuggingFaceEmbedding(model_name=EMBEDDING_MODEL_NAME)

# --- 4. Create Index and Add Documents ---
# Create a storage context with the Pinecone vector store
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create an index. LlamaIndex will handle chunking and embedding
# The from_documents method automatically handles embedding and upserting to the vector store
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model,
)

print("Documents indexed into Pinecone.")

# --- 5. Query the Index ---
# Create a query engine
query_engine = index.as_query_engine()

# Perform a query
response = query_engine.query("What kind of fruit is an apple?")
print(f"\nQuery: What kind of fruit is an apple?")
print(f"Response: {response}")

response = query_engine.query("Tell me about yellow fruits.")
print(f"\nQuery: Tell me about yellow fruits.")
print(f"Response: {response}")

The key here is VectorStoreIndex.from_documents(). LlamaIndex takes your raw documents, splits them into chunks (default is usually a few hundred tokens), then uses the embed_model to generate a vector for each chunk. These vectors, along with the original text of the chunks, are then sent to Pinecone for storage and indexing.

The dimension parameter when creating the Pinecone index is crucial. It must match the output dimension of your chosen embedding model. all-MiniLM-L6-v2 outputs 384-dimensional vectors.

When you query, query_engine.query("What kind of fruit is an apple?") first takes the question, embeds it using the same embed_model, and then sends this query vector to Pinecone. Pinecone performs a nearest neighbor search to find the vectors (and thus the document chunks) that are closest to the query vector in the high-dimensional space. These retrieved chunks are then passed to an LLM (by default, LlamaIndex might use OpenAI’s gpt-3.5-turbo if you have the API key set) to generate the final answer.

The ServerlessSpec in Pinecone is a modern way to provision indexes without managing underlying infrastructure. You specify cloud and region, and Pinecone handles the rest. The dimension must be set correctly for the chosen embedding model.

The magic of this setup is that LlamaIndex abstracts away much of the complexity. You provide the documents and the embedding model, and LlamaIndex handles the chunking, embedding, and upserting into Pinecone. Similarly, it handles the retrieval and prompt augmentation for querying.

A detail that often trips people up is ensuring the dimension of your Pinecone index exactly matches the dimension of your chosen embedding model. If they don’t match, you’ll get errors during upserting or querying, or worse, incorrect retrieval results. Always check the documentation for your specific embedding model to find its vector dimension. For example, OpenAI’s text-embedding-ada-002 has a dimension of 1536.

The next step would be to explore different chunking strategies, experiment with various embedding models (local vs. hosted), and integrate a specific LLM for generation instead of relying on the default.