The core innovation of RAG isn’t just retrieving documents; it’s retrieving relevant snippets based on a query, and the indexing process is where that snippet-ability is forged.
Let’s watch this in action. Imagine we have a collection of product_reviews.jsonl files, and we want to index them into a ChromaDB vector store.
{"id": "review_1", "text": "This laptop is amazing! The battery life is incredible, lasts all day.", "metadata": {"product_id": "laptop_x100"}}
{"id": "review_2", "text": "Disappointed with the screen. Too dim for outdoor use.", "metadata": {"product_id": "laptop_x100"}}
{"id": "review_3", "text": "The keyboard feels cheap, but the performance is top-notch.", "metadata": {"product_id": "laptop_x100"}}
{"id": "review_4", "text": "Amazing sound quality from these headphones. Perfect for my commute.", "metadata": {"product_id": "headphones_pro"}}
{"id": "review_5", "text": "Battery dies way too fast on these headphones. Can't even get through a podcast.", "metadata": {"product_id": "headphones_pro"}}
We’ll use LangChain for this:
from langchain_community.document_loaders import JSONLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load documents
loader = JSONLoader(
file_path='product_reviews.jsonl',
jq_schema='.[]',
content_key="text",
metadata_func=lambda x: x.get('metadata', {})
)
documents = loader.load()
# Split documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
chunks = text_splitter.split_documents(documents)
# Initialize embeddings (using a local Ollama model)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Create ChromaDB client and collection
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="product_reviews_collection",
persist_directory="./chroma_db"
)
print(f"Indexed {len(chunks)} chunks into ChromaDB.")
This code does a few key things:
- Loading: It reads raw data (
.jsonlin this case) and transforms it intoDocumentobjects, which are LangChain’s standard representation. Crucially, it extracts both thetextand associatedmetadata. - Splitting: It breaks down long documents into smaller, manageable
chunks. This is vital because vector embeddings work best on semantically coherent, relatively short pieces of text.RecursiveCharacterTextSplitteris a common choice, trying to split on specific characters (\n\n,\n,, ``) until chunks are below thechunk_sizeand respectingchunk_overlapto maintain context between adjacent chunks. - Embedding: It converts each text chunk into a numerical vector using an embedding model. This vector captures the semantic meaning of the text. Here, we’re using
nomic-embed-textvia Ollama, a fast, locally runnable model. - Indexing: It stores these vectors (along with the original text and metadata) in a vector database (ChromaDB). ChromaDB organizes these vectors for efficient similarity search. The
persist_directoryensures the index is saved to disk.
The problem RAG indexing solves is how to transform unstructured or semi-structured text into a format that a semantic search engine can query effectively. Before vector databases and embeddings, you’d rely on keyword matching (like TF-IDF), which misses synonyms and conceptual relationships. RAG indexing allows you to ask "What are people saying about the battery life of this laptop?" and get back the snippet "This laptop is amazing! The battery life is incredible, lasts all day," even if the query didn’t use the exact words "battery life."
The critical lever you control here is the chunk_size and chunk_overlap in the RecursiveCharacterTextSplitter. A smaller chunk_size might lead to more granular retrieval but could also break up sentences mid-thought. A larger chunk_size preserves more context but might dilute the specificity of a particular sentence. chunk_overlap ensures that if a key piece of information spans a chunk boundary, it’s still accessible by including a bit of the previous chunk in the next.
The surprising thing is that the quality of the embedding model is often more impactful than the intricacies of the chunking strategy, especially when dealing with highly specialized or nuanced language. A mediocre embedding model will produce vectors that don’t accurately represent semantic similarity, regardless of how perfectly you’ve chunked your text. This means investing time in evaluating and selecting an embedding model that aligns with your domain is paramount.
Once indexed, you can then perform a similarity search:
query = "How is the battery on the laptop?"
results = vectorstore.similarity_search(query, k=2)
for doc in results:
print(f"Page Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")
print("-" * 20)
This would likely return the first and third chunks (or similar, depending on embedding model nuances), demonstrating how the system can connect "battery" in the query to "battery life" in the text.
The next step in optimizing this pipeline often involves exploring different embedding models and understanding their performance characteristics for your specific data, or perhaps moving to a more distributed vector database for larger datasets.