The most surprising thing about RAG chunking is that bigger chunks aren’t always better, and sometimes, much smaller chunks can lead to dramatically improved retrieval accuracy, even though it feels counterintuitive.

Let’s watch this in action. Imagine we have a document describing a complex scientific process. We’ll feed it into a RAG system and see how different chunk sizes affect our ability to retrieve specific details.

First, we need a document. Let’s use a snippet from a fictional research paper:

## The Photosynthesis Process in *Artemisia annua*

Photosynthesis in *Artemisia annua*, commonly known as sweet wormwood, is a critical process for its medicinal compound biosynthesis, particularly artemisinin. The overall reaction can be summarized as:

6CO₂ + 6H₂O + Light Energy → C₆H₁₂O₆ + 6O₂

This process occurs primarily in the chloroplasts, where light-dependent reactions and the Calvin cycle (light-independent reactions) take place. The light-dependent reactions, occurring in the thylakoid membranes, convert light energy into chemical energy in the form of ATP and NADPH. Key pigments like chlorophyll absorb photons, exciting electrons that flow through an electron transport chain.

The Calvin cycle, which takes place in the stroma, uses the ATP and NADPH generated to fix atmospheric carbon dioxide into organic molecules, ultimately producing glucose. This glucose then serves as a precursor for the biosynthesis of artemisinin and other terpenes. The efficiency of artemisinin production is directly linked to the plant's photosynthetic capacity and the availability of essential nutrients and light.

Now, let’s consider different chunking strategies.

Scenario 1: Large Chunks (e.g., 500 tokens)

If we chunk this document into very large pieces, say 500 tokens each, our RAG system might get this:

  • Chunk 1: The entire document above (assuming it’s less than 500 tokens).

When we ask a question like "What is the role of chlorophyll in photosynthesis?", the system has to search within this single, large chunk. While it can find the answer, the embedding might become diluted. The signal for "chlorophyll" might be spread thin across other concepts like "artemisinin biosynthesis" and "Calvin cycle," potentially leading to a less precise retrieval.

Scenario 2: Medium Chunks (e.g., 150 tokens)

Let’s try chunking into approximately 150 tokens:

  • Chunk 1: "## The Photosynthesis Process in Artemisia annua Photosynthesis in Artemisia annua, commonly known as sweet wormwood, is a critical process for its medicinal compound biosynthesis, particularly artemisinin. The overall reaction can be summarized as: 6CO₂ + 6H₂O + Light Energy → C₆H₁₂O₆ + 6O₂ This process occurs primarily in the chloroplasts, where light-dependent reactions and the Calvin cycle (light-independent reactions) take place."
  • Chunk 2: "The light-dependent reactions, occurring in the thylakoid membranes, convert light energy into chemical energy in the form of ATP and NADPH. Key pigments like chlorophyll absorb photons, exciting electrons that flow through an electron transport chain. The Calvin cycle, which takes place in the stroma, uses the ATP and NADPH generated to fix atmospheric carbon dioxide into organic molecules, ultimately producing glucose."
  • Chunk 3: "This glucose then serves as a precursor for the biosynthesis of artemisinin and other terpenes. The efficiency of artemisinin production is directly linked to the plant’s photosynthetic capacity and the availability of essential nutrients and light."

With 150-token chunks, the question "What is the role of chlorophyll in photosynthesis?" is more likely to retrieve Chunk 2. This chunk contains a focused description of chlorophyll’s function, making the embedding more specific and the retrieval more accurate.

Scenario 3: Small Chunks (e.g., 50 tokens)

Now, let’s go even smaller, around 50 tokens:

  • Chunk 1: "## The Photosynthesis Process in Artemisia annua Photosynthesis in Artemisia annua, commonly known as sweet wormwood, is a critical process for its medicinal compound biosynthesis, particularly artemisinin."
  • Chunk 2: "The overall reaction can be summarized as: 6CO₂ + 6H₂O + Light Energy → C₆H₁₂O₆ + 6O₂ This process occurs primarily in the chloroplasts, where light-dependent reactions and the Calvin cycle (light-independent reactions) take place."
  • Chunk 3: "The light-dependent reactions, occurring in the thylakoid membranes, convert light energy into chemical energy in the form of ATP and NADPH."
  • Chunk 4: "Key pigments like chlorophyll absorb photons, exciting electrons that flow through an electron transport chain."
  • Chunk 5: "The Calvin cycle, which takes place in the stroma, uses the ATP and NADPH generated to fix atmospheric carbon dioxide into organic molecules, ultimately producing glucose."
  • Chunk 6: "This glucose then serves as a precursor for the biosynthesis of artemisinin and other terpenes. The efficiency of artemisinin production is directly linked to the plant’s photosynthetic capacity and the availability of essential nutrients and light."

When we ask "What is the role of chlorophyll in photosynthesis?", the system is highly likely to retrieve Chunk 4. This chunk is extremely focused on chlorophyll. Its embedding will be very precise, leading to a high-confidence, accurate retrieval.

The problem this solves is the "information dilution" problem. When a chunk contains too much information, the core concepts get averaged out in the embedding. This makes it harder for the vector search to pinpoint the exact piece of text that answers a user’s query. Smaller chunks, when they are semantically coherent, provide a sharper signal.

Internally, this works because vector embeddings capture the semantic meaning of text. When you have a large chunk, its embedding represents the average meaning of all the sentences within it. If a query is very specific, like asking about a particular pigment, a large chunk’s average meaning might not align as closely as a smaller chunk dedicated solely to that pigment. The key is that the chunk must still be semantically meaningful and contain a complete thought or fact. You don’t want to split a single sentence across two chunks.

The exact levers you control are the chunk_size (in tokens or characters) and chunk_overlap parameters in your document loading and splitting libraries (e.g., LangChain, LlamaIndex). A common starting point for chunk_size is often 500-1000 tokens, but for many knowledge retrieval tasks, values between 50 and 250 tokens yield superior results. chunk_overlap is crucial; it ensures that context isn’t lost at the boundaries of chunks. A typical overlap might be 10-20% of the chunk size.

One thing most people don’t know is that the optimal chunk size is highly dependent on the nature of the queries you expect. If your queries are broad and conceptual, larger chunks might be fine. But if your users ask very specific, fact-based questions, smaller, more focused chunks are often essential. The embedding model itself also plays a role; some models are better at distinguishing fine-grained semantic differences than others.

The next problem you’ll likely encounter is managing the explosion of chunks for large document sets and optimizing retrieval beyond just chunk size.

Want structured learning?

Take the full Rag course →