PDFs are often treated as opaque blobs, but the real magic is how a RAG system can coax structured data out of them, even when the layout is a mess.

Let’s see what happens when we feed a complex PDF into a RAG pipeline. Imagine a research paper with multi-column text, embedded figures, and tables spanning multiple pages.

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import ImageExtractor
from llama_index.core.ingestion import IngestionPipeline

# Assume 'complex_document.pdf' is in a 'data' directory
# and contains text, tables, and images.
reader = SimpleDirectoryReader("./data", required_exts=[".pdf"])
documents = reader.load_data()

# Configure the pipeline to handle complex layouts
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        # This is where the magic for tables/images happens
        ImageExtractor(model_name="gpt4-vision-preview", max_images=10),
        # For tables, often a combination of LLM and specific parsing is used.
        # LlamaIndex's PDF reader has built-in table parsing capabilities.
    ]
)

nodes = pipeline.run(documents=documents)

# You can inspect the nodes to see how they've been parsed
for node in nodes:
    print(f"Node ID: {node.id_}")
    print(f"Text: {node.get_content()[:200]}...") # Print first 200 chars
    if node.metadata.get("image_captions"):
        print(f"Image Captions: {node.metadata['image_captions']}")
    if node.metadata.get("table_data"):
        print(f"Table Data (first row): {node.metadata['table_data'][0]}")
    print("-" * 20)

The SimpleDirectoryReader in LlamaIndex, when encountering PDFs, has some inherent capabilities. It can often extract text and, importantly, it has built-in mechanisms to detect and parse tables. For images, you often need to explicitly tell the pipeline to extract them, usually by leveraging an LLM with vision capabilities like GPT-4V. The ImageExtractor uses OCR or multimodal models to interpret image content and generate captions or even extract structured data if the image depicts a table or chart. The SentenceSplitter then takes these extracted chunks (text, image descriptions, table data) and breaks them down into manageable pieces for the embedding model.

The core problem RAG ingestion solves is transforming unstructured or semi-structured documents into a format that a vector database can effectively index and query. For PDFs, this means not just extracting raw text but also understanding the spatial relationships between text blocks, identifying tabular data, and interpreting visual information. Traditional text extraction often fails to preserve table structure or image context, leading to a loss of crucial information. Advanced RAG ingestion pipelines tackle this by employing a multi-pronged approach:

  1. Layout Analysis: Libraries like PyMuPDF (which LlamaIndex often uses under the hood for PDF parsing) perform layout analysis. They identify text blocks, their positions on the page, and how they flow. This is critical for understanding multi-column layouts or text that wraps around figures.
  2. Table Extraction: Dedicated table parsers, often rule-based or LLM-assisted, are used. These tools analyze the visual cues of tables (lines, cell alignment) and the text within them to reconstruct them as structured data (e.g., lists of dictionaries or pandas DataFrames). The SimpleDirectoryReader in LlamaIndex has a pdf_extract_tables=True option that triggers this.
  3. Image Understanding: For images, OCR (Optical Character Recognition) can extract text from within images. For more complex interpretation, multimodal LLMs are employed to generate descriptive captions or even answer questions about the image content. The ImageExtractor transformation is key here.

When you run the IngestionPipeline, each Node that’s created can carry rich metadata. For text nodes, this might include page numbers or bounding box information. For nodes derived from tables, the metadata can contain the parsed table as a list of lists or a DataFrame. For image-derived nodes, you’ll find generated captions or OCR text. This metadata is crucial for the retrieval phase, allowing the system to retrieve not just text that matches a query, but also contextually relevant tables or image descriptions.

What most people don’t realize is how much the order of transformations in the IngestionPipeline matters, especially when dealing with complex layouts and mixed media. If you try to split sentences before extracting tables or images, you might end up with fragmented table rows or incomplete image descriptions within your text chunks. Conversely, running the ImageExtractor too late might mean its output isn’t properly associated with the surrounding text nodes, or it might be processed by a general-purpose text splitter that doesn’t understand its specific nature. The pipeline orchestrates these steps, ensuring that specialized extractors (like ImageExtractor or internal table parsers) run first, and their structured outputs are then integrated and further processed by generalist transformers like SentenceSplitter.

The next challenge is optimizing retrieval when these rich, multimodal nodes are present in your index.

Want structured learning?

Take the full Rag course →