The RAG system can’t parse non-text content because its core logic is designed for string manipulation, and it’s encountering binary data or structured formats it doesn’t understand.

Here’s why that happens and how to fix it:

Cause 1: The Document Loader Isn’t Designed for Non-Text

Diagnosis: You’re likely using a document loader that assumes plain text. If you’re loading a PDF, Word doc, or even an HTML file with embedded images, the loader might just be returning raw bytes or a malformed string.

Check: Try to print the raw output of your document loader before it hits the RAG pipeline.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("my_document.pdf")
pages = loader.load()
print(pages[0].page_content[:500]) # Print first 500 characters

If you see gibberish, escape sequences, or incomplete text, this is your problem.

Fix: Use a loader specifically designed for the file type. For PDFs with text and images, PyMuPDFLoader is generally superior as it can often extract text while retaining some structural information. For complex documents with tables and images, you might need a more sophisticated approach.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("my_document.pdf")
pages = loader.load()
# Now process pages

Why it works: PyMuPDFLoader uses the MuPDF library, which is built to understand PDF structures, including text blocks, fonts, and their positions. It can differentiate between text and image data within the PDF.

Cause 2: Image Data is Being Treated as Text

Diagnosis: Even if your loader can identify an image, it might be dumping the raw binary data of the image directly into the page_content field, which the RAG model then tries to interpret as text.

Check: Look at the page_content of a page that you know contains an image.

# Assuming 'pages' is your loaded documents
for page in pages:
    if "binary data" in page.page_content.lower() or "<image" in page.page_content.lower():
        print(page.page_content[:500])

You might see representations of binary data or placeholders that indicate an image was present but not processed.

Fix: Implement a strategy to skip or extract image data separately. For simple cases, you can filter out pages or sections that are predominantly image data. For more advanced use, you’ll need an OCR tool or a multimodal model.

# Example of filtering out pages that seem to be only images
processed_pages = []
for page in pages:
    # Heuristic: if less than X% of content is alphanumeric, assume it's mostly image
    text_chars = sum(c.isalnum() for c in page.page_content)
    if text_chars / len(page.page_content) > 0.2: # Threshold can be adjusted
        processed_pages.append(page)
# Use processed_pages in your RAG pipeline

Why it works: This approach uses a simple heuristic to ignore content that is unlikely to be meaningful text for a standard RAG model. It prevents the model from being fed raw binary data.

Cause 3: Table Data is Flattened Incorrectly

Diagnosis: Tables are structured data. When loaded as plain text, their rows and columns can become jumbled, making them nonsensical to a language model.

Check: Examine the page_content of a page with a table.

# Assuming 'pages' is your loaded documents
for page in pages:
    if "col1" in page.page_content and "col2" in page.page_content: # Look for table-like patterns
        print(page.page_content[:500])

You’ll likely see data that should be in separate cells merged into single lines or paragraphs.

Fix: Use a document loader or an intermediate processing step that specifically parses tables. Libraries like unstructured or tabula-py are excellent for this. You can then convert tables into Markdown or CSV format before feeding them to the RAG system.

from unstructured.partition.auto import partition
from unstructured.staging.base import dict_to_elements

# Load and partition the document
elements = partition(filename="my_document.pdf")

# Filter for tables and convert to Markdown
table_markdowns = []
for element in elements:
    if element.category == "table":
        # Convert table element to markdown string
        table_markdowns.append(str(element))

# Join markdown tables and append to your text content
full_text = "\n".join(table_markdowns) + "\n" + "\n".join(str(el) for el in elements if el.category != "table")
# Now use full_text in your RAG pipeline

Why it works: unstructured’s partition function uses advanced techniques to identify and parse different document elements, including tables. By converting tables to Markdown, you provide a structured, text-based representation that language models can understand.

Cause 4: Insufficient Metadata or Context for Images/Tables

Diagnosis: The RAG model relies on the text content. If images or tables are present but their surrounding text doesn’t provide enough context, the model might struggle to understand their relevance even if they are extracted.

Check: Manually review pages with images and tables. Does the caption or the paragraph before/after clearly explain what the image/table represents?

Fix: Augment your document loading and chunking process to include relevant surrounding text with image or table data. If you extract tables as separate Markdown, prepend the text immediately preceding the table.

# Example: If you extract tables separately
# ... (previous code for extracting tables)
# Assuming 'text_content' is the regular text and 'table_markdowns' is a list of tables
final_chunks = []
for i, text_chunk in enumerate(split_text_into_chunks(text_content)):
    # Find if any tables are near this text chunk (requires mapping table locations to text)
    associated_tables = find_tables_near_text(i, text_chunk, elements) # Custom function
    if associated_tables:
        final_chunks.append(text_chunk + "\n" + "\n".join(associated_tables))
    else:
        final_chunks.append(text_chunk)
# Now chunk and embed final_chunks

Why it works: By explicitly linking descriptive text with the non-textual elements, you provide the RAG model with the necessary context to interpret and utilize that information.

Cause 5: Using a Text-Only Embedding Model

Diagnosis: Your embedding model is only trained on text. When you try to embed data derived from images or tables (even if converted to text), the embeddings won’t capture the true meaning of that visual or tabular information.

Check: You’re likely using a standard text embedding model like OpenAIEmbeddings or HuggingFaceEmbeddings without any multimodal capabilities.

Fix: For truly multimodal RAG, you need a multimodal embedding model. Models like OpenAI’s CLIP or newer multimodal foundation models can embed both text and images into a shared vector space. Alternatively, use a two-step process: use a dedicated image analysis model (like an object detector or image captioner) to generate text descriptions of images, and then embed those descriptions.

from langchain_openai import OpenAIEmbeddings
from langchain_vision import MultiModalEmbeddings # Hypothetical, use actual multimodal lib

# Option 1: Multimodal Embeddings (if available and suitable)
# embeddings = MultiModalEmbeddings(...)

# Option 2: OCR/Image Captioning + Text Embeddings
# 1. Use an OCR tool (like Tesseract) or an image captioning API/model
# image_description = run_ocr_or_captioning("path/to/image.jpg")
# 2. Embed the description
# text_embeddings = OpenAIEmbeddings()
# embedded_description = text_embeddings.embed_query(image_description)

# For tables, ensure they are converted to text (e.g., Markdown) and then use text embeddings.

Why it works: Multimodal embeddings create a unified vector space where visual and textual concepts are represented coherently. This allows the RAG system to find relevant information across different modalities.

Cause 6: The RAG Framework Itself Isn’t Built for Multimodality

Diagnosis: Many RAG frameworks are architected around text-only Document objects and text-based retrievers. They don’t have a built-in mechanism to handle or query non-textual data alongside text.

Check: Review the documentation or source code of your RAG framework. Does it mention support for images, tables, or multimodal inputs beyond simple text extraction?

Fix: You’ll need to adapt your framework or use a RAG library that explicitly supports multimodal data. This might involve:

  • Storing image/table data separately and linking it via metadata.
  • Creating custom retriever types that can query different data stores (e.g., a vector store for text embeddings and a separate database for image features).
  • Using a multimodal LLM as the final generation step, capable of processing combined text and image inputs.
# Example of a custom document structure for multimodal RAG
class MultiModalDocument:
    def __init__(self, text: str, image_path: Optional[str] = None, table_data: Optional[str] = None, metadata: dict = {}):
        self.text = text
        self.image_path = image_path
        self.table_data = table_data # e.g., markdown string
        self.metadata = metadata

# Your retriever would then need to handle these different fields.

Why it works: By designing your data structures and retrieval logic to accommodate different data types, you build a RAG system capable of indexing and retrieving information from diverse sources.

The next error you’ll hit is likely related to the LLM’s context window limitations or its inability to "see" the images if you’re not using a truly multimodal LLM.

Want structured learning?

Take the full Rag course →