ColPali RAG: Multimodal Document Retrieval with Visuals (2026)

ColPali RAG doesn’t just retrieve text; it understands how images within documents change the meaning of that text.

Let’s see it in action. Imagine you have a PDF manual for a complex piece of machinery.

{
  "query": "how to adjust the torque on the primary actuator",
  "documents": [
    {
      "id": "manual_v3.pdf",
      "content": "The primary actuator (see Figure 2.1) requires precise torque settings. For adjustments, refer to Section 4.3.",
      "images": [
        {
          "id": "figure_2_1.png",
          "caption": "Figure 2.1: Primary Actuator Assembly",
          "description": "Diagram showing the primary actuator with labels for the adjustment screw and torque wrench interface."
        }
      ]
    },
    {
      "id": "maintenance_log_2023.txt",
      "content": "Log entry 2023-10-26: Routine check of primary actuator. Torque setting nominal. No adjustments made.",
      "images": []
    }
  ]
}

When you query "how to adjust the torque on the primary actuator," ColPali RAG doesn’t just find the text mentioning "primary actuator" and "adjust." It also recognizes that "Figure 2.1" is directly relevant to what the primary actuator is and where the adjustment points are, as described by its caption and description. This visual context, linked to the text, allows it to retrieve the correct section (Section 4.3, implied by the text’s reference) and prioritize the document that shows the actuator, even if another document has more text about torque settings in general.

The core problem ColPali RAG solves is the "semantic gap" between visual information and textual understanding in document retrieval. Traditional RAG systems treat text and images as separate entities or rely on very basic image captions. ColPali RAG fuses these modalities by embedding both text and image descriptions into a single, unified vector space. When you query, the system searches this combined space. A query with visual keywords will naturally co-locate with text that describes or references those visuals, and vice-versa.

Here’s how you control it:

query: Your natural language question. This is the entry point.
documents: An array of documents. Each document has:
- id: A unique identifier (e.g., filename).
- content: The extracted text from the document.
- images: An array of images found within that document. Each image has:
  - id: Unique image identifier.
  - caption: Text directly associated with the image (e.g., "Figure 2.1").
  - description: A detailed textual description of the image’s content, often generated by a separate vision-language model (VLM) during indexing. This is crucial for rich multimodal understanding.

The magic happens in the indexing phase. A VLM analyzes each image, generating a descriptive text embedding. This embedding is then combined with the text embeddings from the surrounding document content and the image’s caption. The result is a dense vector that represents the multimodal meaning of that section of the document. Search queries are also embedded, and the system finds the closest vectors in the index.

The most surprising part is how a query that only contains visual descriptors, like "show me the diagram of the cooling fins," can successfully retrieve documents where the word "diagram" or "cooling fins" never appears in the main text, but is present in an image’s description. The system is effectively "seeing" the document’s content, not just reading it.

The next hurdle is handling complex, multi-page diagrams where relationships between different visual elements are critical for understanding.