Pinecone Multimodal: Store Image and Text Embeddings (2026)

Pinecone’s multimodal capabilities let you search across different data types, like images and text, using a single vector index.

Let’s see it in action. Imagine you have a dataset of product listings, each with an image and a description.

[
  {
    "id": "product_1",
    "values": [0.1, 0.2, 0.3, ...], // Embedding for "Red running shoes"
    "metadata": {
      "text": "These are comfortable red running shoes.",
      "image_url": "http://example.com/images/red_shoes.jpg"
    }
  },
  {
    "id": "product_2",
    "values": [0.4, 0.5, 0.6, ...], // Embedding for "Blue t-shirt"
    "metadata": {
      "text": "A stylish blue t-shirt for everyday wear.",
      "image_url": "http://example.com/images/blue_tshirt.jpg"
    }
  }
]

To store these, you’d first generate embeddings for both the text and the images using separate multimodal embedding models. For example, you could use OpenAI’s text-embedding-3-small for text and a CLIP-based model for images. The key is that both models are trained to map semantically similar items to similar vector spaces. This means the embedding for "red running shoes" will be close in the vector space to the embedding for an image of red running shoes.

When you upsert these vectors into Pinecone, you can store them with associated metadata. This metadata can include the original text, the image URL, or any other relevant information about the item.

from pinecone import Pinecone

# Initialize Pinecone connection
pc = Pinecone(api_key="YOUR_API_KEY")

# Assuming you have a 'my_multimodal_index' created with a suitable dimension
index = pc.Index("my-multimodal-index")

# Example data with pre-computed embeddings
data_to_upsert = [
    {
        "id": "product_1",
        "values": [0.1, 0.2, 0.3, ...], # Text embedding for "Red running shoes"
        "metadata": {
            "text": "These are comfortable red running shoes.",
            "image_url": "http://example.com/images/red_shoes.jpg",
            "type": "text" # Optional: helps distinguish data types if needed
        }
    },
    {
        "id": "product_2",
        "values": [0.4, 0.5, 0.6, ...], # Image embedding for red shoes
        "metadata": {
            "text": "These are comfortable red running shoes.",
            "image_url": "http://example.com/images/red_shoes.jpg",
            "type": "image" # Optional
        }
    }
]

# Upsert the data
index.upsert(vectors=data_to_upsert)

The magic happens during querying. You can query using a text embedding and find similar images, or query with an image embedding and find similar text descriptions.

# Example: Query with a text embedding for "blue athletic wear"
query_text_embedding = [0.7, 0.8, 0.9, ...] # Embedding for "blue athletic wear"

results = index.query(
    vector=query_text_embedding,
    top_k=5,
    include_metadata=True
)

# The results might include products with text descriptions like "blue t-shirt"
# and potentially images that visually represent blue athletic wear.

This allows for powerful cross-modal search applications. You can build a system where a user uploads an image and finds visually similar products, or where a text search returns not just text descriptions but also relevant images. The core problem this solves is bridging the semantic gap between different data modalities, enabling unified search across them.

The internal workings rely on a shared embedding space. Models are trained such that the vector representation of "a dog" (text) is close to the vector representation of an image of a dog. Pinecone then stores these vectors and performs nearest neighbor searches. When you query with a vector, Pinecone efficiently finds the top_k vectors in its index that are closest to your query vector, based on a chosen distance metric like cosine similarity or dot product.

The dimension of your vectors is critical. All embeddings stored in a single index must have the same dimension. This is because Pinecone partitions its index based on vector dimensions. If you’re using OpenAI’s text-embedding-3-small, the dimension is 1536. For images, you’d use a model that outputs embeddings of the same dimension, or you’d need to project them to match.

A common misconception is that you need separate indexes for text and image embeddings. With multimodal capabilities, you can store them together in a single index, as long as their dimensions align. This simplifies management and allows for cross-modal queries directly. The metadata field is your playground for storing the original data or pointers to it, enabling you to retrieve rich context after a vector search.

The next step is optimizing your similarity search with metadata filtering.