Pinecone’s delete_by_metadata is your precise scalpel for surgically removing vectors from an index, not just by their ID, but by the context they carry.
Let’s see it in action. Imagine you have a collection of product embeddings, each tagged with its product_id and category. You want to remove all embeddings related to a specific product that’s been discontinued.
from pinecone import Pinecone, Index
# Initialize Pinecone (replace with your actual API key and environment)
pc = Pinecone(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
# Connect to your index
index = pc.Index("your-index-name")
# Define the metadata to filter by
metadata_to_delete = {"product_id": "prod-12345", "category": "electronics"}
# Delete vectors matching the metadata
response = index.delete(filter=metadata_to_delete)
print(f"Delete response: {response}")
This isn’t just a bulk delete; it’s a targeted operation. The filter parameter in the delete call is where the magic happens. Pinecone interprets this dictionary as a query against the metadata associated with each vector. Any vector whose metadata perfectly matches all key-value pairs in the filter dictionary will be removed.
The core problem delete_by_metadata solves is managing evolving datasets without full re-indexing. Think about scenarios where:
- Data Drift: You’ve updated or corrected certain pieces of information associated with vectors. Instead of deleting and re-inserting, you can just update the metadata and then target old, incorrect versions.
- User/Content Removal: A user requests their data be deleted, or a piece of content is removed from your platform. You can use user IDs or content identifiers as metadata to wipe associated embeddings.
- Data Pruning: As your index grows, you might want to remove older, less relevant data. If you’ve tagged vectors with timestamps or version numbers, you can target them for deletion.
Internally, when you call delete(filter=...), Pinecone scans its metadata index. This is a separate, highly optimized index that maps metadata keys and values to the vector IDs. It efficiently finds all vector IDs that satisfy the provided filter conditions. Once these IDs are identified, Pinecone then proceeds with the actual deletion of the vector data itself. The efficiency here is crucial; without it, deleting by metadata would require scanning every vector, making it prohibitively slow for large indexes.
The filter parameter supports more complex queries than simple equality. You can use operators like $eq (equals), $ne (not equals), $gt (greater than), $gte (greater than or equal to), $lt (less than), $lte (less than or equal to), $in (is in a list), and $nin (is not in a list). You can also combine conditions using $and and $or.
For example, to delete all vectors for a specific product ID except those tagged with a "premium" tier:
metadata_filter = {
"product_id": "prod-12345",
"$and": [
{"tier": {"$ne": "premium"}}
]
}
response = index.delete(filter=metadata_filter)
This demonstrates how you can construct sophisticated deletion rules. The system doesn’t just look for exact matches; it evaluates logical expressions against your metadata.
A common point of confusion is how $and and $or operate. When you provide multiple key-value pairs at the top level of the filter dictionary, it implicitly acts as an $and. For instance, {"product_id": "prod-12345", "category": "electronics"} is equivalent to {"$and": [{"product_id": "prod-12345"}, {"category": "electronics"}]}. Explicitly using $and or $or becomes necessary when you need to combine conditions with different logical operators or when you have nested conditions.
The next step after selective deletion is often managing the metadata itself, perhaps by updating it for existing vectors rather than deleting and re-inserting.