Pinecone indexes vectors in a way that feels like magic, but it’s actually a clever approximation of nearest neighbors that makes searching through millions of text embeddings blazing fast.
Let’s say we have a bunch of documents, and we want to find the ones most semantically similar to a given query. We’ll use OpenAI’s text-embedding-ada-002 model to turn our text into numerical vectors, and then we’ll store these vectors in Pinecone for efficient searching.
First, grab your OpenAI API key and your Pinecone API key and environment.
import openai
import pinecone
import os
# Set your API keys and environment
openai.api_key = os.environ.get("OPENAI_API_KEY")
pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"), environment="YOUR_PINECONE_ENVIRONMENT")
Now, let’s create an index in Pinecone. The dimension must match the output dimension of your embedding model. text-embedding-ada-002 outputs 1536-dimensional vectors.
index_name = "semantic-search-example"
dimension = 1536
metric = "cosine" # Cosine similarity is common for text embeddings
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=dimension,
metric=metric,
pod_type="p1" # Or "s1", "p2" depending on your needs and budget
)
print(f"Index '{index_name}' created.")
else:
print(f"Index '{index_name}' already exists.")
# Connect to the index
index = pinecone.Index(index_name)
Next, we need a function to get embeddings from OpenAI.
def get_embedding(text):
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
return response['data'][0]['embedding']
Let’s imagine we have some documents. For this example, we’ll use a small list of strings. In a real-world scenario, you’d load these from files, a database, or a website.
documents = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence is transforming industries.",
"Natural language processing enables computers to understand human language.",
"The lazy dog slept soundly in the sun.",
"Machine learning algorithms can learn from data.",
"Pinecone is a vector database for AI applications.",
"OpenAI provides powerful language models."
]
Now, we’ll generate embeddings for each document and upsert them into our Pinecone index. Each vector needs a unique id, the values (the embedding itself), and optionally metadata.
batch_size = 100 # Pinecone recommends batching for efficiency
for i in range(0, len(documents), batch_size):
batch_docs = documents[i:i+batch_size]
batch_ids = [f"doc_{j}" for j in range(i, i + len(batch_docs))]
batch_embeddings = [get_embedding(doc) for doc in batch_docs]
upsert_data = zip(batch_ids, batch_embeddings, [{} for _ in batch_docs]) # Empty metadata for now
index.upsert(vectors=list(upsert_data))
print(f"Upserted batch {i//batch_size + 1}")
print(f"Total vectors in index: {index.describe_index_stats().total_vector_count}")
Now for the fun part: searching. We’ll take a query, get its embedding, and ask Pinecone for the most similar vectors.
query_text = "What can computers do with language?"
query_embedding = get_embedding(query_text)
# Search Pinecone
# top_k is the number of nearest neighbors to return
results = index.query(
vector=query_embedding,
top_k=3,
include_values=False, # We don't need the vector values back
include_metadata=False # We didn't store any metadata yet
)
print(f"\nQuery: '{query_text}'")
print("Top 3 similar documents:")
for match in results['matches']:
print(f"- ID: {match['id']}, Score: {match['score']:.4f}")
# In a real app, you'd fetch the original document content using the ID
# For this example, we'll just show the ID and score
The score here represents the similarity. A higher cosine similarity score (closer to 1.0) means the vectors are more aligned, indicating semantic closeness.
If you wanted to add metadata, like the original document text, you’d include it during the upsert step:
# Example of upserting with metadata
# metadata_batch = [{"text": doc} for doc in batch_docs]
# upsert_data_with_meta = zip(batch_ids, batch_embeddings, metadata_batch)
# index.upsert(vectors=list(upsert_data_with_meta))
# Then, to retrieve metadata:
# results_with_meta = index.query(vector=query_embedding, top_k=3, include_metadata=True)
# for match in results_with_meta['matches']:
# print(f"- ID: {match['id']}, Score: {match['score']:.4f}, Text: {match['metadata']['text']}")
The key to making this work efficiently at scale is Pinecone’s use of Approximate Nearest Neighbor (ANN) algorithms. Instead of comparing your query vector to every single vector in the index (which would be slow), Pinecone uses data structures like Hierarchical Navigable Small Worlds (HNSW) to quickly find vectors that are likely to be close. This means you get lightning-fast search results, even with millions or billions of vectors, at the cost of a tiny potential for missing the absolute closest match.
When you perform a query, Pinecone traverses its ANN graph, exploring branches that seem most promising based on your query vector’s position. It doesn’t guarantee finding the absolute closest neighbor but provides a highly accurate approximation very rapidly. The top_k parameter tells it how many of these approximate nearest neighbors to return.
The next step is often to integrate this into a retrieval-augmented generation (RAG) system, where the retrieved documents are used as context for a large language model to generate a more informed answer.