OpenAI’s embedding models are powerful, but their high dimensionality can turn simple similarity searches into computationally expensive operations.
Let’s see this in action. Imagine we have a collection of documents and we want to find the one most similar to a query document.
from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
client = OpenAI(api_key="YOUR_API_KEY")
# Sample documents
documents = [
"The quick brown fox jumps over the lazy dog.",
"A fast, agile fox leaps across a sluggish canine.",
"The weather today is sunny with a chance of rain.",
"Expect clear skies and warm temperatures tomorrow.",
"Artificial intelligence is transforming industries.",
"Machine learning algorithms are at the core of AI.",
]
# Get embeddings
def get_embedding(text):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small" # Using a smaller model for demonstration
)
return response.data[0].embedding
document_embeddings = [get_embedding(doc) for doc in documents]
query_embedding = get_embedding("Show me documents about animals.")
# Calculate similarity with original embeddings
document_embeddings_np = np.array(document_embeddings)
query_embedding_np = np.array(query_embedding).reshape(1, -1)
similarities = cosine_similarity(query_embedding_np, document_embeddings_np)[0]
print("Similarities (original embeddings):")
for i, sim in enumerate(similarities):
print(f"Doc {i+1}: {sim:.4f}")
The text-embedding-3-small model produces embeddings with 1536 dimensions. For a small set of documents, this is manageable. But scale that to millions of documents, and computing cosine similarity for every query becomes a bottleneck. Each dimension adds to the computational cost and memory footprint.
Dimensionality reduction techniques, like Principal Component Analysis (PCA), can significantly compress these embeddings while retaining most of the essential information for similarity comparisons. The goal is to find a lower-dimensional representation that preserves the relative distances between embeddings.
Let’s apply PCA to our example.
# Apply PCA for dimensionality reduction
n_components = 128 # Target number of dimensions
pca = PCA(n_components=n_components)
reduced_embeddings_np = pca.fit_transform(document_embeddings_np)
reduced_query_embedding_np = pca.transform(query_embedding_np)
# Calculate similarity with reduced embeddings
reduced_similarities = cosine_similarity(reduced_query_embedding_np, reduced_embeddings_np)[0]
print("\nSimilarities (reduced embeddings):")
for i, sim in enumerate(reduced_similarities):
print(f"Doc {i+1}: {sim:.4f}")
print(f"\nOriginal embedding dimension: {len(document_embeddings[0])}")
print(f"Reduced embedding dimension: {n_components}")
Notice how the rankings of the documents by similarity might change slightly, but the overall trend – identifying documents related to "animals" – remains consistent. The key is that the relative distances, which drive similarity, are largely preserved. This reduction from 1536 to 128 dimensions drastically reduces the computational load and memory requirements for storing and searching embeddings.
The n_components parameter in PCA is crucial. It’s a hyperparameter you tune. A common approach is to look at the explained variance ratio. You fit PCA with a large number of components (e.g., all original dimensions) and then plot the cumulative explained variance. You choose n_components such that it captures a high percentage (e.g., 95% or 99%) of the total variance. For embeddings, especially for tasks like similarity search, you can often get away with much lower percentages (e.g., 80-90%) without significant degradation in search quality, as the primary goal is preserving relative distances rather than reconstructing the original vectors perfectly.
The magic of dimensionality reduction for embeddings lies in its ability to "denoise" the data by discarding less important variance, which often corresponds to noise or subtle distinctions that don’t significantly impact semantic similarity for many practical use cases. It’s not about losing information, but about discarding the least useful information for the specific task.
The next step after reducing dimensions is often integrating these compressed embeddings into a dedicated vector database solution like Pinecone, Weaviate, or Milvus, which are optimized for efficient nearest neighbor search in high-dimensional (and now, lower-dimensional) spaces.