Pinecone’s vector similarity search doesn’t just pick the "closest" vectors; it uses different mathematical "distances" to define closeness, and understanding these differences is crucial for getting the results you expect.
Let’s see how this plays out with a simple example. Imagine we have three vectors:
v1 = [1, 0]
v2 = [0.9, 0.1]
v3 = [0.5, 0.5]
If we query with v1, we want to see how v2 and v3 compare.
Cosine Similarity: This measures the angle between two vectors. It’s all about direction, not magnitude. A cosine similarity of 1 means the vectors point in the exact same direction.
-
Calculation:
cosine_similarity(a, b) = dot_product(a, b) / (magnitude(a) * magnitude(b)) -
Example:
-
Cosine similarity between
v1andv2:dot_product([1,0], [0.9,0.1]) = 1*0.9 + 0*0.1 = 0.9magnitude([1,0]) = sqrt(1^2 + 0^2) = 1magnitude([0.9,0.1]) = sqrt(0.9^2 + 0.1^2) = sqrt(0.81 + 0.01) = sqrt(0.82) ≈ 0.9055cosine_similarity(v1, v2) = 0.9 / (1 * 0.9055) ≈ 0.9939 -
Cosine similarity between
v1andv3:dot_product([1,0], [0.5,0.5]) = 1*0.5 + 0*0.5 = 0.5magnitude([0.5,0.5]) = sqrt(0.5^2 + 0.5^2) = sqrt(0.25 + 0.25) = sqrt(0.5) ≈ 0.7071cosine_similarity(v1, v3) = 0.5 / (1 * 0.7071) ≈ 0.7071
v2is much closer tov1in direction thanv3. -
Dot Product: This is the sum of the products of corresponding elements. It’s influenced by both direction and magnitude. If vectors are normalized (magnitude of 1), it’s identical to cosine similarity.
-
Calculation:
dot_product(a, b) = sum(a[i] * b[i] for i in range(len(a))) -
Example:
- Dot product between
v1andv2:0.9(calculated above) - Dot product between
v1andv3:0.5(calculated above)
Again,
v2has a higher dot product withv1. - Dot product between
Euclidean Distance: This is the straight-line distance between the tips of two vectors in space. It’s sensitive to both magnitude and direction. Smaller values mean closer.
-
Calculation:
euclidean_distance(a, b) = sqrt(sum((a[i] - b[i])^2 for i in range(len(a)))) -
Example:
-
Euclidean distance between
v1andv2:sqrt((1 - 0.9)^2 + (0 - 0.1)^2) = sqrt(0.1^2 + (-0.1)^2) = sqrt(0.01 + 0.01) = sqrt(0.02) ≈ 0.1414 -
Euclidean distance between
v1andv3:sqrt((1 - 0.5)^2 + (0 - 0.5)^2) = sqrt(0.5^2 + (-0.5)^2) = sqrt(0.25 + 0.25) = sqrt(0.5) ≈ 0.7071
v2is much closer tov1in Euclidean space thanv3. -
How Pinecone Uses These:
When you create an index in Pinecone, you specify a metric. The common options are cosine, dotproduct, and euclidean.
cosine: Ideal when you care about the meaning or topic of vectors, regardless of their length. Think document embeddings where a longer document might have a larger magnitude but represent the same core topic as a shorter, denser one.dotproduct: Often used when vectors are not normalized. It can implicitly boost results from vectors with larger magnitudes, which might be desirable if magnitude represents importance or confidence. If your vectors are normalized (magnitude 1),dotproductis mathematically equivalent tocosine.euclidean: Best when the absolute difference in vector components matters. This is common in recommendation systems or anomaly detection where the "distance" in feature space directly correlates with dissimilarity.
The Counterintuitive Nuance:
Many users assume that if their vectors are normalized (e.g., to have a magnitude of 1), then using dotproduct is the same as cosine. While mathematically true, there’s a subtle practical difference in how some libraries or internal implementations might handle floating-point precision or specific optimizations. More importantly, if you are not explicitly normalizing your vectors to a magnitude of 1 before ingestion, using dotproduct will favor vectors with larger magnitudes, while cosine will ignore magnitude entirely, focusing only on direction. This means a very long but directionally similar vector could rank higher with dotproduct than with cosine, even if their angles are nearly identical.
The next step is understanding how to tune the metric parameter in your Pinecone index configuration based on your data and desired search behavior.