Pinecone can handle millions of queries per second, but only if you’re willing to pay for the hardware.

Imagine you’ve got an application that needs to find similar items in a massive dataset – think recommending products, identifying duplicate images, or detecting anomalies. You’ve chosen Pinecone, a vector database, because it’s designed for this kind of similarity search. You’re expecting lightning-fast results, even as your user base explodes. Then, you hit a wall. Your queries per second (QPS) aren’t scaling as you’d hoped, and you’re staring at a bill that’s climbing faster than your user engagement.

Let’s look at a real-world scenario. You’ve deployed a Pinecone index with a few thousand vectors and are seeing decent performance. As you ramp up to tens of thousands, then millions of vectors, and your QPS demands increase, you start noticing latency creeping in. Your application, which previously felt instantaneous, now has a noticeable delay. You check your Pinecone dashboard and see your query latency spiking, and your QPS is plateauing, far below the millions you envisioned.

The core of Pinecone’s scaling story lies in its architecture: shards and replicas. An index is composed of one or more shards, and each shard can have multiple replicas. Queries are distributed across these shards. To achieve high QPS, you need to increase the number of shards and/or replicas.

Here’s how you’d actually see this in action. Let’s say you’re creating an index for a new recommendation engine.

import pinecone

pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

index_name = "my-recommendation-index"

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=1536,  # e.g., for OpenAI embeddings
        metric="cosine",
        pods=1,          # Start with 1 pod
        replicas=1,      # Start with 1 replica per shard
        shards=1         # Start with 1 shard
    )

index = pinecone.Index(index_name)
print(index.describe_index_stats())

The output might look something like this:

{
    "dimension": 1536,
    "index_fullness": 0.0001,
    "namespaces": {
        "": {
            "vector_count": 1000
        }
    },
    "shards": [
        {
            "name": "t1",
            "state": "Ready",
            "shard_id": 0,
            "replicas": [
                {
                    "state": "Ready",
                    "pod_id": "t1-0",
                    "shard_id": 0,
                    "type": "p1.x1"
                }
            ]
        }
    ],
    "total_vector_count": 1000
}

This setup is fine for a few hundred QPS. But if you need to hit millions, you need more horsepower. The pods parameter in create_index is actually a bit of a misnomer in newer versions; it’s more about the type of pods and the initial configuration. What truly drives QPS is the number of shards and replicas.

Let’s say your QPS demands are growing, and you’re hitting limits. You’ll need to scale up. This isn’t a dynamic, in-flight operation for the most part; you typically create a new index with the desired configuration and migrate your data.

Consider scaling to handle 100,000 QPS. You’d create a new index with a significantly higher shard count.

index_name_scaled = "my-recommendation-index-scaled"

if index_name_scaled not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name_scaled,
        dimension=1536,
        metric="cosine",
        pods=1, # This parameter is less about the count and more about the pod type/tier
        # The key to scaling QPS is shards and replicas:
        shards=8,       # Increase shards for parallel processing
        replicas=2      # Increase replicas for fault tolerance and read throughput
    )

index_scaled = pinecone.Index(index_name_scaled)
print(index_scaled.describe_index_stats())

In this scaled configuration, shards=8 means your index is now divided into 8 independent partitions. Queries can be processed in parallel across these shards. replicas=2 means that for each of those 8 shards, there are now 2 copies. This increases read throughput because queries can be spread across replicas, and also provides high availability; if one replica goes down, others can take over.

The pods parameter, in conjunction with replicas and shards, determines the underlying hardware. A p1.x1 pod is a basic unit. A higher pod tier (e.g., p2 or p3) provides more CPU and memory per pod, allowing each shard/replica to handle more load. For millions of QPS, you’re looking at a significant number of pods, often in the tens or hundreds, with a carefully balanced ratio of shards and replicas tailored to your specific query patterns and data volume.

The actual QPS you get from a given configuration depends on many factors: the size of your vectors, the complexity of your queries (e.g., top_k value), the density of your data, and the specific pod type. Pinecone’s pricing is directly tied to the number and type of pods provisioned. To achieve millions of QPS, you are essentially provisioning a large cluster of compute resources.

What most people don’t realize is that the top_k parameter in your query call significantly impacts QPS. A higher top_k requires more computation per query, as Pinecone needs to retrieve and rank more potential matches. If you’re querying for top_k=1000, you’ll get substantially lower QPS than if you query for top_k=10 on the same index and hardware. It’s not just about the index hardware; it’s about the work each query demands from that hardware.

To truly hit millions of QPS, you’ll be looking at indices with configurations like shards=32, replicas=3, and potentially pods=4 with a p2 or p3 pod type, and this would be just a starting point for a single index. You might also need multiple independent indexes, each scaled appropriately, to distribute your overall QPS load.

The next hurdle you’ll face after successfully scaling your QPS is managing the cost associated with that much provisioned hardware.

Want structured learning?

Take the full Pinecone course →