Pinecone indexes don’t actually have a fixed "capacity" in the way you might think; instead, your cost and performance are determined by two independent, configurable dimensions: storage and query-per-second (QPS) throughput.
Let’s see this in action. Imagine you’re building a recommendation engine for a small e-commerce site. You’ve got 10,000 product embeddings, each 1536 dimensions.
import pinecone
# Initialize Pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
# Define index parameters
index_name = "my-product-recs"
dimension = 1536
metric = "cosine"
# Check if index exists, create if not
if index_name not in pinecone.list_indexes():
pinecone.create_index(
index_name,
dimension=dimension,
metric=metric,
pods=1, # Start with 1 pod for minimal storage/QPS
replicas=1,
shards=1,
pod_type="s1.x1" # Smallest pod type
)
print(f"Index '{index_name}' created.")
else:
print(f"Index '{index_name}' already exists.")
# Connect to the index
index = pinecone.Index(index_name)
# Upsert some dummy data (10,000 vectors)
vectors_to_upsert = [(f"product_{i}", [random.random() for _ in range(dimension)]) for i in range(10000)]
index.upsert(vectors=vectors_to_upsert)
print(f"Upserted 10,000 vectors.")
# Describe index stats to see current usage
stats = index.describe_index_stats()
print(stats)
When you run this, you’ll get output similar to this:
Index 'my-product-recs' created.
Upserted 10,000 vectors.
IndexStats(
dimension=1536,
index_fullness=0.00002,
namespaces={
'': NamespaceSummary(vector_count=10000)
},
total_vector_count=10000
)
Notice index_fullness. This is a key indicator. It’s not a percentage of disk space but rather a measure of how much of the allocated compute resources (pods) are being utilized for storing your vectors. A low index_fullness means you’re paying for pods that aren’t being fully utilized for storage.
The actual storage consumed by your vectors is a function of the dimension and the number of vectors. Each vector of dimension d with float values typically takes d * 4 bytes (for 32-bit floats) plus some overhead for the vector ID and metadata. So, 10,000 vectors of 1536 dimensions would be roughly 10000 * 1536 * 4 bytes, which is about 61.44 MB. This is incredibly small.
This is where the "capacity" myth comes in. You could cram millions of these small vectors into a single pod if that’s all you needed for storage. But that single pod has a finite limit on how many queries it can handle per second (QPS).
Pinecone abstracts these limits into "pods." A pod is a unit of compute and memory. You choose a pod_type (e.g., s1.x1, p1.x1, p2.x1) which dictates the CPU, RAM, and network bandwidth available per pod. You then configure the number of pods, replicas, and shards to create your index.
- Pods: The fundamental unit of compute. More pods mean more capacity for both storage and QPS.
- Replicas: Copies of your index. They don’t increase storage but do increase QPS by distributing read traffic. If one replica goes down, others keep the index available.
- Shards: Data partitioning. For very large indexes, sharding splits your data across multiple sets of pods. This allows for higher QPS and can help manage very large datasets that exceed the memory of a single pod.
The total QPS capacity of your index is roughly pods * replicas * QPS_per_pod_type. The storage capacity is also tied to the total memory available across all pods.
How to Plan:
- Estimate Storage Needs: Calculate
number_of_vectors * dimension * 4bytes (for float32). Add a buffer for metadata. This tells you the minimum storage you need. - Estimate QPS Needs: Determine your peak query load. How many search requests per second do you anticipate?
- Choose Pod Type: Select a
pod_typethat offers a good balance of cost and performance for your expected load.s1pods are cost-effective for smaller workloads, whilep1andp2offer higher performance. - Configure Index:
- Start with a minimum number of pods (e.g., 1) and replicas (e.g., 1) of your chosen
pod_type. - Upsert your data.
- Monitor
index_fullnessand QPS. - If
index_fullnessis high (e.g., > 70-80%) and you’re hitting storage limits: Increase the number ofpodsorshards. Each pod contributes to both storage and QPS. Sharding is primarily for scaling beyond what a single set of pods can hold or process. - If
index_fullnessis low but you’re hitting QPS limits: Increase the number ofreplicas. This is the most cost-effective way to boost QPS without significantly increasing storage costs.
- Start with a minimum number of pods (e.g., 1) and replicas (e.g., 1) of your chosen
Example Scenario:
You have 1 million vectors, 768 dimensions each, and need to handle 500 QPS.
- Storage:
1,000,000 * 768 * 4bytes = ~3.07 GB. - QPS: 500 QPS.
Let’s pick p1.x1 pods, which offer a decent QPS baseline.
A single p1.x1 pod can hold quite a bit more than 3GB of data. The QPS limit for a p1.x1 pod might be around 100 QPS.
To reach 500 QPS, you’d need 500 QPS / 100 QPS/replica = 5 replicas.
If you start with 1 pod and 5 replicas: index = pinecone.create_index("my-index", pods=1, replicas=5, pod_type="p1.x1", dimension=768). This gives you 5 pods in total, each running a copy of the index. The storage is handled by the single "set" of pods, and QPS is distributed across the 5 replicas.
If you later find that 1 pod isn’t enough to store your data efficiently (e.g., index_fullness is constantly high, or you exceed the memory of a single pod), you might shard. For example, 2 shards, each with 3 replicas: index = pinecone.create_index("my-index", pods=2, replicas=3, shards=2, pod_type="p1.x1", dimension=768). This gives you 6 pods in total (2 shards * 3 replicas/shard), offering more storage and QPS.
The pod_type is a critical lever. s1.x1 is cheap but has low QPS. p1.x1 is more expensive but offers much higher QPS. p2.x1 offers even more performance. You often start with s1 or p1 and scale up.
The most surprising thing about Pinecone’s capacity model is that storage and QPS are not intrinsically linked by a single knob; you adjust them independently through the number of pods, replicas, shards, and the chosen pod type, allowing for highly tailored cost and performance configurations.
Once you’ve dialed in your storage and QPS, the next challenge you’ll face is optimizing query latency for specific use cases.