A balanced shard key design is less about distributing data evenly and more about distributing workload evenly.

Let’s watch this unfold. Imagine we have a MongoDB sharded cluster with two shards, shard0001 and shard0002. We’re sharding a users collection based on userId.

// Initializing sharding for the 'mydatabase' database
sh.enableSharding("mydatabase")

// Sharding the 'users' collection by 'userId'
sh.shardCollection("mydatabase.users", { userId: 1 })

If userId is an ever-increasing integer (like a sequential ID generated by an application), all new writes will go to the shard that holds the highest userId values. If userId is monotonically increasing, the shard holding the largest values will become overloaded with inserts, while other shards sit idle. This is a "hotspot."

The core problem is that a shard key determines which shard a document lives on. When that shard key is also heavily involved in query patterns or write patterns, one shard can end up doing disproportionately more work.

Consider a transactions collection sharded by accountId. If most of your application’s activity involves a small subset of accountIds (e.g., "hot accounts" that are frequently accessed), those accounts’ transactions will all land on the same shard. Queries for these hot accounts will hammer that single shard, while queries for less active accounts might be spread across shards, but the aggregate workload will be skewed.

Here’s how we can visualize this. If we query for the number of documents on each shard:

# On the mongos shell
db.stats().shards

If the shard key is well-distributed, the document counts will be roughly proportional to the shard sizes. But if we look at operations (reads/writes) or CPU/network usage per shard, the imbalance becomes stark.

The goal is to choose a shard key that distributes documents and operations across shards. This means the shard key should have:

  1. High Cardinality: The shard key should have a large number of distinct values. This allows for more potential distribution points.
  2. Even Distribution: Values should be spread relatively evenly across the range of possible values.
  3. Query Pattern Alignment: Queries should ideally target a wide range of shard key values, or specific values that are themselves distributed.

For example, if we have a logs collection and want to shard by timestamp, this is a bad idea if most of your queries are for recent logs. All recent logs will land on the same shard.

A better approach for logs might be a compound shard key like { eventType: 1, timestamp: 1 } if eventType has many distinct values and is queried frequently, or a hashed shard key if there’s no natural query pattern to align with.

Let’s say we have a products collection and we want to shard it. A common mistake is to shard by productId if productIds are sequential. A better approach might be to shard by a field that has more entropy and is frequently used in queries, like { storeId: 1, productId: 1 } if storeId has high cardinality.

If you’ve already sharded a collection and observe hotspots, you can change the shard key, but it’s a complex operation involving resharding. You’d typically:

  1. Create a new collection with the desired shard key.
  2. Use db.collection.aggregate() with $out to copy data to the new collection, ensuring the new shard key is applied correctly.
  3. Perform a controlled switchover, perhaps by renaming collections and updating application logic to point to the new sharded collection.

The most surprising true thing about shard key design is that a key that perfectly distributes your current data might become a hotspot later if your access patterns change or if your data grows in a skewed way. The key is a moving target.

If you shard a collection using a hashed shard key (e.g., { userId: "hashed" }), MongoDB automatically hashes the userId value to determine the shard. This ensures a more even distribution of data across shards, regardless of whether userId is sequential or not. However, it can make range queries inefficient because documents with sequential userIds will be scattered across shards.

The next challenge you’ll face is optimizing multi-shard queries.

Want structured learning?

Take the full Sharding course →