Sharding often fails not because the sharding strategy is inherently bad, but because the application’s access patterns are fundamentally at odds with the sharded data.

Let’s see what that looks like in practice. Imagine a multi-tenant application storing user data. A common, but often disastrous, sharding strategy is to shard by tenant_id.

-- Example table structure
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    tenant_id INT NOT NULL,
    username VARCHAR(255),
    -- ... other user data
);

-- Sharding key: tenant_id
-- If we have 100 tenants and 4 shards,
-- tenants 1-25 go to shard 1,
-- tenants 26-50 go to shard 2, etc.

This seems logical: all data for a single tenant lives on one shard. But what happens when an administrator needs to query all users across all tenants, perhaps for a global user audit or a system-wide search?

-- The query that kills performance
SELECT COUNT(*) FROM users WHERE status = 'inactive';

If your sharding key is tenant_id, this query becomes a nightmare. The database has to send this request to every single shard. Each shard then scans its local users table, finds inactive users, and returns a count. Finally, the application has to aggregate these counts from all shards. This is a distributed COUNT(*) across the entire dataset, and it’s agonizingly slow.

The problem here is that the sharding strategy optimizes for queries within a shard (e.g., "get all users for tenant X") but penalizes queries that span shards (e.g., "get all inactive users across all tenants").

Here’s how to build a mental model for avoiding these pitfalls. Sharding is about distributing data to improve performance, but it’s a zero-sum game: you improve performance for some access patterns at the expense of others. The key is to identify your most critical access patterns and shard in a way that optimizes for them.

Consider the tenant_id sharding example again. If your primary access pattern is indeed "operate within a single tenant," then sharding by tenant_id might be fine. But if you frequently need global operations, this strategy will choke.

A better approach for mixed workloads might be to shard by user_id and use a consistent hashing algorithm. This distributes users across shards more evenly, regardless of tenant.

-- Sharding key: user_id
-- Using consistent hashing, a user_id might land on any shard,
-- regardless of their tenant_id.

With user_id sharding, a query like SELECT * FROM users WHERE user_id = 12345 goes to a single shard. But now, the global COUNT(*) query is still problematic. It still requires hitting every shard. This highlights the core trade-off: there’s no single sharding key that optimizes for all possible queries.

The real power comes from understanding your application’s read and write patterns and choosing a sharding key that aligns with the most frequent and performance-critical ones. If you have a truly global query that’s critical, you might need a separate, denormalized table optimized for that specific query, or a different sharding strategy altogether, perhaps based on a composite key that includes a global dimension.

Many systems allow for a "scatter-gather" or "broadcast" query capability. This is what happens when you run a query that doesn’t have the sharding key in its WHERE clause. The database coordinator (or your application logic) sends the query to all shards, and each shard executes it independently. The results are then aggregated. This is fine for occasional, non-critical queries, but it becomes a massive bottleneck if these queries are frequent or performance-sensitive. The performance penalty is directly proportional to the number of shards and the amount of data each shard must scan.

The one thing most people don’t realize is that sharding by a key that looks like it distributes data well (like tenant_id for multi-tenant apps) can be a trap if your application also performs global aggregations or cross-tenant lookups. You’re essentially forcing a distributed system to act like a monolithic one for those specific, high-impact operations, negating the benefits of sharding.

The next pitfall you’ll likely encounter is hot spots, where a single shard becomes disproportionately loaded due to uneven data distribution or exceptionally high traffic for a specific subset of your data.

Want structured learning?

Take the full Sharding course →