Joins across shards are often a performance bottleneck, but the real surprise is how frequently they’re completely avoidable with a bit of upfront design.

Let’s look at a common scenario: a users table sharded by user_id and an orders table sharded by order_id. If we need to find all orders for a specific user, we’d typically query the users shard for the user’s ID and then query the orders shards. If the orders table is sharded differently, this becomes a scatter-gather operation across all orders shards, which is slow.

Here’s how we can see this in action. Imagine we have two tables: users and user_purchases.

users table, sharded by user_id:

CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(100),
    email VARCHAR(255)
);
-- Sharding: HASH(user_id)

user_purchases table, sharded by purchase_id:

CREATE TABLE user_purchases (
    purchase_id BIGINT PRIMARY KEY,
    user_id BIGINT, -- This is the key for our join problem
    product_name VARCHAR(100),
    purchase_date DATE,
    amount DECIMAL(10, 2)
);
-- Sharding: HASH(purchase_id)

If we want to get all purchases for a specific user, say user_id = 12345, a naive approach might involve:

  1. Finding the shard containing user_id = 12345 in the users table.
  2. Then, querying all shards of the user_purchases table for records where user_id = 12345.

This second step is the problem. If user_purchases is sharded by purchase_id and we have millions of purchases spread across hundreds of shards, we’re sending a request to every single shard, collecting results, and then filtering. This is an expensive cross-shard join.

The solution lies in how we shard and structure our data, specifically through co-location and denormalization.

Co-location means placing related data on the same physical shard. If we shard user_purchases by user_id instead of purchase_id, then all purchases for a given user will reside on the same shard as that user’s record.

Let’s redefine user_purchases with co-location in mind:

user_purchases table, sharded by user_id:

CREATE TABLE user_purchases (
    purchase_id BIGINT PRIMARY KEY,
    user_id BIGINT, -- This is now our sharding key
    product_name VARCHAR(100),
    purchase_date DATE,
    amount DECIMAL(10, 2)
);
-- Sharding: HASH(user_id)

Now, if we query for user_id = 12345:

  1. The database determines the shard for user_id = 12345 in the users table.
  2. Because user_purchases is also sharded by user_id, the database knows to look only on that same shard for user_id = 12345 in the user_purchases table.

This transforms the scatter-gather operation into a single-shard query, drastically improving performance.

Denormalization is a related but distinct technique. Instead of co-locating by shard key, we duplicate data. In our example, we could add user details directly into the user_purchases table.

user_purchases_denormalized table:

CREATE TABLE user_purchases_denormalized (
    purchase_id BIGINT PRIMARY KEY,
    user_id BIGINT,
    username VARCHAR(100), -- Denormalized field
    email VARCHAR(255),    -- Denormalized field
    product_name VARCHAR(100),
    purchase_date DATE,
    amount DECIMAL(10, 2)
);
-- Sharding: HASH(purchase_id) or even HASH(user_id) if we also want co-location

With this denormalized table, if we want to display a user’s purchases and their username, we only need to query the user_purchases_denormalized table. There’s no join required at all. The trade-off is increased storage and the complexity of keeping the denormalized data consistent during updates.

The mental model for choosing between these strategies hinges on query patterns. If you frequently join users and user_purchases to get purchases for a specific user, co-locating user_purchases by user_id is ideal. If you often need to display user details alongside each purchase and rarely need to access user details independently, denormalization might be simpler and faster for that specific read pattern.

The power of co-location comes from the fact that distributed databases can often identify when a join condition matches the sharding key. When this happens, the database intelligently routes the query to the specific shard that holds both pieces of data, avoiding the network overhead of talking to multiple shards. This is why choosing the right sharding key is paramount.

One thing most people don’t realize is that co-location isn’t just about joining two tables. It’s also about enabling efficient subqueries and aggregations that span related data. If you have a products table and a sales table, and you want to find the total sales for each product on the shard where the product resides, you’d shard both by product_id. This allows a local aggregation on each shard without needing to pull all sales data to a central point.

The next step in mastering distributed data is understanding how to handle transactions that span multiple shards when co-location or denormalization isn’t feasible.

Want structured learning?

Take the full Sharding course →