Cross-shard queries are a necessary evil when your data is distributed across multiple database shards, and you need to retrieve information that spans those shards. The fundamental challenge is that no single shard holds all the data, so you have to ask multiple shards and then combine their results. This is where scatter-gather and fan-out patterns come in.

Let’s see this in action. Imagine you have a users table sharded by user_id across two shards, shard1 and shard2. You want to find all users whose name starts with "Alice".

On shard1:

SELECT user_id, name FROM users WHERE name LIKE 'Alice%';

This might return:

user_id | name
------- | ------
1001    | Alice Smith
1005    | Alice Johnson

On shard2:

SELECT user_id, name FROM users WHERE name LIKE 'Alice%';

This might return:

user_id | name
------- | ------
2003    | Alice Williams
2010    | Alice Brown

The scatter-gather pattern then involves a coordinating process (often the application itself or a dedicated query router) that sends the same query to both shards. It then waits for results from each and merges them into a single, complete result set. In our example, the coordinator would collect the rows from shard1 and shard2 to present:

user_id | name
------- | ------
1001    | Alice Smith
1005    | Alice Johnson
2003    | Alice Williams
2010    | Alice Brown

The fan-out pattern is conceptually similar but often implies sending different queries or requests to different shards, or distributing the workload of a single query. For instance, if you wanted to find users named "Alice" and also users named "Bob", you might fan out one query for "Alice" to shard1 and another for "Bob" to shard2 (if your sharding scheme allowed for that distribution of logic), or more commonly, fan out the entire query to all shards and then filter the results locally on each shard. The term "fan-out" is sometimes used interchangeably with scatter-gather, but it can also refer to a more distributed processing model where tasks are sent to many workers. In the context of databases, scatter-gather is the more precise term for querying across shards.

The core problem these patterns solve is data locality. When data is sharded, a query that could be answered by a single shard (e.g., "get user 1001") is trivial. But a query that requires scanning or filtering across a dataset that happens to be distributed (e.g., "get all users named Alice") becomes complex. Without scatter-gather, you’d have to manually query each shard and combine results in your application code, which is error-prone and inefficient. Scatter-gather abstracts this complexity.

Internally, a scatter-gather query execution typically looks like this:

  1. Query Decomposition/Routing: The query is analyzed. If it’s a cross-shard query, the system determines which shards might contain relevant data. This is based on the query’s filters and the sharding key.
  2. Distribution: The query (or a modified version of it) is sent to each relevant shard. This is the "scatter" part.
  3. Execution on Shards: Each shard executes its portion of the query independently.
  4. Result Aggregation: The results from each shard are sent back to a coordinator.
  5. Finalization: The coordinator merges, sorts, or filters the results as needed to produce the final output. This is the "gather" part.

The key levers you control are how your data is sharded and how your queries are written. A good sharding strategy minimizes the need for cross-shard queries. For example, if you frequently query users by their last name, sharding by user_id is problematic, but sharding by last_name (with potential replication for uniqueness) might be better. For queries that must cross shards, optimizing the query itself is crucial. Adding indexes on columns used in WHERE clauses on all shards can dramatically speed up the execution on each individual shard, thus improving the overall scatter-gather performance.

Consider a scenario where you need to find users in a specific geographic region, and your user data is sharded by user_id. If your users table also has a city column, a query like SELECT * FROM users WHERE city = 'New York'; will have to be executed against every shard. Each shard will scan its local users table, find users in 'New York', and return them. The coordinator then aggregates these results. If you had a secondary index on city on each shard, the scan on each shard would be much faster, directly impacting the scatter-gather performance.

A common pitfall is assuming that the coordinator is infinitely powerful or that network latency between shards and the coordinator is negligible. In reality, many cross-shard queries become bottlenecks because the coordinator becomes overloaded or the time spent transferring partial results from many shards is substantial. This is why query planning and efficient data distribution are so critical. Some systems allow for "partial aggregation" where some level of processing happens on the shards before sending data back, reducing the data volume for the final gather step. For example, if you were counting users by city, each shard could count its local users by city and return only those counts, rather than all the user records. The coordinator then sums these counts.

The next problem you’ll encounter is handling transactions that need to span multiple shards.

Want structured learning?

Take the full Sharding course →