Sharding your database isn’t just about splitting data; it’s about giving each shard its own independent destiny, allowing them to operate and scale as if they were entirely separate databases.

Imagine a massive user table. Instead of one giant table holding millions of rows, sharding breaks it into smaller, more manageable pieces. Let’s say we shard by user_id.

Here’s a simplified PostgreSQL setup:

-- Original table
CREATE TABLE users (
    user_id BIGSERIAL PRIMARY KEY,
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create sharded tables (example for 4 shards)
CREATE TABLE users_shard_0 (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE users_shard_1 (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE users_shard_2 (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE users_shard_3 (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

Now, how do we decide which shard a user_id goes to? This is the "sharding key" and the "sharding function." A common approach is modulo: user_id % num_shards.

If num_shards is 4:

  • user_id 100 goes to users_shard_0 (100 % 4 = 0)
  • user_id 101 goes to users_shard_1 (101 % 4 = 1)
  • user_id 102 goes to users_shard_2 (102 % 4 = 2)
  • user_id 103 goes to users_shard_3 (103 % 4 = 3)
  • user_id 104 goes to users_shard_0 (104 % 4 = 0)

This distribution ensures data is spread out. When you query for user_id = 102, your application (or a sharding proxy) knows to only look in users_shard_2. This dramatically reduces the amount of data the database needs to scan.

The core problem sharding solves is scaling beyond a single machine’s capacity. As your dataset grows or your read/write load increases, a single database server eventually hits its limits for CPU, memory, disk I/O, and network bandwidth. Sharding allows you to distribute this load across multiple servers, each hosting one or more shards.

Internally, sharding typically involves:

  1. Sharding Key Selection: Choosing the column(s) that will determine which shard a row belongs to (e.g., user_id, customer_id, tenant_id). This key must have high cardinality and be present in most queries.
  2. Sharding Function: The algorithm that maps the sharding key value to a specific shard (e.g., modulo, hash, range-based).
  3. Sharding Strategy: How you manage the shards. This can be:
    • Client-side sharding: Your application code or a dedicated client library directs queries to the correct shard. This puts complexity on the application.
    • Proxy-based sharding: A middleware proxy (like ProxySQL, Vitess, or Citus Data’s coordinator node) intercepts queries, determines the target shard(s), and forwards the query. This abstracts sharding from the application.
    • Database-native sharding: Some databases (like Citus Data for PostgreSQL, or native sharding in MongoDB) have built-in support for distributing data and routing queries.

Let’s look at a proxy-based example using a hypothetical sharding_proxy. When your application sends SELECT * FROM users WHERE user_id = 102;:

  1. The sharding_proxy receives the query.
  2. It identifies user_id as the sharding key.
  3. It applies the sharding function: 102 % 4 = 2.
  4. It determines that users_shard_2 is the target.
  5. It rewrites the query to SELECT * FROM users_shard_2 WHERE user_id = 102; and sends it to the database server hosting users_shard_2.

This means that if you need to find user 102, you only ever touch one physical database server and a small subset of the users table. This is the fundamental performance gain.

The most surprising aspect of sharding is how it breaks the ACID guarantees of traditional single-database transactions. While individual shards can maintain ACID properties, transactions that span multiple shards (e.g., transferring funds between two users on different shards) become distributed transactions. These are notoriously complex, often sacrificing strong consistency for availability, and typically require two-phase commit (2PC) or eventual consistency models to manage. If you’re not careful, your application might end up with partial updates across shards, leading to data inconsistencies that are very hard to reconcile.

The next hurdle you’ll encounter is handling cross-shard queries, especially aggregations or joins that involve data from multiple shards.

Want structured learning?

Take the full Sharding course →