Sharding isn’t just about splitting data; it’s about fundamentally changing how your database interacts with its workload.

Let’s see this in action with a simple example. Imagine a social media platform with a massive users table.

-- Original, un-sharded table
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    username VARCHAR(255) UNIQUE,
    email VARCHAR(255) UNIQUE,
    created_at TIMESTAMP
);

If we decide to shard this table, we’re essentially deciding how to distribute these rows across multiple database instances.

Horizontal Sharding: The "Row Splitter"

Horizontal sharding, or "sharding by rows," means you take your table and split its rows across different database servers. Each server will hold a subset of the total rows.

  • How it works: You define a "shard key" (e.g., user_id). Then, you use a sharding algorithm (like modulo, range, or hash) to determine which server a particular row belongs to. For instance, with modulo sharding on user_id across 4 shards:

    • user_id 1-1000 might go to Shard 1.
    • user_id 1001-2000 might go to Shard 2.
    • And so on.
  • What problem it solves: This is your go-to when a single table is growing too large, causing performance bottlenecks for reads and writes. By distributing the rows, you distribute the load. A query for a specific user (WHERE user_id = 12345) will only hit the shard containing that user, making it much faster than scanning a massive table. Writes for new users are also spread out.

  • Levers you control:

    • Shard Key: The choice of shard key is critical. It should be frequently used in queries and distribute data evenly. A poor shard key can lead to "hot spots" where one shard gets disproportionately more traffic.
    • Sharding Algorithm: Modulo is simple but can lead to uneven distribution if user_id isn’t perfectly sequential or if there are bursts of activity. Range sharding can be better for time-series data. Hash sharding offers good distribution but can make range queries harder.
    • Number of Shards: More shards mean more distribution, but also more complexity in managing the cluster.

Vertical Sharding: The "Column Splitter"

Vertical sharding, or "sharding by columns," means you take a single table and split its columns across different database servers. Different rows will still exist on different servers, but each server will hold a different set of columns for those rows.

  • How it works: You identify columns that are frequently accessed together and columns that are accessed less often or are very large (like BLOBs or large TEXT fields). You then move the less-frequently accessed or large columns to a separate table (or a separate database server) that shares a common primary key with the original table.

    Imagine our users table has a profile_bio field that’s often long and only updated infrequently.

    -- Original table
    CREATE TABLE users (
        user_id BIGINT PRIMARY KEY,
        username VARCHAR(255),
        email VARCHAR(255),
        profile_bio TEXT -- Large, infrequently accessed
    );
    
    -- Vertically sharded tables
    CREATE TABLE users_core (
        user_id BIGINT PRIMARY KEY,
        username VARCHAR(255),
        email VARCHAR(255)
    );
    
    CREATE TABLE users_profile (
        user_id BIGINT PRIMARY KEY, -- Foreign key to users_core.user_id
        profile_bio TEXT
    );
    

    users_core might live on Server A, and users_profile on Server B. Queries for username and email only need to hit Server A. Queries for profile_bio only need to hit Server B.

  • What problem it solves: This is great for optimizing read performance when you have wide tables where different queries access different subsets of columns. It reduces the amount of data read from disk for common queries. It also helps manage tables with very large data types (like images or long text) that can bloat your primary data store.

  • Levers you control:

    • Column Selection: The key is to group columns that are frequently accessed together. If you split columns that are often queried in the same SELECT statement, you’ll end up performing joins across servers, which can be slower than querying a single wider table.
    • Data Type Size: Columns with very large data types (TEXT, BLOB, JSONB) are prime candidates for moving to a separate store.
    • Access Frequency: Columns that are rarely updated or read can be moved out to reduce contention on the primary table.

The Hybrid Approach

Often, the best strategy is a combination. You might horizontally shard your users table by user_id across multiple servers. Then, on each of those horizontally sharded servers, you might further apply vertical sharding to split out infrequently accessed or large columns into separate tables.

The most surprising thing about sharding is that it doesn’t inherently make your individual queries faster if they still need to touch data spread across many shards. Instead, its power comes from distributing the total workload so that no single database instance becomes a bottleneck. A single query that needs to aggregate data from all users will still be slow, but queries for individual users or small groups of users will be lightning fast.

The next challenge you’ll face is managing distributed transactions and ensuring data consistency across your shards.

Want structured learning?

Take the full Sharding course →