Migrating database shards without downtime is less about a magical trick and more about a carefully orchestrated, multi-stage dance where data is mirrored, then switched, then cleaned up.

Let’s watch a shard migration in progress. Imagine we have a sharded PostgreSQL cluster. We want to move shard_01 from node_a to node_b without interrupting any application writes or reads.

Here’s a simplified view of the data flow during the live cutover:

Initial State:

  • Application writes to shard_01 on node_a.
  • node_a is the primary for shard_01.

Stage 1: Replication Setup

  • A new PostgreSQL instance is provisioned on node_b.
  • node_b is configured as a streaming replica of node_a for shard_01.
  • node_a sends its Write-Ahead Log (WAL) records to node_b.
  • node_b applies these WAL records, bringing it nearly in sync with node_a.
-- On node_a (the primary)
SELECT pg_create_physical_replication_slot('migration_slot');
-- On node_b (the new replica)
-- Assuming pg_hba.conf allows replication from node_a
-- Assuming postgresql.conf has wal_level = replica and hot_standby = on
SELECT pg_basebackup -D /var/lib/postgresql/14/main_shard_01 -h node_a -p 5432 -U replicator -v -P -R -S migration_slot;
-- The -R option in pg_basebackup automatically creates the recovery.conf (or standby.signal and postgresql.auto.conf in newer versions)
-- with primary_conninfo and recovery_target_slot.

Stage 2: Application Awareness & Dual Writes (Optional but Recommended)

  • The application is modified to be aware of the migration. It can now write to either node_a or node_b.
  • A small percentage of writes might be sent to node_b to test its write capacity and ensure consistency. This is often called "dual writes" or "shadow writes."

Stage 3: The Cutover (The Moment of Truth)

  • Stop Application Writes: A brief pause is introduced where the application stops sending any new writes to shard_01. This is the only point of actual quiescence.
  • Final WAL Sync: Ensure node_b has received all WAL records from node_a up to the point writes were stopped.
  • Promote node_b: Make node_b the new primary for shard_01.
  • Resume Application Writes: The application is now directed to send all writes to shard_01 on node_b.
-- On node_b (the replica)
SELECT pg_promote();

Stage 4: Verification & Switchback Readiness

  • The application performs read and write tests against node_b to confirm its health.
  • The old primary (node_a) is kept running as a replica of the new primary (node_b) for a short period, allowing for an immediate rollback if necessary.
-- On node_a (the old primary, now a replica)
-- Stop the old replication process
SELECT pg_stop_replication();

-- Reconfigure node_a to replicate from node_b
-- This involves updating postgresql.conf or postgresql.auto.conf and standby.signal
-- Example using postgresql.auto.conf:
-- primary_conninfo = 'host=node_b port=5432 user=replicator password=...'
-- recovery_target_slot = 'rollback_slot' -- a new slot on node_b

Stage 5: Cleanup

  • Once confidence in node_b is high, the replication slot on node_b (if created for rollback) and the old primary (node_a) can be decommissioned.
-- On node_b (the new primary)
SELECT pg_drop_replication_slot('rollback_slot');

-- Then decommission node_a

The core problem this solves is the classic CAP theorem trade-off: during a migration, you can’t simultaneously have strong consistency, high availability, and zero downtime for writes if you only have one source of truth. By setting up replication, you create a temporary state where you have two potential sources of truth, allowing you to switch over with minimal interruption. The key is that the application must be able to gracefully pause writes for a very short interval.

The most surprising true thing about this process is that the "zero downtime" is a bit of a misnomer for the writes. There’s a very brief window, typically milliseconds to seconds, where writes are paused. The "zero downtime" applies to the application’s availability to serve requests, as reads can continue uninterrupted and the write pause is so short it’s usually imperceptible to end-users.

The exact levers you control are the replication slot names, the WAL sender/receiver parameters on both primary and replica, the recovery_target_time or recovery_target_lsn if you need to stop replication at a specific point in time before promoting, and the application’s connection pooling logic to seamlessly switch endpoints.

The next concept you’ll grapple with is how to handle schema changes during this migration, which requires a different set of tools and strategies, often involving online schema migration utilities.

Want structured learning?

Take the full Sharding course →