Migrating database shards without downtime is less about a magical trick and more about a carefully orchestrated, multi-stage dance where data is mirrored, then switched, then cleaned up.
Let’s watch a shard migration in progress. Imagine we have a sharded PostgreSQL cluster. We want to move shard_01 from node_a to node_b without interrupting any application writes or reads.
Here’s a simplified view of the data flow during the live cutover:
Initial State:
- Application writes to
shard_01onnode_a. node_ais the primary forshard_01.
Stage 1: Replication Setup
- A new PostgreSQL instance is provisioned on
node_b. node_bis configured as a streaming replica ofnode_aforshard_01.node_asends its Write-Ahead Log (WAL) records tonode_b.node_bapplies these WAL records, bringing it nearly in sync withnode_a.
-- On node_a (the primary)
SELECT pg_create_physical_replication_slot('migration_slot');
-- On node_b (the new replica)
-- Assuming pg_hba.conf allows replication from node_a
-- Assuming postgresql.conf has wal_level = replica and hot_standby = on
SELECT pg_basebackup -D /var/lib/postgresql/14/main_shard_01 -h node_a -p 5432 -U replicator -v -P -R -S migration_slot;
-- The -R option in pg_basebackup automatically creates the recovery.conf (or standby.signal and postgresql.auto.conf in newer versions)
-- with primary_conninfo and recovery_target_slot.
Stage 2: Application Awareness & Dual Writes (Optional but Recommended)
- The application is modified to be aware of the migration. It can now write to either
node_aornode_b. - A small percentage of writes might be sent to
node_bto test its write capacity and ensure consistency. This is often called "dual writes" or "shadow writes."
Stage 3: The Cutover (The Moment of Truth)
- Stop Application Writes: A brief pause is introduced where the application stops sending any new writes to
shard_01. This is the only point of actual quiescence. - Final WAL Sync: Ensure
node_bhas received all WAL records fromnode_aup to the point writes were stopped. - Promote
node_b: Makenode_bthe new primary forshard_01. - Resume Application Writes: The application is now directed to send all writes to
shard_01onnode_b.
-- On node_b (the replica)
SELECT pg_promote();
Stage 4: Verification & Switchback Readiness
- The application performs read and write tests against
node_bto confirm its health. - The old primary (
node_a) is kept running as a replica of the new primary (node_b) for a short period, allowing for an immediate rollback if necessary.
-- On node_a (the old primary, now a replica)
-- Stop the old replication process
SELECT pg_stop_replication();
-- Reconfigure node_a to replicate from node_b
-- This involves updating postgresql.conf or postgresql.auto.conf and standby.signal
-- Example using postgresql.auto.conf:
-- primary_conninfo = 'host=node_b port=5432 user=replicator password=...'
-- recovery_target_slot = 'rollback_slot' -- a new slot on node_b
Stage 5: Cleanup
- Once confidence in
node_bis high, the replication slot onnode_b(if created for rollback) and the old primary (node_a) can be decommissioned.
-- On node_b (the new primary)
SELECT pg_drop_replication_slot('rollback_slot');
-- Then decommission node_a
The core problem this solves is the classic CAP theorem trade-off: during a migration, you can’t simultaneously have strong consistency, high availability, and zero downtime for writes if you only have one source of truth. By setting up replication, you create a temporary state where you have two potential sources of truth, allowing you to switch over with minimal interruption. The key is that the application must be able to gracefully pause writes for a very short interval.
The most surprising true thing about this process is that the "zero downtime" is a bit of a misnomer for the writes. There’s a very brief window, typically milliseconds to seconds, where writes are paused. The "zero downtime" applies to the application’s availability to serve requests, as reads can continue uninterrupted and the write pause is so short it’s usually imperceptible to end-users.
The exact levers you control are the replication slot names, the WAL sender/receiver parameters on both primary and replica, the recovery_target_time or recovery_target_lsn if you need to stop replication at a specific point in time before promoting, and the application’s connection pooling logic to seamlessly switch endpoints.
The next concept you’ll grapple with is how to handle schema changes during this migration, which requires a different set of tools and strategies, often involving online schema migration utilities.