The most surprising truth about online resharding is that it doesn’t actually move data in the way you’re probably imagining. Instead, it orchestrates a controlled, phased transition where a new set of shards takes over responsibility for serving read and write requests, while the old shards are gracefully decommissioned.
Let’s see this in action with a conceptual redis-cli session. Imagine we have a Redis cluster with three shards, and we want to add a fourth shard to distribute the load.
# Initial state: 3 shards
redis-cli --cluster check 192.168.1.10:6379
# ... output shows 3 shards, 1 master per shard, keyspace distribution ...
# Add a new node (which will become the 4th shard)
redis-cli --cluster add-node /path/to/new_redis.conf 192.168.1.10:6379
# ... output shows new node added, but not yet assigned slots ...
# Initiate the resharding process
redis-cli --cluster reshard 192.168.1.10:6379
# CLUSTER RESHARD> :1000 # Target slot range to move, e.g., 0-1000
# CLUSTER RESHARD> 192.168.1.15:6379 # IP of the new node to receive slots
# CLUSTER RESHARD> 1000 # Number of slots to move
# ... Redis CLI prompts for confirmation, then starts moving slots ...
# CLUSTER RESHARD> :2000 # Next range, e.g., 1001-2000
# CLUSTER RESHARD> 192.168.1.15:6379
# CLUSTER RESHARD> 1000
# ... and so on for all desired slots ...
# After all slots are moved and rebalanced
redis-cli --cluster check 192.168.1.10:6379
# ... output now shows 4 shards, keyspace more evenly distributed ...
The problem this solves is scaling a distributed data store like Redis Cluster when your existing shards become overloaded or when you need to add more capacity. Without online resharding, you’d typically have to stop all writes, reconfigure the cluster, and potentially rewrite data, leading to significant downtime.
Internally, redis-cli --cluster reshard orchestrates a series of commands. When you tell it to move a range of slots (e.g., 0-1000) to a new node, it doesn’t copy all the data for those slots immediately. Instead, it performs the following steps for each slot:
SET-SLOT IMPORTING <slot> <source-node-id>: The receiving node (the new shard) is instructed to prepare for incoming data for a specific slot. It tells other nodes, "Hey, I’m about to start receiving data for slot X. If you have any writes for it, send them to me."REPLICAOF NO ONE(if applicable): The source node (the old shard) temporarily stops replicating to its replicas for the slots being moved. This ensures that only the "master" data is considered for migration.MIGRATE <slot> <destination-ip> <destination-port> <timeout> <keys-to-migrate>: The source node then initiates aMIGRATEcommand for each key within the specified slot.MIGRATEatomically transfers a single key-value pair from the source to the destination. Crucially, if theMIGRATEcommand fails (e.g., the key was written to after the migration started), it retries.SET-SLOT MIGRATING <slot> <destination-node-id>: Once all keys for a slot have been successfully migrated, the source node tells other nodes, "I’m done with slot X. All traffic for it should now go to the new node."SET-SLOT NODE <slot> <destination-node-id>: Finally, the receiving node announces that it has taken full ownership of the slot.
This process is repeated for all slots, allowing clients to seamlessly switch their requests to the new shard as slots are migrated. The CLUSTER RESHARD command manages this entire dance, handling the coordination and retries.
The actual data transfer happens via the MIGRATE command, which is designed for atomic key transfer. It locks the key on the source, copies it to the destination, and then removes it from the source. If a write occurs on the source after the key has been selected for migration but before it’s fully migrated, MIGRATE will fail, and the resharding process will retry, ensuring consistency. The critical element is that the client’s request routing table (managed by the cluster bus) is updated after the data is confirmed to be on the new node, and before the old node relinquishes responsibility.
The most counterintuitive part of this process is how Redis handles writes that occur during a MIGRATE operation for a specific key. If a client attempts to write to a key that is currently being migrated, the MIGRATE command on the source node will fail. The resharding process then doesn’t just give up; it enters a state where it will periodically retry the MIGRATE for that specific key until it succeeds. Meanwhile, other clients that are correctly routed to the new node will be able to write to the key once it’s fully migrated and the cluster configuration is updated. This retry mechanism, combined with the client’s ability to discover the correct shard for a key, is what enables the "zero downtime" aspect for writes.
The next challenge after resharding is often rebalancing the keys across the new set of shards to ensure an even distribution.