The Redis cluster is failing because a majority of master nodes cannot communicate with each other to agree on the cluster state. This CLUSTERDOWN state means the cluster can no longer serve requests because it’s lost its quorum.
Cause 1: Network Partition Between Master Nodes
Diagnosis:
Check connectivity between master nodes using redis-cli from one master to another.
redis-cli -c -h <master1_ip> -p <master1_port> cluster nodes
redis-cli -c -h <master2_ip> -p <master2_port> cluster nodes
Look for nodes that are not visible or are marked as disconnected from each other. Also, verify firewall rules and network ACLs on the servers hosting your Redis masters.
Fix:
Identify the specific network path or firewall rule blocking communication on ports 16379 (default Redis port) and 16380 (default Redis Cluster bus port). Open these ports between all master nodes.
# Example firewall rule (ufw)
sudo ufw allow from <master_ip_address> to any port 16379,16380 proto tcp
This fix works by restoring the necessary TCP connections for the Redis Cluster bus, allowing nodes to exchange heartbeats and maintain cluster state.
Cause 2: Master Node Crashed or Unreachable
Diagnosis:
Attempt to connect to the suspected master node using redis-cli.
redis-cli -h <crashed_master_ip> -p <crashed_master_port> ping
If the ping command times out or returns Connection refused, the node is down. Check the server’s status and the Redis process.
Fix: Restart the Redis service on the affected master node.
# For systemd
sudo systemctl restart redis-server
# For init.d
sudo service redis-server restart
This fix works by bringing the downed Redis process back online, allowing it to rejoin the cluster and re-establish its role.
Cause 3: Insufficient Master Nodes for Quorum
Diagnosis: Count the number of master nodes that are currently reachable and participating in the cluster.
redis-cli -c -h <any_master_ip> -p <any_master_port> cluster nodes | grep master | wc -l
A cluster needs a majority of master nodes to be available to form a quorum. For example, a 6-master cluster needs at least 4 masters online to maintain quorum. If you have fewer than the required number of masters, the cluster will enter CLUSTERDOWN.
Fix:
Add new master nodes to the cluster until the quorum is met. Ensure the new nodes are configured correctly and then use redis-cli --cluster add-node to join them.
redis-cli --cluster add-node <new_master_ip>:<new_master_port> <existing_master_ip>:<existing_master_port> --cluster-slave-no-failover
After adding nodes, you might need to rebalance slots.
redis-cli --cluster rebalance <existing_master_ip>:<existing_master_port> --cluster-use-empty-masters
This fix works by increasing the number of voting members, allowing the cluster to reach the minimum required count for consensus on cluster state.
Cause 4: Configuration Mismatch on Master Nodes
Diagnosis:
Compare the cluster-config-file contents on multiple master nodes.
cat /etc/redis/nodes.conf # or wherever your cluster-config-file is
Look for discrepancies in node IDs, slot assignments, or peer information that might indicate a node is operating with stale or incorrect cluster topology data.
Fix:
If a node’s nodes.conf is significantly different or corrupted, you might need to reset it. Stop Redis, delete the nodes.conf file, and restart Redis. The node will attempt to rejoin the cluster as a new member.
sudo systemctl stop redis-server
sudo rm /var/lib/redis/nodes.conf
sudo systemctl start redis-server
This fix works by forcing the node to forget its old cluster identity and rejoin with fresh information, allowing it to synchronize with the current cluster state.
Cause 5: High Load or Resource Starvation on Master Nodes
Diagnosis: Monitor CPU, memory, and network I/O on your master nodes. High utilization can lead to slow responses, causing heartbeats to be missed and nodes to appear offline.
top -H -p $(pgrep redis-server)
# Or use Redis `INFO` command for memory and client stats
redis-cli -h <master_ip> -p <master_port> info memory
redis-cli -h <master_ip> -p <master_port> info clients
Look for sustained high CPU usage, memory exhaustion, or an overwhelming number of client connections.
Fix:
Optimize your Redis usage (e.g., use pipelining, reduce large keys, tune maxmemory), scale up the server resources (CPU, RAM), or add more master nodes to distribute the load.
# Example: increase maxmemory if memory is the bottleneck
# In redis.conf:
# maxmemory 8gb
# maxmemory-policy allkeys-lru
This fix works by ensuring that master nodes have sufficient resources to process commands and send timely heartbeats, preventing them from being marked as failed by other cluster members.
Cause 6: Incorrect cluster-announce-ip or cluster-announce-port
Diagnosis:
Verify the cluster-announce-ip and cluster-announce-port settings in redis.conf for each master node. These should reflect the IP address and port that other nodes can use to reach this node, not necessarily the IP the node is listening on locally if it’s different (e.g., behind NAT).
grep -E "cluster-announce-ip|cluster-announce-port" /etc/redis/redis.conf
If these are set incorrectly, nodes will try to connect to unreachable addresses.
Fix:
Correct the cluster-announce-ip and cluster-announce-port in redis.conf to the IP address and port that are routable by other cluster members. Restart the Redis service after making changes.
# Example redis.conf
cluster-announce-ip 192.168.1.100
cluster-announce-port 16379
This fix works by ensuring that all nodes in the cluster are using the correct, publicly accessible network endpoints to communicate with each other, resolving communication failures due to incorrect addressing.
The next error you’ll likely encounter after fixing CLUSTERDOWN is CLUSTERERR: No suitable master found for replication if you had to replace a master and its replicas are now out of sync or unable to find a new master.