Redis is refusing writes because it can’t find enough replicas that are in sync to satisfy the NOREPLICAS condition.
Here’s what’s actually broken: The Redis master node, when configured for replication and write acknowledgment (like with MIN_REPLICAS_TO_WRITE), is designed to protect data integrity. If it can’t confirm that a sufficient number of replica nodes have received the latest data, it will stop accepting writes to prevent data loss. The NOREPLICAS error specifically means the master lost connection or synchronization with enough of its replicas.
Here are the common causes and how to fix them:
1. Network Connectivity Issues Between Master and Replicas
- Diagnosis: Check basic network reachability. From the master node, try
ping <replica_ip>andtelnet <replica_ip> 6379. From a replica node, tryping <master_ip>andtelnet <master_ip> 6379. Also, check firewall rules on both master and replica machines, and any network devices in between. Ensure port 6379 (or your configured Redis port) is open. - Fix: Resolve network issues. This might involve updating firewall rules (
ufw allow 6379/tcpon Debian/Ubuntu, orfirewall-cmd --zone=public --add-port=6379/tcp --permanent && firewall-cmd --reloadon RHEL/CentOS), configuring security groups in cloud environments, or fixing routing problems. - Why it works: Redis replicas need to maintain a constant TCP connection to the master to receive replication streams. If this connection is broken or blocked, the replica falls out of sync, and the master might not be able to reach it.
2. High Load or Slow I/O on Replicas
- Diagnosis: Monitor the replicas. Use
redis-cli -h <replica_ip> INFO replication. Look atmaster_repl_offsetandslave_repl_offset. Ifmaster_repl_offsetis significantly higher thanslave_repl_offset, the replica is lagging. Also, check the system load on the replica machines (top,htop) and disk I/O performance (iostat). High CPU or slow disk can prevent replicas from processing commands and acknowledging data in a timely manner. - Fix: Optimize replica performance. This could mean upgrading the replica hardware (more CPU, faster SSDs), reducing the workload on the replica (e.g., by offloading read queries to other replicas or dedicated read replicas), or tuning Redis configuration on the replicas (e.g.,
maxmemoryto prevent swapping). - Why it works: Replicas must process commands from the master and persist them to their own data store. If they are too busy or their disks are too slow, they can’t keep up with the replication stream, causing them to lag behind the master’s offset.
3. Network Latency or Bandwidth Saturation
- Diagnosis: Measure network latency between the master and replicas using
ping. If latency is consistently high (e.g., > 100ms), it can slow down replication. Also, monitor network bandwidth utilization on the interfaces connecting the master and replicas. If the link is saturated, replication data can’t be sent quickly enough. - Fix: Improve network infrastructure. This might involve moving master and replicas to the same availability zone or region, increasing network bandwidth, or optimizing network routes.
- Why it works: Replication relies on sending a continuous stream of commands over the network. High latency or insufficient bandwidth means this stream is delayed, causing replicas to fall behind the master’s current state.
4. Master Node Overload
- Diagnosis: Monitor the master node’s performance. Use
redis-cli -h <master_ip> INFO statsand look attotal_commands_processed,instantaneous_ops_per_sec, andrejected_connections. High command throughput, especially with slow commands or network issues on the master itself, can prevent it from efficiently sending data to replicas. Also, check the master’s system load. - Fix: Scale up or optimize the master. This could involve upgrading the master’s hardware, optimizing queries that are slow on the master, or sharding your data if you’re hitting Redis’s throughput limits.
- Why it works: If the master is overwhelmed with processing its own requests, it may not have enough resources to efficiently send replication data to all its replicas, causing them to lag.
5. Master-Replica Configuration Mismatch (e.g., repl-disk-threads)
- Diagnosis: Review
redis.confon both master and replicas. Pay attention to settings related to replication and persistence. For example, ifrepl-disk-threadsis set too low on a replica and the disk is a bottleneck, it can’t keep up. - Fix: Adjust replication-related configuration. On replicas experiencing slow disk I/O, experiment with increasing
repl-disk-threads(e.g., to 4 or 8, depending on your CPU cores). Ensurerepl-ping-slave-periodis not too high, as this can lead to premature disconnects if network is unstable. - Why it works: Specific configuration parameters directly impact how efficiently a replica can process and persist replicated data. Tuning these can unblock slow replicas.
6. Redis Version Bugs or Known Issues
- Diagnosis: Check the Redis release notes for your specific version and known issues related to replication or
MIN_REPLICAS_TO_WRITE. Sometimes, specific network conditions or command patterns can trigger bugs. - Fix: Upgrade to a stable, newer version of Redis. If a bug is identified, upgrading to a patched version is often the most reliable solution.
- Why it works: Software bugs can cause unexpected behavior, including incorrect replication status reporting or failure to maintain connections, which can be resolved by applying fixes in later versions.
7. Incorrect MIN_REPLICAS_TO_WRITE Value or replica-serve-stale-data
- Diagnosis: Examine your Redis master’s
redis.conffor themin-replicas-to-writeandmin-replicas-max-lagdirectives. Ifmin-replicas-to-writeis set to a value (e.g., 3) that is higher than the number of currently connected and synchronized replicas, writes will be blocked. Also, checkreplica-serve-stale-dataon replicas; if it’sno, they won’t serve reads if they are stale, which isn’t directly related to writes but can be a symptom of replication issues. - Fix: Adjust
min-replicas-to-writeandmin-replicas-max-lagon the master. If you have, say, only 2 replicas andmin-replicas-to-writeis 3, lower it to 1 or 2. Or, if you need 3 replicas, you must ensure 3 are healthy and connected.min-replicas-max-lag(in seconds) can also cause issues if replicas are slow to acknowledge; reducing this value can be risky but might unblock writes if the lag is transient. - Why it works: These settings directly control the master’s decision to accept writes based on replica health. Misconfiguration here is a direct cause of the error.
After resolving these issues, the next error you might encounter is related to OOM (Out Of Memory) if the master or replicas are not adequately provisioned for their workload.