The RabbitMQ cluster has experienced a network partition, leading to a split-brain scenario where nodes can no longer communicate and act independently, risking data inconsistencies.
Cause 1: Firewall Blocking Ports
- Diagnosis: Check firewall rules on all nodes. For example, on systems using
iptables:
You should see ACCEPT rules for TCP ports 5672 (AMQP), 15672 (Management UI), and 25672 (Inter-node communication).sudo iptables -L -n | grep -E '15672|25672|5672' - Fix: If a port is blocked, add an
ACCEPTrule. For example, to allow inter-node communication on port 25672:
This opens the necessary communication channel, allowing nodes to rejoin the cluster.sudo iptables -A INPUT -p tcp --dport 25672 -j ACCEPT sudo iptables -A OUTPUT -p tcp --dport 25672 -j ACCEPT sudo service iptables save - Why it works: RabbitMQ nodes use port 25672 for Erlang distribution and cluster membership. Blocking this prevents nodes from seeing each other.
Cause 2: Incorrect Erlang Cookie
- Diagnosis: On each node, cat the Erlang cookie file:
Ensure the content (a seemingly random string of characters) is identical on all nodes in the cluster.sudo cat /var/lib/rabbitmq/.erlang.cookie - Fix: If cookies differ, copy the correct cookie from one node to all others, ensuring the file has proper permissions:
Restart RabbitMQ on all nodes after updating the cookie. This allows Erlang runtimes on different machines to authenticate each other for cluster communication.# On the node with the correct cookie, copy it: scp /var/lib/rabbitmq/.erlang.cookie user@other_node:/tmp/correct_cookie # On the other nodes, replace the cookie: sudo mv /tmp/correct_cookie /var/lib/rabbitmq/.erlang.cookie sudo chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie sudo chmod 600 /var/lib/rabbitmq/.erlang.cookie - Why it works: The Erlang cookie is a shared secret that all nodes in an Erlang cluster must possess to communicate. Mismatched cookies prevent nodes from recognizing each other.
Cause 3: DNS Resolution Issues or Static IP Mismatches
- Diagnosis: On each node, try to
pingandtelnetto the hostnames/IPs of all other nodes in the cluster.
Verify that the IPsping rabbitmq-node-2 telnet rabbitmq-node-2 25672pingresolves to match the IP addresses configured in RabbitMQ’s cluster configuration (if static IPs are used) or that DNS consistently resolves to the correct IPs. - Fix: Correct DNS records or update the
NODENAMEenvironment variable in/etc/rabbitmq/rabbitmq-env.confon each node if static IPs are used incorrectly. For example, ifrabbitmq-node-1should berabbit@192.168.1.10:
Restart RabbitMQ on all nodes. This ensures nodes are addressing each other using consistent and resolvable network identifiers.# In /etc/rabbitmq/rabbitmq-env.conf NODENAME=rabbit@192.168.1.10 - Why it works: RabbitMQ nodes identify each other by their
NODENAME, which typically includes their hostname or IP. Incorrect resolution or configuration means nodes are trying to connect to the wrong network endpoints.
Cause 4: Network Latency or Unstable Connectivity
- Diagnosis: Use
pingwith a higher count andtracerouteto check for packet loss and high latency between nodes.
Sustained packet loss (e.g., >1%) or high RTT (>100ms consistently) can cause cluster instability.ping -c 100 rabbitmq-node-2 traceroute rabbitmq-node-2 - Fix: Investigate and resolve underlying network issues. This might involve checking physical network cables, switches, routers, or working with your network team to ensure stable, low-latency connectivity. For temporary workarounds or specific cloud environments, consider adjusting RabbitMQ’s heartbeats (though this is generally discouraged for long-term fixes):
The# In rabbitmq-env.conf NODENAME=rabbit@your_node_name RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_timeout 30000 -kernel net_ticktime 45"net_ticktimevalue (in seconds) controls how long Erlang waits for a heartbeat. Increasing it can tolerate higher latency but masks underlying problems. Restart RabbitMQ after changes. - Why it works: Erlang distribution relies on timely acknowledgments. High latency or dropped packets cause heartbeats to time out, leading nodes to believe others have failed and thus form partitions.
Cause 5: Resource Exhaustion (CPU, Memory, Disk I/O)
- Diagnosis: Monitor system resources on each node using
top,htop,vmstat, andiostat.
Consistently high CPU usage (>90%), low free memory, or constant disk activity can make nodes unresponsive to cluster communication.top -c vmstat 5 10 iostat -xz 5 - Fix: Optimize RabbitMQ configurations, scale up the nodes (more CPU/RAM), or address the underlying application logic that is causing the resource strain. Ensure disk I/O is not a bottleneck, especially for message persistence. This might involve upgrading storage or tuning OS-level disk settings.
- Why it works: When a node is overwhelmed, it cannot process network messages or respond to heartbeats in a timely manner, leading other nodes to perceive it as down.
Cause 6: RabbitMQ Node Crash or Unclean Shutdown
- Diagnosis: Check RabbitMQ logs (
/var/log/rabbitmq/rabbit@<nodename>.log) and system logs (/var/log/syslogorjournalctl) for crash reports, segmentation faults, or explicit shutdown messages. - Fix: If a node crashed, investigate the root cause (e.g., OOM killer, bug). Once resolved, restart the node and rejoin it to the cluster. If it was an unclean shutdown, ensure the data directory (
/var/lib/rabbitmq/mnesia/) is intact. You might need to reset the node if it cannot recover its cluster state:
This forces the node to forget its previous cluster membership and rejoin as new, allowing it to synchronize state.# On the problematic node, before starting RabbitMQ rabbitmqctl stop_app rabbitmqctl reset rabbitmqctl start_app # Then join it to the cluster from another node: rabbitmqctl join_cluster rabbit@<other_node_name> - Why it works: A crashed or improperly reset node loses its cluster state and cannot communicate, forcing a partition. Rejoining or resetting allows it to re-establish communication and sync data.
After fixing network partitions, you will likely encounter the OVERLOAD state on queues if messages were not consumed during the downtime, requiring manual intervention.