A RabbitMQ node’s inability to rejoin a cluster after a network partition is usually because it’s holding onto stale cluster membership information that conflicts with the current state of the cluster.
Here’s how to fix it, covering the most common scenarios:
1. The Node is Still Discoverable but Won’t Join
This happens when the node can see the other nodes, but they don’t agree on who’s "in" and who’s "out" due to the partition.
- Diagnosis: Check the RabbitMQ log file on the node that failed to rejoin. Look for messages indicating it’s trying to connect but failing to establish a quorum or that it sees a different set of nodes. On the other nodes in the cluster, check their logs for messages about node discovery or cluster membership changes that don’t include the problematic node.
- Cause: The node you’re trying to rejoin has an outdated
erlang.cookiefile, or the clocks on the nodes are significantly out of sync. Erlang nodes use the cookie for authentication, and time drift can cause handshake failures.- Diagnosis Command:
- On the problematic node:
cat /var/lib/rabbitmq/.erlang.cookie - On a healthy node:
cat /var/lib/rabbitmq/.erlang.cookie - On both nodes:
date
- On the problematic node:
- Fix: Ensure the
.erlang.cookiefile is identical on all nodes. If it’s not, copy the correct one to the problematic node and restart RabbitMQ. Synchronize clocks using NTP:sudo apt update && sudo apt install ntp -y(oryumequivalent)sudo systemctl start ntpsudo systemctl enable ntp
- Why it works: A matching cookie is essential for nodes to authenticate each other as part of the same cluster. Synchronized clocks prevent authentication timeouts and handshake issues that can arise from perceived time differences.
- Diagnosis Command:
- Cause: The problematic node’s disk is full, preventing it from writing necessary state information to disk.
- Diagnosis Command:
df -h /var/lib/rabbitmq - Fix: Free up disk space. This might involve deleting old messages, clearing out logs (
/var/log/rabbitmq/), or increasing disk size. A common cleanup step israbbitmq-cleanup-old-files. - Why it works: RabbitMQ needs disk space to store its internal state, including cluster membership, message queues, and durable message data. If this space is exhausted, it can’t operate correctly or rejoin the cluster.
- Diagnosis Command:
- Cause: The
mnesiadatabase on the problematic node is corrupted or out of sync. This database stores cluster state.- Diagnosis Command: Examine the RabbitMQ logs for
mnesiaerrors. You might see messages like "failed to commit" or "corrupted table." - Fix: This is often best addressed by resetting the node’s state and rejoining. Stop RabbitMQ on the problematic node. Then, clear its data directory:
sudo rabbitmqctl stop_appsudo mv /var/lib/rabbitmq/mnesia /var/lib/rabbitmq/mnesia.baksudo mv /var/log/rabbitmq /var/log/rabbitmq.baksudo rabbitmqctl resetsudo rabbitmqctl start_app- Then, from a healthy node,
rabbitmqctl join_cluster rabbit@<healthy_node_name>
- Why it works: By resetting the node, you force it to discard its potentially corrupted local state and re-establish its membership from scratch by joining the cluster as a new member.
- Diagnosis Command: Examine the RabbitMQ logs for
2. The Node is Completely Unreachable or Refuses to Connect
This typically means the network partition was severe, and the node has been effectively isolated for too long, leading to it being evicted or seeing itself as the sole survivor.
- Diagnosis: From the problematic node, try to
pingthe IP addresses of the other nodes in the cluster. Also, attempt anerlshell connection to the other nodes:erl -sname rabbit@<other_node_name> -setcookie <your_cookie>. If these fail, it’s a network or firewall issue. - Cause: Network connectivity issues (firewalls, routing, DNS) between the nodes. Erlang uses specific ports for node communication (EPMD on 25672 and then dynamic ports for inter-node communication).
- Diagnosis Command:
- On the problematic node:
telnet <IP_of_other_node> 25672 - On the other nodes:
telnet <IP_of_problematic_node> 25672 - Check
ufw statusorfirewall-cmd --list-allfor any blocked ports.
- On the problematic node:
- Fix: Open the necessary ports in your firewall. For a standard cluster, this means port 25672 for EPMD and the range of ports used by the Erlang VM for inter-node communication (often 4369 and a range above 50000, though this can be configured). Ensure
net.ipv4.ip_forwardis enabled if nodes are on different subnets. - Why it works: Erlang nodes must be able to discover and communicate with each other using specific TCP ports. If these are blocked, they cannot form or rejoin a cluster.
- Diagnosis Command:
- Cause: The problematic node was forced to run in
nonode@nohostmode due to prolonged isolation, and its cluster state became stale.- Diagnosis: Check the RabbitMQ logs. You might see messages about the node running in
nonode@nohostmode. - Fix: This usually requires a full cluster reset and re-joining.
- Stop RabbitMQ on all nodes.
- On the problematic node, clear its data directory:
sudo rabbitmqctl stop_appsudo rm -rf /var/lib/rabbitmq/mnesiasudo rm -rf /var/lib/rabbitmq/logsudo rabbitmqctl reset
- On one healthy node, clear its data directory as well (this will make it the "seed" for the new cluster):
sudo rabbitmqctl stop_appsudo rm -rf /var/lib/rabbitmq/mnesiasudo rm -rf /var/lib/rabbitmq/logsudo rabbitmqctl reset
- Start RabbitMQ on the node you designated as the seed:
sudo rabbitmqctl start_app. - On all other nodes (including the one that was problematic), start RabbitMQ and join them to the seed:
sudo rabbitmqctl start_appsudo rabbitmqctl join_cluster rabbit@<seed_node_name>
- Why it works: When a node is isolated for too long, it can’t participate in quorum decisions and might revert to a standalone state. Resetting all nodes and then re-forming the cluster from a single seed ensures a clean slate and proper membership establishment.
- Diagnosis: Check the RabbitMQ logs. You might see messages about the node running in
- Cause: The node was explicitly stopped or crashed, and now the remaining cluster members have removed it from their membership lists, considering it permanently gone.
- Diagnosis: Look at the cluster status on a healthy node:
rabbitmqctl cluster_status. If the problematic node is not listed at all, or listed underrunning_nodesbut notdisc_nodes, it might be considered gone. - Fix: Treat this like the
nonode@nohostscenario. You’ll likely need to reset the problematic node and rejoin it to the cluster. If you want to preserve its data (queues, messages), you’ll need to ensure it’s running and accessible before attempting to join. If it was permanently removed, you might need to rejoin it as if it were a new node. - Why it works: RabbitMQ clusters have mechanisms to detect and remove nodes that are unresponsive for extended periods. Rejoining requires the node to be recognized as a valid member again, often necessitating a reset and re-establishment of its cluster identity.
- Diagnosis: Look at the cluster status on a healthy node:
After successfully rejoining the node, you might encounter issues with message replication or queue synchronization if the partition lasted a significant amount of time. The next error might be related to message delivery delays or queues appearing empty on the rejoined node until replication catches up.