A RabbitMQ node’s inability to rejoin a cluster after a network partition is usually because it’s holding onto stale cluster membership information that conflicts with the current state of the cluster.

Here’s how to fix it, covering the most common scenarios:

1. The Node is Still Discoverable but Won’t Join

This happens when the node can see the other nodes, but they don’t agree on who’s "in" and who’s "out" due to the partition.

  • Diagnosis: Check the RabbitMQ log file on the node that failed to rejoin. Look for messages indicating it’s trying to connect but failing to establish a quorum or that it sees a different set of nodes. On the other nodes in the cluster, check their logs for messages about node discovery or cluster membership changes that don’t include the problematic node.
  • Cause: The node you’re trying to rejoin has an outdated erlang.cookie file, or the clocks on the nodes are significantly out of sync. Erlang nodes use the cookie for authentication, and time drift can cause handshake failures.
    • Diagnosis Command:
      • On the problematic node: cat /var/lib/rabbitmq/.erlang.cookie
      • On a healthy node: cat /var/lib/rabbitmq/.erlang.cookie
      • On both nodes: date
    • Fix: Ensure the .erlang.cookie file is identical on all nodes. If it’s not, copy the correct one to the problematic node and restart RabbitMQ. Synchronize clocks using NTP:
      • sudo apt update && sudo apt install ntp -y (or yum equivalent)
      • sudo systemctl start ntp
      • sudo systemctl enable ntp
    • Why it works: A matching cookie is essential for nodes to authenticate each other as part of the same cluster. Synchronized clocks prevent authentication timeouts and handshake issues that can arise from perceived time differences.
  • Cause: The problematic node’s disk is full, preventing it from writing necessary state information to disk.
    • Diagnosis Command: df -h /var/lib/rabbitmq
    • Fix: Free up disk space. This might involve deleting old messages, clearing out logs (/var/log/rabbitmq/), or increasing disk size. A common cleanup step is rabbitmq-cleanup-old-files.
    • Why it works: RabbitMQ needs disk space to store its internal state, including cluster membership, message queues, and durable message data. If this space is exhausted, it can’t operate correctly or rejoin the cluster.
  • Cause: The mnesia database on the problematic node is corrupted or out of sync. This database stores cluster state.
    • Diagnosis Command: Examine the RabbitMQ logs for mnesia errors. You might see messages like "failed to commit" or "corrupted table."
    • Fix: This is often best addressed by resetting the node’s state and rejoining. Stop RabbitMQ on the problematic node. Then, clear its data directory:
      • sudo rabbitmqctl stop_app
      • sudo mv /var/lib/rabbitmq/mnesia /var/lib/rabbitmq/mnesia.bak
      • sudo mv /var/log/rabbitmq /var/log/rabbitmq.bak
      • sudo rabbitmqctl reset
      • sudo rabbitmqctl start_app
      • Then, from a healthy node, rabbitmqctl join_cluster rabbit@<healthy_node_name>
    • Why it works: By resetting the node, you force it to discard its potentially corrupted local state and re-establish its membership from scratch by joining the cluster as a new member.

2. The Node is Completely Unreachable or Refuses to Connect

This typically means the network partition was severe, and the node has been effectively isolated for too long, leading to it being evicted or seeing itself as the sole survivor.

  • Diagnosis: From the problematic node, try to ping the IP addresses of the other nodes in the cluster. Also, attempt an erl shell connection to the other nodes: erl -sname rabbit@<other_node_name> -setcookie <your_cookie>. If these fail, it’s a network or firewall issue.
  • Cause: Network connectivity issues (firewalls, routing, DNS) between the nodes. Erlang uses specific ports for node communication (EPMD on 25672 and then dynamic ports for inter-node communication).
    • Diagnosis Command:
      • On the problematic node: telnet <IP_of_other_node> 25672
      • On the other nodes: telnet <IP_of_problematic_node> 25672
      • Check ufw status or firewall-cmd --list-all for any blocked ports.
    • Fix: Open the necessary ports in your firewall. For a standard cluster, this means port 25672 for EPMD and the range of ports used by the Erlang VM for inter-node communication (often 4369 and a range above 50000, though this can be configured). Ensure net.ipv4.ip_forward is enabled if nodes are on different subnets.
    • Why it works: Erlang nodes must be able to discover and communicate with each other using specific TCP ports. If these are blocked, they cannot form or rejoin a cluster.
  • Cause: The problematic node was forced to run in nonode@nohost mode due to prolonged isolation, and its cluster state became stale.
    • Diagnosis: Check the RabbitMQ logs. You might see messages about the node running in nonode@nohost mode.
    • Fix: This usually requires a full cluster reset and re-joining.
      1. Stop RabbitMQ on all nodes.
      2. On the problematic node, clear its data directory:
        • sudo rabbitmqctl stop_app
        • sudo rm -rf /var/lib/rabbitmq/mnesia
        • sudo rm -rf /var/lib/rabbitmq/log
        • sudo rabbitmqctl reset
      3. On one healthy node, clear its data directory as well (this will make it the "seed" for the new cluster):
        • sudo rabbitmqctl stop_app
        • sudo rm -rf /var/lib/rabbitmq/mnesia
        • sudo rm -rf /var/lib/rabbitmq/log
        • sudo rabbitmqctl reset
      4. Start RabbitMQ on the node you designated as the seed: sudo rabbitmqctl start_app.
      5. On all other nodes (including the one that was problematic), start RabbitMQ and join them to the seed:
        • sudo rabbitmqctl start_app
        • sudo rabbitmqctl join_cluster rabbit@<seed_node_name>
    • Why it works: When a node is isolated for too long, it can’t participate in quorum decisions and might revert to a standalone state. Resetting all nodes and then re-forming the cluster from a single seed ensures a clean slate and proper membership establishment.
  • Cause: The node was explicitly stopped or crashed, and now the remaining cluster members have removed it from their membership lists, considering it permanently gone.
    • Diagnosis: Look at the cluster status on a healthy node: rabbitmqctl cluster_status. If the problematic node is not listed at all, or listed under running_nodes but not disc_nodes, it might be considered gone.
    • Fix: Treat this like the nonode@nohost scenario. You’ll likely need to reset the problematic node and rejoin it to the cluster. If you want to preserve its data (queues, messages), you’ll need to ensure it’s running and accessible before attempting to join. If it was permanently removed, you might need to rejoin it as if it were a new node.
    • Why it works: RabbitMQ clusters have mechanisms to detect and remove nodes that are unresponsive for extended periods. Rejoining requires the node to be recognized as a valid member again, often necessitating a reset and re-establishment of its cluster identity.

After successfully rejoining the node, you might encounter issues with message replication or queue synchronization if the partition lasted a significant amount of time. The next error might be related to message delivery delays or queues appearing empty on the rejoined node until replication catches up.

Want structured learning?

Take the full Rabbitmq course →