A RabbitMQ cluster node is unavailable because the Erlang distribution protocol, which RabbitMQ uses for inter-node communication, is failing to establish a connection between the nodes.
-
Firewall Blocking Erlang Distribution Ports: The most common culprit is a firewall on one or more nodes preventing the Erlang VM from communicating with other nodes. Erlang uses a dynamic port for its distribution protocol.
- Diagnosis: On each node, check if you can
pingthe Erlang distribution port of another node. The distribution port is usually the Erlang port mapper daemon (epmd) port (5222 by default) plus a range of other ports that Erlang distribution uses. A quick test is to trytelnet <other_node_ip> 5222from one node to another. If that fails, check your firewall rules. - Fix: Open the necessary ports in your firewall. For the Erlang distribution, you need to allow TCP traffic on port 5222 (epmd) and a range of high ports (typically 49152-65535, though this can vary based on Erlang version and configuration) from your other cluster nodes. On
ufw(Ubuntu), this would look like:
Repeat for all nodes and all other nodes.sudo ufw allow from <other_node_ip> to any port 5222 sudo ufw allow from <other_node_ip> to any port 49152:65535 - Why it works: This ensures the Erlang VMs on different nodes can find each other via
epmdand then establish a direct, authenticated communication channel for cluster operations.
- Diagnosis: On each node, check if you can
-
Incorrect
cluster_formation.peer_discovery.aws.host_ipv4_addressesorcluster_formation.peer_discovery.aws.host_ipv6_addresses: If you’re using AWS peer discovery and haven’t correctly configured the IP address family, nodes might not be able to find each other.- Diagnosis: Check your
rabbitmq.conffile for thecluster_formation.peer_discovery.awssettings. Ensure the specified IP address family (IPv4 or IPv6) matches the actual IP addresses your nodes are using for inter-node communication. - Fix: Set
cluster_formation.peer_discovery.aws.host_ipv4_addresses = autoorcluster_formation.peer_discovery.aws.host_ipv6_addresses = auto(or explicitly list them) inrabbitmq.confbased on your network configuration. Restart RabbitMQ on all nodes. - Why it works: This tells RabbitMQ how to resolve the IP addresses of other nodes in the cluster within the AWS environment.
- Diagnosis: Check your
-
NODENAMEMismatch or Incorrectly Set: TheNODENAMEin Erlang’s configuration must be consistent and resolvable across all nodes. If nodes are trying to connect using different names or names that don’t resolve, the connection will fail.- Diagnosis: On each node, run
rabbitmqctl eval 'node().'. Compare the output. It should be identical or resolvable across all nodes. Check/etc/rabbitmq/rabbitmq-env.conforrabbitmq.conffor theNODENAMEsetting. Ensure it uses a fully qualified domain name (FQDN) or an IP address that is reachable by all other nodes. - Fix: Set a consistent
NODENAMEin/etc/rabbitmq/rabbitmq-env.confon all nodes, e.g.,NODENAME=rabbit@my-cluster.example.comorNODENAME=rabbit@192.168.1.100. Ensure thatmy-cluster.example.comor192.168.1.100resolves correctly to the node’s IP address from all other nodes (useping <NODENAME>to test). Restart RabbitMQ on all nodes. - Why it works: Erlang uses the
NODENAMEfor inter-node authentication and communication. A consistent, resolvableNODENAMEis crucial for the Erlang distribution protocol to succeed.
- Diagnosis: On each node, run
-
Erlang Cookie Mismatch: The Erlang cookie is a shared secret used to authenticate nodes attempting to join the same cluster. If the cookies don’t match, nodes will reject each other.
- Diagnosis: On each node, check the Erlang cookie file, typically located at
/var/lib/rabbitmq/.erlang.cookieor~/.erlang.cookie. Compare the contents of this file across all nodes. - Fix: Copy the content of the
.erlang.cookiefile from one node to the.erlang.cookiefile on all other nodes. Ensure the file permissions are set to0600(read/write for owner only). Restart RabbitMQ on all nodes. - Why it works: The cookie acts as a password for Erlang nodes. All nodes in a cluster must share the same cookie to trust each other and allow communication.
- Diagnosis: On each node, check the Erlang cookie file, typically located at
-
DNS Resolution Issues: Nodes cannot find each other if their hostnames are not resolvable or resolve to the wrong IP addresses.
- Diagnosis: On each node, try to
pingthe hostname of every other node in the cluster. Ifpingfails or resolves to an incorrect IP, you have a DNS problem. - Fix: Ensure that all nodes have correct entries in their
/etc/hostsfile or that your DNS server is properly configured to resolve all cluster node hostnames to their correct IP addresses. For example, on each node, ensure/etc/hostscontains entries like:
Restart RabbitMQ on all nodes after making changes.192.168.1.101 node1.example.com node1 192.168.1.102 node2.example.com node2 - Why it works: RabbitMQ (and Erlang) relies on being able to resolve hostnames to IP addresses to establish network connections between nodes.
- Diagnosis: On each node, try to
-
Resource Exhaustion (High CPU/Memory/Disk I/O): A node that is overloaded might be too slow to respond to cluster communication requests within the timeout period, leading to it being marked as unavailable.
- Diagnosis: Monitor system resources on all nodes using tools like
top,htop,vmstat,iostat, andfree. Look for consistently high CPU usage, low available memory, or excessive disk I/O. Check RabbitMQ’s own metrics for high queue depths or a large number of connections/channels. - Fix: Optimize RabbitMQ configuration (e.g., memory limits, disk thresholds), scale your hardware, offload queues, or tune application producers/consumers to reduce load. For example, adjust
vm_memory_high_watermarkinrabbitmq.confif memory is the issue. - Why it works: By ensuring nodes have sufficient resources, they can process incoming network requests and maintain cluster state reliably, preventing timeouts.
- Diagnosis: Monitor system resources on all nodes using tools like
-
Incorrect
cluster_formation.k8s.addressorcluster_formation.k8s.token: In Kubernetes, if the API server address or token is incorrect, nodes cannot discover each other.- Diagnosis: Verify the
cluster_formation.k8s.address(e.g.,https://kubernetes.default.svc) and the service account token used by RabbitMQ within your Kubernetes cluster. Ensure therabbitmq.confreflects these correct values. - Fix: Correct the
cluster_formation.k8s.addressand ensure the service account has the necessary RBAC permissions to list pods and endpoints. Restart the RabbitMQ pods. - Why it works: This allows RabbitMQ to query the Kubernetes API to discover other RabbitMQ pods in the same namespace for cluster formation.
- Diagnosis: Verify the
The next error you’ll likely encounter is channel_error or connection_closed on clients trying to connect to the unavailable node, as they won’t be able to reach it for operations.