A RabbitMQ cluster node is unavailable because the Erlang distribution protocol, which RabbitMQ uses for inter-node communication, is failing to establish a connection between the nodes.

  • Firewall Blocking Erlang Distribution Ports: The most common culprit is a firewall on one or more nodes preventing the Erlang VM from communicating with other nodes. Erlang uses a dynamic port for its distribution protocol.

    • Diagnosis: On each node, check if you can ping the Erlang distribution port of another node. The distribution port is usually the Erlang port mapper daemon (epmd) port (5222 by default) plus a range of other ports that Erlang distribution uses. A quick test is to try telnet <other_node_ip> 5222 from one node to another. If that fails, check your firewall rules.
    • Fix: Open the necessary ports in your firewall. For the Erlang distribution, you need to allow TCP traffic on port 5222 (epmd) and a range of high ports (typically 49152-65535, though this can vary based on Erlang version and configuration) from your other cluster nodes. On ufw (Ubuntu), this would look like:
      sudo ufw allow from <other_node_ip> to any port 5222
      sudo ufw allow from <other_node_ip> to any port 49152:65535
      
      Repeat for all nodes and all other nodes.
    • Why it works: This ensures the Erlang VMs on different nodes can find each other via epmd and then establish a direct, authenticated communication channel for cluster operations.
  • Incorrect cluster_formation.peer_discovery.aws.host_ipv4_addresses or cluster_formation.peer_discovery.aws.host_ipv6_addresses: If you’re using AWS peer discovery and haven’t correctly configured the IP address family, nodes might not be able to find each other.

    • Diagnosis: Check your rabbitmq.conf file for the cluster_formation.peer_discovery.aws settings. Ensure the specified IP address family (IPv4 or IPv6) matches the actual IP addresses your nodes are using for inter-node communication.
    • Fix: Set cluster_formation.peer_discovery.aws.host_ipv4_addresses = auto or cluster_formation.peer_discovery.aws.host_ipv6_addresses = auto (or explicitly list them) in rabbitmq.conf based on your network configuration. Restart RabbitMQ on all nodes.
    • Why it works: This tells RabbitMQ how to resolve the IP addresses of other nodes in the cluster within the AWS environment.
  • NODENAME Mismatch or Incorrectly Set: The NODENAME in Erlang’s configuration must be consistent and resolvable across all nodes. If nodes are trying to connect using different names or names that don’t resolve, the connection will fail.

    • Diagnosis: On each node, run rabbitmqctl eval 'node().'. Compare the output. It should be identical or resolvable across all nodes. Check /etc/rabbitmq/rabbitmq-env.conf or rabbitmq.conf for the NODENAME setting. Ensure it uses a fully qualified domain name (FQDN) or an IP address that is reachable by all other nodes.
    • Fix: Set a consistent NODENAME in /etc/rabbitmq/rabbitmq-env.conf on all nodes, e.g., NODENAME=rabbit@my-cluster.example.com or NODENAME=rabbit@192.168.1.100. Ensure that my-cluster.example.com or 192.168.1.100 resolves correctly to the node’s IP address from all other nodes (use ping <NODENAME> to test). Restart RabbitMQ on all nodes.
    • Why it works: Erlang uses the NODENAME for inter-node authentication and communication. A consistent, resolvable NODENAME is crucial for the Erlang distribution protocol to succeed.
  • Erlang Cookie Mismatch: The Erlang cookie is a shared secret used to authenticate nodes attempting to join the same cluster. If the cookies don’t match, nodes will reject each other.

    • Diagnosis: On each node, check the Erlang cookie file, typically located at /var/lib/rabbitmq/.erlang.cookie or ~/.erlang.cookie. Compare the contents of this file across all nodes.
    • Fix: Copy the content of the .erlang.cookie file from one node to the .erlang.cookie file on all other nodes. Ensure the file permissions are set to 0600 (read/write for owner only). Restart RabbitMQ on all nodes.
    • Why it works: The cookie acts as a password for Erlang nodes. All nodes in a cluster must share the same cookie to trust each other and allow communication.
  • DNS Resolution Issues: Nodes cannot find each other if their hostnames are not resolvable or resolve to the wrong IP addresses.

    • Diagnosis: On each node, try to ping the hostname of every other node in the cluster. If ping fails or resolves to an incorrect IP, you have a DNS problem.
    • Fix: Ensure that all nodes have correct entries in their /etc/hosts file or that your DNS server is properly configured to resolve all cluster node hostnames to their correct IP addresses. For example, on each node, ensure /etc/hosts contains entries like:
      192.168.1.101    node1.example.com node1
      192.168.1.102    node2.example.com node2
      
      Restart RabbitMQ on all nodes after making changes.
    • Why it works: RabbitMQ (and Erlang) relies on being able to resolve hostnames to IP addresses to establish network connections between nodes.
  • Resource Exhaustion (High CPU/Memory/Disk I/O): A node that is overloaded might be too slow to respond to cluster communication requests within the timeout period, leading to it being marked as unavailable.

    • Diagnosis: Monitor system resources on all nodes using tools like top, htop, vmstat, iostat, and free. Look for consistently high CPU usage, low available memory, or excessive disk I/O. Check RabbitMQ’s own metrics for high queue depths or a large number of connections/channels.
    • Fix: Optimize RabbitMQ configuration (e.g., memory limits, disk thresholds), scale your hardware, offload queues, or tune application producers/consumers to reduce load. For example, adjust vm_memory_high_watermark in rabbitmq.conf if memory is the issue.
    • Why it works: By ensuring nodes have sufficient resources, they can process incoming network requests and maintain cluster state reliably, preventing timeouts.
  • Incorrect cluster_formation.k8s.address or cluster_formation.k8s.token: In Kubernetes, if the API server address or token is incorrect, nodes cannot discover each other.

    • Diagnosis: Verify the cluster_formation.k8s.address (e.g., https://kubernetes.default.svc) and the service account token used by RabbitMQ within your Kubernetes cluster. Ensure the rabbitmq.conf reflects these correct values.
    • Fix: Correct the cluster_formation.k8s.address and ensure the service account has the necessary RBAC permissions to list pods and endpoints. Restart the RabbitMQ pods.
    • Why it works: This allows RabbitMQ to query the Kubernetes API to discover other RabbitMQ pods in the same namespace for cluster formation.

The next error you’ll likely encounter is channel_error or connection_closed on clients trying to connect to the unavailable node, as they won’t be able to reach it for operations.

Want structured learning?

Take the full Rabbitmq course →