The Erlang distribution protocol, which RabbitMQ uses for inter-node communication, failed because nodes couldn’t authenticate each other due to mismatched secret keys.
The root cause is that all RabbitMQ nodes in a cluster must share the exact same Erlang cookie. This cookie is a shared secret that the Erlang runtime uses to verify that nodes are allowed to communicate with each other. If the cookies don’t match, nodes will reject connection attempts from each other, leading to cluster instability or complete failure to form a cluster.
Here are the common reasons for mismatched cookies and how to fix them:
-
New Node Added with Default Cookie: When you install RabbitMQ on a new server, it often starts with a default Erlang cookie (
.erlang.cookiefile in the user’s home directory). If this default cookie is different from the one on existing cluster nodes, the new node won’t be able to join.- Diagnosis: On each node, check the cookie file:
sudo cat /var/lib/rabbitmq/.erlang.cookie. Compare the output between all nodes. - Fix: Ensure all nodes have the identical cookie. The easiest way is to copy the cookie file from an existing, healthy node to the new node. For example, on the new node:
sudo cp /path/to/source/.erlang.cookie /var/lib/rabbitmq/.erlang.cookie. Then, set the correct permissions:sudo chmod 600 /var/lib/rabbitmq/.erlang.cookieandsudo chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie. Restart the RabbitMQ service on the new node:sudo systemctl restart rabbitmq-server. - Why it works: This ensures the Erlang VM on the new node uses the same secret key as the existing nodes, allowing them to authenticate.
- Diagnosis: On each node, check the cookie file:
-
Manual Cookie Modification on One Node: An administrator might have manually edited the
.erlang.cookiefile on one node without synchronizing it to others.- Diagnosis: Same as above:
sudo cat /var/lib/rabbitmq/.erlang.cookieon all nodes. - Fix: Identify the correct cookie value (usually from a node that is part of the cluster or the intended shared secret). On all other nodes, overwrite their
.erlang.cookiefile with the correct content. Then restart RabbitMQ on those nodes:sudo systemctl restart rabbitmq-server. - Why it works: Restoring consistency across all nodes allows the Erlang distribution to function as intended.
- Diagnosis: Same as above:
-
Automated Deployment/Provisioning Errors: In automated environments (like Ansible, Chef, Terraform), the cookie might not be correctly distributed or might be generated independently on each node.
- Diagnosis: Inspect the deployment scripts or configuration management for how the
.erlang.cookiefile is managed. Use thecatcommand on nodes to verify. - Fix: Correct the deployment playbook/script to ensure the same cookie content is written to
/var/lib/rabbitmq/.erlang.cookieon all provisioned RabbitMQ nodes. Restart RabbitMQ services after the deployment. - Why it works: Automating the correct distribution of the secret key eliminates manual errors and ensures uniformity.
- Diagnosis: Inspect the deployment scripts or configuration management for how the
-
File Permissions Incorrect: If the
.erlang.cookiefile exists but has incorrect permissions, therabbitmquser might not be able to read it, or other users might be able to read it, which is a security risk and can also cause communication issues.- Diagnosis: Run
sudo ls -l /var/lib/rabbitmq/.erlang.cookie. The owner should berabbitmqand the permissionsrw-------(600). - Fix: Correct permissions and ownership:
sudo chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookieandsudo chmod 600 /var/lib/rabbitmq/.erlang.cookie. Restart RabbitMQ:sudo systemctl restart rabbitmq-server. - Why it works: Ensures the RabbitMQ process, running as the
rabbitmquser, has exclusive read access to the secret.
- Diagnosis: Run
-
Erlang Runtime Not Restarted After Cookie Change: Sometimes, the cookie file is updated, but the RabbitMQ process (which uses the Erlang VM) isn’t restarted, so it continues to use the old cookie value it loaded at startup.
- Diagnosis: Verify the cookie file content (
cat) and then check if the RabbitMQ service is running and if its associated processes have started after the cookie file was changed. - Fix: Always restart the RabbitMQ service after modifying the
.erlang.cookiefile:sudo systemctl restart rabbitmq-server. - Why it works: A service restart forces the Erlang VM to reload its configuration, including the Erlang cookie.
- Diagnosis: Verify the cookie file content (
-
Multiple RabbitMQ Instances on the Same Host (Uncommon but Possible): If you’re running multiple, separate RabbitMQ instances on a single machine (e.g., for testing, or using different users), each instance needs its own cookie if they are not intended to be clustered. If they are intended to be clustered, they need the same cookie. Misconfiguration here can lead to confusion.
- Diagnosis: Check the
RABBITMQ_BASEorRABBITMQ_CONFIG_FILEenvironment variables for each instance to find their respective cookie locations. - Fix: Ensure each instance’s cookie file is either unique (if not clustered) or identical (if clustered). Restart the specific instance’s RabbitMQ service.
- Why it works: Isolates or connects instances based on their intended configuration by managing their respective distribution secrets.
- Diagnosis: Check the
After ensuring all nodes have the identical, correct Erlang cookie and restarting the RabbitMQ service on all of them, the nodes should be able to authenticate and form or rejoin the cluster. If you were seeing errors like {not_authorized, 'rabbit@othernode'} or nodes failing to list each other in rabbitmqctl cluster_status, these should resolve.
The next error you’ll likely encounter if you haven’t configured network access or firewall rules correctly is connection refused or timeout errors when nodes attempt to communicate over the Erlang distribution port (typically 25672).