The RabbitMQ federation link is down because the upstream broker is not responding to heartbeats, indicating a network issue or an overloaded upstream.

Common Causes and Fixes

  1. Network Connectivity Issues:

    • Diagnosis: On the node running the federation consumer (the one initiating the federation link), try to ping the upstream broker’s hostname or IP address. If ping fails, check network firewalls between the two nodes.
      ping <upstream_broker_hostname_or_ip>
      
    • Fix: Ensure that port 5672 (AMQP) and potentially 55672 (AMQP/SSL) are open on the firewall of the upstream broker and that no network devices are blocking traffic between the consumer and upstream.
    • Why it works: RabbitMQ uses TCP connections for communication. If the TCP connection cannot be established due to firewalls or routing problems, heartbeats will fail, and the link will drop.
  2. Upstream Broker Overload/Unresponsiveness:

    • Diagnosis: Log into the upstream RabbitMQ broker and check its health. Look at the "Node health" and "Connections" tabs in the management UI. High CPU, memory, or disk I/O, or a very large number of connections, can cause it to stop responding to heartbeats. Also, check the RabbitMQ logs on the upstream for any errors or warnings.
      # On the upstream broker, check logs
      sudo tail -f /var/log/rabbitmq/rabbit@<hostname>.log
      
    • Fix:
      • Resource Allocation: Increase CPU, RAM, or disk I/O for the upstream broker.
      • Connection Limits: If the upstream is overwhelmed by too many connections, investigate the source of these connections and potentially increase the vm_memory_high_watermark or disk_free_limit if they are being hit, or optimize the clients.
      • Queue/Exchange Load: If specific queues or exchanges on the upstream are experiencing extreme message rates, consider offloading or optimizing producers.
    • Why it works: When a broker is overloaded, it may not have enough resources (CPU, memory) to process incoming network traffic, including heartbeat requests from its federated peers. This leads to timeouts.
  3. Incorrect Upstream URI in Federation Policy:

    • Diagnosis: Verify the uri parameter in the federation policy configuration on the consumer node. Ensure it exactly matches the accessible AMQP URI of the upstream broker, including the correct protocol (amqp:// or amqps://), hostname/IP, port, and any necessary credentials.
      # Example policy configuration snippet
      {
        "vhost": "/",
        "pattern": "federated-exchange.*",
        "definition": {
          "federation-upstream-set": "my-upstream-set"
        }
      }
      
      And the upstream set definition:
      {
        "name": "my-upstream-set",
        "definition": {
          "policy": "exactly",
          "upstreams": [
            {
              "uri": "amqp://user:password@upstream.example.com:5672/",
              "prefetch-count": 1000,
              "reconnect-delay": 5,
              "ack-mode": "on-publish"
            }
          ]
        }
      }
      
    • Fix: Correct the uri in the federation upstream set definition to point to the correct, reachable upstream broker. Ensure credentials are valid if authentication is required.
    • Why it works: The uri is how the consumer node finds and connects to the upstream. An incorrect URI means the consumer tries to connect to the wrong place, or with wrong credentials, leading to connection failures and thus federation link down.
  4. SSL/TLS Certificate Issues (if using AMQPS):

    • Diagnosis: If your federation link uses amqps:// or port 55672, check the SSL certificates on both the upstream and consumer nodes. Ensure the upstream’s certificate is valid, trusted by the consumer, and that the hostname in the URI matches the Common Name (CN) or Subject Alternative Name (SAN) of the upstream’s certificate. Check the RabbitMQ logs on both sides for SSL-related errors.
      # On the consumer, check logs for SSL errors
      sudo tail -f /var/log/rabbitmq/rabbit@<consumer_hostname>.log
      
    • Fix:
      • Trust Store: Ensure the CA certificate that signed the upstream’s certificate is present in the consumer’s trust store (e.g., /etc/rabbitmq/certs/ca_certificate.pem).
      • Hostname Mismatch: Update the uri in the federation policy to match the CN/SAN of the upstream’s certificate, or re-issue the upstream’s certificate with the correct hostname.
      • Certificate Expiry: Renew any expired certificates.
    • Why it works: SSL/TLS handshake failures prevent the establishment of a secure connection. If the handshake fails due to trust issues, hostname mismatches, or expired certificates, the AMQP connection cannot be formed, and the federation link will not come up.
  5. Erlang Cookie Mismatch:

    • Diagnosis: RabbitMQ nodes in a cluster must share the same Erlang distribution cookie. If the upstream broker is part of a cluster, or if the consumer node is configured to communicate with an upstream that is not on the same cluster but expects a specific cookie, a mismatch will prevent connection. Check the Erlang cookie file (/var/lib/rabbitmq/.erlang.cookie or similar).
      # On the consumer node
      sudo cat /var/lib/rabbitmq/.erlang.cookie
      # On the upstream node
      sudo cat /var/lib/rabbitmq/.erlang.cookie
      
    • Fix: Ensure the Erlang cookie is identical on both the consumer and upstream nodes. Stop RabbitMQ on both nodes, synchronize the cookie file, and restart RabbitMQ.
    • Why it works: The Erlang cookie is an authentication mechanism for inter-node communication within an Erlang cluster. If the cookies don’t match, nodes cannot authenticate with each other, and communication (including federation) will fail.
  6. Firewall Blocking Inter-Node Communication (Erlang Port Mapper):

    • Diagnosis: If the consumer and upstream are on different machines, and especially if they are part of different RabbitMQ clusters or are not in direct network proximity, the Erlang distribution port (epmd) might be blocked. Epmd typically runs on port 4369. Check if this port is accessible between the nodes.
      # On the consumer node, try to query epmd on upstream
      telnet <upstream_broker_hostname_or_ip> 4369
      
    • Fix: Open port 4369 on the firewalls of both the consumer and upstream nodes to allow Erlang inter-node communication.
    • Why it works: RabbitMQ nodes use epmd to discover other Erlang nodes and the dynamic ports they use for communication. If epmd is blocked, nodes cannot find each other, preventing the establishment of the underlying connection for federation.

If you fix all of the above, the next error you’ll likely encounter is the upstream broker’s authentication failing if credentials were not explicitly provided in the URI and the upstream requires them.

Want structured learning?

Take the full Rabbitmq course →