The RabbitMQ Health Check API doesn’t just tell you if RabbitMQ is up; it tells you if your cluster is actually healthy and ready to process messages.

Let’s see it in action. Imagine a simple cluster with two nodes, rabbit1 and rabbit2. We’ll use curl to poke at the health check endpoint.

curl http://rabbit1:15672/api/health/node

This might return something like:

{
  "status": "ok",
  "message": "Node is healthy and running.",
  "details": {
    "node": "rabbit@rabbit1",
    "status": "running",
    "disk_free": 10000000000,
    "mem_used": 500000000,
    "mem_total": 2000000000,
    "disk_free_alarm": false,
    "mem_alarm": false,
    "disk_free_threshold": 500000000,
    "mem_threshold": 1000000000
  }
}

Now, let’s simulate a problem. Suppose rabbit2 has run out of disk space. If we check rabbit1’s health, and it’s part of a cluster, it will report on the overall cluster status, including the health of other nodes it can reach.

curl http://rabbit1:15672/api/health/cluster

If rabbit2 is out of disk space, rabbit1 might report:

{
  "status": "error",
  "message": "Cluster is unhealthy.",
  "details": {
    "rabbit@rabbit1": {
      "status": "ok",
      "disk_free": 15000000000,
      "mem_used": 700000000,
      "mem_total": 2000000000,
      "disk_free_alarm": false,
      "mem_alarm": false
    },
    "rabbit@rabbit2": {
      "status": "error",
      "message": "Disk alarm is active.",
      "disk_free": 400000000,
      "mem_used": 600000000,
      "mem_total": 2000000000,
      "disk_free_alarm": true,
      "mem_alarm": false
    }
  }
}

This API is your first line of defense against subtle cluster failures. It goes beyond a simple "is the process running?" check by evaluating critical resource thresholds and inter-node communication. For a single node, /api/health/node gives you the direct status. For a cluster, /api/health/cluster aggregates the health of all visible nodes from the perspective of the queried node.

The health check is powered by RabbitMQ’s internal metrics collection. When you query the health API, RabbitMQ checks several things:

  • Node Status: Is the Erlang VM running and responsive?
  • Resource Alarms: Are disk space or memory usage exceeding configured thresholds? These are crucial because RabbitMQ can become unstable or stop accepting connections when resources are critically low.
  • Inter-node Connectivity: For cluster health, it checks if the node can communicate with other nodes in the cluster. This ensures that cluster operations like queue mirroring or distributed exchanges can function correctly.
  • File Descriptors: Are there enough open file descriptors available for the RabbitMQ process? Running out of these can cause various connection and internal processing failures.

The disk_free_alarm and mem_alarm are particularly important. RabbitMQ sets these alarms internally when disk_free drops below the disk_free_threshold (default 50MB) or mem_used exceeds mem_total * 0.9 (90%). When these alarms are active, RabbitMQ will start rejecting new messages to prevent data loss and further resource exhaustion. The health API surfaces these critical states.

The most surprising thing about the health check API is that it’s not a static check; it reflects the dynamic state of the cluster. A node might be running, but if it’s experiencing high resource usage or network partitions, the health API will reveal it. This makes it invaluable for automated monitoring and alerting systems. The disk_free and mem_used values are reported in bytes, giving you precise figures for your resource monitoring.

The exact thresholds for disk and memory alarms are configurable, often via rabbitmq.conf or environment variables. For instance, you can set vm_memory_high_watermark.relative to 0.8 to trigger a memory alarm when 80% of total memory is used. The health API will reflect these custom configurations. The details section in the cluster health check shows the status of each individual node as seen by the node you queried. If you query rabbit1 and it can’t see rabbit2, rabbit2 might appear as error or be missing from the details entirely, indicating a network partition or a node failure.

Once you’ve resolved disk space issues or memory pressure, the alarms will clear automatically, and the health API will return status: "ok".

The next thing you’ll want to monitor is the file_descriptors count and the fd_used metric, as running out of file descriptors is a common cause of unexpected connection failures.

Want structured learning?

Take the full Rabbitmq course →