RabbitMQ’s disk alarm is preventing new messages from being published because the broker believes it’s running out of disk space.

Here’s what’s actually broken: RabbitMQ’s mnesia database, which stores metadata about queues, exchanges, and bindings, is too large. When mnesia gets too big, it can trigger the disk alarm even if there’s plenty of free space on disk. This prevents new messages from being published, but existing messages can still be consumed.

Common Causes and Fixes:

  1. Excessive mnesia Database Growth:

    • Diagnosis: Check the size of the mnesia directory. On a typical Linux install, this is /var/lib/rabbitmq/mnesia/rabbit@<hostname>/. Look for rabbit_transient_store.ets and rabbit_durable_storage.ets.
      sudo du -sh /var/lib/rabbitmq/mnesia/rabbit@<hostname>/
      
    • Fix: Restarting RabbitMQ clears the transient store and can shrink the durable store if there are no long-running transactions. This is often the quickest fix.
      sudo systemctl restart rabbitmq-server
      
      This works because the mnesia database stores transient data in memory or in a temporary file that’s rebuilt on restart. A restart forces a clean slate for this temporary data.
    • Why it works: A restart effectively rebuilds mnesia’s transient state, clearing out old, unreferenced data that was contributing to its size.
  2. Unacknowledged Messages:

    • Diagnosis: While unacknowledged messages don’t directly inflate mnesia, a large number of uncommitted transactions in mnesia (often related to channel/queue operations) can cause its growth. This is harder to diagnose directly. A symptom is a persistent high message count on queues that should be empty.
      rabbitmqctl list_queues name messages_ready messages_unacknowledged
      
    • Fix: Ensure your consumers are properly acknowledging messages. If there are stuck consumers, restart them or, as a last resort, clear the queue. Clearing a queue is destructive.
      # To clear a specific queue (use with extreme caution!)
      rabbitmqctl purge_queue <queue_name>
      
      This works by removing all messages from the queue, thus reducing the load on the broker and potentially allowing mnesia to clean up associated metadata.
    • Why it works: Reducing the number of messages and associated internal states that RabbitMQ needs to track frees up resources and allows mnesia to prune its internal records.
  3. Large Number of Queues/Exchanges/Bindings:

    • Diagnosis: A very high count of these objects can also bloat mnesia.
      rabbitmqctl list_queues | wc -l
      rabbitmqctl list_exchanges | wc -l
      
      (Subtract 1 from each count for headers).
    • Fix: Review your application’s queue/exchange creation patterns. If dynamic creation is happening excessively, implement a strategy to reuse queues or clean them up when no longer needed. A restart will temporarily alleviate the mnesia size but the underlying issue will return if not addressed.
    • Why it works: Each queue, exchange, and binding has metadata stored in mnesia. Reducing the total number of these objects directly reduces the size of the mnesia database.
  4. Node Disk Full (The Obvious One):

    • Diagnosis: Even if mnesia is the trigger, the underlying disk can be full, which would indeed cause the alarm.
      df -h /var/lib/rabbitmq
      
    • Fix: Free up disk space. This might involve deleting old logs, clearing out old message data (if persisted elsewhere), or increasing disk capacity.
      # Example: remove old log files
      sudo find /var/log/rabbitmq/ -type f -name "*.gz" -delete
      
      This works by literally removing files from the filesystem, making more space available for RabbitMQ’s operations.
    • Why it works: The disk alarm is a direct indicator that the filesystem where RabbitMQ stores its data (including mnesia) is critically low on space.
  5. mnesia Table Fragmentation:

    • Diagnosis: This is less common but possible. mnesia tables can become fragmented over time.
    • Fix: A full cluster restart (all nodes) can sometimes help mnesia reorganize its internal data structures.
      # Restart all nodes in the cluster
      for node in $(rabbitmqctl cluster_status | awk '/Erlang/ {print $2}'); do sudo systemctl restart rabbitmq-server; done
      
      This works by forcing mnesia to rebuild its internal data structures across all nodes, potentially defragmenting them.
    • Why it works: A cluster-wide restart ensures that mnesia on each node has an opportunity to optimize its internal storage layout.
  6. Configuration Issues / Policy Misconfiguration:

    • Diagnosis: While rare, a policy that prevents message expiration or queue deletion could lead to unbounded growth. Check your policies.
      rabbitmqctl list_policies
      
    • Fix: Review and adjust any policies that might be preventing message TTL or queue auto-deletion. For example, remove a policy that sets message-ttl to none on queues that are not meant to be permanent.
    • Why it works: Policies dictate how RabbitMQ manages queues and messages. Incorrect policies can prevent automatic cleanup, leading to unbounded data growth.

After resolving the disk alarm, you might hit a channel_error with a code like 404 NOT_FOUND if you’ve restarted a node and a consumer is trying to connect to a queue that no longer exists (e.g., if it was transient and the broker restarted).

Want structured learning?

Take the full Rabbitmq course →