RabbitMQ’s disk alarm is preventing new messages from being published because the broker believes it’s running out of disk space.
Here’s what’s actually broken: RabbitMQ’s mnesia database, which stores metadata about queues, exchanges, and bindings, is too large. When mnesia gets too big, it can trigger the disk alarm even if there’s plenty of free space on disk. This prevents new messages from being published, but existing messages can still be consumed.
Common Causes and Fixes:
-
Excessive
mnesiaDatabase Growth:- Diagnosis: Check the size of the
mnesiadirectory. On a typical Linux install, this is/var/lib/rabbitmq/mnesia/rabbit@<hostname>/. Look forrabbit_transient_store.etsandrabbit_durable_storage.ets.sudo du -sh /var/lib/rabbitmq/mnesia/rabbit@<hostname>/ - Fix: Restarting RabbitMQ clears the transient store and can shrink the durable store if there are no long-running transactions. This is often the quickest fix.
This works because thesudo systemctl restart rabbitmq-servermnesiadatabase stores transient data in memory or in a temporary file that’s rebuilt on restart. A restart forces a clean slate for this temporary data. - Why it works: A restart effectively rebuilds
mnesia’s transient state, clearing out old, unreferenced data that was contributing to its size.
- Diagnosis: Check the size of the
-
Unacknowledged Messages:
- Diagnosis: While unacknowledged messages don’t directly inflate
mnesia, a large number of uncommitted transactions inmnesia(often related to channel/queue operations) can cause its growth. This is harder to diagnose directly. A symptom is a persistent high message count on queues that should be empty.rabbitmqctl list_queues name messages_ready messages_unacknowledged - Fix: Ensure your consumers are properly acknowledging messages. If there are stuck consumers, restart them or, as a last resort, clear the queue. Clearing a queue is destructive.
This works by removing all messages from the queue, thus reducing the load on the broker and potentially allowing# To clear a specific queue (use with extreme caution!) rabbitmqctl purge_queue <queue_name>mnesiato clean up associated metadata. - Why it works: Reducing the number of messages and associated internal states that RabbitMQ needs to track frees up resources and allows
mnesiato prune its internal records.
- Diagnosis: While unacknowledged messages don’t directly inflate
-
Large Number of Queues/Exchanges/Bindings:
- Diagnosis: A very high count of these objects can also bloat
mnesia.
(Subtract 1 from each count for headers).rabbitmqctl list_queues | wc -l rabbitmqctl list_exchanges | wc -l - Fix: Review your application’s queue/exchange creation patterns. If dynamic creation is happening excessively, implement a strategy to reuse queues or clean them up when no longer needed. A restart will temporarily alleviate the
mnesiasize but the underlying issue will return if not addressed. - Why it works: Each queue, exchange, and binding has metadata stored in
mnesia. Reducing the total number of these objects directly reduces the size of themnesiadatabase.
- Diagnosis: A very high count of these objects can also bloat
-
Node Disk Full (The Obvious One):
- Diagnosis: Even if
mnesiais the trigger, the underlying disk can be full, which would indeed cause the alarm.df -h /var/lib/rabbitmq - Fix: Free up disk space. This might involve deleting old logs, clearing out old message data (if persisted elsewhere), or increasing disk capacity.
This works by literally removing files from the filesystem, making more space available for RabbitMQ’s operations.# Example: remove old log files sudo find /var/log/rabbitmq/ -type f -name "*.gz" -delete - Why it works: The disk alarm is a direct indicator that the filesystem where RabbitMQ stores its data (including
mnesia) is critically low on space.
- Diagnosis: Even if
-
mnesiaTable Fragmentation:- Diagnosis: This is less common but possible.
mnesiatables can become fragmented over time. - Fix: A full cluster restart (all nodes) can sometimes help
mnesiareorganize its internal data structures.
This works by forcing# Restart all nodes in the cluster for node in $(rabbitmqctl cluster_status | awk '/Erlang/ {print $2}'); do sudo systemctl restart rabbitmq-server; donemnesiato rebuild its internal data structures across all nodes, potentially defragmenting them. - Why it works: A cluster-wide restart ensures that
mnesiaon each node has an opportunity to optimize its internal storage layout.
- Diagnosis: This is less common but possible.
-
Configuration Issues / Policy Misconfiguration:
- Diagnosis: While rare, a policy that prevents message expiration or queue deletion could lead to unbounded growth. Check your policies.
rabbitmqctl list_policies - Fix: Review and adjust any policies that might be preventing message TTL or queue auto-deletion. For example, remove a policy that sets
message-ttltononeon queues that are not meant to be permanent. - Why it works: Policies dictate how RabbitMQ manages queues and messages. Incorrect policies can prevent automatic cleanup, leading to unbounded data growth.
- Diagnosis: While rare, a policy that prevents message expiration or queue deletion could lead to unbounded growth. Check your policies.
After resolving the disk alarm, you might hit a channel_error with a code like 404 NOT_FOUND if you’ve restarted a node and a consumer is trying to connect to a queue that no longer exists (e.g., if it was transient and the broker restarted).