RabbitMQ’s disk alarm isn’t just a warning; it’s a hard stop that prevents new messages from being enqueued, effectively freezing your application.

Imagine your RabbitMQ broker is a busy post office. When the "free space" alarm triggers, it’s like the postmaster suddenly declaring, "No more mail can be accepted until we clear out some of the backlog!" This isn’t about the total disk size, but the available space. RabbitMQ needs this buffer for internal operations, like writing messages to disk, managing queues, and handling unsynced data.

Here’s how the disk alarm mechanism works and how to get it right:

RabbitMQ monitors the free disk space on the node where it’s running. By default, it’s configured to trigger an alarm when free disk space falls below a certain threshold. This threshold is usually calculated as a percentage of the total disk size.

Common Causes and Fixes for Disk Alarms

  1. Low Free Disk Space (The Obvious, But Critical)

    • Diagnosis:
      df -h /var/lib/rabbitmq
      
      (Adjust /var/lib/rabbitmq if your data directory is elsewhere). Look for the Avail column.
    • Fix: Free up disk space. This could involve:
      • Deleting old unacknowledged messages: If producers are sending messages faster than consumers can process them, and these messages aren’t being acknowledged, they accumulate. Identify and purge unacknowledged messages from queues.
      • Archiving or deleting old logs: RabbitMQ and Erlang generate logs. Ensure they are being rotated and purged.
      • Removing old unreferenced files: Check for temporary files or old data that is no longer needed.
    • Why it works: RabbitMQ requires a minimum amount of free space to operate reliably. By freeing up space, you bring the available disk space above the alarm threshold.
  2. Incorrectly Configured vm_memory_high_watermark

    • Diagnosis: While not directly a disk alarm, high memory usage can indirectly lead to disk pressure. Check your RabbitMQ configuration file (rabbitmq.conf or definitions.json) for vm_memory_high_watermark. If it’s set too high, RabbitMQ might try to use more memory than available, leading to swapping and increased disk I/O, which can then trigger the disk alarm.
    • Fix: Lower the vm_memory_high_watermark. A common starting point is 0.4 (40% of available RAM) or even lower, depending on your workload.
      # rabbitmq.conf example
      vm_memory_high_watermark.absolute = 1GB # Or a percentage like 0.4
      
      Restart RabbitMQ after changing the configuration.
    • Why it works: This setting controls the percentage of available system memory that RabbitMQ is allowed to use. By lowering it, you prevent RabbitMQ from consuming excessive memory, reducing the likelihood of system instability and disk pressure.
  3. Persistent Messages Filling Up Disk

    • Diagnosis: If you use persistent messages (delivery_mode=2), and messages are not being acknowledged or consumers are offline for extended periods, these messages will be written to disk. Check queue depths and message counts.
    • Fix: Ensure your consumers are running and acknowledging messages promptly. Implement dead-letter queues for messages that cannot be processed. Manually purge queues if necessary, but understand this is a last resort and will result in message loss.
      rabbitmqadmin purge_queue name=your_queue_name
      
    • Why it works: Persistent messages are stored on disk until they are acknowledged and removed from the queue. By processing and acknowledging them, you free up disk space.
  4. Unreachable or Slow Network Storage (for Clustered/Distributed Setups)

    • Diagnosis: If your RabbitMQ data directory is on a network file system (NFS, GlusterFS, etc.), and that storage becomes unreachable or experiences high latency, RabbitMQ can’t write to it, and the disk alarm might trigger even if the underlying storage has space. Check dmesg and system logs for network or storage errors.
    • Fix: Resolve network connectivity issues or storage performance problems. Ensure the network file system is mounted correctly and is responsive.
    • Why it works: RabbitMQ relies on local disk I/O for its operations. If the underlying storage is slow or unavailable, it mimics a disk-full condition.
  5. Misconfigured disk_free_limit in RabbitMQ Configuration

    • Diagnosis: RabbitMQ allows you to configure the disk_free_limit explicitly. This is often set in rabbitmq.conf or via environment variables. If this limit is set too high, it will trigger the alarm prematurely.
    • Fix: Adjust the disk_free_limit. The default is usually reasonable, but if it was manually set, it might be the culprit.
      # rabbitmq.conf example
      disk_free_limit.absolute = 1000000000 # 1GB in bytes
      # Or a percentage (less common for disk free limit)
      # disk_free_limit.percentage = 50
      
      Restart RabbitMQ after changing.
    • Why it works: This directly sets the minimum free disk space required. Setting it to a more realistic, lower value (or relying on the sensible default) prevents premature alarms.
  6. Underlying Disk/Filesystem Issues

    • Diagnosis: Sometimes, the issue isn’t RabbitMQ’s configuration but a problem with the disk itself or the filesystem. Check dmesg for I/O errors, read/write errors, or filesystem corruption.
    • Fix: Run filesystem checks (fsck) and potentially replace the failing disk.
    • Why it works: A corrupted or failing disk can report incorrect free space or simply stop accepting writes, triggering alarms.

After resolving the disk alarm, you might encounter a channel_error or connection_error if clients were unable to connect or communicate properly during the alarm state. This is usually transient and resolves as soon as the broker becomes healthy again.

Want structured learning?

Take the full Rabbitmq course →