The mirror-sync-in-progress error means a mirrored queue is stuck in a state where it’s trying to synchronize its data from the master node to its mirrors, and this process is preventing new messages from being written. This typically happens when the network connection between the master and mirror nodes is unstable, or when one of the nodes is under heavy load and can’t keep up with the synchronization.

Here are the most common reasons this occurs and how to fix them:

Network Partition or Latency

Diagnosis: Check network connectivity and latency between your RabbitMQ nodes. Use ping and traceroute from each node to every other node in the cluster. Look for high latency (consistently over 50ms), packet loss, or intermittent connection drops.

Fix:

  • Reduce Latency: If nodes are geographically distributed, consider moving them closer or using a dedicated, low-latency network. For instance, if nodes are in different AWS availability zones, ensure they are in the same zone or use VPC peering with optimized routing.
  • Increase Network Bandwidth: If packet loss or high latency is due to network saturation, upgrade your network infrastructure or reduce other network traffic on the path between RabbitMQ nodes.
  • Firewall/Security Group Issues: Ensure firewalls or security groups aren’t intermittently dropping connections. Check logs on firewalls and network devices for any blocked packets between RabbitMQ nodes on ports 5672 (AMQP), 25672 (inter-node), and 4369 (epmd).

Why it works: RabbitMQ mirroring relies on a constant, low-latency connection to replicate messages. High latency or packet loss disrupts this flow, causing the sync to stall and potentially leading to the mirror-sync-in-progress state.

Insufficient Disk Space on Mirror Nodes

Diagnosis: Check disk space on all RabbitMQ nodes, especially the mirror nodes.

df -h /var/lib/rabbitmq

Look for partitions that are 90% or more full.

Fix:

  • Free up disk space: Delete old logs, unused data, or move data to a different partition.
  • Increase disk size: If you’re consistently running out of space, provision larger disks for your RabbitMQ nodes. For example, if /var/lib/rabbitmq is on a 100GB disk and is full, consider upgrading to a 200GB disk.
  • Configure disk limits: Use RabbitMQ’s vm_memory_high_watermark and disk_free_limit settings to proactively manage memory and disk usage.

Why it works: When disks are full, RabbitMQ cannot write replicated messages to disk on the mirror nodes, halting the synchronization process.

High Message Influx or Throughput

Diagnosis: Monitor the message rates on your queues using rabbitmqctl list_queues name messages_ready messages_unacknowledged --formatter=pretty. If the messages_ready or messages_unacknowledged counts are consistently high and growing, or if the client publishing rate is extremely high, this can overwhelm the mirrors.

Fix:

  • Scale Consumers: Add more consumers to process messages faster. If you have 10 consumers and a backlog is building, try scaling to 20.
  • Scale Publishers: If possible, distribute the publishing load across more publishers or to different queues.
  • Optimize Message Handling: Ensure your consumers are acknowledging messages promptly and efficiently. Avoid long-running operations within consumer handlers.
  • Increase Node Resources: If the nodes themselves are bottlenecked (CPU, I/O), upgrade their hardware or instance types.

Why it works: If the rate at which messages are being published and replicated exceeds the rate at which they can be processed and acknowledged by the mirrors, the synchronization queue will grow indefinitely, leading to the stalled state.

RabbitMQ Node Resource Exhaustion (CPU/Memory)

Diagnosis: Monitor CPU and memory usage on your RabbitMQ nodes. Use top, htop, or cloud provider monitoring tools. Look for sustained high CPU usage (above 80%) or memory consumption nearing the system’s limit.

Fix:

  • Increase Node Resources: Upgrade CPU cores or RAM for your RabbitMQ servers. For example, move from a t3.medium to a t3.xlarge instance.
  • Optimize Erlang VM Settings: Tune Erlang VM parameters related to garbage collection and process limits. This is advanced and requires careful testing.
  • Reduce Unnecessary Processes: Ensure no other applications are consuming significant resources on the RabbitMQ nodes.

Why it works: If a RabbitMQ node is starved for CPU or memory, its ability to perform background tasks like message replication and synchronization is severely degraded.

RabbitMQ Erlang VM Issues or Crashes

Diagnosis: Check RabbitMQ logs (/var/log/rabbitmq/) and Erlang crash logs (/var/lib/rabbitmq/mnesia/rabbit@<node_name>.crash.dump) for any errors, warnings, or indications of Erlang VM crashes or restarts.

Fix:

  • Restart RabbitMQ: A simple restart can sometimes resolve transient Erlang VM issues.
    sudo systemctl restart rabbitmq-server
    
  • Update RabbitMQ and Erlang: Ensure you are running supported and stable versions of RabbitMQ and the Erlang/OTP platform. Older versions may have known bugs.
  • Investigate Crash Dumps: If Erlang crash dumps are frequent, analyze them for specific error patterns or memory leaks. This often requires deeper Erlang/OTP expertise.

Why it works: The Erlang VM is the runtime environment for RabbitMQ. If it encounters critical errors or becomes unstable, it can prevent all operations, including message synchronization.

Stale or Corrupted Mirroring State

Diagnosis: Sometimes, the internal state of mirroring can become corrupted. This is harder to diagnose directly but often manifests as persistent mirror-sync-in-progress errors that don’t resolve with the above steps.

Fix:

  • Force Mirror Re-sync: You can try to force a re-synchronization of a specific queue. This might involve taking the queue offline, deleting it, and recreating it with mirroring enabled. Caution: This will cause a brief outage for that specific queue.
    1. On the master node, run:
      rabbitmqctl pause_minority
      rabbitmqctl set_policy --priority 0 --apply-to queues ha-all "^my_stuck_queue$"
      rabbitmqctl delete_queue my_stuck_queue
      
    2. Then, recreate the queue with the same mirroring policy.
  • Cluster Restart (Last Resort): In extreme cases, a rolling restart of the entire RabbitMQ cluster might clear corrupted states. Ensure you understand the implications of this on availability.

Why it works: Recreating the queue or performing a cluster-wide restart can reset the mirroring state and force a fresh synchronization from scratch, bypassing any lingering corruption.

After resolving the underlying issue, you’ll typically see the mirror-sync-in-progress status disappear from rabbitmqctl status and rabbitmqctl list_queues. The next potential issue you might encounter is a backlog of messages if the consumers were also struggling, leading to high messages_unacknowledged counts.

Want structured learning?

Take the full Rabbitmq course →