The AOF rewrite process can stall your Redis instance if not properly tuned, leading to increased latency and potentially data loss.

Redis uses two primary persistence mechanisms: RDB snapshots and the Append-Only File (AOF). The AOF logs every write operation received by the server. Over time, this file can grow very large, containing many redundant commands (e.g., SET key value1 followed by SET key value2 still has both commands, even though only the last one matters). The AOF rewrite process is a background operation that creates a new, more compact AOF file containing only the essential commands needed to reconstruct the current dataset. This is crucial for performance and disk space.

Here’s how it works and how to tune it:

Why Rewrite?

  • Size Reduction: Eliminates redundant commands, shrinking the AOF file size significantly.
  • Performance: A smaller AOF file means faster restarts and less I/O during writes.
  • Data Integrity: Prevents the AOF file from becoming impossibly large, which could eventually lead to I/O bottlenecks during normal operation.

The Problem: Stalling

The AOF rewrite is a background process, but it’s not entirely non-blocking. During the rewrite, Redis needs to copy data from the old AOF to the new one, and also process new writes that arrive during this period. This involves a lot of I/O and memory allocation. If the system is under heavy load, or if the rewrite is configured poorly, it can monopolize resources, causing the main Redis thread to block and leading to high latency for client operations.

Common Causes and Fixes for Stalling

  1. Insufficient vm.dirty_ratio or vm.dirty_background_ratio:

    • Diagnosis: Check your system’s dirty page ratios:
      cat /proc/sys/vm/dirty_ratio
      cat /proc/sys/vm/dirty_background_ratio
      
      These control when the kernel starts flushing dirty pages to disk. If the AOF rewrite is writing a lot of data and the system is also experiencing other write activity, these thresholds might be reached, triggering aggressive background flushing that competes with the AOF rewrite process.
    • Fix: Increase these values. For example, to set dirty_ratio to 30% and dirty_background_ratio to 15%:
      sudo sysctl vm.dirty_ratio=30
      sudo sysctl vm.dirty_background_ratio=15
      # To make permanent, edit /etc/sysctl.conf
      
    • Why it works: Higher thresholds allow Redis and the kernel to accumulate more dirty pages before initiating aggressive flushing, reducing contention between the AOF rewrite and general kernel background writeback. This gives the AOF rewrite more breathing room.
  2. High Redis maxmemory-policy and Frequent Evictions:

    • Diagnosis: If maxmemory is set and Redis is frequently evicting keys (e.g., allkeys-lru), this adds significant write load during the rewrite. Check Redis logs for eviction messages.
    • Fix: Temporarily increase maxmemory or change the maxmemory-policy to something less write-intensive during the rewrite, or if possible, avoid rewrites during peak client traffic.
      CONFIG SET maxmemory 10gb  # Example: temporarily increase
      CONFIG SET maxmemory-policy volatile-lru # Example: change policy
      
    • Why it works: Evictions involve deleting keys and potentially writing to the AOF, directly competing with the rewrite’s efforts to build a clean AOF. Reducing eviction pressure frees up I/O and CPU.
  3. Disk I/O Saturation (Slow Disks or High I/O Load):

    • Diagnosis: Monitor disk I/O using tools like iostat or iotop. Look for high %util or high await times on the disk where your AOF file resides.
      iostat -xz 5
      iotop -o
      
    • Fix:
      • Faster Storage: Migrate Redis data to faster storage (e.g., SSDs, NVMe).
      • Dedicated Disk: If possible, place the AOF file on a separate, faster disk from the operating system or other high-I/O applications.
      • Reduce auto-aof-rewrite-percentage: Lower this value so rewrites happen more frequently but on smaller AOF files, reducing the I/O burst.
        CONFIG SET auto-aof-rewrite-percentage 50 # Example: rewrite when AOF grows by 50%
        
    • Why it works: The AOF rewrite is inherently I/O-bound. Slow disks or heavy concurrent I/O will naturally slow down the rewrite process, increasing the time it runs and the chance of impacting foreground operations. Faster storage or better I/O isolation directly addresses this bottleneck.
  4. Overly Aggressive auto-aof-rewrite-min-size:

    • Diagnosis: The auto-aof-rewrite-min-size setting prevents rewrites from happening too frequently on very small AOF files. If this is set too low, and your AOF file is small but growing rapidly due to many ephemeral keys, it might trigger rewrites too often.
    • Fix: Increase auto-aof-rewrite-min-size. A common value is 64mb or 128mb.
      CONFIG SET auto-aof-rewrite-min-size 128mb
      
    • Why it works: This ensures that rewrites only occur when the AOF file has reached a substantial size, meaning the potential savings are significant enough to warrant the I/O cost. It prevents a flurry of small, disruptive rewrites.
  5. High Number of Writes During Rewrite:

    • Diagnosis: If your application is experiencing a massive spike in write operations (e.g., during a batch import or a traffic surge) while an AOF rewrite is in progress, Redis has to append these new writes to the new AOF file being generated. This increases the workload.
    • Fix:
      • Schedule Rewrites: Manually trigger AOF rewrites during off-peak hours using BGREWRITEAOF.
      • Rate Limit Writes: If possible, temporarily throttle incoming write traffic during scheduled rewrite periods.
      • Increase vm.dirty_ratio: As mentioned in point 1, this gives the system more buffer.
    • Why it works: By managing the rate of incoming writes or scheduling the rewrite when write volume is naturally lower, you reduce the contention for I/O and CPU resources required by both the rewrite and the incoming data.
  6. Insufficient vm.dirty_expire_centisecs:

    • Diagnosis: This kernel parameter (/proc/sys/vm/dirty_expire_centisecs) controls how long dirty data can stay in the page cache before it must be written to disk. If it’s too low, dirty data is flushed more aggressively, potentially interfering with Redis’s AOF rewrite.
    • Fix: Increase vm.dirty_expire_centisecs. A common value is 3000 (30 seconds).
      sudo sysctl vm.dirty_expire_centisecs=3000
      # To make permanent, edit /etc/sysctl.conf
      
    • Why it works: A longer expiration time allows the kernel to batch writes more effectively, giving Redis’s AOF rewrite more time to complete its operations without the kernel forcibly flushing data that Redis is actively managing.

After addressing these, you might encounter issues with RDB persistence if it’s also enabled and configured poorly, or potentially memory fragmentation if your dataset is highly dynamic.

Want structured learning?

Take the full Redis course →