Redpanda’s storage threshold alerts are telling you that a critical component of the system is about to become unavailable because its underlying storage is full. This isn’t just about "running out of space"; it means the Kafka log directories, where all your data is stored and replicated, can no longer accept writes, effectively halting your entire Kafka cluster.

Common Causes and Fixes for Redpanda Storage Threshold Alerts

  1. Log Directory Full Due to Unmanaged Topic Retention:

    • Diagnosis: Check the size of your topic data directories. On a typical Redpanda setup, these are located in /var/lib/redpanda/data/ by default. You can use du -sh /var/lib/redpanda/data/* to see individual topic sizes. Look for topics with unusually large sizes or a rapid growth rate.
    • Fix: Configure appropriate retention policies for your topics. This is done via the redpanda.yaml configuration file or by using rpk topic update.
      # Example in redpanda.yaml
      topic_reclaim_policy: delete
      default_topic_reclaim_age_ms: 604800000 # 7 days in ms
      default_topic_reclaim_bytes: 10737418240 # 10GB per topic
      
      Or using rpk:
      rpk topic update --config retention.ms=604800000 --config retention.bytes=10737418240 my-large-topic
      
    • Why it works: By setting retention.ms (time-based retention) or retention.bytes (size-based retention), Redpanda will automatically delete old log segments for topics that exceed these limits, freeing up disk space.
  2. Log Directory Full Due to Uncompacted Topics:

    • Diagnosis: Some topics, particularly those with a cleanup.policy set to compact, can grow indefinitely if the keys are constantly updated without the log segments being compacted. Check rpk topic list-configs <topic-name> for cleanup.policy.
    • Fix: Ensure that compacted topics have a reasonable segment.ms or segment.bytes configuration, and that compaction is actually running. Monitor compaction lag using rpk topic admin-api-status <topic-name>. If compaction is stuck, you might need to investigate individual partition issues. For a quick fix on a specific topic, you can reset its compaction state (use with caution, this can cause temporary high I/O):
      rpk topic update --config cleanup.policy=delete my-compacted-topic
      # Wait for cleanup to run, then re-enable compaction if desired
      rpk topic update --config cleanup.policy=compact my-compacted-topic
      
    • Why it works: Compaction reclaims space by discarding older versions of records with the same key, keeping only the latest. If compaction is not running or is inefficient, logs grow. Forcing a policy change can trigger a re-evaluation or a full deletion/re-creation cycle depending on the exact state.
  3. Insufficient Disk Space Allocation:

    • Diagnosis: The most straightforward cause is that the underlying disk partition where Redpanda stores its data is simply too small for your workload. Check available space with df -h /var/lib/redpanda/data/ (or your configured data directory).
    • Fix: Increase the size of the partition or move Redpanda’s data directory to a larger disk. This often involves resizing the disk in your cloud provider’s console or on-premise hardware, followed by a partition resize command (e.g., resize2fs for ext4, xfs_growfs for XFS).
      # Example for ext4
      sudo resize2fs /dev/sda1 # Replace /dev/sda1 with your partition
      
    • Why it works: Providing more physical space directly resolves the "disk full" condition, allowing Redpanda to write new data and segment files.
  4. WAL (Write-Ahead Log) Directory Full:

    • Diagnosis: Redpanda also uses a Write-Ahead Log (WAL) for durability. If this directory (/var/lib/redpanda/wal/ by default) fills up, writes can also be blocked. Check its size with du -sh /var/lib/redpanda/wal/.
    • Fix: Ensure redpanda.wal_fsync is true (default), and that redpanda.wal_record_max_size is appropriately configured. More importantly, ensure that the WAL segments are being properly flushed and removed. This is usually tied to successful data segment writes and replication. If WAL is filling up, it’s a strong indicator that the primary log directories are also under pressure or there’s a replication issue preventing segment advancement.
    • Why it works: The WAL is a temporary buffer. If data is successfully written to the main log segments and replicated, WAL entries are eventually cleared. A full WAL suggests a bottleneck or failure in the primary data path.
  5. Replication Lag or Partition Leader Issues:

    • Diagnosis: If a partition leader is consistently unable to replicate data to its followers due to network issues, follower unavailability, or slow disks on followers, the leader might accumulate a large amount of unacknowledged data. This can lead to the leader’s disk filling up. Check replication status with rpk cluster status.
    • Fix: Identify and resolve the underlying replication issues. This might involve restarting stalled followers, improving network connectivity, or addressing disk I/O bottlenecks on follower nodes.
    • Why it works: When data is successfully replicated and acknowledged by a quorum of followers, the leader can safely advance its log and delete old segments. Fixing replication lag ensures this progress can be made.
  6. Large Segment Files Due to Frequent Restarts or Configuration:

    • Diagnosis: If Redpanda is restarted frequently, or if segment.bytes is set very high, segment files can become excessively large, leading to quicker disk exhaustion, especially if retention policies are not aggressive enough. Check segment file sizes within /var/lib/redpanda/data/<topic>/<partition>/.
    • Fix: Adjust segment.bytes in redpanda.yaml or via rpk topic update to a more manageable size (e.g., 1GB or 2GB). A smaller segment size means more frequent segment file creation and deletion cycles, which can help with space management if retention policies are also tuned.
      # Example in redpanda.yaml
      default_topic_segment_bytes: 1073741824 # 1GB
      
    • Why it works: Smaller segment files are processed and eligible for deletion more quickly by the retention policies, providing a more granular release of disk space.

After resolving these issues, you might encounter alerts related to RAFT_UNAVAILABLE or RAFT_UNSYNCED if replication was severely impacted, or METRIC_UNAVAILABLE if monitoring agents can’t reach Redpanda due to it being unresponsive.

Want structured learning?

Take the full Redpanda course →