Redpanda’s storage threshold alerts are telling you that a critical component of the system is about to become unavailable because its underlying storage is full. This isn’t just about "running out of space"; it means the Kafka log directories, where all your data is stored and replicated, can no longer accept writes, effectively halting your entire Kafka cluster.
Common Causes and Fixes for Redpanda Storage Threshold Alerts
-
Log Directory Full Due to Unmanaged Topic Retention:
- Diagnosis: Check the size of your topic data directories. On a typical Redpanda setup, these are located in
/var/lib/redpanda/data/by default. You can usedu -sh /var/lib/redpanda/data/*to see individual topic sizes. Look for topics with unusually large sizes or a rapid growth rate. - Fix: Configure appropriate retention policies for your topics. This is done via the
redpanda.yamlconfiguration file or by usingrpk topic update.
Or using# Example in redpanda.yaml topic_reclaim_policy: delete default_topic_reclaim_age_ms: 604800000 # 7 days in ms default_topic_reclaim_bytes: 10737418240 # 10GB per topicrpk:rpk topic update --config retention.ms=604800000 --config retention.bytes=10737418240 my-large-topic - Why it works: By setting
retention.ms(time-based retention) orretention.bytes(size-based retention), Redpanda will automatically delete old log segments for topics that exceed these limits, freeing up disk space.
- Diagnosis: Check the size of your topic data directories. On a typical Redpanda setup, these are located in
-
Log Directory Full Due to Uncompacted Topics:
- Diagnosis: Some topics, particularly those with a
cleanup.policyset tocompact, can grow indefinitely if the keys are constantly updated without the log segments being compacted. Checkrpk topic list-configs <topic-name>forcleanup.policy. - Fix: Ensure that compacted topics have a reasonable
segment.msorsegment.bytesconfiguration, and that compaction is actually running. Monitor compaction lag usingrpk topic admin-api-status <topic-name>. If compaction is stuck, you might need to investigate individual partition issues. For a quick fix on a specific topic, you can reset its compaction state (use with caution, this can cause temporary high I/O):rpk topic update --config cleanup.policy=delete my-compacted-topic # Wait for cleanup to run, then re-enable compaction if desired rpk topic update --config cleanup.policy=compact my-compacted-topic - Why it works: Compaction reclaims space by discarding older versions of records with the same key, keeping only the latest. If compaction is not running or is inefficient, logs grow. Forcing a policy change can trigger a re-evaluation or a full deletion/re-creation cycle depending on the exact state.
- Diagnosis: Some topics, particularly those with a
-
Insufficient Disk Space Allocation:
- Diagnosis: The most straightforward cause is that the underlying disk partition where Redpanda stores its data is simply too small for your workload. Check available space with
df -h /var/lib/redpanda/data/(or your configured data directory). - Fix: Increase the size of the partition or move Redpanda’s data directory to a larger disk. This often involves resizing the disk in your cloud provider’s console or on-premise hardware, followed by a partition resize command (e.g.,
resize2fsfor ext4,xfs_growfsfor XFS).# Example for ext4 sudo resize2fs /dev/sda1 # Replace /dev/sda1 with your partition - Why it works: Providing more physical space directly resolves the "disk full" condition, allowing Redpanda to write new data and segment files.
- Diagnosis: The most straightforward cause is that the underlying disk partition where Redpanda stores its data is simply too small for your workload. Check available space with
-
WAL (Write-Ahead Log) Directory Full:
- Diagnosis: Redpanda also uses a Write-Ahead Log (WAL) for durability. If this directory (
/var/lib/redpanda/wal/by default) fills up, writes can also be blocked. Check its size withdu -sh /var/lib/redpanda/wal/. - Fix: Ensure
redpanda.wal_fsyncistrue(default), and thatredpanda.wal_record_max_sizeis appropriately configured. More importantly, ensure that the WAL segments are being properly flushed and removed. This is usually tied to successful data segment writes and replication. If WAL is filling up, it’s a strong indicator that the primary log directories are also under pressure or there’s a replication issue preventing segment advancement. - Why it works: The WAL is a temporary buffer. If data is successfully written to the main log segments and replicated, WAL entries are eventually cleared. A full WAL suggests a bottleneck or failure in the primary data path.
- Diagnosis: Redpanda also uses a Write-Ahead Log (WAL) for durability. If this directory (
-
Replication Lag or Partition Leader Issues:
- Diagnosis: If a partition leader is consistently unable to replicate data to its followers due to network issues, follower unavailability, or slow disks on followers, the leader might accumulate a large amount of unacknowledged data. This can lead to the leader’s disk filling up. Check replication status with
rpk cluster status. - Fix: Identify and resolve the underlying replication issues. This might involve restarting stalled followers, improving network connectivity, or addressing disk I/O bottlenecks on follower nodes.
- Why it works: When data is successfully replicated and acknowledged by a quorum of followers, the leader can safely advance its log and delete old segments. Fixing replication lag ensures this progress can be made.
- Diagnosis: If a partition leader is consistently unable to replicate data to its followers due to network issues, follower unavailability, or slow disks on followers, the leader might accumulate a large amount of unacknowledged data. This can lead to the leader’s disk filling up. Check replication status with
-
Large Segment Files Due to Frequent Restarts or Configuration:
- Diagnosis: If Redpanda is restarted frequently, or if
segment.bytesis set very high, segment files can become excessively large, leading to quicker disk exhaustion, especially if retention policies are not aggressive enough. Check segment file sizes within/var/lib/redpanda/data/<topic>/<partition>/. - Fix: Adjust
segment.bytesinredpanda.yamlor viarpk topic updateto a more manageable size (e.g., 1GB or 2GB). A smaller segment size means more frequent segment file creation and deletion cycles, which can help with space management if retention policies are also tuned.# Example in redpanda.yaml default_topic_segment_bytes: 1073741824 # 1GB - Why it works: Smaller segment files are processed and eligible for deletion more quickly by the retention policies, providing a more granular release of disk space.
- Diagnosis: If Redpanda is restarted frequently, or if
After resolving these issues, you might encounter alerts related to RAFT_UNAVAILABLE or RAFT_UNSYNCED if replication was severely impacted, or METRIC_UNAVAILABLE if monitoring agents can’t reach Redpanda due to it being unresponsive.