The Pulsar BookKeeper nodes (bookies) are failing because their local storage partitions are full, preventing them from writing new journal or ledger data.

The most common culprit is simply that the bookies aren’t cleaning up old data fast enough, especially if they’re under heavy write load or have a high max_in_flight_journal_ops setting. This leads to a backlog of journal segments that haven’t been compacted or deleted.

Diagnosis: Check disk usage on the affected bookie:

df -h /var/lib/bookkeeper

Look for partitions at or near 100% capacity.

Cause 1: Insufficient max_per_segment_size or segment_size If the bookies are writing very large segments, they fill up the disk faster.

  • Diagnosis: Examine bookkeeper.conf for max_per_segment_size and segment_size. Default is 32MB and 32MB respectively. Check Pulsar admin tools for actual segment sizes.
  • Fix: Increase max_per_segment_size and segment_size in bookkeeper.conf on all bookies. A common starting point is 64MB or 128MB.
    # bookkeeper.conf
    max_per_segment_size=134217728 # 128MB
    segment_size=134217728       # 128MB
    
  • Why it works: Larger segments mean fewer individual files on disk for the same amount of data, and fewer file handles being managed. This can improve performance and reduce overhead, but also means each segment is larger.

Cause 2: High max_in_flight_journal_ops or max_in_flight_entries_per_op These settings control how many writes can be pending. If they’re too high, a surge of writes can overwhelm the disk’s ability to flush data, leading to a journal backlog.

  • Diagnosis: Check bookkeeper.conf for max_in_flight_journal_ops and max_in_flight_entries_per_op. Default max_in_flight_journal_ops is 5000, max_in_flight_entries_per_op is 1000.
  • Fix: Reduce these values in bookkeeper.conf on all bookies. Try max_in_flight_journal_ops=2000 and max_in_flight_entries_per_op=500.
    # bookkeeper.conf
    max_in_flight_journal_ops=2000
    max_in_flight_entries_per_op=500
    
  • Why it works: This throttles incoming writes, giving the bookie more time to flush its journal and ledger data to disk before the journal buffer fills up.

Cause 3: Inadequate journal_flush_interval_ms The journal flush interval dictates how often the bookie flushes its in-memory journal to stable storage. If this is too long, the journal can grow very large.

  • Diagnosis: Check bookkeeper.conf for journal_flush_interval_ms. Default is 1000 (1 second).
  • Fix: Decrease journal_flush_interval_ms in bookkeeper.conf on all bookies. Try 500 (0.5 seconds).
    # bookkeeper.conf
    journal_flush_interval_ms=500
    
  • Why it works: More frequent flushes ensure that data is written to stable storage more often, preventing the journal from accumulating excessive amounts of unflushed data.

Cause 4: Slow Disk I/O or High Latency The underlying storage might simply not be fast enough to keep up with the write load, especially during peak times.

  • Diagnosis: Use iostat -xm 5 on the bookie to monitor disk I/O utilization (%util), read/write speeds, and average wait times (await). High %util and await indicate a bottleneck.
  • Fix: Upgrade to faster storage (SSDs are highly recommended for bookies), or offload some topics/partitions to less busy bookies if using topic-based balancing. For immediate relief, consider reducing the write load on the cluster if possible.
  • Why it works: Faster disks can process write requests more quickly, reducing the chance of I/O becoming a bottleneck and causing data to pile up.

Cause 5: Insufficient max_write_throughput_limit This is a rate-limiting mechanism for writes per bookie. If it’s set too low, it can artificially slow down writes, leading to backlog issues if the disk could handle more. Conversely, if it’s too high and the disk can’t keep up, it can contribute to the problem.

  • Diagnosis: Check bookkeeper.conf for max_write_throughput_limit. Default is -1 (unlimited).
  • Fix: If this is set to a low value, increase it. If it’s unlimited and disk is slow, consider setting a sensible limit to prevent overwhelming the disk. A value like 100MB/s or 200MB/s might be appropriate depending on your hardware.
    # bookkeeper.conf
    max_write_throughput_limit=209715200 # 200MB/s
    
  • Why it works: Properly tuning this limit ensures that write operations don’t exceed the disk’s sustained write capabilities, preventing an overload condition.

Cause 6: Ledger/Segment Compaction Lag BookKeeper compacts ledgers to reduce the number of small files and reclaim space. If compaction is not keeping up, or if compaction_max_pending_requests is too low, old data segments can linger.

  • Diagnosis: Monitor bookie logs for compaction-related warnings or errors. Check bookkeeper.conf for compaction_max_pending_requests. Default is 10.
  • Fix: Increase compaction_max_pending_requests in bookkeeper.conf. Try 20 or 30. Ensure your segment_size and max_per_segment_size are reasonable.
    # bookkeeper.conf
    compaction_max_pending_requests=30
    
  • Why it works: Allowing more pending compaction requests means the system can process the backlog of segment merging and cleanup more aggressively.

Cause 7: Unbalanced Topic Distribution If a few topics are extremely hot and are exclusively written to a subset of bookies, those bookies can become overloaded even if the cluster as a whole has free space.

  • Diagnosis: Use pulsar-admin topics list --namespace <your_tenant>/<your_namespace> and then pulsar-admin topics stats <topic_name> to identify high-volume topics. Check which bookies are serving these topics via the topic stats or by looking at the bookkeeper logs for ledger assignments.
  • Fix: Rebalance topics to distribute the load more evenly across bookies. This might involve using Pulsar’s topic-level placement policies or simply restarting bookies to trigger rebalancing (though this is disruptive).
  • Why it works: Spreading the workload across more bookies prevents any single bookie from becoming a disk I/O bottleneck.

After resolving the disk space issue, you will likely encounter BK_NO_RESERVABLE_STORAGE errors if you don’t also free up space or increase max_disk_usage_threshold in bookkeeper.conf.

Want structured learning?

Take the full Pulsar course →