The Pulsar managed ledger component is failing to acknowledge writes, causing data loss and client timeouts because it’s not properly flushing its internal buffer to disk.
Cause 1: Insufficient Disk Space
Diagnosis: Check disk usage on the BookKeeper bookies:
df -h /pulsar/data
Look for partitions with usage at or near 100%.
Fix: Free up disk space on the affected bookie(s). This might involve deleting old segment files or moving data to a larger partition.
# Example: find and remove old, unreferenced ledger files (use with extreme caution)
find /pulsar/data/bookkeeper/ledgers -mtime +30 -type f -delete
This works because BookKeeper cannot write new ledger data when its storage is full, preventing acknowledgements.
Cause 2: BookKeeper Write Quota Exceeded
Diagnosis: Check BookKeeper bookie logs for messages like "Write quota exceeded" or "Storage limit reached." You can also check the BookKeeper admin tool:
/opt/bookkeeper/bin/bookkeeper shell quota
This will show current usage against configured quotas.
Fix:
Increase the write quota for the affected bookie(s) or topic. This is configured in bookkeeper.conf on each bookie:
# Example: Increase total storage quota to 500GB
storageQuotaBytes=536870912000
Restart the BookKeeper bookie service after changing the configuration. This allows writes to proceed once the storage limit is raised.
Cause 3: Network Connectivity Issues Between Bookie and ZooKeeper
Diagnosis: From the bookie machine, attempt to connect to ZooKeeper:
telnet <zookeeper_host> 2181
If the connection times out or is refused, there’s a network issue. Also, check bookie logs for ZooKeeper connection errors.
Fix: Ensure firewalls are not blocking port 2181 between bookies and ZooKeeper. Verify ZooKeeper is running and accessible.
# Example: If using firewalld, allow ZK port
sudo firewall-cmd --zone=public --add-port=2181/tcp --permanent
sudo firewall-cmd --reload
This is crucial because BookKeeper relies on ZooKeeper for metadata management, including ledger and segment registration.
Cause 4: BookKeeper Bookie Process Crashing or Unresponsive
Diagnosis: Check the status of the BookKeeper bookie process:
sudo systemctl status bookkeeper
Look for recent restarts, error messages, or a "dead" status. Examine /var/log/bookkeeper/bookkeeper.log for crash dumps or critical errors.
Fix: Identify the root cause of the bookie crash (e.g., OOM errors, disk I/O issues, unhandled exceptions) and address it. If it’s a transient issue, restarting the service might suffice:
sudo systemctl restart bookkeeper
A healthy bookie process is fundamental for it to accept and acknowledge writes.
Cause 5: Persistent Segment Write Errors
Diagnosis:
Inspect the Pulsar broker logs (server.log) for errors related to writing to specific ledger segments, often mentioning I/O errors, timeouts, or corrupted data.
2023-10-27 10:30:00,123 ERROR org.apache.bookkeeper.client.ReadHandle - Could not read entry from ledger <ledger_id> entry <entry_id>
Also, check BookKeeper bookie logs for similar I/O or disk-related errors.
Fix: This is often indicative of underlying hardware issues (disk failure) or severe filesystem corruption on the bookie nodes hosting the affected ledger segments.
- Identify affected bookies: The error messages in broker logs usually specify the bookie hostname or IP.
- Run disk diagnostics: On the affected bookie, run
smartctl -a /dev/sdX(replacesdXwith the actual disk device) to check for hardware errors. - Repair filesystem: If filesystem corruption is suspected, unmount and run
fsckon the affected partition. - Replace faulty hardware: If disk failure is confirmed, replace the failing drive.
- Re-route traffic: Once the faulty bookie is identified, consider draining traffic from it (if possible, or in a controlled manner) and potentially force the deletion of ledgers it primarily hosts to allow Pulsar to re-replicate. This fixes the problem by ensuring the underlying storage is reliable and can actually persist data.
Cause 6: ZooKeeper Session Expiration or Instability
Diagnosis: Check Pulsar broker logs for ZooKeeper connection loss or session expiration errors. Also, check ZooKeeper server logs for client disconnects or errors related to specific bookie registrations.
2023-10-27 10:35:00,456 WARN org.apache.zookeeper.ClientCnxn - Session 0x1234567890 expired at [...]
Fix:
Ensure ZooKeeper ensemble is healthy, stable, and has sufficient resources. If ZooKeeper is overloaded, consider increasing tickTime and syncLimit in zoo.cfg (with caution and understanding of implications) or scaling up the ZooKeeper cluster.
# In zoo.cfg
tickTime=2000
syncLimit=10
Restarting the ZooKeeper ensemble or affected nodes can sometimes resolve transient issues. This is critical because Pulsar brokers and BookKeeper bookies use ZooKeeper to coordinate and discover each other; unstable ZooKeeper leads to unreliable cluster operations.
The next error you’ll likely encounter is BrokerServiceException: Failed to send message to topic <topic_name> as the brokers struggle to propagate acknowledged data.