The Pulsar broker is failing to find the required segment entries in its bookkeeper bookies, meaning it can’t serve read requests because the data it needs is missing from storage.
Cause 1: BookKeeper Ledger is Incomplete (Most Common)
The ledger in BookKeeper, which is a sequence of segments (entries) that make up a Pulsar topic partition, might be marked as closed but still missing some of its entries in the bookies. This can happen during network partitions or broker/bookie crashes.
-
Diagnosis: Check the ledger’s status and entry counts.
./bookkeeper shell ledger list ./bookkeeper shell ledger stats <ledger_id>Compare the
lengthreported byledger statswith the number of entries Pulsar expects for that ledger based on its internal metadata. -
Fix: If entries are truly missing and the ledger is closed, the only recourse is to force Pulsar to acknowledge the loss and potentially rebuild or re-ingest data.
- Identify the affected topic and ledger: This usually requires digging into Pulsar broker logs (
log4j.properties) to find the specific ledger ID associated with the "Entry log not found" error. - Mark the ledger as "unrecoverable" in BookKeeper: This is a drastic step.
./bookkeeper shell ledger recover <ledger_id> --unrecoverable - Restart the Pulsar broker(s) serving the affected topic. Pulsar will detect the ledger is unrecoverable and attempt to handle the missing data. This might involve marking the partition as unavailable or triggering a data recovery process if configured.
- Identify the affected topic and ledger: This usually requires digging into Pulsar broker logs (
-
Why it works: By marking the ledger as unrecoverable, you tell BookKeeper that it cannot fulfill requests for those missing entries. Pulsar, upon restarting, will see this state and adjust its internal metadata, preventing it from trying to read non-existent data.
Cause 2: BookKeeper Ensemble Size Mismatch or Lost Bookies
When a ledger is created, a set of BookKeeper nodes (the ensemble) is chosen to store its data. If the number of bookies in the ensemble drops below the configured write quorum or ack quorum due to bookie failures, new writes might fail, and reads might become impossible if the quorum can’t be met even for existing entries.
-
Diagnosis: Examine
bookkeeper.confforledger.ensembleSize,ledger.writeQuorum, andledger.ackQuorum. Then, check the health of your bookies../bookkeeper shell bookie stat allLook for bookies that are down or reporting errors.
-
Fix:
- Bring failed bookies back online: If bookies are down, start them. Ensure they can connect to ZooKeeper and the other bookies.
- Adjust quorum settings (if necessary): If bookies were permanently removed, you might need to adjust
ledger.writeQuorumandledger.ackQuoruminbookkeeper.confto be less than or equal to the current number of available bookies. This is a complex operational change and requires careful consideration of data durability. Restart all bookies after changing these settings. - Restart Pulsar brokers.
-
Why it works: Restoring bookie availability allows BookKeeper to meet its quorum requirements for reads and writes. Adjusting quorum settings (with caution) can allow operations to continue with a reduced set of bookies, assuming data is still durably stored across the remaining ones.
Cause 3: ZooKeeper Connectivity Issues for BookKeeper
BookKeeper relies heavily on ZooKeeper for coordination, metadata storage, and service discovery. If bookies lose connectivity to ZooKeeper, they can become isolated, stop serving requests, and their ledger metadata might become stale or inaccessible to Pulsar.
-
Diagnosis: Check the network connectivity between your bookies and your ZooKeeper ensemble. Also, check the ZooKeeper logs for errors related to bookie connections.
# On a bookie node, test connection to ZooKeeper telnet <zookeeper_host> 2181Examine
bookkeeper.logandzookeeper.logfor connection refused, session expired, or similar errors. -
Fix:
- Resolve network issues: Ensure firewalls are not blocking traffic between bookies and ZooKeeper on port 2181.
- Restart ZooKeeper ensemble: If ZooKeeper itself is unhealthy, restart the ensemble in the correct order (leader first, then followers).
- Restart BookKeeper bookies: Once ZooKeeper is stable, restart the bookies so they can re-establish their sessions.
- Restart Pulsar brokers.
-
Why it works: A stable ZooKeeper connection is critical for BookKeeper to maintain its cluster state and for Pulsar to discover and communicate with healthy bookies. Resolving connectivity ensures the distributed system can coordinate properly.
Cause 4: Pulsar Broker Metadata Inconsistency with BookKeeper
The Pulsar broker maintains its own metadata about ledgers and their corresponding BookKeeper ledger IDs. If this metadata becomes out of sync with BookKeeper (e.g., due to a broker crash during a critical metadata update), the broker might try to read from a non-existent or incorrect ledger.
-
Diagnosis: Compare the ledger IDs listed by the broker with those known by BookKeeper.
- Broker logs will show the specific ledger ID it’s looking for.
- Use
bookkeeper shell ledger listto see active ledgers.
-
Fix: This often requires manually cleaning up stale metadata.
- Identify the stale ledger ID from the Pulsar broker logs.
- Find the Pulsar topic/partition associated with that ledger ID. This might involve searching Pulsar admin logs or using Pulsar’s internal topic lookup mechanisms if possible.
- Use the Pulsar admin tool to offload or delete the topic partition: This will clean up the associated metadata in ZooKeeper that Pulsar uses.
Caution: Deleting a topic partition is destructive and will lose data. Offloading is preferred if possible.# Example: Offload a partition (requires appropriate permissions and setup) pulsar-admin topics offload <persistent://public/default/my-topic-partition-0> --offload-until <timestamp> # Or, if the topic is truly corrupt and needs deletion: pulsar-admin topics delete <persistent://public/default/my-topic-partition-0> - Restart the Pulsar broker.
-
Why it works: By forcing Pulsar to remove or re-initialize its metadata for the affected partition, it will no longer attempt to access the incorrect or missing BookKeeper ledger. It will then attempt to create a new ledger if writes resume for that partition.
Cause 5: BookKeeper Bookie Disk Issues or Corruption
Physical disk problems on a BookKeeper bookie can lead to data corruption or the inability to read existing entries, even if the ledger is marked as complete in BookKeeper’s coordination layer.
-
Diagnosis: Check the operating system logs (
dmesg,/var/log/syslog) on the affected bookie nodes for disk I/O errors, read errors, or filesystem corruption. Also, examine the BookKeeper logs for specific I/O exceptions.# On the bookie node sudo dmesg -T | grep -iE 'error|fail|corrupt' -
Fix:
- Take the affected bookie offline: Stop the BookKeeper service on the node.
- Perform disk checks and repairs: Run
fsckon the relevant partition, or if the disk is failing, replace it. - Re-sync data or rebuild: If the disk is repaired, you might need to rejoin the bookie to the cluster and allow BookKeeper to re-replicate data. If the disk was replaced, the bookie will need to be reprovisioned and potentially have data re-ingested or rebuilt from other replicas.
- Restart the Pulsar broker.
-
Why it works: Ensuring the underlying storage is healthy and the data is readable on the BookKeeper nodes is fundamental. Without healthy disks, BookKeeper cannot reliably store or retrieve data.
Cause 6: Incorrect BookKeeper Entry Log Directory Configuration
Each BookKeeper bookie stores its segment data in a configured directory. If this directory is full, inaccessible, or if the configuration points to the wrong location, the bookie will fail to read entries.
-
Diagnosis: Check the
dataLedgers(orjournalDirectoryandledgerDirectoriesin older versions) setting inbookkeeper.confon the affected bookie. Verify that the specified directory exists, has correct permissions, and sufficient free space.# On the bookie node grep dataLedgers /path/to/bookkeeper.conf df -h /path/to/data/directory ls -ld /path/to/data/directory -
Fix:
- Free up disk space: If the directory is full, remove old data or expand the storage.
- Correct directory permissions: Ensure the
bookkeeperuser has read/write access to the directory. - Update
bookkeeper.conf: If the directory path is incorrect, update it and restart the bookie. - Restart the Pulsar broker.
-
Why it works: BookKeeper needs a healthy, accessible, and sufficiently large storage location to read and write ledger entries. Correcting configuration and ensuring space resolves direct I/O access problems.
The next error you’ll likely encounter after fixing "Entry log not found" is Broker is not ready to serve or Topic is not yet partitioned, as Pulsar re-initializes its state.