The Pulsar broker failed to acknowledge messages, indicating a failure in its internal state management or network communication, preventing reliable message delivery to consumers.
Common Causes and Fixes for Pulsar Message Deduplication Failures
A common symptom of this issue is MessageTooBigException or TooManyRequestsException being thrown by the broker when attempting to write to the transaction log, or consumers reporting duplicate messages and errors like NotAllowedError when trying to acknowledge them.
-
Transaction Log Disk Full/IO Errors:
- Diagnosis: Check the disk space on the broker nodes where the transaction log is stored. Look for I/O errors in the broker logs (
ERRORorWARNlevel messages related tobookkeeperorTransactionLog).df -h /path/to/pulsar/transaction/log dmesg | grep -iE 'io error|i_o error|dm-crypt' - Fix:
- Increase Disk Space: If the disk is full, extend the partition or add more storage.
- Check BookKeeper Health: Ensure the BookKeeper ensemble is healthy. If disks are full on BookKeeper nodes, address that first.
- Restart Broker: After freeing up space or resolving disk issues, restart the Pulsar broker.
systemctl restart pulsar-broker - Why it works: Deduplication relies on writing to a transaction log (managed by BookKeeper) to track message acknowledgments and prevent duplicates. If this log becomes inaccessible due to disk space or I/O issues, the broker cannot reliably record these acknowledgments, leading to failures.
- Diagnosis: Check the disk space on the broker nodes where the transaction log is stored. Look for I/O errors in the broker logs (
-
BookKeeper Ensemble Issues:
- Diagnosis: Verify the health of the BookKeeper ensemble. Check BookKeeper logs for errors, especially related to
Ledger,Write, orAck. Ensure all BookKeeper nodes are running and reachable.# On a BookKeeper node tail -f /var/log/bookkeeper/bookkeeper.log # On a Pulsar broker node, check connectivity to BookKeeper telnet <bookkeeper_node_ip> 3181 - Fix:
- Restart BookKeeper Nodes: If BookKeeper nodes are down or unresponsive, restart them.
- Rebalance/Repair BookKeeper: If ledgers are under-replicated or corrupted, use BookKeeper’s recovery tools (e.g.,
bookkeeperctl recovery). - Check Network: Ensure network connectivity between brokers and BookKeeper nodes.
- Why it works: Pulsar uses BookKeeper as its persistent storage for metadata and transaction logs. If BookKeeper is unhealthy, Pulsar cannot reliably store the state needed for deduplication.
- Diagnosis: Verify the health of the BookKeeper ensemble. Check BookKeeper logs for errors, especially related to
-
Transaction Log Size Limit Reached:
- Diagnosis: Pulsar has a configurable limit for transaction log entries per ledger. If this limit is hit frequently, it can cause issues. Check broker logs for messages like
TransactionLog is fullorLedger is full.# Check broker configuration for transaction log settings grep -i "transactionLog" /etc/pulsar/broker.conf - Fix: Increase the
transactionLogMaxEntriesPerLedgersetting inbroker.conf.
Then restart the Pulsar broker.# Example: Increase from default 10000 to 20000 transactionLogMaxEntriesPerLedger=20000 - Why it works: Each acknowledgment or state change related to deduplication is written to a transaction log ledger. When a ledger reaches its maximum entry count, a new one is created. If this limit is too low for the message volume, the system spends too much time creating new ledgers, potentially leading to timeouts and failures.
- Diagnosis: Pulsar has a configurable limit for transaction log entries per ledger. If this limit is hit frequently, it can cause issues. Check broker logs for messages like
-
Network Latency/Packet Loss between Broker and BookKeeper:
- Diagnosis: High network latency or packet loss between Pulsar brokers and BookKeeper nodes can cause timeouts during writes to the transaction log. Use
pingandmtrto check network health.ping <bookkeeper_node_ip> mtr <bookkeeper_node_ip> - Fix:
- Improve Network Infrastructure: Address any network congestion, faulty cables, or misconfigured network devices.
- Colocate Services: Ensure brokers and BookKeeper nodes are in the same data center or availability zone to minimize latency.
- Increase Timeouts: If network issues are intermittent, you might cautiously increase BookKeeper client timeouts in
broker.conf.
Restart the Pulsar broker after changes.# Example: Increase ledger write timeout (use with caution) bookkeeperClientWriteTimeoutMs=30000 - Why it works: Deduplication operations involve synchronous writes to BookKeeper. If these writes take too long due to network issues, they will time out, preventing the acknowledgment from being processed correctly.
- Diagnosis: High network latency or packet loss between Pulsar brokers and BookKeeper nodes can cause timeouts during writes to the transaction log. Use
-
Incorrectly Configured Deduplication Settings:
- Diagnosis: While less common for outright failure, misconfiguration can lead to excessive load. Ensure
enable_deduplicationis set correctly at the topic level or globally. Check for unintendedenable_deduplicationsettings on topics that don’t need it.# Check topic-level configuration pulsar-admin topics get-property <your_topic_name> enable_deduplication # Check broker-level configuration grep -i "enableDeduplication" /etc/pulsar/broker.conf - Fix:
- Set
enable_deduplicationon Topic:pulsar-admin topics set-property <your_topic_name> enable_deduplication=true - Disable if Unnecessary: If deduplication is not required for a specific topic, disable it.
pulsar-admin topics set-property <your_topic_name> enable_deduplication=false
- Set
- Why it works: Deduplication adds overhead. Enabling it on topics with very high throughput or where duplicates are naturally handled by the application logic can strain the transaction log system more than anticipated, revealing underlying infrastructure limitations.
- Diagnosis: While less common for outright failure, misconfiguration can lead to excessive load. Ensure
-
Broker Resource Starvation (CPU/Memory):
- Diagnosis: If the Pulsar broker itself is overloaded with CPU or memory, it may not be able to process transaction log writes or acknowledgments in a timely manner. Monitor broker CPU and memory usage.
top -H -p $(pgrep -f Pulsar) htop - Fix:
- Scale Brokers: Add more broker instances to distribute the load.
- Optimize JVM: Tune Pulsar broker JVM settings (
PULSAR_MEMinpulsarscript orbroker.conf). - Reduce Topic Load: If specific topics are causing excessive load, consider partitioning them further or offloading processing.
- Why it works: Deduplication is an active process on the broker. If the broker is struggling to keep up with basic message handling, it will certainly fail at more complex tasks like maintaining transaction logs for deduplication.
- Diagnosis: If the Pulsar broker itself is overloaded with CPU or memory, it may not be able to process transaction log writes or acknowledgments in a timely manner. Monitor broker CPU and memory usage.
After resolving these issues, the next error you might encounter is related to TopicNotFoundException if the underlying metadata for the topic was corrupted or lost during the previous failures.