The Pulsar broker failed to acknowledge messages, indicating a failure in its internal state management or network communication, preventing reliable message delivery to consumers.

Common Causes and Fixes for Pulsar Message Deduplication Failures

A common symptom of this issue is MessageTooBigException or TooManyRequestsException being thrown by the broker when attempting to write to the transaction log, or consumers reporting duplicate messages and errors like NotAllowedError when trying to acknowledge them.

  1. Transaction Log Disk Full/IO Errors:

    • Diagnosis: Check the disk space on the broker nodes where the transaction log is stored. Look for I/O errors in the broker logs (ERROR or WARN level messages related to bookkeeper or TransactionLog).
      df -h /path/to/pulsar/transaction/log
      dmesg | grep -iE 'io error|i_o error|dm-crypt'
      
    • Fix:
      • Increase Disk Space: If the disk is full, extend the partition or add more storage.
      • Check BookKeeper Health: Ensure the BookKeeper ensemble is healthy. If disks are full on BookKeeper nodes, address that first.
      • Restart Broker: After freeing up space or resolving disk issues, restart the Pulsar broker.
      systemctl restart pulsar-broker
      
    • Why it works: Deduplication relies on writing to a transaction log (managed by BookKeeper) to track message acknowledgments and prevent duplicates. If this log becomes inaccessible due to disk space or I/O issues, the broker cannot reliably record these acknowledgments, leading to failures.
  2. BookKeeper Ensemble Issues:

    • Diagnosis: Verify the health of the BookKeeper ensemble. Check BookKeeper logs for errors, especially related to Ledger, Write, or Ack. Ensure all BookKeeper nodes are running and reachable.
      # On a BookKeeper node
      tail -f /var/log/bookkeeper/bookkeeper.log
      # On a Pulsar broker node, check connectivity to BookKeeper
      telnet <bookkeeper_node_ip> 3181
      
    • Fix:
      • Restart BookKeeper Nodes: If BookKeeper nodes are down or unresponsive, restart them.
      • Rebalance/Repair BookKeeper: If ledgers are under-replicated or corrupted, use BookKeeper’s recovery tools (e.g., bookkeeperctl recovery).
      • Check Network: Ensure network connectivity between brokers and BookKeeper nodes.
    • Why it works: Pulsar uses BookKeeper as its persistent storage for metadata and transaction logs. If BookKeeper is unhealthy, Pulsar cannot reliably store the state needed for deduplication.
  3. Transaction Log Size Limit Reached:

    • Diagnosis: Pulsar has a configurable limit for transaction log entries per ledger. If this limit is hit frequently, it can cause issues. Check broker logs for messages like TransactionLog is full or Ledger is full.
      # Check broker configuration for transaction log settings
      grep -i "transactionLog" /etc/pulsar/broker.conf
      
    • Fix: Increase the transactionLogMaxEntriesPerLedger setting in broker.conf.
      # Example: Increase from default 10000 to 20000
      transactionLogMaxEntriesPerLedger=20000
      
      Then restart the Pulsar broker.
    • Why it works: Each acknowledgment or state change related to deduplication is written to a transaction log ledger. When a ledger reaches its maximum entry count, a new one is created. If this limit is too low for the message volume, the system spends too much time creating new ledgers, potentially leading to timeouts and failures.
  4. Network Latency/Packet Loss between Broker and BookKeeper:

    • Diagnosis: High network latency or packet loss between Pulsar brokers and BookKeeper nodes can cause timeouts during writes to the transaction log. Use ping and mtr to check network health.
      ping <bookkeeper_node_ip>
      mtr <bookkeeper_node_ip>
      
    • Fix:
      • Improve Network Infrastructure: Address any network congestion, faulty cables, or misconfigured network devices.
      • Colocate Services: Ensure brokers and BookKeeper nodes are in the same data center or availability zone to minimize latency.
      • Increase Timeouts: If network issues are intermittent, you might cautiously increase BookKeeper client timeouts in broker.conf.
      # Example: Increase ledger write timeout (use with caution)
      bookkeeperClientWriteTimeoutMs=30000
      
      Restart the Pulsar broker after changes.
    • Why it works: Deduplication operations involve synchronous writes to BookKeeper. If these writes take too long due to network issues, they will time out, preventing the acknowledgment from being processed correctly.
  5. Incorrectly Configured Deduplication Settings:

    • Diagnosis: While less common for outright failure, misconfiguration can lead to excessive load. Ensure enable_deduplication is set correctly at the topic level or globally. Check for unintended enable_deduplication settings on topics that don’t need it.
      # Check topic-level configuration
      pulsar-admin topics get-property <your_topic_name> enable_deduplication
      # Check broker-level configuration
      grep -i "enableDeduplication" /etc/pulsar/broker.conf
      
    • Fix:
      • Set enable_deduplication on Topic:
        pulsar-admin topics set-property <your_topic_name> enable_deduplication=true
        
      • Disable if Unnecessary: If deduplication is not required for a specific topic, disable it.
        pulsar-admin topics set-property <your_topic_name> enable_deduplication=false
        
    • Why it works: Deduplication adds overhead. Enabling it on topics with very high throughput or where duplicates are naturally handled by the application logic can strain the transaction log system more than anticipated, revealing underlying infrastructure limitations.
  6. Broker Resource Starvation (CPU/Memory):

    • Diagnosis: If the Pulsar broker itself is overloaded with CPU or memory, it may not be able to process transaction log writes or acknowledgments in a timely manner. Monitor broker CPU and memory usage.
      top -H -p $(pgrep -f Pulsar)
      htop
      
    • Fix:
      • Scale Brokers: Add more broker instances to distribute the load.
      • Optimize JVM: Tune Pulsar broker JVM settings (PULSAR_MEM in pulsar script or broker.conf).
      • Reduce Topic Load: If specific topics are causing excessive load, consider partitioning them further or offloading processing.
    • Why it works: Deduplication is an active process on the broker. If the broker is struggling to keep up with basic message handling, it will certainly fail at more complex tasks like maintaining transaction logs for deduplication.

After resolving these issues, the next error you might encounter is related to TopicNotFoundException if the underlying metadata for the topic was corrupted or lost during the previous failures.

Want structured learning?

Take the full Pulsar course →