The Pulsar producer is failing to send messages to the broker because the network connection between them is dropping before the broker can acknowledge receipt.

Common Causes and Fixes:

  1. Under-provisioned Broker Network Bandwidth:

    • Diagnosis: Monitor network traffic on the Pulsar broker nodes. Use iftop -i <interface> or your cloud provider’s network monitoring tools. If RX or TX consistently hovers near the interface’s capacity, this is the likely culprit.
    • Fix: Increase the network bandwidth allocated to your Pulsar broker instances. For example, if using AWS EC2, upgrade the instance type to one with higher network performance (e.g., from m5.large to m5.xlarge or an instance with enhanced networking).
    • Why it works: Higher bandwidth allows the broker to receive and process incoming message data faster, preventing the producer’s send operation from timing out due to network congestion.
  2. High Broker CPU Utilization:

    • Diagnosis: Check the CPU load on your Pulsar broker nodes using top or htop. If %CPU is consistently above 80-90% for extended periods, the broker is struggling to keep up.
    • Fix: Scale up your Pulsar broker cluster by adding more nodes or scale up existing nodes with more CPU cores. For example, add two more m5.xlarge instances to your existing cluster.
    • Why it works: A CPU-bound broker can’t process incoming requests (including message acknowledgments) quickly enough, leading to timeouts for producers waiting for those acknowledgments.
  3. Insufficient Broker Memory (leading to excessive Garbage Collection):

    • Diagnosis: Monitor JVM heap usage for the Pulsar broker process. Use jstat -gcutil <pid> 1000 and look for YGC (Young GC) and FGC (Full GC) counts increasing rapidly, and OU (Old Usage) approaching 100%. High GC activity can pause the broker.
    • Fix: Increase the JVM heap size allocated to the Pulsar broker. Edit the Pulsar broker’s configuration file (e.g., conf/pulsar) and adjust PULSAR_MEM parameters. For instance, change PULSAR_MEM=" -Xms1g -Xmx2g" to PULSAR_MEM=" -Xms4g -Xmx8g". Restart the broker.
    • Why it works: More heap memory reduces the frequency and duration of garbage collection pauses, allowing the broker to remain responsive to producer requests.
  4. Network Latency or Packet Loss:

    • Diagnosis: Use ping -c 100 <broker_ip> and mtr <broker_ip> from the producer’s host to the Pulsar broker. Look for consistently high round-trip times (>50ms for typical setups) or a significant percentage of packet loss in mtr or ping results.
    • Fix: Address underlying network issues. This could involve optimizing routing, upgrading network hardware, or ensuring the producer and broker are located in the same low-latency network segment (e.g., same AWS region/VPC).
    • Why it works: High latency means packets take longer to travel, increasing the chance of timeouts. Packet loss forces retransmissions, further degrading performance and increasing latency.
  5. Producer sendTimeoutMs Configuration Too Low:

    • Diagnosis: Examine your Pulsar producer configuration. The sendTimeoutMs setting defines how long the producer will wait for an acknowledgment from the broker. If this value is less than the typical round-trip time plus processing time on the broker, timeouts are inevitable.
    • Fix: Increase the sendTimeoutMs value in your producer configuration. For example, if it’s set to 10000 (10 seconds), try increasing it to 30000 (30 seconds) or 60000 (60 seconds).
    • Why it works: A longer timeout allows the producer to tolerate temporary network glitches or brief periods of broker unresponsiveness without aborting the send operation.
  6. Broker maxIncomingMessageSize Too Small:

    • Diagnosis: Check the Pulsar broker configuration for maxMessageSize (or maxIncomingMessageSize in older versions). If your producer is sending messages larger than this limit, the broker will reject them, and the producer might interpret this as a timeout if error handling isn’t robust.
    • Fix: Increase the maxMessageSize in the Pulsar broker’s broker.conf file. For example, change maxMessageSize=5242880 (5MB) to maxMessageSize=10485760 (10MB). Restart the broker. Ensure your producer’s maxMessagePublishDelay is also configured appropriately if sending large messages in batches.
    • Why it works: Allows the broker to accept larger messages without immediately rejecting them, preventing premature failures.
  7. ZooKeeper Performance Issues:

    • Diagnosis: Monitor ZooKeeper performance metrics. Check ZooKeeper logs for zkServer.request_processing_time or zkServer.avg_latency. High latencies here can indirectly affect broker responsiveness as brokers rely on ZooKeeper for metadata operations.
    • Fix: Optimize your ZooKeeper ensemble. This might involve ensuring ZooKeeper nodes have dedicated, fast disks (SSDs), sufficient RAM, and are not overloaded with other tasks. Consider increasing tickTime or syncLimit if necessary, but be cautious as this can impact consistency guarantees.
    • Why it works: A responsive ZooKeeper is crucial for Pulsar brokers to perform essential operations like topic lookups and metadata updates, which are necessary for successful message routing and acknowledgment.

After resolving these, you might encounter Topic is partitioned and partition metadata not available errors if topic partitioning is involved and the broker hasn’t fully initialized the partition metadata yet.

Want structured learning?

Take the full Pulsar course →