Your Pulsar consumer is getting stuck in a "busy" state because the broker is unable to acknowledge messages to it fast enough, leading to a backlog and eventual consumer failure.

Common Causes and Fixes

  1. Broker Overload: The broker itself is struggling to keep up with the general load, not just your consumer. This could be due to too many topics, too many partitions, high message ingress, or insufficient broker resources (CPU, memory, network).

    • Diagnosis: Monitor broker metrics. Look for high CPU utilization (broker_process_cpu_user_seconds_total), high memory usage (jvm_memory_used_bytes), and network saturation (netty_in_bytes_total, netty_out_bytes_total) on the broker instances serving your topic. Also, check pulsar-admin topics stats <your-topic> for totalUnackedMessages and earliestUnackedMessagePublishTime. If totalUnackedMessages is consistently high and earliestUnackedMessagePublishTime is old, the broker is behind.
    • Fix: Scale up your Pulsar cluster. This might mean adding more broker nodes, upgrading existing nodes with more CPU/RAM, or improving network bandwidth. For specific topics, consider partitioning them further if they are a disproportionate load.
    • Why it works: By distributing the load across more resources or more brokers, the system can process acknowledgments and manage message states more efficiently, preventing the backlog.
  2. Consumer Acknowledgment Latency: Your consumer application is too slow to process messages and send acknowledgments back to the broker. The broker waits a certain amount of time (controlled by brokerService.nettyServer.messageAckTimeout and brokerService.messageExpiryTime) before considering a message "unacked" and potentially redelivering it or marking the subscription as busy.

    • Diagnosis: Enable detailed consumer logs. Look for long durations between receiving a message and sending an acknowledgment for it. On the broker side, observe pulsar-admin topics stats <your-topic> for totalUnackedMessages. If this number grows and the earliestUnackedMessagePublishTime is recent, your consumer is the bottleneck. You can also check the consumer’s processing throughput against the topic’s ingress rate.
    • Fix: Optimize your consumer application’s message processing logic. This could involve improving deserialization, database writes, external API calls, or parallelizing processing within the consumer instance. Ensure your ackTimeout in the consumer client configuration is set appropriately (e.g., ackTimeoutMs: 300000 or 5 minutes) and that your processing consistently finishes within this window.
    • Why it works: A faster consumer can acknowledge messages before the broker’s timeout, clearing the unacked message count and signaling to the broker that it’s keeping up.
  3. Network Issues Between Consumer and Broker: Intermittent network connectivity problems, high latency, or packet loss between your consumer instances and the broker can cause acknowledgment messages to be delayed or dropped.

    • Diagnosis: Use network diagnostic tools like ping, traceroute, and mtr from the consumer machine to the broker. Check network interface statistics on both consumer and broker for errors, dropped packets, and high utilization. Monitor Pulsar client-side metrics for connection errors or high latency.
    • Fix: Address network infrastructure problems. This might involve configuring Quality of Service (QoS) for traffic, increasing bandwidth, or resolving routing issues. Ensure your consumer instances are in the same network region/availability zone as your brokers if possible.
    • Why it works: Reliable and low-latency network communication is essential for timely acknowledgments, allowing the broker to correctly track message consumption.
  4. Broker’s messageAckTimeout Too Low: The broker’s configured messageAckTimeout (a setting like brokerService.nettyServer.messageAckTimeout in broker.conf, defaulting to 5 minutes) might be too aggressive for your consumer’s processing speed, even if the consumer is generally performing well.

    • Diagnosis: Check the broker.conf file on your brokers for the brokerService.nettyServer.messageAckTimeout setting. If it’s set to a value significantly less than your typical message processing time, this could be the culprit. The earliestUnackedMessagePublishTime on the topic stats will be a strong indicator if it’s consistently older than this timeout.
    • Fix: Increase the brokerService.nettyServer.messageAckTimeout in broker.conf and restart the brokers. For example, set it to 600000 (10 minutes).
    • Why it works: Giving the broker a longer window to wait for acknowledgments reduces the chance of it prematurely marking messages as unacked due to temporary consumer processing delays.
  5. Exclusive Subscription Misconfiguration/Design: If you are using an "Exclusive" subscription type and the single consumer assigned to it is experiencing any of the above issues (slow processing, network problems), the entire subscription will appear "busy" and fail. For high-throughput scenarios, Exclusive subscriptions are often not the right choice.

    • Diagnosis: Verify the subscription type in your consumer code or via pulsar-admin subs list <your-topic>. If it’s Exclusive and you expect high throughput or have multiple consumer instances, this is likely the problem.
    • Fix: Change the subscription type to Shared or Failover if your application logic can handle it. For Shared, ensure your consumer code correctly handles message ordering if required (e.g., by using sticky partitions). For Failover, ensure you have multiple consumer instances configured to take over if one fails.
    • Why it works: Shared subscriptions allow multiple consumers to receive messages concurrently, distributing the load and preventing a single slow consumer from blocking the entire subscription. Failover provides redundancy.
  6. Large Message Sizes: If your messages are very large, processing them takes longer, and sending acknowledgments might also consume more bandwidth, indirectly impacting the consumer’s ability to keep up.

    • Diagnosis: Check the average and maximum message sizes for your topic using pulsar-admin topics stats <your-topic>. Compare this to your consumer’s processing capabilities and network bandwidth.
    • Fix: Optimize message payloads. If possible, send references to larger data instead of the data itself. Consider compression if not already used. Ensure your network infrastructure can handle the increased bandwidth demands of large messages.
    • Why it works: Smaller message payloads reduce processing time and network overhead, allowing consumers to acknowledge messages more quickly.

The next error you’ll likely encounter if all these are fixed is a TopicNotFoundException if the topic was deleted, or a BrokerNotAvailableException if the Pulsar cluster itself is experiencing broader availability issues.

Want structured learning?

Take the full Pulsar course →