Your Pulsar consumer is getting stuck in a "busy" state because the broker is unable to acknowledge messages to it fast enough, leading to a backlog and eventual consumer failure.
Common Causes and Fixes
-
Broker Overload: The broker itself is struggling to keep up with the general load, not just your consumer. This could be due to too many topics, too many partitions, high message ingress, or insufficient broker resources (CPU, memory, network).
- Diagnosis: Monitor broker metrics. Look for high CPU utilization (
broker_process_cpu_user_seconds_total), high memory usage (jvm_memory_used_bytes), and network saturation (netty_in_bytes_total,netty_out_bytes_total) on the broker instances serving your topic. Also, checkpulsar-admin topics stats <your-topic>fortotalUnackedMessagesandearliestUnackedMessagePublishTime. IftotalUnackedMessagesis consistently high andearliestUnackedMessagePublishTimeis old, the broker is behind. - Fix: Scale up your Pulsar cluster. This might mean adding more broker nodes, upgrading existing nodes with more CPU/RAM, or improving network bandwidth. For specific topics, consider partitioning them further if they are a disproportionate load.
- Why it works: By distributing the load across more resources or more brokers, the system can process acknowledgments and manage message states more efficiently, preventing the backlog.
- Diagnosis: Monitor broker metrics. Look for high CPU utilization (
-
Consumer Acknowledgment Latency: Your consumer application is too slow to process messages and send acknowledgments back to the broker. The broker waits a certain amount of time (controlled by
brokerService.nettyServer.messageAckTimeoutandbrokerService.messageExpiryTime) before considering a message "unacked" and potentially redelivering it or marking the subscription as busy.- Diagnosis: Enable detailed consumer logs. Look for long durations between receiving a message and sending an acknowledgment for it. On the broker side, observe
pulsar-admin topics stats <your-topic>fortotalUnackedMessages. If this number grows and theearliestUnackedMessagePublishTimeis recent, your consumer is the bottleneck. You can also check the consumer’s processing throughput against the topic’s ingress rate. - Fix: Optimize your consumer application’s message processing logic. This could involve improving deserialization, database writes, external API calls, or parallelizing processing within the consumer instance. Ensure your
ackTimeoutin the consumer client configuration is set appropriately (e.g.,ackTimeoutMs: 300000or 5 minutes) and that your processing consistently finishes within this window. - Why it works: A faster consumer can acknowledge messages before the broker’s timeout, clearing the unacked message count and signaling to the broker that it’s keeping up.
- Diagnosis: Enable detailed consumer logs. Look for long durations between receiving a message and sending an acknowledgment for it. On the broker side, observe
-
Network Issues Between Consumer and Broker: Intermittent network connectivity problems, high latency, or packet loss between your consumer instances and the broker can cause acknowledgment messages to be delayed or dropped.
- Diagnosis: Use network diagnostic tools like
ping,traceroute, andmtrfrom the consumer machine to the broker. Check network interface statistics on both consumer and broker for errors, dropped packets, and high utilization. Monitor Pulsar client-side metrics for connection errors or high latency. - Fix: Address network infrastructure problems. This might involve configuring Quality of Service (QoS) for traffic, increasing bandwidth, or resolving routing issues. Ensure your consumer instances are in the same network region/availability zone as your brokers if possible.
- Why it works: Reliable and low-latency network communication is essential for timely acknowledgments, allowing the broker to correctly track message consumption.
- Diagnosis: Use network diagnostic tools like
-
Broker’s
messageAckTimeoutToo Low: The broker’s configuredmessageAckTimeout(a setting likebrokerService.nettyServer.messageAckTimeoutinbroker.conf, defaulting to 5 minutes) might be too aggressive for your consumer’s processing speed, even if the consumer is generally performing well.- Diagnosis: Check the
broker.conffile on your brokers for thebrokerService.nettyServer.messageAckTimeoutsetting. If it’s set to a value significantly less than your typical message processing time, this could be the culprit. TheearliestUnackedMessagePublishTimeon the topic stats will be a strong indicator if it’s consistently older than this timeout. - Fix: Increase the
brokerService.nettyServer.messageAckTimeoutinbroker.confand restart the brokers. For example, set it to600000(10 minutes). - Why it works: Giving the broker a longer window to wait for acknowledgments reduces the chance of it prematurely marking messages as unacked due to temporary consumer processing delays.
- Diagnosis: Check the
-
Exclusive Subscription Misconfiguration/Design: If you are using an "Exclusive" subscription type and the single consumer assigned to it is experiencing any of the above issues (slow processing, network problems), the entire subscription will appear "busy" and fail. For high-throughput scenarios, Exclusive subscriptions are often not the right choice.
- Diagnosis: Verify the subscription type in your consumer code or via
pulsar-admin subs list <your-topic>. If it’sExclusiveand you expect high throughput or have multiple consumer instances, this is likely the problem. - Fix: Change the subscription type to
SharedorFailoverif your application logic can handle it. ForShared, ensure your consumer code correctly handles message ordering if required (e.g., by using sticky partitions). ForFailover, ensure you have multiple consumer instances configured to take over if one fails. - Why it works:
Sharedsubscriptions allow multiple consumers to receive messages concurrently, distributing the load and preventing a single slow consumer from blocking the entire subscription.Failoverprovides redundancy.
- Diagnosis: Verify the subscription type in your consumer code or via
-
Large Message Sizes: If your messages are very large, processing them takes longer, and sending acknowledgments might also consume more bandwidth, indirectly impacting the consumer’s ability to keep up.
- Diagnosis: Check the average and maximum message sizes for your topic using
pulsar-admin topics stats <your-topic>. Compare this to your consumer’s processing capabilities and network bandwidth. - Fix: Optimize message payloads. If possible, send references to larger data instead of the data itself. Consider compression if not already used. Ensure your network infrastructure can handle the increased bandwidth demands of large messages.
- Why it works: Smaller message payloads reduce processing time and network overhead, allowing consumers to acknowledge messages more quickly.
- Diagnosis: Check the average and maximum message sizes for your topic using
The next error you’ll likely encounter if all these are fixed is a TopicNotFoundException if the topic was deleted, or a BrokerNotAvailableException if the Pulsar cluster itself is experiencing broader availability issues.