Pulsar topics are "busy" because the broker can’t keep up with the rate of incoming writes or outgoing reads, leading to dropped requests and client errors.

Here’s a breakdown of why this happens and how to fix it:

Cause 1: Insufficient Broker Network Bandwidth

The most common culprit is that your brokers simply don’t have enough network capacity to handle the traffic. If the write rate exceeds the available bandwidth, incoming messages will be buffered on the broker, eventually causing it to signal "busy."

Diagnosis: On the broker experiencing the issue, run iftop -i <interface_name> to see real-time network bandwidth usage per connection. Look for sustained high utilization (e.g., >80%) on the interface connected to your producers and consumers.

Fix: Increase the network bandwidth of your broker instances. This might involve upgrading your EC2 instance types, choosing a different cloud provider offering higher I/O, or ensuring your on-premises network hardware can support the load. For example, if you’re on an AWS EC2 instance, you might switch from an m5.large to an m5.xlarge or c5.xlarge which offer better network performance.

Why it works: More bandwidth allows the broker to send and receive data faster, reducing the backlog of messages and preventing it from becoming overwhelmed.

Cause 2: Broker CPU Saturation

If your broker’s CPU is maxed out, it won’t be able to process incoming requests or manage its internal queues efficiently. This can manifest as busy errors, especially if you have many topics or partitions on a single broker.

Diagnosis: Use top or htop on the affected broker. Look for sustained CPU utilization consistently above 90% for the java process (or whatever process Pulsar is running as). Pay attention to the %CPU column.

Fix: Scale up your broker instances by assigning them more CPU cores. For example, if you have brokers with 4 vCPUs, consider upgrading to instances with 8 or 16 vCPUs. Alternatively, if you have a very high number of topics, consider distributing them across more brokers to reduce the load on any single machine.

Why it works: More CPU power allows the broker to handle more concurrent operations, process message acknowledgements faster, and manage its internal state more effectively.

Cause 3: Under-provisioned BookKeeper Ensemble

Apache BookKeeper is Pulsar’s distributed log storage. If your BookKeeper ensemble (the collection of BookKeeper servers, or "bookies") can’t keep up with the write IOPS or bandwidth required by Pulsar, the brokers will eventually be starved of storage resources and report busy.

Diagnosis: Check the BookKeeper server logs for I/O errors, slow write operations, or network issues between bookies and brokers. You can also use BookKeeper’s JMX metrics (if enabled) to monitor write latency, queue depths, and disk I/O. Look for writeQueueLength or writeLatency metrics that are consistently high.

Fix: Scale up your BookKeeper ensemble. This means adding more bookie nodes. If your current bookies are also under-provisioned in terms of CPU, RAM, or network, upgrade those instances as well. For example, if you have 3 bookies, consider increasing to 5 or 7 bookies. Ensure each bookie has fast SSDs for storage.

Why it works: A larger and more performant BookKeeper ensemble can absorb the write load from Pulsar brokers more effectively, reducing the latency of writes and preventing brokers from waiting on storage.

Cause 4: Too Many Topics/Partitions Per Broker

While Pulsar is designed for massive scale, there’s a limit to how many topics and partitions a single broker can efficiently manage. Each topic/partition consumes resources (memory, file handles, network connections) on the broker.

Diagnosis: Use Pulsar’s admin tools to list topics per broker. For example, pulsar-admin brokers list-topics <broker_host:port> and count the number of topics and partitions assigned to each broker. If a single broker is responsible for thousands of topics or partitions, it’s likely overloaded.

Fix: Distribute your topics and partitions more evenly across your brokers. This might involve rebalancing partitions or creating new brokers to offload some of the burden. You can manually reassign topics if necessary, or configure Pulsar’s load manager to automatically balance.

Why it works: Spreading the load across more brokers reduces the per-broker overhead, allowing each broker to handle its assigned topics more gracefully.

Cause 5: Inefficient Consumer/Producer Logic

Sometimes, the "busy" error isn’t strictly a broker problem but a symptom of clients overwhelming the system. If producers are sending messages much faster than consumers can process them, or if consumers are acknowledging messages slowly, it can lead to backpressure and busy errors.

Diagnosis: Monitor consumer lag using pulsar-admin consumer-stats <topic_name>. High last_unacked_messages or pending_unacked_messages indicate consumers are falling behind. For producers, check client-side metrics for send queue depths and error rates.

Fix: Optimize your consumers to process messages faster. This could involve increasing the number of consumer instances, improving the processing logic within your consumers, or increasing the receiverQueueSize on the consumer side (though be cautious, as this increases memory usage). For producers, ensure they are not sending messages in bursts that exceed broker capacity, or implement client-side rate limiting.

Why it works: Ensuring consumers can keep up with the producer rate, or that producers respect the broker’s capacity, prevents message queues from building up excessively on the broker.

Cause 6: Under-provisioned Broker Memory

Brokers use memory for caching, connection management, and internal queues. If a broker runs out of available memory, it can lead to increased garbage collection pauses and general performance degradation, eventually causing it to become unresponsive and report busy.

Diagnosis: Use free -h or vmstat on the broker to check available memory. If it’s consistently low (e.g., <10% free), or if you see excessive swapping, memory is likely an issue. Pulsar’s Java process might also be showing high memory usage in top or htop.

Fix: Increase the RAM on your broker instances. For example, if you’re using instances with 8GB of RAM, consider upgrading to 16GB or 32GB. Ensure your broker.conf (or equivalent) has appropriate JVM heap settings (-Xms, -Xmx) that don’t exceed the available system memory.

Why it works: Sufficient memory allows the broker to cache data effectively, manage connections without swapping, and avoid long garbage collection pauses that disrupt request processing.

The Next Error You’ll Hit

After fixing busy topic errors, you’ll likely encounter BrokerNotAvailableException if your ZooKeeper ensemble is unhealthy or unreachable by the brokers.

Want structured learning?

Take the full Pulsar course →