The Pulsar dispatcher is blocking because a broker is unable to send data to a broker, leading to a backlog of messages and eventual timeouts.

Cause 1: Network Congestion or Latency

Diagnosis: Use ping and traceroute (or mtr) between the affected brokers to check for high latency or packet loss.

ping <other_broker_ip>
traceroute <other_broker_ip>

Fix: Identify the network segment causing issues and work with your network team to resolve it. This might involve increasing bandwidth, optimizing routing, or addressing hardware problems on switches/routers. Why it works: Reduces the time it takes for packets to travel between brokers, allowing the dispatcher to send and receive data within its expected timeouts.

Cause 2: Insufficient Broker Resources (CPU/Memory)

Diagnosis: Monitor broker CPU and memory usage using tools like top, htop, or Prometheus/Grafana. Look for consistently high CPU utilization (>80%) or memory pressure leading to excessive swapping.

top -H -p $(pgrep -f org.apache.pulsar.broker.PulsarService)

Fix: Increase the resources allocated to the broker pods/VMs. This could involve increasing CPU limits and requests, or memory limits and requests in Kubernetes, or upgrading the underlying hardware. Why it works: Provides the broker’s Java Virtual Machine (JVM) and its threads, including those in the dispatcher, with enough processing power and memory to operate efficiently without being throttled by the operating system.

Cause 3: Slow Disk I/O on Broker (for BookKeeper)

Diagnosis: Monitor disk I/O performance on the broker nodes where BookKeeper is running (often co-located). Tools like iostat can reveal high await times or high %util.

iostat -xz 5

Fix: Upgrade to faster storage (e.g., SSDs), ensure sufficient IOPS are provisioned for the underlying storage, or distribute the load across more BookKeeper nodes. Why it works: BookKeeper relies on fast disk writes for durability. If the disks are slow, the write operations will take longer, causing backpressure that propagates up to the Pulsar broker’s dispatcher.

Cause 4: BookKeeper Ensemble Size Too Small or Unhealthy

Diagnosis: Check the BookKeeper cluster status and the ensemble size configured for topics. Use bookkeeper shell to inspect ledger details.

# On a BookKeeper node
/usr/bin/bookkeeper shell
ls
# Then inspect a specific ledger if you have the ID
ledgerstat <ledger_id>

Look for BookKeeper nodes in OFFLINE or RO (read-only) states. The ensemble size for a topic (e.g., writeQuorum, ackQuorum) might be too low to tolerate node failures.

Fix: Ensure all BookKeeper nodes are healthy and in ALIVE state. If the ensemble size is too small, consider reconfiguring topics with a larger ensemble (requires re-creation of topics or careful migration). For example, if writeQuorum is 3 and ackQuorum is 2, and one BookKeeper node goes down, you can still write, but if two go down, you can’t. Why it works: A healthy and sufficiently sized BookKeeper ensemble ensures that writes can be acknowledged quickly even in the presence of transient node failures, preventing write operations from blocking.

Cause 5: Zookeeper Performance Issues

Diagnosis: Monitor Zookeeper ensemble health and performance. Check Zookeeper logs for slow transactions or connection issues. Use netcat to check Zookeeper’s four-letter words (e.g., mntr, stat).

echo mntr | nc <zookeeper_ip> 2181
echo stat | nc <zookeeper_ip> 2181

Look for high zk_avg_latency or zk_outstanding_requests.

Fix: Optimize Zookeeper performance by ensuring it’s running on dedicated, fast storage, has sufficient memory, and its client connections (from Pulsar brokers and BookKeeper) are healthy. Avoid running Zookeeper on the same nodes as Pulsar or BookKeeper if possible. Why it works: Pulsar brokers and BookKeeper nodes constantly communicate with Zookeeper for metadata management and coordination. Slow Zookeeper operations can stall these critical background tasks, leading to dispatcher blocking.

Cause 6: Backlog on Producer or Consumer Side

Diagnosis: Check topic backlog metrics in Pulsar. A continuously growing backlog indicates that producers are sending messages faster than consumers can process them, or that the dispatcher is unable to deliver messages to consumers due to consumer-side issues.

# Using Pulsar Admin CLI
pulsar-admin topics stats <persistent://tenant/namespace/topic>

Look for backlogSize and oldestMessagePublishTime vs earliestAvailableMessagePublishTime.

Fix: Scale up consumers to handle the message rate, optimize consumer processing logic, or investigate why consumers are slow. If the issue is producer-side, ensure producers are not overwhelming the system or that their network connectivity is stable. Why it works: If consumers cannot keep up, messages accumulate in the brokers’ queues. The dispatcher, trying to push these messages, will eventually become saturated and block.

The next error you’ll likely hit is a Topic is unavailable error as brokers start to fail health checks due to the dispatcher being completely unresponsive.

Want structured learning?

Take the full Pulsar course →