Pulsar’s ZooKeeper ensemble is choking on metadata operations, preventing the system from scaling.
Here’s why and how to fix it.
The Problem: Pulsar relies on ZooKeeper for critical metadata management: broker registration, topic discovery, topic metadata, schema information, and more. When these operations become too frequent or too heavy, ZooKeeper becomes the bottleneck, leading to high latency, missed heartbeats, and ultimately, Pulsar service degradation.
Common Causes and Fixes:
-
Excessive Topic Creation/Deletion: Rapidly creating and deleting topics, especially in large numbers, bombards ZooKeeper with
createanddeleteoperations.- Diagnosis: Monitor ZooKeeper’s
mntrmetric, specificallyzk_num_alive_connectionsandzk_outstanding_requests. High values here, correlated with topic churn, point to this issue. Also, check Pulsar broker logs for "topic creation failed" or "topic deletion failed" messages. - Fix: Implement a topic lifecycle management strategy. Avoid ephemeral topic creation for short-lived tasks. Use Pulsar’s
admin-clientto list topics and observe creation/deletion rates. If possible, reduce the frequency of topic lifecycle operations. Consider using partitioned topics if many similar logical topics are needed, as this reduces the metadata overhead per topic. - Why it works: Fewer distinct metadata entries mean fewer ZooKeeper operations.
- Diagnosis: Monitor ZooKeeper’s
-
Large Numbers of Topics/Partitions: Even static, large numbers of topics and partitions can strain ZooKeeper. Each topic and partition is a node in ZooKeeper’s hierarchy, and Pulsar needs to be aware of them.
- Diagnosis: Use
ls /admin/persistent/namespace/topic/partition(replace with your namespace) viazkCli.shto count the number of topic nodes. A very large number (tens of thousands or more) is a red flag. - Fix: Review your topic naming conventions and partitioning strategy. Consolidate where possible. For very large numbers of topics, consider if all are actively used and necessary. For partitioned topics, ensure you’re not over-partitioning.
- Why it works: Reduces the total number of nodes ZooKeeper must manage and traverse.
- Diagnosis: Use
-
ZooKeeper Network Latency/Instability: High latency or packet loss between Pulsar brokers/clients and ZooKeeper, or among ZooKeeper nodes themselves, drastically increases operation times.
- Diagnosis: Run
pingandmtrfrom brokers to ZooKeeper nodes. Monitorzk_server_stateforleaderandfollowerstates andzk_last_zxid_seenfor followers to detect lagging nodes. Check network interface error counters (ifconfigorip -s link). - Fix: Ensure ZooKeeper nodes and Pulsar brokers are in the same low-latency network segment. Optimize network configurations (e.g., MTU sizes). If using cloud providers, ensure your VPC/subnet routing is efficient.
- Why it works: Faster network communication means ZooKeeper operations complete quicker, reducing queue build-up.
- Diagnosis: Run
-
ZooKeeper Configuration Issues (Java Heap, File Descriptors): Insufficient Java heap space or too few file descriptors can cause ZooKeeper to slow down or crash.
- Diagnosis: Check ZooKeeper’s garbage collection logs for frequent or long pauses. Monitor
ulimit -nfor the ZooKeeper process. Usejstat -gcutil <pid>to observe heap usage and GC activity. - Fix: Increase the ZooKeeper Java heap size (e.g.,
JVMFLAGS="-Xms4g -Xmx8g"inzookeeper-server-start.sh). Increase the open file descriptor limit (ulimit -n 65536) for the ZooKeeper user. - Why it works: More memory allows ZooKeeper to cache more data and perform GC less frequently. More file descriptors allow it to handle more concurrent connections and open files (like transaction logs).
- Diagnosis: Check ZooKeeper’s garbage collection logs for frequent or long pauses. Monitor
-
ZooKeeper Transaction Log Disk I/O: Slow disk I/O on the devices hosting ZooKeeper’s transaction logs (
dataDir) will directly impact write performance, which is critical for ZooKeeper’s consistency guarantees.- Diagnosis: Monitor disk I/O metrics (IOPS, throughput, latency) on the ZooKeeper data directories using tools like
iostator cloud provider monitoring. High latency or low IOPS indicate a problem. - Fix: Use fast SSDs or NVMe drives for ZooKeeper’s
dataDir. Ensure the filesystem is configured for optimal performance (e.g.,noatimemount option). SeparatedataDiranddataLogDironto different physical devices if possible. - Why it works: Faster writes to the transaction log allow ZooKeeper to acknowledge operations more quickly.
- Diagnosis: Monitor disk I/O metrics (IOPS, throughput, latency) on the ZooKeeper data directories using tools like
-
Too Many Watchers/Subscriptions: While less common than the above, if clients or Pulsar components are setting an excessive number of watches on ZooKeeper nodes, it can consume ZooKeeper resources.
- Diagnosis: Monitor
zk_num_int_connectionsandzk_avg_latencyinmntr. A large number of connections with high latency could indicate watcher overhead. ZooKeeper’sstatcommand can showwatch_count. - Fix: Review Pulsar’s configuration and any custom clients interacting with ZooKeeper. Ensure watches are only used where absolutely necessary and are properly cleaned up.
- Why it works: Reduces the overhead on ZooKeeper for tracking and notifying clients about changes.
- Diagnosis: Monitor
The Next Hurdle: Once ZooKeeper is stable, you might encounter Pulsar broker resource contention as they now have the capacity to process more requests.