The Pulsar broker is failing to write metadata to ZooKeeper because ZooKeeper itself is no longer accepting new connections from the broker, indicating a fundamental connectivity or capacity issue on the ZooKeeper ensemble.
Common Causes and Fixes for Pulsar Broker Metadata Errors
This error typically surfaces when your Pulsar broker, broker-1, attempts to persist critical operational data (like topic ownership, partition assignments, or configuration changes) to your ZooKeeper ensemble, but ZooKeeper rejects the connection. This isn’t just a transient network blip; it’s ZooKeeper signaling it’s either overloaded, misconfigured, or unreachable from the broker’s perspective.
Here’s a breakdown of the most common culprits, ordered by likelihood, along with how to diagnose and fix them:
-
ZooKeeper Ensemble Overload (Too Many Connections/Requests):
- Diagnosis: Check the ZooKeeper server logs (usually
/var/log/zookeeper/zookeeper.logor similar) on each ZooKeeper node for messages likeToo many connectionsorExceeded limits. You can also monitor the number of open file descriptors on the ZooKeeper servers usingulimit -n(check both the soft and hard limits) and compare it to the actual number of open connections usingnetstat -anp | grep <zookeeper_port> | wc -l. - Fix:
- Increase
max_client_connections: Edit yourzoo.cfgfile on each ZooKeeper node. Find or add the linemax_client_connections=2000(the default is often 200, which is too low for busy Pulsar clusters) and restart the ZooKeeper service. This allows the ZooKeeper server to accept more concurrent client connections. - Tune
tickTime,syncLimit,initLimit: If the ensemble is struggling with internal communication, adjust these parameters inzoo.cfg. For example, increasingtickTimeto2000(milliseconds) andsyncLimitto10(ticks) can provide more breathing room for follower nodes to sync with the leader. Restart ZooKeeper after changes. - Scale ZooKeeper Ensemble: If the load is consistently high, you might need to add more ZooKeeper nodes to your ensemble. Pulsar officially supports ensembles of 3, 5, or 7 nodes.
- Increase
- Why it works: ZooKeeper has a hard limit on the number of client connections it will accept to prevent resource exhaustion. Increasing this limit directly addresses the "too many connections" issue. Adjusting timing parameters helps the ensemble maintain quorum and stability under load.
- Diagnosis: Check the ZooKeeper server logs (usually
-
Network Connectivity Issues Between Broker and ZooKeeper:
- Diagnosis: From the Pulsar broker server (
broker-1), attempt to connect to each ZooKeeper node on its client port (default is2181). Usetelnet <zookeeper_host> 2181ornc -vz <zookeeper_host> 2181. If these fail, you have a network problem. Check firewalls (e.g.,iptables -L -n -v), security groups, and routing tables between the broker and ZooKeeper nodes. Also, ensure DNS resolution is working correctly for ZooKeeper hostnames. - Fix:
- Open Firewall Ports: On your firewall, allow traffic from the Pulsar broker’s IP address to the ZooKeeper nodes’ IP addresses on port
2181. Foriptables, this might look like:iptables -I INPUT -s <broker_ip> -p tcp --dport 2181 -j ACCEPT. - Correct DNS: If DNS is the issue, update your
/etc/hostsfile on the broker or fix your DNS server configuration. - Adjust Network ACLs/Security Groups: If using cloud providers, ensure your network access control lists or security group rules permit traffic from the broker to the ZooKeeper nodes.
- Open Firewall Ports: On your firewall, allow traffic from the Pulsar broker’s IP address to the ZooKeeper nodes’ IP addresses on port
- Why it works: ZooKeeper relies on stable network connections. If the broker cannot reach the ZooKeeper ensemble due to network misconfigurations, it cannot send its metadata updates.
- Diagnosis: From the Pulsar broker server (
-
ZooKeeper Ensemble Not Healthy (No Quorum):
- Diagnosis: Check the status of each ZooKeeper node. On each ZooKeeper server, run
echo stat | nc localhost 2181(or your configured client port). Look forMode: followerorMode: leader. If a majority of your ZooKeeper nodes are not inleaderorfollowermode (e.g., they are instandalonemode unexpectedly, or stuck inlooking), the ensemble cannot function. Also, examine ZooKeeper logs for messages indicating nodes are not syncing or have lost connection to the leader. - Fix:
- Restart Unhealthy Nodes: If a specific node is consistently problematic, try restarting its ZooKeeper service.
- Check
myidFile: Ensure themyidfile in the ZooKeeper data directory (e.g.,/var/lib/zookeeper/myid) on each node contains a unique integer corresponding to its position in theserver.X=...list inzoo.cfg. - Verify
zoo.cfg: Double-check that theserver.Xentries inzoo.cfgon all nodes correctly list all ensemble members and their respective ports. - Restart Ensemble Gracefully: If the entire ensemble is unhealthy, restart them in order: start the leader first, then followers.
- Why it works: ZooKeeper requires a majority (a quorum) of nodes to be operational and in sync to serve requests. If the ensemble is fractured or nodes are not communicating, it cannot maintain its state or accept new writes.
- Diagnosis: Check the status of each ZooKeeper node. On each ZooKeeper server, run
-
ZooKeeper Disk Space or I/O Issues:
- Diagnosis: Check the available disk space on each ZooKeeper server, especially on the partition where the data directory is located (e.g.,
/var/lib/zookeeper). Usedf -h. Also, monitor disk I/O performance using tools likeiostat -xz 1to see if disks are saturated. ZooKeeper writes transaction logs and snapshots to disk, and slow or full disks can cause it to become unresponsive. - Fix:
- Free Up Disk Space: Delete old snapshots or transaction logs if they are no longer needed (though Pulsar typically manages this). Alternatively, expand the disk or move the data directory to a larger partition.
- Improve Disk Performance: Migrate the ZooKeeper data directory to faster storage (e.g., SSDs) or optimize disk I/O settings.
- Why it works: ZooKeeper’s durability and performance are heavily reliant on its ability to quickly write transaction logs and snapshots to disk. Disk full or slow conditions can halt its operations.
- Diagnosis: Check the available disk space on each ZooKeeper server, especially on the partition where the data directory is located (e.g.,
-
Incorrect ZooKeeper Connection String in Pulsar Configuration:
- Diagnosis: Verify the
zookeeperServersorglobalZookeeperServerssetting in your Pulsar broker’s configuration file (broker.conforstandalone.conf). Ensure it lists the correct hostnames/IPs and client ports for all ZooKeeper ensemble members. For example,zookeeperServers=zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181. - Fix: Correct the
zookeeperServersstring inbroker.confto accurately reflect your ZooKeeper ensemble. Restart the Pulsar broker after making changes. - Why it works: If the broker is trying to connect to the wrong ZooKeeper instances or missing some, it won’t be able to establish a connection to the ensemble needed for metadata operations.
- Diagnosis: Verify the
-
ZooKeeper Client Port Blocked by Broker’s Firewall:
- Diagnosis: This is the inverse of point 2. Check the firewall rules on the Pulsar broker server itself. Ensure that the broker process has outbound access to the ZooKeeper client port (
2181) on the ZooKeeper nodes. - Fix: Adjust the broker’s firewall rules to allow outbound connections to the ZooKeeper ensemble on port
2181. Foriptableson the broker:iptables -I OUTPUT -d <zookeeper_host> -p tcp --dport 2181 -j ACCEPT. - Why it works: Even if ZooKeeper is configured correctly and network paths are open, the broker’s own outbound firewall rules might prevent it from initiating the connection.
- Diagnosis: This is the inverse of point 2. Check the firewall rules on the Pulsar broker server itself. Ensure that the broker process has outbound access to the ZooKeeper client port (
After resolving these issues, you’ll likely encounter a NoNodeException as Pulsar attempts to read configuration or topic metadata that might have been in an inconsistent state due to the prior ZooKeeper unavailability.