The Pulsar broker is failing to write metadata to ZooKeeper because ZooKeeper itself is no longer accepting new connections from the broker, indicating a fundamental connectivity or capacity issue on the ZooKeeper ensemble.

Common Causes and Fixes for Pulsar Broker Metadata Errors

This error typically surfaces when your Pulsar broker, broker-1, attempts to persist critical operational data (like topic ownership, partition assignments, or configuration changes) to your ZooKeeper ensemble, but ZooKeeper rejects the connection. This isn’t just a transient network blip; it’s ZooKeeper signaling it’s either overloaded, misconfigured, or unreachable from the broker’s perspective.

Here’s a breakdown of the most common culprits, ordered by likelihood, along with how to diagnose and fix them:

  1. ZooKeeper Ensemble Overload (Too Many Connections/Requests):

    • Diagnosis: Check the ZooKeeper server logs (usually /var/log/zookeeper/zookeeper.log or similar) on each ZooKeeper node for messages like Too many connections or Exceeded limits. You can also monitor the number of open file descriptors on the ZooKeeper servers using ulimit -n (check both the soft and hard limits) and compare it to the actual number of open connections using netstat -anp | grep <zookeeper_port> | wc -l.
    • Fix:
      • Increase max_client_connections: Edit your zoo.cfg file on each ZooKeeper node. Find or add the line max_client_connections=2000 (the default is often 200, which is too low for busy Pulsar clusters) and restart the ZooKeeper service. This allows the ZooKeeper server to accept more concurrent client connections.
      • Tune tickTime, syncLimit, initLimit: If the ensemble is struggling with internal communication, adjust these parameters in zoo.cfg. For example, increasing tickTime to 2000 (milliseconds) and syncLimit to 10 (ticks) can provide more breathing room for follower nodes to sync with the leader. Restart ZooKeeper after changes.
      • Scale ZooKeeper Ensemble: If the load is consistently high, you might need to add more ZooKeeper nodes to your ensemble. Pulsar officially supports ensembles of 3, 5, or 7 nodes.
    • Why it works: ZooKeeper has a hard limit on the number of client connections it will accept to prevent resource exhaustion. Increasing this limit directly addresses the "too many connections" issue. Adjusting timing parameters helps the ensemble maintain quorum and stability under load.
  2. Network Connectivity Issues Between Broker and ZooKeeper:

    • Diagnosis: From the Pulsar broker server (broker-1), attempt to connect to each ZooKeeper node on its client port (default is 2181). Use telnet <zookeeper_host> 2181 or nc -vz <zookeeper_host> 2181. If these fail, you have a network problem. Check firewalls (e.g., iptables -L -n -v), security groups, and routing tables between the broker and ZooKeeper nodes. Also, ensure DNS resolution is working correctly for ZooKeeper hostnames.
    • Fix:
      • Open Firewall Ports: On your firewall, allow traffic from the Pulsar broker’s IP address to the ZooKeeper nodes’ IP addresses on port 2181. For iptables, this might look like: iptables -I INPUT -s <broker_ip> -p tcp --dport 2181 -j ACCEPT.
      • Correct DNS: If DNS is the issue, update your /etc/hosts file on the broker or fix your DNS server configuration.
      • Adjust Network ACLs/Security Groups: If using cloud providers, ensure your network access control lists or security group rules permit traffic from the broker to the ZooKeeper nodes.
    • Why it works: ZooKeeper relies on stable network connections. If the broker cannot reach the ZooKeeper ensemble due to network misconfigurations, it cannot send its metadata updates.
  3. ZooKeeper Ensemble Not Healthy (No Quorum):

    • Diagnosis: Check the status of each ZooKeeper node. On each ZooKeeper server, run echo stat | nc localhost 2181 (or your configured client port). Look for Mode: follower or Mode: leader. If a majority of your ZooKeeper nodes are not in leader or follower mode (e.g., they are in standalone mode unexpectedly, or stuck in looking), the ensemble cannot function. Also, examine ZooKeeper logs for messages indicating nodes are not syncing or have lost connection to the leader.
    • Fix:
      • Restart Unhealthy Nodes: If a specific node is consistently problematic, try restarting its ZooKeeper service.
      • Check myid File: Ensure the myid file in the ZooKeeper data directory (e.g., /var/lib/zookeeper/myid) on each node contains a unique integer corresponding to its position in the server.X=... list in zoo.cfg.
      • Verify zoo.cfg: Double-check that the server.X entries in zoo.cfg on all nodes correctly list all ensemble members and their respective ports.
      • Restart Ensemble Gracefully: If the entire ensemble is unhealthy, restart them in order: start the leader first, then followers.
    • Why it works: ZooKeeper requires a majority (a quorum) of nodes to be operational and in sync to serve requests. If the ensemble is fractured or nodes are not communicating, it cannot maintain its state or accept new writes.
  4. ZooKeeper Disk Space or I/O Issues:

    • Diagnosis: Check the available disk space on each ZooKeeper server, especially on the partition where the data directory is located (e.g., /var/lib/zookeeper). Use df -h. Also, monitor disk I/O performance using tools like iostat -xz 1 to see if disks are saturated. ZooKeeper writes transaction logs and snapshots to disk, and slow or full disks can cause it to become unresponsive.
    • Fix:
      • Free Up Disk Space: Delete old snapshots or transaction logs if they are no longer needed (though Pulsar typically manages this). Alternatively, expand the disk or move the data directory to a larger partition.
      • Improve Disk Performance: Migrate the ZooKeeper data directory to faster storage (e.g., SSDs) or optimize disk I/O settings.
    • Why it works: ZooKeeper’s durability and performance are heavily reliant on its ability to quickly write transaction logs and snapshots to disk. Disk full or slow conditions can halt its operations.
  5. Incorrect ZooKeeper Connection String in Pulsar Configuration:

    • Diagnosis: Verify the zookeeperServers or globalZookeeperServers setting in your Pulsar broker’s configuration file (broker.conf or standalone.conf). Ensure it lists the correct hostnames/IPs and client ports for all ZooKeeper ensemble members. For example, zookeeperServers=zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181.
    • Fix: Correct the zookeeperServers string in broker.conf to accurately reflect your ZooKeeper ensemble. Restart the Pulsar broker after making changes.
    • Why it works: If the broker is trying to connect to the wrong ZooKeeper instances or missing some, it won’t be able to establish a connection to the ensemble needed for metadata operations.
  6. ZooKeeper Client Port Blocked by Broker’s Firewall:

    • Diagnosis: This is the inverse of point 2. Check the firewall rules on the Pulsar broker server itself. Ensure that the broker process has outbound access to the ZooKeeper client port (2181) on the ZooKeeper nodes.
    • Fix: Adjust the broker’s firewall rules to allow outbound connections to the ZooKeeper ensemble on port 2181. For iptables on the broker: iptables -I OUTPUT -d <zookeeper_host> -p tcp --dport 2181 -j ACCEPT.
    • Why it works: Even if ZooKeeper is configured correctly and network paths are open, the broker’s own outbound firewall rules might prevent it from initiating the connection.

After resolving these issues, you’ll likely encounter a NoNodeException as Pulsar attempts to read configuration or topic metadata that might have been in an inconsistent state due to the prior ZooKeeper unavailability.

Want structured learning?

Take the full Pulsar course →