The Pulsar service unit is reporting NotReady because the underlying Zookeeper ensemble is unhealthy and unable to serve the quorum required by Pulsar brokers.
Common Causes and Fixes
1. Zookeeper Node Unreachable or Crashing
- Diagnosis: Check Zookeeper logs for repeated
Connection refused,Connection reset by peer, or crash loop messages. On each Zookeeper node, runsudo systemctl status zookeeperto see if the service is active and check its recent logs. - Cause: Network issues, insufficient system resources (CPU, RAM, disk I/O), or configuration errors causing individual Zookeeper nodes to fail.
- Fix:
- Network: Ensure all Zookeeper nodes can reach each other on the configured client and peer ports (default 2181 and 2888/3888). Use
telnet <zookeeper-ip> 2181from other nodes. - Resources: Monitor resource usage on Zookeeper nodes. If high, allocate more CPU/RAM or upgrade hardware. A common minimum is 2GB RAM per node.
- Configuration: Verify
zoo.cfgon each node. EnsuredataDir,clientPort,tickTime,initLimit, andsyncLimitare consistent across the ensemble, and thatserver.X=<hostname>:2888:3888entries correctly map to each node.
- Network: Ensure all Zookeeper nodes can reach each other on the configured client and peer ports (default 2181 and 2888/3888). Use
- Why it works: Zookeeper requires a majority of nodes (a quorum) to be available and communicating to function. Fixing connectivity or resource issues allows nodes to rejoin the ensemble and establish quorum.
2. Incorrect Zookeeper Ensemble Size or Quorum Configuration
- Diagnosis: Examine the
zoo.cfgfile on all Zookeeper nodes. Count the number ofserver.Xentries. If the number of active nodes is less than(N/2) + 1where N is the total number of configured servers, quorum is lost. - Cause: Nodes have been decommissioned or failed without updating the
zoo.cfgon the remaining nodes, or the initial configuration was for an insufficient number of nodes. - Fix: Ensure
zoo.cfgon all remaining active Zookeeper nodes lists the correct, current set of active servers. If a node is permanently gone, remove itsserver.Xentry from allzoo.cfgfiles and restart the Zookeeper ensemble. For example, if you had 5 nodes and one permanently failed, you’d reconfigure the remaining 4 nodes to reflect only those 4 servers. - Why it works: Zookeeper’s quorum mechanism depends on knowing the total number of expected participants to determine a majority. Correcting the
server.Xlist ensures the quorum calculation is accurate for the active ensemble.
3. Disk Full or I/O Bottleneck on Zookeeper Data Directory
- Diagnosis: Check disk space on Zookeeper nodes using
df -h. Monitor I/O wait times usingiostat -xz 1. Zookeeper writes transaction logs and snapshots frequently. - Cause: The
dataDirspecified inzoo.cfgis full, preventing Zookeeper from writing new transaction logs or snapshots. High disk I/O can also make Zookeeper unresponsive. - Fix:
- Disk Space: Free up space by deleting old snapshots (
snapshot.*.snap) and transaction logs (log.*) from thedataDir, or expand the disk volume. Crucially, ensure you do NOT delete the latest snapshot and its corresponding transaction log if the ensemble is still attempting to run, as this can lead to data loss. A safer approach is to move old, non-essential logs/snapshots to archival storage. - I/O: Migrate the
dataDirto faster storage (SSD) or optimize disk configurations.
- Disk Space: Free up space by deleting old snapshots (
- Why it works: Zookeeper relies on persistent storage for its transaction log and snapshots to maintain state and recover. Full disks or slow I/O prevent these critical operations, leading to node failure.
4. Incorrect Zookeeper Client Port or Peer Port Configuration
- Diagnosis: Inspect
zoo.cfgforclientPortand the peer-to-peer ports (e.g.,2888and3888). Verify these ports are open in firewalls between Zookeeper nodes and between Pulsar brokers and Zookeeper nodes. Usenetstat -tulnp | grep <port>on Zookeeper nodes to confirm the ports are listening. - Cause: Firewalls blocking necessary ports, or incorrect port numbers specified in
zoo.cfgpreventing communication. - Fix:
- Firewall: Open
clientPort(default 2181) for Pulsar brokers and clients to connect to Zookeeper. Open2888(leader election) and3888(follower communication) between Zookeeper ensemble members. - Configuration: Ensure the
clientPortinzoo.cfgmatches the port Pulsar is configured to use for Zookeeper (zookeeperServersinbroker.conforstandalone.conf).
- Firewall: Open
- Why it works: Zookeeper nodes communicate with each other on peer ports and with clients (like Pulsar brokers) on the client port. Blocking these ports prevents the ensemble from forming and Pulsar from connecting.
5. Zookeeper Session Timeout Issues
- Diagnosis: Look for Zookeeper logs indicating "session expired" or "connection lost" messages from Pulsar brokers. Check
tickTime,initLimit, andsyncLimitinzoo.cfgand thezookeeperSessionTimeoutMsinbroker.conf. - Cause: Network latency or instability causing Pulsar brokers to miss Zookeeper heartbeats, or Zookeeper nodes being too slow to respond within their configured limits.
- Fix:
- Network: Improve network stability and reduce latency between Pulsar brokers and Zookeeper nodes.
- Zookeeper Tuning: If network is stable, slightly increase
tickTime(e.g., from 2000ms to 3000ms) inzoo.cfg, and adjustinitLimitandsyncLimitproportionally (e.g.,initLimit=10,syncLimit=5iftickTime=2000). - Pulsar Tuning: Increase
zookeeperSessionTimeoutMsinbroker.conf(e.g., from 60000ms to 90000ms).
- Why it works: Zookeeper uses a session timeout mechanism. If a client (Pulsar broker) doesn’t send a heartbeat within this timeout, Zookeeper considers the session expired. Adjusting timeouts provides more tolerance for transient network delays.
6. Zookeeper Data Integrity Corruption
- Diagnosis: Zookeeper logs might show errors like "snapshot corrupted," "transaction log inconsistent," or exceptions during startup related to reading data files.
- Cause: Abrupt power loss, disk errors, or bugs leading to corrupted data files in the
dataDir. - Fix: This is the most severe. The safest (though disruptive) fix is to stop the entire Zookeeper ensemble, clear the
dataDiron all nodes, and restart Zookeeper from scratch. This will cause a complete loss of Zookeeper state, including Pulsar topic metadata, configuration, and ownership. You will need to re-create topics and re-configure Pulsar. If you have backups of Zookeeper data, you might be able to restore from a consistent snapshot and log pair, but this is complex and error-prone. - Why it works: Corrupted data files prevent Zookeeper from initializing its state correctly, making it impossible to serve requests or maintain quorum. Starting fresh eliminates the corrupted data.
After resolving these Zookeeper issues, the Pulsar brokers should be able to connect and register themselves, and the NotReady status should clear. The next error you might encounter is a BrokerNotAvailableException if Pulsar’s internal metadata or configuration in Zookeeper is also in an inconsistent state due to the earlier Zookeeper problems.