The Pulsar service unit is reporting NotReady because the underlying Zookeeper ensemble is unhealthy and unable to serve the quorum required by Pulsar brokers.

Common Causes and Fixes

1. Zookeeper Node Unreachable or Crashing

  • Diagnosis: Check Zookeeper logs for repeated Connection refused, Connection reset by peer, or crash loop messages. On each Zookeeper node, run sudo systemctl status zookeeper to see if the service is active and check its recent logs.
  • Cause: Network issues, insufficient system resources (CPU, RAM, disk I/O), or configuration errors causing individual Zookeeper nodes to fail.
  • Fix:
    • Network: Ensure all Zookeeper nodes can reach each other on the configured client and peer ports (default 2181 and 2888/3888). Use telnet <zookeeper-ip> 2181 from other nodes.
    • Resources: Monitor resource usage on Zookeeper nodes. If high, allocate more CPU/RAM or upgrade hardware. A common minimum is 2GB RAM per node.
    • Configuration: Verify zoo.cfg on each node. Ensure dataDir, clientPort, tickTime, initLimit, and syncLimit are consistent across the ensemble, and that server.X=<hostname>:2888:3888 entries correctly map to each node.
  • Why it works: Zookeeper requires a majority of nodes (a quorum) to be available and communicating to function. Fixing connectivity or resource issues allows nodes to rejoin the ensemble and establish quorum.

2. Incorrect Zookeeper Ensemble Size or Quorum Configuration

  • Diagnosis: Examine the zoo.cfg file on all Zookeeper nodes. Count the number of server.X entries. If the number of active nodes is less than (N/2) + 1 where N is the total number of configured servers, quorum is lost.
  • Cause: Nodes have been decommissioned or failed without updating the zoo.cfg on the remaining nodes, or the initial configuration was for an insufficient number of nodes.
  • Fix: Ensure zoo.cfg on all remaining active Zookeeper nodes lists the correct, current set of active servers. If a node is permanently gone, remove its server.X entry from all zoo.cfg files and restart the Zookeeper ensemble. For example, if you had 5 nodes and one permanently failed, you’d reconfigure the remaining 4 nodes to reflect only those 4 servers.
  • Why it works: Zookeeper’s quorum mechanism depends on knowing the total number of expected participants to determine a majority. Correcting the server.X list ensures the quorum calculation is accurate for the active ensemble.

3. Disk Full or I/O Bottleneck on Zookeeper Data Directory

  • Diagnosis: Check disk space on Zookeeper nodes using df -h. Monitor I/O wait times using iostat -xz 1. Zookeeper writes transaction logs and snapshots frequently.
  • Cause: The dataDir specified in zoo.cfg is full, preventing Zookeeper from writing new transaction logs or snapshots. High disk I/O can also make Zookeeper unresponsive.
  • Fix:
    • Disk Space: Free up space by deleting old snapshots (snapshot.*.snap) and transaction logs (log.*) from the dataDir, or expand the disk volume. Crucially, ensure you do NOT delete the latest snapshot and its corresponding transaction log if the ensemble is still attempting to run, as this can lead to data loss. A safer approach is to move old, non-essential logs/snapshots to archival storage.
    • I/O: Migrate the dataDir to faster storage (SSD) or optimize disk configurations.
  • Why it works: Zookeeper relies on persistent storage for its transaction log and snapshots to maintain state and recover. Full disks or slow I/O prevent these critical operations, leading to node failure.

4. Incorrect Zookeeper Client Port or Peer Port Configuration

  • Diagnosis: Inspect zoo.cfg for clientPort and the peer-to-peer ports (e.g., 2888 and 3888). Verify these ports are open in firewalls between Zookeeper nodes and between Pulsar brokers and Zookeeper nodes. Use netstat -tulnp | grep <port> on Zookeeper nodes to confirm the ports are listening.
  • Cause: Firewalls blocking necessary ports, or incorrect port numbers specified in zoo.cfg preventing communication.
  • Fix:
    • Firewall: Open clientPort (default 2181) for Pulsar brokers and clients to connect to Zookeeper. Open 2888 (leader election) and 3888 (follower communication) between Zookeeper ensemble members.
    • Configuration: Ensure the clientPort in zoo.cfg matches the port Pulsar is configured to use for Zookeeper (zookeeperServers in broker.conf or standalone.conf).
  • Why it works: Zookeeper nodes communicate with each other on peer ports and with clients (like Pulsar brokers) on the client port. Blocking these ports prevents the ensemble from forming and Pulsar from connecting.

5. Zookeeper Session Timeout Issues

  • Diagnosis: Look for Zookeeper logs indicating "session expired" or "connection lost" messages from Pulsar brokers. Check tickTime, initLimit, and syncLimit in zoo.cfg and the zookeeperSessionTimeoutMs in broker.conf.
  • Cause: Network latency or instability causing Pulsar brokers to miss Zookeeper heartbeats, or Zookeeper nodes being too slow to respond within their configured limits.
  • Fix:
    • Network: Improve network stability and reduce latency between Pulsar brokers and Zookeeper nodes.
    • Zookeeper Tuning: If network is stable, slightly increase tickTime (e.g., from 2000ms to 3000ms) in zoo.cfg, and adjust initLimit and syncLimit proportionally (e.g., initLimit=10, syncLimit=5 if tickTime=2000).
    • Pulsar Tuning: Increase zookeeperSessionTimeoutMs in broker.conf (e.g., from 60000ms to 90000ms).
  • Why it works: Zookeeper uses a session timeout mechanism. If a client (Pulsar broker) doesn’t send a heartbeat within this timeout, Zookeeper considers the session expired. Adjusting timeouts provides more tolerance for transient network delays.

6. Zookeeper Data Integrity Corruption

  • Diagnosis: Zookeeper logs might show errors like "snapshot corrupted," "transaction log inconsistent," or exceptions during startup related to reading data files.
  • Cause: Abrupt power loss, disk errors, or bugs leading to corrupted data files in the dataDir.
  • Fix: This is the most severe. The safest (though disruptive) fix is to stop the entire Zookeeper ensemble, clear the dataDir on all nodes, and restart Zookeeper from scratch. This will cause a complete loss of Zookeeper state, including Pulsar topic metadata, configuration, and ownership. You will need to re-create topics and re-configure Pulsar. If you have backups of Zookeeper data, you might be able to restore from a consistent snapshot and log pair, but this is complex and error-prone.
  • Why it works: Corrupted data files prevent Zookeeper from initializing its state correctly, making it impossible to serve requests or maintain quorum. Starting fresh eliminates the corrupted data.

After resolving these Zookeeper issues, the Pulsar brokers should be able to connect and register themselves, and the NotReady status should clear. The next error you might encounter is a BrokerNotAvailableException if Pulsar’s internal metadata or configuration in Zookeeper is also in an inconsistent state due to the earlier Zookeeper problems.

Want structured learning?

Take the full Pulsar course →