Your Pulsar topic is terminating because the broker is unable to manage the topic’s state, specifically its metadata and ongoing operations, due to a persistent communication failure with the BookKeeper ensemble. This isn’t a simple network blip; it’s a fundamental breakdown in the coordination layer that keeps your topic alive.
Cause 1: BookKeeper Ensemble Unreachable
Diagnosis:
Check broker logs for io.netty.channel.ConnectTimeoutException or java.net.ConnectException: Connection refused originating from the BookKeeper ensemble’s IP addresses and ports.
Verify connectivity from the broker to each BookKeeper node:
telnet <bookie_ip> 3181
If telnet fails or times out, you have a network issue.
Fix:
Ensure the BookKeeper ensemble (bookkeeper.conf zkServers or metadataServiceUri) is correctly configured on the brokers and that the BookKeeper nodes are running and accessible on port 3181. Restart any BookKeeper nodes that are down.
# Example bookkeeper.conf entry
zkServers=zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
# Or for metadataServiceUri (Pulsar 2.8+)
metadataServiceUri=zk+cluster://myzkcluster/my_pulsar_namespace
This fixes the problem by re-establishing the essential communication channel between the Pulsar broker and its durable storage layer, allowing the broker to re-acquire leadership and manage topic operations.
Cause 2: BookKeeper Ensemble Unhealthy (Insufficient Quorum)
Diagnosis:
Use the BookKeeper bookkeeper shell to check the ensemble’s health:
bin/bookkeeper shell
ls -a /ledgers
Look for underReplicated or unavailable status for ledgers. Also, check BookKeeper logs for messages indicating node failures or inability to form a quorum for writes.
Fix:
Ensure at least (ensemble_size / 2) + 1 BookKeeper nodes are active and healthy. If nodes are down, bring them back online. If a node is permanently lost, you may need to decommission it from BookKeeper’s ZooKeeper metadata and potentially rebalance data.
# Example command to check BookKeeper health via ZK (replace with your ZK details)
zkCli.sh -server zk1.example.com:2181 ls /path/to/bookkeeper/root/ensemble
This ensures that BookKeeper has enough operational nodes to guarantee data durability and availability, allowing brokers to reliably write and read topic data.
Cause 3: ZooKeeper Ensemble Unreachable or Unhealthy
Diagnosis:
Check broker logs for org.apache.zookeeper.KeeperException errors, particularly NoNodeException or SessionExpiredException.
Verify connectivity from the broker to each ZooKeeper node:
telnet <zk_ip> 2181
Check ZooKeeper status on each node:
echo "stat" | nc <zk_ip> 2181 | grep "Mode:"
Ensure all nodes report leader or follower.
Fix:
Ensure the ZooKeeper ensemble (bookkeeper.conf zkServers or pulsar.conf zookeeperServers) is correctly configured on the brokers and that the ZooKeeper nodes are running and accessible on port 2181. Restart any ZooKeeper nodes that are down or in a problematic state.
# Example pulsar.conf entry
zookeeperServers=zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
This is critical because Pulsar uses ZooKeeper for cluster metadata, topic ownership, and coordination. If brokers can’t reach ZooKeeper, they can’t discover or manage topics.
Cause 4: Insufficient File Descriptors or Memory on Brokers/Bookies
Diagnosis: On brokers and BookKeeper nodes, check system limits:
ulimit -n # File descriptors
free -m # Memory usage
Look for Too many open files errors in broker/BookKeeper logs, or signs of excessive memory consumption leading to OOM errors.
Fix:
Increase the ulimit -n (nofile) setting for the user running Pulsar brokers and BookKeeper processes. This often requires modifying /etc/security/limits.conf or systemd service files.
# Example /etc/security/limits.conf entry
* soft nofile 65536
* hard nofile 65536
Also, ensure sufficient RAM is available. If memory is consistently high, investigate the specific processes consuming it or consider increasing system memory. This provides the necessary resources for the network sockets, file handles, and internal data structures that Pulsar and BookKeeper rely on.
Cause 5: Network Partition within the BookKeeper Ensemble or between Brokers and Bookies
Diagnosis:
Use ping and traceroute between affected brokers and BookKeeper nodes. Check network interface statistics for errors or dropped packets. If you have a sophisticated network monitoring tool, look for signs of traffic loss or latency spikes between the relevant subnets.
Fix: Address underlying network configuration issues, firewall rules, or hardware problems. Ensure that all necessary ports (3181 for BookKeeper, 2181 for ZooKeeper) are open and that traffic is not being unexpectedly routed or dropped. This directly resolves the communication breakdown by restoring reliable packet delivery.
Cause 6: Incorrect Pulsar Broker Configuration for Metadata Store
Diagnosis:
Review conf/broker.conf (or conf/pulsar for older versions) for metadataStore related configurations. Incorrectly configured metadataStore.extra.connectionString or metadataStore.impl can prevent brokers from initializing their metadata store client.
Fix:
Ensure the metadataStore configuration in broker.conf accurately points to your ZooKeeper ensemble. For example:
# Example broker.conf entry
metadataStore: zookeeper
zookeeperServers: zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181
If using Pulsar’s built-in metadata store (based on ZooKeeper), ensure zookeeperServers is correctly populated. This allows the broker to bootstrap its internal state management correctly.
Once you’ve resolved these issues, your next likely error will be related to topic backlog limits or consumer lag if the topic was terminated for an extended period.