Your Pulsar topic is fenced because a broker has taken over leadership for a partition that another broker is still actively serving. This is a critical failure because it means writes to the topic could be lost or duplicated.
Here are the common causes and how to fix them:
1. ZooKeeper Session Expiration
- Diagnosis: Check the Pulsar broker logs for messages like
ZooKeeper session expiredorConnection to ZooKeeper lost. Also, check ZooKeeper server logs for client disconnections. - Cause: The broker lost its connection to ZooKeeper, its source of truth for leadership and metadata. When the connection is re-established, ZooKeeper might assign leadership to a different broker, fencing the original one.
- Fix:
- Increase ZooKeeper session timeout: On your ZooKeeper ensemble, edit
zoo.cfgand increasetickTimeandinitLimit/syncLimitvalues. For example, changetickTime=2000totickTime=4000and adjustinitLimitandsyncLimitproportionally. Restart ZooKeeper servers. - Increase Pulsar broker ZooKeeper session timeout: In
broker.conf, setzookeeperSessionTimeoutMsto a value higher than the ZooKeeper session timeout, e.g.,zookeeperSessionTimeoutMs=60000(60 seconds). Restart Pulsar brokers. - Why it works: A longer session timeout gives the broker more time to re-establish its connection to ZooKeeper before its session is considered expired, preventing leadership loss.
- Increase ZooKeeper session timeout: On your ZooKeeper ensemble, edit
2. Network Instability Between Brokers and ZooKeeper
- Diagnosis: Monitor network latency and packet loss between your Pulsar brokers and ZooKeeper nodes. Tools like
pingandmtrcan help. Look for intermittent connectivity drops in broker logs. - Cause: Unstable network conditions can cause ZooKeeper session timeouts (as described above) or lead to the broker thinking it lost connection when it didn’t. This can also cause leadership to be reassigned incorrectly.
- Fix:
- Improve network reliability: Address underlying network issues. This might involve checking physical cabling, network switch configurations, or routing.
- Configure ZooKeeper client retry policy: In
broker.conf, adjustzookeeperRetryWaitMsandzookeeperMaxRetries. For example,zookeeperRetryWaitMs=1000andzookeeperMaxRetries=50. Restart brokers. - Why it works: Better network stability reduces spurious disconnections. The retry policy allows the broker to gracefully recover from temporary network glitches without immediately losing its ZooKeeper session.
3. High Broker CPU or Memory Utilization
- Diagnosis: Monitor CPU and memory usage on your Pulsar brokers. High utilization can cause the JVM to pause for garbage collection for extended periods, making the broker unresponsive to ZooKeeper heartbeats.
- Cause: When a broker is overloaded, its Java Virtual Machine (JVM) might pause for garbage collection (GC). If these pauses exceed the ZooKeeper session timeout, the broker’s session with ZooKeeper will expire, leading to fencing.
- Fix:
- Increase broker resources: Add more CPU cores or RAM to the affected broker machines.
- Tune JVM garbage collection: Adjust JVM flags in
pulsarscript orbroker.conf(e.g.,-XX:+UseG1GC,-XX:MaxGCPauseMillis=100). - Scale out Pulsar brokers: Add more broker instances to distribute the load.
- Why it works: Reducing resource contention and GC pause times ensures the broker remains responsive to ZooKeeper, preventing session expiration.
4. ZooKeeper Ensemble Misconfiguration or Overload
- Diagnosis: Check ZooKeeper server logs for
Too many connectionsorInsufficient memoryerrors. Monitor ZooKeeper node CPU, memory, and network I/O. - Cause: If ZooKeeper itself is overloaded or misconfigured (e.g., insufficient
maxClientCnxns), it might drop connections from brokers, even if the network is stable and brokers are healthy. - Fix:
- Increase
maxClientCnxnsinzoo.cfg: Set this to a higher value on all ZooKeeper nodes, e.g.,maxClientCnxns=1000. Restart ZooKeeper servers. - Add more ZooKeeper nodes: If the ensemble is undersized, add more ZooKeeper instances to distribute the load.
- Ensure ZooKeeper disks are fast: ZooKeeper performance is sensitive to disk I/O. Use SSDs for ZooKeeper data directories.
- Why it works: A more robust ZooKeeper ensemble can handle more concurrent connections and requests without dropping clients.
- Increase
5. Pulsar Broker Configuration Issues (e.g., zookeeperSessionTimeoutMs too low)
- Diagnosis: Review
broker.confon all brokers. Specifically, checkzookeeperSessionTimeoutMsandzookeeperConnectionTimeoutMs. - Cause: If
zookeeperSessionTimeoutMsis set too low, it can lead to premature session expiration, especially during brief network hiccups or high broker load. - Fix:
- Increase
zookeeperSessionTimeoutMs: Set it to a value like60000(60 seconds) or120000(120 seconds) inbroker.conf. Restart brokers. - Ensure
zookeeperConnectionTimeoutMsis reasonable: This controls the initial connection attempt time. A value of15000(15 seconds) is usually sufficient. - Why it works: A longer session timeout provides a wider buffer for the broker to maintain its connection to ZooKeeper.
- Increase
6. ZooKeeper Leader Election Issues
- Diagnosis: Examine ZooKeeper logs for messages related to leader election, such as
New leader electedorFollower received a NEW VOTE. - Cause: If the ZooKeeper ensemble experiences frequent leader elections (e.g., due to network partitions within the ZK ensemble itself, or a ZK node crashing), this can disrupt broker connections and cause leadership fencing.
- Fix:
- Stabilize ZooKeeper network: Ensure reliable network connectivity between ZooKeeper nodes.
- Increase ZooKeeper
electionTimeout: Inzoo.cfg, a higherelectionTimeout(e.g.,10000instead of5000) can prevent premature leader elections during transient network issues. Restart ZooKeeper servers. - Why it works: A more stable ZooKeeper quorum reduces the frequency of disruptive leader elections, preserving broker connections.
After addressing these, the next error you’re likely to encounter if you haven’t fully resolved the root cause is a Broker is being fenced error again, but potentially on a different partition or topic, indicating the problem is systemic.