Your Pulsar topic is fenced because a broker has taken over leadership for a partition that another broker is still actively serving. This is a critical failure because it means writes to the topic could be lost or duplicated.

Here are the common causes and how to fix them:

1. ZooKeeper Session Expiration

  • Diagnosis: Check the Pulsar broker logs for messages like ZooKeeper session expired or Connection to ZooKeeper lost. Also, check ZooKeeper server logs for client disconnections.
  • Cause: The broker lost its connection to ZooKeeper, its source of truth for leadership and metadata. When the connection is re-established, ZooKeeper might assign leadership to a different broker, fencing the original one.
  • Fix:
    • Increase ZooKeeper session timeout: On your ZooKeeper ensemble, edit zoo.cfg and increase tickTime and initLimit/syncLimit values. For example, change tickTime=2000 to tickTime=4000 and adjust initLimit and syncLimit proportionally. Restart ZooKeeper servers.
    • Increase Pulsar broker ZooKeeper session timeout: In broker.conf, set zookeeperSessionTimeoutMs to a value higher than the ZooKeeper session timeout, e.g., zookeeperSessionTimeoutMs=60000 (60 seconds). Restart Pulsar brokers.
    • Why it works: A longer session timeout gives the broker more time to re-establish its connection to ZooKeeper before its session is considered expired, preventing leadership loss.

2. Network Instability Between Brokers and ZooKeeper

  • Diagnosis: Monitor network latency and packet loss between your Pulsar brokers and ZooKeeper nodes. Tools like ping and mtr can help. Look for intermittent connectivity drops in broker logs.
  • Cause: Unstable network conditions can cause ZooKeeper session timeouts (as described above) or lead to the broker thinking it lost connection when it didn’t. This can also cause leadership to be reassigned incorrectly.
  • Fix:
    • Improve network reliability: Address underlying network issues. This might involve checking physical cabling, network switch configurations, or routing.
    • Configure ZooKeeper client retry policy: In broker.conf, adjust zookeeperRetryWaitMs and zookeeperMaxRetries. For example, zookeeperRetryWaitMs=1000 and zookeeperMaxRetries=50. Restart brokers.
    • Why it works: Better network stability reduces spurious disconnections. The retry policy allows the broker to gracefully recover from temporary network glitches without immediately losing its ZooKeeper session.

3. High Broker CPU or Memory Utilization

  • Diagnosis: Monitor CPU and memory usage on your Pulsar brokers. High utilization can cause the JVM to pause for garbage collection for extended periods, making the broker unresponsive to ZooKeeper heartbeats.
  • Cause: When a broker is overloaded, its Java Virtual Machine (JVM) might pause for garbage collection (GC). If these pauses exceed the ZooKeeper session timeout, the broker’s session with ZooKeeper will expire, leading to fencing.
  • Fix:
    • Increase broker resources: Add more CPU cores or RAM to the affected broker machines.
    • Tune JVM garbage collection: Adjust JVM flags in pulsar script or broker.conf (e.g., -XX:+UseG1GC, -XX:MaxGCPauseMillis=100).
    • Scale out Pulsar brokers: Add more broker instances to distribute the load.
    • Why it works: Reducing resource contention and GC pause times ensures the broker remains responsive to ZooKeeper, preventing session expiration.

4. ZooKeeper Ensemble Misconfiguration or Overload

  • Diagnosis: Check ZooKeeper server logs for Too many connections or Insufficient memory errors. Monitor ZooKeeper node CPU, memory, and network I/O.
  • Cause: If ZooKeeper itself is overloaded or misconfigured (e.g., insufficient maxClientCnxns), it might drop connections from brokers, even if the network is stable and brokers are healthy.
  • Fix:
    • Increase maxClientCnxns in zoo.cfg: Set this to a higher value on all ZooKeeper nodes, e.g., maxClientCnxns=1000. Restart ZooKeeper servers.
    • Add more ZooKeeper nodes: If the ensemble is undersized, add more ZooKeeper instances to distribute the load.
    • Ensure ZooKeeper disks are fast: ZooKeeper performance is sensitive to disk I/O. Use SSDs for ZooKeeper data directories.
    • Why it works: A more robust ZooKeeper ensemble can handle more concurrent connections and requests without dropping clients.

5. Pulsar Broker Configuration Issues (e.g., zookeeperSessionTimeoutMs too low)

  • Diagnosis: Review broker.conf on all brokers. Specifically, check zookeeperSessionTimeoutMs and zookeeperConnectionTimeoutMs.
  • Cause: If zookeeperSessionTimeoutMs is set too low, it can lead to premature session expiration, especially during brief network hiccups or high broker load.
  • Fix:
    • Increase zookeeperSessionTimeoutMs: Set it to a value like 60000 (60 seconds) or 120000 (120 seconds) in broker.conf. Restart brokers.
    • Ensure zookeeperConnectionTimeoutMs is reasonable: This controls the initial connection attempt time. A value of 15000 (15 seconds) is usually sufficient.
    • Why it works: A longer session timeout provides a wider buffer for the broker to maintain its connection to ZooKeeper.

6. ZooKeeper Leader Election Issues

  • Diagnosis: Examine ZooKeeper logs for messages related to leader election, such as New leader elected or Follower received a NEW VOTE.
  • Cause: If the ZooKeeper ensemble experiences frequent leader elections (e.g., due to network partitions within the ZK ensemble itself, or a ZK node crashing), this can disrupt broker connections and cause leadership fencing.
  • Fix:
    • Stabilize ZooKeeper network: Ensure reliable network connectivity between ZooKeeper nodes.
    • Increase ZooKeeper electionTimeout: In zoo.cfg, a higher electionTimeout (e.g., 10000 instead of 5000) can prevent premature leader elections during transient network issues. Restart ZooKeeper servers.
    • Why it works: A more stable ZooKeeper quorum reduces the frequency of disruptive leader elections, preserving broker connections.

After addressing these, the next error you’re likely to encounter if you haven’t fully resolved the root cause is a Broker is being fenced error again, but potentially on a different partition or topic, indicating the problem is systemic.

Want structured learning?

Take the full Pulsar course →