The Pulsar broker failed to assign a topic to a bundle because the bundle’s metadata was out of sync with the topic’s current assignment, indicating a metadata corruption or race condition.
The most common culprit is a race condition during topic creation or re-assignment where a broker attempts to update bundle metadata while another broker is simultaneously trying to modify it. This can leave the bundle in an inconsistent state.
Cause 1: Stale Zookeeper Metadata
- Diagnosis: Check Zookeeper for the topic’s bundle assignment. Navigate to
/managed-ledgers/<namespace-name>/<topic-name>/bundle. Compare the ZK state with what the broker reports. You can usezkCli.shtoget /managed-ledgers/<namespace-name>/<topic-name>/bundle. - Fix: If ZK shows an incorrect or missing bundle, you might need to forcefully re-assign the topic. This can sometimes be done by deleting the topic’s ZK entry for bundle assignment (use with extreme caution after backing up ZK) and letting the broker re-create it, or by using the
pulsar-admin topics untopologizeand thenpulsar-admin topics topologizecommands to reset the topic’s bundle ownership. - Why it works: This forces the broker to re-evaluate the topic’s position and re-establish a consistent bundle assignment in Zookeeper.
Cause 2: Broker Cache Inconsistency
- Diagnosis: Brokers maintain internal caches of topic metadata. If this cache becomes stale, it can lead to conflicts. Restarting the affected broker can clear its cache.
- Fix: Restart the Pulsar broker experiencing the error. For example, if you’re using systemd:
sudo systemctl restart pulsar. - Why it works: A broker restart forces it to re-read the authoritative metadata from Zookeeper, discarding any stale cached information.
Cause 3: Concurrent Topic Operations
- Diagnosis: Look for a high rate of topic creation, deletion, or re-assignment operations occurring simultaneously on the same namespace or topics. Log analysis on brokers for "topic assignment" or "bundle update" will reveal concurrent attempts.
- Fix: Implement throttling or sequential processing for topic management operations if possible. If this is an automated process, add retry logic with exponential backoff.
- Why it works: By serializing or delaying conflicting operations, you prevent the race condition that corrupts bundle metadata.
Cause 4: Zookeeper Session Expiration/Loss
- Diagnosis: Check Zookeeper logs and the Pulsar broker logs for Zookeeper session expiration or connection loss events. A broker losing its Zookeeper session can lead to it operating with stale information.
- Fix: Ensure Zookeeper is stable and has sufficient resources. Configure Zookeeper
tickTime,initLimit, andsyncLimitappropriately for your network latency and cluster size. Ensure brokers have stable network connectivity to Zookeeper. - Why it works: A stable Zookeeper connection ensures brokers always have access to the latest, authoritative metadata.
Cause 5: Manual Zookeeper Tampering
- Diagnosis: Review Zookeeper audit logs (if enabled) or perform a thorough manual inspection of the relevant Zookeeper paths (
/managed-ledgers/<namespace-name>/<topic-name>/bundle) for any unauthorized or accidental modifications. - Fix: Restore the Zookeeper data from a known good backup. Implement strict access control policies on Zookeeper.
- Why it works: This corrects any manual errors by reverting to a known good state and prevents future accidental corruption.
Cause 6: Pulsar Version Bug
- Diagnosis: Check the Pulsar issue tracker for known bugs related to topic assignment, bundle management, or Zookeeper interaction in your specific Pulsar version.
- Fix: Upgrade to a stable, recommended Pulsar version. If a bug is confirmed, apply any provided patches or workarounds.
- Why it works: This resolves underlying code defects that might be causing the inconsistent state.
After resolving the bundle conflict, you’ll likely encounter Topic Not Found errors if the topic itself was in the process of being deleted or created when the conflict occurred, as its existence might be in an ambiguous state.