The Pulsar transaction coordinator is failing to find transaction metadata it expects to have, leading to transaction timeouts and errors.

Common Causes and Fixes

1. Insufficient Transaction Log Disk Space

  • Diagnosis: Check the disk usage on your Pulsar broker nodes, specifically where the transaction log is stored (often part of the bookkeeper metadata).
    df -h /path/to/bookkeeper/data
    
  • Fix: Free up disk space or expand the storage on the affected broker nodes. For example, if df -h shows /dev/sda1 mounted on /data is at 95% usage, you’d need to delete old data or add more disks.
  • Why it works: The transaction log requires free space to write and retrieve metadata. When the disk is full, writes fail, and the coordinator cannot confirm transaction states.

2. BookKeeper Ensemble Size Too Small for Transaction Load

  • Diagnosis: Examine your BookKeeper ensemble size configuration. If it’s too low (e.g., 3 or 5) and your transaction volume is high, BookKeeper might struggle to maintain quorum for transaction metadata writes.
    # Check bookkeeper.conf on a bookie node
    cat /etc/bookkeeper/conf/bookkeeper.conf | grep ledger.ensemble.size
    
  • Fix: Increase the ledger.ensemble.size and ledger.write.quorum (and potentially ledger.ack.quorum) in your bookkeeper.conf to a higher value, such as 7 or 9, and restart your BookKeeper ensemble.
  • Why it works: A larger ensemble provides more redundancy and capacity for writes, ensuring transaction metadata can be durably stored even under heavy load.

3. Transaction Coordinator Not Properly Initialized or Recovered

  • Diagnosis: Check the Pulsar broker logs for messages related to transaction coordinator initialization or recovery failures. Look for errors like TransactionCoordinatorImpl: Failed to initialize or TransactionMetadataStore: Failed to recover.
    # On a broker node, tailing logs
    tail -f /var/log/pulsar/pulsar-broker.log | grep "TransactionCoordinator"
    
  • Fix: Restart the Pulsar brokers. If the issue persists, you might need to manually ensure the transaction_metadata_store is healthy. In severe cases, you may need to re-initialize the transaction metadata store (this is a destructive operation and requires careful planning).
  • Why it works: The transaction coordinator relies on a persistent metadata store. If this store is unavailable or corrupted, the coordinator cannot function. A restart can sometimes resolve transient issues, while manual intervention is needed for deeper problems.

4. Network Partition or Latency Between Brokers and BookKeeper

  • Diagnosis: Monitor network connectivity and latency between your Pulsar brokers and BookKeeper nodes. Use ping and traceroute from broker to bookie, and check network metrics for packet loss.
    # From a broker node to a bookie node
    ping <bookie_ip>
    traceroute <bookie_ip>
    
  • Fix: Address any network issues, such as firewall misconfigurations, faulty network hardware, or insufficient bandwidth. Ensure that brokers and bookies can communicate reliably with low latency.
  • Why it works: The transaction coordinator frequently interacts with BookKeeper to store and retrieve transaction state. High latency or lost packets can cause these operations to time out, leading to the "transaction not found" error.

5. Incorrect Transaction Timeout Configuration

  • Diagnosis: Review your Pulsar broker configuration for transaction-related timeouts, particularly transactionTimeoutMinutes.
    # In server.conf on a broker node
    cat /etc/pulsar/server.conf | grep transactionTimeoutMinutes
    
  • Fix: Increase the transactionTimeoutMinutes value in your server.conf to a sufficiently large number (e.g., 15 or 30 minutes, depending on your application’s needs) and restart the brokers.
  • Why it works: If your transactions are legitimately taking longer than the configured timeout, Pulsar will abort them. Increasing the timeout allows longer-running transactions to complete.

6. Pulsar Version Bugs or Known Issues

  • Diagnosis: Check the Pulsar release notes and issue tracker for known bugs related to transactions in your current Pulsar version.
    # Example of checking GitHub issues
    # Search Pulsar GitHub issues for "transaction not found" and your version
    
  • Fix: Upgrade to a newer, stable Pulsar version that has addressed the relevant transaction bugs.
  • Why it works: Specific versions might have regressions or bugs in the transaction management logic that are resolved in later releases.

7. BookKeeper Bookie Node Failure During Transaction Commit/Abort

  • Diagnosis: Examine Pulsar broker logs and BookKeeper logs for any bookie nodes becoming unavailable or reporting errors during periods of transaction activity.
    # On a broker node, tailing logs
    tail -f /var/log/pulsar/pulsar-broker.log | grep "Failed to write ledger"
    # On a bookie node, tailing logs
    tail -f /var/log/bookkeeper/bookkeeper.log
    
  • Fix: Ensure your BookKeeper ensemble is healthy and has sufficient redundant nodes. If a bookie fails, it should be replaced or restarted, and BookKeeper’s recovery process should handle the metadata.
  • Why it works: Transactions are persisted in BookKeeper. If a bookie holding critical transaction metadata fails before the data is replicated, the transaction state can be lost, leading to "not found" errors on subsequent attempts to query it.

The next error you’ll likely encounter after fixing these is related to topic lookup failures or producer/consumer connection issues if the underlying cluster instability isn’t fully resolved.

Want structured learning?

Take the full Pulsar course →