A Pulsar cluster upgrade can be performed with zero downtime by carefully orchestrating a rolling restart of its components.

Let’s see this in action. Imagine we have a Pulsar cluster with ZooKeeper, BookKeeper, and the Pulsar brokers. We need to upgrade the Pulsar broker version from 2.8.0 to 2.9.0.

Here’s a simplified view of the components and their interaction during a rolling restart:

  1. ZooKeeper: The source of truth for cluster metadata, including broker registration and topic ownership. It’s usually upgraded first, and because it’s a distributed consensus system, it can tolerate individual node failures during an upgrade.
  2. BookKeeper: The distributed log storage system that Pulsar uses for message persistence. BookKeeper nodes (bookies) are stateless in terms of Pulsar metadata but hold the actual message data.
  3. Pulsar Brokers: The stateless serving layer that handles client requests, topic management, and coordination with ZooKeeper and BookKeeper.

The strategy is to upgrade components one by one, ensuring that the cluster remains available to clients throughout the process.

The Rolling Restart Strategy

The general principle is to upgrade the most critical, stateful components first, followed by the stateless ones. This order ensures that metadata and data remain consistent and accessible.

1. ZooKeeper Upgrade

While not strictly part of the "Pulsar" cluster, ZooKeeper is its backbone. A ZooKeeper cluster upgrade is a prerequisite. The process typically involves upgrading ZooKeeper nodes one by one, allowing each node to rejoin the ensemble.

Diagnosis: Check ZooKeeper ensemble health:

echo 'stat' | nc <zookeeper-ip> 2181

Look for Mode: follower or Mode: leader for all active nodes.

Fix: Follow the official ZooKeeper upgrade guide for your specific versions. Generally, it involves stopping a ZooKeeper node, upgrading its binaries, and restarting it. Repeat for each node, ensuring the ensemble remains available.

Why it works: ZooKeeper’s consensus protocol (like Zab) allows it to tolerate the temporary absence of a minority of its nodes. As long as a quorum of nodes is available, the ensemble remains functional.

2. BookKeeper Upgrade

BookKeeper nodes (bookies) are stateful as they hold the actual message data. However, Pulsar brokers are designed to handle bookie failures gracefully.

Diagnosis: Check bookie health:

./bookkeeper shell ls -n ledgerls

This command lists ledgers. If bookies are down, you might see errors or incomplete results. More importantly, check the BookKeeper logs for any errors related to ServiceUnavailableException or network issues.

Fix: Upgrade BookKeeper nodes one at a time.

  1. Gracefully decommission a bookie:
    ./bookkeeper shell bookiegracefulstop --bookieid <bookie-ip>:3181
    
    This tells ZooKeeper to stop sending new writes to this bookie and allows existing writes to be re-replicated if necessary.
  2. Upgrade BookKeeper binaries on the decommissioned node.
  3. Restart the bookie.
  4. Verify its health before proceeding to the next bookie.
  5. Repeat for all bookies.

Why it works: BookKeeper’s AutoRecovery process ensures that if a bookie goes offline, its ledger data is re-replicated to other available bookies. Pulsar brokers will automatically detect the bookie’s unavailability and direct new writes to other bookies. Once the upgraded bookie rejoins, it can participate in serving traffic and re-replication.

3. Pulsar Broker Upgrade

Pulsar brokers are stateless and register themselves with ZooKeeper. This makes them the easiest to upgrade in a rolling fashion.

Diagnosis: Check broker registration in ZooKeeper:

./pulsar zookeeper ls -p /admin/brokers

This lists all registered brokers. Also, check the broker logs for any connection errors to ZooKeeper or BookKeeper.

Fix: Upgrade Pulsar brokers one by one.

  1. Stop a broker instance.
  2. Upgrade its configuration files and binaries to the new version (2.9.0).
  3. Restart the broker. It will re-register with ZooKeeper and start serving traffic.
  4. Monitor its health by checking its registration in ZooKeeper and observing client traffic.
  5. Repeat for all broker instances.

Why it works: As soon as a broker is stopped, Pulsar’s load balancer will start redistributing its topic ownership and active connections to other available brokers. When the upgraded broker restarts, it will re-register and pick up its fair share of topics. The stateless nature means no data is lost or unavailable during this transition.

Considerations for Zero Downtime

  • Client Libraries: Ensure your client libraries are compatible with both the old and new broker versions during the rolling upgrade. Ideally, client libraries should be upgraded before or concurrently with the brokers.
  • Configuration Management: Use a robust configuration management system (Ansible, Chef, Puppet) to ensure consistent upgrades across all nodes.
  • Monitoring: Keep a close eye on key metrics like latency, error rates, and ZooKeeper/BookKeeper health during the entire process.
  • Rollback Plan: Always have a clear rollback plan in case of unexpected issues.

After successfully upgrading all Pulsar brokers, you might encounter a ZooKeeperAdminException if you try to use a command that requires features only present in the newer Pulsar version but your pulsar-admin client is still from the older version.

Want structured learning?

Take the full Pulsar course →