Pulsar’s tiered storage is designed to be a transparent extension of your broker’s local storage, allowing you to offload older data to cheaper, long-term storage like S3 without applications needing to know.

Let’s see it in action. Imagine a Pulsar topic my-tenant/my-namespace/my-topic with data accumulating over time. We’ve configured tiered storage to move data older than 1 hour to S3.

# Produce some data
pulsar-client produce my-tenant/my-namespace/my-topic \
  --num-messages 10000 \
  --delay-ms 1 \
  --properties "timestamp=$(date +%s%3N)"

# Wait for data to age out and be offloaded (e.g., 65 minutes)
sleep 4000

# Now, consume the data. Pulsar will automatically fetch it from S3 if needed.
pulsar-client consume my-tenant/my-namespace/my-topic \
  --subscription-name my-sub \
  --num-messages 10000 \
  -a

When you run the consume command, you won’t necessarily see a delay if the data has already been offloaded. Pulsar’s internal mechanisms handle fetching that data from S3 and serving it to the consumer as if it were still on the broker.

The core problem tiered storage solves is the ever-increasing cost and management overhead of holding all historical data on fast, local SSDs or NVMe drives attached to your Pulsar brokers. As your data volume grows, so does the required storage capacity, leading to significant expense. Tiered storage allows you to move this older, less frequently accessed data to object storage like AWS S3, Google Cloud Storage, or Azure Blob Storage, which are far cheaper per gigabyte.

Here’s how it works internally:

  1. Segment Creation: Pulsar brokers write data to local storage in segments. When a segment reaches a certain size or age, it’s considered "closed."
  2. Offloading Process: A separate component, often running as a standalone service or integrated into the broker, monitors for closed segments. When a segment meets the offload criteria (e.g., older than offload_threshold_time), it’s uploaded to the configured tiered storage.
  3. Metadata Update: Once the data is successfully uploaded to S3, the segment’s metadata is updated to indicate it’s been offloaded. The original segment file on the broker can then be safely deleted to reclaim local disk space.
  4. Transparent Access: When a consumer requests data from a segment that has been offloaded, the Pulsar broker intercepts the request. It checks the segment’s metadata, sees it’s in tiered storage, retrieves it from S3, caches it locally (if configured to do so), and then serves it to the consumer. This retrieval and caching are transparent to the client application.

The key configuration parameters for tiered storage are typically found in the conf/broker.conf or conf/standalone.conf files. You’ll need to define the managedLedgerOffloadServiceEnabled property and specify the S3 configuration.

Here’s a snippet of what you’d configure in broker.conf:

# Enable the offload service
managedLedgerOffloadServiceEnabled=true

# S3 configuration
s3ManagedLedgerStoragePlugin=org.apache.pulsar.broker.storage.s3.S3ManagedLedgerStorage
s3ManagedLedgerStorage_provider=s3
s3ManagedLedgerStorage_region=us-east-1
s3ManagedLedgerStorage_bucket=my-pulsar-tiered-bucket
s3ManagedLedgerStorage_endpoint=https://s3.amazonaws.com
s3ManagedLedgerStorage_useProxy=false
s3ManagedLedgerStorage_proxyEnable=false
s3ManagedLedgerStorage_awsCredentialsProvider=org.apache.pulsar.broker.storage.s3.BaseS3CredentialsProvider
s3ManagedLedgerStorage_aws_access_key_id=YOUR_ACCESS_KEY_ID
s3ManagedLedgerStorage_aws_secret_access_key=YOUR_SECRET_ACCESS_KEY
# Or use IAM roles/instance profiles for better security

# Offload policy for topics
# This is configured per-namespace. Example for 'my-tenant/my-namespace'
# topicLevelPolicies={"my-tenant/my-namespace/my-topic": {"offloadThresholdTimeInSeconds": 3600, "offloadFrequencyInMinutes": 10}}
# A more common approach is to set it globally or via namespace policies.
# For namespace policies:
# pulsar-admin namespaces update my-tenant/my-namespace --offload-threshold 3600 --offload-frequency 10

The offloadThresholdTimeInSeconds determines how long data stays on local broker storage before it’s eligible for offload. offloadFrequencyInMinutes (or similar parameters depending on Pulsar version and configuration method) dictates how often the offload process checks for eligible segments.

The truly interesting part is that Pulsar doesn’t require you to manage the S3 bucket structure directly. It creates its own object structure within the bucket, typically organized by topic, tenant, and namespace, using segment IDs as object keys. This means you can point multiple Pulsar clusters to the same S3 bucket and potentially share historical data, though this is an advanced configuration with careful considerations for data consistency and access control. The offload process itself is designed to be idempotent; if an upload fails and is retried, it won’t duplicate data.

While S3 offers cost savings, retrieval latency is significantly higher than local SSDs. Pulsar mitigates this by caching recently accessed offloaded segments back to broker local storage. The size and duration of this cache are critical tuning parameters for performance. If your workload frequently re-reads very old data, you might find yourself hitting the S3 latency more often, and you’ll need to adjust the cache configuration or consider how much data you really need to keep in tiered storage versus re-ingesting.

The next challenge you’ll likely encounter is managing the lifecycle of data within S3, such as implementing S3’s own lifecycle policies to move data to Glacier for even deeper cost savings after a certain period.

Want structured learning?

Take the full Pulsar course →