Pulsar’s log compaction doesn’t actually retain the latest value per key, it retains the first value after a compaction event for any given key.

Let’s watch it happen. Imagine a Pulsar topic called persistent://public/default/my-topic. We’re going to write a sequence of key-value pairs to it.

// Write key "user-1", value "data-v1"
pulsar-client produce my-topic --key user-1 --value data-v1

// Write key "user-2", value "data-v2"
pulsar-client produce my-topic --key user-2 --value data-v2

// Write key "user-1", value "data-v3" (overwriting "data-v1" for "user-1")
pulsar-client produce my-topic --key user-1 --value data-v3

Now, before compaction, if you were to read this topic, you’d see data-v1 for user-1, data-v2 for user-2, and then data-v3 for user-1. The consumer would see both data-v1 and data-v3 for user-1 in sequence.

Pulsar’s log compaction is designed to efficiently manage the storage of time-series data where older versions of a key are no longer relevant. When compaction runs, it scans through the topic’s ledger entries. For each unique key it encounters, it identifies the earliest entry that still exists in the topic’s managed ledger. All other entries for that same key, older than this identified earliest entry, are then marked for deletion. The effect is that after compaction, a read operation for a specific key will yield only that single, surviving entry. This dramatically reduces storage overhead for topics with frequent updates to the same keys.

To enable compaction, you configure it on the topic itself. This is typically done using the Pulsar admin tools.

# Enable compaction with a retention time of 1 hour and a maximum time lag of 5 minutes
pulsar-admin topics set-compaction POLICY persistent://public/default/my-topic \
  --time-to-retain 1h \
  --max-compacted-lag 5m

Here, --time-to-retain 1h means that any entry older than one hour will be considered for deletion if it’s not the single surviving entry for its key after compaction. --max-compacted-lag 5m means that compaction will be triggered if the oldest uncompacted entry for any key is more than 5 minutes old.

Internally, compaction works by creating new ledger entries that contain only the latest (or rather, the first surviving) version of each key. It doesn’t modify existing ledgers. Instead, it writes a new set of compacted ledgers. Once these new ledgers are safely written and acknowledged, the older, uncompacted ledger segments that are now fully covered by the compacted ledgers can be garbage collected. This append-only nature ensures durability during the compaction process. The system keeps track of which ledger segments are still referenced by active or compacted ledgers.

The key levers you control are the time-to-retain and max-compacted-lag properties. time-to-retain acts as a global cutoff for how old data can be. If an entry is older than time-to-retain, it’s a candidate for deletion during compaction, but only if it’s not the sole survivor for its key. max-compacted-lag is the trigger for compaction itself. If the oldest data for any key hasn’t been compacted in max-compacted-lag time, the compaction process is initiated.

A common misconception is that compaction always keeps the absolute latest value written. This is not true. It keeps the latest value that remains after the compaction process has identified the first surviving entry for each key. If you have a very rapid write-compact-write cycle, you might have an intermediate value that survives if the next compaction hasn’t run yet. The crucial point is that after a compaction, a read will yield only one entry per key.

The next challenge you’ll likely encounter is understanding the behavior of compaction with deleted messages.

Want structured learning?

Take the full Pulsar course →