Redpanda’s topic compaction is less about keeping old data and more about forgetting it efficiently.
Let’s watch it in action. Imagine we have a Redpanda topic called sensor_data that’s getting hammered with updates for sensor IDs. We want to keep the latest reading for each sensor ID, discarding all previous ones.
Here’s how we’d configure it:
{
"version": 2,
"partitions": [
{
"topic": "sensor_data",
"partition": 0,
"replicas": [0, 1, 2],
"configs": {
"retention.ms": "-1", // Keep forever by default, until compaction kicks in
"retention.bytes": "-1", // Keep forever by default, until compaction kicks in
"cleanup.policy": "compact",
"compaction.strategy": "delete_records", // This is the key for key-based compaction
"segment.bytes": "1073741824", // 1GB segments
"segment.ms": "604800000" // 7 days for segment rolling
}
}
]
}
If we apply this configuration to our sensor_data topic:
rpk topic update sensor_data --config cleanup.policy=compact --config compaction.strategy=delete_records
Now, when we send messages to sensor_data:
- Message 1:
key="sensor-123", value="temp:25C", offset=0 - Message 2:
key="sensor-456", value="temp:26C", offset=1 - Message 3:
key="sensor-123", value="temp:25.5C", offset=2
Redpanda doesn’t just append these. Internally, Redpanda maintains an index of the latest offset for each unique message key. When a new message arrives with an existing key, Redpanda marks the previous message with that key for deletion. The cleanup.policy: "compact" tells Redpanda to periodically scan through its data segments. When it finds segments containing messages that have been marked for deletion (because a newer message with the same key exists), it writes a new, smaller segment that only contains the latest versions of the keys. The old segments, once fully processed and no longer needed to reconstruct the latest state, are then eligible for deletion based on segment.ms or segment.bytes.
The compaction.strategy: "delete_records" is the crucial setting here. It tells Redpanda to use the message key as the identifier for deduplication. If you omit this or set it to none, Redpanda would simply retain all messages based on retention.ms and retention.bytes.
The fundamental problem Redpanda compaction solves is the unbounded growth of topics when you only care about the current state of an entity identified by a key, not its entire history. Think of a user profile update topic, a configuration change log, or, as in our example, sensor readings. Without compaction, these topics would grow indefinitely, consuming disk space and slowing down reads.
Here’s the mental model: Imagine each topic partition is a ledger. Normally, you just keep adding entries. With compaction, you’re periodically going back and rewriting pages of the ledger, only keeping the latest entry for each item you care about. Redpanda does this by creating new, smaller "compacted" segments and eventually discarding the old, larger ones.
The compaction.strategy has other options, like keep_all (which is the default and effectively disables compaction’s key-based deduplication) and delete_records. When delete_records is active, Redpanda treats messages with the same key but different values as updates. A message with an empty value (a tombstone message) effectively tells Redpanda to delete all previous occurrences of that key.
A common pitfall is expecting compaction to immediately remove data. It’s a background process. Data is marked for deletion, then new segments are written, and only then are old segments eligible for physical deletion. This means you might see disk usage temporarily increase before it decreases as old segments are purged.
The next thing you’ll likely grapple with is understanding how retention.ms and retention.bytes interact with cleanup.policy: "compact".