Pulsar’s sequence number deduplication is actually a feature that guarantees exactly-once processing by allowing duplicates to be sent, but then silently discarding them based on a unique identifier.

Let’s see this in action. Imagine a simple producer sending messages to a topic.

// Producer setup
String serviceHttpUrl = "http://localhost:8080";
String tenant = "my-tenant";
String namespace = "my-namespace";
String topic = "my-dedup-topic";

PulsarClient client = PulsarClient.builder()
    .serviceHttpUrl(serviceHttpUrl)
    .build();

Producer<String> producer = client.newProducer(Schema.STRING)
    .topic(tenant + "/" + namespace + "/" + topic)
    .enableBatching(true) // Crucial for dedup efficiency
    .enableIdempotentProducer(true) // Enables sequence number tracking
    .create();

// Sending messages with sequence IDs
for (int i = 0; i < 5; i++) {
    String messagePayload = "Message " + i;
    long sequenceId = i; // Simple sequence for demo, real apps use unique IDs
    producer.newMessage()
        .value(messagePayload)
        .sequenceId(sequenceId) // Assign a unique sequence ID
        .send();
    System.out.println("Sent message with sequence ID: " + sequenceId);
    Thread.sleep(100); // Simulate some delay
}

producer.close();
client.close();

When the producer is configured with enableIdempotentProducer(true), Pulsar tracks the sequence IDs sent for each topic partition. If a producer crashes and restarts, it can resume sending from where it left off. If a message with a sequence ID that has already been acknowledged by the broker is re-sent, Pulsar will detect the duplicate and discard it. This ensures that even if a producer retries sending a message due to network issues, the message is only processed once by consumers.

The core of this mechanism lies in the producer’s ability to generate unique, monotonically increasing sequence IDs for messages sent to a specific topic partition. The broker then maintains a cache of the highest sequence ID received for each producer on each partition. When a new message arrives, the broker checks if its sequence ID is greater than the last acknowledged ID for that producer. If it is, the message is accepted and forwarded to consumers. If it’s less than or equal to the last acknowledged ID, it’s considered a duplicate and dropped.

The enableBatching(true) setting is important because it allows the producer to group multiple messages into a single network request. This significantly improves the efficiency of idempotent production. Pulsar brokers can acknowledge an entire batch of messages with a single highest sequence ID, reducing the overhead of tracking individual message acknowledgments.

When a producer is enabled for idempotence, it establishes a persistent connection with the broker. On this connection, it sends not only the message data but also a sequence ID. The broker, upon receiving a message, checks its internal state for that specific producer and topic partition. It stores the highest sequence ID it has successfully processed. If the incoming message’s sequence ID is less than or equal to this stored value, the broker silently discards the message. If the sequence ID is greater, the broker processes the message, updates its stored highest sequence ID, and acknowledges the producer. This acknowledgment is what allows the producer to know it can safely "forget" about that sequence ID.

The mental model here is that the producer is "stamping" each message with a unique, ordered label. The broker acts as a diligent librarian, keeping a log of the highest-numbered stamp it has seen for each patron (producer). Any incoming message with a stamp number already logged is simply ignored. This prevents any message from being processed more than once, even if the network hiccups and the producer mistakenly sends it again.

The key to successful deduplication is ensuring that the sequence IDs are truly unique and monotonically increasing from the perspective of the broker. If a producer generates sequence IDs that are not strictly increasing or can wrap around, deduplication will fail. In practice, this means using a reliable mechanism to generate these IDs, often tied to epoch-based counters or distributed ID generators.

The next concept to explore is how Pulsar handles consumer acknowledgments in conjunction with idempotent producers, particularly how broker-level acknowledgments work to confirm receipt to the producer.

Want structured learning?

Take the full Pulsar course →