Consumer group rebalancing is actually a controlled traffic jam, not a graceful handover of work.

Let’s see it in action. Imagine a Redpanda topic my-topic with 3 partitions. Two consumers, consumer-1 and consumer-2, are in the same group my-group.

# Initially, consumer-1 is assigned all partitions
rpk topic consume my-topic --group my-group --consumer-id consumer-1
# Output will show consumer-1 processing messages from partitions 0, 1, 2

# Now, consumer-2 joins
rpk topic consume my-topic --group my-group --consumer-id consumer-2

When consumer-2 joins, Redpanda’s broker (or brokers) orchestrates a rebalance. This isn’t instantaneous. The brokers pause all consumption from my-topic for my-group, notify all active consumers in my-group that a rebalance is starting, wait for them to acknowledge, and then redistribute partitions. During this pause, no messages are processed.

Here’s what happens internally:

  1. Heartbeats: Consumers periodically send heartbeats to the brokers to signal they’re alive and part of the group.
  2. Session Timeout: Each consumer has a session.timeout.ms. If a broker doesn’t receive a heartbeat within this time, it assumes the consumer has died and initiates a rebalance.
  3. max.poll.interval.ms: This is the maximum time a consumer can go between poll() calls. If a consumer takes too long to process messages and call poll() again, it will be considered dead by the broker, even if it’s still alive and processing. This is a major cause of unexpected rebalances.
  4. Rebalance Protocol: When a rebalance is triggered (either by a new consumer joining, a consumer leaving, or a perceived failure), the broker designated as the group coordinator sends a LeaveGroup request to all members. Consumers respond, and the coordinator then assigns partitions based on a configurable strategy (default is range).
  5. partition.assignment.strategy: This determines how partitions are distributed. range assigns contiguous partitions to consumers, while roundrobin distributes them more evenly.

The Problem: Unexpected and frequent rebalances kill throughput. If your consumers are struggling to keep up, or if network glitches cause heartbeats to drop, you’ll see these pauses.

The Fixes:

  • Tune session.timeout.ms and heartbeat.interval.ms:

    • Diagnosis: Check consumer logs for "rebalance" messages. Monitor broker logs for session.timeout.ms expirations.
    • Command: No direct command, but you’d look at your consumer client configuration.
    • Fix: Increase session.timeout.ms (e.g., from default 10s to 30s) and decrease heartbeat.interval.ms (e.g., from default 3s to 1s).
    • Why it works: A longer session timeout gives consumers more grace period for network hiccups. A shorter heartbeat interval ensures brokers know the consumer is alive more frequently, preventing false positives for failures.
  • Tune max.poll.interval.ms:

    • Diagnosis: If consumers are logging "This server is losing 100% of the messages it is trying to send to you" or similar, and you see rebalances without new consumers joining, this is likely it.
    • Command: rpk topic describe my-topic --partitions to see partition count. rpk consumer group describe my-group to see lag.
    • Fix: Increase max.poll.interval.ms to a value greater than the maximum time your consumer takes to process a batch of records and call poll() again. For example, if processing a batch of 1000 records takes 15 seconds, set max.poll.interval.ms to 20000 (20 seconds).
    • Why it works: This tells the broker that your consumer is intentionally taking longer between polls, preventing it from being kicked out of the group prematurely due to slow processing.
  • Ensure consumers are actually processing:

    • Diagnosis: Use rpk consumer group describe my-group. If LAG is consistently high or growing, your consumers aren’t keeping up.
    • Fix: Optimize your consumer processing logic, increase the number of consumer instances (if partitions allow), or increase the fetch.max.bytes (though be careful not to overwhelm consumers).
    • Why it works: Consumers that can keep up with message production are less likely to hit max.poll.interval.ms limits.
  • Handle consumer crashes gracefully:

    • Diagnosis: If a consumer instance crashes unexpectedly, it will eventually time out its session.
    • Fix: Implement robust error handling in your consumer. Ensure it commits offsets only after successful processing. If a consumer must be restarted, ensure it’s done in a controlled manner.
    • Why it works: Graceful shutdowns allow the consumer to signal its departure, leading to a cleaner rebalance.
  • Use enable.auto.commit: false and manual offset commits:

    • Diagnosis: Consumers might reprocess messages after a rebalance if auto-commit happens too early.
    • Fix: Set enable.auto.commit: false and explicitly call commitSync() or commitAsync() after processing a batch of records.
    • Why it works: This guarantees that offsets are committed only for records that have been fully processed, preventing data loss or duplication when rebalances occur.
  • Choose the right partition.assignment.strategy:

    • Diagnosis: If you have uneven partition processing, it might be due to the assignment strategy.
    • Fix: For most use cases, range is fine. If you have a large number of partitions and consumers, roundrobin might distribute load more evenly, but can lead to more frequent reassignments if consumers join/leave. Test both.
    • Why it works: A better strategy ensures work is distributed more effectively among available consumers.

The next error you’ll likely see is FETCH_SESSION_ID_EXPIRED if your max.poll.interval.ms is still too low for your processing workload.

Want structured learning?

Take the full Redpanda course →