Consumer group rebalancing is actually a controlled traffic jam, not a graceful handover of work.
Let’s see it in action. Imagine a Redpanda topic my-topic with 3 partitions. Two consumers, consumer-1 and consumer-2, are in the same group my-group.
# Initially, consumer-1 is assigned all partitions
rpk topic consume my-topic --group my-group --consumer-id consumer-1
# Output will show consumer-1 processing messages from partitions 0, 1, 2
# Now, consumer-2 joins
rpk topic consume my-topic --group my-group --consumer-id consumer-2
When consumer-2 joins, Redpanda’s broker (or brokers) orchestrates a rebalance. This isn’t instantaneous. The brokers pause all consumption from my-topic for my-group, notify all active consumers in my-group that a rebalance is starting, wait for them to acknowledge, and then redistribute partitions. During this pause, no messages are processed.
Here’s what happens internally:
- Heartbeats: Consumers periodically send heartbeats to the brokers to signal they’re alive and part of the group.
- Session Timeout: Each consumer has a
session.timeout.ms. If a broker doesn’t receive a heartbeat within this time, it assumes the consumer has died and initiates a rebalance. max.poll.interval.ms: This is the maximum time a consumer can go betweenpoll()calls. If a consumer takes too long to process messages and callpoll()again, it will be considered dead by the broker, even if it’s still alive and processing. This is a major cause of unexpected rebalances.- Rebalance Protocol: When a rebalance is triggered (either by a new consumer joining, a consumer leaving, or a perceived failure), the broker designated as the group coordinator sends a
LeaveGrouprequest to all members. Consumers respond, and the coordinator then assigns partitions based on a configurable strategy (default isrange). partition.assignment.strategy: This determines how partitions are distributed.rangeassigns contiguous partitions to consumers, whileroundrobindistributes them more evenly.
The Problem: Unexpected and frequent rebalances kill throughput. If your consumers are struggling to keep up, or if network glitches cause heartbeats to drop, you’ll see these pauses.
The Fixes:
-
Tune
session.timeout.msandheartbeat.interval.ms:- Diagnosis: Check consumer logs for "rebalance" messages. Monitor broker logs for
session.timeout.msexpirations. - Command: No direct command, but you’d look at your consumer client configuration.
- Fix: Increase
session.timeout.ms(e.g., from default 10s to 30s) and decreaseheartbeat.interval.ms(e.g., from default 3s to 1s). - Why it works: A longer session timeout gives consumers more grace period for network hiccups. A shorter heartbeat interval ensures brokers know the consumer is alive more frequently, preventing false positives for failures.
- Diagnosis: Check consumer logs for "rebalance" messages. Monitor broker logs for
-
Tune
max.poll.interval.ms:- Diagnosis: If consumers are logging "This server is losing 100% of the messages it is trying to send to you" or similar, and you see rebalances without new consumers joining, this is likely it.
- Command:
rpk topic describe my-topic --partitionsto see partition count.rpk consumer group describe my-groupto see lag. - Fix: Increase
max.poll.interval.msto a value greater than the maximum time your consumer takes to process a batch of records and callpoll()again. For example, if processing a batch of 1000 records takes 15 seconds, setmax.poll.interval.msto 20000 (20 seconds). - Why it works: This tells the broker that your consumer is intentionally taking longer between polls, preventing it from being kicked out of the group prematurely due to slow processing.
-
Ensure consumers are actually processing:
- Diagnosis: Use
rpk consumer group describe my-group. IfLAGis consistently high or growing, your consumers aren’t keeping up. - Fix: Optimize your consumer processing logic, increase the number of consumer instances (if partitions allow), or increase the
fetch.max.bytes(though be careful not to overwhelm consumers). - Why it works: Consumers that can keep up with message production are less likely to hit
max.poll.interval.mslimits.
- Diagnosis: Use
-
Handle consumer crashes gracefully:
- Diagnosis: If a consumer instance crashes unexpectedly, it will eventually time out its session.
- Fix: Implement robust error handling in your consumer. Ensure it commits offsets only after successful processing. If a consumer must be restarted, ensure it’s done in a controlled manner.
- Why it works: Graceful shutdowns allow the consumer to signal its departure, leading to a cleaner rebalance.
-
Use
enable.auto.commit: falseand manual offset commits:- Diagnosis: Consumers might reprocess messages after a rebalance if auto-commit happens too early.
- Fix: Set
enable.auto.commit: falseand explicitly callcommitSync()orcommitAsync()after processing a batch of records. - Why it works: This guarantees that offsets are committed only for records that have been fully processed, preventing data loss or duplication when rebalances occur.
-
Choose the right
partition.assignment.strategy:- Diagnosis: If you have uneven partition processing, it might be due to the assignment strategy.
- Fix: For most use cases,
rangeis fine. If you have a large number of partitions and consumers,roundrobinmight distribute load more evenly, but can lead to more frequent reassignments if consumers join/leave. Test both. - Why it works: A better strategy ensures work is distributed more effectively among available consumers.
The next error you’ll likely see is FETCH_SESSION_ID_EXPIRED if your max.poll.interval.ms is still too low for your processing workload.