Redpanda’s Raft consensus isn’t just about picking a leader; it’s about ensuring that every node in the cluster agrees on who the leader is, even when nodes are failing, joining, or experiencing network partitions.
Let’s watch a cluster of three Redpanda nodes (node1, node2, node3) elect a leader. Imagine they’ve just started up. No one is the leader yet.
# On node1
rpk cluster config get --format json | jq '.raft.id'
# Example output: null (no leader yet)
# On node2
rpk cluster config get --format json | jq '.raft.id'
# Example output: null
# On node3
rpk cluster config get --format json | jq '.raft.id'
# Example output: null
Now, one node, say node1, will proactively initiate the election process by becoming a "candidate." It increments its "term" number (a logical clock for the cluster) and sends "RequestVote" RPCs to all other nodes.
# On node1 (after it becomes a candidate)
# Internally, Redpanda logs will show:
# "Sending RequestVote to node2 with term X"
# "Sending RequestVote to node3 with term X"
When node2 and node3 receive this RequestVote RPC, they check if the candidate’s term (X) is greater than or equal to their current term. If it is, and if they haven’t already voted in this term, they grant their vote to node1. They then reply with a "VoteResponse" indicating success.
# On node2 (receiving RequestVote from node1)
# Internally, Redpanda logs will show:
# "Received RequestVote from node1, granting vote for term X"
# On node3 (receiving RequestVote from node1)
# Internally, Redpanda logs will show:
# "Received RequestVote from node1, granting vote for term X"
If node1 receives votes from a majority of the nodes in the cluster (in this case, 2 out of 3), it transitions to the "leader" state. It then starts sending "AppendEntries" heartbeats to all other nodes to assert its leadership and replicate any new data.
# On node1 (after receiving enough votes)
# Internally, Redpanda logs will show:
# "Became leader for term X"
# "Sending AppendEntries heartbeat to node2"
# "Sending AppendEntries heartbeat to node3"
# On node1
rpk cluster config get --format json | jq '.raft.id'
# Example output: "node1" (or its UUID)
# On node2
rpk cluster config get --format json | jq '.raft.id'
# Example output: "node1" (or its UUID)
# On node3
rpk cluster config get --format json | jq '.raft.id'
# Example output: "node1" (or its UUID)
The key to Raft is that a node can only become leader if it has received votes from a majority. This ensures that there’s never more than one leader in a given term. If a node receives an AppendEntries RPC from a different node claiming to be the leader in a higher term, it will reject the current leader and step down, potentially initiating a new election.
The crucial part of Redpanda’s Raft implementation is how it handles the configuration changes. When you add or remove nodes, the cluster doesn’t immediately re-elect. Instead, the leader orchestrates a joint consensus configuration change. This involves a two-phase commit: first, a configuration where both the old and new configurations are valid (meaning a majority can be formed from either set of nodes), and then, once that’s committed by a majority, the new configuration becomes active. This prevents situations where a split-brain scenario could occur during configuration transitions.
The rpk cluster config command is your window into this. It shows the current raft.id (which is the ID of the leader) and the raft.conf which details the current membership of the cluster. When you use rpk cluster config bootstrap or rpk cluster join, you’re essentially telling a node about the existing cluster’s configuration and its potential leader, which helps it synchronize and participate in consensus.
The Raft algorithm is designed to be fault-tolerant, but it relies on a consistent view of the cluster membership. If a node thinks it’s part of the cluster but the actual leader doesn’t, or if network partitions prevent communication, you can run into issues.
If you’re seeing rpk cluster config get return null for the raft.id for an extended period, it usually means a leader hasn’t been successfully elected. This often happens if new nodes can’t reach any existing nodes to request votes or if the initial bootstrap configuration was incorrect and no node has enough peers to form a majority. In such cases, you’d typically check Redpanda’s logs on each node for RequestVote and VoteResponse messages to see where the communication is breaking down.
Understanding the term numbers and the majority rule is paramount. A candidate only wins if it gets votes from more than half of the current cluster members. This is why a node that is partitioned away cannot become a leader and also cannot prevent the remaining nodes from electing a new leader.
The next concept you’ll likely encounter is how Raft handles log replication, which is the mechanism by which the leader ensures all nodes have the same sequence of operations applied to their state machines.