A Rancher HA setup is more than just a high-availability deployment; it’s a distributed consensus system where every node needs to agree on the state of the cluster, making network partitions the ultimate arbiter of truth.
Here’s a typical Rancher HA setup in action, managing a Kubernetes cluster. Imagine you have three Rancher nodes, rancher-01, rancher-02, and rancher-03, all running within a Kubernetes cluster themselves (often using RKE or K3s). Each Rancher node is a deployment with a statefulset, backed by a persistent volume for its etcd data. They communicate with each other over specific ports.
Let’s say you’re accessing the Rancher UI via a LoadBalancer IP. You create a new Kubernetes cluster through the Rancher UI. Rancher then orchestrates the deployment of Kubernetes nodes (your worker and control plane nodes), installs necessary agents, and configures networking. This involves Rancher nodes talking to each other, to the Kubernetes API server of the cluster they are managing, and to the nodes being provisioned.
The core problem Rancher HA solves is ensuring that even if one or two of your Rancher nodes go down, the system remains operational, and cluster state is not lost. This is primarily achieved by using etcd for state storage, which requires a quorum.
Internally, each Rancher HA instance runs a full Kubernetes cluster, including etcd. This etcd cluster is the single source of truth for Rancher’s configuration, cluster definitions, user data, and more. The three Rancher nodes form a cluster for their own management, and then they manage other Kubernetes clusters.
The key levers you control are:
- Node Count: You need an odd number of nodes (3, 5, etc.) for etcd quorum.
- Network Connectivity: All nodes must be able to reach each other on specific ports (etcd peer ports, API ports, etc.).
- Resource Allocation: Each Rancher node needs sufficient CPU, memory, and disk I/O, especially for etcd.
- Persistent Storage: Reliable and performant storage for etcd is critical.
When you set up Rancher HA with RKE, you’re essentially deploying a Kubernetes cluster where the control plane nodes are your Rancher instances. The cluster.yml file will define your nodes, and RKE will install Kubernetes components, including etcd, on them. For a truly HA Rancher setup managing Kubernetes clusters, you’d typically deploy Rancher itself on an existing Kubernetes cluster (like an RKE or K3s cluster) where Rancher is configured for HA.
The rke config etcd command is used to generate the etcd configuration for RKE, ensuring that the etcd cluster members are correctly registered and can form a quorum. The rancher/rancher Docker image runs the Rancher server application, and the rancher/rancher-agent runs on the managed Kubernetes nodes to communicate back to the Rancher server.
A common pitfall is misunderstanding that Rancher HA refers to the high availability of the Rancher management plane itself, not necessarily the Kubernetes clusters it manages (though it facilitates that too). Each Rancher node runs its own etcd instance, and these etcd instances form a cluster. If you have three Rancher nodes and two of them lose network connectivity to the third, the etcd cluster on the third node will lose quorum, and Rancher will become unavailable.
The next logical step after setting up Rancher HA is to explore its multi-cluster management capabilities, including provisioning new Kubernetes clusters and migrating existing ones.