The Rancher agent is failing to connect to the Rancher server because the agent’s TLS certificate is no longer trusted by the server, or the server’s certificate is no longer trusted by the agent.
Here are the most common reasons and how to fix them:
1. Agent’s TLS Certificate Expired or is Invalid
The Rancher agent uses a TLS certificate to authenticate with the Rancher server. If this certificate expires or becomes invalid (e.g., the hostname/IP in the certificate doesn’t match the agent’s hostname/IP), the server will reject the connection.
- Diagnosis:
- On the Rancher server, check the agent logs for messages like
x509: certificate has expired or is not yet validorx509: certificate signed by unknown authority. - On the agent node, you can try to manually inspect the certificate using
openssl s_client -connect <rancher-server-ip>:443 -showcertsand look at the output for the certificate details and expiry date.
- On the Rancher server, check the agent logs for messages like
- Fix:
- The easiest fix is to restart the Rancher agent pod. This usually triggers a certificate renewal process. If you’re running the agent as a Deployment,
kubectl rollout restart deployment <agent-deployment-name> -n cattle-systemshould work. If it’s a DaemonSet,kubectl rollout restart daemonset <agent-daemonset-name> -n cattle-systemwould be appropriate. - Alternatively, if certificate renewal fails, you might need to manually re-register the agent. This involves deleting the existing agent registration in Rancher and then running the
kubectl apply -f <generated-yaml-file>command again on the agent node.
- The easiest fix is to restart the Rancher agent pod. This usually triggers a certificate renewal process. If you’re running the agent as a Deployment,
- Why it works: Restarting the agent pod often forces it to re-establish its connection and obtain a new, valid certificate from the Rancher server. Manual re-registration ensures a completely fresh certificate is issued.
2. Rancher Server’s TLS Certificate Changed or is Invalid
Similarly, if the Rancher server’s TLS certificate changes (e.g., after an upgrade or certificate rotation) and the agent doesn’t trust the new certificate, the connection will fail. This is especially common if you’re using custom CA certificates.
- Diagnosis:
- On the agent node, check the agent logs for messages like
certificate signed by unknown authorityorremote error: tls: bad certificate. - You can use
openssl s_client -connect <rancher-server-ip>:443 -showcertson the agent node to inspect the server’s certificate and ensure it’s signed by a trusted CA.
- On the agent node, check the agent logs for messages like
- Fix:
- If you’re using Rancher’s self-signed certificates and they’ve been rotated, restarting the Rancher server pods might resolve it.
kubectl rollout restart deployment rancher -n cattle-system(assuming your Rancher deployment is namedrancher). - If you’re using custom CA certificates, ensure the CA certificate used to sign the Rancher server’s certificate is present and trusted on the agent nodes. This typically involves ensuring the CA certificate is added to the system’s trust store on each agent node or configured within the agent’s deployment.
- The most robust fix is often to re-register the agent, which will fetch the new server certificate.
- If you’re using Rancher’s self-signed certificates and they’ve been rotated, restarting the Rancher server pods might resolve it.
- Why it works: The agent needs to trust the identity of the server it’s connecting to. Updating its trust store or re-establishing trust through re-registration ensures it accepts the server’s new certificate.
3. Network Connectivity Issues / Firewall Blocking
The agent node might not be able to reach the Rancher server on the required ports (typically 443 for API communication, and potentially 80 if HTTP is used for initial bootstrapping).
- Diagnosis:
- From the agent node, try to
curl -v https://<rancher-server-ip>:443. Look for connection refused or timeout errors. - Check any network firewalls, security groups (AWS, Azure, GCP), or
iptablesrules on the agent node or intermediate network devices.
- From the agent node, try to
- Fix:
- Ensure that port 443 (and any other necessary ports) are open inbound on the Rancher server’s network and outbound from the agent node’s network.
- If using
iptableson the agent node, ensure rules allow outbound traffic to the Rancher server’s IP and port. For example,sudo iptables -A OUTPUT -p tcp --dport 443 -d <rancher-server-ip> -j ACCEPT.
- Why it works: This directly addresses network path issues, allowing the agent to establish a TCP connection to the server’s listening port.
4. Incorrect Cluster Registration URL
The agent was configured with an incorrect URL for the Rancher server during its initial registration.
- Diagnosis:
- Inspect the Rancher agent’s configuration. If it’s running as a Deployment/DaemonSet in
cattle-system, look for theCATTLE_SERVER_URLenvironment variable in the agent’s pod spec.kubectl get pods -n cattle-system -o yamland search for the agent’s pod. - Check the
kubeconfigfile used by the agent if it’s not running within the cluster itself.
- Inspect the Rancher agent’s configuration. If it’s running as a Deployment/DaemonSet in
- Fix:
- Update the
CATTLE_SERVER_URLenvironment variable in the agent’s Deployment or DaemonSet definition to the correct Rancher server URL. Then, restart the agent pods. - If the agent is registered via a
kubeconfigfile, update theclusterssection of that file with the correct server URL and then restart the agent.
- Update the
- Why it works: The agent needs to know the precise address of the Rancher server to attempt communication. Correcting this URL ensures it’s trying to connect to the right place.
5. DNS Resolution Issues
The agent node cannot resolve the hostname of the Rancher server.
- Diagnosis:
- From the agent node, try to
ping <rancher-server-hostname>ordig <rancher-server-hostname>. If these fail, DNS is the problem.
- From the agent node, try to
- Fix:
- Ensure the agent node’s DNS configuration (
/etc/resolv.conf) is correct and points to a working DNS server. - If the Rancher server is only accessible via a specific internal DNS, ensure that DNS server is reachable from the agent node.
- If you’re using host aliases, verify they are correctly configured on the agent node.
- Ensure the agent node’s DNS configuration (
- Why it works: DNS is the first step in resolving a hostname to an IP address. If this fails, the agent cannot even begin to establish a network connection.
6. Resource Constraints on the Agent Node
The agent pod might be experiencing resource starvation (CPU, memory) on the node it’s running on, preventing it from establishing or maintaining its connection.
- Diagnosis:
- Check the agent pod’s resource usage using
kubectl top pod <agent-pod-name> -n cattle-system. - Examine the node’s overall resource utilization (
kubectl top node <node-name>). - Look for OOMKilled events for the agent pod:
kubectl get events -n cattle-system --field-selector involvedObject.name=<agent-pod-name>.
- Check the agent pod’s resource usage using
- Fix:
- Increase the resource requests and limits for the agent pod in its Deployment/DaemonSet definition. For example, change
resources: { requests: { cpu: "100m", memory: "256Mi" }, limits: { cpu: "500m", memory: "1Gi" } }to higher values. - If the node itself is undersized, consider moving the agent to a node with more resources or scaling up the node.
- Increase the resource requests and limits for the agent pod in its Deployment/DaemonSet definition. For example, change
- Why it works: Sufficient resources are required for the agent process to run correctly, including its networking and TLS operations. Lack of resources can lead to unresponsiveness and connection failures.
7. Rancher Server Internal Issues (Less Common)
Rarely, the Rancher server itself might be experiencing internal issues that prevent it from accepting new agent connections, even if its certificate is valid and network is open.
- Diagnosis:
- Check the Rancher server logs for errors related to API requests, database connectivity, or internal service communication.
- Verify the Rancher server pods are running and healthy.
- Fix:
- Restarting the Rancher server pods can sometimes resolve transient internal issues.
kubectl rollout restart deployment rancher -n cattle-system. - If the issue persists, investigate Rancher server logs and its dependencies (like the database).
- Restarting the Rancher server pods can sometimes resolve transient internal issues.
- Why it works: This addresses potential bugs or resource exhaustion within the Rancher server application itself.
After addressing these, the next common issue you might encounter is the agent pod restarting due to a crash loop if the underlying problem wasn’t fully resolved, or if a new, unrelated issue arises.