The Rancher agent is failing to connect to the Rancher server because the agent’s TLS certificate is no longer trusted by the server, or the server’s certificate is no longer trusted by the agent.

Here are the most common reasons and how to fix them:

1. Agent’s TLS Certificate Expired or is Invalid

The Rancher agent uses a TLS certificate to authenticate with the Rancher server. If this certificate expires or becomes invalid (e.g., the hostname/IP in the certificate doesn’t match the agent’s hostname/IP), the server will reject the connection.

  • Diagnosis:
    • On the Rancher server, check the agent logs for messages like x509: certificate has expired or is not yet valid or x509: certificate signed by unknown authority.
    • On the agent node, you can try to manually inspect the certificate using openssl s_client -connect <rancher-server-ip>:443 -showcerts and look at the output for the certificate details and expiry date.
  • Fix:
    • The easiest fix is to restart the Rancher agent pod. This usually triggers a certificate renewal process. If you’re running the agent as a Deployment, kubectl rollout restart deployment <agent-deployment-name> -n cattle-system should work. If it’s a DaemonSet, kubectl rollout restart daemonset <agent-daemonset-name> -n cattle-system would be appropriate.
    • Alternatively, if certificate renewal fails, you might need to manually re-register the agent. This involves deleting the existing agent registration in Rancher and then running the kubectl apply -f <generated-yaml-file> command again on the agent node.
  • Why it works: Restarting the agent pod often forces it to re-establish its connection and obtain a new, valid certificate from the Rancher server. Manual re-registration ensures a completely fresh certificate is issued.

2. Rancher Server’s TLS Certificate Changed or is Invalid

Similarly, if the Rancher server’s TLS certificate changes (e.g., after an upgrade or certificate rotation) and the agent doesn’t trust the new certificate, the connection will fail. This is especially common if you’re using custom CA certificates.

  • Diagnosis:
    • On the agent node, check the agent logs for messages like certificate signed by unknown authority or remote error: tls: bad certificate.
    • You can use openssl s_client -connect <rancher-server-ip>:443 -showcerts on the agent node to inspect the server’s certificate and ensure it’s signed by a trusted CA.
  • Fix:
    • If you’re using Rancher’s self-signed certificates and they’ve been rotated, restarting the Rancher server pods might resolve it. kubectl rollout restart deployment rancher -n cattle-system (assuming your Rancher deployment is named rancher).
    • If you’re using custom CA certificates, ensure the CA certificate used to sign the Rancher server’s certificate is present and trusted on the agent nodes. This typically involves ensuring the CA certificate is added to the system’s trust store on each agent node or configured within the agent’s deployment.
    • The most robust fix is often to re-register the agent, which will fetch the new server certificate.
  • Why it works: The agent needs to trust the identity of the server it’s connecting to. Updating its trust store or re-establishing trust through re-registration ensures it accepts the server’s new certificate.

3. Network Connectivity Issues / Firewall Blocking

The agent node might not be able to reach the Rancher server on the required ports (typically 443 for API communication, and potentially 80 if HTTP is used for initial bootstrapping).

  • Diagnosis:
    • From the agent node, try to curl -v https://<rancher-server-ip>:443. Look for connection refused or timeout errors.
    • Check any network firewalls, security groups (AWS, Azure, GCP), or iptables rules on the agent node or intermediate network devices.
  • Fix:
    • Ensure that port 443 (and any other necessary ports) are open inbound on the Rancher server’s network and outbound from the agent node’s network.
    • If using iptables on the agent node, ensure rules allow outbound traffic to the Rancher server’s IP and port. For example, sudo iptables -A OUTPUT -p tcp --dport 443 -d <rancher-server-ip> -j ACCEPT.
  • Why it works: This directly addresses network path issues, allowing the agent to establish a TCP connection to the server’s listening port.

4. Incorrect Cluster Registration URL

The agent was configured with an incorrect URL for the Rancher server during its initial registration.

  • Diagnosis:
    • Inspect the Rancher agent’s configuration. If it’s running as a Deployment/DaemonSet in cattle-system, look for the CATTLE_SERVER_URL environment variable in the agent’s pod spec. kubectl get pods -n cattle-system -o yaml and search for the agent’s pod.
    • Check the kubeconfig file used by the agent if it’s not running within the cluster itself.
  • Fix:
    • Update the CATTLE_SERVER_URL environment variable in the agent’s Deployment or DaemonSet definition to the correct Rancher server URL. Then, restart the agent pods.
    • If the agent is registered via a kubeconfig file, update the clusters section of that file with the correct server URL and then restart the agent.
  • Why it works: The agent needs to know the precise address of the Rancher server to attempt communication. Correcting this URL ensures it’s trying to connect to the right place.

5. DNS Resolution Issues

The agent node cannot resolve the hostname of the Rancher server.

  • Diagnosis:
    • From the agent node, try to ping <rancher-server-hostname> or dig <rancher-server-hostname>. If these fail, DNS is the problem.
  • Fix:
    • Ensure the agent node’s DNS configuration (/etc/resolv.conf) is correct and points to a working DNS server.
    • If the Rancher server is only accessible via a specific internal DNS, ensure that DNS server is reachable from the agent node.
    • If you’re using host aliases, verify they are correctly configured on the agent node.
  • Why it works: DNS is the first step in resolving a hostname to an IP address. If this fails, the agent cannot even begin to establish a network connection.

6. Resource Constraints on the Agent Node

The agent pod might be experiencing resource starvation (CPU, memory) on the node it’s running on, preventing it from establishing or maintaining its connection.

  • Diagnosis:
    • Check the agent pod’s resource usage using kubectl top pod <agent-pod-name> -n cattle-system.
    • Examine the node’s overall resource utilization (kubectl top node <node-name>).
    • Look for OOMKilled events for the agent pod: kubectl get events -n cattle-system --field-selector involvedObject.name=<agent-pod-name>.
  • Fix:
    • Increase the resource requests and limits for the agent pod in its Deployment/DaemonSet definition. For example, change resources: { requests: { cpu: "100m", memory: "256Mi" }, limits: { cpu: "500m", memory: "1Gi" } } to higher values.
    • If the node itself is undersized, consider moving the agent to a node with more resources or scaling up the node.
  • Why it works: Sufficient resources are required for the agent process to run correctly, including its networking and TLS operations. Lack of resources can lead to unresponsiveness and connection failures.

7. Rancher Server Internal Issues (Less Common)

Rarely, the Rancher server itself might be experiencing internal issues that prevent it from accepting new agent connections, even if its certificate is valid and network is open.

  • Diagnosis:
    • Check the Rancher server logs for errors related to API requests, database connectivity, or internal service communication.
    • Verify the Rancher server pods are running and healthy.
  • Fix:
    • Restarting the Rancher server pods can sometimes resolve transient internal issues. kubectl rollout restart deployment rancher -n cattle-system.
    • If the issue persists, investigate Rancher server logs and its dependencies (like the database).
  • Why it works: This addresses potential bugs or resource exhaustion within the Rancher server application itself.

After addressing these, the next common issue you might encounter is the agent pod restarting due to a crash loop if the underlying problem wasn’t fully resolved, or if a new, unrelated issue arises.

Want structured learning?

Take the full Rancher course →