Troubleshooting the SUSE Rancher Prime Server Kubernetes Cluster

This section describes how to troubleshoot an installation of Rancher on a Kubernetes cluster.

Relevant Namespaces

Most of the troubleshooting will be done on objects in these 3 namespaces.

cattle-system - rancher deployment and pods.
ingress-nginx - Ingress controller pods and services.
cert-manager - cert-manager pods.

"default backend - 404"

A number of things can cause the ingress-controller not to forward traffic to your rancher instance. Most of the time its due to a bad ssl configuration.

Things to check

Is Rancher Running
Cert CN is "Kubernetes Ingress Controller Fake Certificate"

Check if Rancher is Running

Use kubectl to check the cattle-system system namespace and see if the Rancher pods are in a Running state.

kubectl -n cattle-system get pods

NAME                           READY     STATUS    RESTARTS   AGE
pod/rancher-784d94f59b-vgqzh   1/1       Running   0          10m

If the state is not Running, run a describe on the pod and check the Events.

kubectl -n cattle-system describe pod

...
Events:
  Type     Reason                 Age   From                Message
  ----     ------                 ----  ----                -------
  Normal   Scheduled              11m   default-scheduler   Successfully assigned rancher-784d94f59b-vgqzh to localhost
  Normal   SuccessfulMountVolume  11m   kubelet, localhost  MountVolume.SetUp succeeded for volume "rancher-token-dj4mt"
  Normal   Pulling                11m   kubelet, localhost  pulling image "rancher/rancher:v2.0.4"
  Normal   Pulled                 11m   kubelet, localhost  Successfully pulled image "rancher/rancher:v2.0.4"
  Normal   Created                11m   kubelet, localhost  Created container
  Normal   Started                11m   kubelet, localhost  Started container

Check the Rancher Logs

Use kubectl to list the pods.

kubectl -n cattle-system get pods

NAME                           READY     STATUS    RESTARTS   AGE
pod/rancher-784d94f59b-vgqzh   1/1       Running   0          10m

Use kubectl and the pod name to list the logs from the pod.

kubectl -n cattle-system logs -f rancher-784d94f59b-vgqzh

Cert CN is "Kubernetes Ingress Controller Fake Certificate"

Use your browser to check the certificate details. If it says the Common Name is "Kubernetes Ingress Controller Fake Certificate", something may have gone wrong with reading or issuing your SSL cert.

If you are using LetsEncrypt to issue certs, it can sometimes take a few minutes to issue the cert.

Checking for issues with cert-manager issued certs (Rancher Generated or LetsEncrypt)

cert-manager has 3 parts.

cert-manager pod in the cert-manager namespace.
Issuer object in the cattle-system namespace.
Certificate object in the cattle-system namespace.

Work backwards and do a kubectl describe on each object and check the events. You can track down what might be missing.

For example there is a problem with the Issuer:

kubectl -n cattle-system describe certificate
...
Events:
  Type     Reason          Age                 From          Message
  ----     ------          ----                ----          -------
  Warning  IssuerNotReady  18s (x23 over 19m)  cert-manager  Issuer rancher not ready

kubectl -n cattle-system describe issuer
...
Events:
  Type     Reason         Age                 From          Message
  ----     ------         ----                ----          -------
  Warning  ErrInitIssuer  19m (x12 over 19m)  cert-manager  Error initializing issuer: secret "tls-rancher" not found
  Warning  ErrGetKeyPair  9m (x16 over 19m)   cert-manager  Error getting keypair for CA issuer: secret "tls-rancher" not found

Checking for Issues with Your Own SSL Certs

Your certs get applied directly to the Ingress object in the cattle-system namespace.

Check the status of the Ingress object and see if its ready.

kubectl -n cattle-system describe ingress

If its ready and the SSL is still not working you may have a malformed cert or secret.

Check the nginx-ingress-controller logs. Because the nginx-ingress-controller has multiple containers in its pod you will need to specify the name of the container.

kubectl -n ingress-nginx logs -f nginx-ingress-controller-rfjrq nginx-ingress-controller
...
W0705 23:04:58.240571       7 backend_ssl.go:49] error obtaining PEM from secret cattle-system/tls-rancher-ingress: error retrieving secret cattle-system/tls-rancher-ingress: secret cattle-system/tls-rancher-ingress was not found

No matches for kind "Issuer"

The SSL configuration option you have chosen requires cert-manager to be installed before installing Rancher or else the following error is shown:

Error: validation failed: unable to recognize "": no matches for kind "Issuer" in version "certmanager.k8s.io/v1alpha1"

Install cert-manager and try installing Rancher again.

Canal Pods show READY 2/3

The most common cause of this issue is port 8472/UDP is not open between the nodes. Check your local firewall, network routing or security groups.

Once the network issue is resolved, the canal pods should timeout and restart to establish their connections.

nginx-ingress-controller Pods show RESTARTS

The most common cause of this issue is the canal pods have failed to establish the overlay network. See canal Pods show READY 2/3 for troubleshooting.

Failed to dial to /var/run/docker.sock: ssh: rejected: administratively prohibited (open failed)

Some causes of this error include:

User specified to connect with does not have permission to access the Docker socket. This can be checked by logging into the host and running the command docker ps:

$ ssh user@server
user@server$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES

See Manage Docker as a non-root user how to set this up properly.

When using RedHat/CentOS as operating system, you cannot use the user root to connect to the nodes because of Bugzilla #1527565. You will need to add a separate user and configure it to access the Docker socket. See Manage Docker as a non-root user how to set this up properly.
SSH server version is not version 6.7 or higher. This is needed for socket forwarding to work, which is used to connect to the Docker socket over SSH. This can be checked using sshd -V on the host you are connecting to, or using netcat:
```
$ nc xxx.xxx.xxx.xxx 22
SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10
```

Failed to dial ssh using address [xxx.xxx.xxx.xxx:xx]: Error configuring SSH: ssh: no key found

The key file specified as ssh_key_path cannot be accessed. Make sure that you specified the private key file (not the public key, .pub), and that the user that is running the rke command can access the private key file.

Failed to dial ssh using address [xxx.xxx.xxx.xxx:xx]: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

The key file specified as ssh_key_path is not correct for accessing the node. Double-check if you specified the correct ssh_key_path for the node and if you specified the correct user to connect with.

Failed to dial ssh using address [xxx.xxx.xxx.xxx:xx]: Error configuring SSH: ssh: cannot decode encrypted private keys

If you want to use encrypted private keys, you should use ssh-agent to load your keys with your passphrase. If the SSH_AUTH_SOCK environment variable is found in the environment where the rke command is run, it will be used automatically to connect to the node.

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

The node is not reachable on the configured address and port.

Agent reports TLS errors

When using Rancher, you may encounter error messages from the fleet-agent, system-agent, or cluster-agent, such as the message below:

tls: failed to verify certificate: x509: failed to load system roots and no roots provided; readdirent /dev/null: not a directory

This occurs when Rancher was configured with agent-tls-mode set to strict, but couldn’t find cacerts in the cacert setting. To resolve the issue, set the agent-tls-mode to system-store, or upload the CA for Rancher as described in Adding TLS Secrets.

New Cluster Deployment is stuck in "Waiting for Agent to check in"

When Rancher has agent-tls-mode set to strict, new clusters may fail to provision and report a generic "Waiting for Agent to check in" error message. The root cause of this is similar to the above case of TLS errors - Rancher’s agent can’t determine which CA Rancher is using (or can’t verify that Rancher’s cert is actually signed by the specified certificate authority).

To resolve the issue, set the agent-tls-mode to system-store or upload the CA for Rancher as described in Adding TLS Secrets.