Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE CaaS Platform 4.5.2

15 Troubleshooting

This chapter summarizes frequent problems that can occur while using SUSE CaaS Platform and their solutions.

Additionally, SUSE support collects problems and their solutions online at https://www.suse.com/support/kb/?id=SUSE_CaaS_Platform .

15.1 The supportconfig Tool

As a first step for any troubleshooting/debugging effort, you need to find out the location of the cause of the problem. For this purpose we ship the supportconfig tool and plugin with SUSE CaaS Platform. With a simple command you can collect and compile a variety of details about your cluster to enable SUSE support to pinpoint the potential cause of an issue.

In case of problems, a detailed system report can be created with the supportconfig command line tool. It will collect information about the system, such as:

  • Current Kernel version

  • Hardware information

  • Installed packages

  • Partition setup

  • Cluster and node status


A full list of of the data collected by supportconfig can be found under https://github.com/SUSE/supportutils-plugin-suse-caasp/blob/master/README.md.


To collect all the relevant logs, run the supportconfig command on all the master and worker nodes individually.

sudo supportconfig
sudo tar -xvJf /var/log/nts_*.txz
cd /var/log/nts*
sudo cat kubernetes.txt crio.txt

The result is a TAR archive of files. Each of the *.txz files should be given a name that can be used to identify which cluster node it was created on.

After opening a Service Request (SR), you can upload the TAR archives to SUSE Global Technical Support.

The data will help to debug the issue you reported and assist you in solving the problem. For details, see https://documentation.suse.com/sles/15-SP2/single-html/SLES-admin/#cha-adm-support.

15.2 Cluster definition directory

Apart from the logs provided by running the supportconfig tool, an additional set of data might be required for debugging purposes. This information is located at the Management node, under your cluster definition directory. This folder contains important and sensitive information about your SUSE CaaS Platform cluster and it’s the one from where you issue skuba commands.


If the problem you are facing is related to your production environment, do not upload the admin.conf as this would expose access to your cluster to anyone in possession of the collected information! The same precautions apply for the pki directory, since this also contains sensitive information (CA cert and key).

In this case add --exclude='./<CLUSTER_NAME>/admin.conf' --exclude='./<CLUSTER_NAME>/pki/' to the command in the following example. Make sure to replace ./<CLUSTER_NAME> with the actual path of your cluster definition folder.

If you need to debug issues with your private certificates, a separate call with SUSE support must be scheduled to help you.

Create a TAR archive by compressing the cluster definition directory.

# Read the TIP above
# Move the admin.conf and pki directory to another safe location or exclude from packaging
tar -czvf cluster.tar.gz /home/user/<CLUSTER_NAME>/
# If the error is related to Terraform, please copy the terraform configuration files as well
tar -czvf cluster.tar.gz /home/user/my-terraform-configuration/

After opening a Service Request (SR), you can upload the TAR archive to SUSE Global Technical Support.

15.3 Log collection

Some of these information are required for debugging certain cases. The data collected via supportconfig in such cases are following:

  • etcd.txt (master nodes)

    curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/health
    curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/members
    curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/leader
    curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/self
    curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/store
    curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/metrics
    etcdcontainer=$(crictl ps --label io.kubernetes.container.name=etcd --quiet)
    crictl exec $etcdcontainer sh -c \"ETCDCTL_ENDPOINTS='' ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt' ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt' ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key' ETCDCTL_API=3 etcdctl check perf\"
    crictl logs -t $etcdcontainer
    crictl stats --id $etcdcontainer
    etcdpod=$(crictl ps | grep etcd | awk -F ' ' '{ print $9 }')
    crictl inspectp $etcdpod

For more information about etcd, refer to Section 15.10, “ETCD Troubleshooting”.

  • kubernetes.txt (all nodes)

    export KUBECONFIG=/etc/kubernetes/admin.conf
    kubectl version
    kubectl api-versions
    kubectl config view
    kubectl -n kube-system get pods
    kubectl get events --sort-by=.metadata.creationTimestamp
    kubectl get nodes
    kubectl get all -A
    kubectl get nodes -o yaml
  • kubernetes-cluster-info.txt (all nodes)

    export KUBECONFIG=/etc/kubernetes/admin.conf
    # a copy of kubernetes logs /var/log/kubernetes
    kubectl cluster-info dump --output-directory="/var/log/kubernetes"
  • kubelet.txt (all nodes)

    systemctl status --full kubelet
    journalctl -u kubelet
    # a copy of kubernetes manifests /etc/kubernetes/manifests"
    cat /var/lib/kubelet/config.yaml
  • oidc-gangway.txt (all nodes)

    container=$(crictl ps --label io.kubernetes.container.name="oidc-gangway" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "oidc-gangway" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • oidc-dex.txt (worker nodes)

    container=$(crictl ps --label io.kubernetes.container.name="oidc-dex" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "oidc-dex" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • cilium-agent.txt (all nodes)

    container=$(crictl ps --label io.kubernetes.container.name="cilium-agent" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "cilium-agent" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • cilium-operator.txt (only from the worker node is runs)

    container=$(crictl ps --label io.kubernetes.container.name="cilium-operator" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "cilium-operator" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • kured.txt (all nodes)

    container=$(crictl ps --label io.kubernetes.container.name="kured" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "kured" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • coredns.txt (_worker nodes)

    container=$(crictl ps --label io.kubernetes.container.name="coredns" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "coredns" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • kube-apiserver.txt (master nodes)

    container=$(crictl ps --label io.kubernetes.container.name="kube-apiserver" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "kube-apiserver" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • kube-proxy.txt (all nodes)

    container=$(crictl ps --label io.kubernetes.container.name="kube-proxy" --quiet)
    crictl logs -t $container
    crictl inspect $container
    After skuba 4.2.2
    pod=$(crictl ps | grep "kube-proxy" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • kube-scheduler.txt (master nodes)

    container=$(crictl ps --label io.kubernetes.container.name="kube-scheduler" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "kube-scheduler" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • kube-controller-manager.txt (master nodes)

    container=$(crictl ps --label io.kubernetes.container.name="kube-controller-manager" --quiet)
    crictl logs -t $container
    crictl inspect $container
    pod=$(crictl ps | grep "kube-controller-manager" | awk -F ' ' '{ print $9 }')
    crictl inspectp $pod
  • kube-system.txt (all nodes)

    export KUBECONFIG=/etc/kubernetes/admin.conf
    kubectl get all -n kube-system -o yaml
  • crio.txt (all_nodes)

    crictl version
    systemctl status --full crio.service
    crictl info
    crictl images
    crictl ps --all
    crictl stats --all
    journalctl -u crio
    # a copy of /etc/crictl.yaml
    # a copy of /etc/sysconfig/crio
    # a copy of every file under /etc/crio/
    # Run the following three commands for every container using this loop:
    for i in $(crictl  ps -a 2>/dev/null | grep -v "CONTAINER" | awk '{print $1}');
        crictl stats --id $i
        crictl logs $i
        crictl inspect $i

15.4 Debugging SLES Nodes provision

If Terraform fails to setup the required SLES infrastructure for your cluster, please provide the configuration you applied in a form of a TAR archive.

Create a TAR archive by compressing the Terraform.

tar -czvf terraform.tar.gz /path/to/terraform/configuration

After opening a Service Request (SR), you can upload the TAR archive to Global Technical Support.

15.5 Debugging Cluster Deployment

If the cluster deployment fails, please re-run the command again with setting verbosity level to 5 -v=5.

For example, if bootstraps the first master node of the cluster fails, re-run the command like

skuba node bootstrap --user sles --sudo --target <IP/FQDN> <NODE_NAME> -v=5

However, if the join procedure fails at the last final steps, re-running it might not help. To verify this, please list the current member nodes of your cluster and look for the one who failed.

kubectl get nodes

If the node that failed to join is nevertheless listed in the output as part of your cluster, then this is a bad indicator. This node cannot be reset back to a clean state anymore and it’s not safe to keep it online in this unknown state. As a result, instead of trying to fix its existing configuration either by hand or re-running the join/bootstrap command, we would highly recommend you to remove this node completely from your cluster and then replace it with a new one.

skuba node remove <NODE_NAME> --drain-timeout 5s

15.6 Error x509: certificate signed by unknown authority

When interacting with Kubernetes, you might run into the situation where your existing configuration for the authentication has changed (cluster has been rebuild, certificates have been switched.) In such a case you might see an error message in the output of your CLI or Web browser.

x509: certificate signed by unknown authority

This message indicates that your current system does not know the Certificate Authority (CA) that signed the SSL certificates used for encrypting the communication to the cluster. You then need to add or update the Root CA certificate in your local trust store.

  1. Obtain the root CA certificate from on of the Kubernetes cluster node, at the location /etc/kubernetes/pki/ca.crt

  2. Copy the root CA certificate into your local machine directory /etc/pki/trust/anchors/

  3. Update the cache for know CA certificates

    sudo update-ca-certificates

15.7 Error Invalid client credentials

When using Dex & Gangway for authentication, you might see the following error message in the Web browser output:

oauth2: cannot fetch token: 401 Unauthorized
Response: {"error":"invalid_client","error_description":"Invalid client credentials."}

This message indicates that your Kubernetes cluster Dex & Gangway client secret is out of sync.

15.7.1 Versions before SUSE CaaS Platform 4.2.2


These steps apply to skuba ≤ 1.3.5

Please update the Dex & Gangway ConfigMap to use the same client secret.

kubectl -n kube-system get configmap oidc-dex-config -o yaml > oidc-dex-config.yaml
kubectl -n kube-system get configmap oidc-gangway-config -o yaml > oidc-gangway-config.yaml

Make sure the oidc’s secret in oidc-dex-config.yaml is the same as the clientSecret in oidc-gangway-config.yaml. Then, apply the updated ConfigMap.

kubectl replace -f oidc-dex-config.yaml
kubectl replace -f oidc-gangway-config.yaml

15.7.2 Versions after SUSE CaaS Platform 4.2.2


These steps apply to skuba ≥ 1.4.1

If you have configured Dex via a kustomize patch, please update your patch to use secretEnv: OIDC_GANGWAY_CLIENT_SECRET. Change your patch as follows, from:

- id: oidc
  name: 'OIDC'
  secret: <client-secret>
  - oidc-cli


- id: oidc
  name: 'OIDC'
  - oidc-cli

Dex & Gangway will then use the same client secret.

15.8 Replacing a Lost Node

If your cluster loses a node, for example due to failed hardware, remove the node as explained in Section 2.4, “Removing Nodes”. Then add a new node as described in Section 2.3, “Adding Nodes”.

15.9 Rebooting an Undrained Node with RBD Volumes Mapped

Rebooting a cluster node always requires a preceding drain. In some cases, draining the nodes first might not be possible and some problem can occur during reboot if some RBD volumes are mapped to the nodes.

In this situation, apply the following steps.

  1. Make sure kubelet and CRI-O are stopped:

    systemctl stop kubelet crio
  2. Unmount every RBD device /dev/rbd* before rebooting. For example:

    umount -vAf /dev/rbd0

If there are several device mounted, this little script can be used to avoid manual unmounting:

#!/usr/bin/env bash

while grep "rbd" /proc/mounts > /dev/null 2>&1; do
  for dev in $(lsblk -p -o NAME | grep "rbd"); do
    if $(mountpoint -x $dev > /dev/null 2>&1); then
      echo ">>> umounting $dev"
      umount -vAf "$dev"

15.10 ETCD Troubleshooting

15.10.1 Introduction

This document aims to describe debugging an etcd cluster.

The required etcd logs are part of the supportconfig, a utility that collects all the required information for debugging a problem. The rest of the document provides information on how you can obtain these information manually.

15.10.2 ETCD container

ETCD is a distributed reliable key-value store for the most critical data of a distributed system. It is running only on the master nodes in a form a container application. For instance, in a cluster with 3 master nodes, it is expected to have 3 etcd instances as well:

kubectl get pods -n kube-system -l component=etcd
NAME                            READY   STATUS    RESTARTS   AGE
etcd-vm072044.qa.prv.suse.net   1/1     Running   1          7d
etcd-vm072050.qa.prv.suse.net   1/1     Running   1          7d
etcd-vm073033.qa.prv.suse.net   1/1     Running   1          7d

The specific configuration which etcd is using to start, is the following:

etcd \
      --advertise-client-urls=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2379 \
      --cert-file=/etc/kubernetes/pki/etcd/server.crt  \
      --client-cert-auth=true --data-dir=/var/lib/etcd \
      --initial-advertise-peer-urls=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2380 \
      --initial-cluster=vm072050.qa.prv.suse.net=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2380 \
      --key-file=/etc/kubernetes/pki/etcd/server.key \
      --listen-client-urls=,https://<YOUR_MASTER_NODE_IP_ADDRESS>:2379 \
      --listen-peer-urls=https://<YOUR_MASTER_NODE_IP_ADDRESS>:2380 \
      --name=vm072050.qa.prv.suse.net \
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
      --peer-client-cert-auth=true \
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
      --snapshot-count=10000 --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

For more information related to ETCD, we highly recommend you to read ETCD FAQ page.

15.10.3 logging

Since etcd is running in a container, that means it is not controlled by systemd, thus any commands related to that (e.g. journalctl) will fail, therefore you need to use container debugging approach instead.


To use the following commands, you need to connect (e.g. via SSH) to the master node where the etcd pod is running.

To see the etcd logs, connect to a Kubernetes master node and then run as root:

ssh sles@<MASTER_NODE>
sudo bash # connect as root
etcdcontainer=$(crictl ps --label io.kubernetes.container.name=etcd --quiet)
crictl logs -f $etcdcontainer

15.10.4 etcdctl

etcdctl is a command line client for etcd. The new version of SUSE CaaS Platform is using the v3 API. For that, you need to make sure to set environment variable ETCDCTL_API=3 before using it. Apart from that, you need to provide the required keys and certificates for authentication and authorization, via ETCDCTL_CACERT, ETCDCTL_CERT and ETCDCTL_KEY environment variables. Last but not least, you need to also specify the endpoint via ETCDCTL_ENDPOINTS environment variable.

  • Example

    To find out if your network and disk latency are fast enough, you can benchmark your node using the etcdctl check perf command. To do this, frist connect to a Kubernetes master node:

    ssh sles@<MASTER_NODE>
    sudo bash # login as root

    and then run as root:

    etcdcontainer=$(crictl ps --label io.kubernetes.container.name=etcd --quiet)
    crictl exec $etcdcontainer sh -c \
    ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt' \
    ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt' \
    ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key' \
    etcdctl check perf"

15.10.5 curl as an alternative

For most of the etcdctl commands, there is an alternative way to fetch the same information via curl. First you need to connect to the master node and then issue a curl command against the ETCD endpoint. Here’s an example of the information which supportconfig is collecting:

  • Health check:

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/health
  • Member list

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/members
  • Leader information

# available only from the master node where ETCD **leader** runs
sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/leader
  • Current member information

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/self
  • Statistics

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/v2/stats/store
  • Metrics

sudo curl -Ls --cacert /etc/kubernetes/pki/etcd/ca.crt \
--key /etc/kubernetes/pki/etcd/server.key \
--cert /etc/kubernetes/pki/etcd/server.crt https://localhost:2379/metrics

15.12 AWS Deployment fails with cannot attach profile error

For SUSE CaaS Platform to be properly deployed, you need to have proper IAM role, role policy and instance profile set up in AWS. Under normal circumstances Terraform will be invoked by a user with suitable permissions during deployment and automatically create these profiles. If your access permissions on the AWS account forbid Terraform from creating the profiles automatically, they must be created before attempting deployment.

15.12.1 Create IAM Role, Role Policy, and Instance Profile through AWS CLI

Users who do not have permission to create IAM role, role policy, and instance profile using Terraform, devops should create them for you, using the instructions below:

  • STACK_NAME: Cluster Stack Name

    1. Install AWS CLI:

      sudo zypper --gpg-auto-import-keys install -y aws-cli
    2. Setup AWS credentials:

      aws configure
    3. Prepare role policy:

      cat <<*EOF* >"./<STACK_NAME>-trust-policy.json"
        "Version": "2012-10-17",
        "Statement": [
            "Action": "sts:AssumeRole",
            "Principal": {
              "Service": "ec2.amazonaws.com"
            "Effect": "Allow",
            "Sid": ""
    4. Prepare master instance policy:

      cat <<*EOF* >"./<STACK_NAME>-master-role-trust-policy.json"
        "Version": "2012-10-17",
        "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
    5. Prepare worker instance policy:

      cat <<*EOF* >"./<STACK_NAME>-worker-role-trust-policy.json"
        "Version": "2012-10-17",
        "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
    6. Create roles:

      aws iam create-role --role-name <STACK_NAME>_cpi_master --assume-role-policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-trust-policy.json
      aws iam create-role --role-name <STACK_NAME>_cpi_worker --assume-role-policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-trust-policy.json
    7. Create instance role policies:

      aws iam put-role-policy --role-name <STACK_NAME>_cpi_master --policy-name <STACK_NAME>_cpi_master --policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-master-role-trust-policy.json
      aws iam put-role-policy --role-name <STACK_NAME>_cpi_worker --policy-name <STACK_NAME>_cpi_worker --policy-document file://<FILE_DIRECTORY>/<STACK_NAME>-worker-role-trust-policy.json
    8. Create instance profiles:

      aws iam create-instance-profile --instance-profile-name <STACK_NAME>_cpi_master
      aws iam create-instance-profile --instance-profile-name <STACK_NAME>_cpi_worker
    9. Add role to instance profiles:

      aws iam add-role-to-instance-profile --role-name <STACK_NAME>_cpi_master --instance-profile-name <STACK_NAME>_cpi_master
      aws iam add-role-to-instance-profile --role-name <STACK_NAME>_cpi_worker --instance-profile-name <STACK_NAME>_cpi_worker
Print this page