Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE CaaS Platform 4.2.4

10 Cluster Disaster Recovery

Etcd is a crucial component of Kubernetes - the etcd cluster stores the entire Kubernetes cluster state, which means critical configuration data, specifications, as well as the statuses of the running workloads. It also serves as the backend for service discovery. Chapter 11, Backup and Restore with Velero explains how to use Velero to backup, restore and migrate data. However, the Kubernetes cluster needs to be accessible for Velero to operate. And since the Kubernetes cluster can become inaccessible for many reasons, for example when all of its master nodes are lost, it is important to periodically backup etcd cluster data.

10.1 Backing Up etcd Cluster Data

This chapter describes the backup of etcd cluster data running on master nodes of SUSE CaaS Platform.

10.1.1 Data To Backup

  1. Create backup directories on external storage.

    BACKUP_DIR=CaaSP_Backup_`date +%Y%m%d%H%M%S`
    mkdir /${BACKUP_DIR}
  2. Copy the following files/folders into the backup directory:

    • The skuba command-line binary: for the running cluster. Used to replace nodes from cluster.

    • The cluster definition folder: Directory created during bootstrap holding the cluster certificates and configuration.

    • The etcd cluster database: Holds all non-persistent cluster data. Can be used to recover master nodes. Please refer to the next section for steps to create an etcd cluster database backup.

  3. (Optional) Make backup directory into a compressed file, and remove the original backup directory.

    tar cfv ${BACKUP_DIR}.tgz /${BACKUP_DIR}
    rm -rf /${BACKUP_DIR}

10.1.2 Creating an etcd Cluster Database Backup

10.1.2.1 Procedure

  1. Mount external storage device to all master nodes. This is only required if the following step is using local hostpath as volume storage.

  2. Create backup.

    1. Find the size of the database to be backed up

      ls -sh /var/lib/etcd/member/snap/db
      Important
      Important

      The backup size depends on the cluster. Ensure each of the backups has sufficient space. The available size should be more than the database snapshot file.

      You should also have a rotation method to clean up the unneeded snapshots over time.

      When there is insufficient space available during backup, pods will fail to be in Running state and no space left on device errors will show in pod logs.

      The below example manifest shows a binding to a local hostPath. We strongly recommend using other storage methods instead.

    2. Modify the script example

      Replace <STORAGE_MOUNT_POINT> with the directory in which to store the backup. The directory must exist on every node in cluster.

      Replace <IN_CLUSTER_ETCD_IMAGE> with the etcd image used in the cluster. This can be retrieved by accessing any one of the nodes in the cluster and running:

      grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'
    3. Create a backup deployment

      Run the following script:

      ETCD_SNAPSHOT="<STORAGE_MOUNT_POINT>/etcd_snapshot"
      ETCD_IMAGE="<IN_CLUSTER_ETCD_IMAGE>"
      MANIFEST="etcd-backup.yaml"
      
      cat << *EOF* > ${MANIFEST}
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: etcd-backup
        namespace: kube-system
        labels:
          jobgroup: backup
      spec:
        template:
          metadata:
            name: etcd-backup
            labels:
              jobgroup: backup
          spec:
            containers:
            - name: etcd-backup
              image: ${ETCD_IMAGE}
              env:
              - name: ETCDCTL_API
                value: "3"
              command: ["/bin/sh"]
              args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-\$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
              volumeMounts:
              - mountPath: /etc/kubernetes/pki/etcd
                name: etcd-certs
                readOnly: true
              - mountPath: /backup
                name: etcd-backup
            restartPolicy: OnFailure
            nodeSelector:
              node-role.kubernetes.io/master: ""
            tolerations:
            - effect: NoSchedule
              operator: Exists
            hostNetwork: true
            volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
                type: DirectoryOrCreate
            - name: etcd-backup
              hostPath:
                path: ${ETCD_SNAPSHOT}
                type: Directory
      *EOF*
      
      kubectl create -f ${MANIFEST}

      If you are using local hostPath and not using a shared storage device, the etcd backup will be created to any one of the master nodes. To find the node associated with each etcd backup run:

      kubectl get pods --namespace kube-system --selector=job-name=etcd-backup -o wide

10.1.3 Scheduling etcd Cluster Backup

  1. Mount external storage device to all master nodes. This is only required if the following step is using local hostPath as volume storage.

  2. Create Cronjob.

    1. Find the size of the database to be backed up

      Important
      Important

      The backup size depends on the cluster. Ensure each of the backups has sufficient space. The available size should be more than the database snapshot file.

      You should also have a rotation method to clean up the unneeded snapshots over time.

      When there is insufficient space available during backup, pods will fail to be in Running state and no space left on device errors will show in pod logs.

      The below example manifest shows a binding to a local hostPath. We strongly recommend using other storage methods instead.

      ls -sh /var/lib/etcd/member/snap/db
    2. Modify the script example

      Replace <STORAGE_MOUNT_POINT> with directory to store for backup. The directory must exist on every node in cluster.

      Replace <IN_CLUSTER_ETCD_IMAGE> with etcd image used in cluster. This can be retrieved by accessing any one of the nodes in the cluster and running:

      grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'
    3. Create a backup schedule deployment

      Run the following script:

      ETCD_SNAPSHOT="<STORAGE_MOUNT_POINT>/etcd_snapshot"
      ETCD_IMAGE="<IN_CLUSTER_ETCD_IMAGE>"
      
      # SCHEDULE in Cron format. https://crontab.guru/
      SCHEDULE="0 1 * * *"
      
      # *_HISTORY_LIMIT is the number of maximum history keep in the cluster.
      SUCCESS_HISTORY_LIMIT="3"
      FAILED_HISTORY_LIMIT="3"
      
      MANIFEST="etcd-backup.yaml"
      
      cat << *EOF* > ${MANIFEST}
      apiVersion: batch/v1beta1
      kind: CronJob
      metadata:
        name: etcd-backup
        namespace: kube-system
      spec:
        startingDeadlineSeconds: 100
        schedule: "${SCHEDULE}"
        successfulJobsHistoryLimit: ${SUCCESS_HISTORY_LIMIT}
        failedJobsHistoryLimit: ${FAILED_HISTORY_LIMIT}
        jobTemplate:
          spec:
            template:
              spec:
                containers:
                - name: etcd-backup
                  image: ${ETCD_IMAGE}
                  env:
                  - name: ETCDCTL_API
                    value: "3"
                  command: ["/bin/sh"]
                  args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-\$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
                  volumeMounts:
                  - mountPath: /etc/kubernetes/pki/etcd
                    name: etcd-certs
                    readOnly: true
                  - mountPath: /backup
                    name: etcd-backup
                restartPolicy: OnFailure
                nodeSelector:
                  node-role.kubernetes.io/master: ""
                tolerations:
                - effect: NoSchedule
                  operator: Exists
                hostNetwork: true
                volumes:
                - name: etcd-certs
                  hostPath:
                    path: /etc/kubernetes/pki/etcd
                    type: DirectoryOrCreate
                - name: etcd-backup
                  # hostPath is only one of the types of persistent volume. Suggest to setup external storage instead.
                  hostPath:
                    path: ${ETCD_SNAPSHOT}
                    type: Directory
      *EOF*
      
      kubectl create -f ${MANIFEST}

10.2 Recovering Master Nodes

This chapter describes how to recover SUSE CaaS Platform master nodes.

10.2.1 Replacing a Single Master Node

  1. Remove the failed master node with skuba.

    Replace <NODE_NAME> with failed cluster master node name.

    skuba node remove <NODE_NAME>
  2. Delete failed master node from known_hosts.

    Replace` <NODE_IP>` with failed master node IP address.

    sed -i "/<NODE_IP>/d" known_hosts
  3. Prepare a new instance.

  4. Use skuba to join master node from step 3.

    Replace <NODE_IP> with the new master node ip address.

    Replace <NODE_NAME> with the new master node name.

    Replace <USER_NAME> with user name.

    skuba node join --role=master --user=<USER_NAME> --sudo --target <NODE_IP> <NODE_NAME>

10.2.2 Recovering All Master Nodes

Ensure cluster version for backup/restore should be the same. Cross-version restoration in any domain is likely to encounter data/API compatibility issues.

10.2.2.1 Prerequisites

You will only need to restore database on one of the master node (master-0) to regain control-plane access. etcd will sync the database to all master nodes in the cluster once restored. This does not mean, however, that the nodes will automatically be added back to the cluster. You must join one master node to the cluster, restore the database and then continue adding your remaining master nodes (which then will sync automatically).

Do the following on master-0. Remote restore is not supported.

  1. Install one of the required software packages (etcdctl, Docker or Podman).

    • Etcdctl:

      sudo zypper install etcdctl
    • Docker:

      sudo zypper install docker
      sudo systemctl start docker
      
      ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
      
      sudo docker pull ${ETCD_IMAGE}
    • Podman:

      sudo zypper install podman
      
      ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
      
      sudo podman pull ${ETCD_IMAGE}
  2. Have access to etcd snapshot from backup device.

10.2.2.2 Procedure

  1. Stop etcd on all master nodes.

    mv /etc/kubernetes/manifests/etcd.yaml /tmp/

    You can check etcd container does not exist with crictl ps | grep etcd

  2. Purge etcd data on all master nodes.

    sudo rm -rf /var/lib/etcd
  3. Login to master-0 via SSH.

  4. Restore etcd data.

    Replace <SNAPSHOT_DIR> with directory to the etcd snapshot, for example: /share/backup/etcd_snapshot

    Replace <SNAPSHOT> with the name of the etcd snapshot, for example: etcd-snapshot-2019-11-08_05:19:20_GMT.db

    Replace <NODE_NAME> with master-0 cluster node name, for example: skuba-master-1

    Replace <NODE_IP> with master-0 cluster node IP address.

    Important
    Important

    The <NODE_IP> must be visible from inside the node.

    ip addr | grep <NODE_IP>
    Important
    Important

    The <NODE_NAME> and <NODE_IP> must exist after --initial-cluster in /etc/kubernetes/manifests/etcd.yaml

    • Etcdctl:

      SNAPSHOT="<SNAPSHOT_DIR>/<SNAPSHOT>"
      NODE_NAME="<NODE_NAME>"
      NODE_IP="<NODE_IP>"
      
      sudo ETCDCTL_API=3 etcdctl snapshot restore ${SNAPSHOT}\
       --data-dir /var/lib/etcd\
       --name ${NODE_NAME}\
       --initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
       --initial-advertise-peer-urls https://${NODE_IP}:2380
    • Docker:

      SNAPSHOT="<SNAPSHOT>"
      SNAPSHOT_DIR="<SNAPSHOT_DIR>"
      NODE_NAME="<NODE_NAME>"
      NODE_IP="<NODE_IP>"
      
      sudo docker run\
       -v ${SNAPSHOT_DIR}:/etcd_snapshot\
       -v /var/lib:/var/lib\
       --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
      ETCDCTL_API=3 etcdctl snapshot restore /etcd_snapshot/${SNAPSHOT}\
       --data-dir /var/lib/etcd\
       --name ${NODE_NAME}\
       --initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
       --initial-advertise-peer-urls https://${NODE_IP}:2380"
    • Podman:

      SNAPSHOT="<SNAPSHOT>"
      SNAPSHOT_DIR="<SNAPSHOT_DIR>"
      NODE_NAME="<NODE_NAME>"
      NODE_IP="<NODE_IP>"
      
      sudo podman run\
       -v ${SNAPSHOT_DIR}:/etcd_snapshot\
       -v /var/lib:/var/lib\
       --network host\
       --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
      ETCDCTL_API=3 etcdctl snapshot restore /etcd_snapshot/${SNAPSHOT}\
       --data-dir /var/lib/etcd\
       --name ${NODE_NAME}\
       --initial-cluster ${NODE_NAME}=https://${NODE_IP}:2380\
       --initial-advertise-peer-urls https://${NODE_IP}:2380"
  5. Start etcd on master-0.

    mv /tmp/etcd.yaml /etc/kubernetes/manifests/
  6. You should be able to see master-0 joined to the etcd cluster member list.

    Replace <ENDPOINT_IP> with master-0 cluster node IP address.

    • Etcdctl:

      sudo ETCDCTL_API=3 etcdctl\
       --endpoints=https://127.0.0.1:2379\
       --cacert=/etc/kubernetes/pki/etcd/ca.crt\
       --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
       --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
    • Docker:

      ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
      ENDPOINT=<ENDPOINT_IP>
      
      sudo docker run\
       -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
       --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
      ETCDCTL_API=3 etcdctl\
       --endpoints=https://${ENDPOINT}:2379\
       --cacert=/etc/kubernetes/pki/etcd/ca.crt\
       --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
       --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"
    • Podman:

      ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
      ENDPOINT=<ENDPOINT_IP>
      
      sudo podman run\
       -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
       --network host\
       --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
      ETCDCTL_API=3 etcdctl\
       --endpoints=https://${ENDPOINT}:2379\
       --cacert=/etc/kubernetes/pki/etcd/ca.crt\
       --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
       --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"
  7. Add another master node to the etcd cluster member list.

    Replace <NODE_NAME> with cluster node name, for example: skuba-master-1

    Replace <ENDPOINT_IP> with master-0 cluster node IP address.

    Replace <NODE_IP> with cluster node IP address.

    Important
    Important

    The <NODE_IP> must be visible from inside the node.

    ip addr | grep <NODE_IP>
    Important
    Important

    The <NODE_NAME> and <NODE_IP> must exist after --initial-cluster in /etc/kubernetes/manifests/etcd.yaml

    Important
    Important

    Nodes must be restored in sequence.

    • Etcdctl:

      NODE_NAME="<NODE_NAME>"
      NODE_IP="<NODE_IP>"
      
      sudo ETCDCTL_API=3 etcdctl\
       --endpoints=https://127.0.0.1:2379\
       --cacert=/etc/kubernetes/pki/etcd/ca.crt\
       --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
       --key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
       member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380
    • Docker:

      ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
      ENDPOINT=<ENDPOINT_IP>
      NODE_NAME="<NODE_NAME>"
      NODE_IP="<NODE_IP>"
      
      sudo docker run\
       -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
       --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
      ETCDCTL_API=3 etcdctl\
       --endpoints=https://${ENDPOINT}:2379\
       --cacert=/etc/kubernetes/pki/etcd/ca.crt\
       --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
       --key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
       member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380"
    • Podman:

      ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
      ENDPOINT=<ENDPOINT_IP>
      NODE_NAME="<NODE_NAME>"
      NODE_IP="<NODE_IP>"
      
      sudo podman run\
       -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
       --network host\
       --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
      ETCDCTL_API=3 etcdctl\
       --endpoints=https://${ENDPOINT}:2379\
       --cacert=/etc/kubernetes/pki/etcd/ca.crt\
       --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
       --key=/etc/kubernetes/pki/etcd/healthcheck-client.key\
       member add ${NODE_NAME} --peer-urls=https://${NODE_IP}:2380"
  8. Login to the node in step 7 via SSH.

  9. Start etcd.

    cp /tmp/etcd.yaml /etc/kubernetes/manifests/
  10. Repeat step 7, 8, 9 to recover all remaining master nodes.

10.2.2.3 Confirming the Restoration

After restoring, execute the below command to confirm the procedure. A successful restoration will show master nodes in etcd member list started, and all Kubernetes nodes in STATUS Ready.

  • Etcdctl:

    sudo ETCDCTL_API=3 etcdctl\
     --endpoints=https://127.0.0.1:2379\
     --cacert=/etc/kubernetes/pki/etcd/ca.crt\
     --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
     --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list
    
    # EXAMPLE
    116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
    3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
    43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379
  • Docker:

    ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
    
    # Replace <ENDPOINT_IP> with `master-0` cluster node IP address.
    ENDPOINT=<ENDPOINT_IP>
    
    sudo docker run\
     -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
     --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
    ETCDCTL_API=3 etcdctl\
     --endpoints=https://${ENDPOINT}:2379\
     --cacert=/etc/kubernetes/pki/etcd/ca.crt\
     --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
     --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"
    
    # EXAMPLE
    116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
    3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
    43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379
  • Podman:

    ETCD_IMAGE=`grep image: /etc/kubernetes/manifests/etcd.yaml | awk '{print $2}'`
    
    # Replace <ENDPOINT_IP> with `master-0` cluster node IP address.
    ENDPOINT=<ENDPOINT_IP>
    
    sudo podman run\
     -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd\
     --network host\
     --entrypoint "" ${ETCD_IMAGE} /bin/bash -c "\
    ETCDCTL_API=3 etcdctl\
     --endpoints=https://${ENDPOINT}:2379\
     --cacert=/etc/kubernetes/pki/etcd/ca.crt\
     --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt\
     --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member list"
    
    # EXAMPLE
    116c1458aef748bc, started, caasp-master-cluster-2, https://172.28.0.20:2380, https://172.28.0.20:2379
    3d124d6ad11cf3dd, started, caasp-master-cluster-0, https://172.28.0.26:2380, https://172.28.0.26:2379
    43d2c8b1d5179c01, started, caasp-master-cluster-1, https://172.28.0.6:2380, https://172.28.0.6:2379
  • Kubectl:

    kubectl get nodes
    
    # EXAMPLE
    NAME                          STATUS   ROLES    AGE      VERSION
    caasp-master-cluster-0        Ready    master   28m      v1.16.2
    caasp-master-cluster-1        Ready    master   20m      v1.16.2
    caasp-master-cluster-2        Ready    master   12m      v1.16.2
    caasp-worker-cluster-0        Ready    <none>   36m36s   v1.16.2
Print this page