Restoring a Cluster from Backup

Etcd backup and recovery for Rancher launched Kubernetes clusters can be easily performed. Snapshots of the etcd database are taken and saved either locally onto the etcd nodes or to a S3 compatible target. The advantages of configuring S3 is that if all etcd nodes are lost, your snapshot is saved remotely and can be used to restore the cluster.

Rancher recommends enabling the ability to set up recurring snapshots of etcd, but one-time snapshots can easily be taken as well. Rancher allows restore from saved snapshots.

Clusters can also be restored to a prior Kubernetes version and cluster configuration.

Viewing Available Snapshots

The list of all available snapshots for the cluster is available.

In the upper left corner, click ☰ > Cluster Management.
In the Clusters page, go to the cluster where you want to view the snapshots and click the name of the cluster.
Click the Snapshots tab. The listed snapshots include a timestamp of when they were created.

Restoring a Cluster from a Snapshot

If your Kubernetes cluster is broken, you can restore the cluster from a snapshot.

Snapshots are composed of the cluster data in etcd, the Kubernetes version, and the cluster configuration in the cluster.yml. These components allow you to select from the following options when restoring a cluster from a snapshot:

Restore just the etcd contents: This restore is similar to restoring to snapshots in Rancher before v2.4.0.
Restore etcd and Kubernetes version: This option should be used if a Kubernetes upgrade is the reason that your cluster is failing, and you haven’t made any cluster configuration changes.
Restore etcd, Kubernetes versions and cluster configuration: This option should be used if you changed both the Kubernetes version and cluster configuration when upgrading.

When rolling back to a prior Kubernetes version, the upgrade strategy options are ignored. Worker nodes are not cordoned or drained before being reverted to the older Kubernetes version, so that an unhealthy cluster can be more quickly restored to a healthy state.

Prerequisite:

To restore snapshots from S3, the cluster needs to be configured to take recurring snapshots on S3.

In the upper left corner, click ☰ > Cluster Management.
In the Clusters page, go to the cluster where you want to view the snapshots and click the name of the cluster.
Click the Snapshots tab to view the list of saved snapshots.
Go to the snapshot you want to restore and click ⋮ > Restore.
Select a Restore Type.
Click Restore.

Result: The cluster will go into updating state and the process of restoring the etcd nodes from the snapshot will start. The cluster is restored when it returns to an active state.

Restoring a Cluster From a Snapshot When the controlplane/etcd Are Completely Unavailable

In a disaster recovery scenario, the control plane and etcd nodes managed by Rancher in a downstream cluster may no longer be available or functioning. The cluster can be rebuilt by adding control plane and etcd nodes again, followed by restoring from an available snapshot.

If you have a complete cluster failure, you must remove all etcd nodes/machines from your cluster before you can add a "new" etcd node for restore.

If you are using local snapshots, it is VERY important that you ensure you back up the corresponding snapshot you want to restore from the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder on the etcd node you are going to be removing. You can copy the snapshot onto your new node in the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder. Furthermore, if using local snapshots and restoring to a new node, restoration cannot be done via the UI as of now.

Remove all etcd nodes from your cluster.
1. In the upper left corner, click ☰ > Cluster Management.
2. In the Clusters page, go to the cluster where you want to remove nodes.
3. In the Machines tab, click ⋮ > Delete on each node you want to delete. Initially, you will see the nodes hang in a deleting state, but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to "short circuit" the etcd safe-removal logic.
After all etcd nodes are removed, add the new etcd node that you are planning to restore from. Assign the new node the role of all (etcd, controlplane, and worker).
- If the node was previously in a cluster, clean the node first.
- For custom clusters, go to the Registration tab and check the box for etcd, controlplane, and worker. Then copy and run the registration command on your node.
- For node driver clusters, a new node is provisioned automatically.
  
  At this point, Rancher will indicate that restoration from etcd snapshot is required.
Restore from an etcd snapshot.

As the etcd node is a clean node, you may need to manually create the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ path.
- For S3 snapshots, restore using the UI.
  1. Click the Snapshots tab to view the list of saved snapshots.
  2. Go to the snapshot you want to restore and click ⋮ > Restore.
  3. Select a Restore Type.
  4. Click Restore.
- For local snapshots, restore using the UI is not available.
  1. In the upper right corner, click ⋮ > Edit YAML.
  2. The example YAML below can be added under your rkeConfig to configure the etcd restore:
    
    ... rkeConfig: etcdSnapshotRestore: name: <string> # This field is required. Refers to the filename of the associated etcdsnapshot object. ...
After restoration is successful, you can scale your etcd nodes back up to the desired redundancy.