Upgrade from v1.4.1 or v1.4.2 to v1.4.3

General information

An Upgrade button appears on the Dashboard screen whenever a new SUSE Virtualization version that you can upgrade to becomes available. For more information, see Start an upgrade.

SUSE Virtualization v1.4.2 and v1.4.3 use the same minor version of SUSE® Rancher Prime: RKE2 (v1.31). This allows you to directly upgrade from v1.4.1 to v1.4.3.

For air-gapped environments, see Prepare an air-gapped upgrade.

Known issues

1. Air-gapped upgrade stuck with `ImagePullBackOff` error in Fluentd and Fluent Bit pods

The upgrade may become stuck at the very beginning of the process, as indicated by 0% progress and items marked Pending in the Upgrade dialog of the SUSE Virtualization UI.

Specifically, Fluentd and Fluent Bit pods may become stuck in the ImagePullBackOff status. To check the status of the pods, run the following commands:

$ kubectl -n harvester-system get upgrades -l harvesterhci.io/latestUpgrade=true
NAME                 AGE
hvst-upgrade-x2hz8   7m14s

$ kubectl -n harvester-system get upgradelogs -l harvesterhci.io/upgrade=hvst-upgrade-x2hz8
NAME                            UPGRADE
hvst-upgrade-x2hz8-upgradelog   hvst-upgrade-x2hz8

$ kubectl -n harvester-system get pods -l harvesterhci.io/upgradeLog=hvst-upgrade-x2hz8-upgradelog
NAME                                                        READY   STATUS             RESTARTS   AGE
hvst-upgrade-x2hz8-upgradelog-downloader-6cdb864dd9-6bw98   1/1     Running            0          7m7s
hvst-upgrade-x2hz8-upgradelog-infra-fluentbit-2nq7q         0/1     ImagePullBackOff   0          7m42s
hvst-upgrade-x2hz8-upgradelog-infra-fluentbit-697wf         0/1     ImagePullBackOff   0          7m42s
hvst-upgrade-x2hz8-upgradelog-infra-fluentbit-kd8kl         0/1     ImagePullBackOff   0          7m42s
hvst-upgrade-x2hz8-upgradelog-infra-fluentd-0               0/2     ImagePullBackOff   0          7m42s

This occurs because the following container images are neither preloaded in the cluster nodes nor pulled from the internet:

ghcr.io/kube-logging/fluentd:v1.15-ruby3
ghcr.io/kube-logging/config-reloader:v0.0.5
fluent/fluent-bit:2.1.8

To fix the issue, perform any of the following actions:

Update the Logging CR to use the images that are already preloaded in the cluster nodes. To do this, run the following commands against the cluster:

# Get the Logging CR names
OPERATOR_LOGGING_NAME=$(kubectl get loggings -l app.kubernetes.io/name=rancher-logging -o jsonpath="{.items[0].metadata.name}")
INFRA_LOGGING_NAME=$(kubectl get loggings -l harvesterhci.io/upgradeLogComponent=infra -o jsonpath="{.items[0].metadata.name}")

# Gather image info from operator's Logging CR
FLUENTD_IMAGE_REPO=$(kubectl get loggings $OPERATOR_LOGGING_NAME -o jsonpath="{.spec.fluentd.image.repository}")
FLUENTD_IMAGE_TAG=$(kubectl get loggings $OPERATOR_LOGGING_NAME -o jsonpath="{.spec.fluentd.image.tag}")

FLUENTBIT_IMAGE_REPO=$(kubectl get loggings $OPERATOR_LOGGING_NAME -o jsonpath="{.spec.fluentbit.image.repository}")
FLUENTBIT_IMAGE_TAG=$(kubectl get loggings $OPERATOR_LOGGING_NAME -o jsonpath="{.spec.fluentbit.image.tag}")

CONFIG_RELOADER_IMAGE_REPO=$(kubectl get loggings $OPERATOR_LOGGING_NAME -o jsonpath="{.spec.fluentd.configReloaderImage.repository}")
CONFIG_RELOADER_IMAGE_TAG=$(kubectl get loggings $OPERATOR_LOGGING_NAME -o jsonpath="{.spec.fluentd.configReloaderImage.tag}")

# Patch the Logging CR
kubectl patch logging $INFRA_LOGGING_NAME --type=json -p="[{\"op\":\"replace\",\"path\":\"/spec/fluentbit/image\",\"value\":{\"repository\":\"$FLUENTBIT_IMAGE_REPO\",\"tag\":\"$FLUENTBIT_IMAGE_TAG\"}}]"
kubectl patch logging $INFRA_LOGGING_NAME --type=json -p="[{\"op\":\"replace\",\"path\":\"/spec/fluentd/image\",\"value\":{\"repository\":\"$FLUENTD_IMAGE_REPO\",\"tag\":\"$FLUENTD_IMAGE_TAG\"}}]"
kubectl patch logging $INFRA_LOGGING_NAME --type=json -p="[{\"op\":\"replace\",\"path\":\"/spec/fluentd/configReloaderImage\",\"value\":{\"repository\":\"$CONFIG_RELOADER_IMAGE_REPO\",\"tag\":\"$CONFIG_RELOADER_IMAGE_TAG\"}}]"

The status of the Fluentd and Fluent Bit pods should change to Running in a moment and the upgrade process should continue after the Logging CR is updated. If the Fluentd pod status is still ImagePullBackOff, you can delete the pod to force it to restart.

UPGRADE_NAME=$(kubectl -n harvester-system get upgrades -l harvesterhci.io/latestUpgrade=true -o jsonpath='{.items[0].metadata.name}')
UPGRADELOG_NAME=$(kubectl -n harvester-system get upgradelogs -l harvesterhci.io/upgrade=$UPGRADE_NAME -o jsonpath='{.items[0].metadata.name}')

kubectl -n harvester-system delete pods -l harvesterhci.io/upgradeLog=$UPGRADELOG_NAME,harvesterhci.io/upgradeLogComponent=aggregator

On a computer with internet access, pull the required container images and then export them to a TAR file. Next, transfer the TAR file to the cluster nodes and then import the images by running the following commands on each node:

# Pull down the three container images
docker pull ghcr.io/kube-logging/fluentd:v1.15-ruby3
docker pull ghcr.io/kube-logging/config-reloader:v0.0.5
docker pull fluent/fluent-bit:2.1.8

# Export the images to a tar file
docker save \
  ghcr.io/kube-logging/fluentd:v1.15-ruby3 \
  ghcr.io/kube-logging/config-reloader:v0.0.5 \
  fluent/fluent-bit:2.1.8 > upgradelog-images.tar

# After transferring the tar file to the cluster nodes, import the images (need to be run on each node)
ctr -n k8s.io images import upgradelog-images.tar

The upgrade process should continue after the images are preloaded.

(Not recommended) Restart the upgrade process with logging disabled. Ensure that the Enable Logging checkbox in the Upgrade dialog is not selected.

Related issue: #7955

2. Oversized volumes

In SUSE Virtualization v1.4.3, which uses SUSE Storage v1.7.3, oversized volumes (for example, 999999 Gi in size) are marked Not Ready and cannot be deleted.

To resolve this issue, perform the following steps:

Temporarily remove the PVC webhook rule.

RULE_INDEX=$(kubectl get \
  validatingwebhookconfiguration longhorn-webhook-validator -o json \
  | jq '.webhooks[0].rules | map(.resources[0] == "persistentvolumeclaims") | index(true)')

if [ -n "$RULE_INDEX" -a "$RULE_INDEX" != "null" ]; then
  kubectl patch validatingwebhookconfiguration longhorn-webhook-validator \
    --type='json' \
    -p="[{'op': 'remove', 'path': '/webhooks/0/rules/$RULE_INDEX'}]"
fi

Wait for the related PVC to be deleted.

Restore the PVC webhook rule to re-enable validation.

kubectl patch validatingwebhookconfiguration longhorn-webhook-validator \
  --type='json' \
  -p='[{"op": "add", "path": "/webhooks/0/rules/-", "value": {"apiGroups":[""],"apiVersions":["v1"],"operations":["UPDATE"],"resources":["persistentvolumeclaims"],"scope":"Namespaced"}}]'

The issue will be addressed in SUSE Storage v1.8.2, which will likely be included in SUSE Virtualization v1.5.1.

Related issues: #8096 and #10741

3. Non-root users on guest clusters unable to access RWX volumes

Non-root users on guest clusters encounter unexpected "Permission denied" errors when accessing RWX volumes. This is caused by a regression issue in nfs-ganesha v6.0+, which affects v1.7.3 of the longhorn-share-manager image.

You can resolve the issue by replacing longhorn-share-manager:v1.7.3 with the hotfixed image longhorn-share-manager:v1.7.3-hotfix-1.

Do not use the hotfixed image if you are not affected by the issue.

Edit the longhorn-manager DaemonSet by running the following command:
```
  kubectl -n longhorn-system edit daemonset/longhorn-manager
```

In the spec.containers.command field, change the --share-manager-image to longhornio/longhorn-share-manager:v1.7.3-hotfix-1.

  ...
    spec:
      containers:
      - command:
        - longhorn-manager
        - -d
        - daemon
        - --engine-image
        - longhornio/longhorn-engine:v1.7.3
        - --instance-manager-image
        - longhornio/longhorn-instance-manager:v1.7.3
        - --share-manager-image
        - longhornio/longhorn-share-manager:v1.7.3-hotfix-1
        - --backing-image-manager-image
        - longhornio/backing-image-manager:v1.7.3
        - --support-bundle-manager-image
        - longhornio/support-bundle-kit:v0.0.51
        - --manager-image
        - longhornio/longhorn-manager:v1.7.3
        - --service-account
        - longhorn-service-account
        - --upgrade-version-check
  ...

Once the update is applied, restart workloads that are using RWX volumes.

If you are using the hotfixed image and want to upgrade SUSE Virtualization v1.5.x, you must edit the longhorn-manager DaemonSet and revert to the longhorn-share-manager:v1.7.3 image before starting the upgrade.

Related issues: 8354 and 10621

4. Virtual machines that use migratable RWX volumes restart unexpectedly

Virtual machines that use migratable RWX volumes restart unexpectedly when the CSI plugin pods are restarted. This issue affects SUSE Virtualization v1.4.x, v1.5.0, and v1.5.1.

The workaround is to disable the setting Automatically Delete Workload Pod When The Volume Is Detached Unexpectedly on the SUSE Storage UI before starting the upgrade. You must enable the setting again once the upgrade is completed.

The issue will be fixed in SUSE Storage v1.8.3, v1.9.1, and later versions. SUSE Virtualization v1.6.0 will include SUSE Storage v1.9.1.

Related issues: #8534 and #11158