Offline Replica Rebuilding

Starting with v1.9.0, SUSE Storage supports offline replica rebuilding. This feature allows degraded volumes to automatically rebuild replicas while the volumes are detached.

Global Setting `offline-replica-rebuilding`

When enabled, SUSE Storage automatically initiates offline rebuilding for eligible volumes.
For more information about this global setting, see settings.

Per-Volume Override

You can override the global offline-replica-rebuilding setting for each volume individually. This can be done through the SUSE Storage UI or by editing the Volume Custom Resource. To use kubectl, run the following command and then modify the spec.offlineRebuilding field:
```
kubectl -n longhorn-system edit volume <volume-name>
```
Replace <volume-name> with the name of your specific volume.
When the per-volume spec.offlineRebuilding field is set to enabled or disabled, this setting takes precedence over the global configuration. The default value for spec.offlineRebuilding is ignored.

The following table shows how the global and per-volume settings interact:

Global Setting (offline-replica-rebuilding)

Per-Volume Setting (spec.offlineRebuilding)

Offline Rebuilding Enabled

true

ignored

Yes

false

ignored

true

enabled

Yes

false

enabled

Yes

true

disabled

false

disabled

Rebuilding Process

When offline replica rebuilding is triggered, SUSE Storage attaches the volume without activating the frontend, rebuilds any missing replicas, and then detaches the volume after the rebuilding process is complete.
This process can be interrupted if the associated workload scales up and requires the volume.

Rebuilding Not Started or Canceled

When offline rebuilding starts, degraded volumes can get stuck in the attached state if rebuilding conditions aren’t met. To prevent this, if the necessary conditions aren’t satisfied, offline rebuilding won’t start or will be canceled.

Benefits:
- It ensures volumes don’t remain stuck in the attached state if rebuilding never finishes.
- It prevents wasteful rebuilding attempts.
- It reduces unnecessary volume attachment and detachment cycles.
- It provides predictable rebuilding behavior based on resource availability.
Required conditions: Offline rebuilding automatically starts for degraded volumes once the required conditions are met. These conditions include:
- A reusable failed replica exists, or
- A disk candidate exists:
  - The instance manager on the node hosting the disk must be ready.
  - The disk’s containing node is schedulable.
  - The disk itself is schedulable.

Before offline rebuilding starts

When offline rebuilding is enabled, SUSE Storage determines whether it should start.

SUSE Storage detects a degraded, detached volume.
The system validates whether the required conditions are met before starting the rebuild.
If the conditions are met, rebuilding proceeds. Otherwise, the volume remains detached.
The required conditions are re-evaluated when a node is added, becomes ready, or becomes schedulable.

During offline rebuilding

SUSE Storage determines if a rebuilding process should be canceled while in progress.

SUSE Storage detects the volume’s status when offline rebuilding starts and the volume is attached.
If the volume’s Scheduled condition status becomes False, the offline rebuilding is canceled, and the volume is detached.
If the required conditions are met again, offline rebuilding restarts; otherwise, the volume remains detached.

Examples

Successful offline rebuilding:
1. A volume is created with 3 replicas in a 3-worker-node cluster.
2. Offline rebuilding is enabled.
3. The volume is detached and then a replica of the volume is deleted.
4. Offline rebuilding begins, and the volume is attached.
5. After rebuilding finishes, the volume is detached.
Offline rebuilding doesn’t start even when it is enabled:
1. A volume is created with 3 replicas in a 3-worker-node (A, B, and C) cluster.
2. Offline rebuilding is enabled.
3. Worker node A is unschedulable.
4. The volume replica on worker node A is deleted.
5. Because only two schedulable worker nodes exist, offline rebuilding will not start.
A worker node is drained during offline rebuilding:
1. A volume is created with 3 replicas in a 3-worker-node (A, B, and C) cluster.
2. Offline rebuilding is enabled.
3. The volume is detached, and then the volume replica on worker node A is deleted.
4. Offline rebuilding begins, and the volume is attached to rebuild a replica on worker node A.
5. Worker node A is drained, making it unschedulable, and the volume replica on worker node A is deleted.
6. The volume remains attached until the volume’s Scheduled condition status becomes False.
7. The volume is detached until worker node A is uncordoned or a new schedulable node is added.

Limitations

Offline rebuilding is not supported for faulted volumes.

Offline Replica Rebuilding

Global Setting offline-replica-rebuilding

Per-Volume Override

Rebuilding Process

Rebuilding Not Started or Canceled

Before offline rebuilding starts

During offline rebuilding

Examples

Limitations

Global Setting `offline-replica-rebuilding`