|
This is unreleased documentation for SUSE® Storage 1.12 (Dev). |
Graceful Node Removal
Removing a node from a Kubernetes cluster requires careful coordination with SUSE Storage to ensure data availability. Simply running kubectl delete node is insufficient because Kubernetes does not automatically notify CSI storage layers to migrate data replicas before the node is destroyed.
If a node is removed without following this procedure, the replicas stored on that node will be lost, potentially leaving your volumes in a Degraded or Faulted state.
Prerequisites
The cluster must have enough available space and schedulable nodes to receive the replicas being moved.
Step-by-Step Procedure
1. Cordon and Drain the Kubernetes Node
Prepare the node for removal by moving all running workloads (Pods) to other nodes. Using a timeout and force flag ensures that the drain completes even if some pods are slow to terminate or protected by Pod Disruption Budgets.
kubectl cordon <NODE_NAME>
kubectl drain <NODE_NAME> \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=-1 \
--timeout=300s
|
The |
2. Disable Scheduling and Trigger Eviction
You must prevent new data from being scheduled on the node and move existing data to other healthy nodes. This can be done via the SUSE Storage UI or the command line.
3. Monitor Eviction Progress
Before proceeding to delete the node, you must ensure all replicas has successfully migrated.
-
Via UI: In the Nodes list, watch the Replicas column for the node. Wait until the count reaches 0.
-
Via CLI: Poll the node status to confirm all resources (Replicas and Backing Images) have migrated:
kubectl get node.longhorn.io <NODE_NAME> -n longhorn-system -o json \ | jq '.status.diskStatus | to_entries[] | {disk: .key, scheduledReplicas: (.value.scheduledReplica | length), scheduledBackingImages: (.value.scheduledBackingImage | length)}'
|
Eviction is complete only when every disk reports |
4. Delete the Kubernetes Node
Remove the node resource from the Kubernetes cluster. SUSE Storage requires the Kubernetes node to be removed before it will allow the deletion of the Longhorn Node resource.
kubectl delete node <NODE_NAME>
5. Delete the Longhorn Node
Once the Kubernetes node is deleted, the SUSE Storage UI will show the node in a Down state. You can now safely remove the metadata.
-
Via UI: In the Nodes tab, the Delete button for the node will now be enabled. Click Delete.
-
Via CLI: If you prefer using
kubectl, you can delete the Longhorn Node resource directly using the following command:kubectl -n longhorn-system delete nodes.longhorn.io <NODE_NAME>
Troubleshooting
Node "Delete" button is greyed out
The UI disables the Delete button if the corresponding Kubernetes node still exists. Ensure you have successfully executed kubectl delete node <NODE_NAME> first.
Eviction is stuck
If the replica count does not reach 0, check the Events log. Common causes include:
-
Insufficient Space: No other nodes have enough disk space to house the replicas. To recover from this, you can refer to Manual Recovery of Nodes with Insufficient Space.
-
Anti-Affinity Constraints: If
Replica Node Level Soft Anti-Affinityis disabled, and all other nodes already host a replica of the same volume, the eviction will have no valid destination. To learn more about anti-affinity and how to resolve this, see Replica Scheduling and Anti-Affinity. -
Volume Health: Rebuilding cannot start if the volume is already in a
Faultedstate. To learn more volume and volume health, refer to Volume documentation.