This is unreleased documentation for SUSE® Storage 1.10 (Dev). |
Scheduling
SUSE Storage schedules replicas based on multiple factors.
Scheduling Policy
SUSE Storage’s scheduling policy has two stages. The scheduler only goes to the next stage if the previous stage is satisfied. Otherwise, the scheduling will fail.
If any tag has been set in order to be selected for scheduling, the node tag and the disk tag have to match when the node or the disk is selected.
The first stage is the node and zone selection stage. SUSE Storage will filter the node and zone based on the Replica Node Level Soft Anti-Affinity
and Replica Zone Level Soft Anti-Affinity
settings.
The second stage is the disk selection stage. SUSE Storage will filter the disks that satisfy the first stage based on the Replica Disk Level Soft Anti-Affinity
, Storage Minimal Available Percentage
, Storage Over Provisioning Percentage
, and other disk-related factors like requested disk space.
The Node and Zone Selection Stage
SUSE Storage evaluates which nodes are suitable for scheduling a new replica based on a series of criteria. The decision-making process follows a specific order to ensure optimal placement for fault tolerance.
1. Node Tag Matching
SUSE Storage first checks for node selector tags on the volume.
-
If the volume has node selector tags, only nodes with matching tags are eligible.
-
If the volume has no node selector, the behavior depends on the Allow Empty Node Selector Volume setting:
-
true
(default): Schedules on nodes with or without tags. -
false
: Schedules only on nodes without tags.
2. Cordoned Node Handling
The Disable Scheduling On Cordoned Node setting determines whether cordoned nodes are eligible for replica scheduling:
-
true
(default): Cordoned nodes are excluded. -
false
: Cordoned nodes are eligible.
3. Anti-Affinity Rules Across Nodes and Zones
SUSE Storage prioritizes spreading replicas across different nodes and zones to improve fault tolerance. A "new" node or zone is one that does not currently host any replica of the volume, while an "existing" node or zone already hosts a replica of the volume.
The scheduler attempts to place the new replica in the most "isolated" location possible, following this hierarchy of preference:
-
New Node in a New Zone (most preferred)
-
New Node in an Existing Zone
-
Existing Node in an Existing Zone (least preferred)
The following table details the required settings for a replica to be scheduled in each scenario:
Scenario | Replica Zone Level Soft Anti-Affinity | Replica Node Level Soft Anti-Affinity | Scheduler Action |
---|---|---|---|
New Node in a New Zone |
|
|
Schedules the replica. |
Any other value |
Any other value |
Does not schedule the replica. |
|
New Node in an Existing Zone |
|
|
Schedules the replica if no new zone is available. |
Any other value |
Any other value |
Does not schedule the replica. |
|
Existing Node in an Existing Zone |
|
|
Schedules the replica if no other options are available. |
Any other value |
Any other value |
Does not schedule the replica. |
Disk Selection Stage
After the node and zone selection stage, SUSE Storage determines whether it can schedule the replica on any disk of the selected node. It checks the available disks based on matching tags, total disk space, and available disk space. It also considers whether another replica already exists and the anti-affinity settings.
SUSE Storage checks all available disks on the selected node to ensure they meet the following criteria:
-
Disk Tag Matching:
-
If the volume has disk tags, the disk must match any specified tags required for the replica.
-
If the volume has no disk tags, the behavior depends on the Allow Empty Disk Selector Volume setting:
-
true
(default): Allows scheduling on disks with or without tags. -
false
: Only allows scheduling on disks without tags.
-
-
-
Available Space Check:
-
The disk must have sufficient available space based on the configured
Storage Minimal Available Percentage
.
-
-
Anti-Affinity Settings:
-
Hard Anti-Affinity: Prevents scheduling a replica on a disk that already hosts another replica of the same volume.
-
Soft Anti-Affinity (when enabled): Prefers scheduling the replica on a disk without an existing replica, even if it is a less optimal choice in terms of space or other factors.
-
-
Space Conditions: Two formulas determine if a disk is schedulable:
-
Actual Space Usage Condition: Ensures sufficient usable storage remains after accounting for currently used space.
-
Formula:
(Storage Available - Actual Size) > (Storage Maximum × Minimal Available Percentage) / 100
-
-
Scheduling Space Condition: Ensures the replica’s size (plus any scheduled but unwritten data) fits within the over-provisioning limit.
-
Formula:
(Size + Storage Scheduled) ⇐ ((Storage Maximum - Storage Reserved) × Over Provisioning Percentage) / 100
-
-
During disk evaluation, since no specific replica is being scheduled yet, |
If any of these conditions fail, including disk tag, anti-affinity, or space requirements, the disk is marked unschedulable, and SUSE Storage will not place the replica on that disk.
Example Scenario
Consider a node (Node A) with two disks:
-
Disk X: 1 GB available, 4 GB max space
-
Disk Y: 2 GB available, 8 GB max space
Stage 1: Initial Disk Evaluation
During the initial disk selection stage, SUSE Storage performs a basic check on all available disks. At this point, no specific replica has been selected, so Actual Size
and Size
are treated as 0
.
Disk X Evaluation
-
Available Space: 1 GB
-
Storage Minimal Available Percentage
: 25% (default) -
Minimum required available space:
(4 GB × 25) / 100 = 1 GB
-
Result: Disk X fails the
Actual Space Usage Condition
because its available space (1 GB) is not greater than the minimum required (1 GB). Therefore, Disk X is not schedulable unless theStorage Minimal Available Percentage
is set to0
.
Disk Y Evaluation
-
Available Space: 2 GB
-
Storage Minimal Available Percentage
: 10% -
Minimum required available space:
(8 GB × 10) / 100 = 0.8 GB
-
Result: Disk Y passes the
Actual Space Usage Condition
because its available space (2 GB) is greater than the minimum required (0.8 GB).
Next, we check the Scheduling Space Condition
:
-
Scheduled Space: 2 GB
-
Storage Reserved
: 1 GB -
Over Provisioning Percentage
: 100% (default) -
Maximum Provisionable Storage:
(8 GB - 1 GB) × 100 / 100 = 7 GB
-
Result: Disk Y passes the
Scheduling Space Condition
because the currently scheduled space (2 GB) is less than the maximum provisionable storage (7 GB).
Since Disk Y passes all conditions, it is marked as a schedulable disk candidate.
Stage 2: Anti-Affinity Rules
Assume both Disk X and Disk Y pass the initial space checks and Disk X already hosts a replica for the same volume.
Hard Anti-Affinity
-
If hard anti-affinity is enabled, SUSE Storage will not schedule the new replica on Disk X. Instead, it will attempt to schedule it on Disk Y.
-
If Disk Y is not suitable (e.g., mismatched disk tags), scheduling for this replica will fail.
Soft Anti-Affinity
-
If soft anti-affinity is enabled, SUSE Storage prefers to schedule the replica on Disk Y to avoid co-locating replicas.
-
However, if Disk Y is unsuitable for any reason, SUSE Storage may still schedule the replica on Disk X. This allows for sharing a disk as a fallback option when no other viable candidates are available.
Settings
For more information on settings that are relevant to scheduling replicas on nodes and disks, refer to the settings reference:
-
Replica Soft Anti-Affinity (also called Replica Node Level Soft Anti-Affinity)
Notice
SUSE Storage relies on label topology.kubernetes.io/zone=<Zone name of the node>
or topology.kubernetes.io/region=<Region name of the node>
in the Kubernetes node object to identify the zone/region.
Since these are reserved and used by Kubernetes as well-known labels.