Applies to SUSE Linux Enterprise High Performance Computing 15 SP4

5 Slurm — utility for HPC workload management #

Slurm is a workload manager for managing compute jobs on High Performance Computing clusters. It can start multiple jobs on a single node, or a single job on multiple nodes. Additional components can be used for advanced scheduling and accounting.

The mandatory components of Slurm are the control daemon slurmctld, which handles job scheduling, and the Slurm daemon slurmd, responsible for launching compute jobs. Nodes running slurmctld are called management servers and nodes running slurmd are called compute nodes.

Additional components are a secondary slurmctld acting as a standby server for a failover, and the Slurm database daemon slurmdbd, which stores the job history and user hierarchy.

For further documentation, see the Quick Start Administrator Guide and Quick Start User Guide. There is further in-depth documentation on the Slurm documentation page.

5.1 Installing Slurm #

These instructions describe a minimal installation of Slurm with one management server and multiple compute nodes.

5.1.1 Minimal installation #

Important: Make sure of consistent UIDs and GIDs for Slurm's accounts

For security reasons, Slurm does not run as the user root, but under its own user. It is important that the user slurm has the same UID/GID across all nodes of the cluster.

If this user/group does not exist, the package slurm creates this user and group when it is installed. However, this does not guarantee that the generated UIDs/GIDs will be identical on all systems.

Therefore, we strongly advise you to create the user/group slurm before installing slurm. If you are using a network directory service such as LDAP for user and group management, you can use it to provide the slurm user/group as well.

It is strongly recommended that all compute nodes share common user home directories. These should be provided through network storage.

Procedure 5.1: Installing the Slurm packages #

On the management server, install the slurm package with the command zypper in slurm.
On the compute nodes, install the slurm-node package with the command zypper in slurm-node.
On the management server and the compute nodes, the package munge is installed automatically. Configure, enable and start MUNGE on the management server and compute nodes as described in Section 3.4, “MUNGE authentication”. Ensure that the same munge key is shared across all nodes.

Note: Automatically opened ports

Installing the slurm package automatically opens the TCP ports 6817, 6818, and 6819.

Procedure 5.2: Configuring Slurm #

On the management server, edit the main configuration file /etc/slurm/slurm.conf:
1. Configure the parameter SlurmctldHost=SLURMCTLD_HOST with the host name of the management server.
  To find the correct host name, run hostname -s on the management server.
2. Under the COMPUTE NODES section, add the following lines to define the compute nodes:
```
NodeName=NODE_LIST State=UNKNOWN
PartitionName=normal Nodes=NODE_LIST Default=YES MaxTime=24:00:00 State=UP
```
  Replace NODE_LIST with the host names of the compute nodes, either comma-separated or as a range (for example: node[1-100]).
  The NodeName line also allows specifying additional parameters for the nodes, such as Boards, SocketsPerBoard CoresPerSocket, ThreadsPerCore, or CPU. The actual values of these can be obtained by running the following command on the compute nodes:
```
node1# slurmd -C
```
Copy the modified configuration file /etc/slurm/slurm.conf from the management server to all compute nodes:
```
management# scp /etc/slurm/slurm.conf COMPUTE_NODE:/etc/slurm/
```
On the management server, start slurmctld and enable it so that it starts on every boot:
```
management# systemctl enable --now slurmctld.service
```
On each compute node, start slurmd and enable it so that it starts on every boot:
```
node1# systemctl enable --now slurmd.service
```

Procedure 5.3: Testing the Slurm installation #

Check the status and availability of the compute nodes by running the sinfo command. You should see output similar to the following:
```
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 1-00:00:00      2   idle node[01-02]
```
If the node state is not idle, see Section 5.4, “Frequently asked questions”.
Test the Slurm installation by running the following command:
```
management# srun sleep 30
```
This runs the sleep command on a free compute node for 30 seconds.
In another shell, run the squeue command during the 30 seconds that the compute node is asleep. You should see output similar to the following:
```
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    1    normal    sleep     root  R       0:05      1 node02
```
Create the following shell script and save it as sleeper.sh:
```
#!/bin/bash
echo "started at $(date)"
sleep 30
echo "finished at $(date)"
```
Run the shell script in the queue:
```
management# sbatch sleeper.sh
```
The shell script is executed when enough resources are available, and the output is stored in the file slurm-${JOBNR}.out.

5.1.2 Installing the Slurm database #

In a minimal installation, Slurm only stores pending and running jobs. To store finished and failed job data, the storage plugin must be installed and enabled. You can also enable completely fair scheduling, which replaces FIFO (first in, first out) scheduling with algorithms that calculate the job priority in a queue in dependence of the job which a user has run in the history.

The Slurm database has two components: the slurmdbd daemon itself, and an SQL database. MariaDB is recommended. The database can be installed on the same node that runs slurmdbd, or on a separate node. For a minimal setup, all these services run on the management server.

Before you begin, make sure Slurm is installed as described in Section 5.1.1, “Minimal installation”.

Procedure 5.4: Install slurmdbd #

Note: MariaDB

If you want to use an external SQL database (or you already have a database installed on the management server), you can skip Step 1 and Step 2.

Install the MariaDB SQL database:
```
management# zypper install mariadb
```

Start and enable MariaDB:

management# systemctl enable --now mariadb

Secure the database:
```
management# mysql_secure_installation
```
Connect to the SQL database:
```
management# mysql -u root -p
```
Create the Slurm database user and grant it permissions for the Slurm database, which will be created later:
```
mysql> create user 'slurm'@'localhost' identified by 'PASSWORD';
mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost';
```
You can choose to use a different user name or database name. In this case, you must also change the corresponding values in the /etc/slurm/slurmdbd.conf file later.
Exit the database:
```
mysql> exit
```
Install the slurmdbd package:
```
management# zypper in slurm-slurmdbd
```
Edit the /etc/slurm/slurmdbd.conf file so that the daemon can access the database. Change the following line to the password that you used in Step 5:
```
StoragePass=password
```
If the database is on a different node, or if you chose a different user name or database name, you must also modify the following lines:
```
StorageUser=slurm
StorageLoc=slurm_acct_db
DbdAddr=localhost
DbdHost=localhost
```
Start and enable slurmdbd:
```
management# systemctl enable --now slurmdbd
```
The first start of slurmdbd might take some time.
To enable accounting, edit the /etc/slurm/slurm.conf file to add the connection between slurmctld and the slurmdbd daemon. Ensure that the following lines appear as shown:
```
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
```
This example assumes that slurmdbd is running on the same node as slurmctld. If not, change localhost to the host name or IP address of the node where slurmdbd is running.
Make sure slurmdbd is running before you continue:
```
management# systemctl status slurmdbd
```
If you restart slurmctld before slurmdbd is running, slurmctld fails because it cannot connect to the database.
Restart slurmctld:
```
management# systemctl restart slurmctld
```
This creates the Slurm database and adds the cluster to the database (using the ClusterName from /etc/slurm/slurm.conf).
(Optional) By default, Slurm does not take any group membership into account, and the system groups cannot be mapped to Slurm. However, you can mimic system groups with accounts. In Slurm, accounts are usually entities billed for cluster usage, while users identify individual cluster users. Multiple users can be associated with a single account.
The following example creates an umbrella group bavaria for two subgroups called nuremberg and munich:
```
management# sacctmgr add account bavaria \
Description="umbrella group for subgroups" Organization=bavaria
```
```
management# sacctmgr add account nuremberg,munich parent=bavaria \
Description="subgroup" Organization=bavaria
```
The following example adds a user called tux to the subgroup nuremberg:
```
management# sacctmgr add user tux Account=nuremberg
```

5.2 Slurm administration commands #

This section lists some useful options for common Slurm commands. For more information and a full list of options, see the man page for each command. For more Slurm commands, see https://slurm.schedmd.com/man_index.html.

5.2.1 scontrol #

The command scontrol is used to show and update the entities of Slurm, such as the state of the compute nodes or compute jobs. It can also be used to reboot or to propagate configuration changes to the compute nodes.

Useful options for this command are --details, which prints more verbose output, and --oneliner, which forces the output onto a single line, which is more useful in shell scripts.

For more information, see man scontrol.

scontrol show ENTITY

Displays the state of the specified ENTITY.

scontrol update SPECIFICATION

Updates the SPECIFICATION like the compute node or compute node state. Useful SPECIFICATION states that can be set for compute nodes include:

nodename=NODE state=down reason=REASON: Removes all jobs from the compute node, and aborts any jobs already running on the node.
nodename=NODE state=drain reason=REASON: Drains the compute node so that no new jobs can be scheduled on it, but does not end compute jobs already running on the compute node. REASON can be any string. The compute node stays in the drained state and must be returned to the idle state manually.
nodename=NODE state=resume: Marks the compute node as ready to return to the idle state.
jobid=JOBID REQUIREMENT=VALUE: Updates the given requirement, such as NumNodes, with a new value. This command can also be run as a non-privileged user.

scontrol reconfigure

Triggers a reload of the configuration file slurm.conf on all compute nodes.

scontrol reboot NODELIST

Reboots a compute node, or group of compute nodes, when the jobs on it finish. To use this command, the option RebootProgram="/sbin/reboot" must be set in slurm.conf. When the reboot of a compute node takes more than 60 seconds, you can set a higher value in slurm.conf, such as ResumeTimeout=300.

5.2.2 sinfo #

The command sinfo retrieves information about the state of the compute nodes, and can be used for a fast overview of the cluster health. For more information, see man sinfo.

--dead: Displays information about unresponsive nodes.
--long: Shows more detailed information.
--reservation: Prints information about advanced reservations.
-R: Displays the reason a node is in the down, drained, or failing state.
--state=STATE: Limits the output only to nodes with the specified STATE.

5.2.3 sacctmgr and sacct #

These commands are used for managing accounting. For more information, see man sacctmgr and man sacct.

sacctmgr

Used for job accounting in Slurm. To use this command, the service slurmdbd must be set up. See Section 5.1.2, “Installing the Slurm database”.

sacct

Displays the accounting data if accounting is enabled.

--allusers: Shows accounting data for all users.
--accounts=NAME: Shows only the specified user(s).
--starttime=MM/DD[/YY]-HH:MM[:SS]: Shows only jobs after the specified start time. You can use just MM/DD or HH:MM. If no time is given, the command defaults to 00:00, which means that only jobs from today are shown.
--endtime=MM/DD[/YY]-HH:MM[:SS]: Accepts the same options as --starttime. If no time is given, the time when the command was issued is used.
--name=NAME: Limits output to jobs with the given NAME.
--partition=PARTITION: Shows only jobs that run in the specified PARTITION.

5.2.4 sbatch, salloc, and srun #

These commands are used to schedule compute jobs, which means batch scripts for the sbatch command, interactive sessions for the salloc command, or binaries for the srun command. If the job cannot be scheduled immediately, only sbatch places it into the queue.

For more information, see man sbatch, man salloc, and man srun.

-n COUNT_THREADS: Specifies the number of threads needed by the job. The threads can be allocated on different nodes.
-N MINCOUNT_NODES[-MAXCOUNT_NODES]: Sets the number of compute nodes required for a job. The MAXCOUNT_NODES number can be omitted.
--time TIME: Specifies the maximum clock time (runtime) after which a job is terminated. The format of TIME is either seconds or [HH:]MM:SS. Not to be confused with walltime, which is clocktime × threads.
--signal [B:]NUMBER[@TIME]: Sends the signal specified by NUMBER 60 seconds before the end of the job, unless TIME is specified. The signal is sent to every process on every node. If a signal should only be sent to the controlling batch job, you must specify the B: flag.
--job-name NAME: Sets the name of the job to NAME in the queue.
--array=RANGEINDEX: Executes the given script via sbatch for indexes given by RANGEINDEX with the same parameters.
--dependency=STATE:JOBID: Defers the job until the specified STATE of the job JOBID is reached.
--gres=GRES: Runs a job only on nodes with the specified generic resource (GRes), for example a GPU, specified by the value of GRES.
--licenses=NAME[:COUNT]: The job must have the specified number (COUNT) of licenses with the name NAME. A license is the opposite of a generic resource: it is not tied to a computer, but is a cluster-wide variable.
--mem=MEMORY: Sets the real MEMORY required by a job per node. To use this option, memory control must be enabled. The default unit for the MEMORY value is megabytes, but you can also use K for kilobyte, M for megabyte, G for gigabyte, or T for terabyte.
--mem-per-cpu=MEMORY: This option takes the same values as --mem, but defines memory on a per-CPU basis rather than a per-node basis.

5.3 Upgrading Slurm #

For existing products under general support, version upgrades of Slurm are provided regularly. Unlike maintenance updates, these upgrades are not installed automatically using zypper patch but require you to request their installation explicitly. This ensures that these upgrades are not installed unintentionally and gives you the opportunity to plan version upgrades beforehand.

Important: zypper up is not recommended

On systems running Slurm, updating packages with zypper up is not recommended. zypper up attempts to update all installed packages to the latest version, so might install a new major version of Slurm outside of planned Slurm upgrades.

Use zypper patch instead, which only updates packages to the latest bug fix version.

5.3.1 Slurm upgrade workflow #

Interoperability is guaranteed between three consecutive versions of Slurm, with the following restrictions:

The version of slurmdbd must be identical to or higher than the version of slurmctld.
The version of slurmctld must the identical to or higher than the version of slurmd.
The version of slurmd must be identical to or higher than the version of the slurm user applications.

Or in short: version(slurmdbd) >= version(slurmctld) >= version(slurmd) >= version (Slurm user CLIs).

Slurm uses a segmented version number: the first two segments denote the major version, and the final segment denotes the patch level. Upgrade packages (that is, packages that were not initially supplied with the module or service pack) have their major version encoded in the package name (with periods . replaced by underscores _). For example, for version 23.02, this would be slurm_23_02-*.rpm. To find out the latest version of Slurm, you can check My Tools › Packages in the SUSE Customer Center, or run zypper search -v slurm on a node.

With each version, configuration options for slurmctld, slurmd, or slurmdbd might be deprecated. While deprecated, they remain valid for this version and the two consecutive versions, but they might be removed later. Therefore, it is advisable to update the configuration files after the upgrade and replace deprecated configuration options before the final restart of a service.

A new major version of Slurm introduces a new version of libslurm. Older versions of this library might not work with an upgraded Slurm. An upgrade is provided for all SUSE Linux Enterprise software that depends on libslurm . It is strongly recommended to rebuild local applications using libslurm, such as MPI libraries with Slurm support, as early as possible. This might require updating the user applications, as new arguments might be introduced to existing functions.

Warning: Upgrade slurmdbd databases before other Slurm components

If slurmdbd is used, always upgrade the slurmdbd database before starting the upgrade of any other Slurm component. The same database can be connected to multiple clusters and must be upgraded before all of them.

Upgrading other Slurm components before the database can lead to data loss.

5.3.2 Upgrading the `slurmdbd` database daemon #

When upgrading slurmdbd, the database is converted when the new version of slurmdbd starts for the first time. If the database is big, the conversion could take several tens of minutes. During this time, the database is inaccessible.

It is highly recommended to create a backup of the database in case an error occurs during or after the upgrade process. Without a backup, all accounting data collected in the database might be lost if an error occurs or the upgrade is rolled back. A database converted to a newer version cannot be converted back to an older version, and older versions of slurmdbd do not recognize the newer formats.

Note: Convert primary slurmdbd first

If you are using a backup slurmdbd, the conversion must be performed on the primary slurmdbd first. The backup slurmdbd only starts after the conversion is complete.

Procedure 5.5: Upgrading the slurmdbd database daemon #

Stop the slurmdbd service:
```
DBnode# rcslurmdbd stop
```
Ensure that slurmdbd is not running anymore:
```
DBnode# rcslurmdbd status
```
slurmctld might remain running while the database daemon is down. During this time, requests intended for slurmdbd are queued internally. The DBD Agent Queue size is limited, however, and should therefore be monitored with sdiag.

Create a backup of the slurm_acct_db database:

DBnode# mysqldump -p slurm_acct_db > slurm_acct_db.sql

If needed, this can be restored with the following command:

DBnode# mysql -p slurm_acct_db < slurm_acct_db.sql

During the database conversion, the variable innodb_buffer_pool_size must be set to a value of 128 MB or more. Check the current size:
```
DBnode# echo  'SELECT @@innodb_buffer_pool_size/1024/1024;' | \
mysql --password --batch
```
If the value of innodb_buffer_pool_size is less than 128 MB, you can change it for the duration of the current session (on mariadb):
```
DBnode# echo 'set GLOBAL innodb_buffer_pool_size = 134217728;' | \
mysql --password --batch
```
Alternatively, to permanently change the size, edit the /etc/my.cnf file, set innodb_buffer_pool_size to 128 MB, then restart the database:
```
DBnode# rcmysql restart
```
If you need to update MariaDB, run the following command:
```
DBnode# zypper update mariadb
```
Convert the database tables to the new version of MariaDB:
```
DBnode# mysql_upgrade --user=root --password=ROOT_DB_PASSWORD;
```

Install the new version of slurmdbd:

DBnode# zypper install --force-resolution slurm_VERSION-slurmdbd

Rebuild the database. If you are using a backup slurmdbd, perform this step on the primary slurmdbd first.
Because a conversion might take a considerable amount of time, the systemd service might time out during the conversion. Therefore, we recommend performing the migration manually by running slurmdbd from the command line in the foreground:
```
DBnode# /usr/sbin/slurmdbd -D -v
```
When you see the following message, you can shut down slurmdbd by pressing Ctrl–C:
```
Conversion done:
success!
```
Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes.
Restart slurmdbd:
```
DBnode# systemctl start slurmdbd
```
Note: No daemonization during rebuild
During the rebuild of the Slurm database, the database daemon does not daemonize.

5.3.3 Upgrading `slurmctld` and `slurmd` #

After the Slurm database is upgraded, the slurmctld and slurmd instances can be upgraded. It is recommended to update the management servers and compute nodes all at once. If this is not feasible, the compute nodes (slurmd) can be updated on a node-by-node basis. However, the management servers (slurmctld) must be updated first.

Prerequisites #

Section 5.3.2, “Upgrading the slurmdbd database daemon”. Upgrading other Slurm components before the database can lead to data loss.
This procedure assumes that MUNGE authentication is used and that pdsh, the pdsh Slurm plugin, and mrsh can access all of the machines in the cluster. If this is not the case, install pdsh by running zypper in pdsh-slurm.
If mrsh is not used in the cluster, the ssh back-end for pdsh can be used instead. Replace the option -R mrsh with -R ssh in the pdshcommands below. This is less scalable and you might run out of usable ports.

Procedure 5.6: Upgrading slurmctld and slurmd #

Back up the configuration file/etc/slurm/slurm.conf. Because this file should be identical across the entire cluster, it is sufficient to do so only on the main management server.
On the main management server, edit /etc/slurm/slurm.conf and set SlurmdTimeout and SlurmctldTimeout to sufficiently high values to avoid timeouts while slurmctld and slurmd are down:
```
SlurmctldTimeout=3600
SlurmdTimeout=3600
```
We recommend at least 60 minutes (3600), and more for larger clusters.

Copy the updated /etc/slurm/slurm.conf from the management server to all nodes:

Obtain the list of partitions in /etc/slurm/slurm.conf.

Copy the updated configuration to the compute nodes:

management# cp /etc/slurm/slurm.conf /etc/slurm/slurm.conf.update
management# sudo -u slurm /bin/bash -c 'cat /etc/slurm/slurm.conf.update | \
pdsh -R mrsh -P PARTITIONS "cat > /etc/slurm/slurm.conf"'
management# rm /etc/slurm/slurm.conf.update

Reload the configuration file on all compute nodes:
```
management# scontrol reconfigure
```

Verify that the reconfiguration took effect:

management# scontrol show config | grep Timeout

Shut down all running slurmctld instances, first on any backup management servers, and then on the main management server:
```
management# systemctl stop slurmctld
```
Back up the slurmctld state files. slurmctld maintains persistent state information. Almost every major version involves changes to the slurmctld state files. This state information is upgraded if the upgrade remains within the supported version range and no data is lost.
However, if a downgrade is necessary, state information from newer versions is not recognized by an older version of slurmctld and is discarded, resulting in a loss of all running and pending jobs. Therefore, back up the old state in case an update needs to be rolled back.
1. Determine the StateSaveLocation directory:
```
management# scontrol show config | grep StateSaveLocation
```
2. Create a backup of the content of this directory. If a downgrade is required, restore the content of the StateSaveLocation directory from this backup.

Shut down slurmd on the compute nodes:

management# pdsh -R ssh -P PARTITIONS systemctl stop slurmd

Upgrade slurmctld on the main and backup management servers:
```
management# zypper install --force-resolution slurm_VERSION
```
Important: Upgrade all Slurm packages at the same time
If any additional Slurm packages are installed, you must upgrade those as well. This includes:
- slurm-pam_slurm
- slurm-sview
- perl-slurm
- slurm-lua
- slurm-torque
- slurm-config-man
- slurm-doc
- slurm-webdoc
- slurm-auth-none
- pdsh-slurm
All Slurm packages must be upgraded at the same time to avoid conflicts between packages of different versions. This can be done by adding them to the zypper install command line described above.
Upgrade slurmd on the compute nodes:
```
management# pdsh -R ssh -P PARTITIONS \
zypper install --force-resolution slurm_VERSION-node
```
Note: Memory size seen by slurmd might change on update
Under certain circumstances, the amount of memory seen by slurmd might change after an update. If this happens, slurmctld puts the nodes in a drained state. To check whether the amount of memory seem by slurmd changed after the update, run the following command on a single compute node:
```
node1# slurmd -C
```
Compare the output with the settings in slurm.conf. If required, correct the setting.
Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes.
If you replace deprecated options in the configuration files, these configuration files can be distributed to all management servers and compute nodes in the cluster by using the method described in Step 3.

Restart slurmd on all compute nodes:

management# pdsh -R ssh -P PARTITIONS systemctl start slurmd

Restart slurmctld on the main and backup management servers:
```
management# systemctl start slurmctld
```
Check the status of the management servers. On the main and backup management servers, run the following command:
```
management# systemctl status slurmctld
```
Verify that the services are running without errors. Run the following command to check whether there are any down, drained, failing, or failed nodes:
```
management# sinfo -R
```
Restore the original values of SlurmdTimeout and SlurmctldTimeout in /etc/slurm/slurm.conf, then copy the restored configuration to all nodes by using the method described in Step 3.

5.4 Frequently asked questions #

1. How do I change the state of a node from down to up?

When the slurmd daemon on a node does not reboot in the time specified in the ResumeTimeout parameter, or the ReturnToService was not changed in the configuration file slurm.conf, compute nodes stay in the down state and must be set back to the up state manually. This can be done for the NODE with the following command:

management# scontrol update state=resume NodeName=NODE

2. What is the difference between the states down and down*?

A * shown after a status code means that the node is not responding.

When a node is marked as down*, it means that the node is not reachable because of network issues, or that slurmd is not running on that node.

In the down state, the node is reachable, but either the node was rebooted unexpectedly, the hardware does not match the description in slurm.conf, or a health check was configured with the HealthCheckProgram.

3. How do I get the exact core count, socket number, and number of CPUs for a node?

To find the node values that go into the configuration file slurm.conf, run the following command:

node1# slurmd -C