5 Slurm — utility for HPC workload management #
Slurm is a workload manager for managing compute jobs on High Performance Computing clusters. It can start multiple jobs on a single node, or a single job on multiple nodes. Additional components can be used for advanced scheduling and accounting.
The mandatory components of Slurm are the control daemon
slurmctld, which handles job scheduling, and the
Slurm daemon slurmd, responsible for launching compute
jobs. Nodes running slurmctld
are called
management servers and nodes running
slurmd
are called compute nodes.
Additional components are a secondary slurmctld acting as a standby server for a failover, and the Slurm database daemon slurmdbd, which stores the job history and user hierarchy.
For further documentation, see the Quick Start Administrator Guide and Quick Start User Guide. There is further in-depth documentation on the Slurm documentation page.
5.1 Installing Slurm #
These instructions describe a minimal installation of Slurm with one management server and multiple compute nodes.
5.1.1 Minimal installation #
For security reasons, Slurm does not run as the user
root
, but under its own
user. It is important that the user
slurm
has the same UID/GID
across all nodes of the cluster.
If this user/group does not exist, the package slurm creates this user and group when it is installed. However, this does not guarantee that the generated UIDs/GIDs will be identical on all systems.
Therefore, we strongly advise you to create the user/group
slurm
before installing
slurm. If you are using a network directory service
such as LDAP for user and group management, you can use it to provide the
slurm
user/group as well.
It is strongly recommended that all compute nodes share common user home directories. These should be provided through network storage.
On the management server, install the slurm package with the command
zypper in slurm
.On the compute nodes, install the slurm-node package with the command
zypper in slurm-node
.On the management server and the compute nodes, the package munge is installed automatically. Configure, enable and start MUNGE on the management server and compute nodes as described in Section 3.4, “MUNGE authentication”. Ensure that the same
munge
key is shared across all nodes.
On the management server, edit the main configuration file
/etc/slurm/slurm.conf
:Configure the parameter
SlurmctldHost=SLURMCTLD_HOST
with the host name of the management server.To find the correct host name, run
hostname -s
on the management server.Under the
COMPUTE NODES
section, add the following lines to define the compute nodes:NodeName=NODE_LIST State=UNKNOWN PartitionName=normal Nodes=NODE_LIST Default=YES MaxTime=24:00:00 State=UP
Replace NODE_LIST with the host names of the compute nodes, either comma-separated or as a range (for example:
node[1-100]
).The
NodeName
line also allows specifying additional parameters for the nodes, such asBoards
,SocketsPerBoard
CoresPerSocket
,ThreadsPerCore
, orCPU
. The actual values of these can be obtained by running the following command on the compute nodes:node1#
slurmd -C
Copy the modified configuration file
/etc/slurm/slurm.conf
from the management server to all compute nodes:management#
scp /etc/slurm/slurm.conf COMPUTE_NODE:/etc/slurm/On the management server, start
slurmctld
and enable it so that it starts on every boot:management#
systemctl enable --now slurmctld.serviceOn each compute node, start
slurmd
and enable it so that it starts on every boot:node1#
systemctl enable --now slurmd.service
Check the status and availability of the compute nodes by running the
sinfo
command. You should see output similar to the following:PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 2 idle node[01-02]
If the node state is not
idle
, see Section 5.4, “Frequently asked questions”.Test the Slurm installation by running the following command:
management#
srun sleep 30This runs the
sleep
command on a free compute node for 30 seconds.In another shell, run the
squeue
command during the 30 seconds that the compute node is asleep. You should see output similar to the following:JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 normal sleep root R 0:05 1 node02
Create the following shell script and save it as
sleeper.sh
:#!/bin/bash echo "started at $(date)" sleep 30 echo "finished at $(date)"
Run the shell script in the queue:
management#
sbatch sleeper.shThe shell script is executed when enough resources are available, and the output is stored in the file
slurm-${JOBNR}.out
.
5.1.2 Installing the Slurm database #
In a minimal installation, Slurm only stores pending and running jobs. To store finished and failed job data, the storage plugin must be installed and enabled. You can also enable completely fair scheduling, which replaces FIFO (first in, first out) scheduling with algorithms that calculate the job priority in a queue in dependence of the job which a user has run in the history.
The Slurm database has two components: the slurmdbd
daemon itself, and an SQL database. MariaDB is recommended. The
database can be installed on the same node that runs slurmdbd
,
or on a separate node. For a minimal setup, all these services run on the
management server.
Before you begin, make sure Slurm is installed as described in Section 5.1.1, “Minimal installation”.
If you want to use an external SQL database (or you already have a database installed on the management server), you can skip Step 1 and Step 2.
Install the MariaDB SQL database:
management#
zypper install mariadbStart and enable MariaDB:
management#
systemctl enable --now mariadbSecure the database:
management#
mysql_secure_installationConnect to the SQL database:
management#
mysql -u root -pCreate the Slurm database user and grant it permissions for the Slurm database, which will be created later:
mysql>
create user 'slurm'@'localhost' identified by 'PASSWORD';mysql>
grant all on slurm_acct_db.* TO 'slurm'@'localhost';You can choose to use a different user name or database name. In this case, you must also change the corresponding values in the
/etc/slurm/slurmdbd.conf
file later.Exit the database:
mysql>
exitInstall the slurmdbd package:
management#
zypper in slurm-slurmdbdEdit the
/etc/slurm/slurmdbd.conf
file so that the daemon can access the database. Change the following line to the password that you used in Step 5:StoragePass=password
If the database is on a different node, or if you chose a different user name or database name, you must also modify the following lines:
StorageUser=slurm StorageLoc=slurm_acct_db DbdAddr=localhost DbdHost=localhost
Start and enable
slurmdbd
:management#
systemctl enable --now slurmdbdThe first start of
slurmdbd
might take some time.To enable accounting, edit the
/etc/slurm/slurm.conf
file to add the connection betweenslurmctld
and theslurmdbd
daemon. Ensure that the following lines appear as shown:JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost
This example assumes that
slurmdbd
is running on the same node asslurmctld
. If not, changelocalhost
to the host name or IP address of the node whereslurmdbd
is running.Make sure
slurmdbd
is running before you continue:management#
systemctl status slurmdbdIf you restart
slurmctld
beforeslurmdbd
is running,slurmctld
fails because it cannot connect to the database.Restart
slurmctld
:management#
systemctl restart slurmctldThis creates the Slurm database and adds the cluster to the database (using the
ClusterName
from/etc/slurm/slurm.conf
).(Optional) By default, Slurm does not take any group membership into account, and the system groups cannot be mapped to Slurm. However, you can mimic system groups with accounts. In Slurm, accounts are usually entities billed for cluster usage, while users identify individual cluster users. Multiple users can be associated with a single account.
The following example creates an umbrella group
bavaria
for two subgroups callednuremberg
andmunich
:management#
sacctmgr add account bavaria \ Description="umbrella group for subgroups" Organization=bavariamanagement#
sacctmgr add account nuremberg,munich parent=bavaria \ Description="subgroup" Organization=bavariaThe following example adds a user called
tux
to the subgroupnuremberg
:management#
sacctmgr add user tux Account=nuremberg
5.2 Slurm administration commands #
This section lists some useful options for common Slurm commands. For more
information and a full list of options, see the man
page
for each command. For more Slurm commands, see
https://slurm.schedmd.com/man_index.html.
5.2.1 scontrol #
The command scontrol
is used to show and update the
entities of Slurm, such as the state of the compute nodes or compute jobs.
It can also be used to reboot or to propagate configuration changes to the
compute nodes.
Useful options for this command are --details
, which
prints more verbose output, and --oneliner
, which forces
the output onto a single line, which is more useful in shell scripts.
For more information, see man scontrol
.
scontrol show ENTITY
Displays the state of the specified ENTITY.
scontrol update SPECIFICATION
Updates the SPECIFICATION like the compute node or compute node state. Useful SPECIFICATION states that can be set for compute nodes include:
nodename=NODE state=down reason=REASON
Removes all jobs from the compute node, and aborts any jobs already running on the node.
nodename=NODE state=drain reason=REASON
Drains the compute node so that no new jobs can be scheduled on it, but does not end compute jobs already running on the compute node. REASON can be any string. The compute node stays in the
drained
state and must be returned to theidle
state manually.nodename=NODE state=resume
Marks the compute node as ready to return to the
idle
state.jobid=JOBID REQUIREMENT=VALUE
Updates the given requirement, such as
NumNodes
, with a new value. This command can also be run as a non-privileged user.
scontrol reconfigure
Triggers a reload of the configuration file
slurm.conf
on all compute nodes.scontrol reboot NODELIST
Reboots a compute node, or group of compute nodes, when the jobs on it finish. To use this command, the option
RebootProgram="/sbin/reboot"
must be set inslurm.conf
. When the reboot of a compute node takes more than 60 seconds, you can set a higher value inslurm.conf
, such asResumeTimeout=300
.
5.2.2 sinfo #
The command sinfo
retrieves information about the state
of the compute nodes, and can be used for a fast overview of the cluster
health. For more information, see man sinfo
.
--dead
Displays information about unresponsive nodes.
--long
Shows more detailed information.
--reservation
Prints information about advanced reservations.
-R
Displays the reason a node is in the
down
,drained
, orfailing
state.--state=STATE
Limits the output only to nodes with the specified STATE.
5.2.3 sacctmgr and sacct #
These commands are used for managing accounting. For more information, see
man sacctmgr
and man sacct
.
sacctmgr
Used for job accounting in Slurm. To use this command, the service
slurmdbd
must be set up. See Section 5.1.2, “Installing the Slurm database”.sacct
Displays the accounting data if accounting is enabled.
--allusers
Shows accounting data for all users.
--accounts
=NAMEShows only the specified user(s).
--starttime
=MM/DD[/YY]-HH:MM[:SS]Shows only jobs after the specified start time. You can use just MM/DD or HH:MM. If no time is given, the command defaults to
00:00
, which means that only jobs from today are shown.--endtime
=MM/DD[/YY]-HH:MM[:SS]Accepts the same options as
--starttime
. If no time is given, the time when the command was issued is used.--name
=NAMELimits output to jobs with the given NAME.
--partition
=PARTITIONShows only jobs that run in the specified PARTITION.
5.2.4 sbatch, salloc, and srun #
These commands are used to schedule compute jobs,
which means batch scripts for the sbatch
command,
interactive sessions for the salloc
command, or
binaries for the srun
command. If the job cannot be
scheduled immediately, only sbatch
places it into the queue.
For more information, see man sbatch
,
man salloc
, and man srun
.
-n COUNT_THREADS
Specifies the number of threads needed by the job. The threads can be allocated on different nodes.
-N MINCOUNT_NODES[-MAXCOUNT_NODES]
Sets the number of compute nodes required for a job. The MAXCOUNT_NODES number can be omitted.
--time TIME
Specifies the maximum clock time (runtime) after which a job is terminated. The format of TIME is either seconds or [HH:]MM:SS. Not to be confused with
walltime
, which isclocktime × threads
.--signal [B:]NUMBER[@TIME]
Sends the signal specified by NUMBER 60 seconds before the end of the job, unless TIME is specified. The signal is sent to every process on every node. If a signal should only be sent to the controlling batch job, you must specify the
B:
flag.--job-name NAME
Sets the name of the job to NAME in the queue.
--array=RANGEINDEX
Executes the given script via
sbatch
for indexes given by RANGEINDEX with the same parameters.--dependency=STATE:JOBID
Defers the job until the specified STATE of the job JOBID is reached.
--gres=GRES
Runs a job only on nodes with the specified generic resource (GRes), for example a GPU, specified by the value of GRES.
--licenses=NAME[:COUNT]
The job must have the specified number (COUNT) of licenses with the name NAME. A license is the opposite of a generic resource: it is not tied to a computer, but is a cluster-wide variable.
--mem=MEMORY
Sets the real MEMORY required by a job per node. To use this option, memory control must be enabled. The default unit for the MEMORY value is megabytes, but you can also use
K
for kilobyte,M
for megabyte,G
for gigabyte, orT
for terabyte.--mem-per-cpu=MEMORY
This option takes the same values as
--mem
, but defines memory on a per-CPU basis rather than a per-node basis.
5.3 Upgrading Slurm #
For existing products under general support, version upgrades of Slurm are
provided regularly. Unlike maintenance updates, these upgrades are not
installed automatically using zypper patch
but require
you to request their installation explicitly. This ensures that these
upgrades are not installed unintentionally and gives you the opportunity
to plan version upgrades beforehand.
zypper up
is not recommended
On systems running Slurm, updating packages with zypper up
is not recommended. zypper up
attempts to update all installed
packages to the latest version, so might install a new major version of Slurm
outside of planned Slurm upgrades.
Use zypper patch
instead, which only updates packages to the
latest bug fix version.
5.3.1 Slurm upgrade workflow #
Interoperability is guaranteed between three consecutive versions of Slurm, with the following restrictions:
The version of
slurmdbd
must be identical to or higher than the version ofslurmctld
.The version of
slurmctld
must the identical to or higher than the version ofslurmd
.The version of
slurmd
must be identical to or higher than the version of theslurm
user applications.
Or in short:
version(slurmdbd
) >=
version(slurmctld
) >=
version(slurmd
) >= version (Slurm user CLIs).
Slurm uses a segmented version number: the first two segments denote the
major version, and the final segment denotes the patch level.
Upgrade packages (that is, packages that were not initially supplied with
the module or service pack) have their major version encoded in the package
name (with periods .
replaced by underscores
_
). For example, for version 23.02, this would be
slurm_23_02-*.rpm
. To find out the latest version of Slurm,
you can check ›
in the SUSE Customer Center, or run zypper search -v slurm
on a node.
With each version, configuration options for
slurmctld
, slurmd
, or
slurmdbd
might be deprecated. While deprecated, they
remain valid for this version and the two consecutive versions, but they might
be removed later. Therefore, it is advisable to update the configuration files
after the upgrade and replace deprecated configuration options before the
final restart of a service.
A new major version of Slurm introduces a new version of
libslurm
. Older versions of this library might not work
with an upgraded Slurm. An upgrade is provided for all SUSE Linux Enterprise software that
depends on libslurm
. It is strongly recommended to rebuild
local applications using libslurm
, such as MPI libraries
with Slurm support, as early as possible. This might require updating the
user applications, as new arguments might be introduced to existing functions.
slurmdbd
databases before other Slurm components
If slurmdbd
is used, always upgrade the
slurmdbd
database before starting
the upgrade of any other Slurm component. The same database can be connected
to multiple clusters and must be upgraded before all of them.
Upgrading other Slurm components before the database can lead to data loss.
5.3.2 Upgrading the slurmdbd
database daemon #
When upgrading slurmdbd
,
the database is converted when the new version of
slurmdbd
starts for the first time. If the
database is big, the conversion could take several tens of minutes. During
this time, the database is inaccessible.
It is highly recommended to create a backup of the database in case an
error occurs during or after the upgrade process. Without a backup,
all accounting data collected in the database might be lost if an error
occurs or the upgrade is rolled back. A database
converted to a newer version cannot be converted back to an older version,
and older versions of slurmdbd
do not recognize the
newer formats.
slurmdbd
first
If you are using a backup slurmdbd
, the conversion must
be performed on the primary slurmdbd
first. The backup
slurmdbd
only starts after the conversion is complete.
slurmdbd
database daemon #Stop the
slurmdbd
service:DBnode#
rcslurmdbd stopEnsure that
slurmdbd
is not running anymore:DBnode#
rcslurmdbd statusslurmctld
might remain running while the database daemon is down. During this time, requests intended forslurmdbd
are queued internally. The DBD Agent Queue size is limited, however, and should therefore be monitored withsdiag
.Create a backup of the
slurm_acct_db
database:DBnode#
mysqldump -p slurm_acct_db > slurm_acct_db.sqlIf needed, this can be restored with the following command:
DBnode#
mysql -p slurm_acct_db < slurm_acct_db.sqlDuring the database conversion, the variable
innodb_buffer_pool_size
must be set to a value of 128 MB or more. Check the current size:DBnode#
echo 'SELECT @@innodb_buffer_pool_size/1024/1024;' | \ mysql --password --batchIf the value of
innodb_buffer_pool_size
is less than 128 MB, you can change it for the duration of the current session (onmariadb
):DBnode#
echo 'set GLOBAL innodb_buffer_pool_size = 134217728;' | \ mysql --password --batchAlternatively, to permanently change the size, edit the
/etc/my.cnf
file, setinnodb_buffer_pool_size
to 128 MB, then restart the database:DBnode#
rcmysql restartIf you need to update MariaDB, run the following command:
DBnode#
zypper update mariadbConvert the database tables to the new version of MariaDB:
DBnode#
mysql_upgrade --user=root --password=ROOT_DB_PASSWORD;Install the new version of
slurmdbd
:DBnode#
zypper install --force-resolution slurm_VERSION-slurmdbdRebuild the database. If you are using a backup
slurmdbd
, perform this step on the primaryslurmdbd
first.Because a conversion might take a considerable amount of time, the
systemd
service might time out during the conversion. Therefore, we recommend performing the migration manually by runningslurmdbd
from the command line in the foreground:DBnode#
/usr/sbin/slurmdbd -D -vWhen you see the following message, you can shut down
slurmdbd
by pressing Ctrl–C:Conversion done: success!
Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes.
Restart
slurmdbd
:DBnode#
systemctl start slurmdbdNote: No daemonization during rebuildDuring the rebuild of the Slurm database, the database daemon does not daemonize.
5.3.3 Upgrading slurmctld
and slurmd
#
After the Slurm database is upgraded, the slurmctld
and
slurmd
instances can be upgraded. It is recommended to
update the management servers and compute nodes all at once.
If this is not feasible, the compute nodes (slurmd
) can
be updated on a node-by-node basis. However, the management servers
(slurmctld
) must be updated first.
Section 5.3.2, “Upgrading the
slurmdbd
database daemon”. Upgrading other Slurm components before the database can lead to data loss.This procedure assumes that MUNGE authentication is used and that
pdsh
, thepdsh
Slurm plugin, andmrsh
can access all of the machines in the cluster. If this is not the case, installpdsh
by runningzypper in pdsh-slurm
.If
mrsh
is not used in the cluster, thessh
back-end forpdsh
can be used instead. Replace the option-R mrsh
with-R ssh
in thepdsh
commands below. This is less scalable and you might run out of usable ports.
slurmctld
and slurmd
#Back up the configuration file
/etc/slurm/slurm.conf
. Because this file should be identical across the entire cluster, it is sufficient to do so only on the main management server.On the main management server, edit
/etc/slurm/slurm.conf
and setSlurmdTimeout
andSlurmctldTimeout
to sufficiently high values to avoid timeouts whileslurmctld
andslurmd
are down:SlurmctldTimeout=3600 SlurmdTimeout=3600
We recommend at least 60 minutes (
3600
), and more for larger clusters.Copy the updated
/etc/slurm/slurm.conf
from the management server to all nodes:Obtain the list of partitions in
/etc/slurm/slurm.conf
.Copy the updated configuration to the compute nodes:
management#
cp /etc/slurm/slurm.conf /etc/slurm/slurm.conf.updatemanagement#
sudo -u slurm /bin/bash -c 'cat /etc/slurm/slurm.conf.update | \ pdsh -R mrsh -P PARTITIONS "cat > /etc/slurm/slurm.conf"'management#
rm /etc/slurm/slurm.conf.updateReload the configuration file on all compute nodes:
management#
scontrol reconfigureVerify that the reconfiguration took effect:
management#
scontrol show config | grep Timeout
Shut down all running
slurmctld
instances, first on any backup management servers, and then on the main management server:management#
systemctl stop slurmctldBack up the
slurmctld
state files.slurmctld
maintains persistent state information. Almost every major version involves changes to theslurmctld
state files. This state information is upgraded if the upgrade remains within the supported version range and no data is lost.However, if a downgrade is necessary, state information from newer versions is not recognized by an older version of
slurmctld
and is discarded, resulting in a loss of all running and pending jobs. Therefore, back up the old state in case an update needs to be rolled back.Determine the
StateSaveLocation
directory:management#
scontrol show config | grep StateSaveLocationCreate a backup of the content of this directory. If a downgrade is required, restore the content of the
StateSaveLocation
directory from this backup.
Shut down
slurmd
on the compute nodes:management#
pdsh -R ssh -P PARTITIONS systemctl stop slurmdUpgrade
slurmctld
on the main and backup management servers:management#
zypper install --force-resolution slurm_VERSIONImportant: Upgrade all Slurm packages at the same timeIf any additional Slurm packages are installed, you must upgrade those as well. This includes:
slurm-pam_slurm
slurm-sview
perl-slurm
slurm-lua
slurm-torque
slurm-config-man
slurm-doc
slurm-webdoc
slurm-auth-none
pdsh-slurm
All Slurm packages must be upgraded at the same time to avoid conflicts between packages of different versions. This can be done by adding them to the
zypper install
command line described above.Upgrade
slurmd
on the compute nodes:management#
pdsh -R ssh -P PARTITIONS \ zypper install --force-resolution slurm_VERSION-nodeNote: Memory size seen byslurmd
might change on updateUnder certain circumstances, the amount of memory seen by
slurmd
might change after an update. If this happens,slurmctld
puts the nodes in adrained
state. To check whether the amount of memory seem byslurmd
changed after the update, run the following command on a single compute node:node1#
slurmd -CCompare the output with the settings in
slurm.conf
. If required, correct the setting.Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes.
If you replace deprecated options in the configuration files, these configuration files can be distributed to all management servers and compute nodes in the cluster by using the method described in Step 3.
Restart
slurmd
on all compute nodes:management#
pdsh -R ssh -P PARTITIONS systemctl start slurmdRestart
slurmctld
on the main and backup management servers:management#
systemctl start slurmctldCheck the status of the management servers. On the main and backup management servers, run the following command:
management#
systemctl status slurmctldVerify that the services are running without errors. Run the following command to check whether there are any
down
,drained
,failing
, orfailed
nodes:management#
sinfo -RRestore the original values of
SlurmdTimeout
andSlurmctldTimeout
in/etc/slurm/slurm.conf
, then copy the restored configuration to all nodes by using the method described in Step 3.
5.4 Frequently asked questions #
- 1.
How do I change the state of a node from
down
toup
? When the
slurmd
daemon on a node does not reboot in the time specified in theResumeTimeout
parameter, or theReturnToService
was not changed in the configuration fileslurm.conf
, compute nodes stay in thedown
state and must be set back to theup
state manually. This can be done for the NODE with the following command:management#
scontrol update state=resume NodeName=NODE
- 2.
What is the difference between the states
down
anddown*
? A
*
shown after a status code means that the node is not responding.When a node is marked as
down*
, it means that the node is not reachable because of network issues, or thatslurmd
is not running on that node.In the
down
state, the node is reachable, but either the node was rebooted unexpectedly, the hardware does not match the description inslurm.conf
, or a health check was configured with theHealthCheckProgram
.
- 3. How do I get the exact core count, socket number, and number of CPUs for a node?
To find the node values that go into the configuration file
slurm.conf
, run the following command:node1#
slurmd -C