6 Slurm — utility for HPC workload management #
Slurm is a workload manager for managing compute jobs on High Performance Computing clusters. It can start multiple jobs on a single node, or a single job on multiple nodes. Additional components can be used for advanced scheduling and accounting.
    The mandatory components of Slurm are the control daemon
    slurmctld, which handles job scheduling, and the
    Slurm daemon slurmd, responsible for launching compute
    jobs. Nodes running slurmctld are called
    management servers and nodes running
    slurmd are called compute nodes.
   
Additional components are a secondary slurmctld acting as a standby server for a failover, and the Slurm database daemon slurmdbd, which stores the job history and user hierarchy.
For further documentation, see the Quick Start Administrator Guide and Quick Start User Guide. There is further in-depth documentation on the Slurm documentation page.
6.1 Installing Slurm #
These instructions describe a minimal installation of Slurm with one management server and multiple compute nodes.
6.1.1 Minimal installation #
     For security reasons, Slurm does not run as the user
     root, but under its own
     user. It is important that the user
     slurm has the same UID/GID
     across all nodes of the cluster.
    
If this user/group does not exist, the package slurm creates this user and group when it is installed. However, this does not guarantee that the generated UIDs/GIDs will be identical on all systems.
     Therefore, we strongly advise you to create the user/group
     slurm before installing
     slurm. If you are using a network directory service
     such as LDAP for user and group management, you can use it to provide the
     slurm user/group as well.
    
It is strongly recommended that all compute nodes share common user home directories. These should be provided through network storage.
- On the management server, install the slurm package with the command - zypper in slurm.
- On the compute nodes, install the slurm-node package with the command - zypper in slurm-node.
- On the management server and the compute nodes, the package munge is installed automatically. Configure, enable and start MUNGE on the management server and compute nodes as described in Section 4.4, “MUNGE authentication”. Ensure that the same - mungekey is shared across all nodes.
Installing the slurm package automatically opens the TCP ports 6817, 6818, and 6819.
- On the management server, edit the main configuration file - /etc/slurm/slurm.conf:- Configure the parameter - SlurmctldHost=SLURMCTLD_HOSTwith the host name of the management server.- To find the correct host name, run - hostname -son the management server.
- Under the - COMPUTE NODESsection, add the following lines to define the compute nodes:- NodeName=NODE_LIST State=UNKNOWN PartitionName=normal Nodes=NODE_LIST Default=YES MaxTime=24:00:00 State=UP - Replace NODE_LIST with the host names of the compute nodes, either comma-separated or as a range (for example: - node[1-100]).- The - NodeNameline also allows specifying additional parameters for the nodes, such as- Boards,- SocketsPerBoard- CoresPerSocket,- ThreadsPerCore, or- CPU. The actual values of these can be obtained by running the following command on the compute nodes:- node1#slurmd -C
 
- Copy the modified configuration file - /etc/slurm/slurm.conffrom the management server to all compute nodes:- management#scp /etc/slurm/slurm.conf COMPUTE_NODE:/etc/slurm/
- On the management server, start - slurmctldand enable it so that it starts on every boot:- management#systemctl enable --now slurmctld.service
- On each compute node, start - slurmdand enable it so that it starts on every boot:- node1#systemctl enable --now slurmd.service
- Check the status and availability of the compute nodes by running the - sinfocommand. You should see output similar to the following:- PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 2 idle node[01-02] - If the node state is not - idle, see Section 6.4, “Frequently asked questions”.
- Test the Slurm installation by running the following command: - management#srun sleep 30- This runs the - sleepcommand on a free compute node for 30 seconds.- In another shell, run the - squeuecommand during the 30 seconds that the compute node is asleep. You should see output similar to the following:- JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1 normal sleep root R 0:05 1 node02
- Create the following shell script and save it as - sleeper.sh:- #!/bin/bash echo "started at $(date)" sleep 30 echo "finished at $(date)" - Run the shell script in the queue: - management#sbatch sleeper.sh- The shell script is executed when enough resources are available, and the output is stored in the file - slurm-${JOBNR}.out.
6.1.2 Installing the Slurm database #
In a minimal installation, Slurm only stores pending and running jobs. To store finished and failed job data, the storage plugin must be installed and enabled. You can also enable completely fair scheduling, which replaces FIFO (first in, first out) scheduling with algorithms that calculate the job priority in a queue in dependence of the job which a user has run in the history.
    The Slurm database has two components: the slurmdbd
    daemon itself, and an SQL database. MariaDB is recommended. The
    database can be installed on the same node that runs slurmdbd,
    or on a separate node. For a minimal setup, all these services run on the
    management server.
   
Before you begin, make sure Slurm is installed as described in Section 6.1.1, “Minimal installation”.
If you want to use an external SQL database (or you already have a database installed on the management server), you can skip Step 1 and Step 2.
- Install the MariaDB SQL database: - management#zypper install mariadb
- Start and enable MariaDB: - management#systemctl enable --now mariadb
- Secure the database: - management#mysql_secure_installation
- Connect to the SQL database: - management#mysql -u root -p
- Create the Slurm database user and grant it permissions for the Slurm database, which will be created later: - mysql>create user 'slurm'@'localhost' identified by 'PASSWORD';- mysql>grant all on slurm_acct_db.* TO 'slurm'@'localhost';- You can choose to use a different user name or database name. In this case, you must also change the corresponding values in the - /etc/slurm/slurmdbd.conffile later.
- Exit the database: - mysql>exit
- Install the slurmdbd package: - management#zypper in slurm-slurmdbd
- Edit the - /etc/slurm/slurmdbd.conffile so that the daemon can access the database. Change the following line to the password that you used in Step 5:- StoragePass=password - If the database is on a different node, or if you chose a different user name or database name, you must also modify the following lines: - StorageUser=slurm StorageLoc=slurm_acct_db DbdAddr=localhost DbdHost=localhost 
- Start and enable - slurmdbd:- management#systemctl enable --now slurmdbd- The first start of - slurmdbdmight take some time.
- To enable accounting, edit the - /etc/slurm/slurm.conffile to add the connection between- slurmctldand the- slurmdbddaemon. Ensure that the following lines appear as shown:- JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost - This example assumes that - slurmdbdis running on the same node as- slurmctld. If not, change- localhostto the host name or IP address of the node where- slurmdbdis running.
- Make sure - slurmdbdis running before you continue:- management#systemctl status slurmdbd- If you restart - slurmctldbefore- slurmdbdis running,- slurmctldfails because it cannot connect to the database.
- Restart - slurmctld:- management#systemctl restart slurmctld- This creates the Slurm database and adds the cluster to the database (using the - ClusterNamefrom- /etc/slurm/slurm.conf).
- (Optional) By default, Slurm does not take any group membership into account, and the system groups cannot be mapped to Slurm. However, you can mimic system groups with accounts. In Slurm, accounts are usually entities billed for cluster usage, while users identify individual cluster users. Multiple users can be associated with a single account. - The following example creates an umbrella group - bavariafor two subgroups called- nurembergand- munich:- management#sacctmgr add account bavaria \ Description="umbrella group for subgroups" Organization=bavaria- management#sacctmgr add account nuremberg,munich parent=bavaria \ Description="subgroup" Organization=bavaria- The following example adds a user called - tuxto the subgroup- nuremberg:- management#sacctmgr add user tux Account=nuremberg
6.2 Slurm administration commands #
   This section lists some useful options for common Slurm commands. For more
   information and a full list of options, see the man page
   for each command. For more Slurm commands, see
   https://slurm.schedmd.com/man_index.html.
  
6.2.1 scontrol #
    The command scontrol is used to show and update the
    entities of Slurm, such as the state of the compute nodes or compute jobs.
    It can also be used to reboot or to propagate configuration changes to the
    compute nodes.
   
    Useful options for this command are --details, which
    prints more verbose output, and --oneliner, which forces
    the output onto a single line, which is more useful in shell scripts.
   
    For more information, see man scontrol.
   
- scontrol show ENTITY
- Displays the state of the specified ENTITY. 
- scontrol update SPECIFICATION
- Updates the SPECIFICATION like the compute node or compute node state. Useful SPECIFICATION states that can be set for compute nodes include: - nodename=NODE state=down reason=REASON
- Removes all jobs from the compute node, and aborts any jobs already running on the node. 
- nodename=NODE state=drain reason=REASON
- Drains the compute node so that no new jobs can be scheduled on it, but does not end compute jobs already running on the compute node. REASON can be any string. The compute node stays in the - drainedstate and must be returned to the- idlestate manually.
- nodename=NODE state=resume
- Marks the compute node as ready to return to the - idlestate.
- jobid=JOBID REQUIREMENT=VALUE
- Updates the given requirement, such as - NumNodes, with a new value. This command can also be run as a non-privileged user.
 
- scontrol reconfigure
- Triggers a reload of the configuration file - slurm.confon all compute nodes.
- scontrol reboot NODELIST
- Reboots a compute node, or group of compute nodes, when the jobs on it finish. To use this command, the option - RebootProgram="/sbin/reboot"must be set in- slurm.conf. When the reboot of a compute node takes more than 60 seconds, you can set a higher value in- slurm.conf, such as- ResumeTimeout=300.
6.2.2 sinfo #
    The command sinfo retrieves information about the state
    of the compute nodes, and can be used for a fast overview of the cluster
    health. For more information, see man sinfo.
   
- --dead
- Displays information about unresponsive nodes. 
- --long
- Shows more detailed information. 
- --reservation
- Prints information about advanced reservations. 
- -R
- Displays the reason a node is in the - down,- drained, or- failingstate.
- --state=STATE
- Limits the output only to nodes with the specified STATE. 
6.2.3 sacctmgr and sacct #
    These commands are used for managing accounting. For more information, see
    man sacctmgr and man sacct.
   
- sacctmgr
- Used for job accounting in Slurm. To use this command, the service - slurmdbdmust be set up. See Section 6.1.2, “Installing the Slurm database”.
- sacct
- Displays the accounting data if accounting is enabled. - --allusers
- Shows accounting data for all users. 
- --accounts=NAME
- Shows only the specified user(s). 
- --starttime=MM/DD[/YY]-HH:MM[:SS]
- Shows only jobs after the specified start time. You can use just MM/DD or HH:MM. If no time is given, the command defaults to - 00:00, which means that only jobs from today are shown.
- --endtime=MM/DD[/YY]-HH:MM[:SS]
- Accepts the same options as - --starttime. If no time is given, the time when the command was issued is used.
- --name=NAME
- Limits output to jobs with the given NAME. 
- --partition=PARTITION
- Shows only jobs that run in the specified PARTITION. 
 
6.2.4 sbatch, salloc, and srun #
    These commands are used to schedule compute jobs,
    which means batch scripts for the sbatch command,
    interactive sessions for the salloc command, or
    binaries for the srun command. If the job cannot be
    scheduled immediately, only sbatch places it into the queue.
   
    For more information, see man sbatch,
    man salloc, and man srun.
   
- -n COUNT_THREADS
- Specifies the number of threads needed by the job. The threads can be allocated on different nodes. 
- -N MINCOUNT_NODES[-MAXCOUNT_NODES]
- Sets the number of compute nodes required for a job. The MAXCOUNT_NODES number can be omitted. 
- --time TIME
- Specifies the maximum clock time (runtime) after which a job is terminated. The format of TIME is either seconds or [HH:]MM:SS. Not to be confused with - walltime, which is- clocktime × threads.
- --signal [B:]NUMBER[@TIME]
- Sends the signal specified by NUMBER 60 seconds before the end of the job, unless TIME is specified. The signal is sent to every process on every node. If a signal should only be sent to the controlling batch job, you must specify the - B:flag.
- --job-name NAME
- Sets the name of the job to NAME in the queue. 
- --array=RANGEINDEX
- Executes the given script via - sbatchfor indexes given by RANGEINDEX with the same parameters.
- --dependency=STATE:JOBID
- Defers the job until the specified STATE of the job JOBID is reached. 
- --gres=GRES
- Runs a job only on nodes with the specified generic resource (GRes), for example a GPU, specified by the value of GRES. 
- --licenses=NAME[:COUNT]
- The job must have the specified number (COUNT) of licenses with the name NAME. A license is the opposite of a generic resource: it is not tied to a computer, but is a cluster-wide variable. 
- --mem=MEMORY
- Sets the real MEMORY required by a job per node. To use this option, memory control must be enabled. The default unit for the MEMORY value is megabytes, but you can also use - Kfor kilobyte,- Mfor megabyte,- Gfor gigabyte, or- Tfor terabyte.
- --mem-per-cpu=MEMORY
- This option takes the same values as - --mem, but defines memory on a per-CPU basis rather than a per-node basis.
6.3 Upgrading Slurm #
    For existing products under general support, version upgrades of Slurm are
    provided regularly. Unlike maintenance updates, these upgrades are not
    installed automatically using zypper patch but require
    you to request their installation explicitly. This ensures that these
    upgrades are not installed unintentionally and gives you the opportunity
    to plan version upgrades beforehand.
   
zypper up is not recommended
       On systems running Slurm, updating packages with zypper up
       is not recommended. zypper up attempts to update all installed
       packages to the latest version, so might install a new major version of Slurm
       outside of planned Slurm upgrades.
     
       Use zypper patch instead, which only updates packages to the
       latest bug fix version.
     
6.3.1 Slurm upgrade workflow #
Interoperability is guaranteed between three consecutive versions of Slurm, with the following restrictions:
- The version of - slurmdbdmust be identical to or higher than the version of- slurmctld.
- The version of - slurmctldmust the identical to or higher than the version of- slurmd.
- The version of - slurmdmust be identical to or higher than the version of the- slurmuser applications.
    Or in short:
    version(slurmdbd) >=
    version(slurmctld) >=
    version(slurmd) >= version (Slurm user CLIs).
   
    Slurm uses a segmented version number: the first two segments denote the
    major version, and the final segment denotes the patch level.
    Upgrade packages (that is, packages that were not initially supplied with
    the module or service pack) have their major version encoded in the package
    name (with periods . replaced by underscores
    _). For example, for version 23.02, this would be
    slurm_23_02-*.rpm. To find out the latest version of Slurm,
    you can check  › 
    in the SUSE Customer Center, or run zypper search -v slurm on a node.
   
    With each version, configuration options for
    slurmctld, slurmd, or
    slurmdbd might be deprecated. While deprecated, they
    remain valid for this version and the two consecutive versions, but they might
    be removed later. Therefore, it is advisable to update the configuration files
    after the upgrade and replace deprecated configuration options before the
    final restart of a service.
   
    A new major version of Slurm introduces a new version of
    libslurm. Older versions of this library might not work
    with an upgraded Slurm. An upgrade is provided for all SUSE Linux Enterprise software that
    depends on libslurm . It is strongly recommended to rebuild
    local applications using libslurm, such as MPI libraries
    with Slurm support, as early as possible. This might require updating the
    user applications, as new arguments might be introduced to existing functions.
   
slurmdbd databases before other Slurm components
     If slurmdbd is used, always upgrade the
     slurmdbd database before starting
     the upgrade of any other Slurm component. The same database can be connected
     to multiple clusters and must be upgraded before all of them.
    
Upgrading other Slurm components before the database can lead to data loss.
6.3.2 Upgrading the slurmdbd database daemon #
    When upgrading slurmdbd,
    the database is converted when the new version of
    slurmdbd starts for the first time. If the
    database is big, the conversion could take several tens of minutes. During
    this time, the database is inaccessible.
   
    It is highly recommended to create a backup of the database in case an
    error occurs during or after the upgrade process. Without a backup,
    all accounting data collected in the database might be lost if an error
    occurs or the upgrade is rolled back. A database
    converted to a newer version cannot be converted back to an older version,
    and older versions of slurmdbd do not recognize the
    newer formats.
   
slurmdbd first
     If you are using a backup slurmdbd, the conversion must
     be performed on the primary slurmdbd first. The backup
     slurmdbd only starts after the conversion is complete.
    
slurmdbd database daemon #- Stop the - slurmdbdservice:- DBnode#rcslurmdbd stop- Ensure that - slurmdbdis not running anymore:- DBnode#rcslurmdbd status- slurmctldmight remain running while the database daemon is down. During this time, requests intended for- slurmdbdare queued internally. The DBD Agent Queue size is limited, however, and should therefore be monitored with- sdiag.
- Create a backup of the - slurm_acct_dbdatabase:- DBnode#mysqldump -p slurm_acct_db > slurm_acct_db.sql- If needed, this can be restored with the following command: - DBnode#mysql -p slurm_acct_db < slurm_acct_db.sql
- During the database conversion, the variable - innodb_buffer_pool_sizemust be set to a value of 128 MB or more. Check the current size:- DBnode#echo 'SELECT @@innodb_buffer_pool_size/1024/1024;' | \ mysql --password --batch
- If the value of - innodb_buffer_pool_sizeis less than 128 MB, you can change it for the duration of the current session (on- mariadb):- DBnode#echo 'set GLOBAL innodb_buffer_pool_size = 134217728;' | \ mysql --password --batch- Alternatively, to permanently change the size, edit the - /etc/my.cnffile, set- innodb_buffer_pool_sizeto 128 MB, then restart the database:- DBnode#rcmysql restart
- If you need to update MariaDB, run the following command: - DBnode#zypper update mariadb- Convert the database tables to the new version of MariaDB: - DBnode#mysql_upgrade --user=root --password=ROOT_DB_PASSWORD;
- Install the new version of - slurmdbd:- DBnode#zypper install --force-resolution slurm_VERSION-slurmdbd
- Rebuild the database. If you are using a backup - slurmdbd, perform this step on the primary- slurmdbdfirst.- Because a conversion might take a considerable amount of time, the - systemdservice might time out during the conversion. Therefore, we recommend performing the migration manually by running- slurmdbdfrom the command line in the foreground:- DBnode#/usr/sbin/slurmdbd -D -v- When you see the following message, you can shut down - slurmdbdby pressing Ctrl–C:- Conversion done: success! 
- Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes. 
- Restart - slurmdbd:- DBnode#systemctl start slurmdbdNote: No daemonization during rebuild- During the rebuild of the Slurm database, the database daemon does not daemonize. 
6.3.3 Upgrading slurmctld and slurmd #
    After the Slurm database is upgraded, the slurmctld and
    slurmd instances can be upgraded. It is recommended to
    update the management servers and compute nodes all at once.
    If this is not feasible, the compute nodes (slurmd) can
    be updated on a node-by-node basis. However, the management servers
    (slurmctld) must be updated first.
   
- Section 6.3.2, “Upgrading the - slurmdbddatabase daemon”. Upgrading other Slurm components before the database can lead to data loss.
- This procedure assumes that MUNGE authentication is used and that - pdsh, the- pdshSlurm plugin, and- mrshcan access all of the machines in the cluster. If this is not the case, install- pdshby running- zypper in pdsh-slurm.- If - mrshis not used in the cluster, the- sshback-end for- pdshcan be used instead. Replace the option- -R mrshwith- -R sshin the- pdshcommands below. This is less scalable and you might run out of usable ports.
slurmctld and slurmd #- Back up the configuration file - /etc/slurm/slurm.conf. Because this file should be identical across the entire cluster, it is sufficient to do so only on the main management server.
- On the main management server, edit - /etc/slurm/slurm.confand set- SlurmdTimeoutand- SlurmctldTimeoutto sufficiently high values to avoid timeouts while- slurmctldand- slurmdare down:- SlurmctldTimeout=3600 SlurmdTimeout=3600 - We recommend at least 60 minutes ( - 3600), and more for larger clusters.
- Copy the updated - /etc/slurm/slurm.conffrom the management server to all nodes:- Obtain the list of partitions in - /etc/slurm/slurm.conf.
- Copy the updated configuration to the compute nodes: - management#cp /etc/slurm/slurm.conf /etc/slurm/slurm.conf.update- management#sudo -u slurm /bin/bash -c 'cat /etc/slurm/slurm.conf.update | \ pdsh -R mrsh -P PARTITIONS "cat > /etc/slurm/slurm.conf"'- management#rm /etc/slurm/slurm.conf.update
- Reload the configuration file on all compute nodes: - management#scontrol reconfigure
- Verify that the reconfiguration took effect: - management#scontrol show config | grep Timeout
 
- Shut down all running - slurmctldinstances, first on any backup management servers, and then on the main management server:- management#systemctl stop slurmctld
- Back up the - slurmctldstate files.- slurmctldmaintains persistent state information. Almost every major version involves changes to the- slurmctldstate files. This state information is upgraded if the upgrade remains within the supported version range and no data is lost.- However, if a downgrade is necessary, state information from newer versions is not recognized by an older version of - slurmctldand is discarded, resulting in a loss of all running and pending jobs. Therefore, back up the old state in case an update needs to be rolled back.- Determine the - StateSaveLocationdirectory:- management#scontrol show config | grep StateSaveLocation
- Create a backup of the content of this directory. If a downgrade is required, restore the content of the - StateSaveLocationdirectory from this backup.
 
- Shut down - slurmdon the compute nodes:- management#pdsh -R ssh -P PARTITIONS systemctl stop slurmd
- Upgrade - slurmctldon the main and backup management servers:- management#zypper install --force-resolution slurm_VERSIONImportant: Upgrade all Slurm packages at the same time- If any additional Slurm packages are installed, you must upgrade those as well. This includes: - slurm-pam_slurm 
- slurm-sview 
- perl-slurm 
- slurm-lua 
- slurm-torque 
- slurm-config-man 
- slurm-doc 
- slurm-webdoc 
- slurm-auth-none 
- pdsh-slurm 
 - All Slurm packages must be upgraded at the same time to avoid conflicts between packages of different versions. This can be done by adding them to the - zypper installcommand line described above.
- Upgrade - slurmdon the compute nodes:- management#pdsh -R ssh -P PARTITIONS \ zypper install --force-resolution slurm_VERSION-nodeNote: Memory size seen by- slurmdmight change on update- Under certain circumstances, the amount of memory seen by - slurmdmight change after an update. If this happens,- slurmctldputs the nodes in a- drainedstate. To check whether the amount of memory seem by- slurmdchanged after the update, run the following command on a single compute node:- node1#slurmd -C- Compare the output with the settings in - slurm.conf. If required, correct the setting.
- Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes. - If you replace deprecated options in the configuration files, these configuration files can be distributed to all management servers and compute nodes in the cluster by using the method described in Step 3. 
- Restart - slurmdon all compute nodes:- management#pdsh -R ssh -P PARTITIONS systemctl start slurmd
- Restart - slurmctldon the main and backup management servers:- management#systemctl start slurmctld
- Check the status of the management servers. On the main and backup management servers, run the following command: - management#systemctl status slurmctld
- Verify that the services are running without errors. Run the following command to check whether there are any - down,- drained,- failing, or- failednodes:- management#sinfo -R
- Restore the original values of - SlurmdTimeoutand- SlurmctldTimeoutin- /etc/slurm/slurm.conf, then copy the restored configuration to all nodes by using the method described in Step 3.
6.4 Frequently asked questions #
- 1. 
       How do I change the state of a node from downtoup?
- When the - slurmddaemon on a node does not reboot in the time specified in the- ResumeTimeoutparameter, or the- ReturnToServicewas not changed in the configuration file- slurm.conf, compute nodes stay in the- downstate and must be set back to the- upstate manually. This can be done for the NODE with the following command:- management#scontrol update state=resume NodeName=NODE
- 2. 
       What is the difference between the states downanddown*?
- A - *shown after a status code means that the node is not responding.- When a node is marked as - down*, it means that the node is not reachable because of network issues, or that- slurmdis not running on that node.- In the - downstate, the node is reachable, but either the node was rebooted unexpectedly, the hardware does not match the description in- slurm.conf, or a health check was configured with the- HealthCheckProgram.
- 3. How do I get the exact core count, socket number, and number of CPUs for a node?
- To find the node values that go into the configuration file - slurm.conf, run the following command:- node1#slurmd -C