Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Linux Enterprise High Performance Computing 15 SP3

5 Slurm — utility for HPC workload management

Slurm is a workload manager for managing compute jobs on High Performance Computing clusters. It can start multiple jobs on a single node, or a single job on multiple nodes. Additional components can be used for advanced scheduling and accounting.

The mandatory components of Slurm are the control daemon slurmctld, which handles job scheduling, and the Slurm daemon slurmd, responsible for launching compute jobs. Nodes running slurmctld are called control nodes and nodes running slurmd are called compute nodes.

Additional components are a secondary slurmctld acting as a standby server for a failover, and the Slurm database daemon slurmdbd, which stores the job history and user hierarchy.

5.1 Installing Slurm

These instructions describe a minimal installation of Slurm with one control node and multiple compute nodes.

5.1.1 Minimal installation

Important
Important: Make sure of consistent UIDs and GIDs for Slurm's accounts

For security reasons, Slurm does not run as the user root, but under its own user. It is important that the user slurm has the same UID/GID across all nodes of the cluster.

If this user/group does not exist, the package slurm creates this user and group when it is installed. However, this does not guarantee that the generated UIDs/GIDs will be identical on all systems.

Therefore, we strongly advise you to create the user/group slurm before installing slurm. If you are using a network directory service such as LDAP for user and group management, you can use it to provide the slurm user/group as well.

It is strongly recommended that all compute nodes share common user home directories. These should be provided through network storage.

Procedure 5.1: Installing the Slurm packages
  1. On the control node, install the slurm package with the command zypper in slurm.

  2. On the compute nodes, install the slurm-node package with the command zypper in slurm-node.

  3. On the control node and the compute nodes, the package munge is installed automatically. Configure, enable and start MUNGE on the control and compute nodes as described in Section 3.4, “MUNGE authentication”. Ensure that the same munge key is shared across all nodes.

Procedure 5.2: Configuring Slurm
  1. On the control node, edit the main configuration file /etc/slurm/slurm.conf:

    1. Configure the parameter ControlMachine=CONTROL_MACHINE with the host name of the control node.

      To find the correct host name, run hostname -s on the control node.

    2. Under the COMPUTE NODES section, add the following lines to define the compute nodes:

      NodeName=NODE_LIST State=UNKNOWN
      PartitionName=normal Nodes=NODE_LIST Default=YES MaxTime=24:00:00 State=UP

      Replace NODE_LIST with the host names of the compute nodes, either comma-separated or as a range (for example: node[1-100]).

      The NodeName line also allows specifying additional parameters for the nodes, such as Boards, SocketsPerBoard CoresPerSocket, ThreadsPerCore, or CPU. The actual values of these can be obtained by running the following command on the compute nodes:

      # slurmd -C
  2. Copy the modified configuration file /etc/slurm/slurm.conf from the control node to all compute nodes:

    # scp /etc/slurm/slurm.conf COMPUTE_NODE:/etc/slurm/
  3. On the control node, start slurmctld and enable it so that it starts on every boot:

    # systemctl enable --now slurmctld.service
  4. On the compute nodes, start slurmd and enable it so that it starts on every boot:

    # systemctl enable --now slurmd.service
Procedure 5.3: Testing the Slurm installation
  1. Check the status and availability of the compute nodes by running the sinfo command. You should see output similar to the following:

    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    normal*      up 1-00:00:00      2   idle node[01-02]

    If the node state is not idle, see Section 5.4.1, “Frequently asked questions”.

  2. Test the Slurm installation by running the following command:

    # srun sleep 30

    This runs the sleep command on a free compute node for 30 seconds.

    In another shell, run the squeue command during the 30 seconds that the compute node is asleep. You should see output similar to the following:

    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        1    normal    sleep     root  R       0:05      1 node02
  3. Create the following shell script and save it as sleeper.sh:

    #!/bin/bash
    echo "started at $(date)"
    sleep 30
    echo "finished at $(date)"

    Run the shell script in the queue:

    # sbatch sleeper.sh

    The shell script is executed when enough resources are available, and the output is stored in the file slurm-${JOBNR}.out.

5.1.2 Installing the Slurm database

In a minimal installation, Slurm only stores pending and running jobs. To store finished and failed job data, the storage plugin must be installed and enabled. You can also enable completely fair scheduling, which replaces FIFO (first in, first out) scheduling with algorithms that calculate the job priority in a queue in dependence of the job which a user has run in the history.

The Slurm database has two components: the slurmdbd daemon itself, and an SQL database. MariaDB is recommended. The database can be installed on the same node that runs slurmdbd, or on a separate node. For a minimal setup, all these services run on the same node.

Procedure 5.4: Install slurmdbd
Note
Note: MariaDB

If you want to use an external SQL database (or you already have a database installed on the control node), you can skip the MariaDB steps.

  1. Install the MariaDB SQL database with zypper in mariadb.

  2. Start and enable MariaDB:

    # systemctl enable --now mariadb
  3. Secure the database with the command mysql_secure_installation.

  4. Connect to the SQL database, for example with the command mysql -u root -p.

  5. After a successful connection to the database and the creation of a secure password [1] , create the Slurm user and the database with the following commands:

    create user 'slurm'@'localhost' identified by 'password';
    grant all on slurm_acct_db.* TO 'slurm'@'localhost';
    create database slurm_acct_db;

    After these steps are complete, exit the database.

  6. Install the slurmdbd package:

    # zypper in slurm-slurmdbd
  7. Edit the /etc/slurm/slurmdbd.conf file so that the daemon can access the database. Change the following line to the password that you used in Step 5:

    StoragePass=password

    If you chose another location or user for the SQL database, you must also modify the following lines:

    StorageUser=slurm
    DbdAddr=localhost
    DbdHost=localhost
  8. Start and enable slurmdbd:

    # systemctl enable --now slurmdbd

    The first start of slurmdbd will take some time.

  9. To enable accounting, edit the /etc/slurm/slurm.conf file to add the connection between slurmctld and the slurmdbd daemon. Ensure that the following lines appear as shown:

    JobAcctGatherType=jobacct_gather/linux
    JobAcctGatherFrequency=30
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=localhost
    Note

    This example assumes that slurmdbd is running on the same node as slurmctld. If not, change localhost to the host name or IP address of the node where slurmdbd is running.

  10. Ensure that a table for the cluster is added to the database. Otherwise, no accounting information can be written to the database. To add a cluster table, run sacctmgr -i add cluster CLUSTERNAME.

  11. Restart slurmctld:

    # systemctl restart slurmctld
  12. (Optional) By default, Slurm does not take any group membership into account, and the system groups cannot be mapped to Slurm. Group creation and membership must be managed via the command line tool sacctmgr. You can have a group hierarchy, and users can be part of several groups.

    The following example creates an umbrella group bavaria for two subgroups called nuremberg and munich:

    # sacctmgr add account bavaria \
    Description="umbrella group for subgroups" Organization=bavaria
    # sacctmgr add account nuremberg,munich parent=bavaria \
    Description="subgroup" Organization=bavaria

    The following example adds a user called tux to the subgroup nuremberg:

    # sacctmgr add user tux Account=nuremberg

5.2 Slurm administration commands

5.2.1 scontrol

The command scontrol is used to show and update the entities of Slurm, such as the state of the compute nodes or compute jobs. It can also be used to reboot or to propagate configuration changes to the compute nodes.

Useful options for this command are --details, which prints more verbose output, and --oneliner, which forces the output onto a single line, which is more useful in shell scripts.

scontrol show ENTITY

Displays the state of the specified ENTITY.

scontrol update SPECIFICATION

Updates the SPECIFICATION like the compute node or compute node state. Useful SPECIFICATION states that can be set for compute nodes include:

nodename=NODE state=down reason=REASON

Removes all jobs from the compute node, and aborts any jobs already running on the node.

nodename=NODE state=drain reason=REASON

Drains the compute node so that no new jobs can be scheduled on it, but does not end compute jobs already running on the compute node. REASON can be any string. The compute node stays in the drained state and must be returned to the idle state manually.

nodename=NODE state=resume

Marks the compute node as ready to return to the idle state.

jobid=JOBID REQUIREMENT=VALUE

Updates the given requirement, such as NumNodes, with a new value. This command can also be run as a non-privileged user.

scontrol reconfigure

Triggers a reload of the configuration file slurm.conf on all compute nodes.

scontrol reboot NODELIST

Reboots a compute node, or group of compute nodes, when the jobs on it finish. To use this command, the option RebootProgram="/sbin/reboot" must be set in slurm.conf. When the reboot of a compute node takes more than 60 seconds, you can set a higher value in slurm.conf, such as ResumeTimeout=300.

sacctmgr

Used for job accounting in Slurm. To use this command, the service slurmdbd must be set up. See Section 5.1.2, “Installing the Slurm database”.

sinfo

Retrieves information about the state of the compute nodes, and can be used for a fast overview of the cluster health. The following command line switches are available:

--dead

Displays information about unresponsive nodes.

--long

Shows more detailed information.

--reservation

Prints information about advanced reservations.

-R

Displays the reason a node is in the down, drained, or failing state.

--state=STATE

Limits the output only to nodes with the specified STATE.

sacct

Displays the accounting data if accounting is enabled. The following options are available:

--allusers

Shows accounting data for all users.

--accounts=NAME

Shows only the specified user(s).

--starttime=MM/DD[/YY]-HH:MM[:SS]

Shows only jobs after the specified start time. You can use just MM/DD or HH:MM. If no time is given, the command defaults to 00:00, which means that only jobs from today are shown.

--endtime=MM/DD[/YY]-HH:MM[:SS]

Accepts the same options as --starttime. If no time is given, the time when the command was issued is used.

--name=NAME

Limits output to jobs with the given NAME.

--partition=PARTITION

Shows only jobs that run in the specified PARTITION.

sbatch, salloc and srun

These commands are used to schedule compute jobs, which means batch scripts for the sbatch command, interactive sessions for the salloc command, or binaries for the srun command. If the job cannot be scheduled immediately, only sbatch places it into the queue.

Frequently-used options for these commands include:

-n COUNT_THREADS

Specifies the number of threads needed by the job. The threads can be allocated on different nodes.

-N MINCOUNT_NODES[-MAXCOUNT_NODES]

Sets the number of compute nodes required for a job. The MAXCOUNT_NODES number can be omitted.

--time TIME

Specifies the maximum clock time (runtime) after which a job is terminated. The format of TIME is either seconds or [HH:]MM:SS. Not to be confused with walltime, which is clocktime × threads.

--signal [B:]NUMBER[@TIME]

Sends the signal specified by NUMBER 60 seconds before the end of the job, unless TIME is specified. The signal is sent to every process on every node. If a signal should only be sent to the controlling batch job, you must specify the B: flag.

--job-name NAME

Sets the name of the job to NAME in the queue.

--array=RANGEINDEX

Executes the given script via sbatch for indexes given by RANGEINDEX with the same parameters.

--dependency=STATE:JOBID

Defers the job until the specified STATE of the job JOBID is reached.

--gres=GRES

Runs a job only on nodes with the specified generic resource (GRes), for example a GPU, specified by the value of GRES.

--licenses=NAME[:COUNT]

The job must have the specified number (COUNT) of licenses with the name NAME. A license is the opposite of a generic resource: it is not tied to a computer, but is a cluster-wide variable.

--mem=MEMORY

Sets the real MEMORY required by a job per node. To use this option, memory control must be enabled. The default unit for the MEMORY value is megabytes, but you can also use K for kilobyte, M for megabyte, G for gigabyte, or T for terabyte.

--mem-per-cpu=MEMORY

This option takes the same values as --mem, but defines memory on a per-CPU basis rather than a per-node basis.

5.3 Upgrading Slurm

For existing products under general support, version upgrades of Slurm are provided regularly. With some restrictions, interoperability is guaranteed between three consecutive versions.

Unlike maintenance updates, these upgrades are not installed automatically using zypper patch but require the administrator to request their installation explicitly. This ensures that these upgrades are not installed unintentionally and gives the administrator the opportunity to plan version upgrades beforehand.

In addition to the three major-version rule, obey the following rules regarding the order of updates:

  1. The version of slurmdbd must be identical to or higher than the version of slurmctld.

  2. The version of slurmctld must the identical to or higher than the version of slurmd.

  3. The version of slurmd must be identical to or higher than the version of the slurm user applications.

Or in short: version(slurmdbd) >= version(slurmctld) >= version(slurmd) >= version (Slurm user CLIs).

With each version, configuration options for slurmctld/slurmd or slurmdbd might be deprecated. While deprecated, they remain valid for this version and the two consecutive versions, but they might be removed later. Therefore, it is advisable to update the configuration files after the upgrade and replace deprecated configuration options before the final restart of a service.

Slurm uses a segmented version number: the first two segments denote the major version, and the final segment denotes the patch level. Upgrade packages (that is, packages that were not initially supplied with the module or service pack) have their major version encoded in the package name (with periods . replaced by underscores _). For example, for version 18.08, this would be slurm_18_08-*.rpm.

To upgrade the package slurm to 18.08, run the following command:

zypper install --force-resolution slurm_18_08

To upgrade Slurm subpackages, use the analogous commands.

Important
Important

If any additional Slurm packages are installed, you must upgrade those as well. This includes:

  • slurm-pam_slurm

  • slurm-sview

  • perl-slurm

  • slurm-lua

  • slurm-torque

  • slurm-config-man

  • slurm-doc

  • slurm-webdoc

  • slurm-auth-none

  • pdsh-slurm

All Slurm packages must be upgraded at the same time to avoid conflicts between packages of different versions. This can be done by adding them to the zypper install command line described above.

Note
Note

A new major version of Slurm introduces a new version of libslurm. Older versions of this library might not work with an upgraded Slurm. An upgrade is provided for all SUSE Linux Enterprise software that depends on libslurm . Any locally-developed Slurm modules or tools might require modification or recompilation.

5.3.1 Upgrade workflow

This workflow assumes that MUNGE authentication is used and that pdsh, the pdsh Slurm plugin, and mrsh can access all of the machines in the cluster (that is, mrshd is running on all nodes in the cluster).

If this is not the case, install pdsh:

# zypper in pdsh-slurm

If mrsh is not used in the cluster, the ssh back-end for pdsh can be used instead. Replace the option -R mrsh with -R ssh in the pdshcommands below. This is less scalable and you might run out of usable ports.

Warning
Warning: Upgrade slurmdbd databases before other Slurm components

If slurmdbd is used, always upgrade the slurmdbd database before starting the upgrade of any other Slurm component. The same database can be connected to multiple clusters and must be upgraded before all of them.

Upgrading other Slurm components before the database can lead to data loss.

Procedure 5.5: Upgrading Slurm
  1. Upgrade the slurmdbd database daemon

    When upgrading slurmdbd, the database is converted when the new version of slurmdbd starts for the first time. If the database is big, the conversion could take several tens of minutes. During this time, the database is inaccessible.

    It is highly recommended to create a backup of the database in case an error occurs during or after the upgrade process. Without a backup, all accounting data collected in the database might be lost if an error occurs or the upgrade is rolled back. A database converted to a newer version cannot be converted back to an older version, and older versions of slurmdbd do not recognize the newer formats. To back up and upgrade slurmdbd, follow this procedure:

    1. Stop the slurmdbd service:

      # rcslurmdbd stop

      Ensure that slurmdbd is not running anymore:

      # rcslurmdbd status

      slurmctld might remain running while the database daemon is down. During this time, requests intended for slurmdbd are queued internally. The DBD Agent Queue size is limited, however, and should therefore be monitored with sdiag.

    2. Create a backup of the slurm_acct_db database:

      # mysqldump -p slurm_acct_db > slurm_acct_db.sql

      If needed, this can be restored with the following command:

      # mysql -p slurm_acct_db < slurm_acct_db.sql
    3. In preparation for the conversion, ensure that the variable innodb_buffer_pool_size is set to a value of 128 Mb or more:

      On the database server, run the following command:

      # echo  'SELECT @@innodb_buffer_pool_size/1024/1024;' | \
      mysql --password --batch

      If the size is less than 128 MB, you can change it for the duration of the current session (on mariadb):

      # echo 'set GLOBAL innodb_buffer_pool_size = 134217728;' | \
      mysql --password --batch

      To permanently change the size, edit the /etc/my.cnf file, set innodb_buffer_pool_size to 128 MB, then restart the database:

      # rcmysql restart
    4. If you also need to update mariadb, run the following command:

      # zypper update mariadb

      Convert the database tables to the new version of MariaDB:

      # mysql_upgrade --user=root --password=root_db_password;
    5. Install the new version of slurmdbd:

      # zypper install --force-resolution slurm_version-slurmdbd
    6. Rebuild database

      Because a conversion might take a considerable amount of time, the systemd service might time out during the conversion. Therefore, we recommend performing the migration manually by running slurmdbd from the command line in the foreground:

      # /usr/sbin/slurmdbd -D -v

      Once you see the following message, you can shut down slurmdbd by pressing CtrlC:

      Conversion done:
      success!
      Note
      Note: Convert primary slurmdbd first

      If you are using a backup slurmdbd, the conversion must be performed on the primary slurmdbd first. The backup slurmdbd only starts after the conversion is complete.

    7. Before restarting the service, remove or replace any deprecated configuration options. Check the deprecated options in the Release Notes. After this is complete, restart slurmdbd:

      # systemctl start slurmdbd
      Note
      Note: No daemonization during rebuild

      During the rebuild of the Slurm database, the database daemon does not daemonize.

  2. Update slurmctld and slurmd

    After the Slurm database is updated, the slurmctld and slurmd instances can be updated. It is recommended to update the control and compute nodes all in a single pass. If this is not feasible, the compute nodes (slurmd) can be updated on a node-by-node basis. However, the control nodes (slurmctld) must be updated first.

    1. Back up the slurmctld and slurmd configuration

      Create a backup copy of the Slurm configuration before starting the upgrade process. Since the configuration file /etc/slurm/slurm.conf should be identical across the entire cluster, it is sufficient to do so only on the main control node.

    2. Increase Timeouts

      Edit /etc/slurm/slurm.conf and set SlurmdTimeout and SlurmctldTimeout to sufficiently high values to avoid timeouts while slurmctld and slurmd are down. We recommend at least 60 minutes, and more on larger clusters.

      1. On the main control node, edit /etc/slurm/slurm.conf and set the values for the following variables to at least 3600 (one hour):

        SlurmctldTimeout=3600
        SlurmdTimeout=3600
      2. Copy /etc/slurm/slurm.conf to all nodes. To do so using the following steps, MUNGE authentication must be used in the cluster as recommended.

        1. Obtain the list of partitions in /etc/slurm/slurm.conf.

        2. Run the following commands:

          # cp /etc/slurm/slurm.conf /etc/slurm/slurm.conf.update
          # sudo -u slurm /bin/bash -c 'cat /etc/slurm/slurm.conf.update | \
          pdsh -R mrsh -P partitions \
          "cat > /etc/slurm/slurm.conf"'
          # rm /etc/slurm/slurm.conf.update
          # scontrol reconfigure
        3. Verify that the reconfiguration took effect:

          # scontrol show config | grep Timeout
    3. Shut down any running slurmctld instances

      1. If applicable, shut down any backup controllers on the backup control nodes:

        backup: # systemctl stop slurmctld
      2. Shut down the main controller:

        master: # systemctl stop slurmctld
    4. Back up the slurmctld state files

      slurmctld maintains persistent state information. Almost every major version involves changes to the slurmctld state files. This state information is upgraded if the upgrade remains within the supported version range and no data is lost.

      However, if a downgrade is necessary, state information from newer versions is not recognized by an older version of slurmctld and is discarded, resulting in a loss of all running and pending jobs. Therefore, back up the old state in case an update needs to be rolled back.

      1. Determine the StateSaveLocation directory:

        # scontrol show config | grep StateSaveLocation
      2. Create a backup of the content of this directory.

        If a downgrade is required, restore the content of the StateSaveLocation directory from this backup.

    5. Shut down slurmd on the nodes

      # pdsh -R ssh -P partitions systemctl stop slurmd
    6. Upgrade slurmctld on the control and backup nodes as well as slurmd on the compute nodes

      1. On the main and backup control nodes, run the following command:

        master: # zypper zypper install \
        --force-resolution slurm_version
      2. On the control node, run the following command:

        master: # pdsh -R ssh -P partitions \
        zypper install --force-resolution \
        slurm_version-node
      Note
      Note: Memory size seen by slurmd might change on update

      Under certain circumstances, the amount of memory seen by slurmd might change after an update. If this happens, slurmctld puts the nodes in a drained state. To check whether the amount of memory seem by slurmd changed after the update, run the following command on a single compute node:

      node1: # slurmd -C

      Compare the output with the settings in slurm.conf. If required, correct the setting.

    7. Replace deprecated options

      If you replace deprecated options in the configuration files, these configuration files can be distributed to all controllers and nodes in the cluster by using the method described in Step 2.b.ii.B.

    8. Restart slurmd on all compute nodes

      On the main control node, run the following command:

      master: # pdsh -R ssh -P partitions \
      systemctl start slurmd

      On the main and backups control nodes, run the following command:

      master: # systemctl start slurmctld
    9. Verify that the system operates properly

      1. Check the status of the control nodes. On the main and backup control nodes, run the following command:

        # systemctl status slurmctld
      2. Verify that the services are running without errors. Run the following command to check whether there are any down, drained, failing, or failed nodes:

        # sinfo -R
    10. Clean up

      Restore the SlurmdTimeout and SlurmctldTimeout values in /etc/slurm/slurm.conf on all nodes, then run scontrol reconfigure (see Step 2.b).

A new major version of libslurm is provided with each service pack of SUSE Linux Enterprise High Performance Computing. The old version is not uninstalled on upgrade, and user-provided applications built with an old version should continue to work if the old library used is not older than the past two versions. It is strongly recommended to rebuild local applications using libslurm — such as MPI libraries with Slurm support — as early as possible. This might require updating the user applications, as new arguments might be introduced to existing functions.

5.4 Additional resources

5.4.1 Frequently asked questions

Q: 1. How do I change the state of a node from down to up?

When the slurmd daemon on a node does not reboot in the time specified in the ResumeTimeout parameter, or the ReturnToService was not changed in the configuration file slurm.conf, compute nodes stay in the down state and must be set back to the up state manually. This can be done for the NODE with the following command:

# scontrol update state=resume NodeName=NODE
Q: 2. What is the difference between the states down and down*?

A * shown after a status code means that the node is not responding.

When a node is marked as down*, it means that the node is not reachable because of network issues, or that slurmd is not running on that node.

In the down state, the node is reachable, but either the node was rebooted unexpectedly, the hardware does not match the description in slurm.conf, or a health check was configured with the HealthCheckProgram.

Q: 3. How do I get the exact core count, socket number, and number of CPUs for a node?

To find the node values that go into the configuration file slurm.conf, run the following command:

# slurmd -C

5.4.2 External documentation

For further documentation, see the Quick Start Administrator Guide and Quick Start User Guide. There is further in-depth documentation on the Slurm documentation page.



[1] You can use the command openssl rand -base64 32 to create a secure random password

Print this page