Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / Administration Guide / Remote administration
Applies to SUSE Linux Enterprise High Performance Computing 15 SP4

3 Remote administration

High Performance Computing clusters usually consist of a small set of identical compute nodes. However, large clusters could consist of thousands of machines. This chapter describes tools to help manage the compute nodes in a cluster.

3.1 Genders — static cluster configuration database

Genders is a static cluster configuration database used for configuration management. It allows grouping and addressing sets of nodes by attributes, and is used by a variety of tools. The Genders database is a text file that is usually replicated on each node in a cluster.

Perl, Python, Lua, C, and C++ bindings are supplied with Genders. Each package provides man pages or other documentation which describes the APIs.

3.1.1 Genders database format

The Genders database in SUSE Linux Enterprise High Performance Computing is a plain-text file called /etc/genders. It contains a list of node names with their attributes. Each line of the database can have one of the following formats.

nodename                attr[=value],attr[=value],...
nodename1,nodename2,... attr[=value],attr[=value],...
nodenames[A-B]          attr[=value],attr[=value],...

Node names are listed without their domain, and are followed by any number of spaces or tabs, then the comma-separated list of attributes. Every attribute can optionally have a value. The substitution string %n can be used in an attribute value to represent the node name. Node names can be listed on multiple lines, so a node's attributes can be specified on multiple lines. However, no single node can have duplicate attributes.

The attribute list must not contain spaces, and there is no provision for continuation lines. Commas and equals characters (=) are special, and cannot appear in attribute names or values. Comments are prefixed with the hash character (#) and can appear anywhere in the file.

Ranges for node names can be specified in the form prefix[a-c,n-p,...] as an alternative to explicit lists of node names. For example, node[01-03,06] would specify node01, node02, node03, and node06.

3.1.2 Nodeattr usage

The command line utility nodeattr can be used to query data in the genders file. When the genders file is replicated on all nodes, a query can be done without network access. The genders file can be called as follows:

> nodeattr [-q | -n | -s] [-r] attr[=val]

-q is the default option and prints a list of nodes with attr[=val].

The -c or -s options give a comma-separated or space-separated list of nodes with attr[=val].

If none of the formatting options are specified, nodeattr returns a zero value if the local node has the specified attribute, and non-zero otherwise. The -v option causes any value associated with the attribute to go to stdout. If a node name is specified before the attribute, the specified node is queried instead of the local node.

To print all attributes for a particular node, run the following command:

> nodeattr -l [node]

If no node parameter is given, all attributes of the local node are printed.

To perform a syntax check of the genders database, run the following command:

> nodeattr [-f genders] -k

To specify an alternative database location, use the option -f.

3.2 pdsh — parallel remote shell program

pdsh is a parallel remote shell that can be used with multiple back-ends for remote connections. It can run a command on multiple machines in parallel.

To install pdsh, run the command zypper in pdsh.

In SUSE Linux Enterprise High Performance Computing, the back-ends ssh, mrsh, and exec are supported. The ssh back-end is the default. Non-default login methods can be used by setting the PDSH_RCMD_TYPE environment variable, or by using the -R command argument.

When using the ssh back-end, you must use a non-interactive (passwordless) login method.

The mrsh back-end requires the mrshd daemon to be running on the client. The mrsh back-end does not require the use of reserved sockets, so it does not suffer from port exhaustion when running commands on many machines in parallel. For information about setting up the system to use this back-end, see Section 3.5, “mrsh/mrlogin — remote login using MUNGE authentication”.

Remote machines can be specified on the command line, or pdsh can use a machines file (/etc/pdsh/machines), dsh (Dancer's shell)-style groups or netgroups. It can also target nodes based on the currently running Slurm jobs.

The different ways to select target hosts are realized by modules. Some of these modules provide identical options to pdsh. The module loaded first will win and handle the option. Therefore, we recommended using a single method and specifying this with the -M option.

The machines file lists all target hosts, one per line. The appropriate netgroup can be selected with the -g command line option.

The following host-list plugins for pdsh are supported: machines, slurm, netgroup and dshgroup. Each host-list plugin is provided in a separate package. This avoids conflicts between command line options for different plugins which happen to be identical, and helps to keep installations small and free of unneeded dependencies. Package dependencies have been set to prevent the installation of plugins with conflicting command options. To install one of the plugins, run:

> sudo zypper in pdsh-PLUGIN_NAME

For more information, see the man page pdsh.

3.3 PowerMan — centralized power control for clusters

PowerMan can control the following remote power control devices (RPC) from a central location:

  • local devices connected to a serial port

  • RPCs listening on a TCP socket

  • RPCs that are accessed through an external program

The communication to RPCs is controlled by expect-like scripts. For a list of currently supported devices, see the configuration file /etc/powerman/powerman.conf.

To install PowerMan, run zypper in powerman.

To configure PowerMan, include the appropriate device file for your RPC (/etc/powerman/*.dev) in /etc/powerman/powerman.conf and add devices and nodes. The device type needs to match the specification name in one of the included device files. The list of plugs used for nodes needs to match an entry in the plug name list.

After configuring PowerMan, start its service:

> sudo systemctl start powerman.service

To start PowerMan automatically after every boot, run the following command:

> sudo systemctl enable powerman.service

Optionally, PowerMan can connect to a remote PowerMan instance. To enable this, add the option listen to /etc/powerman/powerman.conf.

Important
Important: Unencrypted transfer

When connecting to a remote PowerMan instance, data is transferred unencrypted. Therefore, use this feature only if the network is appropriately secured.

3.4 MUNGE authentication

MUNGE allows for secure communications between different machines that share the same secret key. The most common use case is the Slurm workload manager, which uses MUNGE for the encryption of its messages. Another use case is authentication for the parallel shell mrsh.

3.4.1 Setting up MUNGE authentication

MUNGE uses UID/GID values to uniquely identify and authenticate users, so you must ensure that users who will authenticate across a network have matching UIDs and GIDs across all nodes.

MUNGE credentials have a limited time-to-live, so you must ensure that the time is synchronized across the entire cluster.

MUNGE is installed with the command zypper in munge. This also installs further required packages. A separate munge-devel package is available to build applications that require MUNGE authentication.

When installing the munge package, a new key is generated on every system. However, the entire cluster needs to use the same MUNGE key. Therefore, you must securely copy the MUNGE key from one system to all the other nodes in the cluster. You can accomplish this by using pdsh with SSH. Ensure that the key is only readable by the munge user (permissions mask 0400).

Procedure 3.1: Setting up MUNGE authentication
  1. On the server where MUNGE is installed, check the permissions, owner, and file type of the key file /etc/munge/munge.key:

    > sudo stat --format "%F %a %G %U %n" /etc/munge/munge.key

    The settings should be as follows:

    400 regular file munge munge /etc/munge/munge.key
  2. Calculate the MD5 sum of munge.key:

    > sudo md5sum /etc/munge/munge.key
  3. Copy the key to the listed nodes using pdcp:

    > pdcp -R ssh -w NODELIST /etc/munge/munge.key /etc/munge/munge.key
  4. Check the key settings on the remote nodes:

    > pdsh -R ssh -w HOSTLIST stat --format \"%F %a %G %U %n\" /etc/munge/munge.key
    > pdsh -R ssh -w HOSTLIST md5sum /etc/munge/munge.key

    Ensure that they match the settings on the MUNGE server.

3.4.2 Enabling and starting MUNGE

munged must be running on all nodes that use MUNGE authentication. If MUNGE is used for authentication across the network, it needs to run on each side of the communications link.

To start the service and ensure it is started after every reboot, run the following command on each node:

> sudo systemctl enable --now munge.service

You can also use pdsh to run this command on multiple nodes at once.

3.5 mrsh/mrlogin — remote login using MUNGE authentication

mrsh is a set of remote shell programs using the MUNGE authentication system instead of reserved ports for security.

It can be used as a drop-in replacement for rsh and rlogin.

To install mrsh, do the following:

  • If only the mrsh client is required (without allowing remote login to this machine), use: zypper in mrsh.

  • To allow logging in to a machine, the server must be installed: zypper in mrsh-server.

  • To get a drop-in replacement for rsh and rlogin, run: zypper in mrsh-rsh-server-compat or zypper in mrsh-rsh-compat.

To set up a cluster of machines allowing remote login from each other, first follow the instructions for setting up and starting MUNGE authentication in Section 3.4, “MUNGE authentication”. After the MUNGE service successfully starts, enable and start mrlogin on each machine on which the user will log in:

> sudo systemctl enable mrlogind.socket mrshd.socket
> sudo systemctl start mrlogind.socket mrshd.socket

To start mrsh support at boot, run the following command:

> sudo systemctl enable munge.service
> sudo systemctl enable mrlogin.service

We do not recommend using mrsh when logged in as the user root. This is disabled by default. To enable it anyway, run the following command:

> sudo echo "mrsh" >> /etc/securetty
> sudo echo "mrlogin" >> /etc/securetty