4 Remote administration #
High Performance Computing clusters usually consist of a small set of identical compute nodes. However, large clusters could consist of thousands of machines. This chapter describes tools to help manage the compute nodes in a cluster.
4.1 Genders — static cluster configuration database #
Genders is a static cluster configuration database used for configuration management. It allows grouping and addressing sets of nodes by attributes, and is used by a variety of tools. The Genders database is a text file that is usually replicated on each node in a cluster.
Perl, Python, Lua, C, and C++ bindings are supplied with Genders. Each
package provides man
pages or other documentation which
describes the APIs.
4.1.1 Genders database format #
The Genders database is a plain-text file called
/etc/genders
. It contains a list of node names with
their attributes. Each line of the database can have one of the following
formats.
nodename attr[=value],attr[=value],... nodename1,nodename2,... attr[=value],attr[=value],... nodenames[A-B] attr[=value],attr[=value],...
Node names are listed without their domain, and are followed by any number of
spaces or tabs, then the comma-separated list of attributes. Every
attribute can optionally have a value. The substitution string
%n
can be used in an attribute value to represent the node
name. Node names can be listed on multiple lines, so a node's attributes can
be specified on multiple lines. However, no single node can have duplicate
attributes.
The attribute list must not contain spaces, and there is no provision for
continuation lines. Commas and equals characters (=
) are
special, and cannot appear in attribute names or values. Comments are
prefixed with the hash character (#
) and can appear
anywhere in the file.
Ranges for node names can be specified in the form
prefix[a-c,n-p,...]
as an alternative to explicit lists
of node names. For example, node[01-03,06]
would specify
node01
, node02
, node03
,
and node06
.
4.1.2 Nodeattr usage #
The command line utility nodeattr
can be used to query data
in the genders file. When the genders file is replicated on all nodes, a query
can be done without network access. The genders file can be called as follows:
>
nodeattr [-q | -n | -s] [-r] attr[=val]
-q
is the default option and prints a list of nodes
with attr[=val]
.
The -c
or -s
options give a
comma-separated or space-separated list of nodes with
attr[=val]
.
If none of the formatting options are specified, nodeattr
returns a zero value if the local node has the specified attribute, and
non-zero otherwise. The -v
option causes any value
associated with the attribute to go to stdout
. If a node
name is specified before the attribute, the specified node is queried instead
of the local node.
To print all attributes for a particular node, run the following command:
>
nodeattr -l [node]
If no node parameter is given, all attributes of the local node are printed.
To perform a syntax check of the genders database, run the following command:
>
nodeattr [-f genders] -k
To specify an alternative database location, use the option
-f
.
4.2 pdsh — parallel remote shell program #
pdsh
is a parallel remote shell that can be used
with multiple back-ends for remote connections. It can run a command on
multiple machines in parallel.
To install pdsh, run the command
zypper in pdsh
.
The HPC module for SUSE Linux Enterprise Server supports the back-ends ssh
,
mrsh
, and exec
. The
ssh
back-end is the default. Non-default login methods
can be used by setting the PDSH_RCMD_TYPE
environment variable, or by using the -R
command
argument.
When using the ssh
back-end, you must use a
non-interactive (passwordless) login method.
The mrsh
back-end requires the
mrshd
daemon to be running on the client. The
mrsh
back-end does not require the use of reserved
sockets, so it does not suffer from port exhaustion when running commands
on many machines in parallel. For information about setting up the system to
use this back-end, see Section 4.5, “mrsh/mrlogin — remote login using MUNGE authentication”.
Remote machines can be specified on the command line, or
pdsh
can use a machines
file
(/etc/pdsh/machines
), dsh
(Dancer's
shell)-style groups or netgroups. It can also target nodes based on the
currently running Slurm jobs.
The different ways to select target hosts are realized by modules. Some
of these modules provide identical options to pdsh
.
The module loaded first will win and handle the option. Therefore, we
recommended using a single method and specifying this with
the -M
option.
The machines
file lists all target hosts, one per
line. The appropriate netgroup can be selected with the
-g
command line option.
The following host-list plugins for pdsh
are supported:
machines
, slurm
,
netgroup
and dshgroup
.
Each host-list plugin is provided in a separate package. This avoids
conflicts between command line options for different plugins which
happen to be identical, and helps to keep installations small and free
of unneeded dependencies. Package dependencies have been set to prevent
the installation of plugins with conflicting command options. To install one
of the plugins, run:
>
sudo zypper in pdsh-PLUGIN_NAME
For more information, see the man
page pdsh
.
4.3 PowerMan — centralized power control for clusters #
PowerMan can control the following remote power control devices (RPC) from a central location:
local devices connected to a serial port
RPCs listening on a TCP socket
RPCs that are accessed through an external program
The communication to RPCs is controlled by “expect”-like
scripts. For a
list of currently supported devices, see the configuration file
/etc/powerman/powerman.conf
.
To install PowerMan, run zypper in powerman
.
To configure PowerMan, include the appropriate device file for your RPC
(/etc/powerman/*.dev
) in
/etc/powerman/powerman.conf
and add devices and
nodes. The device “type” needs to match the
“specification” name in one
of the included device files. The list of “plugs” used for
nodes needs to
match an entry in the “plug name” list.
After configuring PowerMan, start its service:
>
sudo systemctl start powerman.service
To start PowerMan automatically after every boot, run the following command:
>
sudo systemctl enable powerman.service
Optionally, PowerMan can connect to a remote PowerMan instance. To
enable this, add the option listen
to
/etc/powerman/powerman.conf
.
When connecting to a remote PowerMan instance, data is transferred unencrypted. Therefore, use this feature only if the network is appropriately secured.
4.4 MUNGE authentication #
MUNGE allows for secure communications between different machines that share the same secret key. The most common use case is the Slurm workload manager, which uses MUNGE for the encryption of its messages. Another use case is authentication for the parallel shell mrsh.
4.4.1 Setting up MUNGE authentication #
MUNGE uses UID/GID values to uniquely identify and authenticate users, so you must ensure that users who will authenticate across a network have matching UIDs and GIDs across all nodes.
MUNGE credentials have a limited time-to-live, so you must ensure that the time is synchronized across the entire cluster.
MUNGE is installed with the command zypper in munge
.
This also installs further required packages. A separate
munge-devel package is available to build applications
that require MUNGE authentication.
When installing the munge package, a new key is generated
on every system. However, the entire cluster needs to use the same MUNGE
key. Therefore, you must securely copy the MUNGE key from one system to
all the other nodes in the cluster. You can accomplish this by using
pdsh
with SSH. Ensure that the key is only readable
by the munge
user (permissions mask
0400
).
On the server where MUNGE is installed, check the permissions, owner, and file type of the key file
/etc/munge/munge.key
:>
sudo stat --format "%F %a %G %U %n" /etc/munge/munge.keyThe settings should be as follows:
400 regular file munge munge /etc/munge/munge.key
Calculate the MD5 sum of
munge.key
:>
sudo md5sum /etc/munge/munge.keyCopy the key to the listed nodes using
pdcp
:>
pdcp -R ssh -w NODELIST /etc/munge/munge.key /etc/munge/munge.keyCheck the key settings on the remote nodes:
>
pdsh -R ssh -w HOSTLIST stat --format \"%F %a %G %U %n\" /etc/munge/munge.key>
pdsh -R ssh -w HOSTLIST md5sum /etc/munge/munge.keyEnsure that they match the settings on the MUNGE server.
4.4.2 Enabling and starting MUNGE #
munged
must be running on all nodes
that use MUNGE authentication. If MUNGE is used for
authentication across the network, it needs to run on each side of the
communications link.
To start the service and ensure it is started after every reboot, run the following command on each node:
>
sudo systemctl enable --now munge.service
You can also use pdsh
to run this command on multiple
nodes at once.
4.5 mrsh/mrlogin — remote login using MUNGE authentication #
mrsh is a set of remote shell programs using the MUNGE authentication system instead of reserved ports for security.
It can be used as a drop-in replacement for rsh
and
rlogin
.
To install mrsh, do the following:
If only the mrsh client is required (without allowing remote login to this machine), use:
zypper in mrsh
.To allow logging in to a machine, the server must be installed:
zypper in mrsh-server
.To get a drop-in replacement for
rsh
andrlogin
, run:zypper in mrsh-rsh-server-compat
orzypper in mrsh-rsh-compat
.
To set up a cluster of machines allowing remote login from each other,
first follow the instructions for setting up and starting MUNGE
authentication in Section 4.4, “MUNGE authentication”. After the MUNGE service
successfully starts, enable and start mrlogin
on each machine on which the user will log in:
>
sudo systemctl enable mrlogind.socket mrshd.socket>
sudo systemctl start mrlogind.socket mrshd.socket
To start mrsh support at boot, run the following command:
>
sudo systemctl enable munge.service>
sudo systemctl enable mrlogin.service
We do not recommend using mrsh when logged in as the
user root
. This is disabled by
default. To enable it anyway, run the following command:
>
sudo echo "mrsh" >> /etc/securetty>
sudo echo "mrlogin" >> /etc/securetty