Applies to SUSE Linux Enterprise High Performance Computing 15 SP5

7 Monitoring and logging #

Obtaining and maintaining an overview over the status and health of a cluster's compute nodes helps to ensure a smooth operation. This chapter describes tools that give an administrator an overview of the current cluster status, collect system logs, and gather information on certain system failure conditions.

7.1 ConMan — the console manager #

ConMan is a serial console management program designed to support many console devices and simultaneous users. It supports:

local serial devices
remote terminal servers (via the telnet protocol)
IPMI Serial-Over-LAN (via FreeIPMI)
Unix domain sockets
external processes (for example, using expect scripts for telnet, ssh, or ipmi-sol connections)

ConMan can be used for monitoring, logging, and optionally timestamping console device output.

To install ConMan, run zypper in conman.

Important: conmand sends unencrypted data

The daemon conmand sends unencrypted data over the network and its connections are not authenticated. Therefore, it should be used locally only, listening to the port localhost. However, the IPMI console does offer encryption. This makes conman a good tool for monitoring many such consoles.

ConMan provides expect-scripts in the directory /usr/lib/conman/exec.

Input to conman is not echoed in interactive mode. This can be changed by entering the escape sequence &E.

When pressing Enter in interactive mode, no line feed is generated. To generate a line feed, press Ctrl–L.

For more information about options, see the ConMan man page.

7.2 Monitoring HPC clusters with Prometheus and Grafana #

Monitor the performance of HPC clusters using Prometheus and Grafana.

Prometheus collects metrics from exporters running on cluster nodes and stores the data in a time series database. Grafana provides data visualization dashboards for the metrics collected by Prometheus. Preconfigured dashboards are available on the Grafana website.

The following Prometheus exporters are useful for High Performance Computing:

Slurm exporter: Extracts job and job queue status metrics from the Slurm workload manager. Install this exporter on a node that has access to the Slurm command line interface.
Node exporter: Extracts hardware and kernel performance metrics directly from each compute node. Install this exporter on every compute node you want to monitor.

Important: Restrict access to monitoring data

It is recommended that the monitoring data only be accessible from within a trusted environment (for example, using a login node or VPN). It should not be accessible from the internet without additional security hardening measures for access restriction, access control, and encryption.

More information #

Grafana: https://grafana.com/docs/grafana/latest/getting-started/
Grafana dashboards: https://grafana.com/grafana/dashboards
Prometheus: https://prometheus.io/docs/introduction/overview/
Prometheus exporters: https://prometheus.io/docs/instrumenting/exporters/
Slurm exporter: https://github.com/vpenso/prometheus-slurm-exporter
Node exporter: https://github.com/prometheus/node_exporter

7.2.1 Installing Prometheus and Grafana #

Install Prometheus and Grafana on a management server, or on a separate monitoring node.

Prerequisites #

You have an installation source for Prometheus and Grafana:
- The packages are available from SUSE Package Hub. To install SUSE Package Hub, see https://packagehub.suse.com/how-to-use/.
- If you have a subscription for SUSE Manager, the packages are available from the SUSE Manager Client Tools repository.

Procedure 7.1: Installing Prometheus and Grafana #

In this procedure, replace MNTRNODE with the host name or IP address of the server where Prometheus and Grafana are installed.

Install the Prometheus and Grafana packages:

monitor# zypper in golang-github-prometheus-prometheus grafana

Enable and start Prometheus:

monitor# systemctl enable --now prometheus

Verify that Prometheus works:
- In a browser, navigate to MNTRNODE:9090/config, or:
- In a terminal, run the following command:
```
> wget MNTRNODE:9090/config --output-document=-
```
Either of these methods should show the default contents of the /etc/prometheus/prometheus.yml file.

Enable and start Grafana:

monitor# systemctl enable --now grafana-server

Log in to the Grafana web server at MNTRNODE:3000.
Use admin for both the user name and password, then change the password when prompted.
On the left panel, select the gear icon () and click Data Sources.
Click Add data source.
Find Prometheus and click Select.
In the URL field, enter http://localhost:9090. The default settings for the other fields can remain unchanged.
If Prometheus and Grafana are installed on different servers, replace localhost with the host name or IP address of the server where Prometheus is installed.
Click Save & Test.

You can now configure Prometheus to collect metrics from the cluster, and add dashboards to Grafana to visualize those metrics.

7.2.2 Monitoring cluster workloads #

To monitor the status of the nodes and jobs in an HPC cluster, install the Prometheus Slurm exporter to collect workload data, then import a custom Slurm dashboard from the Grafana website to visualize the data. For more information about this dashboard, see https://grafana.com/grafana/dashboards/4323.

You must install the Slurm exporter on a node that has access to the Slurm command line interface. In the following procedure, the Slurm exporter will be installed on a management server.

Prerequisites #

Section 7.2.1, “Installing Prometheus and Grafana” is complete.
The Slurm workload manager is fully configured.
You have internet access and policies that allow you to download the dashboard from the Grafana website.

Procedure 7.2: Monitoring cluster workloads #

In this procedure, replace MGMTSERVER with the host name or IP address of the server where the Slurm exporter is installed, and replace MNTRNODE with the host name or IP address of the server where Grafana is installed.

Install the Slurm exporter:

management# zypper in golang-github-vpenso-prometheus_slurm_exporter

Enable and start the Slurm exporter:
```
management# systemctl enable --now prometheus-slurm_exporter
```
Important: Slurm exporter fails when GPU monitoring is enabled
In Slurm 20.11, the Slurm exporter fails when GPU monitoring is enabled.
This feature is disabled by default. Do not enable it for this version of Slurm.

Verify that the Slurm exporter works:

In a browser, navigate to MNGMTSERVER:8080/metrics, or:

In a terminal, run the following command:

> wget MGMTSERVER:8080/metrics --output-document=-

Either of these methods should show output similar to the following:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.9521e-05
go_gc_duration_seconds{quantile="0.25"} 4.5717e-05
go_gc_duration_seconds{quantile="0.5"} 7.8573e-05
...

On the server where Prometheus is installed, edit the scrape_configs section of the /etc/prometheus/prometheus.yml file to add a job for the Slurm exporter:
```
  - job_name: slurm-exporter
     scrape_interval: 30s
     scrape_timeout: 30s
     static_configs:
       - targets: ['MGMTSERVER:8080']
```
Set the scrape_interval and scrape_timeout to 30s to avoid overloading the server.
Restart the Prometheus service:
```
monitor# systemctl restart prometheus
```
Log in to the Grafana web server at MNTRNODE:3000.
On the left panel, select the plus icon () and click Import.
In the Import via grafana.com field, enter the dashboard ID 4323, then click Load.
From the Select a Prometheus data source drop-down box, select the Prometheus data source added in Procedure 7.1, “Installing Prometheus and Grafana”, then click Import.
Review the Slurm dashboard. The data might take some time to appear.
If you made any changes, click Save dashboard when prompted, optionally describe your changes, then click Save.

The Slurm dashboard is now available from the Home screen in Grafana.

7.2.3 Monitoring compute node performance #

To monitor the performance and health of each compute node, install the Prometheus node exporter to collect performance data, then import a custom node dashboard from the Grafana website to visualize the data. For more information about this dashboard, see https://grafana.com/grafana/dashboards/405.

Prerequisites #

Section 7.2.1, “Installing Prometheus and Grafana” is complete.
You have internet access and policies that allow you to download the dashboard from the Grafana website.
To run commands on multiple nodes at once, pdsh must be installed on the system your shell is running on, and SSH key authentication must be configured for all of the nodes. For more information, see Section 4.2, “pdsh — parallel remote shell program”.

Procedure 7.3: Monitoring compute node performance #

In this procedure, replace the example node names with the host names or IP addresses of the nodes, and replace MNTRNODE with the host name or IP address of the server where Grafana is installed.

Install the node exporter on each compute node. You can do this on multiple nodes at once by running the following command:
```
management# pdsh -R ssh -u root -w "NODE1,NODE2" \
"zypper in -y golang-github-prometheus-node_exporter"
```
Enable and start the node exporter. You can do this on multiple nodes at once by running the following command:
```
management# pdsh -R ssh -u root -w "NODE1,NODE2" \
"systemctl enable --now prometheus-node_exporter"
```

Verify that the node exporter works:

In a browser, navigate to NODE1:9100/metrics, or:

In a terminal, run the following command:

> wget NODE1:9100/metrics --output-document=-

Either of these methods should show output similar to the following:

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.3937e-05
go_gc_duration_seconds{quantile="0.25"} 3.5456e-05
go_gc_duration_seconds{quantile="0.5"} 8.1436e-05
...

On the server where Prometheus is installed, edit the scrape_configs section of the /etc/prometheus/prometheus.yml file to add a job for the node exporter:
```
  - job_name: node-exporter
    static_configs:
      - targets: ['NODE1:9100']
      - targets: ['NODE2:9100']
```
Add a target for every node that has the node exporter installed.
Restart the Prometheus service:
```
monitor# systemctl restart prometheus
```
Log in to the Grafana web server at MNTRNODE:3000.
On the left panel, select the plus icon () and click Import.
In the Import via grafana.com field, enter the dashboard ID 405, then click Load.
From the Select a Prometheus data source drop-down box, select the Prometheus data source added in Procedure 7.1, “Installing Prometheus and Grafana”, then click Import.
Review the node dashboard. Click the node drop-down box to select the nodes you want to view. The data might take some time to appear.
If you made any changes, click Save dashboard when prompted. To keep the currently selected nodes next time you access the dashboard, activate Save current variable values as dashboard default. Optionally describe your changes, then click Save.

The node dashboard is now available from the Home screen in Grafana.

7.3 rasdaemon — utility to log RAS error tracings #

rasdaemon is an RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors using EDAC (Error Detection and Correction) tracing events. EDAC drivers in the Linux kernel handle detection of ECC (Error Correction Code) errors from memory controllers.

rasdaemon can be used on large memory systems to track, record, and localize memory errors and how they evolve over time to detect hardware degradation. Furthermore, it can be used to localize a faulty DIMM on the mainboard.

To check whether the EDAC drivers are loaded, run the following command:

# ras-mc-ctl --status

The command should return ras-mc-ctl: drivers are loaded. If it indicates that the drivers are not loaded, EDAC may not be supported on your board.

To start rasdaemon, run systemctl start rasdaemon.service. To start rasdaemon automatically at boot time, run systemctl enable rasdaemon.service. The daemon logs information to /var/log/messages and to an internal database. A summary of the stored errors can be obtained with the following command:

# ras-mc-ctl --summary

The errors stored in the database can be viewed with:

# ras-mc-ctl --errors

Optionally, you can load the DIMM labels silk-screened on the system board to more easily identify the faulty DIMM. To do so, before starting rasdaemon, run:

# systemctl start ras-mc-ctl start

For this to work, you need to set up a layout description for the board. There are no descriptions supplied by default. To add a layout description, create a file with an arbitrary name in the directory /etc/ras/dimm_labels.d/. The format is:

Vendor: MOTHERBOARD-VENDOR-NAME
Model: MOTHERBOARD-MODEL-NAME
  LABEL: MC.TOP.MID.LOW