Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Linux Enterprise High Performance Computing 15 SP3

6 Monitoring and logging

Obtaining and maintaining an overview over the status and health of a cluster's compute nodes helps to ensure a smooth operation. This chapter describes tools that give an administrator an overview of the current cluster status, collect system logs, and gather information on certain system failure conditions.

6.1 ConMan — the console manager

ConMan is a serial console management program designed to support many console devices and simultaneous users. It supports:

  • local serial devices

  • remote terminal servers (via the telnet protocol)

  • IPMI Serial-Over-LAN (via FreeIPMI)

  • Unix domain sockets

  • external processes (for example, using expect scripts for telnet, ssh, or ipmi-sol connections)

ConMan can be used for monitoring, logging, and optionally timestamping console device output.

To install ConMan, run zypper in conman.

Important: conmand sends unencrypted data

The daemon conmand sends unencrypted data over the network and its connections are not authenticated. Therefore, it should be used locally only, listening to the port localhost. However, the IPMI console does offer encryption. This makes conman a good tool for monitoring many such consoles.

ConMan provides expect-scripts in the directory /usr/lib/conman/exec.

Input to conman is not echoed in interactive mode. This can be changed by entering the escape sequence &E.

When pressing Enter in interactive mode, no line feed is generated. To generate a line feed, press CtrlL.

For more information about options, see the ConMan man page.

6.2 Ganglia — system monitoring

Ganglia is a scalable, distributed monitoring system for high-performance computing systems, such as clusters and grids. It is based on a hierarchical design targeted at federations of clusters.

6.2.1 Using Ganglia

To use Ganglia, install ganglia-gmetad on the management server, then start the Ganglia meta-daemon: rcgmead start. To make sure the service is started after a reboot, run: systemctl enable gmetad. On each cluster node which you want to monitor, install ganglia-gmond, start the service rcgmond start and make sure it is enabled to start automatically after a reboot: systemctl enable gmond. To test whether the gmond daemon has connected to the meta-daemon, run gstat -a and check that each node to be monitored is present in the output.

6.2.2 Ganglia on Btrfs

When using the Btrfs file system, the monitoring data will be lost after a rollback of the service gmetad. To fix this issue, either install the package ganglia-gmetad-skip-bcheck or create the file /etc/ganglia/no_btrfs_check.

6.2.3 Using the Ganglia Web interface

Install ganglia-web on the management server. Enable PHP in Apache2: a2enmod php7. Then start Apache2 on this machine: rcapache2 start and make sure it is started automatically after a reboot: systemctl enable apache2. The Ganglia Web interface is accessible from http://MANAGEMENT_SERVER/ganglia.

6.3 rasdaemon — utility to log RAS error tracings

rasdaemon is an RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors using EDAC (Error Detection and Correction) tracing events. EDAC drivers in the Linux kernel handle detection of ECC (Error Correction Code) errors from memory controllers.

rasdaemon can be used on large memory systems to track, record, and localize memory errors and how they evolve over time to detect hardware degradation. Furthermore, it can be used to localize a faulty DIMM on the mainboard.

To check whether the EDAC drivers are loaded, run the following command:

# ras-mc-ctl --status

The command should return ras-mc-ctl: drivers are loaded. If it indicates that the drivers are not loaded, EDAC may not be supported on your board.

To start rasdaemon, run systemctl start rasdaemon.service. To start rasdaemon automatically at boot time, run systemctl enable rasdaemon.service. The daemon logs information to /var/log/messages and to an internal database. A summary of the stored errors can be obtained with the following command:

# ras-mc-ctl --summary

The errors stored in the database can be viewed with:

# ras-mc-ctl --errors

Optionally, you can load the DIMM labels silk-screened on the system board to more easily identify the faulty DIMM. To do so, before starting rasdaemon, run:

# systemctl start ras-mc-ctl start

For this to work, you need to set up a layout description for the board. There are no descriptions supplied by default. To add a layout description, create a file with an arbitrary name in the directory /etc/ras/dimm_labels.d/. The format is:

Print this page