Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / SUSE Linux Enterprise High Availability Documentation / Administration Guide / Configuration and administration / Monitoring clusters
Applies to SUSE Linux Enterprise High Availability 15 SP3

11 Monitoring clusters

This chapter describes how to monitor a cluster's health and view its history.

11.1 Monitoring cluster status

Hawk2 has different screens for monitoring single clusters and multiple clusters: the Status and the Dashboard screen.

11.1.1 Monitoring a single cluster

To monitor a single cluster, use the Status screen. After you have logged in to Hawk2, the Status screen is displayed by default. An icon in the upper right corner shows the cluster status at a glance. For further details, have a look at the following categories:

Errors

If errors have occurred, they are shown at the top of the page.

Resources

Shows the configured resources including their Status, Name (ID), Location (node on which they are running), and resource agent Type. From the Operations column, you can start or stop a resource, trigger several actions, or view details. Actions that can be triggered include setting the resource to maintenance mode (or removing maintenance mode), migrating it to a different node, cleaning up the resource, showing any recent events, or editing the resource.

Nodes

Shows the nodes belonging to the cluster site you are logged in to, including the nodes' Status and Name. In the Maintenance and Standby columns, you can set or remove the maintenance or standby flag for a node. The Operations column allows you to view recent events for the node or further details: for example, if a utilization, standby or maintenance attribute is set for the respective node.

Tickets

Only shown if tickets have been configured (for use with Geo clustering).

Hawk2—cluster status
Figure 11.1: Hawk2—cluster status

11.1.2 Monitoring multiple clusters

To monitor multiple clusters, use the Hawk2 Dashboard. The cluster information displayed in the Dashboard screen is stored on the server side. It is synchronized between the cluster nodes (if passwordless SSH access between the cluster nodes has been configured). For details, see Section D.2, “Configuring a passwordless SSH account”. However, the machine running Hawk2 does not even need to be part of any cluster for that purpose—it can be a separate, unrelated system.

In addition to the general Hawk2 requirements, the following prerequisites need to be fulfilled to monitor multiple clusters with Hawk2:

Prerequisites
  • All clusters to be monitored from Hawk2's Dashboard must be running SUSE Linux Enterprise High Availability 15 SP3.

  • If you did not replace the self-signed certificate for Hawk2 on every cluster node with your own certificate (or a certificate signed by an official Certificate Authority) yet, do the following: Log in to Hawk2 on every node in every cluster at least once. Verify the certificate (or add an exception in the browser to bypass the warning). Otherwise Hawk2 cannot connect to the cluster.

Procedure 11.1: Monitoring multiple clusters with the dashboard
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Dashboard.

    Hawk2 shows an overview of the resources and nodes on the current cluster site. In addition, it shows any Tickets that have been configured for use with a Geo cluster. If you need information about the icons used in this view, click Legend. To search for a resource ID, enter the name (ID) into the Search text box. To only show specific nodes, click the filter icon and select a filtering option.

    Hawk2 dashboard with one cluster site (amsterdam)
    Figure 11.2: Hawk2 dashboard with one cluster site (amsterdam)
  3. To add dashboards for multiple clusters:

    1. Click Add Cluster.

    2. Enter the Cluster name with which to identify the cluster in the Dashboard. For example, berlin.

    3. Enter the fully qualified host name of one of the nodes in the second cluster. For example, charlie.

      Image
    4. Click Add. Hawk2 will display a second tab for the newly added cluster site with an overview of its nodes and resources.

      Note
      Note: Connection error

      If instead you are prompted to log in to this node by entering a password, you probably did not connect to this node yet and have not replaced the self-signed certificate. In that case, even after entering the password, the connection will fail with the following message: Error connecting to server. Retrying every 5 seconds... '.

      To proceed, see Replacing the self-signed certificate.

  4. To view more details for a cluster site or to manage it, switch to the site's tab and click the chain icon.

    Hawk2 opens the Status view for this site in a new browser window or tab. From there, you can administer this part of the Geo cluster.

  5. To remove a cluster from the dashboard, click the x icon on the right-hand side of the cluster's details.

11.2 Verifying cluster health

You can check the health of a cluster using either Hawk2 or crmsh.

11.2.1 Verifying cluster health with Hawk2

Hawk2 provides a wizard which checks and detects issues with your cluster. After the analysis is complete, Hawk2 creates a cluster report with further details. To verify cluster health and generate the report, Hawk2 requires passwordless SSH access between the nodes. Otherwise it can only collect data from the current node. If you have set up your cluster with the bootstrap scripts, provided by the ha-cluster-bootstrap package, passwordless SSH access is already configured. In case you need to configure it manually, see Section D.2, “Configuring a passwordless SSH account”.

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Wizards.

  3. Expand the Basic category.

  4. Select the Verify health and configuration wizard.

  5. Confirm with Verify.

  6. Enter the root password for your cluster and click Apply. Hawk2 will generate the report.

11.2.2 Getting health status with crmsh

The health status of a cluster or node can be displayed with so called scripts. A script can perform different tasks—they are not targeted to health. However, for this subsection, we focus on how to get the health status.

To get all the details about the health command, use describe:

# crm script describe health

It shows a description and a list of all parameters and their default values. To execute a script, use run:

# crm script run health

If you prefer to run only one step from the suite, the describe command lists all available steps in the Steps category.

For example, the following command executes the first step of the health command. The output is stored in the health.json file for further investigation:

# crm script run health statefile='health.json'

It is also possible to run the above commands with crm cluster health.

For additional information regarding scripts, see http://crmsh.github.io/scripts/.

11.3 Viewing the cluster history

Hawk2 provides the following possibilities to view past events on the cluster (on different levels and in varying detail):

You can also view cluster history information using crmsh:

11.3.1 Viewing recent events of nodes or resources

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Status. It lists Resources and Nodes.

  3. To view recent events of a resource:

    1. Click Resources and select the respective resource.

    2. In the Operations column for the resource, click the arrow down button and select Recent events.

      Hawk2 opens a new window and displays a table view of the latest events.

  4. To view recent events of a node:

    1. Click Nodes and select the respective node.

    2. In the Operations column for the node, select Recent events.

      Hawk2 opens a new window and displays a table view of the latest events.

      Image

11.3.2 Using the history explorer for cluster reports

From the left navigation bar, select Troubleshooting › History to access the History Explorer. The History Explorer allows you to create detailed cluster reports and view transition information. It provides the following options:

Generate

Create a cluster report for a certain time. Hawk2 calls the crm report command to generate the report.

Upload

Allows you to upload crm report archives that have either been created with the crm shell directly or even on a different cluster.

After reports have been generated or uploaded, they are shown below Reports. From the list of reports, you can show a report's details, download or delete the report.

Hawk2—history explorer main view
Figure 11.3: Hawk2—history explorer main view
Procedure 11.2: Generating or uploading a cluster report
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Troubleshooting › History.

    The History Explorer screen opens in the Generate view. By default, the suggested time frame for a report is the last hour.

  3. To create a cluster report:

    1. To immediately start a report, click Generate.

    2. To modify the time frame for the report, click anywhere on the suggested time frame and select another option from the drop-down box. You can also enter a Custom start date, end date and hour, respectively. To start the report, click Generate.

      After the report has finished, it is shown below Reports.

  4. To upload a cluster report, the crm report archive must be located on a file system that you can access with Hawk2. Proceed as follows:

    1. Switch to the Upload tab.

    2. Browse for the cluster report archive and click Upload.

      After the report is uploaded, it is shown below Reports.

  5. To download or delete a report, click the respective icon next to the report in the Operations column.

  6. To view Report details in history explorer, click the report's name or select Show from the Operations column.

    Image
  7. Return to the list of reports by clicking the Reports button.

Report details in history explorer
  • Name of the report.

  • Start time of the report.

  • End time of the report.

  • Number of transitions plus time line of all transitions in the cluster that are covered by the report. To learn how to view more details for a transition, see Section 11.3.3.

  • Node events.

  • Resource events.

11.3.3 Viewing transition details in the history explorer

For each transition, the cluster saves a copy of the state which it provides as input to pacemaker-schedulerd. The path to this archive is logged. All pe-* files are generated on the Designated Coordinator (DC). As the DC can change in a cluster, there may be pe-* files from several nodes. Any pe-* files are saved snapshots of the CIB, used as input of calculations by pacemaker-schedulerd.

In Hawk2, you can display the name of each pe-* file plus the time and node on which it was created. In addition, the History Explorer can visualize the following details, based on the respective pe-* file:

Transition details in the history explorer
Details

Shows snippets of logging data that belongs to the transition. Displays the output of the following command (including the resource agents' log messages):

crm history transition peinput
Configuration

Shows the cluster configuration at the time that the pe-* file was created.

Diff

Shows the differences of configuration and status between the selected pe-* file and the following one.

Log

Shows snippets of logging data that belongs to the transition. Displays the output of the following command:

crm history transition log peinput

This includes details from the following daemons: pacemaker-schedulerd, pacemaker-controld, and pacemaker-execd.

Graph

Shows a graphical representation of the transition. If you click Graph, the calculation is simulated (exactly as done by pacemaker-schedulerd) and a graphical visualization is generated.

Procedure 11.3: Viewing transition details
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Troubleshooting › History.

    If reports have already been generated or uploaded, they are shown in the list of Reports. Otherwise generate or upload a report as described in Procedure 11.2.

  3. Click the report's name or select Show from the Operations column to open the Report details in history explorer.

  4. To access the transition details, you need to select a transition point in the transition time line that is shown below. Use the Previous and Next icons and the Zoom In and Zoom Out icons to find the transition that you are interested in.

  5. To display the name of a pe-input* file plus the time and node on which it was created, hover the mouse pointer over a transition point in the time line.

  6. To view the Transition details in the history explorer, click the transition point for which you want to know more.

  7. To show Details, Configuration, Diff, Logs or Graph, click the respective buttons to show the content described in Transition details in the history explorer.

  8. To return to the list of reports, click the Reports button.

11.3.4 Retrieving history information with crmsh

Investigating the cluster history is a complex task. To simplify this task, crmsh contains the history command with its subcommands. It is assumed SSH is configured correctly.

Each cluster moves states, migrates resources, or starts important processes. All these actions can be retrieved by subcommands of history.

By default, all history commands look at the events of the last hour. To change this time frame, use the limit subcommand. The syntax is:

# crm history
crm(live)history# limit FROM_TIME [TO_TIME]

Some valid examples include:

limit4:00pm , limit16:00

Both commands mean the same, today at 4pm.

limit2012/01/12 6pm

January 12th 2012 at 6pm

limit"Sun 5 20:46"

In the current year of the current month at Sunday the 5th at 8:46pm

Find more examples and how to create time frames at http://labix.org/python-dateutil.

The info subcommand shows all the parameters which are covered by the crm report:

crm(live)history# info
Source: live
Period: 2012-01-12 14:10:56 - end
Nodes: alice
Groups:
Resources:

To limit crm report to certain parameters view the available options with the subcommand help.

To narrow down the level of detail, use the subcommand detail with a level:

crm(live)history# detail 1

The higher the number, the more detailed your report will be. Default is 0 (zero).

After you have set above parameters, use log to show the log messages.

To display the last transition, use the following command:

crm(live)history# transition -1
INFO: fetching new logs, please wait ...

This command fetches the logs and runs dotty (from the graphviz package) to show the transition graph. The shell opens the log file which you can browse with the and cursor keys.

If you do not want to open the transition graph, use the nograph option:

crm(live)history# transition -1 nograph

11.4 Monitoring system health

To prevent a node from running out of disk space and thus being unable to manage any resources that have been assigned to it, SUSE Linux Enterprise High Availability provides a resource agent, ocf:pacemaker:SysInfo. Use it to monitor a node's health with regard to disk partitions. The SysInfo RA creates a node attribute named #health_disk which will be set to red if any of the monitored disks' free space is below a specified limit.

To define how the CRM should react in case a node's health reaches a critical state, use the global cluster option node-health-strategy.

Procedure 11.4: Configuring system health monitoring

To automatically move resources away from a node in case the node runs out of disk space, proceed as follows:

  1. Configure an ocf:pacemaker:SysInfo resource:

    primitive sysinfo ocf:pacemaker:SysInfo \
         params disks="/tmp /var"1 min_disk_free="100M"2 disk_unit="M"3 \
         op monitor interval="15s"

    1

    Which disk partitions to monitor. For example, /tmp, /usr, /var, and /dev. To specify multiple partitions as attribute values, separate them with a blank.

    Note
    Note: / file system always monitored

    You do not need to specify the root partition (/) in disks. It is always monitored by default.

    2

    The minimum free disk space required for those partitions. Optionally, you can specify the unit to use for measurement (in the example above, M for megabytes is used). If not specified, min_disk_free defaults to the unit defined in the disk_unit parameter.

    3

    The unit in which to report the disk space.

  2. To complete the resource configuration, create a clone of ocf:pacemaker:SysInfo and start it on each cluster node.

  3. Set the node-health-strategy to migrate-on-red:

    property node-health-strategy="migrate-on-red"

    In case of a #health_disk attribute set to red, the pacemaker-schedulerd adds -INF to the resources' score for that node. This will cause any resources to move away from this node. The STONITH resource will be the last one to be stopped but even if the STONITH resource is not running anymore, the node can still be fenced. Fencing has direct access to the CIB and will continue to work.

After a node's health status has turned to red, solve the issue that led to the problem. Then clear the red status to make the node eligible again for running resources. Log in to the cluster node and use one of the following methods:

  • Execute the following command:

    # crm node status-attr NODE delete #health_disk
  • Restart the cluster services on that node.

  • Reboot the node.

The node will be returned to service and can run resources again.