Applies to SUSE Linux Enterprise High Availability 15 SP4

28 Executing maintenance tasks #

To perform maintenance tasks on the cluster nodes, you might need to stop the resources running on that node, to move them, or to shut down or reboot the node. It might also be necessary to temporarily take over the control of resources from the cluster, or even to stop the cluster service while resources remain running.

This chapter explains how to manually take down a cluster node without negative side-effects. It also gives an overview of different options the cluster stack provides for executing maintenance tasks.

28.1 Preparing and finishing maintenance work #

Use the following commands to start, stop, or view the status of the cluster:

crm cluster start [--all]: Start the cluster services on one node or all nodes
crm cluster stop [--all]: Stop the cluster services on one node or all nodes
crm cluster restart [--all]: Restart the cluster services on one node or all nodes
crm cluster status: View the status of the cluster stack

Execute the above commands as user root, or as a user with the required privileges.

When you shut down or reboot a cluster node (or stop the cluster services on a node), the following processes will be triggered:

The resources that are running on the node will be stopped or moved off the node.
If stopping a resource fails or times out, the STONITH mechanism will fence the node and shut it down.

Warning: Risk of data loss

If you need to do testing or maintenance work, follow the general steps below.

Otherwise, you risk unwanted side effects, like resources not starting in an orderly fashion, unsynchronized CIBs across the cluster nodes, or even data loss.

Before you start, choose the appropriate option from Section 28.2, “Different options for maintenance tasks”.
Apply this option with Hawk2 or crmsh.
Execute your maintenance task or tests.
After you have finished, put the resource, node or cluster back to “normal” operation.

28.2 Different options for maintenance tasks #

Pacemaker offers the following options for performing system maintenance:

Putting the cluster into maintenance mode

The global cluster property maintenance-mode puts all resources into maintenance state at once. The cluster stops monitoring them and becomes oblivious to their status. Note that only the resource management by Pacemaker is disabled. Corosync and SBD are still functional. Use maintenance mode for any tasks involving cluster resources. For any tasks involving infrastructure, such as storage or networking, the safest method is to stop the cluster services completely. See Stopping the cluster services for the whole cluster.

Stopping the cluster services for the whole cluster

Stopping the cluster services on all nodes at once allows you to shut down a cluster while avoiding the mass migration of resources that would happen if you shut down each node one by one. Because there are no nodes to migrate to, all resources will be stopped.

Putting a node into maintenance mode

This option allows you to put all resources running on a specific node into maintenance state at once. The cluster will cease monitoring them and thus become oblivious to their status.

Putting a node into standby mode

A node that is in standby mode can no longer run resources. Any resources running on the node will be moved away or stopped (if no other node is eligible to run the resource). Also, all monitoring operations will be stopped on the node (except for those with role="Stopped").

You can use this option if you need to stop a node in a cluster while continuing to provide the services running on another node.

Stopping the cluster services on a node

This option stops all of the cluster services on a single node. Any resources running on the node will be moved away or stopped (if no other node is eligible to run the resource). If stopping a resource fails or times out, the node will be fenced.

Putting a resource into maintenance mode

When this mode is enabled for a resource, no monitoring operations will be triggered for the resource.

Use this option if you need to manually touch the service that is managed by this resource and do not want the cluster to run any monitoring operations for the resource during that time.

Putting a resource into unmanaged mode

The is-managed meta attribute allows you to temporarily “release” a resource from being managed by the cluster stack. This means you can manually touch the service that is managed by this resource (for example, to adjust any components). However, the cluster will continue to monitor the resource and to report any failures.

If you want the cluster to also cease monitoring the resource, use the per-resource maintenance mode instead (see Putting a resource into maintenance mode).

28.3 Putting the cluster into maintenance mode #

Warning: Maintenance mode only disables Pacemaker

When putting a cluster into maintenance mode, only the resource management by Pacemaker is disabled. Corosync and SBD are still functional. Depending on your maintenance tasks, this might lead to fence operations.

Use maintenance mode for any tasks involving cluster resources. For any tasks involving infrastructure, such as storage or networking, the safest method is to stop the cluster services completely. See Section 28.4, “Stopping the cluster services for the whole cluster”.

To put the cluster into maintenance mode on the crm shell, use the following command:

# crm maintenance on

To put the cluster back to normal mode after your maintenance work is done, use the following command:

# crm maintenance off

Procedure 28.1: Putting the cluster into maintenance mode with Hawk2 #

Start a Web browser and log in to the cluster as described in Section 5.4.2, “Logging in”.
In the left navigation bar, select Configuration › Cluster Configuration.
Select the maintenance-mode attribute from the empty drop-down box.
From the maintenance-mode drop-down box, select Yes.
Click Apply.
After you have finished the maintenance task for the whole cluster, select No from the maintenance-mode drop-down box, then click Apply.
From this point on, High Availability will take over cluster management again.

28.4 Stopping the cluster services for the whole cluster #

To stop the cluster services on all nodes at once, use the following command:

# crm cluster stop --all

To start the cluster services again after your maintenance work is done, use the following command:

# crm cluster start --all

Warning: Graceful shutdown not guaranteed

The --all option alone does not guarantee graceful shutdown of the cluster, because of the unexpected fencing that might be triggered by resource stop-failure at the application level. If applications are critical, consider stopping them before stopping the cluster services for the whole cluster.

28.5 Putting a node into maintenance mode #

To put a node into maintenance mode on the crm shell, use the following command:

# crm node maintenance NODENAME

To put the node back to normal mode after your maintenance work is done, use the following command:

# crm node ready NODENAME

Procedure 28.2: Putting a node into maintenance mode with Hawk2 #

Start a Web browser and log in to the cluster as described in Section 5.4.2, “Logging in”.
In the left navigation bar, select Cluster Status.
In one of the individual nodes' views, click the wrench icon next to the node and select Maintenance.
After you have finished your maintenance task, click the wrench icon next to the node and select Ready.

28.6 Putting a node into standby mode #

To put a node into standby mode on the crm shell, use the following command:

# crm node standby NODENAME

To bring the node back online after your maintenance work is done, use the following command:

# crm node online NODENAME

Procedure 28.3: Putting a node into standby mode with Hawk2 #

Start a Web browser and log in to the cluster as described in Section 5.4.2, “Logging in”.
In the left navigation bar, select Cluster Status.
In one of the individual nodes' views, click the wrench icon next to the node and select Standby.
Finish the maintenance task for the node.
To deactivate the standby mode, click the wrench icon next to the node and select Ready.

28.7 Stopping the cluster services on a node #

You can move the services off the node in an orderly fashion before shutting down or rebooting the node. This allows services to migrate off the node without being limited by the shutdown timeout of the cluster services.

Procedure 28.4: Manually rebooting a cluster node #

On the node you want to reboot or shut down, log in as root or equivalent.
Put the node into standby mode:
```
# crm node standby
```
By default, the node will remain in standby mode after rebooting. Alternatively, you can set the node to come back online automatically with crm node standby reboot.
Check the cluster status:
```
# crm status
```
It shows the respective node in standby mode:
```
[...]
Node bob: standby
[...]
```
Stop the cluster services on that node:
```
# crm cluster stop
```
Reboot the node.

To check if the node joins the cluster again:

After the node reboots, log in to it again.
Check if the cluster services have started:
```
# crm cluster status
```
This might take some time. If the cluster services do not start again on their own, start them manually:
```
# crm cluster start
```
Check the cluster status:
```
# crm status
```
If the node is still in standby mode, bring it back online:
```
# crm node online
```

28.8 Putting a resource into maintenance mode #

To put a resource into maintenance mode on the crm shell, use the following command:

# crm resource maintenance RESOURCE_ID true

To put the resource back into normal mode after your maintenance work is done, use the following command:

# crm resource maintenance RESOURCE_ID false

Procedure 28.5: Putting a resource into maintenance mode with Hawk2 #

Start a Web browser and log in to the cluster as described in Section 5.4.2, “Logging in”.
In the left navigation bar, select Resources.
Select the resource you want to put in maintenance mode or unmanaged mode, click the wrench icon next to the resource and select Edit Resource.
Open the Meta Attributes category.
From the empty drop-down list, select the maintenance attribute and click the plus icon to add it.
Activate the check box next to maintenance to set the maintenance attribute to yes.
Confirm your changes.
After you have finished the maintenance task for that resource, deactivate the check box next to the maintenance attribute for that resource.
From this point on, the resource will be managed by the High Availability software again.

28.9 Putting a resource into unmanaged mode #

To put a resource into unmanaged mode on the crm shell, use the following command:

# crm resource unmanage RESOURCE_ID

To put it into managed mode again after your maintenance work is done, use the following command:

# crm resource manage RESOURCE_ID

Procedure 28.6: Putting a resource into unmanaged mode with Hawk2 #

Start a Web browser and log in to the cluster as described in Section 5.4.2, “Logging in”.
From the left navigation bar, select Status and go to the Resources list.
In the Operations column, click the arrow down icon next to the resource you want to modify and select Edit.
The resource configuration screen opens.
Below Meta Attributes, select the is-managed entry from the empty drop-down box.
Set its value to No and click Apply.
After you have finished your maintenance task, set is-managed to Yes (which is the default value) and apply your changes.
From this point on, the resource will be managed by the High Availability software again.

28.10 Rebooting a cluster node while in maintenance mode #

Note: Implications

If the cluster or a node is in maintenance mode, you can use tools external to the cluster stack (for example, systemctl) to manually operate the components that are managed by the cluster as resources. The High Availability software will not monitor them or attempt to restart them.

If you stop the cluster services on a node, all daemons and processes (originally started as Pacemaker-managed cluster resources) will continue to run.

If you attempt to start cluster services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node. However, it will take no further action other than determining the resources' status.

Procedure 28.7: Rebooting a cluster node while the cluster or node is in maintenance mode #

On the node you want to reboot or shut down, log in as root or equivalent.
If you have a DLM resource (or other resources depending on DLM), make sure to explicitly stop those resources before stopping the cluster services:
```
crm(live)resource# stop RESOURCE_ID
```
The reason is that stopping Pacemaker also stops the Corosync service on whose membership and messaging services DLM depends. If Corosync stops, the DLM resource will assume a split brain scenario and trigger a fencing operation.
Stop the cluster services on that node:
```
# crm cluster stop
```
Shut down or reboot the node.