23 Executing Maintenance Tasks #
To perform maintenance tasks on the cluster nodes, you might need to stop the resources running on that node, to move them, or to shut down or reboot the node. It might also be necessary to temporarily take over the control of resources from the cluster, or even to stop the cluster service while resources remain running.
This chapter explains how to manually take down a cluster node without negative side effects. It also gives an overview of different options the cluster stack provides for executing maintenance tasks.
23.1 Implications of Taking Down a Cluster Node #
When you shut down or reboot a cluster node (or stop the Pacemaker service on a node), the following processes will be triggered:
The resources that are running on the node will be stopped or moved off the node.
If stopping the resources should fail or time out, the STONITH mechanism will fence the node and shut it down.
If your aim is to move the services off the node in an orderly fashion before shutting down or rebooting the node, proceed as follows:
On the node you want to reboot or shut down, log in as
root
or equivalent.Put the node into
standby
mode:#
crm node standbyThat way, services can migrate off the node without being limited by the shutdown timeout of Pacemaker.
Check the cluster status with:
#
crm statusIt shows the respective node in
standby
mode:[...] Node bob: standby [...]
Stop the Pacemaker service on that node:
#
systemctl stop pacemaker.serviceReboot the node.
To check if the node joins the cluster again:
Log in to the node as
root
or equivalent.Check if the Pacemaker service has started:
#
systemctl status pacemaker.serviceIf not, start it:
#
systemctl start pacemaker.serviceCheck the cluster status with:
#
crm statusIt should show the node coming online again.
If the node is still in
standby
mode, bring it back online:#
crm node online
23.2 Different Options for Maintenance Tasks #
Pacemaker offers a variety of options for performing system maintenance:
- Putting the Cluster into Maintenance Mode
The global cluster property
maintenance-mode
puts all resources into maintenance state at once. The cluster stops monitoring them and becomes oblivious to their status. Note that only the resource management by Pacemaker is disabled. Corosync and SBD are still functional. Use maintenance mode for any tasks involving cluster resources. For any tasks involving infrastructure such as storage or networking, the safest method is to stop the cluster services completely. See Section 23.1, “Implications of Taking Down a Cluster Node”.- Putting a Node into Maintenance Mode
This option allows you to put all resources running on a specific node into maintenance state at once. The cluster will cease monitoring them and thus become oblivious to their status.
- Putting a Node into Standby Mode
A node that is in standby mode can no longer run resources. Any resources running on the node will be moved away or stopped (in case no other node is eligible to run the resource). Also, all monitoring operations will be stopped on the node (except for those with
role="Stopped"
).You can use this option if you need to stop a node in a cluster while continuing to provide the services running on another node.
- Putting a Resource into Maintenance Mode
When this mode is enabled for a resource, no monitoring operations will be triggered for the resource.
Use this option if you need to manually touch the service that is managed by this resource and do not want the cluster to run any monitoring operations for the resource during that time.
- Putting a Resource into Unmanaged Mode
The
is-managed
meta attribute allows you to temporarily “release” a resource from being managed by the cluster stack. This means you can manually touch the service that is managed by this resource (for example, to adjust any components). However, the cluster will continue to monitor the resource and to report any failures.If you want the cluster to also cease monitoring the resource, use the per-resource maintenance mode instead (see Putting a Resource into Maintenance Mode).
23.3 Preparing and Finishing Maintenance Work #
If you need to do testing or maintenance work, follow the general steps below.
Otherwise you risk unwanted side effects, like resources not starting in an orderly fashion, unsynchronized CIBs across the cluster nodes, or even data loss.
Before you start, choose which of the options outlined in Section 23.2 is appropriate for your situation.
Apply this option with Hawk2 or crmsh.
Execute your maintenance task or tests.
After you have finished, put the resource, node or cluster back to “normal” operation.
23.4 Putting the Cluster into Maintenance Mode #
When putting a cluster into maintenance mode, only the resource management by Pacemaker is disabled. Corosync and SBD are still functional. Depending on your maintainence tasks, this might lead to fence operations.
Use maintenance mode for any tasks involving cluster resources. For any tasks involving infrastructure such as storage or networking, the safest method is to stop the cluster services completely. See Section 23.1, “Implications of Taking Down a Cluster Node”.
To put the cluster into maintenance mode on the crm shell, use the following command:
#
crm maintenance on
To put the cluster back to normal mode after your maintenance work is done, use the following command:
#
crm maintenance off
Start a Web browser and log in to the cluster as described in Section 6.2, “Logging In”.
In the left navigation bar, select
.In the
group, select the attribute from the empty drop-down box and click the plus icon to add it.To set
maintenance-mode=true
, activate the check box next tomaintenance-mode
and confirm your changes.After you have finished the maintenance task for the whole cluster, deactivate the check box next to the
maintenance-mode
attribute.From this point on, High Availability will take over cluster management again.
23.5 Putting a Node into Maintenance Mode #
To put a node into maintenance mode on the crm shell, use the following command:
#
crm
node maintenance NODENAME
To put the node back into normal mode after your maintenance work is done, use the following command:
#
crm
node ready NODENAME
Start a Web browser and log in to the cluster as described in Section 6.2, “Logging In”.
In the left navigation bar, select
.In one of the individual nodes' views, click the wrench icon next to the node and select
.After you have finished your maintenance task, click the wrench icon next to the node and select
.
23.6 Putting a Node into Standby Mode #
To put a node into standby mode on the crm shell, use the following command:
#
crm node standby NODENAME
To bring the node back online after your maintenance work is done, use the following command:
#
crm node online NODENAME
Start a Web browser and log in to the cluster as described in Section 6.2, “Logging In”.
In the left navigation bar, select
.In one of the individual nodes' views, click the wrench icon next to the node and select
.Finish the maintenance task for the node.
To deactivate the standby mode, click the wrench icon next to the node and select
.
23.7 Putting a Resource into Maintenance Mode #
To put a resource into maintenance mode on the crm shell, use the following command:
#
crm
resource maintenance RESOURCE_ID true
To put the resource back into normal mode after your maintenance work is done, use the following command:
#
crm
resource maintenance RESOURCE_ID false
Start a Web browser and log in to the cluster as described in Section 6.2, “Logging In”.
In the left navigation bar, select
.Select the resource you want to put in maintenance mode or unmanaged mode, click the wrench icon next to the resource and select
.Open the
category.From the empty drop-down box, select the
attribute and click the plus icon to add it.Activate the check box next to
maintenance
to set the maintenance attribute toyes
.Confirm your changes.
After you have finished the maintenance task for that resource, deactivate the check box next to the
maintenance
attribute for that resource.From this point on, the resource will be managed by the High Availability software again.
23.8 Putting a Resource into Unmanaged Mode #
To put a resource into unmanaged mode on the crm shell, use the following command:
#
crm
resource unmanage RESOURCE_ID
To put it into managed mode again after your maintenance work is done, use the following command:
#
crm
resource manage RESOURCE_ID
Start a Web browser and log in to the cluster as described in Section 6.2, “Logging In”.
From the left navigation bar, select
and go to the list.In the
column, click the arrow down icon next to the resource you want to modify and select .The resource configuration screen opens.
Below
, select the entry from the empty drop-down box.Set its value to
No
and click .After you have finished your maintenance task, set
toYes
(which is the default value) and apply your changes.From this point on, the resource will be managed by the High Availability software again.
23.9 Rebooting a Cluster Node While In Maintenance Mode #
If the cluster or a node is in maintenance mode, you can stop or restart cluster resources at will—the High Availability software will not attempt to restart them. If you stop the Pacemaker service on a node, all daemons and processes (originally started as Pacemaker-managed cluster resources) will continue to run.
If you attempt to start Pacemaker services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node. However, it will take no further action other than determining the resources' status.
If you want to take down a node while either the cluster or the node is in
maintenance mode
, proceed as follows:
On the node you want to reboot or shut down, log in as
root
or equivalent.Check if you have resources of the type
ocf:pacemaker:controld
or any dependencies on this type of resource. Resources of the typeocf:pacemaker:controld
are DLM resources.If yes, explicitly stop the DLM resources and any resources depending on them:
crm(live)resource#
stop RESOURCE_IDThe reason is that stopping Pacemaker also stops the Corosync service, on whose membership and messaging services DLM depends. If Corosync stops, the DLM resource will assume a split brain scenario and trigger a fencing operation.
If no, continue with Step 3.
Stop the Pacemaker service on that node:
#
systemctl stop pacemaker.serviceShut down or reboot the node.