12 Fencing and STONITH #
Fencing is a very important concept in computer clusters for HA (High Availability). A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource. Fencing may be defined as a method to bring an HA cluster to a known state.
Every resource in a cluster has a state attached. For example: “resource r1 is started on alice”. In an HA cluster, such a state implies that “resource r1 is stopped on all nodes except alice”, because the cluster must make sure that every resource may be started on only one node. Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.
When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.
12.1 Classes of fencing #
There are two classes of fencing: resource level and node level fencing. The latter is the primary subject of this chapter.
- Resource level fencing
Resource level fencing ensures exclusive access to a given resource. Common examples of this are changing the zoning of the node from a SAN fiber channel switch (thus locking the node out of access to its disks) or methods like SCSI reserve. For examples, refer to Section 13.10, “Additional mechanisms for storage protection”.
- Node level fencing
Node level fencing prevents a failed node from accessing shared resources entirely. This is usually done in a simple and abrupt way: reset or power off the node.
Classes | Methods | Options | Examples |
---|---|---|---|
Node fencing[a] (reboot or shutdown) | Remote management | In-node | ILO, DRAC, IPMI |
External | HMC, vCenter, EC2 | ||
SBD and watchdog | Disk-based | hpwdt, iTCO_wdt, ipmi_wdt, softdog | |
Diskless | |||
I/O fencing (locking or reservation) | Pure locking | In-cluster | SFEX |
External | SCSI2 reservation, SCSI3 reservation | ||
Built-in locking | Cluster-based | Cluster MD, LVM with lvmlockd and DLM | |
Cluster-handled | MD-RAID | ||
[a] Mandatory in High Availability clusters |
12.2 Node level fencing #
In a Pacemaker cluster, the implementation of node level fencing is STONITH
(Shoot The Other Node in the Head). SUSE Linux Enterprise High Availability
includes the stonith
command line tool, an extensible
interface for remotely powering down a node in the cluster. For an
overview of the available options, run stonith --help
or refer to the man page of stonith
for more
information.
12.2.1 STONITH devices #
To use node level fencing, you first need to have a fencing device. To get a list of STONITH devices which are supported by SUSE Linux Enterprise High Availability, run one of the following commands on any of the nodes:
#
stonith -L
or
#
crm ra list stonith
STONITH devices may be classified into the following categories:
- Power Distribution Units (PDU)
Power Distribution Units are an essential element in managing power capacity and functionality for critical network, server and data center equipment. They can provide remote load monitoring of connected equipment and individual outlet power control for remote power recycling.
- Uninterruptible Power Supplies (UPS)
A stable power supply provides emergency power to connected equipment by supplying power from a separate source if a utility power failure occurs.
- Blade power control devices
If you are running a cluster on a set of blades, then the power control device in the blade enclosure is the only candidate for fencing. Of course, this device must be capable of managing single blade computers.
- Lights-out devices
Lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and may even become standard in off-the-shelf computers. However, they are inferior to UPS devices, because they share a power supply with their host (a cluster node). If a node stays without power, the device supposed to control it would be useless. In that case, the CRM would continue its attempts to fence the node indefinitely while all other resource operations would wait for the fencing/STONITH operation to complete.
- Testing devices
Testing devices are used exclusively for testing purposes. They are usually more gentle on the hardware. Before the cluster goes into production, they must be replaced with real fencing devices.
The choice of the STONITH device depends mainly on your budget and the kind of hardware you use.
12.2.2 STONITH implementation #
The STONITH implementation of SUSE® Linux Enterprise High Availability consists of two components:
- pacemaker-fenced
pacemaker-fenced
is a daemon which can be accessed by local processes or over the network. It accepts the commands which correspond to fencing operations: reset, power-off, and power-on. It can also check the status of the fencing device.The
pacemaker-fenced
daemon runs on every node in the High Availability cluster. Thepacemaker-fenced
instance running on the DC node receives a fencing request from thepacemaker-controld
. It is up to this and otherpacemaker-fenced
programs to carry out the desired fencing operation.- STONITH plug-ins
For every supported fencing device there is a STONITH plug-in which is capable of controlling said device. A STONITH plug-in is the interface to the fencing device. The STONITH plug-ins contained in the cluster-glue package reside in
/usr/lib64/stonith/plugins
on each node. (If you installed the fence-agents package, too, the plug-ins contained there are installed in/usr/sbin/fence_*
.) All STONITH plug-ins look the same topacemaker-fenced
, but are quite different on the other side, reflecting the nature of the fencing device.Some plug-ins support more than one device. A typical example is
ipmilan
(orexternal/ipmi
) which implements the IPMI protocol and can control any device which supports this protocol.
12.3 STONITH resources and configuration #
To set up fencing, you need to configure one or more STONITH
resources—the pacemaker-fenced
daemon requires no configuration. All
configuration is stored in the CIB. A STONITH resource is a resource of
class stonith
(see
Section 6.2, “Supported resource agent classes”). STONITH resources
are a representation of STONITH plug-ins in the CIB. Apart from the
fencing operations, the STONITH resources can be started, stopped and
monitored, like any other resource. Starting or stopping STONITH
resources means loading and unloading the STONITH device driver on a
node. Starting and stopping are thus only administrative operations and
do not translate to any operation on the fencing device itself. However,
monitoring does translate to logging in to the device (to verify that the
device will work in case it is needed). When a STONITH resource fails
over to another node it enables the current node to talk to the STONITH
device by loading the respective driver.
STONITH resources can be configured like any other resource. For details how to do so with your preferred cluster management tool:
The list of parameters (attributes) depends on the respective STONITH
type. To view a list of parameters for a specific device, use the
stonith
command:
#
stonith -t stonith-device-type -n
For example, to view the parameters for the ibmhmc
device type, enter the following:
#
stonith -t ibmhmc -n
To get a short help text for the device, use the -h
option:
#
stonith -t stonith-device-type -h
12.3.1 Example STONITH resource configurations #
In the following, find some example configurations written in the syntax
of the crm
command line tool. To apply them, put the
sample in a text file (for example, sample.txt
) and
run:
#
crm < sample.txt
For more information about configuring resources with the
crm
command line tool, refer to
Section 5.5, “Introduction to crmsh”.
An IBM RSA lights-out device might be configured like this:
#
crm configure
crm(live)configure#
primitive st-ibmrsa-1 stonith:external/ibmrsa-telnet \ params nodename=alice ip_address=192.168.0.101 \ username=USERNAME password=PASSW0RD
crm(live)configure#
primitive st-ibmrsa-2 stonith:external/ibmrsa-telnet \ params nodename=bob ip_address=192.168.0.102 \ username=USERNAME password=PASSW0RD
crm(live)configure#
location l-st-alice st-ibmrsa-1 -inf: alice
crm(live)configure#
location l-st-bob st-ibmrsa-2 -inf: bob
crm(live)configure#
commit
In this example, location constraints are used for the following
reason: There is always a certain probability that the STONITH
operation is going to fail. Therefore, a STONITH operation on the
node which is the executioner as well is not reliable. If the node is
reset, it cannot send the notification about the fencing operation
outcome. The only way to do that is to assume that the operation is
going to succeed and send the notification beforehand. But if the
operation fails, problems could arise. Therefore, by convention,
pacemaker-fenced
refuses to terminate its host.
The configuration of a UPS type fencing device is similar to the examples above. The details are not covered here. All UPS devices employ the same mechanics for fencing. How the device is accessed varies. Old UPS devices only had a serial port, usually connected at 1200baud using a special serial cable. Many new ones still have a serial port, but often they also use a USB or Ethernet interface. The kind of connection you can use depends on what the plug-in supports.
For example, compare the apcmaster
with the
apcsmart
device by using the stonith
-t stonith-device-type -n
command:
#
stonith -t apcmaster -h
returns the following information:
STONITH Device: apcmaster - APC MasterSwitch (via telnet) NOTE: The APC MasterSwitch accepts only one (telnet) connection/session a time. When one session is active, subsequent attempts to connect to the MasterSwitch will fail. For more information see http://www.apc.com/ List of valid parameter names for apcmaster STONITH device: ipaddr login password For Config info [-p] syntax, give each of the above parameters in order as the -p value. Arguments are separated by white space. Config file [-F] syntax is the same as -p, except # at the start of a line denotes a comment
With
#
stonith -t apcsmart -h
you get the following output:
STONITH Device: apcsmart - APC Smart UPS (via serial port - NOT USB!). Works with higher-end APC UPSes, like Back-UPS Pro, Smart-UPS, Matrix-UPS, etc. (Smart-UPS may have to be >= Smart-UPS 700?). See http://www.networkupstools.org/protocols/apcsmart.html for protocol compatibility details. For more information see http://www.apc.com/ List of valid parameter names for apcsmart STONITH device: ttydev hostlist
The first plug-in supports APC UPS with a network port and telnet protocol. The second plug-in uses the APC SMART protocol over the serial line, which is supported by many APC UPS product lines.
Kdump belongs to the Special fencing devices and is in fact the opposite of a fencing device. The plug-in checks if a Kernel dump is in progress on a node. If so, it returns true and acts as if the node has been fenced, because the node will reboot after the Kdump is complete. If not, it returns a failure and the next fencing device is triggered.
The Kdump plug-in must be used together with another, real STONITH
device, for example, external/ipmi
. It does
not work with SBD as the STONITH device. For the fencing
mechanism to work properly, you must specify the order of the fencing devices
so that Kdump is checked before a real STONITH device is triggered, as
shown in the following procedure.
Use the
stonith:fence_kdump
fence agent. A configuration example is shown below. For more information, seecrm ra info stonith:fence_kdump
.#
crm configure
crm(live)configure#
primitive st-kdump stonith:fence_kdump \ params nodename="alice "\
1pcmk_host_list="alice" \ pcmk_host_check="static-list" \ pcmk_reboot_action="off" \ pcmk_monitor_action="metadata" \ pcmk_reboot_retries="1" \ timeout="60"
2crm(live)configure#
commit
Name of the node to listen for a message from
fence_kdump_send
. Configure more STONITH resources for other nodes if needed.Defines how long to wait for a message from
fence_kdump_send
. If a message is received, then a Kdump is in progress and the fencing mechanism considers the node to be fenced. If no message is received,fence_kdump
times out, which indicates that the fence operation failed. The next STONITH device in thefencing_topology
eventually fences the node.On each node, configure
fence_kdump_send
to send a message to all nodes when the Kdump process is finished. In/etc/sysconfig/kdump
, edit theKDUMP_POSTSCRIPT
line. For example:KDUMP_POSTSCRIPT="/usr/lib/fence_kdump_send -i 10 -p 7410 -c 1 NODELIST"
Replace NODELIST with the host names of all the cluster nodes.
Run either
systemctl restart kdump.service
ormkdumprd
. Either of these commands will detect that/etc/sysconfig/kdump
was modified, and will regenerate theinitrd
to include the libraryfence_kdump_send
with network enabled.Open a port in the firewall for the
fence_kdump
resource. The default port is7410
.To have Kdump checked before triggering a real fencing mechanism (like
external/ipmi
), use a configuration similar to the following:crm(live)configure#
fencing_topology \ alice: kdump-node1 ipmi-node1 \ bob: kdump-node2 ipmi-node2
crm(live)configure#
commit
For more details on
fencing_topology
:crm(live)configure#
help fencing_topology
12.4 Monitoring fencing devices #
Like any other resource, the STONITH class agents also support the monitoring operation for checking status.
Fencing devices are an indispensable part of an HA cluster, but the less you need to use them, the better. Power management equipment is often affected by too much broadcast traffic. Some devices cannot handle more than ten or so connections per minute. Some get confused if two clients try to connect at the same time. Most cannot handle more than one session at a time.
The probability that a fencing operation needs to be performed and the fencing device fails is low. For most devices, a monitoring interval of at least 1800 seconds (30 minutes) should suffice. The exact value depends on the device and infrastructure. STONITH SBD resources do not need a monitor at all. See Section 12.5, “Special fencing devices” and Chapter 13, Storage protection and SBD.
For detailed information on how to configure monitor operations, refer to Section 6.10.2, “Configuring resource monitoring with crmsh” for the command line approach.
12.5 Special fencing devices #
In addition to plug-ins which handle real STONITH devices, there are special purpose STONITH plug-ins.
Some STONITH plug-ins mentioned below are for demonstration and testing purposes only. Do not use any of the following devices in real-life scenarios because this may lead to data corruption and unpredictable results:
external/ssh
ssh
fence_kdump
This plug-in checks if a Kernel dump is in progress on a node. If so, it returns
true
, and acts as if the node has been fenced. The node cannot run any resources during the dump anyway. This avoids fencing a node that is already down but doing a dump, which takes some time. The plug-in must be used in concert with another, real STONITH device.For configuration details, see Example 12.3, “Configuration of a Kdump device”.
external/sbd
This is a self-fencing device. It reacts to a so-called “poison pill” which can be inserted into a shared disk. On shared-storage connection loss, it stops the node from operating. Learn how to use this STONITH agent to implement storage-based fencing in Chapter 13, Procedure 13.7, “Configuring the cluster to use SBD”. See also http://www.linux-ha.org/wiki/SBD_Fencing for more details.
Important:external/sbd
and DRBDThe
external/sbd
fencing mechanism requires that the SBD partition is readable directly from each node. Thus, a DRBD* device must not be used for an SBD partition.However, you can use the fencing mechanism for a DRBD cluster, provided the SBD partition is located on a shared disk that is not mirrored or replicated.
external/ssh
Another software-based “fencing” mechanism. The nodes must be able to log in to each other as
root
without passwords. It takes a single parameter,hostlist
, specifying the nodes that it will target. As it is not able to reset a truly failed node, it must not be used for real-life clusters—for testing and demonstration purposes only. Using it for shared storage would result in data corruption.meatware
meatware
requires help from the user to operate. Whenever invoked,meatware
logs a CRIT severity message which shows up on the node's console. The operator then confirms that the node is down and issues ameatclient(8)
command. This tellsmeatware
to inform the cluster that the node should be considered dead. See/usr/share/doc/packages/cluster-glue/README.meatware
for more information.suicide
This is a software-only device, which can reboot a node it is running on, using the
reboot
command. This requires action by the node's operating system and can fail under certain circumstances. Therefore avoid using this device whenever possible. However, it is safe to use on one-node clusters.- Diskless SBD
This configuration is useful if you want a fencing mechanism without shared storage. In this diskless mode, SBD fences nodes by using the hardware watchdog without relying on any shared device. However, diskless SBD cannot handle a split brain scenario for a two-node cluster. Use this option only for clusters with more than two nodes.
suicide
is the only exception to the “I do not shoot my host” rule.
12.6 Basic recommendations #
Check the following list of recommendations to avoid common mistakes:
Do not configure several power switches in parallel.
To test your STONITH devices and their configuration, pull the plug once from each node and verify that fencing the node does takes place.
Test your resources under load and verify the timeout values are appropriate. Setting timeout values too low can trigger (unnecessary) fencing operations. For details, refer to Section 6.3, “Timeout values”.
Use appropriate fencing devices for your setup. For details, also refer to Section 12.5, “Special fencing devices”.
Configure one or more STONITH resources. By default, the global cluster option
stonith-enabled
is set totrue
. If no STONITH resources have been defined, the cluster will refuse to start any resources.Do not set the global cluster option
stonith-enabled
tofalse
for the following reasons:Clusters without STONITH enabled are not supported.
DLM/OCFS2 will block forever waiting for a fencing operation that will never happen.
Do not set the global cluster option
startup-fencing
tofalse
. By default, it is set totrue
for the following reason: If a node is in an unknown state during cluster start-up, the node will be fenced once to clarify its status.
12.7 For more information #
/usr/share/doc/packages/cluster-glue
In your installed system, this directory contains README files for many STONITH plug-ins and devices.
- http://www.clusterlabs.org/pacemaker/doc/
Pacemaker Explained: Explains the concepts used to configure Pacemaker. Contains comprehensive and detailed information for reference.
- http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
Article explaining the concepts of split brain, quorum and fencing in HA clusters.