13 Storage protection and SBD #
SBD (STONITH Block Device) provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). This isolates the fencing mechanism from changes in firmware version or dependencies on specific firmware controllers. SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. Under certain conditions, it is also possible to use SBD without shared storage, by running it in diskless mode.
The cluster bootstrap scripts provide an automated way to set up a cluster with the option of using SBD as fencing mechanism. For details, see the Installation and Setup Quick Start. However, manually setting up SBD provides you with more options regarding the individual settings.
This chapter explains the concepts behind SBD. It guides you through configuring the components needed by SBD to protect your cluster from potential data corruption in case of a split brain scenario.
In addition to node level fencing, you can use additional mechanisms for storage protection, such as LVM exclusive activation or OCFS2 file locking support (resource level fencing). They protect your system against administrative or application faults.
13.1 Conceptual overview #
SBD expands to Storage-Based Death or STONITH Block Device.
The highest priority of the High Availability cluster stack is to protect the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage. The cluster stack takes care of this using several control mechanisms.
However, network partitioning or software malfunction could potentially cause scenarios where several DCs are elected in a cluster. If this so-called split brain scenario were allowed to unfold, data corruption might occur.
Node fencing via STONITH is the primary mechanism to prevent this. Using SBD as a node fencing mechanism is one way of shutting down nodes without using an external power off device in case of a split brain scenario.
- SBD partition
In an environment where all nodes have access to shared storage, a small partition of the device is formatted for use with SBD. The size of the partition depends on the block size of the used disk (for example, 1 MB for standard SCSI disks with 512 byte block size or 4 MB for DASD disks with 4 kB block size). The initialization process creates a message layout on the device with slots for up to 255 nodes.
- SBD daemon
After the respective SBD daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.
- Messages
The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.
Also, the daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The workload will terminate anyway if the storage connectivity has been lost.
- Watchdog
Whenever SBD is used, a correctly working watchdog is crucial. Modern systems support a hardware watchdog that needs to be “tickled” or “fed” by a software component. The software component (in this case, the SBD daemon) “feeds” the watchdog by regularly writing a service pulse to the watchdog. If the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an I/O error.
If Pacemaker integration is activated, loss of device majority alone does not trigger self-fencing. For example, your cluster contains three nodes: A, B, and C. Because of a network split, A can only see itself while B and C can still communicate. In this case, there are two cluster partitions: one with quorum because of being the majority (B, C), and one without (A). If this happens while the majority of fencing devices are unreachable, node A would self-fence, but nodes B and C would continue to run.
13.2 Overview of manually setting up SBD #
The following steps are necessary to manually set up storage-based protection.
They must be executed as root
. Before you start, check Section 13.3, “Requirements and restrictions”.
Depending on your scenario, either use SBD with one to three devices or in diskless mode. For an outline, see Section 13.4, “Number of SBD devices”. The detailed setup is described in:
13.3 Requirements and restrictions #
You can use up to three SBD devices for storage-based fencing. When using one to three devices, the shared storage must be accessible from all nodes.
The path to the shared storage device must be persistent and consistent across all nodes in the cluster. Use stable device names such as
/dev/disk/by-id/dm-uuid-part1-mpath-abcedf12345
.The shared storage can be connected via Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), or even iSCSI.
The shared storage segment must not use host-based RAID, LVM, or DRBD*. DRBD can be split, which is problematic for SBD, as there cannot be two states in SBD. Cluster multi-device (Cluster MD) cannot be used for SBD.
However, using storage-based RAID and multipathing is recommended for increased reliability.
An SBD device can be shared between different clusters, as long as no more than 255 nodes share the device.
Fencing does not work with an asymmetric SBD setup. When using more than one SBD device, all nodes must have a slot in all SBD devices.
When using more than one SBD device, all devices must have the same configuration, for example, the same timeout values.
For clusters with more than two nodes, you can also use SBD in diskless mode.
13.4 Number of SBD devices #
SBD supports the use of up to three devices:
- One device
The most simple implementation. It is appropriate for clusters where all of your data is on the same shared storage.
- Two devices
This configuration is primarily useful for environments that use host-based mirroring but where no third storage device is available. SBD will not terminate itself if it loses access to one mirror leg, allowing the cluster to continue. However, since SBD does not have enough knowledge to detect an asymmetric split of the storage, it will not fence the other side while only one mirror leg is available. Thus, it cannot automatically tolerate a second failure while one of the storage arrays is down.
- Three devices
The most reliable configuration. It is resilient against outages of one device—be it because of failures or maintenance. SBD will terminate itself only if more than one device is lost and if required, depending on the status of the cluster partition or node. If at least two devices are still accessible, fencing messages can be successfully transmitted.
This configuration is suitable for more complex scenarios where storage is not restricted to a single array. Host-based mirroring solutions can have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI.
- Diskless
This configuration is useful if you want a fencing mechanism without shared storage. In this diskless mode, SBD fences nodes by using the hardware watchdog without relying on any shared device. However, diskless SBD cannot handle a split brain scenario for a two-node cluster. Use this option only for clusters with more than two nodes.
13.5 Calculation of timeouts #
When using SBD as a fencing mechanism, it is vital to consider the timeouts of all components, because they depend on each other. When using more than one SBD device, all devices must have the same timeout values.
watchdog
timeoutThis timeout is set during initialization of the SBD device. It depends mostly on your storage latency. The majority of devices must be successfully read within this time. Otherwise, the node might self-fence.
Note: Multipath or iSCSI setupIf your SBD device(s) reside on a multipath setup or iSCSI, the timeout should be set to the time required to detect a path failure and switch to the next path.
This also means that in
/etc/multipath.conf
the value ofmax_polling_interval
must be less than thewatchdog
timeout.msgwait
timeoutThis timeout is set during initialization of the SBD device. It defines the time after which a message written to a node's slot on the SBD device is considered delivered. The timeout should be long enough for the node to detect that it needs to self-fence.
However, if the
msgwait
timeout is relatively long, a fenced cluster node might rejoin before the fencing action returns. This can be mitigated by setting theSBD_DELAY_START
parameter in the SBD configuration, as described in Procedure 13.4 in Step 3.stonith-timeout
in the CIBThis timeout is set in the CIB as a global cluster property. It defines how long to wait for the STONITH action (reboot, on, off) to complete.
stonith-watchdog-timeout
in the CIBThis timeout is set in the CIB as a global cluster property. If not set explicitly, it defaults to
0
, which is appropriate for using SBD with one to three devices. For SBD in diskless mode, this timeout must not be0
. For details, see Procedure 13.8, “Configuring diskless SBD”.
If you change the watchdog timeout, you need to adjust the other two timeouts as well. The following “formula” expresses the relationship between these three values:
Timeout (msgwait) >= (Timeout (watchdog) * 2) stonith-timeout >= Timeout (msgwait) + 20%
For example, if you set the watchdog timeout to 120
,
set the msgwait
timeout to at least 240
and the
stonith-timeout
to at least 288
.
If you use the bootstrap scripts provided by the crm shell to set up a cluster and to initialize the SBD device, the relationship between these timeouts is automatically considered.
13.6 Setting up the watchdog #
SUSE Linux Enterprise High Availability ships with several kernel modules that provide hardware-specific watchdog drivers. For a list of the most commonly used ones, see Commonly used watchdog drivers.
For clusters in production environments we recommend to use a hardware-specific
watchdog driver. However, if no watchdog matches your hardware,
softdog
can be used as kernel
watchdog module.
SUSE Linux Enterprise High Availability uses the SBD daemon as the software component that “feeds” the watchdog.
13.6.1 Using a hardware watchdog #
Finding the right watchdog kernel module for a given system is not trivial. Automatic probing fails very often. As a result, lots of modules are already loaded before the right one gets a chance.
The following table lists some commonly used watchdog drivers. However, this is
not a complete list of supported drivers. If your hardware is not listed here,
you can also find a list of choices in the directories
/lib/modules/KERNEL_VERSION/kernel/drivers/watchdog
and
/lib/modules/KERNEL_VERSION/kernel/drivers/ipmi
.
Alternatively, ask your hardware or
system vendor for details on system-specific watchdog configuration.
Hardware | Driver |
---|---|
HP | hpwdt |
Dell, Lenovo (Intel TCO) | iTCO_wdt |
Fujitsu | ipmi_watchdog |
LPAR on IBM Power | pseries-wdt |
VM on IBM z/VM | vmwatchdog |
Xen VM (DomU) | xen_xdt |
VM on VMware vSphere | wdat_wdt |
Generic | softdog |
Some hardware vendors ship systems management software that uses the watchdog for system resets (for example, HP ASR daemon). If the watchdog is used by SBD, disable such software. No other software must access the watchdog timer.
To make sure the correct watchdog module is loaded, proceed as follows:
List the drivers that have been installed with your kernel version:
#
rpm -ql kernel-VERSION | grep watchdog
List any watchdog modules that are currently loaded in the kernel:
#
lsmod | egrep "(wd|dog)"
If you get a result, unload the wrong module:
#
rmmod WRONG_MODULE
Enable the watchdog module that matches your hardware:
#
echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.conf
#
systemctl restart systemd-modules-load
Test whether the watchdog module is loaded correctly:
#
lsmod | grep dog
Verify if the watchdog device is available and works:
#
ls -l /dev/watchdog*
#
sbd query-watchdog
If your watchdog device is not available, stop here and check the module name and options. Maybe use another driver.
Verify if the watchdog device works:
#
sbd -w WATCHDOG_DEVICE test-watchdog
Reboot your machine to make sure there are no conflicting kernel modules. For example, if you find the message
cannot register ...
in your log, this would indicate such conflicting modules. To ignore such modules, refer to https://documentation.suse.com/sles/html/SLES-all/cha-mod.html#sec-mod-modprobe-blacklist.
13.6.2 Using the software watchdog (softdog) #
For clusters in production environments we recommend to use a hardware-specific watchdog
driver. However, if no watchdog matches your hardware, softdog
can be used as kernel watchdog module.
The softdog driver assumes that at least one CPU is still running. If all CPUs are stuck, the code in the softdog driver that should reboot the system will never be executed. In contrast, hardware watchdogs keep working even if all CPUs are stuck.
Enable the softdog watchdog:
#
echo softdog > /etc/modules-load.d/watchdog.conf
#
systemctl restart systemd-modules-load
Test whether the softdog watchdog module is loaded correctly:
#
lsmod | grep softdog
13.7 Setting up SBD with devices #
The following steps are necessary for setup:
Before you start, make sure the block device or devices you want to use for SBD meet the requirements specified in Section 13.3.
When setting up the SBD devices, you need to take several timeout values into account. For details, see Section 13.5, “Calculation of timeouts”.
The node will terminate itself if the SBD daemon running on it has not updated the watchdog timer fast enough. After having set the timeouts, test them in your specific environment.
To use SBD with shared storage, you must first create the messaging
layout on one to three block devices. The sbd create
command
will write a metadata header to the specified device or devices. It will also
initialize the messaging slots for up to 255 nodes. If executed without any
further options, the command will use the default timeout settings.
Make sure the device or devices you want to use for SBD do not hold any
important data. When you execute the sbd create
command, roughly the first megabyte of the specified block devices
will be overwritten without further requests or backup.
Decide which block device or block devices to use for SBD.
Initialize the SBD device with the following command:
#
sbd -d /dev/disk/by-id/DEVICE_ID create
To use more than one device for SBD, specify the
-d
option multiple times, for example:#
sbd -d /dev/disk/by-id/DEVICE_ID1 -d /dev/disk/by-id/DEVICE_ID2 -d /dev/disk/by-id/DEVICE_ID3 create
If your SBD device resides on a multipath group, use the
-1
and-4
options to adjust the timeouts to use for SBD. If you initialized more than one device, you must set the same timeout values for all devices. For details, see Section 13.5, “Calculation of timeouts”. All timeouts are given in seconds:#
sbd -d /dev/disk/by-id/DEVICE_ID -1 90
1-4 180
2create
Check what has been written to the device:
#
sbd -d /dev/disk/by-id/DEVICE_ID dump
Header version : 2.1 UUID : 619127f4-0e06-434c-84a0-ea82036e144c Number of slots : 255 Sector size : 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10 ==Header on disk /dev/disk/by-id/DEVICE_ID is dumpedAs you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.
After you have initialized the SBD devices, edit the SBD configuration file, then enable and start the respective services for the changes to take effect.
Open the file
/etc/sysconfig/sbd
.Search for the following parameter: SBD_DEVICE.
It specifies the devices to monitor and to use for exchanging SBD messages.
Edit this line by replacing /dev/disk/by-id/DEVICE_ID with your SBD device:
SBD_DEVICE="/dev/disk/by-id/DEVICE_ID"
If you need to specify multiple devices in the first line, separate them with semicolons (the order of the devices does not matter):
SBD_DEVICE="/dev/disk/by-id/DEVICE_ID1;/dev/disk/by-id/DEVICE_ID2;/dev/disk/by-id/DEVICE_ID3"
If the SBD device is not accessible, the daemon fails to start and inhibits cluster start-up.
Search for the following parameter: SBD_DELAY_START.
Enables or disables a delay. Set SBD_DELAY_START to
yes
ifmsgwait
is relatively long, but your cluster nodes boot very fast. Setting this parameter toyes
delays the start of SBD on boot. This is sometimes necessary with virtual machines.The default delay length is the same as the
msgwait
timeout value. Alternatively, you can specify an integer, in seconds, instead ofyes
.If you enable SBD_DELAY_START, you must also check the SBD service file to ensure that the value of
TimeoutStartSec
is greater than the value of SBD_DELAY_START. For more information, see https://www.suse.com/support/kb/doc/?id=000019356.Copy the configuration file to all nodes by using
csync2
:#
csync2 -xv
For more information, see Section 4.7, “Transferring the configuration to all nodes”.
After you have added your SBD devices to the SBD configuration file,
enable the SBD daemon. The SBD daemon is a critical piece
of the cluster stack. It needs to be running when the cluster stack is running.
Thus, the sbd
service is started as a dependency whenever
the cluster services are started.
On each node, enable the SBD service:
#
systemctl enable sbd
It will be started together with the Corosync service whenever the cluster services are started.
Restart the cluster services on all nodes at once by using the
--all
option:#
crm cluster restart --all
This automatically triggers the start of the SBD daemon.
Important: Restart cluster services for SBD changesIf any SBD metadata changes, you must restart the cluster services again. To keep critical cluster resources running during the restart, consider putting the cluster into maintenance mode first. For more information, see Chapter 28, Executing maintenance tasks.
As a next step, test the SBD devices as described in Procedure 13.6.
The following command will dump the node slots and their current messages from the SBD device:
#
sbd -d /dev/disk/by-id/DEVICE_ID list
Now you should see all cluster nodes that have ever been started with SBD listed here. For example, if you have a two-node cluster, the message slot should show
clear
for both nodes:0 alice clear 1 bob clear
Try sending a test message to one of the nodes:
#
sbd -d /dev/disk/by-id/DEVICE_ID message alice test
The node will acknowledge the receipt of the message in the system log files:
May 03 16:08:31 alice sbd[66139]: /dev/disk/by-id/DEVICE_ID: notice: servant: Received command test from bob on disk /dev/disk/by-id/DEVICE_ID
This confirms that SBD is indeed up and running on the node and that it is ready to receive messages.
As a final step, you need to adjust the cluster configuration as described in Procedure 13.7.
Start a shell and log in as
root
or equivalent.Run
crm configure
.Enter the following:
crm(live)configure#
property stonith-enabled="true"
1crm(live)configure#
property stonith-watchdog-timeout=0
2crm(live)configure#
property stonith-timeout="40s"
3This is the default configuration, because clusters without STONITH are not supported. But in case STONITH has been deactivated for testing purposes, make sure this parameter is set to
true
again.If not explicitly set, this value defaults to
0
, which is appropriate for use of SBD with one to three devices.To calculate the stonith-timeout, refer to Section 13.5, “Calculation of timeouts”. A
stonith-timeout
value of40
would be appropriate if themsgwait
timeout value for SBD was set to30
seconds.Configure the SBD STONITH resource. You do not need to clone this resource.
For a two-node cluster, in case of split brain, there will be fencing issued from each node to the other as expected. To prevent both nodes from being reset at practically the same time, it is recommended to apply the following fencing delays to help one of the nodes, or even the preferred node, win the fencing match. For clusters with more than two nodes, you do not need to apply these delays.
- Priority fencing delay
The
priority-fencing-delay
cluster property is disabled by default. By configuring a delay value, if the other node is lost and it has the higher total resource priority, the fencing targeting it will be delayed for the specified amount of time. This means that in case of split-brain, the more important node wins the fencing match.Resources that matter can be configured with priority meta attribute. On calculation, the priority values of the resources or instances that are running on each node are summed up to be accounted. A promoted resource instance takes the configured base priority plus one so that it receives a higher value than any unpromoted instance.
#
crm configure property priority-fencing-delay=30
Even if
priority-fencing-delay
is used, we still recommend also usingpcmk_delay_base
orpcmk_delay_max
as described below to address any situations where the nodes happen to have equal priority. The value ofpriority-fencing-delay
should be significantly greater than the maximum ofpcmk_delay_base
/pcmk_delay_max
, and preferably twice the maximum.- Predictable static delay
This parameter adds a static delay before executing STONITH actions. To prevent the nodes from being reset at the same time under split-brain of a two-node cluster, configure separate fencing resources with different delay values. The preferred node can be marked with the parameter to be targeted with a longer fencing delay so that it wins any fencing match. To make this succeed, it is essential to create two primitive STONITH devices for each node. In the following configuration, alice will win and survive in case of a split brain scenario:
crm(live)configure#
primitive st-sbd-alice stonith:external/sbd params \ pcmk_host_list=alice pcmk_delay_base=20
crm(live)configure#
primitive st-sbd-bob stonith:external/sbd params \ pcmk_host_list=bob pcmk_delay_base=0
- Dynamic random delay
This parameter adds a random delay for STONITH actions on the fencing device. Rather than a static delay targeting a specific node, the parameter pcmk_delay_max adds a random delay for any fencing with the fencing resource to prevent double reset. Unlike pcmk_delay_base, this parameter can be specified for a unified fencing resource targeting multiple nodes.
crm(live)configure#
primitive stonith_sbd stonith:external/sbd \ params pcmk_delay_max=30
Warning: pcmk_delay_max might not prevent double reset in a split-brain scenarioThe lower the value of pcmk_delay_max, the higher the chance that a double reset might still occur.
If your aim is to have a predictable survivor, use a priority fencing delay or predictable static delay.
Review your changes with
show
.Submit your changes with
commit
and leave the crm live configuration withquit
.
After the resource has started, your cluster is successfully configured for use of SBD. It will use this method in case a node needs to be fenced.
13.8 Setting up diskless SBD #
SBD can be operated in a diskless mode. In this mode, a watchdog device will be used to reset the node in the following cases: if it loses quorum, if any monitored daemon is lost and not recovered, or if Pacemaker decides that the node requires fencing. Diskless SBD is based on “self-fencing” of a node, depending on the status of the cluster, the quorum and some reasonable assumptions. No STONITH SBD resource primitive is needed in the CIB.
Diskless SBD relies on reformed membership and loss of quorum to achieve fencing. Corosync traffic must be able to pass through all network interfaces, including the loopback interface, and must not be blocked by a local firewall. Otherwise, Corosync cannot reform a new membership, which can cause a split-brain scenario that cannot be handled by diskless SBD fencing.
Do not use diskless SBD as a fencing mechanism for two-node clusters. Use diskless SBD only for clusters with three or more nodes. SBD in diskless mode cannot handle split brain scenarios for two-node clusters. If you want to use diskless SBD for two-node clusters, use QDevice as described in Chapter 14, QDevice and QNetd.
Open the file
/etc/sysconfig/sbd
and use the following entries:SBD_PACEMAKER=yes SBD_STARTMODE=always SBD_DELAY_START=no SBD_WATCHDOG_DEV=/dev/watchdog SBD_WATCHDOG_TIMEOUT=5
The
SBD_DEVICE
entry is not needed as no shared disk is used. When this parameter is missing, thesbd
service does not start any watcher process for SBD devices.If you need to delay the start of SBD on boot, change
SBD_DELAY_START
toyes
. The default delay length is double the value ofSBD_WATCHDOG_TIMEOUT
. Alternatively, you can specify an integer, in seconds, instead ofyes
.Important:SBD_WATCHDOG_TIMEOUT
for diskless SBD and QDeviceIf you use QDevice with diskless SBD, the
SBD_WATCHDOG_TIMEOUT
value must be greater than QDevice'ssync_timeout
value, or SBD will time out and fail to start.The default value for
sync_timeout
is 30 seconds. Therefore, setSBD_WATCHDOG_TIMEOUT
to a greater value, such as35
.On each node, enable the SBD service:
#
systemctl enable sbd
It will be started together with the Corosync service whenever the cluster services are started.
Restart the cluster services on all nodes at once by using the
--all
option:#
crm cluster restart --all
This automatically triggers the start of the SBD daemon.
Important: Restart cluster services for SBD changesIf any SBD metadata changes, you must restart the cluster services again. To keep critical cluster resources running during the restart, consider putting the cluster into maintenance mode first. For more information, see Chapter 28, Executing maintenance tasks.
Check if the parameter have-watchdog=true has been automatically set:
#
crm configure show | grep have-watchdog
have-watchdog=trueRun
crm configure
and set the following cluster properties on the crm shell:crm(live)configure#
property stonith-enabled="true"
1crm(live)configure#
property stonith-watchdog-timeout=10
2crm(live)configure#
property stonith-timeout=15
3This is the default configuration, because clusters without STONITH are not supported. But in case STONITH has been deactivated for testing purposes, make sure this parameter is set to
true
again.For diskless SBD, this parameter must not equal zero. It defines after how long it is assumed that the fencing target has already self-fenced. Use the following formula to calculate this timeout:
stonith-watchdog-timeout >= (SBD_WATCHDOG_TIMEOUT * 2)
If you set stonith-watchdog-timeout to a negative value, Pacemaker automatically calculates this timeout and sets it to twice the value of SBD_WATCHDOG_TIMEOUT.
This parameter must allow sufficient time for fencing to complete. For diskless SBD, use the following formula to calculate this timeout:
stonith-timeout >= stonith-watchdog-timeout + 20%
Important: Diskless SBD timeoutsWith diskless SBD, if the
stonith-timeout
value is smaller than thestonith-watchdog-timeout
value, failed nodes can become stuck in anUNCLEAN
state and block failover of active resources.Review your changes with
show
.Submit your changes with
commit
and leave the crm live configuration withquit
.
13.9 Testing SBD and fencing #
To test whether SBD works as expected for node fencing purposes, use one or all of the following methods:
- Manually triggering fencing of a node
To trigger a fencing action for node NODENAME:
#
crm node fence NODENAME
Check if the node is fenced and if the other nodes consider the node as fenced after the stonith-watchdog-timeout.
- Simulating an SBD failure
Identify the process ID of the SBD inquisitor:
#
systemctl status sbd
● sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2018-04-17 15:24:51 CEST; 6 days ago Docs: man:sbd(8) Process: 1844 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=exited, status=0/SUCCESS) Main PID: 1859 (sbd) Tasks: 4 (limit: 4915) CGroup: /system.slice/sbd.service ├─1859 sbd: inquisitor [...]Simulate an SBD failure by terminating the SBD inquisitor process. In our example, the process ID of the SBD inquisitor is
1859
):#
kill -9 1859
The node proactively self-fences. The other nodes notice the loss of the node and consider it has self-fenced after the stonith-watchdog-timeout.
- Triggering fencing through a monitor operation failure
With a normal configuration, a failure of a resource stop operation will trigger fencing. To trigger fencing manually, you can produce a failure of a resource stop operation. Alternatively, you can temporarily change the configuration of a resource monitor operation and produce a monitor failure as described below:
Configure an
on-fail=fence
property for a resource monitor operation:op monitor interval=10 on-fail=fence
Let the monitoring operation fail (for example, by terminating the respective daemon, if the resource relates to a service).
This failure triggers a fencing action.
13.10 Additional mechanisms for storage protection #
Apart from node fencing via STONITH there are other methods to achieve
storage protection at a resource level. For example, SCSI-3 and SCSI-4 use
persistent reservations whereas sfex
provides a locking
mechanism. Both methods are explained in the following subsections.
13.10.1 Configuring an sg_persist resource #
The SCSI specifications 3 and 4 define persistent reservations.
These are SCSI protocol features and can be used for I/O fencing and failover.
This feature is implemented in the sg_persist
Linux
command.
Any backing disks for sg_persist
must be SCSI
disk compatible. sg_persist
only works for devices like
SCSI disks or iSCSI LUNs.
Do not use it for IDE, SATA, or any block devices
which do not support the SCSI protocol.
Before you proceed, check if your disk supports persistent reservations. Use the following command (replace DEVICE_ID with your device name):
#
sg_persist -n --in --read-reservation -d /dev/disk/by-id/DEVICE_ID
The result shows whether your disk supports persistent reservations:
Supported disk:
PR generation=0x0, there is NO reservation held
Unsupported disk:
PR in (Read reservation): command not supported Illegal request, Invalid opcode
If you get an error message (like the one above), replace the old disk with an SCSI compatible disk. Otherwise proceed as follows:
Create the primitive resource
sg_persist
, using a stable device name for the disk:#
crm configure
crm(live)configure#
primitive sg sg_persist \ params devs="/dev/disk/by-id/DEVICE_ID" reservation_type=3 \ op monitor interval=60 timeout=60
Create a promotable clone of the
sg_persist
primitive:crm(live)configure#
clone clone-sg sg \ meta promotable=true promoted-max=1 notify=true
Test the setup. When the resource is promoted, you can mount and write to the disk partitions on the cluster node where the primary instance is running, but you cannot write on the cluster node where the secondary instance is running.
Add a file system primitive for Ext4, using a stable device name for the disk partition:
crm(live)configure#
primitive ext4 Filesystem \ params device="/dev/disk/by-id/DEVICE_ID" directory="/mnt/ext4" fstype=ext4
Add the following order relationship plus a collocation between the
sg_persist
clone and the file system resource:crm(live)configure#
order o-clone-sg-before-ext4 Mandatory: clone-sg:promote ext4:start
crm(live)configure#
colocation col-ext4-with-sg-persist inf: ext4 clone-sg:Promoted
Check all your changes with the
show changed
command.Commit your changes.
For more information, refer to the sg_persist
man
page.
13.10.2 Ensuring exclusive storage activation with sfex
#
This section introduces sfex
, an additional low-level
mechanism to lock access to shared storage exclusively to one node. Note
that sfex does not replace STONITH. As sfex requires shared
storage, it is recommended that the SBD node fencing mechanism described
above is used on another partition of the storage.
By design, sfex cannot be used with workloads that require concurrency (such as OCFS2). It serves as a layer of protection for classic failover style workloads. This is similar to an SCSI-2 reservation in effect, but more general.
13.10.2.1 Overview #
In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.
Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker. The sfex component ensures that even if Pacemaker were subject to a split brain situation, the lock will never be granted more than once.
These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.
13.10.2.2 Setup #
In the following, learn how to create a shared partition for use with
sfex and how to configure a resource for the sfex lock in the CIB. A
single sfex partition can hold any number of locks, and needs 1 KB
of storage space allocated per lock.
By default, sfex_init
creates one lock on the partition.
The shared partition for sfex should be on the same logical unit as the data you want to protect.
The shared sfex partition must not use host-based RAID or DRBD.
Using an LVM logical volume is possible.
Create a shared partition for use with sfex. Note the name of this partition and use it as a substitute for
/dev/sfex
below.Create the sfex metadata with the following command:
#
sfex_init -n 1 /dev/sfex
Verify that the metadata has been created correctly:
#
sfex_stat -i 1 /dev/sfex ; echo $?
This should return
2
, since the lock is not currently held.
The sfex lock is represented via a resource in the CIB, configured as follows:
crm(live)configure#
primitive sfex_1 ocf:heartbeat:sfex \ params device="/dev/sfex" index="1" collision_timeout="1" \ lock_timeout="70" monitor_interval="10" \ op monitor interval="10s" timeout="30s" on-fail="fence"
To protect resources via an sfex lock, create mandatory order and placement constraints between the resources to protect the sfex resource. If the resource to be protected has the ID
filesystem1
:crm(live)configure#
order order-sfex-1 Mandatory: sfex_1 filesystem1
crm(live)configure#
colocation col-sfex-1 inf: filesystem1 sfex_1
If using group syntax, add the sfex resource as the first resource to the group:
crm(live)configure#
group LAMP sfex_1 filesystem1 apache ipaddr
13.11 Changing SBD configuration #
You might need to change the cluster's SBD configuration for various reasons. For example:
Changing disk-based SBD to diskless SBD
Changing diskless SBD to disk-based SBD
Replacing an SBD device with a new device
Changing timeout values and other settings
You can use crmsh to change the SBD configuration. This method uses crmsh's default settings, including timeout values.
If you need to change any settings, or if you use custom settings and need to retain them when replacing a device, you must manually edit the SBD configuration.
Use this procedure to change diskless SBD to disk-based SBD, or to replace an existing SBD device with a new device.
Put the cluster into maintenance mode:
#
crm maintenance on
In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even if the cluster services stop.
Configure the new device:
#
crm -F cluster init sbd -s /dev/disk/by-id/DEVICE_ID
The
-F
option allows crmsh to reconfigure SBD even if the SBD service is already running.Check the status of the cluster:
#
crm status
Initially, the nodes have a status of
UNCLEAN (offline)
, but after a short time they change toOnline
.Check the SBD configuration. First, check the device metadata:
#
sbd -d /dev/disk/by-id/DEVICE_ID dump
Then check that all nodes in the cluster are assigned to a slot in the device:
#
sbd -d /dev/disk/by-id/DEVICE_ID list
Check the status of the SBD service:
#
systemctl status sbd
If you changed diskless SBD to disk-based SBD, check that the following section includes a device ID:
CGroup: /system/.slice/sbd.service |—23314 "sbd: inquisitor" |—23315 "sbd: watcher: /dev/disk/by-id/DEVICE_ID - slot: 0 - uuid: DEVICE_UUID" |—23316 "sbd: watcher: Pacemaker" |—23317 "sbd: watcher: Cluster"
When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:
#
crm maintenance off
Use this procedure to change diskless SBD to disk-based SBD, to replace an existing SBD device with a new device, or to change the timeout settings for disk-based SBD.
Put the cluster into maintenance mode:
#
crm maintenance on
In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even when you stop the cluster services.
Stop the cluster services, including SBD, on all nodes:
#
crm cluster stop --all
Reinitialize the device metadata, specifying the new device ID and timeouts as required:
#
sbd -d /dev/disk/by-id/DEVICE_ID -1 VALUE -4 VALUE create
The
-1
option specifies thewatchdog
timeout.The
-4
option specifies themsgwait
timeout. This must be at least double thewatchdog
timeout.For more information, see Procedure 13.3, “Initializing the SBD devices”.
Open the file
/etc/sysconfig/sbd
.If you are changing diskless SBD to disk-based SBD, add the following line and specify the device ID. If you are replacing an SBD device, change the value of this line to the new device ID:
SBD_DEVICE="/dev/disk/by-id/DEVICE_ID"
Adjust the other settings as required. For more information, see Procedure 13.4, “Editing the SBD configuration file”.
Copy the configuration file to all nodes:
#
csync2 -xv
Start the cluster services on all nodes:
#
crm cluster start --all
Check the status of the cluster:
#
crm status
Initially, the nodes have a status of
UNCLEAN (offline)
, but after a short time they change toOnline
.Check the SBD configuration. First, check the device metadata:
#
sbd -d /dev/disk/by-id/DEVICE_ID dump
Then check that all nodes in the cluster are assigned to a slot in the device:
#
sbd -d /dev/disk/by-id/DEVICE_ID list
Check the status of the SBD service:
#
systemctl status sbd
If you changed diskless SBD to disk-based SBD, check that the following section includes a device ID:
CGroup: /system/.slice/sbd.service |—23314 "sbd: inquisitor" |—23315 "sbd: watcher: /dev/disk/by-id/DEVICE_ID - slot: 0 - uuid: DEVICE_UUID" |—23316 "sbd: watcher: Pacemaker" |—23317 "sbd: watcher: Cluster"
If you changed any timeouts, or if you changed diskless SBD to disk-based SBD, you might also need to change the CIB properties
stonith-timeout
andstonith-watchdog-timeout
. For disk-based SBD,stonith-watchdog-timeout
should be0
or defaulted. For more information, see Section 13.5, “Calculation of timeouts”.To check the current values, run the following command:
#
crm configure show
If you need to change the values, use the following commands:
#
crm configure property stonith-watchdog-timeout=0
#
crm configure property stonith-timeout=VALUE
If you changed diskless SBD to disk-based SBD, you must configure a STONITH resource for SBD. For example:
#
crm configure primitive stonith-sbd stonith:external/sbd
For more information, see Step 4 in Procedure 13.7, “Configuring the cluster to use SBD”.
When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:
#
crm maintenance off
Use this procedure to change disk-based SBD to diskless SBD.
Put the cluster into maintenance mode:
#
crm maintenance on
In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even if the cluster services stop.
Configure diskless SBD:
#
crm -F cluster init sbd -S
The
-F
option allows crmsh to reconfigure SBD even if the SBD service is already running.Check the status of the cluster:
#
crm status
Initially, the nodes have a status of
UNCLEAN (offline)
, but after a short time they change toOnline
.Check the status of the SBD service:
#
systemctl status sbd
The following section should not include a device ID:
CGroup: /system/.slice/sbd.service |—23314 "sbd: inquisitor" |—23315 "sbd: watcher: Pacemaker" |—23316 "sbd: watcher: Cluster"
When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:
#
crm maintenance off
Use this procedure to change disk-based SBD to diskless SBD, or to change the timeout values for diskless SBD.
Put the cluster into maintenance mode:
#
crm maintenance on
In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even when you stop the cluster services.
Stop the cluster services, including SBD, on all nodes:
#
crm cluster stop --all
Open the file
/etc/sysconfig/sbd
.If you are changing disk-based SBD to diskless SBD, remove or comment out the
SBD_DEVICE
entry.Adjust the other settings as required. For more information, see Section 13.8, “Setting up diskless SBD”.
Copy the configuration file to all nodes:
#
csync2 -xv
Start the cluster services on all nodes:
#
crm cluster start --all
Check the status of the cluster:
#
crm status
Initially, the nodes have a status of
UNCLEAN (offline)
, but after a short time they change toOnline
.Check the status of the SBD service:
#
systemctl status sbd
If you changed disk-based SBD to diskless SBD, check that the following section does not include a device ID:
CGroup: /system/.slice/sbd.service |—23314 "sbd: inquisitor" |—23315 "sbd: watcher: Pacemaker" |—23316 "sbd: watcher: Cluster"
If you changed any timeouts, or if you changed disk-based SBD to diskless SBD, you might also need to change the CIB properties
stonith-timeout
andstonith-watchdog-timeout
. For more information, see Step 5 of Procedure 13.8, “Configuring diskless SBD”.To check the current values, run the following command:
#
crm configure show
If you need to change the values, use the following commands:
#
crm configure property stonith-watchdog-timeout=VALUE
#
crm configure property stonith-timeout=VALUE
When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:
#
crm maintenance off
13.12 For more information #
For more details, see man sbd
.