Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Linux Enterprise High Availability Extension 11 SP4

17 Storage Protection

The High Availability cluster stack's highest priority is protecting the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage: For example, Ext3 file systems are only mounted once in the cluster, OCFS2 volumes will not be mounted unless coordination with other cluster nodes is available. In a well-functioning cluster Pacemaker will detect if resources are active beyond their concurrency limits and initiate recovery. Furthermore, its policy engine will never exceed these limitations.

However, network partitioning or software malfunction could potentially cause scenarios where several coordinators are elected. If this so-called split brain scenarios were allowed to unfold, data corruption might occur. Hence, several layers of protection have been added to the cluster stack to mitigate this.

The primary component contributing to this goal is IO fencing/STONITH since it ensures that all other access prior to storage activation is terminated. Other mechanisms are cLVM2 exclusive activation or OCFS2 file locking support to protect your system against administrative or application faults. Combined appropriately for your setup, these can reliably prevent split brain scenarios from causing harm.

This chapter describes an IO fencing mechanism that leverages the storage itself, followed by the description of an additional layer of protection to ensure exclusive storage access. These two mechanisms can be combined for higher levels of protection.

17.1 Storage-based Fencing

You can reliably avoid split brain scenarios by using one or more STONITH Block Devices (SBD), watchdog support and the external/sbd STONITH agent.

17.1.1 Overview

In an environment where all nodes have access to shared storage, a small partition of the device is formatted for use with SBD. The size of the partition depends on the block size of the used disk (1 MB for standard SCSI disks with 512 Byte block size; DASD disks with 4 kB block size need 4 MB). After the respective daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.

The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.

The daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The work-load will terminate anyway if the storage connectivity has been lost.

Increased protection is offered through watchdog support. Modern systems support a hardware watchdog that needs to be tickled or fed by a software component. The software component (usually a daemon) regularly writes a service pulse to the watchdog—if the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an IO error.

If Pacemaker integration is activated, SBD will not self-fence if device majority is lost. For example, your cluster contains 3 nodes: A, B, and C. Because of a network split, A can only see itself while B and C can still communicate. In this case, there are two cluster partitions, one with quorum because of being the majority (B, C), and one without (A). If this happens while the majority of fencing devices are unreachable, node A would immediately commit suicide, but the nodes B and C would continue to run.

17.1.2 Number of SBD Devices

SBD supports the use of 1-3 devices:

One Device

The most simple implementation. It is appropriate for clusters where all of your data is on the same shared storage.

Two Devices

This configuration is primarily useful for environments that use host-based mirroring but where no third storage device is available. SBD will not terminate itself if it loses access to one mirror leg, allowing the cluster to continue. However, since SBD does not have enough knowledge to detect an asymmetric split of the storage, it will not fence the other side while only one mirror leg is available. Thus, it cannot automatically tolerate a second failure while one of the storage arrays is down.

Three Devices

The most reliable configuration. It is resilient against outages of one device—be it because of failures or maintenance. SBD will only terminate itself if more than one device is lost. Fencing messages can be successfully be transmitted if at least two devices are still accessible.

This configuration is suitable for more complex scenarios where storage is not restricted to a single array. Host-based mirroring solutions can have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI.

17.1.3 Setting Up Storage-based Protection

The following steps are necessary to set up storage-based protection:

All of the following procedures must be executed as root. Before you start, make sure the following requirements are met:

Important
Important: Requirements
  • The environment must have shared storage reachable by all nodes.

  • The shared storage segment must not use host-based RAID, cLVM2, nor DRBD*.

  • However, using storage-based RAID and multipathing is recommended for increased reliability.

17.1.3.1 Creating the SBD Partition

It is recommended to create a 1MB partition at the start of the device. If your SBD device resides on a multipath group, you need to adjust the timeouts SBD uses, as MPIO's path down detection can cause some latency. After the msgwait timeout, the message is assumed to have been delivered to the node. For multipath, this should be the time required for MPIO to detect a path failure and switch to the next path. You may need to test this in your environment. The node will terminate itself if the SBD daemon running on it has not updated the watchdog timer fast enough. Test your chosen timeouts in your specific environment. In case you use a multipath storage with just one SBD device, pay special attention to the failover delays incurred.

Note
Note: Device Name for SBD Partition

In the following, this SBD partition is referred to by /dev/SBD . Replace it with your actual path name, for example: /dev/sdc1.

Important
Important: Overwriting Existing Data

Make sure the device you want to use for SBD does not hold any data. The sbd command will overwrite the device without further requests for confirmation.

  1. Initialize the SBD device with the following command:

    root # sbd -d /dev/SBD create

    This will write a header to the device, and create slots for up to 255 nodes sharing this device with default timings.

    If you want to use more than one device for SBD, provide the devices by specifying the -d option multiple times, for example:

    root # sbd -d /dev/SBD1 -d /dev/SBD2 -d /dev/SBD3 create
  2. If your SBD device resides on a multipath group, adjust the timeouts SBD uses. This can be specified when the SBD device is initialized (all timeouts are given in seconds):

    root # /usr/sbin/sbd -d /dev/SBD -4 1801 -1 902 create

    1

    The -4 option is used to specify the msgwait timeout. In the example above, it is set to 180 seconds.

    2

    The -1 option is used to specify the watchdog timeout. In the example above, it is set to 90 seconds.

  3. With the following command, check what has been written to the device:

    root # sbd -d /dev/SBD dump 
    Header version     : 2
    Number of slots    : 255
    Sector size        : 512
    Timeout (watchdog) : 5
    Timeout (allocate) : 2
    Timeout (loop)     : 1
    Timeout (msgwait)  : 10

As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.

17.1.3.2 Setting Up the Software Watchdog

Watchdog will protect the system against SBD failures, if no other software uses it.

Important
Important: Accessing the Watchdog Timer

No other software must access the watchdog timer. Some hardware vendors ship systems management software that uses the watchdog for system resets (for example, HP ASR daemon). Disable such software, if watchdog is used by SBD.

In SUSE Linux Enterprise High Availability Extension, watchdog support in the Kernel is enabled by default: It ships with several different Kernel modules that provide hardware-specific watchdog drivers. The High Availability Extension uses the SBD daemon as software component that feeds the watchdog. If configured as described in Section 17.1.3.3, “Starting the SBD Daemon”, the SBD daemon will start automatically when the respective node is brought online with rcopenais  start .

Usually, the appropriate watchdog driver for your hardware is automatically loaded during system boot. softdog is the most generic driver, but it is recommended to use a driver with actual hardware integration. For example:

  • On HP hardware, this is the hpwdt driver.

  • For systems with an Intel TCO, the iTCO_wdt driver can be used.

For a list of choices, refer to /usr/src/KERNEL_VERSION/drivers/watchdog. Alternatively, list the drivers that have been installed with your Kernel version with the following command:

root # rpm -ql kernel-VERSION | grep watchdog

As most watchdog driver names contain strings like wd, wdt, or dog, use the following command to check which driver is currently loaded:

root # lsmod | egrep "(wd|dog)"

To automatically load the watchdog driver, create the file /etc/modules-load.d/watchdog.conf containing a line with the driver name. For more information refer to the man page modules-load.d.

If you change the timeout for watchdog, the other two values (msgwait and stonith-timeout) must be changed as well. The watchdog timeout depends mostly on your storage latency. This value specifies that the majority of devices must successfully finish their read operation within this time frame. If not, the node will self-fence.

The following formula expresses roughly this relationship between these three values:

Example 17.1: Cluster Timings with SBD as STONITH Device
Timeout (msgwait) = (Timeout (watchdog) * 2)
stonith-timeout = Timeout (msgwait) + 20%

For example, if you set the timeout watchdog to 120, you need to set the msgwait to 240 and the stonith-timeout to 288. You can check the output with sbd:

root # sbd -d /dev/SDB dump
==Dumping header on disk /dev/sdb
Header version     : 2.1
UUID               : 619127f4-0e06-434c-84a0-ea82036e144c
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 20
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 40
==Header on disk /dev/sdb is dumped

If you set up a new cluster, the sleha-init command takes the above considerations into account.

For more details about timing variables related to SBD, see the technical information document SBD Operation Guidelines for HAE Clusters, available at https://www.suse.com/support/kb/doc.php?id=7011346.

17.1.3.3 Starting the SBD Daemon

The SBD daemon is a critical piece of the cluster stack. It needs to be running when the cluster stack is running, or even when part of it has crashed, so that the node can be fenced.

  1. Run sleha-init. This script ensures that SBD is correctly configured and the configuration file /etc/sysconfig/sbd is added to the list of files that needs to be synchronized with Csync2.

    If you want to configure SBD manually, perform the following step:

    To make the OpenAIS init script start and stop SBD, edit the file /etc/sysconfig/sbd and search for the following line, replacing SBD with your SBD device:

    SBD_DEVICE="/dev/SBD"

    If you need to specify multiple devices in the first line, separate them by a semicolon (the order of the devices does not matter):

    SBD_DEVICE="/dev/SBD1; /dev/SBD2; /dev/SBD3"

    If the SBD device is not accessible, the daemon will fail to start and inhibit OpenAIS startup.

    Note
    Note: Starting Services at Boot Time

    If the SBD device becomes inaccessible from a node, this could cause the node to enter an infinite reboot cycle. This is technically correct behavior, but depending on your administrative policies, most likely a nuisance. In such cases, better do not automatically start up OpenAIS on boot.

  2. Before proceeding, ensure that SBD has started on all nodes by executing rcopenais  restart .

17.1.3.4 Testing SBD

  1. The following command will dump the node slots and their current messages from the SBD device:

    root # sbd -d /dev/SBD list

    Now you should see all cluster nodes that have ever been started with SBD listed here, the message slot should show clear.

  2. Try sending a test message to one of the nodes:

    root # sbd -d /dev/SBD message alice test
  3. The node will acknowledge the receipt of the message in the system log files:

    Aug 29 14:10:00 alice sbd: [13412]: info: Received command test from bob

    This confirms that SBD is indeed up and running on the node and that it is ready to receive messages.

17.1.3.5 Configuring the Fencing Resource

To complete the SBD setup, configure SBD as a STONITH/fencing mechanism in the CIB.

Tip
Tip: STONITH Configuration for 2-Node Clusters

In two-node clusters (and other clusters where no-quorum-policy is set to ignore), mistimed fencing occurs quite frequently, because both nodes will try to fence each other in case of a split-brain situation. To avoid this double fencing, add the pcmk_delay_max parameter to the configuration of the STONITH resource. This gives servers with a working network card a better chance to survive.

  1. Log in to one of the nodes and start the interactive crmsh with crm configure.

  2. Enter the following:

    crm(live)configure# property stonith-enabled="true"
    crm(live)configure# property stonith-timeout="40s" 1
    crm(live)configure# primitive stonith_sbd stonith:external/sbd \
       pcmk_delay_max="30" 2

    The resource does not need to be cloned. As node slots are allocated automatically, no manual host list needs to be defined.

    1

    Which value to set for stonith-timeout depends on the msgwait timeout. The msgwait timeout should be longer than the maximum allowed timeout for the underlying IO system. For example, this is 30 seconds for plain SCSI disks. Provided you set the msgwait timeout value to 30 seconds, setting stonith-timeout to 40 seconds is appropriate.

    2

    The pcmk_delay_max parameter enables a random delay for STONITH actions on the fencing device. Its value specifies the maximum amount of time to wait before the start operation of the STONITH device is executed. As it takes time to detect the ring failure, become the DC and to start the STONITH resource, do not set this value too low (otherwise the prior DC will always start the fencing action first).

  3. Commit the changes:

    crm(live)# configure commit
  4. Disable any other fencing devices you might have configured before, since the SBD mechanism is used for this function now.

After the resource has started, your cluster is successfully configured for shared-storage fencing and will use this method in case a node needs to be fenced.

17.1.3.6 For More Information

http://www.linux-ha.org/wiki/SBD_Fencing

17.2 Ensuring Exclusive Storage Activation

This section introduces sfex, an additional low-level mechanism to lock access to shared storage exclusively to one node. Note that sfex does not replace STONITH. Since sfex requires shared storage, it is recommended that the external/sbd fencing mechanism described above is used on another partition of the storage.

By design, sfex cannot be used with workloads that require concurrency (such as OCFS2), but serves as a layer of protection for classic failover style workloads. This is similar to a SCSI-2 reservation in effect, but more general.

17.2.1 Overview

In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.

Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker, and the sfex component ensures that even if Pacemaker were subject to a split brain situation, the lock will never be granted more than once.

These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.

17.2.2 Setup

In the following, learn how to create a shared partition for use with sfex and how to configure a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks, it defaults to one, and needs 1 KB of storage space allocated per lock.

Important
Important: Requirements
  • The shared partition for sfex should be on the same logical unit as the data you want to protect.

  • The shared sfex partition must not use host-based RAID, nor DRBD.

  • Using a cLVM2 logical volume is possible.

Procedure 17.1: Creating an sfex Partition
  1. Create a shared partition for use with sfex. Note the name of this partition and use it as a substitute for /dev/sfex below.

  2. Create the sfex meta data with the following command:

    root # sfex_init -n 1 /dev/sfex
  3. Verify that the meta data has been created correctly:

    root # sfex_stat -i 1 /dev/sfex ; echo $?

    This should return 2, since the lock is not currently held.

Procedure 17.2: Configuring a Resource for the sfex Lock
  1. The sfex lock is represented via a resource in the CIB, configured as follows:

    crm(live)configure# primitive sfex_1 ocf:heartbeat:sfex \
    #	params device="/dev/sfex" index="1" collision_timeout="1" \
          lock_timeout="70" monitor_interval="10" \
    #	op monitor interval="10s" timeout="30s" on_fail="fence"
  2. To protect resources via an sfex lock, create mandatory ordering and placement constraints between the protectees and the sfex resource. If the resource to be protected has the id filesystem1:

    crm(live)configure# order order-sfex-1 inf: sfex_1 filesystem1
    crm(live)configure# colocation colo-sfex-1 inf: filesystem1 sfex_1
  3. If using group syntax, add the sfex resource as the first resource to the group:

    crm(live)configure# group LAMP sfex_1 filesystem1 apache ipaddr

17.3 For More Information

See http://www.linux-ha.org/wiki/SBD_Fencing and man sbd.

Print this page