Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Linux Enterprise High Availability Extension 12 SP5

11 Storage Protection and SBD

Abstract

SBD (STONITH Block Device) provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). This isolates the fencing mechanism from changes in firmware version or dependencies on specific firmware controllers. SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. Under certain conditions, it is also possible to use SBD without shared storage, by running it in diskless mode.

The ha-cluster-bootstrap scripts provide an automated way to set up a cluster with the option of using SBD as fencing mechanism. For details, see the Installation and Setup Quick Start. However, manually setting up SBD provides you with more options regarding the individual settings.

This chapter explains the concepts behind SBD. It guides you through configuring the components needed by SBD to protect your cluster from potential data corruption in case of a split brain scenario.

In addition to node level fencing, you can use additional mechanisms for storage protection, such as LVM2 exclusive activation or OCFS2 file locking support (resource level fencing). They protect your system against administrative or application faults.

11.1 Conceptual Overview

SBD expands to Storage-Based Death or STONITH Block Device.

The highest priority of the High Availability cluster stack is to protect the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage. The cluster stack takes care of this using several control mechanisms.

However, network partitioning or software malfunction could potentially cause scenarios where several DCs are elected in a cluster. If this so-called split brain scenario were allowed to unfold, data corruption might occur.

Node fencing via STONITH is the primary mechanism to prevent this. Using SBD as a node fencing mechanism is one way of shutting down nodes without using an external power off device in case of a split brain scenario.

SBD Components and Mechanisms
SBD Partition

In an environment where all nodes have access to shared storage, a small partition of the device is formatted for use with SBD. The size of the partition depends on the block size of the used disk (for example, 1 MB for standard SCSI disks with 512 byte block size or 4 MB for DASD disks with 4 kB block size). The initialization process creates a message layout on the device with slots for up to 255 nodes.

SBD Daemon

After the respective SBD daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.

Messages

The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.

Also, the daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The workload will terminate anyway if the storage connectivity has been lost.

Watchdog

Whenever SBD is used, a correctly working watchdog is crucial. Modern systems support a hardware watchdog that needs to be tickled or fed by a software component. The software component (in this case, the SBD daemon) feeds the watchdog by regularly writing a service pulse to the watchdog. If the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an I/O error.

If Pacemaker integration is activated, SBD will not self-fence if device majority is lost. For example, your cluster contains three nodes: A, B, and C. Because of a network split, A can only see itself while B and C can still communicate. In this case, there are two cluster partitions: one with quorum because of being the majority (B, C), and one without (A). If this happens while the majority of fencing devices are unreachable, node A would immediately commit suicide, but nodes B and C would continue to run.

11.2 Overview of Manually Setting Up SBD

The following steps are necessary to manually set up storage-based fencing. They must be executed as root. Before you start, check Section 11.3, “Requirements”.

  1. Setting Up the Watchdog

  2. Depending on your scenario, either use SBD with one to three devices or in diskless mode. For an outline, see Section 11.4, “Number of SBD Devices”. The detailed setup is described in:

  3. Testing SBD and Fencing

11.3 Requirements

  • You can use up to three SBD devices for storage-based fencing. When using one to three devices, the shared storage must be accessible from all nodes.

  • The path to the shared storage device must be persistent and consistent across all nodes in the cluster. Use stable device names such as /dev/disk/by-id/dm-uuid-part1-mpath-abcedf12345.

  • The shared storage can be connected via Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), or even iSCSI. In virtualized environments, the hypervisor might provide shared block devices. In any case, content on that shared block device needs to be consistent for all cluster nodes. Make sure that caching does not break that consistency.

  • The shared storage segment must not use host-based RAID, LVM2, or DRBD*. DRBD can be split, which is problematic for SBD, as there cannot be two states in SBD. Cluster multi-device (Cluster MD) cannot be used for SBD.

  • However, using storage-based RAID and multipathing is recommended for increased reliability.

  • An SBD device can be shared between different clusters, as long as no more than 255 nodes share the device.

  • For clusters with more than two nodes, you can also use SBD in diskless mode.

11.4 Number of SBD Devices

SBD supports the use of up to three devices:

One Device

The most simple implementation. It is appropriate for clusters where all of your data is on the same shared storage.

Two Devices

This configuration is primarily useful for environments that use host-based mirroring but where no third storage device is available. SBD will not terminate itself if it loses access to one mirror leg, allowing the cluster to continue. However, since SBD does not have enough knowledge to detect an asymmetric split of the storage, it will not fence the other side while only one mirror leg is available. Thus, it cannot automatically tolerate a second failure while one of the storage arrays is down.

Three Devices

The most reliable configuration. It is resilient against outages of one device—be it because of failures or maintenance. SBD will terminate itself only if more than one device is lost and if required, depending on the status of the cluster partition or node. If at least two devices are still accessible, fencing messages can be successfully transmitted.

This configuration is suitable for more complex scenarios where storage is not restricted to a single array. Host-based mirroring solutions can have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI.

Diskless

This configuration is useful if you want a fencing mechanism without shared storage. In this diskless mode, SBD fences nodes by using the hardware watchdog without relying on any shared device. However, diskless SBD cannot handle a split brain scenario for a two-node cluster. Therefore, three or more nodes are required for using diskless SBD.

11.5 Calculation of Timeouts

When using SBD as a fencing mechanism, it is vital to consider the timeouts of all components, because they depend on each other.

Watchdog Timeout

This timeout is set during initialization of the SBD device. It depends mostly on your storage latency. The majority of devices must be successfully read within this time. Otherwise, the node might self-fence.

Note
Note: Multipath or iSCSI Setup

If your SBD device(s) reside on a multipath setup or iSCSI, the timeout should be set to the time required to detect a path failure and switch to the next path.

This also means that in /etc/multipath.conf the value of max_polling_interval must be less than watchdog timeout.

msgwait Timeout

This timeout is set during initialization of the SBD device. It defines the time after which a message written to a node's slot on the SBD device is considered delivered. The timeout should be long enough for the node to detect that it needs to self-fence.

However, if the msgwait timeout is relatively long, a fenced cluster node might rejoin before the fencing action returns. This can be mitigated by setting the SBD_DELAY_START parameter in the SBD configuration, as described in Procedure 11.4 in Step 4.

stonith-timeout in the CIB

This timeout is set in the CIB as a global cluster property. It defines how long to wait for the STONITH action (reboot, on, off) to complete.

stonith-watchdog-timeout in the CIB

This timeout is set in the CIB as a global cluster property. If not set explicitly, it defaults to 0, which is appropriate for using SBD with one to three devices. For use of SBD in diskless mode, see Procedure 11.8, “Configuring Diskless SBD” for more details.

If you change the watchdog timeout, you need to adjust the other two timeouts as well. The following formula expresses the relationship between these three values:

Example 11.1: Formula for Timeout Calculation
Timeout (msgwait) >= (Timeout (watchdog) * 2)
stonith-timeout = Timeout (msgwait) + 20%

For example, if you set the watchdog timeout to 120, set the msgwait timeout to 240 and the stonith-timeout to 288.

If you use the ha-cluster-bootstrap scripts to set up a cluster and to initialize the SBD device, the relationship between these timeouts is automatically considered.

11.6 Setting Up the Watchdog

SUSE Linux Enterprise High Availability Extension ships with several kernel modules that provide hardware-specific watchdog drivers. For a list of the most commonly used ones, see Commonly Used Watchdog Drivers.

For clusters in production environments we recommend to use a hardware-specific watchdog driver. However, if no watchdog matches your hardware, softdog can be used as kernel watchdog module.

The High Availability Extension uses the SBD daemon as the software component that feeds the watchdog.

11.6.1 Using a Hardware Watchdog

Finding the right watchdog kernel module for a given system is not trivial. Automatic probing fails very often. As a result, lots of modules are already loaded before the right one gets a chance.

Table 11.1 lists the most commonly used watchdog drivers. If your hardware is not listed there, the directory /lib/modules/KERNEL_VERSION/kernel/drivers/watchdog gives you a list of choices, too. Alternatively, ask your hardware or system vendor for details on system specific watchdog configuration.

Table 11.1: Commonly Used Watchdog Drivers
HardwareDriver
HPhpwdt
Dell, Supermicro, LenovoiTCO_wdt
Fujitsuipmi_watchdog
VM on z/VM on IBM mainframevmwatchdog
Xen VM (DomU)xen_wdt
Genericsoftdog
Important
Important: Accessing the Watchdog Timer

Some hardware vendors ship systems management software that uses the watchdog for system resets (for example, HP ASR daemon). If the watchdog is used by SBD, disable such software. No other software must access the watchdog timer.

Procedure 11.1: Loading the Correct Kernel Module

To make sure the correct watchdog module is loaded, proceed as follows:

  1. List the drivers that have been installed with your kernel version:

    root # rpm -ql kernel-VERSION | grep watchdog
  2. List any watchdog modules that are currently loaded in the kernel:

    root # lsmod | egrep "(wd|dog)"
  3. If you get a result, unload the wrong module:

    root # rmmod WRONG_MODULE
  4. Enable the watchdog module that matches your hardware:

    root # echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.conf
    root # systemctl restart systemd-modules-load
  5. Test whether the watchdog module is loaded correctly:

    root # lsmod | egrep "(wd|dog)"

11.6.2 Using the Software Watchdog (softdog)

For clusters in production environments we recommend to use a hardware-specific watchdog driver. However, if no watchdog matches your hardware, softdog can be used as kernel watchdog module.

Important
Important: Softdog Limitations

The softdog driver assumes that at least one CPU is still running. If all CPUs are stuck, the code in the softdog driver that should reboot the system will never be executed. In contrast, hardware watchdogs keep working even if all CPUs are stuck.

Procedure 11.2: Loading the Softdog Kernel Module
  1. Enable the softdog driver:

    root # echo softdog > /etc/modules-load.d/watchdog.conf
  2. Add the softdog module in /etc/modules-load.d/watchdog.conf and restart a service:

    root # echo softdog > /etc/modules-load.d/watchdog.conf
    root # systemctl restart systemd-modules-load
  3. Test whether the softdog watchdog module is loaded correctly:

    root # lsmod | grep softdog

11.7 Setting Up SBD with Devices

The following steps are necessary for setup:

Before you start, make sure the block device or devices you want to use for SBD meet the requirements specified in Section 11.3.

When setting up the SBD devices, you need to take several timeout values into account. For details, see Section 11.5, “Calculation of Timeouts”.

The node will terminate itself if the SBD daemon running on it has not updated the watchdog timer fast enough. After having set the timeouts, test them in your specific environment.

Procedure 11.3: Initializing the SBD Devices

To use SBD with shared storage, you must first create the messaging layout on one to three block devices. The sbd create command will write a metadata header to the specified device or devices. It will also initialize the messaging slots for up to 255 nodes. If executed without any further options, the command will use the default timeout settings.

Warning
Warning: Overwriting Existing Data

Make sure the device or devices you want to use for SBD do not hold any important data. When you execute the sbd create command, roughly the first megabyte of the specified block devices will be overwritten without further requests or backup.

  1. Decide which block device or block devices to use for SBD.

  2. Initialize the SBD device with the following command:

    root # sbd -d /dev/SBD create

    (Replace /dev/SBD with your actual path name, for example: /dev/disk/by-id/scsi-ST2000DM001-0123456_Wabcdefg.)

    To use more than one device for SBD, specify the -d option multiple times, for example:

    root # sbd -d /dev/SBD1 -d /dev/SBD2 -d /dev/SBD3 create
  3. If your SBD device resides on a multipath group, use the -1 and -4 options to adjust the timeouts to use for SBD. For details, see Section 11.5, “Calculation of Timeouts”. All timeouts are given in seconds:

    root # sbd -d /dev/SBD -4 1801 -1 602 create

    1

    The -4 option is used to specify the msgwait timeout. In the example above, it is set to 180 seconds.

    2

    The -1 option is used to specify the watchdog timeout. In the example above, it is set to 60 seconds. The minimum allowed value for the emulated watchdog is 15 seconds.

  4. Check what has been written to the device:

    root # sbd -d /dev/SBD dump
    Header version     : 2.1
    UUID               : 619127f4-0e06-434c-84a0-ea82036e144c
    Number of slots    : 255
    Sector size        : 512
    Timeout (watchdog) : 60
    Timeout (allocate) : 2
    Timeout (loop)     : 1
    Timeout (msgwait)  : 180
    ==Header on disk /dev/SBD is dumped

    As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.

After you have initialized the SBD devices, edit the SBD configuration file, then enable and start the respective services for the changes to take effect.

Procedure 11.4: Editing the SBD Configuration File
  1. Open the file /etc/sysconfig/sbd.

  2. Search for the following parameter: SBD_DEVICE.

    It specifies the devices to monitor and to use for exchanging SBD messages.

  3. Edit this line by replacing SBD with your SBD device:

    SBD_DEVICE="/dev/SBD"

    If you need to specify multiple devices in the first line, separate them with semicolons (the order of the devices does not matter):

    SBD_DEVICE="/dev/SBD1; /dev/SBD2; /dev/SBD3"

    If the SBD device is not accessible, the daemon will fail to start and inhibit cluster start-up.

  4. Search for the following parameter: SBD_DELAY_START.

    Enables or disables a delay. Set SBD_DELAY_START to yes if msgwait is relatively long, but your cluster nodes boot very fast. Setting this parameter to yes delays the start of SBD on boot. This is sometimes necessary with virtual machines.

After you have added your SBD devices to the SBD configuration file, enable the SBD daemon. The SBD daemon is a critical piece of the cluster stack. It needs to be running when the cluster stack is running. Thus, the sbd service is started as a dependency whenever the pacemaker service is started.

Procedure 11.5: Enabling and Starting the SBD Service
  1. On each node, enable the SBD service:

    root # systemctl enable sbd

    It will be started together with the Corosync service whenever the Pacemaker service is started.

  2. Restart the cluster stack on each node:

    root # systemctl stop pacemaker
    root # systemctl start pacemaker

    This automatically triggers the start of the SBD daemon.

As a next step, test the SBD devices as described in Procedure 11.6.

Procedure 11.6: Testing the SBD Devices
  1. The following command will dump the node slots and their current messages from the SBD device:

    root # sbd -d /dev/SBD list

    Now you should see all cluster nodes that have ever been started with SBD listed here. For example, if you have a two-node cluster, the message slot should show clear for both nodes:

    0       alice        clear
    1       bob          clear
  2. Try sending a test message to one of the nodes:

    root # sbd -d /dev/SBD message alice test
  3. The node will acknowledge the receipt of the message in the system log files:

    May 03 16:08:31 alice sbd[66139]: /dev/SBD: notice: servant: Received command test from bob on disk /dev/SBD

    This confirms that SBD is indeed up and running on the node and that it is ready to receive messages.

As a final step, you need to adjust the cluster configuration as described in Procedure 11.7.

Procedure 11.7: Configuring the Cluster to Use SBD

To configure the use of SBD in the cluster, you need to do the following in the cluster configuration:

  • Set the stonith-timeout parameter to a value that matches your setting.

  • Configure the SBD STONITH resource.

For the calculation of the stonith-timeout refer to Section 11.5, “Calculation of Timeouts”.

  1. Start a shell and log in as root or equivalent.

  2. Run crm configure.

  3. Enter the following:

    crm(live)configure# property stonith-enabled="true" 1
    crm(live)configure# property stonith-watchdog-timeout=0 2
    crm(live)configure# property stonith-timeout="220s" 3

    1

    This is the default configuration, because clusters without STONITH are not supported. But in case STONITH has been deactivated for testing purposes, make sure this parameter is set to true again.

    2

    If not explicitly set, this value defaults to 0, which is appropriate for use of SBD with one to three devices.

    3

    A stonith-timeout value of 220 would be appropriate if the msgwait timeout value for SBD was set to 30 seconds.

  4. For a two-node cluster, decide if you want predictable or random delays. For other cluster setups you do not need to set this parameter.

    Predictable Static Delays

    This parameter enables a static delay before executing STONITH actions. It ensures that the nodes do not fence each other if separate fencing resources and different delay values are being used. The targeted node will lose in a fencing race. The parameter can be used to mark a specific node to survive in case of a split brain scenario in a two-node cluster. To make this succeed, it is essential to create two primitive STONITH devices for each node. In the following configuration, alice will win and survive in case of a split brain scenario:

    crm(live)configure# primitive st-sbd-alice stonith:external/sbd params \
           pcmk_host_list=alice pcmk_delay_base=20
    crm(live)configure# primitive st-sbd-bob stonith:external/sbd params \
           pcmk_host_list=bob pcmk_delay_base=0
    Dynamic Random Delays

    This parameter prevents double fencing when using slow devices such as SBD. It adds a random delay for STONITH actions on the fencing device. It is especially important for two-node clusters where otherwise both nodes might try to fence each other in case of a split brain scenario.

    crm(live)configure# primitive stonith_sbd stonith:external/sbd
      params pcmk_delay_max=30
  5. Review your changes with show.

  6. Submit your changes with commit and leave the crm live configuration with exit.

After the resource has started, your cluster is successfully configured for use of SBD. It will use this method in case a node needs to be fenced.

11.8 Setting Up Diskless SBD

SBD can be operated in a diskless mode. In this mode, a watchdog device will be used to reset the node in the following cases: if it loses quorum, if any monitored daemon is lost and not recovered, or if Pacemaker decides that the node requires fencing. Diskless SBD is based on self-fencing of a node, depending on the status of the cluster, the quorum and some reasonable assumptions. No STONITH SBD resource primitive is needed in the CIB.

Important
Important: Number of Cluster Nodes

Do not use diskless SBD as a fencing mechanism for two-node clusters. Use it only in clusters with three or more nodes. SBD in diskless mode cannot handle split brain scenarios for two-node clusters.

Procedure 11.8: Configuring Diskless SBD
  1. Open the file /etc/sysconfig/sbd and use the following entries:

    SBD_PACEMAKER=yes
    SBD_STARTMODE=always
    SBD_DELAY_START=no
    SBD_WATCHDOG_DEV=/dev/watchdog
    SBD_WATCHDOG_TIMEOUT=5

    The SBD_DEVICE entry is not needed as no shared disk is used. When this parameter is missing, the sbd service does not start any watcher process for SBD devices.

  2. On each node, enable the SBD service:

    root # systemctl enable sbd

    It will be started together with the Corosync service whenever the Pacemaker service is started.

  3. Restart the cluster stack on each node:

    root # systemctl stop pacemaker
    root # systemctl start pacemaker

    This automatically triggers the start of the SBD daemon.

  4. Check if the parameter have-watchdog=true has been automatically set:

    root # crm configure show | grep have-watchdog
             have-watchdog=true
  5. Run crm configure and set the following cluster properties on the crm shell:

    crm(live)configure# property stonith-enabled="true" 1
    crm(live)configure# property stonith-watchdog-timeout=10 2

    1

    This is the default configuration, because clusters without STONITH are not supported. But in case STONITH has been deactivated for testing purposes, make sure this parameter is set to true again.

    2

    For diskless SBD, this parameter must not equal zero. It defines after how long it is assumed that the fencing target has already self-fenced. Therefore its value needs to be >= the value of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. Starting with SUSE Linux Enterprise High Availability Extension 15, if you set stonith-watchdog-timeout to a negative value, Pacemaker will automatically calculate this timeout and set it to twice the value of SBD_WATCHDOG_TIMEOUT.

  6. Review your changes with show.

  7. Submit your changes with commit and leave the crm live configuration with exit.

11.9 Testing SBD and Fencing

To test whether SBD works as expected for node fencing purposes, use one or all of the following methods:

Manually Triggering Fencing of a Node

To trigger a fencing action for node NODENAME:

root # crm node fence NODENAME

Check if the node is fenced and if the other nodes consider the node as fenced after the stonith-watchdog-timeout.

Simulating an SBD Failure
  1. Identify the process ID of the SBD inquisitor:

    root # systemctl status sbd
    ● sbd.service - Shared-storage based fencing daemon
    
       Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled)
       Active: active (running) since Tue 2018-04-17 15:24:51 CEST; 6 days ago
         Docs: man:sbd(8)
      Process: 1844 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=exited, status=0/SUCCESS)
     Main PID: 1859 (sbd)
        Tasks: 4 (limit: 4915)
       CGroup: /system.slice/sbd.service
               ├─1859 sbd: inquisitor
    [...]
  2. Simulate an SBD failure by terminating the SBD inquisitor process. In our example, the process ID of the SBD inquisitor is 1859):

    root # kill -9 1859

    The node proactively self-fences. The other nodes notice the loss of the node and consider it has self-fenced after the stonith-watchdog-timeout.

Triggering Fencing through a Monitor Operation Failure

With a normal configuration, a failure of a resource stop operation will trigger fencing. To trigger fencing manually, you can produce a failure of a resource stop operation. Alternatively, you can temporarily change the configuration of a resource monitor operation and produce a monitor failure as described below:

  1. Configure an on-fail=fence property for a resource monitor operation:

    op monitor interval=10 on-fail=fence
  2. Let the monitoring operation fail (for example, by terminating the respective daemon, if the resource relates to a service).

    This failure triggers a fencing action.

11.10 Additional Mechanisms for Storage Protection

Apart from node fencing via STONITH there are other methods to achieve storage protection at a resource level. For example, SCSI-3 and SCSI-4 use persistent reservations, whereas sfex provides a locking mechanism. Both methods are explained in the following subsections.

11.10.1 Configuring an sg_persist Resource

The SCSI specifications 3 and 4 define persistent reservations. These are SCSI protocol features and can be used for I/O fencing and failover. This feature is implemented in the sg_persist Linux command.

Note
Note: SCSI Disk Compatibility

Any backing disks for sg_persist must be SCSI disk compatible. sg_persist only works for devices like SCSI disks or iSCSI LUNs. Do not use it for IDE, SATA, or any block devices which do not support the SCSI protocol.

Before you proceed, check if your disk supports persistent reservations. Use the following command (replace DISK with your device name):

root # sg_persist -n --in --read-reservation -d /dev/DISK

The result shows whether your disk supports persistent reservations:

  • Supported disk:

    PR generation=0x0, there is NO reservation held
  • Unsupported disk:

    PR in (Read reservation): command not supported
    Illegal request, Invalid opcode

If you get an error message (like the one above), replace the old disk with an SCSI compatible disk. Otherwise proceed as follows:

  1. To create the primitive resource sg_persist, run the following commands as root:

    root # crm configure
    crm(live)configure# primitive sg sg_persist \
        params devs="/dev/sdc" reservation_type=3 \
        op monitor interval=60 timeout=60
  2. Add the sg_persist primitive to a master-slave group:

    crm(live)configure# ms ms-sg sg \
        meta master-max=1 notify=true
  3. Do some tests. When the resource is in master/slave status, you can mount and write on /dev/sdc1 on the cluster node where the master instance is running, while you cannot write on the cluster node where the slave instance is running.

  4. Add a file system primitive for Ext4:

    crm(live)configure# primitive ext4 ocf:heartbeat:Filesystem \
        params device="/dev/sdc1" directory="/mnt/ext4" fstype=ext4
  5. Add the following order relationship plus a collocation between the sg_persist master and the file system resource:

    crm(live)configure# order o-ms-sg-before-ext4 inf: ms-sg:promote ext4:start
    crm(live)configure# colocation col-ext4-with-sg-persist inf: ext4 ms-sg:Master
  6. Check all your changes with the show command.

  7. Commit your changes.

For more information, refer to the sg_persist man page.

11.10.2 Ensuring Exclusive Storage Activation with sfex

This section introduces sfex, an additional low-level mechanism to lock access to shared storage exclusively to one node. Note that sfex does not replace STONITH. As sfex requires shared storage, it is recommended that the SBD node fencing mechanism described above is used on another partition of the storage.

By design, sfex cannot be used with workloads that require concurrency (such as OCFS2). It serves as a layer of protection for classic failover style workloads. This is similar to an SCSI-2 reservation in effect, but more general.

11.10.2.1 Overview

In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.

Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker. The sfex component ensures that even if Pacemaker were subject to a split brain situation, the lock will never be granted more than once.

These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.

11.10.2.2 Setup

In the following, learn how to create a shared partition for use with sfex and how to configure a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks, and needs 1 KB of storage space allocated per lock. By default, sfex_init creates one lock on the partition.

Important
Important: Requirements
  • The shared partition for sfex should be on the same logical unit as the data you want to protect.

  • The shared sfex partition must not use host-based RAID, nor DRBD.

  • Using an LVM2 logical volume is possible.

Procedure 11.9: Creating an sfex Partition
  1. Create a shared partition for use with sfex. Note the name of this partition and use it as a substitute for /dev/sfex below.

  2. Create the sfex metadata with the following command:

    root # sfex_init -n 1 /dev/sfex
  3. Verify that the metadata has been created correctly:

    root # sfex_stat -i 1 /dev/sfex ; echo $?

    This should return 2, since the lock is not currently held.

Procedure 11.10: Configuring a Resource for the sfex Lock
  1. The sfex lock is represented via a resource in the CIB, configured as follows:

    crm(live)configure# primitive sfex_1 ocf:heartbeat:sfex \
    #	params device="/dev/sfex" index="1" collision_timeout="1" \
          lock_timeout="70" monitor_interval="10" \
    #	op monitor interval="10s" timeout="30s" on-fail="fence"
  2. To protect resources via an sfex lock, create mandatory ordering and placement constraints between the resources to protect the sfex resource. If the resource to be protected has the ID filesystem1:

    crm(live)configure# order order-sfex-1 inf: sfex_1 filesystem1
    crm(live)configure# colocation col-sfex-1 inf: filesystem1 sfex_1
  3. If using group syntax, add the sfex resource as the first resource to the group:

    crm(live)configure# group LAMP sfex_1 filesystem1 apache ipaddr

11.11 For More Information

Print this page