Applies to SUSE Linux Enterprise High Availability 15 SP5

25 Cluster multi-device (Cluster MD) #

Revision History: SUSE Linux Enterprise High Availability Extension – Dokumentation

The cluster multi-device (Cluster MD) is a software based RAID storage solution for a cluster. Currently, Cluster MD provides the redundancy of RAID1 mirroring to the cluster. With SUSE Linux Enterprise High Availability 15 SP5, RAID10 is included as a technology preview. If you want to try RAID10, replace mirror with 10 in the related mdadm command. This chapter shows you how to create and use Cluster MD.

25.1 Conceptual overview #

The Cluster MD provides support for use of RAID1 across a cluster environment. The disks or devices used by Cluster MD are accessed by each node. If one device of the Cluster MD fails, it can be replaced at runtime by another device and it is re-synced to provide the same amount of redundancy. The Cluster MD requires Corosync and Distributed Lock Manager (DLM) for co-ordination and messaging.

A Cluster MD device is not automatically started on boot like the rest of the regular MD devices. A clustered device needs to be started using resource agents to ensure the DLM resource has been started.

25.2 Creating a clustered MD RAID device #

Requirements #

A running cluster with pacemaker.
A resource agent for DLM (see Section 20.2, “Configuring DLM cluster resources”).
At least two shared disk devices. You can use an additional device as a spare that fails over automatically in case of device failure.
An installed package cluster-md-kmp-default.

Warning: Always use persistent device names

Always use cluster-wide persistent device names, such as /dev/disk/by-id/DEVICE_ID. Unstable device names like /dev/sdX or /dev/dm-X might become mismatched on different nodes, causing major problems across the cluster.

Make sure the DLM resource is up and running on every node of the cluster and check the resource status with the command:
```
# crm_resource -r dlm -W
```
Create the Cluster MD device:
- If you do not have an existing normal RAID device, create the Cluster MD device on the node running the DLM resource with the following command:
```
# mdadm --create /dev/md0 --bitmap=clustered \
--metadata=1.2 --raid-devices=2 --level=mirror \
/dev/disk/by-id/DEVICE_ID1 /dev/disk/by-id/DEVICE_ID2
```
  As Cluster MD only works with version 1.2 of the metadata, it is recommended to specify the version using the --metadata option. For other useful options, refer to the man page of mdadm. Monitor the progress of the re-sync in /proc/mdstat.
- If you already have an existing normal RAID, first clear the existing bitmap and then create the clustered bitmap:
```
# mdadm --grow /dev/mdX --bitmap=none
# mdadm --grow /dev/mdX --bitmap=clustered
```
- Optionally, to create a Cluster MD device with a spare device for automatic failover, run the following command on one cluster node:
```
# mdadm --create /dev/md0 --bitmap=clustered --raid-devices=2 \
--level=mirror --spare-devices=1 --metadata=1.2 \
/dev/disk/by-id/DEVICE_ID1 /dev/disk/by-id/DEVICE_ID2 /dev/disk/by-id/DEVICE_ID3
```
Get the UUID and the related md path:
```
# mdadm --detail --scan
```
The UUID must match the UUID stored in the superblock. For details on the UUID, refer to the mdadm.conf man page.
Open /etc/mdadm.conf and add the md device name and the devices associated with it. Use the UUID from the previous step:
```
DEVICE /dev/disk/by-id/DEVICE_ID1 /dev/disk/by-id/DEVICE_ID2
ARRAY /dev/md0 UUID=1d70f103:49740ef1:af2afce5:fcf6a489
```

Open Csync2's configuration file /etc/csync2/csync2.cfg and add /etc/mdadm.conf:

group ha_group
{
   # ... list of files pruned ...
   include /etc/mdadm.conf;
}

Copy the configuration file to all nodes:
```
# csync2 -xv
```

25.3 Configuring a resource agent #

Configure a CRM resource as follows:

Create a Raid1 primitive for the Cluster MD device:

crm(live)configure# primitive raider Raid1 \
  params raidconf="/etc/mdadm.conf" raiddev=/dev/md0 \
  force_clones=true \
  op monitor timeout=20s interval=10 \
  op start timeout=20s interval=0 \
  op stop timeout=20s interval=0

Make sure the Raid1 primitive can only run on nodes where the DLM resource is already running:
- You can add a single Raid1 primitive to the g-storage group described in Procedure 20.1, “Configuring a base group for DLM”:
```
crm(live)configure# modgroup g-storage add raider
```
  This group already has internal colocation and order constraints.
- Do not add multiple Raid1 primitives to the group, because this creates a dependency between the Cluster MD devices. For multiple devices, clone the primitives and colocate them with the independent DLM resource described in Procedure 20.2, “Configuring an independent DLM resource”:
```
crm(live)configure# crm configure clone cl-raider1 raider1 meta interleave=true
crm(live)configure# crm configure clone cl-raider2 raider2 meta interleave=true
crm(live)configure# crm configure colocation col-cmd-with-dlm inf: ( cl-raider1 cl-raider2 ) cl-dlm
crm(live)configure# crm configure order o-dlm-before-cmd Mandatory: cl-dlm ( cl-raider1 cl-raider2 )
```
Review your changes with show.
If everything is correct, submit your changes with commit.

25.4 Adding a device #

To add a device to an existing, active Cluster MD device, first ensure that the device is “visible” on each node with the command cat /proc/mdstat. If the device is not visible, the command fails.

Use the following command on one cluster node:

# mdadm --manage /dev/md0 --add /dev/disk/by-id/DEVICE_ID

The behavior of the new device added depends on the state of the Cluster MD device:

If only one of the mirrored devices is active, the new device becomes the second device of the mirrored devices and a recovery is initiated.
If both devices of the Cluster MD device are active, the new added device becomes a spare device.

25.5 Re-adding a temporarily failed device #

Often the failures are transient and limited to a single node. If any of the nodes encounters a failure during an I/O operation, the device is marked as failed for the entire cluster.

This could happen, for example, because of a cable failure on one of the nodes. After correcting the problem, you can re-add the device. Only the outdated parts are synchronized as opposed to synchronizing the entire device by adding a new one.

To re-add the device, run the following command on one cluster node:

# mdadm --manage /dev/md0 --re-add /dev/disk/by-id/DEVICE_ID

25.6 Removing a device #

Before removing a device at runtime for replacement, do the following:

Make sure the device is failed by introspecting /proc/mdstat. Look for an (F) before the device.
Run the following command on one cluster node to make a device fail:
```
# mdadm --manage /dev/md0 --fail /dev/disk/by-id/DEVICE_ID
```
Remove the failed device using the command on one cluster node:
```
# mdadm --manage /dev/md0 --remove /dev/disk/by-id/DEVICE_ID
```

25.7 Assembling Cluster MD as normal RAID at the disaster recovery site #

In the event of disaster recovery, you might face the situation that you do not have a Pacemaker cluster stack in the infrastructure on the disaster recovery site, but applications still need to access the data on the existing Cluster MD disks, or from the backups.

You can convert a Cluster MD RAID to a normal RAID by using the --assemble operation with the -U no-bitmap option to change the metadata of the RAID disks accordingly.

Find an example below of how to assemble all arrays on the data recovery site:

while read i; do
   NAME=`echo $i | sed 's/.*name=//'|awk '{print $1}'|sed 's/.*://'`
   UUID=`echo $i | sed 's/.*UUID=//'|awk '{print $1}'`
   mdadm -AR "/dev/md/$NAME" -u $UUID -U no-bitmap
   echo "NAME =" $NAME ", UUID =" $UUID ", assembled."
done < <(mdadm -Es)