Applies to SUSE Linux Enterprise High Availability 15 SP3

23 Cluster logical volume manager (Cluster LVM) #

When managing shared storage on a cluster, every node must be informed about changes to the storage subsystem. Logical Volume Manager (LVM) supports transparent management of volume groups across the whole cluster. Volume groups shared among multiple hosts can be managed using the same commands as local storage.

23.1 Conceptual overview #

Cluster LVM is coordinated with different tools:

Distributed lock manager (DLM): Coordinates access to shared resources among multiple hosts through cluster-wide locking.
Logical Volume Manager (LVM): LVM provides a virtual pool of disk space and enables flexible distribution of one logical volume over several disks.
Cluster logical volume manager (Cluster LVM): The term Cluster LVM indicates that LVM is being used in a cluster environment. This needs some configuration adjustments to protect the LVM metadata on shared storage. From SUSE Linux Enterprise 15 onward, the cluster extension uses lvmlockd, which replaces clvmd. For more information about lvmlockd, see the man page of the lvmlockd command (man 8 lvmlockd).
lvmlockd with sanlock is not officially supported.
Volume group and logical volume: Volume groups (VGs) and logical volumes (LVs) are basic concepts of LVM. A volume group is a storage pool of multiple physical disks. A logical volume belongs to a volume group, and can be seen as an elastic volume on which you can create a file system. In a cluster environment, there is a concept of shared VGs, which consist of shared storage and can be used concurrently by multiple hosts.

23.2 Configuration of Cluster LVM #

Make sure the following requirements are fulfilled:

A shared storage device is available, provided by a Fibre Channel, FCoE, SCSI, iSCSI SAN, or DRBD*, for example.
Make sure the following packages have been installed: lvm2 and lvm2-lockd.
From SUSE Linux Enterprise 15 onward, the cluster extension uses lvmlockd, which replaces clvmd. Make sure the clvmd daemon is not running, otherwise lvmlockd will fail to start.

23.2.1 Creating the cluster resources #

Perform the following basic steps on one node to configure a shared VG in the cluster:

Procedure 23.1: Creating a DLM resource #

Start a shell and log in as root.
Check the current configuration of the cluster resources:
```
# crm configure show
```
If you have already configured a DLM resource (and a corresponding base group and base clone), continue with Procedure 23.2, “Creating an lvmlockd resource”.
Otherwise, configure a DLM resource and a corresponding base group and base clone as described in Procedure 19.1, “Configuring a base group for DLM”.

Procedure 23.2: Creating an lvmlockd resource #

Start a shell and log in as root.
Run the following command to see the usage of this resource:
```
# crm configure ra info lvmlockd
```

Configure a lvmlockd resource as follows:

# crm configure primitive lvmlockd lvmlockd \
  op start timeout="90" \
  op stop timeout="100" \
  op monitor interval="30" timeout="90"

To ensure the lvmlockd resource is started on every node, add the primitive resource to the base group for storage you have created in Procedure 23.1, “Creating a DLM resource”:
```
# crm configure modgroup g-storage add lvmlockd
```
Review your changes:
```
# crm configure show
```
Check if the resources are running well:
```
# crm status full
```

Procedure 23.3: Creating a shared VG and LV #

Start a shell and log in as root.

Assuming you already have two shared disks, create a shared VG with them:

# vgcreate --shared vg1 /dev/disk/by-id/DEVICE_ID1 /dev/disk/by-id/DEVICE_ID2

Create an LV and do not activate it initially:
```
# lvcreate -an -L10G -n lv1 vg1
```

Procedure 23.4: Creating an LVM-activate resource #

Start a shell and log in as root.
Run the following command to see the usage of this resource:
```
# crm configure ra info LVM-activate
```
This resource manages the activation of a VG. In a shared VG, LV activation has two different modes: exclusive and shared mode. The exclusive mode is the default and should be used normally, when a local file system like ext4 uses the LV. The shared mode should only be used for cluster file systems like OCFS2.

Configure a resource to manage the activation of your VG. Choose one of the following options according to your scenario:

Use exclusive activation mode for local file system usage:

# crm configure primitive vg1 LVM-activate \
  params vgname=vg1 vg_access_mode=lvmlockd \
  op start timeout=90s interval=0 \
  op stop timeout=90s interval=0 \
  op monitor interval=30s timeout=90s

Use shared activation mode for OCFS2:

# crm configure primitive vg1 LVM-activate \
   params vgname=vg1 vg_access_mode=lvmlockd activation_mode=shared \
   op start timeout=90s interval=0 \
   op stop timeout=90s interval=0 \
   op monitor interval=30s timeout=90s

Make sure the VG can only be activated on nodes where the DLM and lvmlockd resources are already running:
- Exclusive activation mode:
  Because this VG is only active on a single node, do not add it to the cloned g-storage group. Instead, add constraints directly to the resource:
```
# crm configure colocation col-vg-with-dlm inf: vg1 cl-storage
# crm configure order o-dlm-before-vg Mandatory: cl-storage vg1
```
  For multiple VGs, you can add constraints to multiple resources at once:
```
# crm configure colocation col-vg-with-dlm inf: ( vg1 vg2 ) cl-storage
# crm configure order o-dlm-before-vg Mandatory: cl-storage ( vg1 vg2 )
```
- Shared activation mode:
  Because this VG is active on multiple nodes, you can add it to the cloned g-storage group, which already has internal colocation and order constraints:
```
# crm configure modgroup g-storage add vg1
```
  Do not add multiple VGs to the group, because this creates a dependency between the VGs. For multiple VGs, clone the resources and add constraints to the clones:
```
# crm configure clone cl-vg1 vg1 meta interleave=true
# crm configure clone cl-vg2 vg2 meta interleave=true
# crm configure colocation col-vg-with-dlm inf: ( cl-vg1 cl-vg2 ) cl-storage
# crm configure order o-dlm-before-vg Mandatory: cl-storage ( cl-vg1 cl-vg2 )
```
Check if the resources are running well:
```
# crm status full
```

23.2.2 Scenario: Cluster LVM with iSCSI on SANs #

The following scenario uses two SAN boxes which export their iSCSI targets to several clients. The general idea is displayed in Figure 23.1, “Setup of a shared disk with Cluster LVM”.

Figure 23.1: Setup of a shared disk with Cluster LVM #

Warning: Data loss

The following procedures will destroy any data on your disks!

Configure only one SAN box first. Each SAN box needs to export its own iSCSI target. Proceed as follows:

Procedure 23.5: Configuring iSCSI targets (SAN) #

Run YaST and click Network Services › iSCSI LIO Target to start the iSCSI Server module.
If you want to start the iSCSI target whenever your computer is booted, choose When Booting, otherwise choose Manually.
If you have a firewall running, enable Open Port in Firewall.
Switch to the Global tab. If you need authentication, enable incoming or outgoing authentication or both. In this example, we select No Authentication.
Add a new iSCSI target:
1. Switch to the Targets tab.
2. Click Add.
3. Enter a target name. The name needs to be formatted like this:
```
iqn.DATE.DOMAIN
```
  For more information about the format, refer to Section 3.2.6.3.1. Type "iqn." (iSCSI Qualified Name) at http://www.ietf.org/rfc/rfc3720.txt.
4. If you want a more descriptive name, you can change it as long as your identifier is unique for your different targets.
5. Click Add.
6. Enter the device name in Path and use a Scsiid.
7. Click Next twice.
Confirm the warning box with Yes.
Open the configuration file /etc/iscsi/iscsid.conf and change the parameter node.startup to automatic.

Now set up your iSCSI initiators as follows:

Procedure 23.6: Configuring iSCSI initiators #

Run YaST and click Network Services › iSCSI Initiator.
If you want to start the iSCSI initiator whenever your computer is booted, choose When Booting, otherwise set Manually.
Change to the Discovery tab and click the Discovery button.
Add the IP address and the port of your iSCSI target (see Procedure 23.5, “Configuring iSCSI targets (SAN)”). Normally, you can leave the port as it is and use the default value.
If you use authentication, insert the incoming and outgoing user name and password, otherwise activate No Authentication.
Select Next. The found connections are displayed in the list.
Proceed with Finish.
Open a shell, log in as root.

Test if the iSCSI initiator has been started successfully:

# iscsiadm -m discovery -t st -p 192.168.3.100
192.168.3.100:3260,1 iqn.2010-03.de.jupiter:san1

Establish a session:

# iscsiadm -m node -l -p 192.168.3.100 -T iqn.2010-03.de.jupiter:san1
Logging in to [iface: default, target: iqn.2010-03.de.jupiter:san1, portal: 192.168.3.100,3260]
Login to [iface: default, target: iqn.2010-03.de.jupiter:san1, portal: 192.168.3.100,3260]: successful

See the device names with lsscsi:

...
[4:0:0:2]    disk    IET      ...     0     /dev/sdd
[5:0:0:1]    disk    IET      ...     0     /dev/sde

Look for entries with IET in their third column. In this case, the devices are /dev/sdd and /dev/sde.

Procedure 23.7: Creating the shared volume groups #

Open a root shell on one of the nodes you have run the iSCSI initiator from Procedure 23.6, “Configuring iSCSI initiators”.
Create the shared volume group on disks /dev/sdd and /dev/sde, using their stable device names (for example, in /dev/disk/by-id/):
```
# vgcreate --shared testvg /dev/disk/by-id/DEVICE_ID /dev/disk/by-id/DEVICE_ID
```

Create logical volumes as needed:

# lvcreate --name lv1 --size 500M testvg

Check the volume group with vgdisplay:

  --- Volume group ---
      VG Name               testvg
      System ID
      Format                lvm2
      Metadata Areas        2
      Metadata Sequence No  1
      VG Access             read/write
      VG Status             resizable
      MAX LV                0
      Cur LV                0
      Open LV               0
      Max PV                0
      Cur PV                2
      Act PV                2
      VG Size               1016,00 MB
      PE Size               4,00 MB
      Total PE              254
      Alloc PE / Size       0 / 0
      Free  PE / Size       254 / 1016,00 MB
      VG UUID               UCyWw8-2jqV-enuT-KH4d-NXQI-JhH3-J24anD

Check the shared state of the volume group with the command vgs:
```
# vgs
  VG       #PV #LV #SN Attr   VSize     VFree
  vgshared   1   1   0 wz--ns 1016.00m  1016.00m
```
The Attr column shows the volume attributes. In this example, the volume group is writable (w), resizeable (z), the allocation policy is normal (n), and it is a shared resource (s). See the man page of vgs for details.

After you have created the volumes and started your resources you should have new device names under /dev/testvg, for example /dev/testvg/lv1. This indicates the LV has been activated for use.

23.2.3 Scenario: Cluster LVM with DRBD #

The following scenarios can be used if you have data centers located in different parts of your city, country, or continent.

Procedure 23.8: Creating a cluster-aware volume group with DRBD #

Create a primary/primary DRBD resource:
1. First, set up a DRBD device as primary/secondary as described in Procedure 22.2, “Manually configuring DRBD”. Make sure the disk state is up-to-date on both nodes. Check this with drbdadm status.
2. Add the following options to your configuration file (usually something like /etc/drbd.d/r0.res):
```
resource r0 {
  net {
     allow-two-primaries;
  }
  ...
}
```
3. Copy the changed configuration file to the other node, for example:
```
# scp /etc/drbd.d/r0.res venus:/etc/drbd.d/
```
4. Run the following commands on both nodes:
```
# drbdadm disconnect r0
# drbdadm connect r0
# drbdadm primary r0
```
5. Check the status of your nodes:
```
# drbdadm status r0
```
Include the lvmlockd resource as a clone in the pacemaker configuration, and make it depend on the DLM clone resource. See Procedure 23.1, “Creating a DLM resource” for detailed instructions. Before proceeding, confirm that these resources have started successfully on your cluster. Use crm status or the Web interface to check the running services.
Prepare the physical volume for LVM with the command pvcreate. For example, on the device /dev/drbd_r0 the command would look like this:
```
# pvcreate /dev/drbd_r0
```
Create a shared volume group:
```
# vgcreate --shared testvg /dev/drbd_r0
```
Create logical volumes as needed. You probably want to change the size of the logical volume. For example, create a 4 GB logical volume with the following command:
```
# lvcreate --name lv1 -L 4G testvg
```
The logical volumes within the VG are now available as file system mounts for raw usage. Ensure that services using them have proper dependencies to collocate them with and order them after the VG has been activated.

After finishing these configuration steps, the LVM configuration can be done like on any stand-alone workstation.

23.3 Configuring eligible LVM devices explicitly #

When several devices seemingly share the same physical volume signature (as can be the case for multipath devices or DRBD), we recommend to explicitly configure the devices which LVM scans for PVs.

For example, if the command vgcreate uses the physical device instead of using the mirrored block device, DRBD will be confused. This may result in a split brain condition for DRBD.

To deactivate a single device for LVM, do the following:

Edit the file /etc/lvm/lvm.conf and search for the line starting with filter.
The patterns there are handled as regular expressions. A leading “a” means to accept a device pattern to the scan, a leading “r” rejects the devices that follow the device pattern.
To remove a device named /dev/sdb1, add the following expression to the filter rule:
```
"r|^/dev/sdb1$|"
```
The complete filter line will look like the following:
```
filter = [ "r|^/dev/sdb1$|", "r|/dev/.*/by-path/.*|", "r|/dev/.*/by-id/.*|", "a/.*/" ]
```
A filter line that accepts DRBD and MPIO devices but rejects all other devices would look like this:
```
filter = [ "a|/dev/drbd.*|", "a|/dev/.*/by-id/dm-uuid-mpath-.*|", "r/.*/" ]
```
Write the configuration file and copy it to all cluster nodes.

23.4 Online migration from mirror LV to cluster MD #

Starting with SUSE Linux Enterprise High Availability 15, cmirrord in Cluster LVM is deprecated. We highly recommend to migrate the mirror logical volumes in your cluster to cluster MD. Cluster MD stands for cluster multi-device and is a software-based RAID storage solution for a cluster.

23.4.1 Example setup before migration #

Let us assume you have the following example setup:

You have a two-node cluster consisting of the nodes alice and bob.
A mirror logical volume named test-lv was created from a volume group named cluster-vg2.
The volume group cluster-vg2 is composed of the disks /dev/vdb and /dev/vdc.

# lsblk
NAME                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                                   253:0    0   40G  0 disk
├─vda1                                253:1    0    4G  0 part [SWAP]
└─vda2                                253:2    0   36G  0 part /
vdb                                   253:16   0   20G  0 disk
├─cluster--vg2-test--lv_mlog_mimage_0 254:0    0    4M  0 lvm
│ └─cluster--vg2-test--lv_mlog        254:2    0    4M  0 lvm
│   └─cluster--vg2-test--lv           254:5    0   12G  0 lvm
└─cluster--vg2-test--lv_mimage_0      254:3    0   12G  0 lvm
  └─cluster--vg2-test--lv             254:5    0   12G  0 lvm
vdc                                   253:32   0   20G  0 disk
├─cluster--vg2-test--lv_mlog_mimage_1 254:1    0    4M  0 lvm
│ └─cluster--vg2-test--lv_mlog        254:2    0    4M  0 lvm
│   └─cluster--vg2-test--lv           254:5    0   12G  0 lvm
└─cluster--vg2-test--lv_mimage_1      254:4    0   12G  0 lvm
  └─cluster--vg2-test--lv             254:5    0   12G  0 lvm

Important: Avoiding migration failures

Before you start the migration procedure, check the capacity and degree of utilization of your logical and physical volumes. If the logical volume uses 100% of the physical volume capacity, the migration might fail with an insufficient free space error on the target volume. How to prevent this migration failure depends on the options used for mirror log:

Is the mirror log itself mirrored (mirrored option) and allocated on the same device as the mirror leg? (For example, this might be the case if you have created the logical volume for a cmirrord setup on SUSE Linux Enterprise High Availability 11 or 12 as described in the Administration Guide for those versions.)
By default, mdadm reserves a certain amount of space between the start of a device and the start of array data. During migration, you can check for the unused padding space and reduce it with the data-offset option as shown in Step 1.d and following.
The data-offset must leave enough space on the device for cluster MD to write its metadata to it. On the other hand, the offset must be small enough for the remaining capacity of the device to accommodate all physical volume extents of the migrated volume. Because the volume may have spanned the complete device minus the mirror log, the offset must be smaller than the size of the mirror log.
We recommend to set the data-offset to 128 kB. If no value is specified for the offset, its default value is 1 kB (1024 bytes).
Is the mirror log written to a different device (disk option) or kept in memory (core option)? Before starting the migration, either enlarge the size of the physical volume or reduce the size of the logical volume (to free more space for the physical volume).

23.4.2 Migrating a mirror LV to cluster MD #

The following procedure is based on Section 23.4.1, “Example setup before migration”. Adjust the instructions to match your setup and replace the names for the LVs, VGs, disks and the cluster MD device accordingly.

The migration does not involve any downtime. The file system can still be mounted during the migration procedure.

On node alice, execute the following steps:

Convert the mirror logical volume test-lv to a linear logical volume:
```
# lvconvert -m0 cluster-vg2/test-lv /dev/vdc
```
Remove the physical volume /dev/vdc from the volume group cluster-vg2:
```
# vgreduce cluster-vg2 /dev/vdc
```

Remove this physical volume from LVM:

# pvremove /dev/vdc

When you run lsblk now, you get:

NAME                                  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                     253:0    0   40G  0 disk
├─vda1                  253:1    0    4G  0 part [SWAP]
└─vda2                  253:2    0   36G  0 part /
vdb                     253:16   0   20G  0 disk
└─cluster--vg2-test--lv 254:5    0   12G  0 lvm
vdc                     253:32   0   20G  0 disk

Create a cluster MD device /dev/md0 with the disk /dev/vdc:
```
# mdadm --create /dev/md0 --bitmap=clustered \
--metadata=1.2 --raid-devices=1 --force --level=mirror \
/dev/vdc --data-offset=128
```
For details on why to use the data-offset option, see Important: Avoiding migration failures.

On node bob, assemble this MD device:
```
# mdadm --assemble md0 /dev/vdc
```
If your cluster consists of more than two nodes, execute this step on all remaining nodes in your cluster.
Back on node alice:
1. Initialize the MD device /dev/md0 as physical volume for use with LVM:
```
# pvcreate /dev/md0
```
2. Add the MD device /dev/md0 to the volume group cluster-vg2:
```
# vgextend cluster-vg2 /dev/md0
```
3. Move the data from the disk /dev/vdb to the /dev/md0 device:
```
# pvmove /dev/vdb /dev/md0
```
4. Remove the physical volume /dev/vdb from the volume group cluster-vg2:
```
# vgreduce cluster-vg2 /dev/vdb
```
5. Remove the label from the device so that LVM no longer recognizes it as physical volume:
```
# pvremove /dev/vdb
```
6. Add /dev/vdb to the MD device /dev/md0:
```
# mdadm --grow /dev/md0 --raid-devices=2 --add /dev/vdb
```

23.4.3 Example setup after migration #

When you run lsblk now, you get:

NAME                      MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
vda                       253:0    0   40G  0 disk
├─vda1                    253:1    0    4G  0 part  [SWAP]
└─vda2                    253:2    0   36G  0 part  /
vdb                       253:16   0   20G  0 disk
└─md0                       9:0    0   20G  0 raid1
  └─cluster--vg2-test--lv 254:5    0   12G  0 lvm
vdc                       253:32   0   20G  0 disk
└─md0                       9:0    0   20G  0 raid1
  └─cluster--vg2-test--lv 254:5    0   12G  0 lvm