Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Enterprise Storage 6

23 RADOS Block Device Edit source

A block is a sequence of bytes, for example a 4 MB block of data. Block-based storage interfaces are the most common way to store data with rotating media, such as hard disks, CDs, floppy disks. The ubiquity of block device interfaces makes a virtual block device an ideal candidate to interact with a mass data storage system like Ceph.

Ceph block devices allow sharing of physical resources, and are resizable. They store data striped over multiple OSDs in a Ceph cluster. Ceph block devices leverage RADOS capabilities such as snapshotting, replication, and consistency. Ceph's RADOS Block Devices (RBD) interact with OSDs using kernel modules or the librbd library.

RADOS Protocol
Figure 23.1: RADOS Protocol

Ceph's block devices deliver high performance with infinite scalability to kernel modules. They support virtualization solutions such as QEMU, or cloud-based computing systems such as OpenStack that rely on libvirt. You can use the same cluster to operate the Object Gateway, CephFS, and RADOS Block Devices simultaneously.

23.1 Block Device Commands Edit source

The rbd command enables you to create, list, introspect, and remove block device images. You can also use it, for example, to clone images, create snapshots, rollback an image to a snapshot, or view a snapshot.

23.1.1 Creating a Block Device Image in a Replicated Pool Edit source

Before you can add a block device to a client, you need to create a related image in an existing pool (see Chapter 22, Managing Storage Pools):

cephadm@adm > rbd create --size MEGABYTES POOL-NAME/IMAGE-NAME

For example, to create a 1 GB image named 'myimage' that stores information in a pool named 'mypool', execute the following:

cephadm@adm > rbd create --size 1024 mypool/myimage
Tip
Tip: Image Size Units

If you omit a size unit shortcut ('G' or 'T'), the image's size is in megabytes. Use 'G' or 'T' after the size number to specify gigabytes or terabytes.

23.1.2 Creating a Block Device Image in an Erasure Coded Pool Edit source

As of SUSE Enterprise Storage 5, it is possible to store data of a block device image directly in erasure coded (EC) pools. A RADOS Block Device image consists of data and metadata parts. You can store only the 'data' part of a RADOS Block Device image in an EC pool. The pool needs to have the 'overwrite' flag set to true, and that is only possible if all OSDs where the pool is stored use BlueStore.

You cannot store the image's 'metadata' part in an EC pool. You need to specify the replicated pool for storing the image's metadata with the --pool= option of the rbd create command.

Use the following steps to create an RBD image in a newly created EC pool:

cephadm@adm > ceph osd pool create POOL_NAME 12 12 erasure
cephadm@adm > ceph osd pool set POOL_NAME allow_ec_overwrites true

#Metadata will reside in pool "OTHER_POOL", and data in pool "POOL_NAME"
cephadm@adm > rbd create IMAGE_NAME --size=1G --data-pool POOL_NAME --pool=OTHER_POOL

23.1.3 Listing Block Device Images Edit source

To list block devices in a pool named 'mypool', execute the following:

cephadm@adm > rbd ls mypool

23.1.4 Retrieving Image Information Edit source

To retrieve information from an image 'myimage' within a pool named 'mypool', run the following:

cephadm@adm > rbd info mypool/myimage

23.1.5 Resizing a Block Device Image Edit source

RADOS Block Device images are thin provisioned—they do not actually use any physical storage until you begin saving data to them. However, they do have a maximum capacity that you set with the --size option. If you want to increase (or decrease) the maximum size of the image, run the following:

cephadm@adm > rbd resize --size 2048 POOL_NAME/IMAGE_NAME # to increase
cephadm@adm > rbd resize --size 2048 POOL_NAME/IMAGE_NAME --allow-shrink # to decrease

23.1.6 Removing a Block Device Image Edit source

To remove a block device that corresponds to an image 'myimage' in a pool named 'mypool', run the following:

cephadm@adm > rbd rm mypool/myimage

23.2 Mounting and Unmounting Edit source

After you create a RADOS Block Device, you can use it like any other disk device: format it, mount it to be able to exchange files, and unmount it when done.

  1. Make sure your Ceph cluster includes a pool with the disk image you want to map. Assume the pool is called mypool and the image is myimage.

    cephadm@adm > rbd list mypool
  2. Map the image to a new block device.

    cephadm@adm > rbd map --pool mypool myimage
    Tip
    Tip: User Name and Authentication

    To specify a user name, use --id user-name. If you use cephx authentication, you also need to specify a secret. It may come from a keyring or a file containing the secret:

    cephadm@adm > rbd map --pool rbd myimage --id admin --keyring /path/to/keyring

    or

    cephadm@adm > rbd map --pool rbd myimage --id admin --keyfile /path/to/file
  3. List all mapped devices:

    cephadm@adm > rbd showmapped
     id pool   image   snap device
     0  mypool myimage -    /dev/rbd0

    The device we want to work on is /dev/rbd0.

    Tip
    Tip: RBD Device Path

    Instead of /dev/rbdDEVICE_NUMBER, you can use /dev/rbd/POOL_NAME/IMAGE_NAME as a persistent device path. For example:

    /dev/rbd/mypool/myimage
  4. Make an XFS file system on the /dev/rbd0 device.

    root # mkfs.xfs /dev/rbd0
     log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
     log stripe unit adjusted to 32KiB
     meta-data=/dev/rbd0              isize=256    agcount=9, agsize=261120 blks
              =                       sectsz=512   attr=2, projid32bit=1
              =                       crc=0        finobt=0
     data     =                       bsize=4096   blocks=2097152, imaxpct=25
              =                       sunit=1024   swidth=1024 blks
     naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
     log      =internal log           bsize=4096   blocks=2560, version=2
              =                       sectsz=512   sunit=8 blks, lazy-count=1
     realtime =none                   extsz=4096   blocks=0, rtextents=0
  5. Mount the device and check it is correctly mounted. Replace /mnt with your mount point.

    root # mount /dev/rbd0 /mnt
    root # mount | grep rbd0
    /dev/rbd0 on /mnt type xfs (rw,relatime,attr2,inode64,sunit=8192,...

    Now you can move data to and from the device as if it was a local directory.

    Tip
    Tip: Increasing the Size of RBD Device

    If you find that the size of the RBD device is no longer enough, you can easily increase it.

    1. Increase the size of the RBD image, for example up to 10 GB.

      cephadm@adm > rbd resize --size 10000 mypool/myimage
       Resizing image: 100% complete...done.
    2. Grow the file system to fill up the new size of the device.

      root # xfs_growfs /mnt
       [...]
       data blocks changed from 2097152 to 2560000
  6. After you finish accessing the device, you can unmap and unmount it.

    cephadm@adm > rbd unmap /dev/rbd0
    root # unmount /mnt
Tip
Tip: Manual (Un)mounting

Since manually mapping and mounting RBD images after boot and unmounting and unmapping them before shutdown can be tedious, an rbdmap script and systemd unit is provided. Refer to Section 23.2.1, “rbdmap: Map RBD Devices at Boot Time”.

23.2.1 rbdmap: Map RBD Devices at Boot Time Edit source

rbdmap is a shell script that automates rbd map and rbd unmap operations on one or more RBD images. Although you can run the script manually at any time, the main advantage is automatic mapping and mounting of RBD images at boot time (and unmounting and unmapping at shutdown), as triggered by the Init system. A systemd unit file, rbdmap.service is included with the ceph-common package for this purpose.

The script takes a single argument, which can be either map or unmap. In either case, the script parses a configuration file. It defaults to /etc/ceph/rbdmap, but can be overridden via an environment variable RBDMAPFILE. Each line of the configuration file corresponds to an RBD image which is to be mapped, or unmapped.

The configuration file has the following format:

image_specification rbd_options
image_specification

Path to an image within a pool. Specify as pool_name/image_name.

rbd_options

An optional list of parameters to be passed to the underlying rbd map command. These parameters and their values should be specified as a comma-separated string, for example:

PARAM1=VAL1,PARAM2=VAL2,...

The example makes the rbdmap script run the following command:

cephadm@adm > rbd map POOL_NAME/IMAGE_NAME --PARAM1 VAL1 --PARAM2 VAL2

In the following example you can see how to specify a user name and a keyring with a corresponding secret:

cephadm@adm > rbdmap map mypool/myimage id=rbd_user,keyring=/etc/ceph/ceph.client.rbd.keyring

When run as rbdmap map, the script parses the configuration file, and for each specified RBD image, it attempts to first map the image (using the rbd map command) and then mount the image.

When run as rbdmap unmap, images listed in the configuration file will be unmounted and unmapped.

rbdmap unmap-all attempts to unmount and subsequently unmap all currently mapped RBD images, regardless of whether they are listed in the configuration file.

If successful, the rbd map operation maps the image to a /dev/rbdX device, at which point a udev rule is triggered to create a friendly device name symbolic link /dev/rbd/pool_name/image_name pointing to the real mapped device.

In order for mounting and unmounting to succeed, the 'friendly' device name needs to have a corresponding entry in /etc/fstab. When writing /etc/fstab entries for RBD images, specify the 'noauto' (or 'nofail') mount option. This prevents the Init system from trying to mount the device too early—before the device in question even exists, as rbdmap.service is typically triggered quite late in the boot sequence.

For a complete list of rbd options, see the rbd manual page (man 8 rbd).

For examples of the rbdmap usage, see the rbdmap manual page (man 8 rbdmap).

23.2.2 Increasing the Size of RBD Device Edit source

If you find that the size of the RBD device is no longer enough, you can easily increase it.

  1. Increase the size of the RBD image, for example up to 10GB.

    cephadm@adm > rbd resize --size 10000 mypool/myimage
     Resizing image: 100% complete...done.
  2. Grow the file system to fill up the new size of the device.

    root # xfs_growfs /mnt
     [...]
     data blocks changed from 2097152 to 2560000

23.3 Snapshots Edit source

An RBD snapshot is a snapshot of a RADOS Block Device image. With snapshots, you retain a history of the image's state. Ceph also supports snapshot layering, which allows you to clone VM images quickly and easily. Ceph supports block device snapshots using the rbd command and many higher-level interfaces, including QEMU, libvirt, OpenStack, and CloudStack.

Note
Note

Stop input and output operations and flush all pending writes before snapshotting an image. If the image contains a file system, the file system must be in a consistent state at the time of snapshotting.

23.3.1 Cephx Notes Edit source

When cephx is enabled, you must specify a user name or ID and a path to the keyring containing the corresponding key for the user. See Chapter 19, Authentication with cephx for more details. You may also add the CEPH_ARGS environment variable to avoid re-entry of the following parameters.

cephadm@adm > rbd --id user-ID --keyring=/path/to/secret commands
cephadm@adm > rbd --name username --keyring=/path/to/secret commands

For example:

cephadm@adm > rbd --id admin --keyring=/etc/ceph/ceph.keyring commands
cephadm@adm > rbd --name client.admin --keyring=/etc/ceph/ceph.keyring commands
Tip
Tip

Add the user and secret to the CEPH_ARGS environment variable so that you do not need to enter them each time.

23.3.2 Snapshot Basics Edit source

The following procedures demonstrate how to create, list, and remove snapshots using the rbd command on the command line.

23.3.2.1 Create Snapshot Edit source

To create a snapshot with rbd, specify the snap create option, the pool name, and the image name.

cephadm@adm > rbd --pool pool-name snap create --snap snap-name image-name
cephadm@adm > rbd snap create pool-name/image-name@snap-name

For example:

cephadm@adm > rbd --pool rbd snap create --snap snapshot1 image1
cephadm@adm > rbd snap create rbd/image1@snapshot1

23.3.2.2 List Snapshots Edit source

To list snapshots of an image, specify the pool name and the image name.

cephadm@adm > rbd --pool pool-name snap ls image-name
cephadm@adm > rbd snap ls pool-name/image-name

For example:

cephadm@adm > rbd --pool rbd snap ls image1
cephadm@adm > rbd snap ls rbd/image1

23.3.2.3 Rollback Snapshot Edit source

To rollback to a snapshot with rbd, specify the snap rollback option, the pool name, the image name, and the snapshot name.

cephadm@adm > rbd --pool pool-name snap rollback --snap snap-name image-name
cephadm@adm > rbd snap rollback pool-name/image-name@snap-name

For example:

cephadm@adm > rbd --pool pool1 snap rollback --snap snapshot1 image1
cephadm@adm > rbd snap rollback pool1/image1@snapshot1
Note
Note

Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a snapshot, and it is the preferred method of returning to a pre-existing state.

23.3.2.4 Delete a Snapshot Edit source

To delete a snapshot with rbd, specify the snap rm option, the pool name, the image name, and the user name.

cephadm@adm > rbd --pool pool-name snap rm --snap snap-name image-name
cephadm@adm > rbd snap rm pool-name/image-name@snap-name

For example:

cephadm@adm > rbd --pool pool1 snap rm --snap snapshot1 image1
cephadm@adm > rbd snap rm pool1/image1@snapshot1
Note
Note

Ceph OSDs delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.

23.3.2.5 Purge Snapshots Edit source

To delete all snapshots for an image with rbd, specify the snap purge option and the image name.

cephadm@adm > rbd --pool pool-name snap purge image-name
cephadm@adm > rbd snap purge pool-name/image-name

For example:

cephadm@adm > rbd --pool pool1 snap purge image1
cephadm@adm > rbd snap purge pool1/image1

23.3.3 Layering Edit source

Ceph supports the ability to create multiple copy-on-write (COW) clones of a block device snapshot. Snapshot layering enables Ceph block device clients to create images very quickly. For example, you might create a block device image with a Linux VM written to it, then, snapshot the image, protect the snapshot, and create as many copy-on-write clones as you like. A snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to create clones rapidly.

Note
Note

The terms 'parent' and 'child' mentioned in the command line examples below mean a Ceph block device snapshot (parent) and the corresponding image cloned from the snapshot (child).

Each cloned image (child) stores a reference to its parent image, which enables the cloned image to open the parent snapshot and read it.

A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of a snapshot refers to the snapshot, so you must protect the snapshot before you clone it.

Note
Note: --image-format 1 Not Supported

You cannot create snapshots of images created with the deprecated rbd create --image-format 1 option. Ceph only supports cloning of the default format 2 images.

23.3.3.1 Getting Started with Layering Edit source

Ceph block device layering is a simple process. You must have an image. You must create a snapshot of the image. You must protect the snapshot. After you have performed these steps, you can begin cloning the snapshot.

The cloned image has a reference to the parent snapshot, and includes the pool ID, image ID, and snapshot ID. The inclusion of the pool ID means that you may clone snapshots from one pool to images in another pool.

  • Image Template: A common use case for block device layering is to create a master image and a snapshot that serves as a template for clones. For example, a user may create an image for a Linux distribution (for example, SUSE Linux Enterprise Server), and create a snapshot for it. Periodically, the user may update the image and create a new snapshot (for example, zypper ref && zypper patch followed by rbd snap create). As the image matures, the user can clone any one of the snapshots.

  • Extended Template: A more advanced use case includes extending a template image that provides more information than a base image. For example, a user may clone an image (a VM template) and install other software (for example, a database, a content management system, or an analytics system), and then snapshot the extended image, which itself may be updated in the same way as the base image.

  • Template Pool: One way to use block device layering is to create a pool that contains master images that act as templates, and snapshots of those templates. You may then extend read-only privileges to users so that they may clone the snapshots without the ability to write or execute within the pool.

  • Image Migration/Recovery: One way to use block device layering is to migrate or recover data from one pool into another pool.

23.3.3.2 Protecting a Snapshot Edit source

Clones access the parent snapshots. All clones would break if a user inadvertently deleted the parent snapshot. To prevent data loss, you need to protect the snapshot before you can clone it.

cephadm@adm > rbd --pool pool-name snap protect \
 --image image-name --snap snapshot-name
cephadm@adm > rbd snap protect pool-name/image-name@snapshot-name

For example:

cephadm@adm > rbd --pool pool1 snap protect --image image1 --snap snapshot1
cephadm@adm > rbd snap protect pool1/image1@snapshot1
Note
Note

You cannot delete a protected snapshot.

23.3.3.3 Cloning a Snapshot Edit source

To clone a snapshot, you need to specify the parent pool, image, snapshot, the child pool, and the image name. You need to protect the snapshot before you can clone it.

cephadm@adm > rbd clone --pool pool-name --image parent-image \
 --snap snap-name --dest-pool pool-name \
 --dest child-image
cephadm@adm > rbd clone pool-name/parent-image@snap-name \
pool-name/child-image-name

For example:

cephadm@adm > rbd clone pool1/image1@snapshot1 pool1/image2
Note
Note

You may clone a snapshot from one pool to an image in another pool. For example, you may maintain read-only images and snapshots as templates in one pool, and writable clones in another pool.

23.3.3.4 Unprotecting a Snapshot Edit source

Before you can delete a snapshot, you must unprotect it first. Additionally, you may not delete snapshots that have references from clones. You need to flatten each clone of a snapshot before you can delete the snapshot.

cephadm@adm > rbd --pool pool-name snap unprotect --image image-name \
 --snap snapshot-name
cephadm@adm > rbd snap unprotect pool-name/image-name@snapshot-name

For example:

cephadm@adm > rbd --pool pool1 snap unprotect --image image1 --snap snapshot1
cephadm@adm > rbd snap unprotect pool1/image1@snapshot1

23.3.3.5 Listing Children of a Snapshot Edit source

To list the children of a snapshot, execute the following:

cephadm@adm > rbd --pool pool-name children --image image-name --snap snap-name
cephadm@adm > rbd children pool-name/image-name@snapshot-name

For example:

cephadm@adm > rbd --pool pool1 children --image image1 --snap snapshot1
cephadm@adm > rbd children pool1/image1@snapshot1

23.3.3.6 Flattening a Cloned Image Edit source

Cloned images retain a reference to the parent snapshot. When you remove the reference from the child clone to the parent snapshot, you effectively 'flatten' the image by copying the information from the snapshot to the clone. The time it takes to flatten a clone increases with the size of the snapshot. To delete a snapshot, you must flatten the child images first.

cephadm@adm > rbd --pool pool-name flatten --image image-name
cephadm@adm > rbd flatten pool-name/image-name

For example:

cephadm@adm > rbd --pool pool1 flatten --image image1
cephadm@adm > rbd flatten pool1/image1
Note
Note

Since a flattened image contains all the information from the snapshot, a flattened image will take up more storage space than a layered clone.

23.4 Mirroring Edit source

RBD images can be asynchronously mirrored between two Ceph clusters. This capability uses the RBD journaling image feature to ensure crash-consistent replication between clusters. Mirroring is configured on a per-pool basis within peer clusters and can be configured to automatically mirror all images within a pool or only a specific subset of images. Mirroring is configured using the rbd command. The rbd-mirror daemon is responsible for pulling image updates from the remote peer cluster and applying them to the image within the local cluster.

Note
Note: rbd-mirror Daemon

To use RBD mirroring, you need to have two Ceph clusters, each running the rbd-mirror daemon.

Important
Important: RADOS Block Devices Exported via iSCSI

You cannot mirror RBD devices that are exported via iSCSI using kernel-based iSCSI Gateway.

Refer to Chapter 27, Ceph iSCSI Gateway for more details on iSCSI.

23.4.1 rbd-mirror Daemon Edit source

The two rbd-mirror daemons are responsible for watching image journals on the remote, peer cluster and replaying the journal events against the local cluster. The RBD image journaling feature records all modifications to the image in the order they occur. This ensures that a crash-consistent mirror of the remote image is available locally.

The rbd-mirror daemon is available in the rbd-mirror package. You can install the package on OSD nodes, gateway nodes, or even on dedicated nodes. We do not recommend installing the rbd-mirror on the Admin Node. Install, enable, and start rbd-mirror:

root@minion > zypper install rbd-mirror
root@minion > systemctl enable ceph-rbd-mirror@server_name.service
root@minion > systemctl start ceph-rbd-mirror@server_name.service
Important
Important

Each rbd-mirror daemon requires the ability to connect to both clusters simultaneously.

23.4.2 Pool Configuration Edit source

The following procedures demonstrate how to perform the basic administrative tasks to configure mirroring using the rbd command. Mirroring is configured on a per-pool basis within the Ceph clusters.

You need to perform the pool configuration steps on both peer clusters. These procedures assume two clusters, named 'local' and 'remote', are accessible from a single host for clarity.

See the rbd manual page (man 8 rbd) for additional details on how to connect to different Ceph clusters.

Tip
Tip: Multiple Clusters

The cluster name in the following examples corresponds to a Ceph configuration file of the same name /etc/ceph/remote.conf.

23.4.2.1 Enable Mirroring on a Pool Edit source

To enable mirroring on a pool, specify the mirror pool enable subcommand, the pool name, and the mirroring mode. The mirroring mode can either be pool or image:

pool

All images in the pool with the journaling feature enabled are mirrored.

image

Mirroring needs to be explicitly enabled on each image. See Section 23.4.3.2, “Enable Image Mirroring” for more information.

For example:

cephadm@adm > rbd --cluster local mirror pool enable POOL_NAME pool
cephadm@adm > rbd --cluster remote mirror pool enable POOL_NAME pool

23.4.2.2 Disable Mirroring Edit source

To disable mirroring on a pool, specify the mirror pool disable subcommand and the pool name. When mirroring is disabled on a pool in this way, mirroring will also be disabled on any images (within the pool) for which mirroring was enabled explicitly.

cephadm@adm > rbd --cluster local mirror pool disable POOL_NAME
cephadm@adm > rbd --cluster remote mirror pool disable POOL_NAME

23.4.2.3 Add Cluster Peer Edit source

In order for the rbd-mirror daemon to discover its peer cluster, the peer needs to be registered to the pool. To add a mirroring peer cluster, specify the mirror pool peer add subcommand, the pool name, and a cluster specification:

cephadm@adm > rbd --cluster local mirror pool peer add POOL_NAME client.remote@remote
cephadm@adm > rbd --cluster remote mirror pool peer add POOL_NAME client.local@local

23.4.2.4 Remove Cluster Peer Edit source

To remove a mirroring peer cluster, specify the mirror pool peer remove subcommand, the pool name, and the peer UUID (available from the rbd mirror pool info command):

cephadm@adm > rbd --cluster local mirror pool peer remove POOL_NAME \
 55672766-c02b-4729-8567-f13a66893445
cephadm@adm > rbd --cluster remote mirror pool peer remove POOL_NAME \
 60c0e299-b38f-4234-91f6-eed0a367be08

23.4.3 Image Configuration Edit source

Unlike pool configuration, image configuration only needs to be performed against a single mirroring peer Ceph cluster.

Mirrored RBD images are designated as either primary or non-primary. This is a property of the image and not the pool. Images that are designated as non-primary cannot be modified.

Images are automatically promoted to primary when mirroring is first enabled on an image (either implicitly if the pool mirror mode was 'pool' and the image has the journaling image feature enabled, or explicitly (see Section 23.4.3.2, “Enable Image Mirroring”) by the rbd command).

23.4.3.1 Image Journaling Support Edit source

RBD mirroring uses the RBD journaling feature to ensure that the replicated image always remains crash-consistent. Before an image can be mirrored to a peer cluster, the journaling feature must be enabled. The feature can be enabled at the time of image creation by providing the --image-feature exclusive-lock,journaling option to the rbd command.

Alternatively, the journaling feature can be dynamically enabled on pre-existing RBD images. To enable journaling, specify the feature enable subcommand, the pool and image name, and the feature name:

cephadm@adm > rbd --cluster local feature enable POOL_NAME/IMAGE_NAME journaling
Note
Note: Option Dependency

The journaling feature is dependent on the exclusive-lock feature. If the exclusive-lock feature is not already enabled, you need to enable it prior to enabling the journaling feature.

Warning
Warning: Journaling on All New Images

You can enable journaling on all new images by default by appending the journaling value to the rbd default features option in the Ceph configuration file. For example:

rbd default features = layering,exclusive-lock,object-map,deep-flatten,journaling

Before applying such a change, carefully consider if enabling journaling on all new images is good for your deployment, because it can have a negative performance impact.

23.4.3.2 Enable Image Mirroring Edit source

If mirroring is configured in the 'image' mode, then it is necessary to explicitly enable mirroring for each image within the pool. To enable mirroring for a specific image, specify the mirror image enable subcommand along with the pool and image name:

cephadm@adm > rbd --cluster local mirror image enable POOL_NAME/IMAGE_NAME

23.4.3.3 Disable Image Mirroring Edit source

To disable mirroring for a specific image, specify the mirror image disable subcommand along with the pool and image name:

cephadm@adm > rbd --cluster local mirror image disable POOL_NAME/IMAGE_NAME

23.4.3.4 Image Promotion and Demotion Edit source

In a failover scenario where the primary designation needs to be moved to the image in the peer cluster, you need to stop access to the primary image, demote the current primary image, promote the new primary image, and resume access to the image on the alternate cluster.

Note
Note: Forced Promotion

Promotion can be forced using the --force option. Forced promotion is needed when the demotion cannot be propagated to the peer cluster (for example, in case of cluster failure or communication outage). This will result in a split-brain scenario between the two peers, and the image will no longer be synchronized until a resync subcommand is issued.

To demote a specific image to non-primary, specify the mirror image demote subcommand along with the pool and image name:

cephadm@adm > rbd --cluster local mirror image demote POOL_NAME/IMAGE_NAME

To demote all primary images within a pool to non-primary, specify the mirror pool demote subcommand along with the pool name:

cephadm@adm > rbd --cluster local mirror pool demote POOL_NAME

To promote a specific image to primary, specify the mirror image promote subcommand along with the pool and image name:

cephadm@adm > rbd --cluster remote mirror image promote POOL_NAME/IMAGE_NAME

To promote all non-primary images within a pool to primary, specify the mirror pool promote subcommand along with the pool name:

cephadm@adm > rbd --cluster local mirror pool promote POOL_NAME
Tip
Tip: Split I/O Load

Since the primary or non-primary status is per-image, it is possible to have two clusters split the I/O load and stage failover or failback.

23.4.3.5 Force Image Resync Edit source

If a split-brain event is detected by the rbd-mirror daemon, it will not attempt to mirror the affected image until corrected. To resume mirroring for an image, first demote the image determined to be out of date and then request a resync to the primary image. To request an image resync, specify the mirror image resync subcommand along with the pool and image name:

cephadm@adm > rbd mirror image resync POOL_NAME/IMAGE_NAME

23.4.4 Mirror Status Edit source

The peer cluster replication status is stored for every primary mirrored image. This status can be retrieved using the mirror image status and mirror pool status subcommands:

To request the mirror image status, specify the mirror image status subcommand along with the pool and image name:

cephadm@adm > rbd mirror image status POOL_NAME/IMAGE_NAME

To request the mirror pool summary status, specify the mirror pool status subcommand along with the pool name:

cephadm@adm > rbd mirror pool status POOL_NAME
Tip
Tip:

Adding the --verbose option to the mirror pool status subcommand will additionally output status details for every mirroring image in the pool.

23.5 Cache Settings Edit source

The user space implementation of the Ceph block device (librbd) cannot take advantage of the Linux page cache. Therefore, it includes its own in-memory caching. RBD caching behaves similar to hard disk caching. When the OS sends a barrier or a flush request, all 'dirty' data is written to the OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a VM that properly sends flushes. The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can merge adjacent requests for better throughput.

Ceph supports write-back caching for RBD. To enable it, add

[client]
...
rbd cache = true

to the [client] section of your ceph.conf file. By default, librbd does not perform any caching. Writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more unflushed bytes than set in the rbd cache max dirty option. In such a case, the write triggers writeback and blocks until enough bytes are flushed.

Ceph supports write-through caching for RBD. You can set the size of the cache, and you can set targets and limits to switch from write-back caching to write-through caching. To enable write-through mode, set

rbd cache max dirty = 0

This means writes return only when the data is on disk on all replicas, but reads may come from the cache. The cache is in memory on the client, and each RBD image has its own cache. Since the cache is local to the client, there is no coherency if there are others accessing the image. Running GFS or OCFS on top of RBD will not work with caching enabled.

The ceph.conf file settings for RBD should be set in the [client] section of your configuration file. The settings include:

rbd cache

Enable caching for RADOS Block Device (RBD). Default is 'true'.

rbd cache size

The RBD cache size in bytes. Default is 32 MB.

rbd cache max dirty

The 'dirty' limit in bytes at which the cache triggers write-back. rbd cache max dirty needs to be less than rbd cache size. If set to 0, uses write-through caching. Default is 24 MB.

rbd cache target dirty

The 'dirty target' before the cache begins writing data to the data storage. Does not block writes to the cache. Default is 16 MB.

rbd cache max dirty age

The number of seconds dirty data is in the cache before writeback starts. Default is 1.

rbd cache writethrough until flush

Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but safe setting in case virtual machines running on rbd are too old to send flushes (for example, the virtio driver in Linux before kernel 2.6.32). Default is 'true'.

23.6 QoS Settings Edit source

Generallyi, Quality of Service (QoS) refers to methods of traffic prioritization and resource reservation. It is particularly important for the transportation of traffic with special requirements.

Important
Important: Not Supported by iSCSI

The following QoS settings are used only by the userspace RBD implementation librbd and not used by the kRBD implementation. Because iSCSI uses kRBD, it does not use the QoS settings. However, for iSCSI you can configure QoS on the kernel block device layer using standard kernel facilities.

rbd qos iops limit

The desired limit of I/O operations per second. Default is 0 (no limit).

rbd qos bps limit

The desired limit of I/O bytes per second. Default is 0 (no limit).

rbd qos read iops limit

The desired limit of read operations per second. Default is 0 (no limit).

rbd qos write iops limit

The desired limit of write operations per second. Default is 0 (no limit).

rbd qos read bps limit

The desired limit of read bytes per second. Default is 0 (no limit).

rbd qos write bps limit

The desired limit of write bytes per second. Default is 0 (no limit).

rbd qos iops burst

The desired burst limit of I/O operations. Default is 0 (no limit).

rbd qos bps burst

The desired burst limit of I/O bytes. Default is 0 (no limit).

rbd qos read iops burst

The desired burst limit of read operations. Default is 0 (no limit).

rbd qos write iops burst

The desired burst limit of write operations. Default is 0 (no limit).

rbd qos read bps burst

The desired burst limit of read bytes. Default is 0 (no limit).

rbd qos write bps burst

The desired burst limit of write bytes. Default is 0 (no limit).

rbd qos schedule tick min

The minimum schedule tick (in milliseconds) for QoS. Default is 50.

23.7 Read-ahead Settings Edit source

RADOS Block Device supports read-ahead/prefetching to optimize small, sequential reads. This should normally be handled by the guest OS in the case of a virtual machine, but boot loaders may not issue efficient reads. Read-ahead is automatically disabled if caching is disabled.

rbd readahead trigger requests

Number of sequential read requests necessary to trigger read-ahead. Default is 10.

rbd readahead max bytes

Maximum size of a read-ahead request. If set to 0, read-ahead is disabled. Default is 512 kB.

rbd readahead disable after bytes

After this many bytes have been read from an RBD image, read-ahead is disabled for that image until it is closed. This allows the guest OS to take over read-ahead when it is booted. If set to 0, read-ahead stays enabled. Default is 50 MB.

23.8 Advanced Features Edit source

RADOS Block Device supports advanced features that enhance the functionality of RBD images. You can specify the features either on the command line when creating an RBD image, or in the Ceph configuration file by using the rbd_default_features option.

You can specify the values of the rbd_default_features option in two ways:

  • As a sum of features' internal values. Each feature has its own internal value—for example 'layering' has 1 and 'fast-diff' has 16. Therefore to activate these two feature by default, include the following:

    rbd_default_features = 17
  • As a comma-separated list of features. The previous example will look as follows:

    rbd_default_features = layering,fast-diff
Note
Note: Features Not Supported by iSCSI

RBD images with the following features will not be supported by iSCSI: deep-flatten, object-map, journaling, fast-diff, striping

A list of advanced RBD features follows:

layering

Layering enables you to use cloning.

Internal value is 1, default is 'yes'.

striping

Striping spreads data across multiple objects and helps with parallelism for sequential read/write workloads. It prevents single node bottlenecks for large or busy RADOS Block Devices.

Internal value is 2, default is 'yes'.

exclusive-lock

When enabled, it requires a client to get a lock on an object before making a write. Enable the exclusive lock only when a single client is accessing an image at the same time. Internal value is 4. Default is 'yes'.

object-map

Object map support depends on exclusive lock support. Block devices are thin provisioned, meaning that they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, importing and exporting a sparsely populated image, and deleting.

Internal value is 8, default is 'yes'.

fast-diff

Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image and the actual data usage of a snapshot.

Internal value is 16, default is 'yes'.

deep-flatten

Deep-flatten makes the rbd flatten (see Section 23.3.3.6, “Flattening a Cloned Image”) work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, therefore you will not be able to delete the parent image until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.

Internal value is 32, default is 'yes'.

journaling

Journaling support depends on exclusive lock support. Journaling records all modifications to an image in the order they occur. RBD mirroring (see Section 23.4, “Mirroring”) uses the journal to replicate a crash consistent image to a remote cluster.

Internal value is 64, default is 'no'.

23.9 Mapping RBD Using Old Kernel Clients Edit source

Old clients (for example, SLE11 SP4) may not be able to map RBD images because a cluster deployed with SUSE Enterprise Storage 6 forces some features (both RBD image level features and RADOS level features) that these old clients do not support. When this happens, the OSD logs will show messages similar to the following:

2019-05-17 16:11:33.739133 7fcb83a2e700  0 -- 192.168.122.221:0/1006830 >> \
192.168.122.152:6789/0 pipe(0x65d4e0 sd=3 :57323 s=1 pgs=0 cs=0 l=1 c=0x65d770).connect \
protocol feature mismatch, my 2fffffffffff < peer 4010ff8ffacffff missing 401000000000000
Warning
Warning: Changing CRUSH Map Bucket Types Causes Massive Rebalancing

If you intend to switch the CRUSH Map bucket types between 'straw' and 'straw2', do it in a planned manner. Expect a significant impact on the cluster load because changing bucket type will cause massive cluster rebalancing.

  1. Disable any RBD image features that are not supported. For example:

    cephadm@adm > rbd feature disable pool1/image1 object-map
    cephadm@adm > rbd feature disable pool1/image1 exclusive-lock
  2. Change the CRUSH Map bucket types from 'straw2' to 'straw':

    1. Save the CRUSH Map:

      cephadm@adm > ceph osd getcrushmap -o crushmap.original
    2. Decompile the CRUSH Map:

      cephadm@adm > crushtool -d crushmap.original -o crushmap.txt
    3. Edit the CRUSH Map and replace 'straw2' with 'straw'.

    4. Recompile the CRUSH Map:

      cephadm@adm > crushtool -c crushmap.txt -o crushmap.new
    5. Set the new CRUSH Map:

      cephadm@adm > ceph osd setcrushmap -i crushmap.new
Print this page