Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / SUSE Enterprise Storage 7.1 Documentation / Administration and Operations Guide / Storing Data in a Cluster / RADOS Block Device
Applies to SUSE Enterprise Storage 7.1

20 RADOS Block Device

A block is a sequence of bytes, for example a 4 MB block of data. Block-based storage interfaces are the most common way to store data with rotating media, such as hard disks, CDs, floppy disks. The ubiquity of block device interfaces makes a virtual block device an ideal candidate to interact with a mass data storage system like Ceph.

Ceph block devices allow sharing of physical resources, and are resizable. They store data striped over multiple OSDs in a Ceph cluster. Ceph block devices leverage RADOS capabilities such as snapshotting, replication, and consistency. Ceph's RADOS Block Devices (RBD) interact with OSDs using kernel modules or the librbd library.

RADOS protocol
Figure 20.1: RADOS protocol

Ceph's block devices deliver high performance with infinite scalability to kernel modules. They support virtualization solutions such as QEMU, or cloud-based computing systems such as OpenStack that rely on libvirt. You can use the same cluster to operate the Object Gateway, CephFS, and RADOS Block Devices simultaneously.

20.1 Block device commands

The rbd command enables you to create, list, introspect, and remove block device images. You can also use it, for example, to clone images, create snapshots, rollback an image to a snapshot, or view a snapshot.

20.1.1 Creating a block device image in a replicated pool

Before you can add a block device to a client, you need to create a related image in an existing pool (see Chapter 18, Manage storage pools):

cephuser@adm > rbd create --size MEGABYTES POOL-NAME/IMAGE-NAME

For example, to create a 1 GB image named 'myimage' that stores information in a pool named 'mypool', execute the following:

cephuser@adm > rbd create --size 1024 mypool/myimage
Tip
Tip: Image size units

If you omit a size unit shortcut ('G' or 'T'), the image's size is in megabytes. Use 'G' or 'T' after the size number to specify gigabytes or terabytes.

20.1.2 Creating a block device image in an erasure coded pool

It is possible to store data of a block device image directly in erasure coded (EC) pools. A RADOS Block Device image consists of data and metadata parts. You can store only the data part of a RADOS Block Device image in an EC pool. The pool needs to have the overwrite flag set to true, and that is only possible if all OSDs where the pool is stored use BlueStore.

You cannot store the image's metadata part in an EC pool. You can specify the replicated pool for storing the image's metadata with the --pool= option of the rbd create command or specify pool/ as a prefix to the image name.

Create an EC pool:

cephuser@adm > ceph osd pool create EC_POOL 12 12 erasure
cephuser@adm > ceph osd pool set EC_POOL allow_ec_overwrites true

Specify the replicated pool for storing metadata:

cephuser@adm > rbd create IMAGE_NAME --size=1G --data-pool EC_POOL --pool=POOL

Or:

cephuser@adm > rbd create POOL/IMAGE_NAME --size=1G --data-pool EC_POOL

20.1.3 Listing block device images

To list block devices in a pool named 'mypool', execute the following:

cephuser@adm > rbd ls mypool

20.1.4 Retrieving image information

To retrieve information from an image 'myimage' within a pool named 'mypool', run the following:

cephuser@adm > rbd info mypool/myimage

20.1.5 Resizing a block device image

RADOS Block Device images are thin provisioned—they do not actually use any physical storage until you begin saving data to them. However, they do have a maximum capacity that you set with the --size option. If you want to increase (or decrease) the maximum size of the image, run the following:

cephuser@adm > rbd resize --size 2048 POOL_NAME/IMAGE_NAME # to increase
cephuser@adm > rbd resize --size 2048 POOL_NAME/IMAGE_NAME --allow-shrink # to decrease

20.1.6 Removing a block device image

To remove a block device that corresponds to an image 'myimage' in a pool named 'mypool', run the following:

cephuser@adm > rbd rm mypool/myimage

20.2 Mounting and unmounting

After you create a RADOS Block Device, you can use it like any other disk device: format it, mount it to be able to exchange files, and unmount it when done.

The rbd command defaults to accessing the cluster using the Ceph admin user account. This account has full administrative access to the cluster. This runs the risk of accidentally causing damage, similarly to logging in to a Linux workstation as root. Thus, it is preferable to create user accounts with fewer privileges and use these accounts for normal read/write RADOS Block Device access.

20.2.1 Creating a Ceph user account

To create a new user account with Ceph Manager, Ceph Monitor, and Ceph OSD capabilities, use the ceph command with the auth get-or-create subcommand:

cephuser@adm > ceph auth get-or-create client.ID mon 'profile rbd' osd 'profile profile name \
  [pool=pool-name] [, profile ...]' mgr 'profile rbd [pool=pool-name]'

For example, to create a user called qemu with read-write access to the pool vms and read-only access to the pool images, execute the following:

ceph auth get-or-create client.qemu mon 'profile rbd' osd 'profile rbd pool=vms, profile rbd-read-only pool=images' \
  mgr 'profile rbd pool=images'

The output from the ceph auth get-or-create command will be the keyring for the specified user, which can be written to /etc/ceph/ceph.client.ID.keyring.

Note
Note

When using the rbd command, you can specify the user ID by providing the optional --id ID argument.

For more details on managing Ceph user accounts, refer to Chapter 30, Authentication with cephx.

20.2.2 User authentication

To specify a user name, use --id user-name. If you use cephx authentication, you also need to specify a secret. It may come from a keyring or a file containing the secret:

cephuser@adm > rbd device map --pool rbd myimage --id admin --keyring /path/to/keyring

or

cephuser@adm > rbd device map --pool rbd myimage --id admin --keyfile /path/to/file

20.2.3 Preparing a RADOS Block Device for use

  1. Make sure your Ceph cluster includes a pool with the disk image you want to map. Assume the pool is called mypool and the image is myimage.

    cephuser@adm > rbd list mypool
  2. Map the image to a new block device:

    cephuser@adm > rbd device map --pool mypool myimage
  3. List all mapped devices:

    cephuser@adm > rbd device list
    id pool   image   snap device
    0  mypool myimage -    /dev/rbd0

    The device we want to work on is /dev/rbd0.

    Tip
    Tip: RBD device path

    Instead of /dev/rbdDEVICE_NUMBER, you can use /dev/rbd/POOL_NAME/IMAGE_NAME as a persistent device path. For example:

           /dev/rbd/mypool/myimage
  4. Make an XFS file system on the /dev/rbd0 device:

    # mkfs.xfs /dev/rbd0
          log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
          log stripe unit adjusted to 32KiB
          meta-data=/dev/rbd0              isize=256    agcount=9, agsize=261120 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=0        finobt=0
          data     =                       bsize=4096   blocks=2097152, imaxpct=25
          =                       sunit=1024   swidth=1024 blks
          naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
          log      =internal log           bsize=4096   blocks=2560, version=2
          =                       sectsz=512   sunit=8 blks, lazy-count=1
          realtime =none                   extsz=4096   blocks=0, rtextents=0
  5. Replacing /mnt with your mount point, mount the device and check it is correctly mounted:

    # mount /dev/rbd0 /mnt
          # mount | grep rbd0
          /dev/rbd0 on /mnt type xfs (rw,relatime,attr2,inode64,sunit=8192,...

    Now you can move data to and from the device as if it was a local directory.

    Tip
    Tip: Increasing the size of RBD device

    If you find that the size of the RBD device is no longer enough, you can easily increase it.

    1. Increase the size of the RBD image, for example up to 10 GB.

      cephuser@adm > rbd resize --size 10000 mypool/myimage
               Resizing image: 100% complete...done.
    2. Grow the file system to fill up the new size of the device:

      # xfs_growfs /mnt
      [...]
      data blocks changed from 2097152 to 2560000
  6. After you finish accessing the device, you can unmap and unmount it.

    cephuser@adm > rbd device unmap /dev/rbd0
    # unmount /mnt
Tip
Tip: Manual mounting and unmounting

A rbdmap script and systemd unit is provided to make the process of mapping and mounting RBDs after boot, and unmounting them before shutdown, smoother. Refer to Section 20.2.4, “rbdmap Map RBD devices at boot time”.

20.2.4 rbdmap Map RBD devices at boot time

rbdmap is a shell script that automates rbd map and rbd device unmap operations on one or more RBD images. Although you can run the script manually at any time, the main advantage is automatic mapping and mounting of RBD images at boot time (and unmounting and unmapping at shutdown), as triggered by the Init system. A systemd unit file, rbdmap.service is included with the ceph-common package for this purpose.

The script takes a single argument, which can be either map or unmap. In either case, the script parses a configuration file. It defaults to /etc/ceph/rbdmap, but can be overridden via an environment variable RBDMAPFILE. Each line of the configuration file corresponds to an RBD image which is to be mapped, or unmapped.

The configuration file has the following format:

image_specification rbd_options
image_specification

Path to an image within a pool. Specify as pool_name/image_name.

rbd_options

An optional list of parameters to be passed to the underlying rbd device map command. These parameters and their values should be specified as a comma-separated string, for example:

PARAM1=VAL1,PARAM2=VAL2,...

The example makes the rbdmap script run the following command:

cephuser@adm > rbd device map POOL_NAME/IMAGE_NAME --PARAM1 VAL1 --PARAM2 VAL2

In the following example you can see how to specify a user name and a keyring with a corresponding secret:

cephuser@adm > rbdmap device map mypool/myimage id=rbd_user,keyring=/etc/ceph/ceph.client.rbd.keyring

When run as rbdmap map, the script parses the configuration file, and for each specified RBD image, it attempts to first map the image (using the rbd device map command) and then mount the image.

When run as rbdmap unmap, images listed in the configuration file will be unmounted and unmapped.

rbdmap unmap-all attempts to unmount and subsequently unmap all currently mapped RBD images, regardless of whether they are listed in the configuration file.

If successful, the rbd device map operation maps the image to a /dev/rbdX device, at which point a udev rule is triggered to create a friendly device name symbolic link /dev/rbd/pool_name/image_name pointing to the real mapped device.

In order for mounting and unmounting to succeed, the 'friendly' device name needs to have a corresponding entry in /etc/fstab. When writing /etc/fstab entries for RBD images, specify the 'noauto' (or 'nofail') mount option. This prevents the Init system from trying to mount the device too early—before the device in question even exists, as rbdmap.service is typically triggered quite late in the boot sequence.

For a complete list of rbd options, see the rbd manual page (man 8 rbd).

For examples of the rbdmap usage, see the rbdmap manual page (man 8 rbdmap).

20.2.5 Increasing the size of RBD devices

If you find that the size of the RBD device is no longer enough, you can easily increase it.

  1. Increase the size of the RBD image, for example up to 10GB.

    cephuser@adm > rbd resize --size 10000 mypool/myimage
     Resizing image: 100% complete...done.
  2. Grow the file system to fill up the new size of the device.

    # xfs_growfs /mnt
     [...]
     data blocks changed from 2097152 to 2560000

20.3 Snapshots

An RBD snapshot is a snapshot of a RADOS Block Device image. With snapshots, you retain a history of the image's state. Ceph also supports snapshot layering, which allows you to clone VM images quickly and easily. Ceph supports block device snapshots using the rbd command and many higher-level interfaces, including QEMU, libvirt, OpenStack, and CloudStack.

Note
Note

Stop input and output operations and flush all pending writes before snapshotting an image. If the image contains a file system, the file system must be in a consistent state at the time of snapshotting.

20.3.1 Enabling and configuring cephx

When cephx is enabled, you must specify a user name or ID and a path to the keyring containing the corresponding key for the user. See Chapter 30, Authentication with cephx for more details. You may also add the CEPH_ARGS environment variable to avoid re-entry of the following parameters.

cephuser@adm > rbd --id user-ID --keyring=/path/to/secret commands
cephuser@adm > rbd --name username --keyring=/path/to/secret commands

For example:

cephuser@adm > rbd --id admin --keyring=/etc/ceph/ceph.keyring commands
cephuser@adm > rbd --name client.admin --keyring=/etc/ceph/ceph.keyring commands
Tip
Tip

Add the user and secret to the CEPH_ARGS environment variable so that you do not need to enter them each time.

20.3.2 Snapshot basics

The following procedures demonstrate how to create, list, and remove snapshots using the rbd command on the command line.

20.3.2.1 Creating snapshots

To create a snapshot with rbd, specify the snap create option, the pool name, and the image name.

cephuser@adm > rbd --pool pool-name snap create --snap snap-name image-name
cephuser@adm > rbd snap create pool-name/image-name@snap-name

For example:

cephuser@adm > rbd --pool rbd snap create --snap snapshot1 image1
cephuser@adm > rbd snap create rbd/image1@snapshot1

20.3.2.2 Listing snapshots

To list snapshots of an image, specify the pool name and the image name.

cephuser@adm > rbd --pool pool-name snap ls image-name
cephuser@adm > rbd snap ls pool-name/image-name

For example:

cephuser@adm > rbd --pool rbd snap ls image1
cephuser@adm > rbd snap ls rbd/image1

20.3.2.3 Rolling back snapshots

To rollback to a snapshot with rbd, specify the snap rollback option, the pool name, the image name, and the snapshot name.

cephuser@adm > rbd --pool pool-name snap rollback --snap snap-name image-name
cephuser@adm > rbd snap rollback pool-name/image-name@snap-name

For example:

cephuser@adm > rbd --pool pool1 snap rollback --snap snapshot1 image1
cephuser@adm > rbd snap rollback pool1/image1@snapshot1
Note
Note

Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a snapshot, and it is the preferred method of returning to a pre-existing state.

20.3.2.4 Deleting a snapshot

To delete a snapshot with rbd, specify the snap rm option, the pool name, the image name, and the user name.

cephuser@adm > rbd --pool pool-name snap rm --snap snap-name image-name
cephuser@adm > rbd snap rm pool-name/image-name@snap-name

For example:

cephuser@adm > rbd --pool pool1 snap rm --snap snapshot1 image1
cephuser@adm > rbd snap rm pool1/image1@snapshot1
Note
Note

Ceph OSDs delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.

20.3.2.5 Purging snapshots

To delete all snapshots for an image with rbd, specify the snap purge option and the image name.

cephuser@adm > rbd --pool pool-name snap purge image-name
cephuser@adm > rbd snap purge pool-name/image-name

For example:

cephuser@adm > rbd --pool pool1 snap purge image1
cephuser@adm > rbd snap purge pool1/image1

20.3.3 Snapshot layering

Ceph supports the ability to create multiple copy-on-write (COW) clones of a block device snapshot. Snapshot layering enables Ceph block device clients to create images very quickly. For example, you might create a block device image with a Linux VM written to it, then, snapshot the image, protect the snapshot, and create as many copy-on-write clones as you like. A snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to create clones rapidly.

Note
Note

The terms 'parent' and 'child' mentioned in the command line examples below mean a Ceph block device snapshot (parent) and the corresponding image cloned from the snapshot (child).

Each cloned image (child) stores a reference to its parent image, which enables the cloned image to open the parent snapshot and read it.

A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of a snapshot refers to the snapshot, so you must protect the snapshot before you clone it.

Note
Note: --image-format 1 not supported

You cannot create snapshots of images created with the deprecated rbd create --image-format 1 option. Ceph only supports cloning of the default format 2 images.

20.3.3.1 Getting started with layering

Ceph block device layering is a simple process. You must have an image. You must create a snapshot of the image. You must protect the snapshot. After you have performed these steps, you can begin cloning the snapshot.

The cloned image has a reference to the parent snapshot, and includes the pool ID, image ID, and snapshot ID. The inclusion of the pool ID means that you may clone snapshots from one pool to images in another pool.

  • Image Template: A common use case for block device layering is to create a primary image and a snapshot that serves as a template for clones. For example, a user may create an image for a Linux distribution (for example, SUSE Linux Enterprise Server), and create a snapshot for it. Periodically, the user may update the image and create a new snapshot (for example, zypper ref && zypper patch followed by rbd snap create). As the image matures, the user can clone any one of the snapshots.

  • Extended Template: A more advanced use case includes extending a template image that provides more information than a base image. For example, a user may clone an image (a VM template) and install other software (for example, a database, a content management system, or an analytics system), and then snapshot the extended image, which itself may be updated in the same way as the base image.

  • Template Pool: One way to use block device layering is to create a pool that contains primary images that act as templates, and snapshots of those templates. You may then extend read-only privileges to users so that they may clone the snapshots without the ability to write or execute within the pool.

  • Image Migration/Recovery: One way to use block device layering is to migrate or recover data from one pool into another pool.

20.3.3.2 Protecting a snapshot

Clones access the parent snapshots. All clones would break if a user inadvertently deleted the parent snapshot. To prevent data loss, you need to protect the snapshot before you can clone it.

cephuser@adm > rbd --pool pool-name snap protect \
 --image image-name --snap snapshot-name
cephuser@adm > rbd snap protect pool-name/image-name@snapshot-name

For example:

cephuser@adm > rbd --pool pool1 snap protect --image image1 --snap snapshot1
cephuser@adm > rbd snap protect pool1/image1@snapshot1
Note
Note

You cannot delete a protected snapshot.

20.3.3.3 Cloning a snapshot

To clone a snapshot, you need to specify the parent pool, image, snapshot, the child pool, and the image name. You need to protect the snapshot before you can clone it.

cephuser@adm > rbd clone --pool pool-name --image parent-image \
 --snap snap-name --dest-pool pool-name \
 --dest child-image
cephuser@adm > rbd clone pool-name/parent-image@snap-name \
pool-name/child-image-name

For example:

cephuser@adm > rbd clone pool1/image1@snapshot1 pool1/image2
Note
Note

You may clone a snapshot from one pool to an image in another pool. For example, you may maintain read-only images and snapshots as templates in one pool, and writable clones in another pool.

20.3.3.4 Unprotecting a snapshot

Before you can delete a snapshot, you must unprotect it first. Additionally, you may not delete snapshots that have references from clones. You need to flatten each clone of a snapshot before you can delete the snapshot.

cephuser@adm > rbd --pool pool-name snap unprotect --image image-name \
 --snap snapshot-name
cephuser@adm > rbd snap unprotect pool-name/image-name@snapshot-name

For example:

cephuser@adm > rbd --pool pool1 snap unprotect --image image1 --snap snapshot1
cephuser@adm > rbd snap unprotect pool1/image1@snapshot1

20.3.3.5 Listing children of a snapshot

To list the children of a snapshot, execute the following:

cephuser@adm > rbd --pool pool-name children --image image-name --snap snap-name
cephuser@adm > rbd children pool-name/image-name@snapshot-name

For example:

cephuser@adm > rbd --pool pool1 children --image image1 --snap snapshot1
cephuser@adm > rbd children pool1/image1@snapshot1

20.3.3.6 Flattening a cloned image

Cloned images retain a reference to the parent snapshot. When you remove the reference from the child clone to the parent snapshot, you effectively 'flatten' the image by copying the information from the snapshot to the clone. The time it takes to flatten a clone increases with the size of the snapshot. To delete a snapshot, you must flatten the child images first.

cephuser@adm > rbd --pool pool-name flatten --image image-name
cephuser@adm > rbd flatten pool-name/image-name

For example:

cephuser@adm > rbd --pool pool1 flatten --image image1
cephuser@adm > rbd flatten pool1/image1
Note
Note

Since a flattened image contains all the information from the snapshot, a flattened image will take up more storage space than a layered clone.

20.4 RBD image mirrors

RBD images can be asynchronously mirrored between two Ceph clusters. This capability is available in two modes:

Journal-based

This mode uses the RBD journaling image feature to ensure point-in-time, crash-consistent replication between clusters. Every write to the RBD image is first recorded to the associated journal before modifying the actual image. The remote cluster will read from the journal and replay the updates to its local copy of the image. Since each write to the RBD image will result in two writes to the Ceph cluster, expect write latencies to nearly double when using the RBD journaling image feature.

Snapshot-based

This mode uses periodically-scheduled or manually-created RBD image mirror-snapshots to replicate crash-consistent RBD images between clusters. The remote cluster will determine any data or metadata updates between two mirror-snapshots, and copy the deltas to its local copy of the image. With the help of the RBD fast-diff image feature, updated data blocks can be quickly computed without the need to scan the full RBD image. Since this mode is not point-in-time consistent, the full snapshot delta will need to be synchronized prior to use during a failover scenario. Any partially-applied snapshot deltas will be rolled back to the last fully synchronized snapshot prior to use.

Mirroring is configured on a per-pool basis within peer clusters. This can be configured on a specific subset of images within the pool, or configured to automatically mirror all images within a pool when using journal-based mirroring only. Mirroring is configured using the rbd command. The rbd-mirror daemon is responsible for pulling image updates from the remote, peer cluster and applying them to the image within the local cluster.

Depending on the desired needs for replication, RBD mirroring can be configured for either one- or two-way replication:

One-way Replication

When data is only mirrored from a primary cluster to a secondary cluster, the rbd-mirror daemon runs only on the secondary cluster.

Two-way Replication

When data is mirrored from primary images on one cluster to non-primary images on another cluster (and vice-versa), the rbd-mirror daemon runs on both clusters.

Important
Important

Each instance of the rbd-mirror daemon needs to be able to connect to both the local and remote Ceph clusters simultaneously. For example, all monitor and OSD hosts. Additionally, the network needs to have sufficient bandwidth between the two data centers to handle mirroring workload.

20.4.1 Pool configuration

The following procedures demonstrate how to perform the basic administrative tasks to configure mirroring using the rbd command. Mirroring is configured on a per-pool basis within the Ceph clusters.

You need to perform the pool configuration steps on both peer clusters. These procedures assume two clusters, named local and remote, are accessible from a single host for clarity.

See the rbd manual page (man 8 rbd) for additional details on how to connect to different Ceph clusters.

Tip
Tip: Multiple clusters

The cluster name in the following examples corresponds to a Ceph configuration file of the same name /etc/ceph/remote.conf and Ceph keyring file of the same name /etc/ceph/remote.client.admin.keyring.

20.4.1.1 Enable mirroring on a pool

To enable mirroring on a pool, specify the mirror pool enable subcommand, the pool name, and the mirroring mode. The mirroring mode can either be pool or image:

pool

All images in the pool with the journaling feature enabled are mirrored.

image

Mirroring needs to be explicitly enabled on each image. See Section 20.4.2.1, “Enabling image mirroring” for more information.

For example:

cephuser@adm > rbd --cluster local mirror pool enable POOL_NAME pool
cephuser@adm > rbd --cluster remote mirror pool enable POOL_NAME pool

20.4.1.2 Disable mirroring

To disable mirroring on a pool, specify the mirror pool disable subcommand and the pool name. When mirroring is disabled on a pool in this way, mirroring will also be disabled on any images (within the pool) for which mirroring was enabled explicitly.

cephuser@adm > rbd --cluster local mirror pool disable POOL_NAME
cephuser@adm > rbd --cluster remote mirror pool disable POOL_NAME

20.4.1.3 Bootstrapping peers

In order for the rbd-mirror daemon to discover its peer cluster, the peer needs to be registered to the pool and a user account needs to be created. This process can be automated with rbd and the mirror pool peer bootstrap create and mirror pool peer bootstrap import commands.

To manually create a new bootstrap token with rbd, specify the mirror pool peer bootstrap create command, a pool name, along with an optional friendly site name to describe the local cluster:

cephuser@local > rbd mirror pool peer bootstrap create \
 [--site-name LOCAL_SITE_NAME] POOL_NAME

The output of mirror pool peer bootstrap create will be a token that should be provided to the mirror pool peer bootstrap import command. For example, on the local cluster:

cephuser@local > rbd --cluster local mirror pool peer bootstrap create --site-name local image-pool
eyJmc2lkIjoiOWY1MjgyZGItYjg5OS00NTk2LTgwOTgtMzIwYzFmYzM5NmYzIiwiY2xpZW50X2lkIjoicmJkLW1pcnJvci1wZWVyIiwia2V5I \
joiQVFBUnczOWQwdkhvQmhBQVlMM1I4RmR5dHNJQU50bkFTZ0lOTVE9PSIsIm1vbl9ob3N0IjoiW3YyOjE5Mi4xNjguMS4zOjY4MjAsdjE6MTkyLjE2OC4xLjM6NjgyMV0ifQ==

To manually import the bootstrap token created by another cluster with the rbd command, use the following syntax:

rbd mirror pool peer bootstrap import \
 [--site-name LOCAL_SITE_NAME] \
 [--direction DIRECTION \
 POOL_NAME TOKEN_PATH

Where:

LOCAL_SITE_NAME

An optional friendly site name to describe the local cluster.

DIRECTION

A mirroring direction. Defaults to rx-tx for bidirectional mirroring, but can also be set to rx-only for unidirectional mirroring.

POOL_NAME

Name of the pool.

TOKEN_PATH

A file path to the created token (or - to read it from the standard input).

For example, on the remote cluster:

cephuser@remote > cat <<EOF > token
eyJmc2lkIjoiOWY1MjgyZGItYjg5OS00NTk2LTgwOTgtMzIwYzFmYzM5NmYzIiwiY2xpZW50X2lkIjoicmJkLW1pcnJvci1wZWVyIiwia2V5IjoiQVFBUnczOWQwdkhvQmhBQVlMM1I4RmR5dHNJQU50bkFTZ0lOTVE9PSIsIm1vbl9ob3N0IjoiW3YyOjE5Mi4xNjguMS4zOjY4MjAsdjE6MTkyLjE2OC4xLjM6NjgyMV0ifQ==
EOF
cephuser@adm > rbd --cluster remote mirror pool peer bootstrap import \
 --site-name remote image-pool token

20.4.1.4 Adding a cluster peer manually

Alternatively to bootstrapping peers as described in Section 20.4.1.3, “Bootstrapping peers”, you can specify peers manually. The remote rbd-mirror daemon will need access to the local cluster to perform mirroring. Create a new local Ceph user that the remote rbd-mirror daemon will use, for example rbd-mirror-peer:

cephuser@adm > ceph auth get-or-create client.rbd-mirror-peer \
 mon 'profile rbd' osd 'profile rbd'

Use the following syntax to add a mirroring peer Ceph cluster with the rbd command:

rbd mirror pool peer add POOL_NAME CLIENT_NAME@CLUSTER_NAME

For example:

cephuser@adm > rbd --cluster site-a mirror pool peer add image-pool client.rbd-mirror-peer@site-b
cephuser@adm > rbd --cluster site-b mirror pool peer add image-pool client.rbd-mirror-peer@site-a

By default, the rbd-mirror daemon needs to have access to the Ceph configuration file located at /etc/ceph/.CLUSTER_NAME.conf. It provides IP addresses of the peer cluster’s MONs and a keyring for a client named CLIENT_NAME located in the default or custom keyring search paths, for example /etc/ceph/CLUSTER_NAME.CLIENT_NAME.keyring.

Alternatively, the peer cluster’s MON and/or client key can be securely stored within the local Ceph config-key store. To specify the peer cluster connection attributes when adding a mirroring peer, use the --remote-mon-host and --remote-key-file options. For example:

cephuser@adm > rbd --cluster site-a mirror pool peer add image-pool \
 client.rbd-mirror-peer@site-b --remote-mon-host 192.168.1.1,192.168.1.2 \
 --remote-key-file /PATH/TO/KEY_FILE
cephuser@adm > rbd --cluster site-a mirror pool info image-pool --all
Mode: pool
Peers:
  UUID        NAME   CLIENT                 MON_HOST                KEY
  587b08db... site-b client.rbd-mirror-peer 192.168.1.1,192.168.1.2 AQAeuZdb...

20.4.1.5 Remove cluster peer

To remove a mirroring peer cluster, specify the mirror pool peer remove subcommand, the pool name, and the peer UUID (available from the rbd mirror pool info command):

cephuser@adm > rbd --cluster local mirror pool peer remove POOL_NAME \
 55672766-c02b-4729-8567-f13a66893445
cephuser@adm > rbd --cluster remote mirror pool peer remove POOL_NAME \
 60c0e299-b38f-4234-91f6-eed0a367be08

20.4.1.6 Data pools

When creating images in the destination cluster, rbd-mirror selects a data pool as follows:

  • If the destination cluster has a default data pool configured (with the rbd_default_data_pool configuration option), it will be used.

  • Otherwise, if the source image uses a separate data pool, and a pool with the same name exists on the destination cluster, that pool will be used.

  • If neither of the above is true, no data pool will be set.

20.4.2 RBD Image configuration

Unlike pool configuration, image configuration only needs to be performed against a single mirroring peer Ceph cluster.

Mirrored RBD images are designated as either primary or non-primary. This is a property of the image and not the pool. Images that are designated as non-primary cannot be modified.

Images are automatically promoted to primary when mirroring is first enabled on an image (either implicitly if the pool mirror mode was 'pool' and the image has the journaling image feature enabled, or explicitly (see Section 20.4.2.1, “Enabling image mirroring”) by the rbd command).

20.4.2.1 Enabling image mirroring

If mirroring is configured in the image mode, then it is necessary to explicitly enable mirroring for each image within the pool. To enable mirroring for a specific image with rbd, specify the mirror image enable subcommand along with the pool and image name:

cephuser@adm > rbd --cluster local mirror image enable \
 POOL_NAME/IMAGE_NAME

The mirror image mode can either be journal or snapshot:

journal (default)

When configured in journal mode, mirroring will use the RBD journaling image feature to replicate the image contents. If the RBD journaling image feature is not yet enabled on the image, it will be automatically enabled.

snapshot

When configured in snapshot mode, mirroring will use RBD image mirror-snapshots to replicate the image contents. When enabled, an initial mirror-snapshot will automatically be created. Additional RBD image mirror-snapshots can be created by the rbd command.

For example:

cephuser@adm > rbd --cluster local mirror image enable image-pool/image-1 snapshot
cephuser@adm > rbd --cluster local mirror image enable image-pool/image-2 journal

20.4.2.2 Enabling the image journaling feature

RBD mirroring uses the RBD journaling feature to ensure that the replicated image always remains crash-consistent. When using the image mirroring mode, the journaling feature will be automatically enabled if mirroring is enabled on the image. When using the pool mirroring mode, before an image can be mirrored to a peer cluster, the RBD image journaling feature must be enabled. The feature can be enabled at image creation time by providing the --image-feature exclusive-lock,journaling option to the rbd command.

Alternatively, the journaling feature can be dynamically enabled on pre-existing RBD images. To enable journaling, specify the feature enable subcommand, the pool and image name, and the feature name:

cephuser@adm > rbd --cluster local feature enable POOL_NAME/IMAGE_NAME exclusive-lock
cephuser@adm > rbd --cluster local feature enable POOL_NAME/IMAGE_NAME journaling
Note
Note: Option dependency

The journaling feature is dependent on the exclusive-lock feature. If the exclusive-lock feature is not already enabled, you need to enable it prior to enabling the journaling feature.

Tip
Tip

You can enable journaling on all new images by default by adding rbd default features = layering,exclusive-lock,object-map,deep-flatten,journaling to your Ceph configuration file.

20.4.2.3 Creating image mirror-snapshots

When using snapshot-based mirroring, mirror-snapshots will need to be created whenever it is desired to mirror the changed contents of the RBD image. To create a mirror-snapshot manually with rbd, specify the mirror image snapshot command along with the pool and image name:

cephuser@adm > rbd mirror image snapshot POOL_NAME/IMAGE_NAME

For example:

cephuser@adm > rbd --cluster local mirror image snapshot image-pool/image-1

By default only three mirror-snapshots will be created per image. The most recent mirror-snapshot is automatically pruned if the limit is reached. The limit can be overridden via the rbd_mirroring_max_mirroring_snapshots configuration option if required. Additionally, mirror-snapshots are automatically deleted when the image is removed or when mirroring is disabled.

Mirror-snapshots can also be automatically created on a periodic basis if mirror-snapshot schedules are defined. The mirror-snapshot can be scheduled globally, per-pool, or per-image levels. Multiple mirror-snapshot schedules can be defined at any level, but only the most-specific snapshot schedules that match an individual mirrored image will run.

To create a mirror-snapshot schedule with rbd, specify the mirror snapshot schedule add command along with an optional pool or image name, interval, and optional start time.

The interval can be specified in days, hours, or minutes using the suffixes d, h, or m respectively. The optional start time can be specified using the ISO 8601 time format. For example:

cephuser@adm > rbd --cluster local mirror snapshot schedule add --pool image-pool 24h 14:00:00-05:00
cephuser@adm > rbd --cluster local mirror snapshot schedule add --pool image-pool --image image1 6h

To remove a mirror-snapshot schedule with rbd, specify the mirror snapshot schedule remove command with options that match the corresponding add schedule command.

To list all snapshot schedules for a specific level (global, pool, or image) with rbd, specify the mirror snapshot schedule ls command along with an optional pool or image name. Additionally, the --recursive option can be specified to list all schedules at the specified level and below. For example:

cephuser@adm > rbd --cluster local mirror schedule ls --pool image-pool --recursive
POOL        NAMESPACE IMAGE  SCHEDULE
image-pool  -         -      every 1d starting at 14:00:00-05:00
image-pool            image1 every 6h

To find out when the next snapshots will be created for snapshot-based mirroring RBD images with rbd, specify the mirror snapshot schedule status command along with an optional pool or image name. For example:

cephuser@adm > rbd --cluster local mirror schedule status
SCHEDULE TIME       IMAGE
2020-02-26 18:00:00 image-pool/image1

20.4.2.4 Disabling image mirroring

To disable mirroring for a specific image, specify the mirror image disable subcommand along with the pool and image name:

cephuser@adm > rbd --cluster local mirror image disable POOL_NAME/IMAGE_NAME

20.4.2.5 Promoting and demoting images

In a failover scenario where the primary designation needs to be moved to the image in the peer cluster, you need to stop access to the primary image, demote the current primary image, promote the new primary image, and resume access to the image on the alternate cluster.

Note
Note: Forced promotion

Promotion can be forced using the --force option. Forced promotion is needed when the demotion cannot be propagated to the peer cluster (for example, in case of cluster failure or communication outage). This will result in a split-brain scenario between the two peers, and the image will no longer be synchronized until a resync subcommand is issued.

To demote a specific image to non-primary, specify the mirror image demote subcommand along with the pool and image name:

cephuser@adm > rbd --cluster local mirror image demote POOL_NAME/IMAGE_NAME

To demote all primary images within a pool to non-primary, specify the mirror pool demote subcommand along with the pool name:

cephuser@adm > rbd --cluster local mirror pool demote POOL_NAME

To promote a specific image to primary, specify the mirror image promote subcommand along with the pool and image name:

cephuser@adm > rbd --cluster remote mirror image promote POOL_NAME/IMAGE_NAME

To promote all non-primary images within a pool to primary, specify the mirror pool promote subcommand along with the pool name:

cephuser@adm > rbd --cluster local mirror pool promote POOL_NAME
Tip
Tip: Split I/O load

Since the primary or non-primary status is per-image, it is possible to have two clusters split the I/O load and stage failover or failback.

20.4.2.6 Forcing image resync

If a split-brain event is detected by the rbd-mirror daemon, it will not attempt to mirror the affected image until corrected. To resume mirroring for an image, first demote the image determined to be out of date and then request a resync to the primary image. To request an image resync, specify the mirror image resync subcommand along with the pool and image name:

cephuser@adm > rbd mirror image resync POOL_NAME/IMAGE_NAME

20.4.3 Checking the mirror status

The peer cluster replication status is stored for every primary mirrored image. This status can be retrieved using the mirror image status and mirror pool status subcommands:

To request the mirror image status, specify the mirror image status subcommand along with the pool and image name:

cephuser@adm > rbd mirror image status POOL_NAME/IMAGE_NAME

To request the mirror pool summary status, specify the mirror pool status subcommand along with the pool name:

cephuser@adm > rbd mirror pool status POOL_NAME
Tip
Tip:

Adding the --verbose option to the mirror pool status subcommand will additionally output status details for every mirroring image in the pool.

20.5 Cache settings

The user space implementation of the Ceph block device (librbd) cannot take advantage of the Linux page cache. Therefore, it includes its own in-memory caching. RBD caching behaves similar to hard disk caching. When the OS sends a barrier or a flush request, all 'dirty' data is written to the OSDs. This means that using write-back caching is just as safe as using a well-behaved physical hard disk with a VM that properly sends flushes. The cache uses a Least Recently Used (LRU) algorithm, and in write-back mode it can merge adjacent requests for better throughput.

Ceph supports write-back caching for RBD. To enable it, run

cephuser@adm > ceph config set client rbd_cache true

By default, librbd does not perform any caching. Writes and reads go directly to the storage cluster, and writes return only when the data is on disk on all replicas. With caching enabled, writes return immediately, unless there are more unflushed bytes than set in the rbd cache max dirty option. In such a case, the write triggers writeback and blocks until enough bytes are flushed.

Ceph supports write-through caching for RBD. You can set the size of the cache, and you can set targets and limits to switch from write-back caching to write-through caching. To enable write-through mode, run

cephuser@adm > ceph config set client rbd_cache_max_dirty 0

This means writes return only when the data is on disk on all replicas, but reads may come from the cache. The cache is in memory on the client, and each RBD image has its own cache. Since the cache is local to the client, there is no coherency if there are others accessing the image. Running GFS or OCFS on top of RBD will not work with caching enabled.

The following parameters affect the behavior of RADOS Block Devices. To set them, use the client category:

cephuser@adm > ceph config set client PARAMETER VALUE
rbd cache

Enable caching for RADOS Block Device (RBD). Default is 'true'.

rbd cache size

The RBD cache size in bytes. Default is 32 MB.

rbd cache max dirty

The 'dirty' limit in bytes at which the cache triggers write-back. rbd cache max dirty needs to be less than rbd cache size. If set to 0, uses write-through caching. Default is 24 MB.

rbd cache target dirty

The 'dirty target' before the cache begins writing data to the data storage. Does not block writes to the cache. Default is 16 MB.

rbd cache max dirty age

The number of seconds dirty data is in the cache before writeback starts. Default is 1.

rbd cache writethrough until flush

Start out in write-through mode, and switch to write-back after the first flush request is received. Enabling this is a conservative but safe setting in case virtual machines running on rbd are too old to send flushes (for example, the virtio driver in Linux before kernel 2.6.32). Default is 'true'.

20.6 QoS settings

Generally, Quality of Service (QoS) refers to methods of traffic prioritization and resource reservation. It is particularly important for the transportation of traffic with special requirements.

Important
Important: Not supported by iSCSI

The following QoS settings are used only by the user space RBD implementation librbd and not used by the kRBD implementation. Because iSCSI uses kRBD, it does not use the QoS settings. However, for iSCSI you can configure QoS on the kernel block device layer using standard kernel facilities.

rbd qos iops limit

The desired limit of I/O operations per second. Default is 0 (no limit).

rbd qos bps limit

The desired limit of I/O bytes per second. Default is 0 (no limit).

rbd qos read iops limit

The desired limit of read operations per second. Default is 0 (no limit).

rbd qos write iops limit

The desired limit of write operations per second. Default is 0 (no limit).

rbd qos read bps limit

The desired limit of read bytes per second. Default is 0 (no limit).

rbd qos write bps limit

The desired limit of write bytes per second. Default is 0 (no limit).

rbd qos iops burst

The desired burst limit of I/O operations. Default is 0 (no limit).

rbd qos bps burst

The desired burst limit of I/O bytes. Default is 0 (no limit).

rbd qos read iops burst

The desired burst limit of read operations. Default is 0 (no limit).

rbd qos write iops burst

The desired burst limit of write operations. Default is 0 (no limit).

rbd qos read bps burst

The desired burst limit of read bytes. Default is 0 (no limit).

rbd qos write bps burst

The desired burst limit of write bytes. Default is 0 (no limit).

rbd qos schedule tick min

The minimum schedule tick (in milliseconds) for QoS. Default is 50.

20.7 Read-ahead settings

RADOS Block Device supports read-ahead/prefetching to optimize small, sequential reads. This should normally be handled by the guest OS in the case of a virtual machine, but boot loaders may not issue efficient reads. Read-ahead is automatically disabled if caching is disabled.

Important
Important: Not supported by iSCSI

The following read-ahead settings are used only by the user space RBD implementation librbd and not used by the kRBD implementation. Because iSCSI uses kRBD, it does not use the read-ahead settings. However, for iSCSI you can configure read-ahead on the kernel block device layer using standard kernel facilities.

rbd readahead trigger requests

Number of sequential read requests necessary to trigger read-ahead. Default is 10.

rbd readahead max bytes

Maximum size of a read-ahead request. If set to 0, read-ahead is disabled. Default is 512 kB.

rbd readahead disable after bytes

After this many bytes have been read from an RBD image, read-ahead is disabled for that image until it is closed. This allows the guest OS to take over read-ahead when it is booted. If set to 0, read-ahead stays enabled. Default is 50 MB.

20.8 Advanced features

RADOS Block Device supports advanced features that enhance the functionality of RBD images. You can specify the features either on the command line when creating an RBD image, or in the Ceph configuration file by using the rbd_default_features option.

You can specify the values of the rbd_default_features option in two ways:

  • As a sum of features' internal values. Each feature has its own internal value—for example 'layering' has 1 and 'fast-diff' has 16. Therefore to activate these two feature by default, include the following:

    rbd_default_features = 17
  • As a comma-separated list of features. The previous example will look as follows:

    rbd_default_features = layering,fast-diff
Note
Note: Features not supported by iSCSI

RBD images with the following features will not be supported by iSCSI: deep-flatten, object-map, journaling, fast-diff, striping

A list of advanced RBD features follows:

layering

Layering enables you to use cloning.

Internal value is 1, default is 'yes'.

striping

Striping spreads data across multiple objects and helps with parallelism for sequential read/write workloads. It prevents single node bottlenecks for large or busy RADOS Block Devices.

Internal value is 2, default is 'yes'.

exclusive-lock

When enabled, it requires a client to get a lock on an object before making a write. Enable the exclusive lock only when a single client is accessing an image at the same time. Internal value is 4. Default is 'yes'.

object-map

Object map support depends on exclusive lock support. Block devices are thin provisioned, meaning that they only store data that actually exists. Object map support helps track which objects actually exist (have data stored on a drive). Enabling object map support speeds up I/O operations for cloning, importing and exporting a sparsely populated image, and deleting.

Internal value is 8, default is 'yes'.

fast-diff

Fast-diff support depends on object map support and exclusive lock support. It adds another property to the object map, which makes it much faster to generate diffs between snapshots of an image and the actual data usage of a snapshot.

Internal value is 16, default is 'yes'.

deep-flatten

Deep-flatten makes the rbd flatten (see Section 20.3.3.6, “Flattening a cloned image”) work on all the snapshots of an image, in addition to the image itself. Without it, snapshots of an image will still rely on the parent, therefore you will not be able to delete the parent image until the snapshots are deleted. Deep-flatten makes a parent independent of its clones, even if they have snapshots.

Internal value is 32, default is 'yes'.

journaling

Journaling support depends on exclusive lock support. Journaling records all modifications to an image in the order they occur. RBD mirroring (see Section 20.4, “RBD image mirrors”) uses the journal to replicate a crash consistent image to a remote cluster.

Internal value is 64, default is 'no'.

20.9 Mapping RBD using old kernel clients

Old clients (for example, SLE11 SP4) may not be able to map RBD images because a cluster deployed with SUSE Enterprise Storage 7.1 forces some features (both RBD image level features and RADOS level features) that these old clients do not support. When this happens, the OSD logs will show messages similar to the following:

2019-05-17 16:11:33.739133 7fcb83a2e700  0 -- 192.168.122.221:0/1006830 >> \
192.168.122.152:6789/0 pipe(0x65d4e0 sd=3 :57323 s=1 pgs=0 cs=0 l=1 c=0x65d770).connect \
protocol feature mismatch, my 2fffffffffff < peer 4010ff8ffacffff missing 401000000000000
Warning
Warning: Changing CRUSH Map bucket types causes massive rebalancing

If you intend to switch the CRUSH Map bucket types between 'straw' and 'straw2', do it in a planned manner. Expect a significant impact on the cluster load because changing bucket type will cause massive cluster rebalancing.

  1. Disable any RBD image features that are not supported. For example:

    cephuser@adm > rbd feature disable pool1/image1 object-map
    cephuser@adm > rbd feature disable pool1/image1 exclusive-lock
  2. Change the CRUSH Map bucket types from 'straw2' to 'straw':

    1. Save the CRUSH Map:

      cephuser@adm > ceph osd getcrushmap -o crushmap.original
    2. Decompile the CRUSH Map:

      cephuser@adm > crushtool -d crushmap.original -o crushmap.txt
    3. Edit the CRUSH Map and replace 'straw2' with 'straw'.

    4. Recompile the CRUSH Map:

      cephuser@adm > crushtool -c crushmap.txt -o crushmap.new
    5. Set the new CRUSH Map:

      cephuser@adm > ceph osd setcrushmap -i crushmap.new

20.10 Enabling block devices and Kubernetes

You can use Ceph RBD with Kubernetes v1.13 and higher through the ceph-csi driver. This driver dynamically provisions RBD images to back Kubernetes volumes, and maps these RBD images as block devices (optionally mounting a file system contained within the image) on worker nodes running pods that reference an RBD-backed volume.

To use Ceph block devices with Kubernetes, you must install and configure ceph-csi within your Kubernetes environment.

Important
Important

ceph-csi uses the RBD kernel modules by default which may not support all Ceph CRUSH tunables or RBD image features.

  1. By default, Ceph block devices use the RBD pool. Create a pool for Kubernetes volume storage. Ensure your Ceph cluster is running, then create the pool:

    cephuser@adm > ceph osd pool create kubernetes
  2. Use the RBD tool to initialize the pool:

    cephuser@adm > rbd pool init kubernetes
  3. Create a new user for Kubernetes and ceph-csi. Execute the following and record the generated key:

    cephuser@adm > ceph auth get-or-create client.kubernetes mon 'profile rbd' osd 'profile rbd pool=kubernetes' mgr 'profile rbd pool=kubernetes'
    [client.kubernetes]
        key = AQD9o0Fd6hQRChAAt7fMaSZXduT3NWEqylNpmg==
  4. ceph-csi requires a ConfigMap object stored in Kubernetes to define the Ceph monitor addresses for the Ceph cluster. Collect both the Ceph cluster unique fsid and the monitor addresses:

    cephuser@adm > ceph mon dump
    <...>
    fsid b9127830-b0cc-4e34-aa47-9d1a2e9949a8
    <...>
    0: [v2:192.168.1.1:3300/0,v1:192.168.1.1:6789/0] mon.a
    1: [v2:192.168.1.2:3300/0,v1:192.168.1.2:6789/0] mon.b
    2: [v2:192.168.1.3:3300/0,v1:192.168.1.3:6789/0] mon.c
  5. Generate a csi-config-map.yaml file similar to the example below, substituting the FSID for clusterID, and the monitor addresses for monitors:

    kubectl@adm > cat <<EOF > csi-config-map.yaml
    ---
    apiVersion: v1
    kind: ConfigMap
    data:
      config.json: |-
        [
          {
            "clusterID": "b9127830-b0cc-4e34-aa47-9d1a2e9949a8",
            "monitors": [
              "192.168.1.1:6789",
              "192.168.1.2:6789",
              "192.168.1.3:6789"
            ]
          }
        ]
    metadata:
      name: ceph-csi-config
    EOF
  6. When generated, store the new ConfigMap object in Kubernetes:

    kubectl@adm > kubectl apply -f csi-config-map.yaml
  7. ceph-csi requires the cephx credentials for communicating with the Ceph cluster. Generate a csi-rbd-secret.yaml file similar to the example below, using the newly-created Kubernetes user ID and cephx key:

    kubectl@adm > cat <<EOF > csi-rbd-secret.yaml
    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: csi-rbd-secret
      namespace: default
    stringData:
      userID: kubernetes
      userKey: AQD9o0Fd6hQRChAAt7fMaSZXduT3NWEqylNpmg==
    EOF
  8. When generated, store the new secret object in Kubernetes:

    kubectl@adm > kubectl apply -f csi-rbd-secret.yaml
  9. Create the required ServiceAccount and RBAC ClusterRole/ClusterRoleBinding Kubernetes objects. These objects do not necessarily need to be customized for your Kubernetes environment, and therefore can be used directly from the ceph-csi deployment YAML files:

    kubectl@adm > kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-provisioner-rbac.yaml
    kubectl@adm > kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-nodeplugin-rbac.yaml
  10. Create the ceph-csi provisioner and node plugins:

    kubectl@adm > wget https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-rbdplugin-provisioner.yaml
    kubectl@adm > kubectl apply -f csi-rbdplugin-provisioner.yaml
    kubectl@adm > wget https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-rbdplugin.yaml
    kubectl@adm > kubectl apply -f csi-rbdplugin.yaml
    Important
    Important

    By default, the provisioner and node plugin YAML files will pull the development release of the ceph-csi container. The YAML files should be updated to use a release version.

20.10.1 Using Ceph block devices in Kubernetes

The Kubernetes StorageClass defines a class of storage. Multiple StorageClass objects can be created to map to different quality-of-service levels and features. For example, NVMe versus HDD-based pools.

To create a ceph-csi StorageClass that maps to the Kubernetes pool created above, the following YAML file can be used, after ensuring that the clusterID property matches your Ceph cluster's FSID:

kubectl@adm > cat <<EOF > csi-rbd-sc.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: csi-rbd-sc
provisioner: rbd.csi.ceph.com
parameters:
   clusterID: b9127830-b0cc-4e34-aa47-9d1a2e9949a8
   pool: kubernetes
   csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
   csi.storage.k8s.io/provisioner-secret-namespace: default
   csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
   csi.storage.k8s.io/node-stage-secret-namespace: default
reclaimPolicy: Delete
mountOptions:
   - discard
EOF
kubectl@adm > kubectl apply -f csi-rbd-sc.yaml

A PersistentVolumeClaim is a request for abstract storage resources by a user. The PersistentVolumeClaim would then be associated to a pod resource to provision a PersistentVolume, which would be backed by a Ceph block image. An optional volumeMode can be included to select between a mounted file system (default) or raw block-device-based volume.

Using ceph-csi, specifying Filesystem for volumeMode can support both ReadWriteOnce and ReadOnlyMany accessMode claims, and specifying Block for volumeMode can support ReadWriteOnce, ReadWriteMany, and ReadOnlyMany accessMode claims.

For example, to create a block-based PersistentVolumeClaim that uses the ceph-csi-based StorageClass created above, the following YAML file can be used to request raw block storage from the csi-rbd-sc StorageClass:

kubectl@adm > cat <<EOF > raw-block-pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: raw-block-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-rbd-sc
EOF
kubectl@adm > kubectl apply -f raw-block-pvc.yaml

The following demonstrates and example of binding the above PersistentVolumeClaim to a pod resource as a raw block device:

kubectl@adm > cat <<EOF > raw-block-pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-raw-block-volume
spec:
  containers:
    - name: fc-container
      image: fedora:26
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeDevices:
        - name: data
          devicePath: /dev/xvda
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: raw-block-pvc
EOF
kubectl@adm > kubectl apply -f raw-block-pod.yaml

To create a file-system-based PersistentVolumeClaim that uses the ceph-csi-based StorageClass created above, the following YAML file can be used to request a mounted file system (backed by an RBD image) from the csi-rbd-sc StorageClass:

kubectl@adm > cat <<EOF > pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rbd-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-rbd-sc
EOF
kubectl@adm > kubectl apply -f pvc.yaml

The following demonstrates an example of binding the above PersistentVolumeClaim to a pod resource as a mounted file system:

kubectl@adm > cat <<EOF > pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: csi-rbd-demo-pod
spec:
  containers:
    - name: web-server
      image: nginx
      volumeMounts:
        - name: mypvc
          mountPath: /var/lib/www/html
  volumes:
    - name: mypvc
      persistentVolumeClaim:
        claimName: rbd-pvc
        readOnly: false
EOF
kubectl@adm > kubectl apply -f pod.yaml