Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

10 Erasure Coded Pools Edit source

Ceph provides an alternative to the normal replication of data in pools, called erasure or erasure coded pool. Erasure pools do not provide all functionality of replicated pools (for example it cannot store metadata for RBD pools), but require less raw storage. A default erasure pool capable of storing 1 TB of data requires 1,5 TB of raw storage, allowing a single disk failure. This compares favorably to a replicated pool which needs 2 TB of raw storage for the same purpose.

For background information on Erasure Code, see https://en.wikipedia.org/wiki/Erasure_code.

Note
Note

When using FileStore, you cannot access erasure coded pools with the RBD interface unless you have a cache tier configured. Refer to Section 11.5, “Erasure Coded Pool and Cache Tiering” for more details, or use BlueStore.

10.1 Prerequisite for Erasure Coded Pools Edit source

To make use of erasure coding, you need to:

  • Define an erasure rule in the CRUSH Map.

  • Define an erasure code profile that specifies the coding algorithm to be used.

  • Create a pool using the previously mentioned rule and profile.

Keep in mind that changing the profile and the details in the profile will not be possible after the pool was created and has data.

Ensure that the CRUSH rules for erasure pools use indep for step. For details see Section 7.3.2, “firstn and indep”.

10.2 Creating a Sample Erasure Coded Pool Edit source

The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts. This procedure describes how to create a pool for testing purposes.

  1. The command ceph osd pool create is used to create a pool with type erasure. The 12 stands for the number of placement groups. With default parameters, the pool is able to handle the failure of one OSD.

    cephadm > ceph osd pool create ecpool 12 12 erasure
    pool 'ecpool' created
  2. The string ABCDEFGHI is written into an object called NYAN.

    cephadm > echo ABCDEFGHI | rados --pool ecpool put NYAN -
  3. For testing purposes OSDs can now be disabled, for example by disconnecting them from the network.

  4. To test whether the pool can handle the failure of devices, the content of the file can be accessed with the rados command.

    cephadm > rados --pool ecpool get NYAN -
    ABCDEFGHI

10.3 Erasure Code Profiles Edit source

When the ceph osd pool create command is invoked to create an erasure pool, the default profile is used, unless another profile is specified. Profiles define the redundancy of data. This is done by setting two parameters, arbitrarily named k and m. k and m define in how many chunks a piece of data is split and how many coding chunks are created. Redundant chunks are then stored on different OSDs.

Definitions required for erasure pool profiles:

chunk

when the encoding function is called, it returns chunks of the same size: data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.

k

the number of data chunks, that is the number of chunks into which the original object is divided. For example if k = 2 a 10KB object will be divided into k objects of 5KB each.

m

the number of coding chunks, that is the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.

crush-failure-domain

defines to which devices the chunks are distributed. A bucket type needs to be set as value. For all bucket types, see Section 7.2, “Buckets”. If the failure domain is rack, the chunks will be stored on different racks to increase the resilience in case of rack failures. Keep in mind that this requires k+m racks.

With the default erasure code profile used in Section 10.2, “Creating a Sample Erasure Coded Pool”, you will not lose cluster data if a single OSD or host fails. Therefore, to store 1 TB of data it needs another 0.5 TB of raw storage. That means 1.5 TB of raw storage are required for 1 TB of data (due to k=2,m=1). This is equivalent to a common RAID 5 configuration. For comparison: a replicated pool needs 2 TB of raw storage to store 1 TB of data.

The settings of the default profile can be displayed with:

cephadm > ceph osd erasure-code-profile get default
directory=.libs
k=2
m=1
plugin=jerasure
crush-failure-domain=host
technique=reed_sol_van

Choosing the right profile is important because it cannot be modified after the pool is created. A new pool with a different profile needs to be created and all objects from the previous pool moved to the new one (see Section 8.3, “Pool Migration”).

The most important parameters of the profile are k, m and crush-failure-domain because they define the storage overhead and the data durability. For example, if the desired architecture must sustain the loss of two racks with a storage overhead of 66% overhead, the following profile can be defined. Note that this is only valid with a CRUSH Map that has buckets of type 'rack':

cephadm > ceph osd erasure-code-profile set myprofile \
   k=3 \
   m=2 \
   crush-failure-domain=rack

The example Section 10.2, “Creating a Sample Erasure Coded Pool” can be repeated with this new profile:

cephadm > ceph osd pool create ecpool 12 12 erasure myprofile
cephadm > echo ABCDEFGHI | rados --pool ecpool put NYAN -
cephadm > rados --pool ecpool get NYAN -
ABCDEFGHI

The NYAN object will be divided in three (k=3) and two additional chunks will be created (m=2). The value of m defines how many OSDs can be lost simultaneously without losing any data. The crush-failure-domain=rack will create a CRUSH ruleset that ensures no two chunks are stored in the same rack.

For more information about the erasure code profiles, see http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile.

10.4 Erasure Coded Pools with RADOS Block Device Edit source

To mark an EC pool as a RBD pool, tag it accordingly:

cephadm > ceph osd pool application enable rbd ec_pool_name

RBD can store image data in EC pools. However, the image header and metadata still needs to be stored in a replicated pool. Assuming you have the pool named 'rbd' for this purpose:

cephadm > rbd create rbd/image_name --size 1T --data-pool ec_pool_name

You can use the image normally like any other image, except that all of the data will be stored in the ec_pool_name pool instead of 'rbd' pool.

Print this page