Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Enterprise Storage 5.5 (SES 5 & SES 5.5)

11 Cache Tiering Edit source

A cache tier is an additional storage layer implemented between the client and the standard storage. It is designed to speed up access to pools stored on slow hard disks and erasure coded pools.

Typically cache tiering involves creating a pool of relatively fast storage devices (for example SSD drives) configured to act as a cache tier, and a backing pool of slower and cheaper devices configured to act as a storage tier. The size of the cache pool is usually 10-20% of the storage pool.

11.1 Tiered Storage Terminology Edit source

Cache tiering recognizes two types of pools: a cache pool and a storage pool.


For general information on pools, see Chapter 8, Managing Storage Pools.

storage pool

Either a standard replicated pool that stores several copies of an object in the Ceph storage cluster, or an erasure coded pool (see Chapter 10, Erasure Coded Pools).

The storage pool is sometimes referred to as 'backing' or 'cold' storage.

cache pool

A standard replicated pool stored on a relatively small but fast storage device with their own ruleset in a CRUSH Map.

The cache pool is also referred to as 'hot' storage.

11.2 Points to Consider Edit source

Cache tiering may degrade the cluster performance for specific workloads. The following points show some of its aspects that you need to consider:

  • Workload-dependent: Whether a cache will improve performance is dependent on the workload. Because there is a cost associated with moving objects into or out of the cache, it can be more effective when most of the requests touch a small number of objects. The cache pool should be large enough to capture the working set for your workload to avoid thrashing.

  • Difficult to benchmark: Most performance benchmarks may show low performance with cache tiering. The reason is that they request a big set of objects, and it takes a long time for the cache to 'warm up'.

  • Possibly low performance: For workloads that are not suitable for cache tiering, performance is often slower than a normal replicated pool without cache tiering enabled.

  • librados object enumeration: If your application is using librados directly and relies on object enumeration, cache tiering may not work as expected. (This is not a problem for Object Gateway, RBD, or CephFS.)

11.3 When to Use Cache Tiering Edit source

Consider using cache tiering in the following cases:

  • Your erasure coded pools are stored on FileStore and you need to access them via RADOS Block Device. For more information on RBD, see Chapter 9, RADOS Block Device.

  • Your erasure coded pools are stored on FileStore and you need to access them via iSCSI. For more information on iSCSI, refer to Chapter 14, Ceph iSCSI Gateway.

  • You have a limited number of high-performance storage and a large collection of low-performance storage, and need to access the stored data faster.

11.4 Cache Modes Edit source

The cache tiering agent handles the migration of data between the cache tier and the backing storage tier. Administrators have the ability to configure how this migration takes place. There are two main scenarios:

write-back mode

In write-back mode, Ceph clients write data to the cache tier and receive an ACK from the cache tier. In time, the data written to the cache tier migrates to the storage tier and gets flushed from the cache tier. Conceptually, the cache tier is overlaid 'in front' of the backing storage tier. When a Ceph client needs data that resides in the storage tier, the cache tiering agent migrates the data to the cache tier on read, then it is sent to the Ceph client. Thereafter, the Ceph client can perform I/O using the cache tier, until the data becomes inactive. This is ideal for mutable, data such as photo or video editing, or transactional data.

read-only mode

In read-only mode, Ceph clients write data directly to the backing tier. On read, Ceph copies the requested objects from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data such as presenting pictures or videos on a social network, DNA data, or X-ray imaging, because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use read-only mode for mutable data.

11.5 Erasure Coded Pool and Cache Tiering Edit source

Erasure coded pools require more resources than replicated pools. To overcome these limitations, we recommended to set a cache tier before the erasure coded pool. This it a requirement when using FileStore.

For example, if the hot-storage pool is made of fast storage, the ecpool created in Section 10.3, “Erasure Code Profiles” can be speeded up with:

cephadm > ceph osd tier add ecpool hot-storage
cephadm > ceph osd tier cache-mode hot-storage writeback
cephadm > ceph osd tier set-overlay ecpool hot-storage

This will place the hot-storage pool as a tier of ecpool in write-back mode so that every write and read to the ecpool is actually using the hot-storage and benefits from its flexibility and speed.

cephadm > rbd --pool ecpool create --size 10 myvolume

For more information about cache tiering, see Chapter 11, Cache Tiering.

11.6 Setting Up an Example Tiered Storage Edit source

This section illustrates how to set up a fast SSD cache tier (hot storage) in front of a standard hard disk (cold storage).


The following example is for illustration purposes only and includes a setup with one root and one rule for the SSD part residing on a single Ceph node.

In the production environment, cluster setups typically include more root and rule entries for the hot storage, and also mixed nodes, with both SSDs and SATA disks.

  1. Create two additional CRUSH rules, 'replicated_ssd' for the fast SSD caching device class, and 'replicated_hdd' for the slower HDD device class:

    cephadm > ceph osd crush rule create-replicated replicated_ssd default host ssd
    cephadm > ceph osd crush rule create-replicated replicated_hdd default host hdd
  2. Switch all existing pools to the 'replicated_hdd' rule. This prevents Ceph from storing data to the newly added SSD devices:

    cephadm > ceph osd pool set POOL_NAME crush_rule replicated_hdd
  3. Turn the machine into a Ceph node using DeepSea. Install the software and configure the host machine as described in Section 1.1, “Adding New Cluster Nodes”. Let us assume that its name is node-4. This node needs to have 4 OSD disks.

    Turn the machines into a Ceph nodes using DeepSea. Install the software and configure as described in Section 1.1, “Adding New Cluster Nodes”. In this example, the nodes have 4 OSD disks.

    host node-4 {
       id -5  # do not change unnecessarily
       # weight 0.012
       alg straw
       hash 0  # rjenkins1
       item osd.6 weight 0.003
       item osd.7 weight 0.003
       item osd.8 weight 0.003
       item osd.9 weight 0.003
  4. Edit the CRUSH map for the hot storage pool mapped to the OSDs backed by the fast SSD drives. Define a second hierarchy with a root node for the SSDs (as 'root ssd'). Additionally, change the weight and a CRUSH rule for the SSDs. For more information on CRUSH map, see http://docs.ceph.com/docs/master/rados/operations/crush-map/.

    Edit the CRUSH map directly with command line tools such as getcrushmap and crushtool:

    cephadm > ceph osd crush rm-device-class osd.6 osd.7 osd.8 osd.9
    cephadm > ceph osd crush set-device-class ssd osd.6 osd.7 osd.8 osd.9
  5. Create the hot storage pool to be used for cache tiering. Use the new 'ssd' rule for it:

    cephadm > ceph osd pool create hot-storage 100 100 replicated ssd
  6. Create the cold storage pool using the default 'replicated_ruleset' rule:

    cephadm > ceph osd pool create cold-storage 100 100 replicated replicated_ruleset
  7. Then, setting up a cache tier involves associating a backing storage pool with a cache pool, in this case, cold storage (= storage pool) with hot storage (= cache pool):

    cephadm > ceph osd tier add cold-storage hot-storage
  8. To set the cache mode to 'writeback', execute the following:

    cephadm > ceph osd tier cache-mode hot-storage writeback

    For more information about cache modes, see Section 11.4, “Cache Modes”.

    Writeback cache tiers overlay the backing storage tier, so they require one additional step: you must direct all client traffic from the storage pool to the cache pool. To direct client traffic directly to the cache pool, execute the following, for example:

    cephadm > ceph osd tier set-overlay cold-storage hot-storage

11.7 Configuring a Cache Tier Edit source

There are several options you can use to configure cache tiers. Use the following syntax:

cephadm > ceph osd pool set cachepool key value

11.7.1 Hit Set Edit source

Hit set parameters allow for tuning of cache pools. Hit sets in Ceph are usually bloom filters and provide a memory-efficient way of tracking objects that are already in the cache pool.

The hit set is a bit array that is used to store the result of a set of hashing functions applied on object names. Initially, all bits are set to 0. When an object is added to the hit set, its name is hashed and the result is mapped on different positions in the hit set, where the value of the bit is then set to 1.

To find out whether an object exists in the cache, the object name is hashed again. If any bit is 0, the object is definitely not in the cache and needs to be retrieved from cold storage.

It is possible that the results of different objects are stored in the same location of the hit set. By chance, all bits can be 1 without the object being in the cache. Therefore, hit sets working with a bloom filter can only tell whether an object is definitely not in the cache and needs to be retrieved from cold storage.

A cache pool can have more than one hit set tracking file access over time. The setting hit_set_count defines how many hit sets are being used, and hit_set_period defines for how long each hit set has been used. After the period has expired, the next hit set is used. If the number of hit sets is exhausted, the memory from the oldest hit set is freed and a new hit set is created. The values of hit_set_count and hit_set_period multiplied by each other define the overall time frame in which access to objects has been tracked.

Bloom Filter with 3 Stored Objects
Figure 11.1: Bloom Filter with 3 Stored Objects

Compared to the number of hashed objects, a hit set based on a bloom filter is very memory-efficient. Less than 10 bits are required to reduce the false positive probability below 1%. The false positive probability can be defined with hit_set_fpp. Based on the number of objects in a placement group and the false positive probability Ceph automatically calculates the size of the hit set.

The required storage on the cache pool can be limited with min_write_recency_for_promote and min_read_recency_for_promote. If the value is set to 0, all objects are promoted to the cache pool as soon as they are read or written and this persists until they are evicted. Any value greater than 0 defines the number of hit sets ordered by age that are searched for the object. If the object is found in a hit set, it will be promoted to the cache pool. Keep in mind that backup of objects may also cause them to be promoted to the cache. A full backup with the value of '0' can cause all data to be promoted to the cache tier while active data gets removed from the cache tier. Therefore, changing this setting based on the backup strategy may be useful.


The longer the period and the higher the min_read_recency_for_promote and min_write_recency_for_promote values, the more RAM the ceph-osd daemon consumes. In particular, when the agent is active to flush or evict cache objects, all hit_set_count hit sets are loaded into RAM. Use GMT for Hit Set Edit source

Cache tier setups have a bloom filter called hit set. The filter tests whether an object belongs to a set of either hot or cold objects. The objects are added to the hit set using time stamps appended to their names.

If cluster machines are placed in different time zones and the time stamps are derived from the local time, objects in a hit set can have misleading names consisting of future or past time stamps. In the worst case, objects may not exist in the hit set at all.

To prevent this, the use_gmt_hitset defaults to '1' on a newly created cache tier setups. This way, you force OSDs to use GMT (Greenwich Mean Time) time stamps when creating the object names for the hit set.

Warning: Leave the Default Value

Do not touch the default value '1' of use_gmt_hitset. If errors related to this option are not caused by your cluster setup, never change it manually. Otherwise, the cluster behavior may become unpredictable.

11.7.2 Cache Sizing Edit source

The cache tiering agent performs two main functions:


The agent identifies modified (dirty) objects and forwards them to the storage pool for long-term storage.


The agent identifies objects that have not been modified (clean) and evicts the least recently used among them from the cache. Absolute Sizing Edit source

The cache tiering agent can flush or evict objects based on the total number of bytes or the total number of objects. To specify a maximum number of bytes, execute the following:

cephadm > ceph osd pool set cachepool target_max_bytes num_of_bytes

To specify the maximum number of objects, execute the following:

cephadm > ceph osd pool set cachepool target_max_objects num_of_objects

Ceph is not able to determine the size of a cache pool automatically, so the configuration on the absolute size is required here. Otherwise, flush and evict will not work. If you specify both limits, the cache tiering agent will begin flushing or evicting when either threshold is triggered.


All client requests will be blocked only when target_max_bytes or target_max_objects reached. Relative Sizing Edit source

The cache tiering agent can flush or evict objects relative to the size of the cache pool (specified by target_max_bytes or target_max_objects in Section, “Absolute Sizing”). When the cache pool consists of a certain percentage of modified (dirty) objects, the cache tiering agent will flush them to the storage pool. To set the cache_target_dirty_ratio, execute the following:

cephadm > ceph osd pool set cachepool cache_target_dirty_ratio 0.0...1.0

For example, setting the value to 0.4 will begin flushing modified (dirty) objects when they reach 40% of the cache pool's capacity:

cephadm > ceph osd pool set hot-storage cache_target_dirty_ratio 0.4

When the dirty objects reach a certain percentage of the capacity, flush them at a higher speed. Use cache_target_dirty_high_ratio:

cephadm > ceph osd pool set cachepool cache_target_dirty_high_ratio 0.0..1.0

When the cache pool reaches a certain percentage of its capacity, the cache tiering agent will evict objects to maintain free capacity. To set the cache_target_full_ratio, execute the following:

cephadm > ceph osd pool set cachepool cache_target_full_ratio 0.0..1.0

11.7.3 Cache Age Edit source

You can specify the minimum age of a recently modified (dirty) object before the cache tiering agent flushes it to the backing storage pool. Note that this will only apply if the cache actually needs to flush/evict objects:

cephadm > ceph osd pool set cachepool cache_min_flush_age num_of_seconds

You can specify the minimum age of an object before it will be evicted from the cache tier:

cephadm > ceph osd pool set cachepool cache_min_evict_age num_of_seconds

11.7.4 Examples Edit source Large Cache Pool and Small Memory Edit source

If lots of storage and only a small amount of RAM is available, all objects can be promoted to the cache pool as soon as they are accessed. The hit set is kept small. The following is a set of example configuration values:

hit_set_count = 1
hit_set_period = 3600
hit_set_fpp = 0.05
min_write_recency_for_promote = 0
min_read_recency_for_promote = 0 Small Cache Pool and Large Memory Edit source

If a small amount of storage but a comparably large amount of memory is available, the cache tier can be configured to promote a limited number of objects into the cache pool. Twelve hit sets, of which each is used over a period of 14,400 seconds, provide tracking for a total of 48 hours. If an object has been accessed in the last 8 hours, it is promoted to the cache pool. The set of example configuration values then is:

hit_set_count = 12
hit_set_period = 14400
hit_set_fpp = 0.01
min_write_recency_for_promote = 2
min_read_recency_for_promote = 2
Print this page