4 Multi-tier Caching for Block Device Operations #
A multi-tier cache is a replicated/distributed cache that consists of at least two tiers: one is represented by slower but cheaper rotational block devices (hard disks), while the other is more expensive but performs faster data operations (for example SSD flash disks).
SUSE Linux Enterprise Server implements two different solutions for caching between flash and
rotational devices: bcache
and lvmcache
.
4.1 General Terminology #
This section explains several terms often used when describing cache related features:
- Migration
Movement of the primary copy of a logical block from one device to the other.
- Promotion
Migration from the slow device to the fast device.
- Demotion
Migration from the fast device to the slow device.
- Origin device
The big and slower block device. It always contains a copy of the logical block, which may be out of date or kept in synchronization with the copy on the cache device (depending on policy).
- Cache device
The small and faster block device.
- Metadata device
A small device that records which blocks are in the cache, which are dirty, and extra hints for use by the policy object. This information could be put on the cache device as well, but having it separate allows the volume manager to configure it differently, for example as a mirror for extra robustness. The metadata device may only be used by a single cache device.
- Dirty block
If some process writes to a block of data which is placed in the cache, the cached block is marked as dirty because it was overwritten in the cache and needs to be written back to the original device.
- Cache miss
A request for I/O operations is pointed to the cached device's cache first. If it cannot find the requested values, it looks in the device itself, which is slow. This is called a cache miss.
- Cache hit
When a requested value is found in the cached device's cache, it is served fast. This is called a cache hit.
- Cold cache
Cache that holds no values (is empty) and causes cache misses. As the cached block device operations progress, it gets filled with data and becomes warm.
- Warm cache
Cache that already holds some values and is likely to result in cache hits.
4.2 Caching Modes #
Following are the basic caching modes that multi-tier caches use: write-back, write-through, write-around and pass-through.
- write-back
Data written to a block that is cached go to the cache only, and the block is marked dirty. This is the default caching mode.
- write-through
Writing to a cached block will not complete until it has hit both the origin and cache devices. Clean blocks remain clean with write-through cache.
- write-around
A similar technique to write-through cache, but write I/O is written directly to a permanent storage, bypassing the cache. This can prevent the cache being flooded with write I/O that will not subsequently be re-read, but the disadvantage is that a read request for recently written data will create a 'cache miss' and needs to be read from slower bulk storage and experience higher latency.
- pass-through
To enable the pass-through mode, the cache needs to be clean. Reading is served from the origin device bypassing the cache. Writing is forwarded to the origin device and 'invalidates' the cache block. Pass-through allows a cache device activation without having to care about data coherency, which is maintained. The cache will gradually become cold as writing takes place. If you can verify the coherency of the cache later, or establish it by using the
invalidate_cblocks
message, you can switch the cache device to write-through or write-back mode while it is still warm. Otherwise, you can discard the cache contents before switching to the desired caching mode.
4.3 bcache
#
bcache
is a Linux kernel block layer cache. It allows one or more fast
disk drives (such as SSDs) to act as a cache for one or more slower hard
disks. bcache
supports write-through and write-back, and is independent of
the file system used. By default it caches random reads and writes only,
which SSDs excel at. It is suitable for desktops, servers, and high end
storage arrays as well.
4.3.1 Main Features #
A single cache device can be used to cache an arbitrary number of backing devices. Backing devices can be attached and detached at runtime, while mounted and in use.
Recovers from unclean shutdowns—writes are not completed until the cache is consistent with regard to the backing device.
Throttles traffic to the SSD if it becomes congested.
Highly efficient write-back implementation. Dirty data is always written out in sorted order.
Stable and reliable—in production use.
4.3.2 Setting Up a bcache
Device #
This section describes steps to set up and manage a bcache
device.
Install the
bcache-tools
package:sudo zypper in bcache-tools
Create a backing device (typically a mechanical drive). The backing device can be a whole device, a partition, or any other standard block device.
sudo make-bcache -B /dev/sdb
Create a cache device (typically an SSD disk).
sudo make-bcache -C /dev/sdc
In this example, the default block and bucket sizes of 512 B and 128 KB are used. The block size should match the backing device's sector size which will usually be either 512 or 4k. The bucket size should match the erase block size of the caching device with the intention of reducing write amplification. For example, using a hard disk with 4k sectors and an SSD with an erase block size of 2 MB this command would look as follows:
sudo make-bcache --block 4k --bucket 2M -C /dev/sdc
Tip: Multi-Device Supportmake-bcache
can prepare and register multiple backing devices and a cache device at the same time. In this case you do not need to manually attach the cache device to the backing device afterward:sudo make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
bcache
devices show up as/dev/bcacheN
and as
/dev/bcache/by-uuid/UUID /dev/bcache/by-label/LABEL
You can normally format and mount
bcache
devices as usual:mkfs.ext4 /dev/bcache0 mount /dev/bcache0 /mnt
You can control
bcache
devices throughsysfs
at/sys/block/bcacheN/bcache
.After both the cache and backing devices are registered, you need to attach the backing device to the related cache set to enable caching:
echo CACHE_SET_UUID > /sys/block/bcache0/bcache/attach
where CACHE_SET_UUID is found in
/sys/fs/bcache
.By default
bcache
uses a pass-through caching mode. To change it to for example write-back, runecho writeback > /sys/block/bcache0/bcache/cache_mode
4.3.3 bcache
Configuration Using sysfs
#
bcache
devices use the sysfs
interface to store
their runtime configuration values. This way you can change bcache
backing and cache disks' behavior or see their usage statistics.
For the complete list of bcache
sysfs
parameters, see the contents of the
/usr/src/linux/Documentation/bcache.txt
file, mainly
the SYSFS - BACKING DEVICE
, SYSFS - BACKING
DEVICE STATS
, and SYSFS - CACHE DEVICE
sections.
4.4 lvmcache
#
lvmcache
is a caching mechanism consisting of logical volumes (LVs). It
uses the dm-cache
kernel driver and supports
write-through (default) and write-back caching modes. lvmcache
improves
performance of a large and slow LV by dynamically migrating some of its data
to a faster and smaller LV. For more information on LVM, see
Part II, “Logical Volumes (LVM)”.
LVM refers to the small, fast LV as a cache pool LV. The large, slow LV is called the origin LV. Because of requirements from dm-cache, LVM further splits the cache pool LV into two devices: the cache data LV and cache metadata LV. The cache data LV is where copies of data blocks are kept from the origin LV to increase speed. The cache metadata LV holds the accounting information that specifies where data blocks are stored.
4.4.1 Configuring lvmcache
#
This section describes steps to create and configure LVM based caching.
Create the origin LV. Create a new LV or use an existing LV to become the origin LV:
lvcreate -n ORIGIN_LV -L 100G vg /dev/SLOW_DEV
Create the cache data LV. This LV will hold data blocks from the origin LV. The size of this LV is the size of the cache and will be reported as the size of the cache pool LV.
lvcreate -n CACHE_DATA_LV -L 10G vg /dev/FAST
Create the cache metadata LV. This LV will hold cache pool metadata. The size of this LV should be approximately 1000 times smaller than the cache data LV, with a minimum size of 8MB.
lvcreate -n CACHE_METADATA_LV -L 12M vg /dev/FAST
List the volumes you have created so far:
lvs -a vg LV VG Attr LSize Pool Origin cache_data_lv vg -wi-a----- 10.00g cache_metadata_lv vg -wi-a----- 12.00m origin_lv vg -wi-a----- 100.00g
Create a cache pool LV. Combine the data and metadata LVs into a cache pool LV. You can set the cache pool LV's behavior at the same time.
CACHE_POOL_LV takes the name of CACHE_DATA_LV.
CACHE_DATA_LV is renamed to CACHE_DATA_LV_cdata and becomes hidden.
CACHE_META_LV is renamed to CACHE_DATA_LV_cmeta and becomes hidden.
lvconvert --type cache-pool \ --poolmetadata vg/cache_metadata_lv vg/cache_data_lv
lvs -a vg LV VG Attr LSize Pool Origin cache_data_lv vg Cwi---C--- 10.00g [cache_data_lv_cdata] vg Cwi------- 10.00g [cache_data_lv_cmeta] vg ewi------- 12.00m origin_lv vg -wi-a----- 100.00g
Create a cache LV. Create a cache LV by linking the cache pool LV to the origin LV.
The user accessible cache LV takes the name of the origin LV, while the origin LV becomes a hidden LV renamed to ORIGIN_LV_corig.
CacheLV takes the name of ORIGIN_LV.
ORIGIN_LV is renamed to ORIGIN_LV_corig and becomes hidden.
lvconvert --type cache --cachepool vg/cache_data_lv vg/origin_lv
lvs -a vg LV VG Attr LSize Pool Origin cache_data_lv vg Cwi---C--- 10.00g [cache_data_lv_cdata] vg Cwi-ao---- 10.00g [cache_data_lv_cmeta] vg ewi-ao---- 12.00m origin_lv vg Cwi-a-C--- 100.00g cache_data_lv [origin_lv_corig] [origin_lv_corig] vg -wi-ao---- 100.00g
4.4.2 Removing a Cache Pool #
There are several ways to turn off the LV cache.
4.4.2.1 Detach a Cache Pool LV from a Cache LV #
You can disconnect a cache pool LV from a cache LV, leaving an unused cache pool LV and an uncached origin LV. Data are written back from the cache pool to the origin LV when necessary.
lvconvert --splitcache vg/origin_lv
4.4.2.2 Removing a Cache Pool LV without Removing its Origin LV #
This writes back data from the cache pool to the origin LV when necessary, then removes the cache pool LV, leaving the uncached origin LV.
lvremove vg/cache_data_lv
An alternative command that also disconnects the cache pool from the cache LV, and deletes the cache pool:
lvconvert --uncache vg/origin_lv
4.4.2.3 Removing Both the Origin LV and the Cache Pool LV #
Removing a cache LV removes both the origin LV and the linked cache pool LV.
lvremove vg/origin_lv
4.4.2.4 For More Information #
You can find more lvmcache
related topics, such as supported cache
modes, redundant sub-logical volumes, cache policy, or converting existing
LVs to cache types, in the lvmcache
manual page (man 7
lvmcache
).