Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Enterprise Storage 6

11 Installation of CephFS Edit source

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

To use CephFS, you need to have a running Ceph storage cluster, and at least one running Ceph metadata server.

11.1 Supported CephFS Scenarios and Guidance Edit source

With SUSE Enterprise Storage 6, SUSE introduces official support for many scenarios in which the scale-out and distributed component CephFS is used. This entry describes hard limits and provides guidance for the suggested use cases.

A supported CephFS deployment must meet these requirements:

  • Clients are SUSE Linux Enterprise Server 12 SP3 or newer, or SUSE Linux Enterprise Server 15 or newer, using the cephfs kernel module driver. The FUSE module is not supported.

  • CephFS quotas are supported in SUSE Enterprise Storage 6 and can be set on any subdirectory of the Ceph file system. The quota restricts either the number of bytes or files stored beneath the specified point in the directory hierarchy. For more information, see Section 28.6, “Setting CephFS Quotas”.

  • CephFS supports file layout changes as documented in Section 11.3.4, “File Layouts”. However, while the file system is mounted by any client, new data pools may not be added to an existing CephFS file system (ceph mds add_data_pool). They may only be added while the file system is unmounted.

  • A minimum of one Metadata Server. SUSE recommends deploying several nodes with the MDS role. By default, additional MDS daemons start as standby daemons, acting as backups for the active MDS. Multiple active MDS daemons are also supported (refer to section Section 11.3.2, “MDS Cluster Size”).

11.2 Ceph Metadata Server Edit source

Ceph metadata server (MDS) stores metadata for the CephFS. Ceph block devices and Ceph object storage do not use MDS. MDSs make it possible for POSIX file system users to execute basic commands—such as ls or find—without placing an enormous burden on the Ceph storage cluster.

11.2.1 Adding and Removing a Metadata Server Edit source

You can deploy MDS either during the initial cluster deployment process as described in Section 5.3, “Cluster Deployment”, or add it to an already deployed cluster as described in Section 2.1, “Adding New Cluster Nodes”.

After you deploy your MDS, allow the Ceph OSD/MDS service in the firewall setting of the server where MDS is deployed: Start yast, navigate to Security and Users › Firewall › Allowed Services and in the Service to Allow drop–down menu select Ceph OSD/MDS. If the Ceph MDS node is not allowed full traffic, mounting of a file system fails, even though other operations may work properly.

You can remove a metadata server in your cluster as described in Example 2.2, “Migrating Nodes”.

11.2.2 Configuring a Metadata Server Edit source

You can fine-tune the MDS behavior by inserting relevant options in the ceph.conf configuration file.

Metadata Server Settings
mon force standby active

If set to 'true' (default), monitors force standby-replay to be active. Set under [mon] or [global] sections.

mds cache memory limit

The soft memory limit (in bytes) that the MDS will enforce for its cache. Administrators should use this instead of the old mds cache size setting. Defaults to 1 GB.

mds cache reservation

The cache reservation (memory or inodes) for the MDS cache to maintain. When the MDS begins touching its reservation, it will recall client state until its cache size shrinks to restore the reservation. Defaults to 0.05.

mds cache size

The number of inodes to cache. A value of 0 (default) indicates an unlimited number. It is recommended to use mds cache memory limit to limit the amount of memory the MDS cache uses.

mds cache mid

The insertion point for new items in the cache LRU (from the top). Default is 0.7.

mds dir commit ratio

The fraction of directory that is dirty before Ceph commits using a full update instead of partial update. Default is 0.5.

mds dir max commit size

The maximum size of a directory update before Ceph breaks it into smaller transactions. Default is 90 MB.

mds decay halflife

The half-life of MDS cache temperature. Default is 5.

mds beacon interval

The frequency in seconds of beacon messages sent to the monitor. Default is 4.

mds beacon grace

The interval without beacons before Ceph declares an MDS laggy and possibly replaces it. Default is 15.

mds blacklist interval

The blacklist duration for failed MDSs in the OSD map. This setting controls how long failed MDS daemons will stay in the OSD map blacklist. It has no effect on how long something is blacklisted when the administrator blacklists it manually. For example, the ceph osd blacklist add command will still use the default blacklist time. Default is 24 * 60.

mds reconnect timeout

The interval in seconds to wait for clients to reconnect during MDS restart. Default is 45.

mds tick interval

How frequently the MDS performs internal periodic tasks. Default is 5.

mds dirstat min interval

The minimum interval in seconds to try to avoid propagating recursive stats up the tree. Default is 1.

mds scatter nudge interval

How quickly dirstat changes propagate up. Default is 5.

mds client prealloc inos

The number of inode numbers to preallocate per client session. Default is 1000.

mds early reply

Determines whether the MDS should allow clients to see request results before they commit to the journal. Default is 'true'.

mds use tmap

Use trivial map for directory updates. Default is 'true'.

mds default dir hash

The function to use for hashing files across directory fragments. Default is 2 (that is 'rjenkins').

mds log skip corrupt events

Determines whether the MDS should try to skip corrupt journal events during journal replay. Default is 'false'.

mds log max events

The maximum events in the journal before we initiate trimming. Set to -1 (default) to disable limits.

mds log max segments

The maximum number of segments (objects) in the journal before we initiate trimming. Set to -1 to disable limits. Default is 30.

mds log max expiring

The maximum number of segments to expire in parallels. Default is 20.

mds log eopen size

The maximum number of inodes in an EOpen event. Default is 100.

mds bal sample interval

Determines how frequently to sample directory temperature for fragmentation decisions. Default is 3.

mds bal replicate threshold

The maximum temperature before Ceph attempts to replicate metadata to other nodes. Default is 8000.

mds bal unreplicate threshold

The minimum temperature before Ceph stops replicating metadata to other nodes. Default is 0.

mds bal split size

The maximum directory size before the MDS will split a directory fragment into smaller bits. Default is 10000.

mds bal split rd

The maximum directory read temperature before Ceph splits a directory fragment. Default is 25000.

mds bal split wr

The maximum directory write temperature before Ceph splits a directory fragment. Default is 10000.

mds bal split bits

The number of bits by which to split a directory fragment. Default is 3.

mds bal merge size

The minimum directory size before Ceph tries to merge adjacent directory fragments. Default is 50.

mds bal interval

The frequency in seconds of workload exchanges between MDSs. Default is 10.

mds bal fragment interval

The delay in seconds between a fragment being capable of splitting or merging, and execution of the fragmentation change. Default is 5.

mds bal fragment fast factor

The ratio by which fragments may exceed the split size before a split is executed immediately, skipping the fragment interval. Default is 1.5.

mds bal fragment size max

The maximum size of a fragment before any new entries are rejected with ENOSPC. Default is 100000.

mds bal idle threshold

The minimum temperature before Ceph migrates a subtree back to its parent. Default is 0.

mds bal mode

The method for calculating MDS load:

  • 0 = Hybrid.

  • 1 = Request rate and latency.

  • 2 = CPU load.

Default is 0.

mds bal min rebalance

The minimum subtree temperature before Ceph migrates. Default is 0.1.

mds bal min start

The minimum subtree temperature before Ceph searches a subtree. Default is 0.2.

mds bal need min

The minimum fraction of target subtree size to accept. Default is 0.8.

mds bal need max

The maximum fraction of target subtree size to accept. Default is 1.2.

mds bal midchunk

Ceph will migrate any subtree that is larger than this fraction of the target subtree size. Default is 0.3.

mds bal minchunk

Ceph will ignore any subtree that is smaller than this fraction of the target subtree size. Default is 0.001.

mds bal target removal min

The minimum number of balancer iterations before Ceph removes an old MDS target from the MDS map. Default is 5.

mds bal target removal max

The maximum number of balancer iteration before Ceph removes an old MDS target from the MDS map. Default is 10.

mds replay interval

The journal poll interval when in standby-replay mode ('hot standby'). Default is 1.

mds shutdown check

The interval for polling the cache during MDS shutdown. Default is 0.

mds thrash fragments

Ceph will randomly fragment or merge directories. Default is 0.

mds dump cache on map

Ceph will dump the MDS cache contents to a file on each MDS map. Default is 'false'.

mds dump cache after rejoin

Ceph will dump MDS cache contents to a file after rejoining the cache during recovery. Default is 'false'.

mds standby for name

An MDS daemon will standby for another MDS daemon of the name specified in this setting.

mds standby for rank

An MDS daemon will standby for an MDS daemon of this rank. Default is -1.

mds standby replay

Determines whether a Ceph MDS daemon should poll and replay the log of an active MDS ('hot standby'). Default is 'false'.

mds min caps per client

Set the minimum number of capabilities a client may hold. Default is 100.

mds max ratio caps per client

Set the maximum ratio of current caps that may be recalled during MDS cache pressure. Default is 0.8.

Metadata Server Journaler Settings
journaler write head interval

How frequently to update the journal head object. Default is 15.

journaler prefetch periods

How many stripe periods to read ahead on journal replay. Default is 10.

journal prezero periods

How many stripe periods to zero ahead of write position. Default 10.

journaler batch interval

Maximum additional latency in seconds we incur artificially. Default is 0.001.

journaler batch max

Maximum number of bytes by which we will delay flushing. Default is 0.

11.3 CephFS Edit source

When you have a healthy Ceph storage cluster with at least one Ceph metadata server, you can create and mount your Ceph file system. Ensure that your client has network connectivity and a proper authentication keyring.

11.3.1 Creating CephFS Edit source

A CephFS requires at least two RADOS pools: one for data and one for metadata. When configuring these pools, you might consider:

  • Using a higher replication level for the metadata pool, as any data loss in this pool can render the whole file system inaccessible.

  • Using lower-latency storage such as SSDs for the metadata pool, as this will improve the observed latency of file system operations on clients.

When assigning a role-mds in the policy.cfg, the required pools are automatically created. You can manually create the pools cephfs_data and cephfs_metadata for manual performance tuning before setting up the Metadata Server. DeepSea will not create these pools if they already exist.

For more information on managing pools, see Chapter 22, Managing Storage Pools.

To create the two required pools—for example, 'cephfs_data' and 'cephfs_metadata'—with default settings for use with CephFS, run the following commands:

cephadm@adm > ceph osd pool create cephfs_data pg_num
cephadm@adm > ceph osd pool create cephfs_metadata pg_num

It is possible to use EC pools instead of replicated pools. We recommend to only use EC pools for low performance requirements and infrequent random access, for example cold storage, backups, archiving. CephFS on EC pools requires BlueStore to be enabled and the pool must have the allow_ec_overwrite option set. This option can be set by running ceph osd pool set ec_pool allow_ec_overwrites true.

Erasure coding adds significant overhead to file system operations, especially small updates. This overhead is inherent to using erasure coding as a fault tolerance mechanism. This penalty is the trade off for significantly reduced storage space overhead.

When the pools are created, you may enable the file system with the ceph fs new command:

cephadm@adm > ceph fs new fs_name metadata_pool_name data_pool_name

For example:

cephadm@adm > ceph fs new cephfs cephfs_metadata cephfs_data

You can check that the file system was created by listing all available CephFSs:

cephadm@adm > ceph fs ls
 name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

When the file system has been created, your MDS will be able to enter an active state. For example, in a single MDS system:

cephadm@adm > ceph mds stat
e5: 1/1/1 up
Tip: More Topics

You can find more information of specific tasks—for example mounting, unmounting, and advanced CephFS setup—in Chapter 28, Clustered File System.

11.3.2 MDS Cluster Size Edit source

A CephFS instance can be served by multiple active MDS daemons. All active MDS daemons that are assigned to a CephFS instance will distribute the file system's directory tree between themselves, and thus spread the load of concurrent clients. In order to add an active MDS daemon to a CephFS instance, a spare standby is needed. Either start an additional daemon or use an existing standby instance.

The following command will display the current number of active and passive MDS daemons.

cephadm@adm > ceph mds stat

The following command sets the number of active MDSs to two in a file system instance.

cephadm@adm > ceph fs set fs_name max_mds 2

In order to shrink the MDS cluster prior to an update, two steps are necessary. First, set max_mds so that only one instance remains:

cephadm@adm > ceph fs set fs_name max_mds 1

and after that, explicitly deactivate the other active MDS daemons:

cephadm@adm > ceph mds deactivate fs_name:rank

where rank is the number of an active MDS daemon of a file system instance, ranging from 0 to max_mds-1.

We recommend at least one MDS is left as a standby daemon.

11.3.3 MDS Cluster and Updates Edit source

During Ceph updates, the feature flags on a file system instance may change (usually by adding new features). Incompatible daemons (such as the older versions) are not able to function with an incompatible feature set and will refuse to start. This means that updating and restarting one daemon can cause all other not yet updated daemons to stop and refuse to start. For this reason, we recommend shrinking the active MDS cluster to size one and stopping all standby daemons before updating Ceph. The manual steps for this update procedure are as follows:

  1. Update the Ceph related packages using zypper.

  2. Shrink the active MDS cluster as described above to one instance and stop all standby MDS daemons using their systemd units on all other nodes:

    cephadm@mds > systemctl stop ceph-mds\*.service ceph-mds.target
  3. Only then restart the single remaining MDS daemon, causing it to restart using the updated binary.

    cephadm@mds > systemctl restart ceph-mds\*.service ceph-mds.target
  4. Restart all other MDS daemons and reset the desired max_mds setting.

    cephadm@mds > systemctl start ceph-mds.target

If you use DeepSea, it will follow this procedure in case the ceph package was updated during stages 0 and 4. It is possible to perform this procedure while clients have the CephFS instance mounted and I/O is ongoing. Note however that there will be a very brief I/O pause while the active MDS restarts. Clients will recover automatically.

It is good practice to reduce the I/O load as much as possible before updating an MDS cluster. An idle MDS cluster will go through this update procedure quicker. Conversely, on a heavily loaded cluster with multiple MDS daemons it is essential to reduce the load in advance to prevent a single MDS daemon from being overwhelmed by ongoing I/O.

11.3.4 File Layouts Edit source

The layout of a file controls how its contents are mapped to Ceph RADOS objects. You can read and write a file’s layout using virtual extended attributes or xattrs for shortly.

The name of the layout xattrs depends on whether a file is a regular file or a directory. Regular files’ layout xattrs are called ceph.file.layout, while directories’ layout xattrs are called ceph.dir.layout. Where examples refer to ceph.file.layout, substitute the .dir. part as appropriate when dealing with directories. Layout Fields Edit source

The following attribute fields are recognized:


ID or name of a RADOS pool in which a file’s data objects will be stored.


RADOS namespace within a data pool to which the objects will be written. It is empty by default, meaning the default namespace.


The size in bytes of a block of data used in the RAID 0 distribution of a file. All stripe units for a file have equal size. The last stripe unit is typically incomplete—it represents the data at the end of the file as well as the unused 'space' beyond it up to the end of the fixed stripe unit size.


The number of consecutive stripe units that constitute a RAID 0 'stripe' of file data.


The size in bytes of RADOS objects into which the file data is chunked.

Tip: Object Sizes

RADOS enforces a configurable limit on object sizes. If you increase CephFS object sizes beyond that limit, then writes may not succeed. The OSD setting is osd_max_object_size, which is 128 MB by default. Very large RADOS objects may prevent smooth operation of the cluster, so increasing the object size limit past the default is not recommended. Reading Layout with getfattr Edit source

Use the getfattr command to read the layout information of an example file file as a single string:

root # touch file
root # getfattr -n ceph.file.layout file
# file: file
ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=419430

Read individual layout fields:

root # getfattr -n ceph.file.layout.pool file
# file: file
root # getfattr -n ceph.file.layout.stripe_unit file
# file: file
Tip: Pool ID or Name

When reading layouts, the pool will usually be indicated by name. However, in rare cases when pools have only just been created, the ID may be output instead.

Directories do not have an explicit layout until it is customized. Attempts to read the layout will fail if it has never been modified: this indicates that the layout of the next ancestor directory with an explicit layout will be used.

root # mkdir dir
root # getfattr -n ceph.dir.layout dir
dir: ceph.dir.layout: No such attribute
root # setfattr -n ceph.dir.layout.stripe_count -v 2 dir
root # getfattr -n ceph.dir.layout dir
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data" Writing Layouts with setfattr Edit source

Use the setfattr command to modify the layout fields of an example file file:

cephadm@adm > ceph osd lspools
0 rbd
1 cephfs_data
2 cephfs_metadata
root # setfattr -n ceph.file.layout.stripe_unit -v 1048576 file
root # setfattr -n ceph.file.layout.stripe_count -v 8 file
# Setting pool by ID:
root # setfattr -n ceph.file.layout.pool -v 1 file
# Setting pool by name:
root # setfattr -n ceph.file.layout.pool -v cephfs_data file
Note: Empty File

When the layout fields of a file are modified using setfattr, this file needs to be empty otherwise an error will occur. Clearing Layouts Edit source

If you want to remove an explicit layout from an example directory mydir and revert back to inheriting the layout of its ancestor, run the following:

root # setfattr -x ceph.dir.layout mydir

Similarly, if you have set the 'pool_namespace' attribute and wish to modify the layout to use the default namespace instead, run:

# Create a directory and set a namespace on it
root # mkdir mydir
root # setfattr -n ceph.dir.layout.pool_namespace -v foons mydir
root # getfattr -n ceph.dir.layout mydir
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \
 pool=cephfs_data_a pool_namespace=foons"

# Clear the namespace from the directory's layout
root # setfattr -x ceph.dir.layout.pool_namespace mydir
root # getfattr -n ceph.dir.layout mydir
ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 \
 pool=cephfs_data_a" Inheritance of Layouts Edit source

Files inherit the layout of their parent directory at creation time. However, subsequent changes to the parent directory’s layout do not affect children:

root # getfattr -n ceph.dir.layout dir
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \

# file1 inherits its parent's layout
root # touch dir/file1
root # getfattr -n ceph.file.layout dir/file1
# file: dir/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \

# update the layout of the directory before creating a second file
root # setfattr -n ceph.dir.layout.stripe_count -v 4 dir
root # touch dir/file2

# file1's layout is unchanged
root # getfattr -n ceph.file.layout dir/file1
# file: dir/file1
ceph.file.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 \

# ...while file2 has the parent directory's new layout
root # getfattr -n ceph.file.layout dir/file2
# file: dir/file2
ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \

Files created as descendants of the directory also inherit its layout if the intermediate directories do not have layouts set:

root # getfattr -n ceph.dir.layout dir
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \
root # mkdir dir/childdir
root # getfattr -n ceph.dir.layout dir/childdir
dir/childdir: ceph.dir.layout: No such attribute
root # touch dir/childdir/grandchild
root # getfattr -n ceph.file.layout dir/childdir/grandchild
# file: dir/childdir/grandchild
ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 \
 pool=cephfs_data" Adding a Data Pool to the Metadata Server Edit source

Before you can use a pool with CephFS, you need to add it to the Metadata Server:

cephadm@adm > ceph fs add_data_pool cephfs cephfs_data_ssd
cephadm@adm > ceph fs ls  # Pool should now show up
.... data pools: [cephfs_data cephfs_data_ssd ]
Tip: cephx Keys

Make sure that your cephx keys allow the client to access this new pool.

You can then update the layout on a directory in CephFS to use the pool you added:

root # mkdir /mnt/cephfs/myssddir
root # setfattr -n ceph.dir.layout.pool -v cephfs_data_ssd /mnt/cephfs/myssddir

All new files created within that directory will now inherit its layout and place their data in your newly added pool. You may notice that the number of objects in your primary data pool continues to increase, even if files are being created in the pool you newly added. This is normal: the file data is stored in the pool specified by the layout, but a small amount of metadata is kept in the primary data pool for all files.

Print this page