Applies to SUSE Enterprise Storage 7

23 Clustered file system #

This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Section 8.3.3, “Deploying Metadata Servers”.

23.1 Mounting CephFS #

When the file system is created and the MDS is active, you are ready to mount the file system from a client host.

23.1.1 Preparing the client #

If the client host is running SUSE Linux Enterprise 12 SP2 or later, the system is ready to mount CephFS 'out of the box'.

If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.

In any case, everything needed to mount CephFS is included in SUSE Linux Enterprise. The SUSE Enterprise Storage 7 product is not needed.

To support the full mount syntax, the ceph-common package (which is shipped with SUSE Linux Enterprise) should be installed before trying to mount CephFS.

Important

Without the ceph-common package (and thus without the mount.ceph helper), the monitors' IPs will need to be used instead of their names. This is because the kernel client will be unable to perform name resolution.

The basic mount syntax is:

# mount -t ceph MON1_IP[:PORT],MON2_IP[:PORT],...:CEPHFS_MOUNT_TARGET \
MOUNT_POINT -o name=CEPHX_USER_NAME,secret=SECRET_STRING

23.1.2 Creating a secret file #

The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:

Procedure 23.1: Creating a secret key #

View the key for the particular user in a keyring file:

cephuser@adm > cat /etc/ceph/ceph.client.admin.keyring

Copy the key of the user who will be using the mounted Ceph FS file system. Usually, the key looks similar to the following:
```
AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w==
```
Create a file with the user name as a file name part, for example /etc/ceph/admin.secret for the user admin.
Paste the key value to the file created in the previous step.
Set proper access rights to the file. The user should be the only one who can read the file—others may not have any access rights.

23.1.3 Mounting CephFS #

You can mount CephFS with the mount command. You need to specify the monitor host name or IP address. Because the cephx authentication is enabled by default in SUSE Enterprise Storage, you need to specify a user name and their related secret as well:

# mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==

As the previous command remains in the shell history, a more secure approach is to read the secret from a file:

# mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

Note that the secret file should only contain the actual keyring secret. In our example, the file will then contain only the following line:

AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==

Tip: Specify multiple monitors

It is a good idea to specify multiple monitors separated by commas on the mount command line in case one monitor happens to be down at the time of mount. Each monitor address takes the form host[:port]. If the port is not specified, it defaults to 6789.

Create the mount point on the local host:

# mkdir /mnt/cephfs

Mount the CephFS:

# mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

A subdirectory subdir may be specified if a subset of the file system is to be mounted:

# mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

You can specify more than one monitor host in the mount command:

# mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret

Important: Read access to the root directory

If clients with path restriction are used, the MDS capabilities need to include read access to the root directory. For example, a keyring may look as follows:

client.bar
 key: supersecretkey
 caps: [mds] allow rw path=/barjail, allow r path=/
 caps: [mon] allow r
 caps: [osd] allow rwx

The allow r path=/ part means that path-restricted clients are able to see the root volume, but cannot write to it. This may be an issue for use cases where complete isolation is a requirement.

23.2 Unmounting CephFS #

To unmount the CephFS, use the umount command:

# umount /mnt/cephfs

23.3 Mounting CephFS in `/etc/fstab` #

To mount CephFS automatically upon client start-up, insert the corresponding line in its file systems table /etc/fstab:

mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2

23.4 Multiple active MDS daemons (active-active MDS) #

CephFS is configured for a single active MDS daemon by default. To scale metadata performance for large-scale systems, you can enable multiple active MDS daemons, which will share the metadata workload with one another.

23.4.1 Using active-active MDS #

Consider using multiple active MDS daemons when your metadata performance is bottlenecked on the default single MDS.

Adding more daemons does not increase performance on all workload types. For example, a single application running on a single client will not benefit from an increased number of MDS daemons unless the application is doing a lot of metadata operations in parallel.

Workloads that typically benefit from a larger number of active MDS daemons are those with many clients, perhaps working on many separate directories.

23.4.2 Increasing the MDS active cluster size #

Each CephFS file system has a max_mds setting, which controls how many ranks will be created. The actual number of ranks in the file system will only be increased if a spare daemon is available to take on the new rank. For example, if there is only one MDS daemon running and max_mds is set to two, no second rank will be created.

In the following example, we set the max_mds option to 2 to create a new rank apart from the default one. To see the changes, run ceph status before and after you set max_mds, and watch the line containing fsmap:

cephuser@adm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-1/1/1 up  {0=node2=up:active}, 1 up:standby
    [...]
cephuser@adm > ceph fs set cephfs max_mds 2
cephuser@adm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-2/2/2 up  {0=node2=up:active,1=node1=up:active}
    [...]

The newly created rank (1) passes through the 'creating' state and then enter its 'active' state.

Important: Standby daemons

Even with multiple active MDS daemons, a highly available system still requires standby daemons to take over if any of the servers running an active daemon fail.

Consequently, the practical maximum of max_mds for highly available systems is one less than the total number of MDS servers in your system. To remain available in the event of multiple server failures, increase the number of standby daemons in the system to match the number of server failures you need to survive.

23.4.3 Decreasing the number of ranks #

All ranks—including the ranks to be removed—must first be active. This means that you need to have at least max_mds MDS daemons available.

First, set max_mds to a lower number. For example, go back to having a single active MDS:

cephuser@adm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-2/2/2 up  {0=node2=up:active,1=node1=up:active}
    [...]
cephuser@adm > ceph fs set cephfs max_mds 1
cephuser@adm > ceph status
  [...]
  services:
    [...]
    mds: cephfs-1/1/1 up  {0=node2=up:active}, 1 up:standby
    [...]

23.4.4 Manually pinning directory trees to a rank #

In multiple active metadata server configurations, a balancer runs, which works to spread metadata load evenly across the cluster. This usually works well enough for most users, but sometimes it is desirable to override the dynamic balancer with explicit mappings of metadata to particular ranks. This can allow the administrator or users to evenly spread application load or limit impact of users' metadata requests on the entire cluster.

The mechanism provided for this purpose is called an 'export pin'. It is an extended attribute of directories. The name of this extended attribute is ceph.dir.pin. Users can set this attribute using standard commands:

# setfattr -n ceph.dir.pin -v 2 /path/to/dir

The value (-v) of the extended attribute is the rank to assign the directory sub-tree to. A default value of -1 indicates that the directory is not pinned.

A directory export pin is inherited from its closest parent with a set export pin. Therefore, setting the export pin on a directory affects all of its children. However, the parent's pin can be overridden by setting the child directory export pin. For example:

# mkdir -p a/b                      # "a" and "a/b" start with no export pin set.
setfattr -n ceph.dir.pin -v 1 a/  # "a" and "b" are now pinned to rank 1.
setfattr -n ceph.dir.pin -v 0 a/b # "a/b" is now pinned to rank 0
                                  # and "a/" and the rest of its children
                                  # are still pinned to rank 1.

23.5 Managing failover #

If an MDS daemon stops communicating with the monitor, the monitor will wait mds_beacon_grace seconds (default 15 seconds) before marking the daemon as laggy. You can configure one or more 'standby' daemons that will take over during the MDS daemon failover.

23.5.1 Configuring standby replay #

Each CephFS file system may be configured to add standby-replay daemons. These standby daemons follow the active MDS's metadata journal to reduce failover time in the event that the active MDS becomes unavailable. Each active MDS may have only one standby-replay daemon following it.

Configure standby-replay on a file system with the following command:

cephuser@adm > ceph fs set FS-NAME allow_standby_replay BOOL

When set the monitors will assign available standby daemons to follow the active MDSs in that file system.

Once an MDS has entered the standby-replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby-replay daemon will not be used as a replacement, even if no other standbys are available. For this reason, it is advised that if standby-replay is used then every active MDS should have a standby-replay daemon.

23.6 Setting CephFS quotas #

You can set quotas on any subdirectory of the Ceph file system. The quota restricts either the number of bytes or files stored beneath the specified point in the directory hierarchy.

23.6.1 CephFS quota limitations #

Using quotas with CephFS has the following limitations:

Quotas are cooperative and non-competing.: Ceph quotas rely on the client that is mounting the file system to stop writing to it when a limit is reached. The server part cannot prevent a malicious client from writing as much data as it needs. Do not use quotas to prevent filling the file system in environments where the clients are fully untrusted.
Quotas are imprecise.: Processes that are writing to the file system will be stopped shortly after the quota limit is reached. They will inevitably be allowed to write some amount of data over the configured limit. Client writers will be stopped within tenths of seconds after crossing the configured limit.
Quotas are implemented in the kernel client from version 4.17.: Quotas are supported by the user space client (libcephfs, ceph-fuse). Linux kernel clients 4.17 and higher support CephFS quotas on SUSE Enterprise Storage 7 clusters. Kernel clients (even recent versions) will fail to handle quotas on older clusters, even if they are able to set the quotas extended attributes. SLE12-SP3 (and later) kernels already include the required backports to handle quotas.
Configure quotas carefully when used with path-based mount restrictions.: The client needs to have access to the directory inode on which quotas are configured in order to enforce them. If the client has restricted access to a specific path (for example /home/user) based on the MDS capability, and a quota is configured on an ancestor directory they do not have access to (/home), the client will not enforce it. When using path-based access restrictions, be sure to configure the quota on the directory that the client can access (for example /home/user or /home/user/quota_dir).

23.6.2 Configuring CephFS quotas #

You can configure CephFS quotas by using virtual extended attributes:

ceph.quota.max_files: Configures a file limit.
ceph.quota.max_bytes: Configures a byte limit.

If the attributes appear on a directory inode, a quota is configured there. If they are not present then no quota is set on that directory (although one may still be configured on a parent directory).

To set a 100 MB quota, run:

cephuser@mds > setfattr -n ceph.quota.max_bytes -v 100000000 /SOME/DIRECTORY

To set a 10,000 files quota, run:

cephuser@mds > setfattr -n ceph.quota.max_files -v 10000 /SOME/DIRECTORY

To view quota setting, run:

cephuser@mds > getfattr -n ceph.quota.max_bytes /SOME/DIRECTORY

cephuser@mds > getfattr -n ceph.quota.max_files /SOME/DIRECTORY

Note: Quota not set

If the value of the extended attribute is '0', the quota is not set.

To remove a quota, run:

cephuser@mds > setfattr -n ceph.quota.max_bytes -v 0 /SOME/DIRECTORY
cephuser@mds > setfattr -n ceph.quota.max_files -v 0 /SOME/DIRECTORY

23.7 Managing CephFS snapshots #

CephFS snapshots create a read-only view of the file system at the point in time they are taken. You can create a snapshot in any directory. The snapshot will cover all data in the file system under the specified directory. After creating a snapshot, the buffered data is flushed out asynchronously from various clients. As a result, creating a snapshot is very fast.

Important: Multiple file systems

If you have multiple CephFS file systems sharing a single pool (via name spaces), their snapshots will collide, and deleting one snapshot will result in missing file data for other snapshots sharing the same pool.

23.7.1 Creating snapshots #

The CephFS snapshot feature is enabled by default on new file systems. To enable it on existing file systems, run:

cephuser@adm > ceph fs set CEPHFS_NAME allow_new_snaps true

After you enable snapshots, all directories in the CephFS will have a special .snap subdirectory.

Note

This is a virtual subdirectory. It does not appear in the directory listing of the parent directory, but the name .snap cannot be used as a file or directory name. To access the .snap directory one needs to explicitly access it, for example:

> ls -la /CEPHFS_MOUNT/.snap/

Important: Kernel clients limitation

CephFS kernel clients have a limitation: they cannot handle more than 400 snapshots in a file system. The number of snapshots should always be kept below this limit, regardless of which client you are using. If using older CephFS clients, such as SLE12-SP3, keep in mind that going above 400 snapshots is harmful to operations as the client will crash.

Tip: Custom snapshot subdirectory name

You may configure a different name for the snapshots subdirectory by setting the client snapdir setting.

To create a snapshot, create a subdirectory under the .snap directory with a custom name. For example, to create a snapshot of the directory /CEPHFS_MOUNT/2/3/, run:

> mkdir /CEPHFS_MOUNT/2/3/.snap/CUSTOM_SNAPSHOT_NAME

23.7.2 Deleting snapshots #

To delete a snapshot, remove its subdirectory inside the .snap directory:

> rmdir /CEPHFS_MOUNT/2/3/.snap/CUSTOM_SNAPSHOT_NAME