23 Clustered file system #
This chapter describes administration tasks that are normally performed after the cluster is set up and CephFS exported. If you need more information on setting up CephFS, refer to Section 8.3.3, “Deploying Metadata Servers”.
23.1 Mounting CephFS #
When the file system is created and the MDS is active, you are ready to mount the file system from a client host.
23.1.1 Preparing the client #
If the client host is running SUSE Linux Enterprise 12 SP2 or later, the system is ready to mount CephFS 'out of the box'.
If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.
In any case, everything needed to mount CephFS is included in SUSE Linux Enterprise. The SUSE Enterprise Storage 7.1 product is not needed.
To support the full mount
syntax, the
ceph-common package (which is shipped with SUSE Linux Enterprise) should
be installed before trying to mount CephFS.
Without the ceph-common package (and thus without the
mount.ceph
helper), the monitors' IPs will need to be
used instead of their names. This is because the kernel client will be
unable to perform name resolution.
The basic mount syntax is:
#
mount -t ceph MON1_IP[:PORT],MON2_IP[:PORT],...:CEPHFS_MOUNT_TARGET \
MOUNT_POINT -o name=CEPHX_USER_NAME,secret=SECRET_STRING
23.1.2 Creating a secret file #
The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:
View the key for the particular user in a keyring file:
cephuser@adm >
cat /etc/ceph/ceph.client.admin.keyringCopy the key of the user who will be using the mounted Ceph FS file system. Usually, the key looks similar to the following:
AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w==
Create a file with the user name as a file name part, for example
/etc/ceph/admin.secret
for the user admin.Paste the key value to the file created in the previous step.
Set proper access rights to the file. The user should be the only one who can read the file—others may not have any access rights.
23.1.3 Mounting CephFS #
You can mount CephFS with the mount
command. You need
to specify the monitor host name or IP address. Because the
cephx
authentication is enabled by default in
SUSE Enterprise Storage, you need to specify a user name and their related secret as
well:
#
mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
As the previous command remains in the shell history, a more secure approach is to read the secret from a file:
#
mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
Note that the secret file should only contain the actual keyring secret. In our example, the file will then contain only the following line:
AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
It is a good idea to specify multiple monitors separated by commas on the
mount
command line in case one monitor happens to be
down at the time of mount. Each monitor address takes the form
host[:port]
. If the port is not specified, it defaults
to 6789.
Create the mount point on the local host:
#
mkdir /mnt/cephfs
Mount the CephFS:
#
mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
A subdirectory subdir
may be specified if a subset of
the file system is to be mounted:
#
mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
You can specify more than one monitor host in the mount
command:
#
mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs \
-o name=admin,secretfile=/etc/ceph/admin.secret
If clients with path restriction are used, the MDS capabilities need to include read access to the root directory. For example, a keyring may look as follows:
client.bar key: supersecretkey caps: [mds] allow rw path=/barjail, allow r path=/ caps: [mon] allow r caps: [osd] allow rwx
The allow r path=/
part means that path-restricted
clients are able to see the root volume, but cannot write to it. This may
be an issue for use cases where complete isolation is a requirement.
23.2 Unmounting CephFS #
To unmount the CephFS, use the umount
command:
#
umount /mnt/cephfs
23.3 Mounting CephFS in /etc/fstab
#
To mount CephFS automatically upon client start-up, insert the
corresponding line in its file systems table
/etc/fstab
:
mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime,_netdev 0 2
23.4 Multiple active MDS daemons (active-active MDS) #
CephFS is configured for a single active MDS daemon by default. To scale metadata performance for large-scale systems, you can enable multiple active MDS daemons, which will share the metadata workload with one another.
23.4.1 Using active-active MDS #
Consider using multiple active MDS daemons when your metadata performance is bottlenecked on the default single MDS.
Adding more daemons does not increase performance on all workload types. For example, a single application running on a single client will not benefit from an increased number of MDS daemons unless the application is doing a lot of metadata operations in parallel.
Workloads that typically benefit from a larger number of active MDS daemons are those with many clients, perhaps working on many separate directories.
23.4.2 Increasing the MDS active cluster size #
Each CephFS file system has a max_mds
setting, which
controls how many ranks will be created. The actual number of ranks in the
file system will only be increased if a spare daemon is available to take
on the new rank. For example, if there is only one MDS daemon running and
max_mds
is set to two, no second rank will be created.
In the following example, we set the max_mds
option to 2
to create a new rank apart from the default one. To see the changes, run
ceph status
before and after you set
max_mds
, and watch the line containing
fsmap
:
cephuser@adm >
ceph
status [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]cephuser@adm >
ceph
fs set cephfs max_mds 2cephuser@adm >
ceph
status [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]
The newly created rank (1) passes through the 'creating' state and then enter its 'active' state.
Even with multiple active MDS daemons, a highly available system still requires standby daemons to take over if any of the servers running an active daemon fail.
Consequently, the practical maximum of max_mds
for highly
available systems is one less than the total number of MDS servers in your
system. To remain available in the event of multiple server failures,
increase the number of standby daemons in the system to match the number
of server failures you need to survive.
23.4.3 Decreasing the number of ranks #
All ranks—including the ranks to be removed—must first be
active. This means that you need to have at least max_mds
MDS daemons available.
First, set max_mds
to a lower number. For example, go back
to having a single active MDS:
cephuser@adm >
ceph
status [...] services: [...] mds: cephfs-2/2/2 up {0=node2=up:active,1=node1=up:active} [...]cephuser@adm >
ceph
fs set cephfs max_mds 1cephuser@adm >
ceph
status [...] services: [...] mds: cephfs-1/1/1 up {0=node2=up:active}, 1 up:standby [...]
23.4.4 Manually pinning directory trees to a rank #
In multiple active metadata server configurations, a balancer runs, which works to spread metadata load evenly across the cluster. This usually works well enough for most users, but sometimes it is desirable to override the dynamic balancer with explicit mappings of metadata to particular ranks. This can allow the administrator or users to evenly spread application load or limit impact of users' metadata requests on the entire cluster.
The mechanism provided for this purpose is called an 'export pin'. It is an
extended attribute of directories. The name of this extended attribute is
ceph.dir.pin
. Users can set this attribute using
standard commands:
#
setfattr -n ceph.dir.pin -v 2 /path/to/dir
The value (-v
) of the extended attribute is the rank to
assign the directory sub-tree to. A default value of -1 indicates that the
directory is not pinned.
A directory export pin is inherited from its closest parent with a set export pin. Therefore, setting the export pin on a directory affects all of its children. However, the parent's pin can be overridden by setting the child directory export pin. For example:
#
mkdir -p a/b # "a" and "a/b" start with no export pin set.
setfattr -n ceph.dir.pin -v 1 a/ # "a" and "b" are now pinned to rank 1.
setfattr -n ceph.dir.pin -v 0 a/b # "a/b" is now pinned to rank 0
# and "a/" and the rest of its children
# are still pinned to rank 1.
23.5 Managing failover #
If an MDS daemon stops communicating with the monitor, the monitor will wait
mds_beacon_grace
seconds (default 15 seconds) before
marking the daemon as laggy. You can configure one or
more 'standby' daemons that will take over during the MDS daemon failover.
23.5.1 Configuring standby replay #
Each CephFS file system may be configured to add standby-replay daemons. These standby daemons follow the active MDS's metadata journal to reduce failover time in the event that the active MDS becomes unavailable. Each active MDS may have only one standby-replay daemon following it.
Configure standby-replay on a file system with the following command:
cephuser@adm >
ceph fs set FS-NAME allow_standby_replay BOOL
When set the monitors will assign available standby daemons to follow the active MDSs in that file system.
When an MDS has entered the standby-replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby-replay daemon will not be used as a replacement, even if no other standbys are available. For this reason, it is advised that if standby-replay is used then every active MDS should have a standby-replay daemon.
23.6 Setting CephFS quotas #
You can set quotas on any subdirectory of the Ceph file system. The quota restricts either the number of bytes or files stored beneath the specified point in the directory hierarchy.
23.6.1 CephFS quota limitations #
Using quotas with CephFS has the following limitations:
- Quotas are cooperative and non-competing.
Ceph quotas rely on the client that is mounting the file system to stop writing to it when a limit is reached. The server part cannot prevent a malicious client from writing as much data as it needs. Do not use quotas to prevent filling the file system in environments where the clients are fully untrusted.
- Quotas are imprecise.
Processes that are writing to the file system will be stopped shortly after the quota limit is reached. They will inevitably be allowed to write some amount of data over the configured limit. Client writers will be stopped within tenths of seconds after crossing the configured limit.
- Quotas are implemented in the kernel client from version 4.17.
Quotas are supported by the user space client (libcephfs, ceph-fuse). Linux kernel clients 4.17 and higher support CephFS quotas on SUSE Enterprise Storage 7.1 clusters. Kernel clients (even recent versions) will fail to handle quotas on older clusters, even if they are able to set the quotas extended attributes. SLE12-SP3 (and later) kernels already include the required backports to handle quotas.
- Configure quotas carefully when used with path-based mount restrictions.
The client needs to have access to the directory inode on which quotas are configured in order to enforce them. If the client has restricted access to a specific path (for example
/home/user
) based on the MDS capability, and a quota is configured on an ancestor directory they do not have access to (/home
), the client will not enforce it. When using path-based access restrictions, be sure to configure the quota on the directory that the client can access (for example/home/user
or/home/user/quota_dir
).
23.6.2 Configuring CephFS quotas #
You can configure CephFS quotas by using virtual extended attributes:
ceph.quota.max_files
Configures a file limit.
ceph.quota.max_bytes
Configures a byte limit.
If the attributes appear on a directory inode, a quota is configured there. If they are not present then no quota is set on that directory (although one may still be configured on a parent directory).
To set a 100 MB quota, run:
cephuser@mds >
setfattr -n ceph.quota.max_bytes -v 100000000 /SOME/DIRECTORY
To set a 10,000 files quota, run:
cephuser@mds >
setfattr -n ceph.quota.max_files -v 10000 /SOME/DIRECTORY
To view quota setting, run:
cephuser@mds >
getfattr -n ceph.quota.max_bytes /SOME/DIRECTORY
cephuser@mds >
getfattr -n ceph.quota.max_files /SOME/DIRECTORY
If the value of the extended attribute is '0', the quota is not set.
To remove a quota, run:
cephuser@mds >
setfattr -n ceph.quota.max_bytes -v 0 /SOME/DIRECTORYcephuser@mds >
setfattr -n ceph.quota.max_files -v 0 /SOME/DIRECTORY
23.7 Managing CephFS snapshots #
CephFS snapshots create a read-only view of the file system at the point in time they are taken. You can create a snapshot in any directory. The snapshot will cover all data in the file system under the specified directory. After creating a snapshot, the buffered data is flushed out asynchronously from various clients. As a result, creating a snapshot is very fast.
If you have multiple CephFS file systems sharing a single pool (via name spaces), their snapshots will collide, and deleting one snapshot will result in missing file data for other snapshots sharing the same pool.
23.7.1 Creating snapshots #
The CephFS snapshot feature is enabled by default on new file systems. To enable it on existing file systems, run:
cephuser@adm >
ceph fs set CEPHFS_NAME allow_new_snaps true
After you enable snapshots, all directories in the CephFS will have a
special .snap
subdirectory.
This is a virtual subdirectory. It does not appear in
the directory listing of the parent directory, but the name
.snap
cannot be used as a file or directory name. To
access the .snap
directory one needs to explicitly
access it, for example:
>
ls -la /CEPHFS_MOUNT/.snap/
CephFS kernel clients have a limitation: they cannot handle more than 400 snapshots in a file system. The number of snapshots should always be kept below this limit, regardless of which client you are using. If using older CephFS clients, such as SLE12-SP3, keep in mind that going above 400 snapshots is harmful to operations as the client will crash.
You may configure a different name for the snapshots subdirectory by
setting the client snapdir
setting.
To create a snapshot, create a subdirectory under the
.snap
directory with a custom name. For example, to
create a snapshot of the directory
/CEPHFS_MOUNT/2/3/
, run:
>
mkdir /CEPHFS_MOUNT/2/3/.snap/CUSTOM_SNAPSHOT_NAME
23.7.2 Deleting snapshots #
To delete a snapshot, remove its subdirectory inside the
.snap
directory:
>
rmdir /CEPHFS_MOUNT/2/3/.snap/CUSTOM_SNAPSHOT_NAME