13 Operational tasks #
13.1 Modifying the cluster configuration #
To modify the configuration of an existing Ceph cluster, follow these steps:
Export the current configuration of the cluster to a file:
cephuser@adm >
ceph orch ls --export --format yaml > cluster.yamlEdit the file with the configuration and update the relevant lines. Find specification examples in Chapter 8, Deploying the remaining core services using cephadm and Section 13.4.3, “Adding OSDs using DriveGroups specification”.
Apply the new configuration:
cephuser@adm >
ceph orch apply -i cluster.yaml
13.2 Adding nodes #
To add a new node to a Ceph cluster, follow these steps:
Install SUSE Linux Enterprise Server and SUSE Enterprise Storage on the new host. Refer to Chapter 5, Installing and configuring SUSE Linux Enterprise Server for more information.
Configure the host as a Salt Minion of an already existing Salt Master. Refer to Chapter 6, Deploying Salt for more information.
Add the new host to
ceph-salt
and make cephadm aware of it, for example:root@master #
ceph-salt config /ceph_cluster/minions add ses-min5.example.comroot@master #
ceph-salt config /ceph_cluster/roles/cephadm add ses-min5.example.comRefer to Section 7.2.2, “Adding Salt Minions” for more information.
Verify that the node was added to
ceph-salt
:root@master #
ceph-salt config /ceph_cluster/minions ls o- minions ................................................. [Minions: 5] [...] o- ses-min5.example.com .................................... [no roles]Apply the configuration to the new cluster host:
root@master #
ceph-salt apply ses-min5.example.comVerify that the newly added host now belongs to the cephadm environment:
cephuser@adm >
ceph orch host ls HOST ADDR LABELS STATUS [...] ses-min5.example.com ses-min5.example.com
13.3 Removing nodes #
If the node that you are going to remove runs OSDs, remove the OSDs from it first and check that no OSDs are running on that node. Refer to Section 13.4.4, “Removing OSDs” for more details on removing OSDs.
To remove a node from a cluster, do the following:
For all Ceph service types except for
node-exporter
andcrash
, remove the node's host name from the cluster placement specification file (for example,cluster.yml
). Refer to Section 8.2, “Service and placement specification” for more details. For example, if you are removing the host namedses-min2
, remove all occurrences of- ses-min2
from allplacement:
sections:Update
service_type: rgw service_id: EXAMPLE_NFS placement: hosts: - ses-min2 - ses-min3
to
service_type: rgw service_id: EXAMPLE_NFS placement: hosts: - ses-min3
Apply your changes to the configuration file:
cephuser@adm >
ceph orch apply -i rgw-example.yamlRemove the node from cephadm's environment:
cephuser@adm >
ceph orch host rm ses-min2If the node is running
crash.osd.1
andcrash.osd.2
services, remove them by running the following command on the host:root@minion >
cephadm rm-daemon --fsid CLUSTER_ID --name SERVICE_NAMEFor example:
root@minion >
cephadm rm-daemon --fsid b4b30c6e... --name crash.osd.1root@minion >
cephadm rm-daemon --fsid b4b30c6e... --name crash.osd.2Remove all the roles from the minion you want to delete:
cephuser@adm >
ceph-salt config /ceph_cluster/roles/tuned/throughput remove ses-min2cephuser@adm >
ceph-salt config /ceph_cluster/roles/tuned/latency remove ses-min2cephuser@adm >
ceph-salt config /ceph_cluster/roles/cephadm remove ses-min2cephuser@adm >
ceph-salt config /ceph_cluster/roles/admin remove ses-min2If the minion you want to remove is the bootstrap minion, you also need to remove the bootstrap role:
cephuser@adm >
ceph-salt config /ceph_cluster/roles/bootstrap resetAfter removing all OSDs on a single host, remove the host from the CRUSH map:
cephuser@adm >
ceph osd crush remove bucket-nameNoteThe bucket name should be the same as the host name.
You can now remove the minion from the cluster:
cephuser@adm >
ceph-salt config /ceph_cluster/minions remove ses-min2
In the event of a failure and the minion you are trying to remove is in a permanently powered-off state, you will need to remove the node from the Salt Master:
root@master #
salt-key -d minion_id
Then, manually remove the node from
pillar_root/ceph-salt.sls
.
This is typically located in
/srv/pillar/ceph-salt.sls
.
13.4 OSD management #
This section describes how to add, erase, or remove OSDs in a Ceph cluster.
13.4.1 Listing disk devices #
To identify used and unused disk devices on all cluster nodes, list them by running the following command:
cephuser@adm >
ceph orch device ls
HOST PATH TYPE SIZE DEVICE AVAIL REJECT REASONS
ses-master /dev/vda hdd 42.0G False locked
ses-min1 /dev/vda hdd 42.0G False locked
ses-min1 /dev/vdb hdd 8192M 387836 False locked, LVM detected, Insufficient space (<5GB) on vgs
ses-min2 /dev/vdc hdd 8192M 450575 True
13.4.2 Erasing disk devices #
To re-use a disk device, you need to erase (or zap) it first:
ceph orch device zap HOST_NAME DISK_DEVICE
For example:
cephuser@adm >
ceph orch device zap ses-min2 /dev/vdc
If you previously deployed OSDs by using DriveGroups or the
--all-available-devices
option while the
unmanaged
flag was not set, cephadm will deploy these
OSDs automatically after you erase them.
13.4.3 Adding OSDs using DriveGroups specification #
DriveGroups specify the layouts of OSDs in the Ceph
cluster. They are defined in a single YAML file. In this section, we will
use drive_groups.yml
as an example.
An administrator should manually specify a group of OSDs that are
interrelated (hybrid OSDs that are deployed on a mixture of HDDs and SDDs)
or share identical deployment options (for example, the same object store,
same encryption option, stand-alone OSDs). To avoid explicitly listing
devices, DriveGroups use a list of filter items that correspond to a few
selected fields of ceph-volume
's inventory reports.
cephadm will provide code that translates these DriveGroups into actual
device lists for inspection by the user.
The command to apply the OSD specification to the cluster is:
cephuser@adm >
ceph orch apply osd -idrive_groups.yml
To see a preview of actions and test your application, you can use the
--dry-run
option together with the ceph orch
apply osd
command. For example:
cephuser@adm >
ceph orch apply osd -idrive_groups.yml
--dry-run ... +---------+------+------+----------+----+-----+ |SERVICE |NAME |HOST |DATA |DB |WAL | +---------+------+------+----------+----+-----+ |osd |test |mgr0 |/dev/sda |- |- | |osd |test |mgr0 |/dev/sdb |- |- | +---------+------+------+----------+----+-----+
If the --dry-run
output matches your expectations, then
simply re-run the command without the --dry-run
option.
13.4.3.1 Unmanaged OSDs #
All available clean disk devices that match the DriveGroups specification will be used as OSDs automatically after you add them to the cluster. This behavior is called a managed mode.
To disable the managed mode, add the
unmanaged: true
line to the relevant specifications,
for example:
service_type: osd service_id: example_drvgrp_name placement: hosts: - ses-min2 - ses-min3 encrypted: true unmanaged: true
To change already deployed OSDs from the managed to
unmanaged mode, add the unmanaged:
true
lines where applicable during the procedure described in
Section 13.1, “Modifying the cluster configuration”.
13.4.3.2 DriveGroups specification #
Following is an example DriveGroups specification file:
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: drive_spec: DEVICE_SPECIFICATION db_devices: drive_spec: DEVICE_SPECIFICATION wal_devices: drive_spec: DEVICE_SPECIFICATION block_wal_size: '5G' # (optional, unit suffixes permitted) block_db_size: '5G' # (optional, unit suffixes permitted) encrypted: true # 'True' or 'False' (defaults to 'False')
The option previously called "encryption" in DeepSea has been renamed
to "encrypted". When applying DriveGroups in SUSE Enterprise Storage 7, ensure you
use this new terminology in your service specification, otherwise the
ceph orch apply
operation will fail.
13.4.3.3 Matching disk devices #
You can describe the specification using the following filters:
By a disk model:
model: DISK_MODEL_STRING
By a disk vendor:
vendor: DISK_VENDOR_STRING
TipAlways enter the DISK_VENDOR_STRING in lowercase.
To obtain details about disk model and vendor, examine the output of the following command:
cephuser@adm >
ceph orch device ls HOST PATH TYPE SIZE DEVICE_ID MODEL VENDOR ses-min1 /dev/sdb ssd 29.8G SATA_SSD_AF34075704240015 SATA SSD ATA ses-min2 /dev/sda ssd 223G Micron_5200_MTFDDAK240TDN Micron_5200_MTFD ATA [...]Whether a disk is rotational or not. SSDs and NVMe drives are not rotational.
rotational: 0
Deploy a node using all available drives for OSDs:
data_devices: all: true
Additionally, by limiting the number of matching disks:
limit: 10
13.4.3.4 Filtering devices by size #
You can filter disk devices by their size—either by an exact size,
or a size range. The size:
parameter accepts arguments in
the following form:
'10G' - Includes disks of an exact size.
'10G:40G' - Includes disks whose size is within the range.
':10G' - Includes disks less than or equal to 10 GB in size.
'40G:' - Includes disks equal to or greater than 40 GB in size.
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: size: '40TB:' db_devices: size: ':2TB'
When using the ':' delimiter, you need to enclose the size in quotes, otherwise the ':' sign will be interpreted as a new configuration hash.
Instead of Gigabytes (G), you can specify the sizes in Megabytes (M) or Terabytes (T).
13.4.3.5 DriveGroups examples #
This section includes examples of different OSD setups.
This example describes two nodes with the same setup:
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4 TB
2 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512 GB
The corresponding drive_groups.yml
file will be as
follows:
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: model: SSD-123-foo db_devices: model: MC-55-44-XZ
Such a configuration is simple and valid. The problem is that an administrator may add disks from different vendors in the future, and these will not be included. You can improve it by reducing the filters on core properties of the drives:
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: rotational: 1 db_devices: rotational: 0
In the previous example, we are enforcing all rotating devices to be declared as 'data devices' and all non-rotating devices will be used as 'shared devices' (wal, db).
If you know that drives with more than 2 TB will always be the slower data devices, you can filter by size:
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: size: '2TB:' db_devices: size: ':2TB'
This example describes two distinct setups: 20 HDDs should share 2 SSDs, while 10 SSDs should share 2 NVMes.
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4 TB
12 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512 GB
2 NVMes
Vendor: Samsung
Model: NVME-QQQQ-987
Size: 256 GB
Such a setup can be defined with two layouts as follows:
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ
service_type: osd service_id: example_drvgrp_name2 placement: host_pattern: '*' data_devices: model: MC-55-44-XZ db_devices: vendor: samsung size: 256GB
The previous examples assumed that all nodes have the same drives. However, that is not always the case:
Nodes 1-5:
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4 TB
2 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512 GB
Nodes 6-10:
5 NVMes
Vendor: Intel
Model: SSD-123-foo
Size: 4 TB
20 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512 GB
You can use the 'target' key in the layout to target specific nodes. Salt target notation helps to keep things simple:
service_type: osd service_id: example_drvgrp_one2five placement: host_pattern: 'node[1-5]' data_devices: rotational: 1 db_devices: rotational: 0
followed by
service_type: osd service_id: example_drvgrp_rest placement: host_pattern: 'node[6-10]' data_devices: model: MC-55-44-XZ db_devices: model: SSD-123-foo
All previous cases assumed that the WALs and DBs use the same device. It is however possible to deploy the WAL on a dedicated device as well:
20 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4 TB
2 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512 GB
2 NVMes
Vendor: Samsung
Model: NVME-QQQQ-987
Size: 256 GB
service_type: osd service_id: example_drvgrp_name placement: host_pattern: '*' data_devices: model: MC-55-44-XZ db_devices: model: SSD-123-foo wal_devices: model: NVME-QQQQ-987
In the following setup, we are trying to define:
20 HDDs backed by 1 NVMe
2 HDDs backed by 1 SSD(db) and 1 NVMe (wal)
8 SSDs backed by 1 NVMe
2 SSDs stand-alone (encrypted)
1 HDD is spare and should not be deployed
The summary of used drives is as follows:
23 HDDs
Vendor: Intel
Model: SSD-123-foo
Size: 4 TB
10 SSDs
Vendor: Micron
Model: MC-55-44-ZX
Size: 512 GB
1 NVMe
Vendor: Samsung
Model: NVME-QQQQ-987
Size: 256 GB
The DriveGroups definition will be the following:
service_type: osd service_id: example_drvgrp_hdd_nvme placement: host_pattern: '*' data_devices: rotational: 0 db_devices: model: NVME-QQQQ-987
service_type: osd service_id: example_drvgrp_hdd_ssd_nvme placement: host_pattern: '*' data_devices: rotational: 0 db_devices: model: MC-55-44-XZ wal_devices: model: NVME-QQQQ-987
service_type: osd service_id: example_drvgrp_ssd_nvme placement: host_pattern: '*' data_devices: model: SSD-123-foo db_devices: model: NVME-QQQQ-987
service_type: osd service_id: example_drvgrp_standalone_encrypted placement: host_pattern: '*' data_devices: model: SSD-123-foo encrypted: True
One HDD will remain as the file is being parsed from top to bottom.
13.4.4 Removing OSDs #
Before removing an OSD node from the cluster, verify that the cluster has more free disk space than the OSD disk you are going to remove. Be aware that removing an OSD results in rebalancing of the whole cluster.
Identify which OSD to remove by getting its ID:
cephuser@adm >
ceph orch ps --daemon_type osd NAME HOST STATUS REFRESHED AGE VERSION osd.0 target-ses-090 running (3h) 7m ago 3h 15.2.7.689 ... osd.1 target-ses-090 running (3h) 7m ago 3h 15.2.7.689 ... osd.2 target-ses-090 running (3h) 7m ago 3h 15.2.7.689 ... osd.3 target-ses-090 running (3h) 7m ago 3h 15.2.7.689 ...Remove one or more OSDs from the cluster:
cephuser@adm >
ceph orch osd rm OSD1_ID OSD2_ID ...For example:
cephuser@adm >
ceph orch osd rm 1 2You can query the state of the removal operation:
cephuser@adm >
ceph orch osd rm status OSD_ID HOST STATE PG_COUNT REPLACE FORCE STARTED_AT 2 cephadm-dev done, waiting for purge 0 True False 2020-07-17 13:01:43.147684 3 cephadm-dev draining 17 False True 2020-07-17 13:01:45.162158 4 cephadm-dev started 42 False True 2020-07-17 13:01:45.162158
13.4.4.1 Stopping OSD removal #
After you have scheduled an OSD removal, you can stop the removal if needed. The following command will reset the initial state of the OSD and remove it from the queue:
cephuser@adm >
ceph orch osd rm stop OSD_SERVICE_ID
13.4.5 Replacing OSDs #
There are several reasons why you may need to replace an OSD disk. For example:
The OSD disk failed or is soon going to fail based on SMART information, and can no longer be used to store data safely.
You need to upgrade the OSD disk, for example to increase its size.
You need to change the OSD disk layout.
You plan to move from a non-LVM to a LVM-based layout.
To replace an OSD while preserving its ID, run:
cephuser@adm >
ceph orch osd rm OSD_SERVICE_ID --replace
For example:
cephuser@adm >
ceph orch osd rm 4 --replace
Replacing an OSD is identical to removing an OSD (see
Section 13.4.4, “Removing OSDs” for more details) with the exception
that the OSD is not permanently removed from the CRUSH hierarchy and is
assigned a destroyed
flag instead.
The destroyed
flag is used to determined OSD IDs that
will be reused during the next OSD deployment. Newly added disks that match
the DriveGroups specification (see Section 13.4.3, “Adding OSDs using DriveGroups specification” for more
details) will be assigned OSD IDs of their replaced counterpart.
Appending the --dry-run
option will not execute the
actual replacement, but will preview the steps that would normally happen.
In the case of replacing an OSD after a failure, we highly recommend triggering a deep scrub of the placement groups. See Section 17.6, “Scrubbing placement groups” for more details.
Run the following command to initiate a deep scrub:
cephuser@adm >
ceph osd deep-scrub osd.OSD_NUMBER
If a shared device for DB/WAL fails you will need to perform the replacement procedure for all OSDs that share the failed device.
13.5 Moving the Salt Master to a new node #
If you need to replace the Salt Master host with a new one, follow these steps:
Export the cluster configuration and back up the exported JSON file. Find more details in Section 7.2.14, “Exporting cluster configurations”.
If the old Salt Master is also the only administration node in the cluster, then manually move
/etc/ceph/ceph.client.admin.keyring
and/etc/ceph/ceph.conf
to the new Salt Master.Stop and disable the Salt Master
systemd
service on the old Salt Master node:root@master #
systemctl stop salt-master.serviceroot@master #
systemctl disable salt-master.serviceIf the old Salt Master node is no longer in the cluster, also stop and disable the Salt Minion
systemd
service:root@master #
systemctl stop salt-minion.serviceroot@master #
systemctl disable salt-minion.serviceWarningDo not stop or disable the
salt-minion.service
if the old Salt Master node has any Ceph daemons (MON, MGR, OSD, MDS, gateway, monitoring) running on it.Install SUSE Linux Enterprise Server 15 SP2 on the new Salt Master following the procedure described in Chapter 5, Installing and configuring SUSE Linux Enterprise Server.
Tip: Transition of Salt MinionTo simplify the transition of Salt Minions to the new Salt Master, remove the original Salt Master's public key from each of them:
root@minion >
rm /etc/salt/pki/minion/minion_master.pubroot@minion >
systemctl restart salt-minion.serviceInstall the salt-master package and, if applicable, the salt-minion package on the new Salt Master.
Install
ceph-salt
on the new Salt Master node:root@master #
zypper install ceph-saltroot@master #
systemctl restart salt-master.serviceroot@master #
salt '*' saltutil.sync_allImportantMake sure to run all three commands before continuing. The commands are idempotent; it does not matter if they get repeated.
Include the new Salt Master in the cluster as described in Section 7.1, “Installing
ceph-salt
”, Section 7.2.2, “Adding Salt Minions” and Section 7.2.4, “Specifying Admin Node”.Import the backed up cluster configuration and apply it:
root@master #
ceph-salt import CLUSTER_CONFIG.jsonroot@master #
ceph-salt applyImportantRename the Salt Master's
minion id
in the exportedCLUSTER_CONFIG.json
file before importing it.
13.6 Updating the cluster nodes #
Keep the Ceph cluster nodes up-to-date by applying rolling updates regularly.
13.6.1 Software repositories #
Before patching the cluster with the latest software packages, verify that all the cluster's nodes have access to the relevant repositories.
13.6.2 Repository staging #
If you use a staging tool—for example, SUSE Manager, Subscription Management Tool, or RMT—that serves software repositories to the cluster nodes, verify that stages for both 'Updates' repositories for SUSE Linux Enterprise Server and SUSE Enterprise Storage are created at the same point in time.
We strongly recommend to use a staging tool to apply patches which have
frozen
or staged
patch levels. This
ensures that new nodes joining the cluster have the same patch level as the
nodes already running in the cluster. This way you avoid the need to apply
the latest patches to all the cluster's nodes before new nodes can join the
cluster.
13.6.3 Downtime of Ceph services #
Depending on the configuration, cluster nodes may be rebooted during the update. If there is a single point of failure for services such as Object Gateway, Samba Gateway, NFS Ganesha, or iSCSI, the client machines may be temporarily disconnected from services whose nodes are being rebooted.
13.6.4 Running the update #
To update the software packages on all cluster nodes to the latest version, run the following command:
root@master #
ceph-salt update
13.7 Updating Ceph #
You can instruct cephadm to update Ceph from one bugfix release to another. The automated update of Ceph services respects the recommended order—it starts with Ceph Managers, Ceph Monitors, and then continues on to other services such as Ceph OSDs, Metadata Servers, and Object Gateways. Each daemon is restarted only after Ceph indicates that the cluster will remain available.
The following update procedure uses the ceph orch
upgrade
command. Keep in mind that the following instructions
detail how to update your Ceph cluster with a product version (for
example, a maintenance update), and does not provide
instructions on how to upgrade your cluster from one product version to
another.
13.7.1 Starting the update #
Before you start the update, verify that all nodes are currently online and your cluster is healthy:
cephuser@adm >
cephadm shell -- ceph -s
To update to a specific Ceph release:
cephuser@adm >
ceph orch upgrade start --image REGISTRY_URL
For example:
cephuser@adm >
ceph orch upgrade start --image registry.suse.com/ses/7/ceph/ceph:latest
Upgrade packages on the hosts:
cephuser@adm >
ceph-salt update
13.7.2 Monitoring the update #
Run the following command to determine whether an update is in progress:
cephuser@adm >
ceph orch upgrade status
While the update is in progress, you will see a progress bar in the Ceph status output:
cephuser@adm >
ceph -s
[...]
progress:
Upgrade to registry.suse.com/ses/7/ceph/ceph:latest (00h 20m 12s)
[=======.....................] (time remaining: 01h 43m 31s)
You can also watch the cephadm log:
cephuser@adm >
ceph -W cephadm
13.7.3 Cancelling an update #
You can stop the update process at any time:
cephuser@adm >
ceph orch upgrade stop
13.8 Halting or rebooting cluster #
In some cases it may be necessary to halt or reboot the whole cluster. We recommended carefully checking for dependencies of running services. The following steps provide an outline for stopping and starting the cluster:
Tell the Ceph cluster not to mark OSDs as out:
cephuser@adm >
ceph
osd set nooutStop daemons and nodes in the following order:
Storage clients
Gateways, for example NFS Ganesha or Object Gateway
Metadata Server
Ceph OSD
Ceph Manager
Ceph Monitor
If required, perform maintenance tasks.
Start the nodes and servers in the reverse order of the shutdown process:
Ceph Monitor
Ceph Manager
Ceph OSD
Metadata Server
Gateways, for example NFS Ganesha or Object Gateway
Storage clients
Remove the noout flag:
cephuser@adm >
ceph
osd unset noout
13.9 Removing an entire Ceph cluster #
The ceph-salt purge
command removes the entire Ceph
cluster. If there are more Ceph clusters deployed, the one reported by
ceph -s
is purged. This way you can clean the cluster
environment when testing different setups.
To prevent accidental deletion, the orchestration checks if the safety is disengaged. You can disengage the safety measures and remove the Ceph cluster by running:
root@master #
ceph-salt disengage-safetyroot@master #
ceph-salt purge