Applies to SUSE Enterprise Storage 7.1

13 Hints and tips #

The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.

13.1 Identifying orphaned volumes #

To identify possibly orphaned journal/WAL/DB volumes, follow these steps:

Get a list of OSD IDs for which LVs exist, but not OSD daemon is running on the node.

root@minion >  comm -3 <(ceph osd tree | grep up | awk '{print $4}') <(cephadm ceph-volume lvm list 2>/dev/null | grep ====== | awk '{print $2}')

If the command outputs one or more OSDs (for example osd.33), take the IDs (33 in the prior example) and zap the associated volumes.
```
root@minion >  ceph-volume lvm zap --destroy --osd-id ID.
```

13.2 Adjusting scrubbing #

By default, Ceph performs light scrubbing daily (find more details in Section 17.6, “Scrubbing placement groups”) and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.

The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.

If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:

cephuser@adm > ceph config set osd osd_scrub_begin_hour 23
cephuser@adm > ceph config set osd osd_scrub_end_hour 6

If time restriction is not an effective method of determining a scrubbing schedule, consider using the osd_scrub_load_threshold option. The default value is 0.5, but it could be modified for low load conditions:

cephuser@adm > ceph config set osd osd_scrub_load_threshold 0.25

13.3 Stopping OSDs without rebalancing #

You may need to stop OSDs for maintenance periodically. If you do not want CRUSH to automatically rebalance the cluster in order to avoid huge data transfers, set the cluster to noout first:

root@minion > ceph osd set noout

When the cluster is set to noout, you can begin stopping the OSDs within the failure domain that requires maintenance work. To identify the unique FSID of the cluster, run ceph fsid. To identify the Object Gateway daemon name, run

cephuser@adm > ceph orch ps ---hostname HOSTNAME

# systemctl stop ceph-FSID@DAEMON_NAME

Find more information about operating Ceph services and identifying their names in Chapter 14, Operation of Ceph services.

After you complete the maintenance, start OSDs again:

# systemctl start ceph-FSID@osd.SERVICE_ID.service

After OSD services are started, unset the cluster from noout:

cephuser@adm > ceph osd unset noout

13.4 Checking for unbalanced data writing #

When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.

Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.

To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given ruleset, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:

cephuser@adm > ceph osd tree | grep osd.13
 13  hdd 3                osd.13  up  1.00000  1.00000

 cephuser@adm > ceph osd crush reweight osd.13 3.05
 reweighted item id 13 name 'osd.13' to 3.05 in crush map

cephuser@adm > ceph osd tree | grep osd.13
 13  hdd 3.05                osd.13  up  1.00000  1.00000

Tip: OSD reweight by utilization

The ceph osd reweight-by-utilization threshold command automates the process of reducing the weight of OSDs which are heavily overused. By default it will adjust the weights downward on OSDs which reached 120% of the average usage, but if you include threshold it will use that percentage instead.

13.5 Increasing file descriptors #

For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.

To prevent OSDs from running out of file descriptors, you can override the default and specify an appropriate value, for example:

cephuser@adm > ceph config set global max_open_files 131072

After you change max_open_files, you need to restart the OSD service on the relevant Ceph node.

13.6 Integration with virtualization software #

13.6.1 Storing KVM disks in Ceph cluster #

You can create a disk image for a KVM-driven virtual machine, store it in a Ceph pool, optionally convert the content of an existing image to it, and then run the virtual machine with qemu-kvm making use of the disk image stored in the cluster. For more detailed information, see Chapter 27, Ceph as a back-end for QEMU KVM instance.

13.6.2 Storing `libvirt` disks in Ceph cluster #

Similar to KVM (see Section 13.6.1, “Storing KVM disks in Ceph cluster”), you can use Ceph to store virtual machines driven by libvirt. The advantage is that you can run any libvirt-supported virtualization solution, such as KVM, Xen, or LXC. For more information, see Chapter 26, libvirt and Ceph.

13.6.3 Storing Xen disks in Ceph cluster #

One way to use Ceph for storing Xen disks is to make use of libvirt as described in Chapter 26, libvirt and Ceph.

Another option is to make Xen talk to the rbd block device driver directly:

If you have no disk image prepared for Xen, create a new one:
```
cephuser@adm > rbd create myimage --size 8000 --pool mypool
```
List images in the pool mypool and check if your new image is there:
```
cephuser@adm > rbd list mypool
```
Create a new block device by mapping the myimage image to the rbd kernel module:
```
cephuser@adm > rbd map --pool mypool myimage
```
Tip: User name and authentication
To specify a user name, use --id user-name. Moreover, if you use cephx authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:
```
cephuser@adm > rbd map --pool rbd myimage --id admin --keyring /path/to/keyring
```
or
```
cephuserrbd map --pool rbd myimage --id admin --keyfile /path/to/file
```

List all mapped devices:

cephuser@adm > rbd showmapped
 id pool   image   snap device
 0  mypool myimage -    /dev/rbd0

Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the xl-style domain configuration file:
```
disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
```

13.7 Firewall settings for Ceph #

We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting YaST › Security and Users › Firewall › Allowed Services.

Following is a list of Ceph-related services and numbers of the ports that they normally use:

Ceph Dashboard

The Ceph Dashboard binds to a specific TCP/IP address and TCP port. By default, the currently active Ceph Manager that hosts the dashboard binds to TCP port 8443 (or 8080 when SSL is disabled).

Ceph Monitor

Enable the Ceph MON service or ports 3300 and 6789 (TCP).

Ceph OSD or Metadata Server

Enable the Ceph OSD/MDS service or ports 6800-7300 (TCP).

This port range needs to be adjusted for dense nodes. See Section 7.4, “Reviewing final steps” for more information.

iSCSI Gateway

Open port 3260 (TCP).

Object Gateway

Open the port where Object Gateway communication occurs. To display it, run the following command:

cephuser@adm > ceph config get client.rgw.RGW_DAEMON_NAME rgw_frontends

Default is 80 for HTTP and 443 for HTTPS (TCP).

NFS Ganesha

By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP).

SSH

Open port 22 (TCP).

NTP

Open port 123 (UDP).

Salt

Open ports 4505 and 4506 (TCP).

Grafana

Open port 3000 (TCP).

Prometheus

Open port 9095 (TCP).

13.8 Testing network performance #

Both intermittent and complete network failures will impact a Ceph cluster. These two utilities can help in tracking down the cause and verifying expectations.

Tip: Sync Runners

Use the salt-run saltutil.sync_runners command if the Salt runner is reported as not available.

13.8.1 Performing basic diagnostics #

Try the salt-run network.ping command to ping between cluster nodes to see if an individual interface can reach to a specific interface and the average response time. Any specific response time much slower than average will also be reported. For example:

root@master # salt-run network.ping
Succeeded: 8 addresses from 7 minions average rtt 0.15 ms

Or, for IPv6:

root@master # salt-run network.ping6
Succeeded: 8 addresses from 7 minions average rtt 0.15 ms

Try validating all interfaces with JumboFrame enabled:

root@master # salt-run network.jumbo_ping
Succeeded: 8 addresses from 7 minions average rtt 0.26 ms

13.8.2 Performing throughput benchmark #

Try the salt-run network.iperf command to test network bandwidth between each pair of interfaces. On a given cluster node, a number of iperf processes (according to the number of CPU cores) are started as servers. The remaining cluster nodes will be used as clients to generate network traffic. The accumulated bandwidth of all per-node iperf processes is reported. This should reflect the maximum achievable network throughput on all cluster nodes. For example:

root@master # salt-run network.iperf
Fastest 2 hosts:
    |_
      - 192.168.31.25
      - 11443 Mbits/sec
    |_
      - 172.16.31.25
      - 10363 Mbits/sec

Slowest 2 hosts:
    |_
      - 172.16.32.14
      - 10217 Mbits/sec
    |_
      - 192.168.121.164
      - 10113 Mbits/sec

13.8.3 Useful options #

The output=full option will list the results of each interface rather than the summary of the two slowest and fastest.

root@master # salt-run network.iperf output=full
192.168.128.1:
    8644.0 Mbits/sec
192.168.128.2:
    10360.0 Mbits/sec
192.168.128.3:
    9336.0 Mbits/sec
192.168.128.4:
    9588.56 Mbits/sec
192.168.128.5:
    10187.0 Mbits/sec
192.168.128.6:
    10465.0 Mbits/sec

The remove=network where network is a comma delimited list of subnets that should not be included.

root@master # salt-run network.ping remove="192.168.121.0/24,192.168.1.0/24"
Succeeded: 20 addresses from 10 minions average rtt 14.16 ms

13.9 Locating physical disks using LED lights #

Ceph tracks which daemons manage which hardware storage devices (HDDs, SSDs), and collects health metrics about those devices in order to provide tools to predict and automatically respond to hardware failure.

You can blink the drive LEDs on hardware enclosures to make the replacement of failed disks easy and less error-prone. Use the following command:

cephuser@adm > ceph device light --enable=on --devid=string --light_type=ident --force

The DEVID parameter is the device identification. You can obtain it by running the following command:

cephuser@adm > ceph device ls

13.10 Sending large objects with `rados` fails with full OSD #

rados is a command line utility to manage RADOS object storage. For more information, see man 8 rados.

If you send a large object to a Ceph cluster with the rados utility, such as

cephuser@adm > rados -p mypool put myobject /file/to/send

it can fill up all the related OSD space and cause serious trouble to the cluster performance.

13.11 Managing the 'Too Many PGs per OSD' status message #

If you receive a Too Many PGs per OSD message after running ceph status, it means that the mon_pg_warn_max_per_osd value (300 by default) was exceeded. This value is compared to the number of PGs per OSD ratio. This means that the cluster setup is not optimal.

The number of PGs cannot be reduced after the pool is created. Pools that do not yet contain any data can safely be deleted and then re-created with a lower number of PGs. Where pools already contain data, the only solution is to add OSDs to the cluster so that the ratio of PGs per OSD becomes lower.

13.12 Managing the 'nn pg stuck inactive' status message #

If you receive a stuck inactive status message after running ceph status, it means that Ceph does not know where to replicate the stored data to fulfill the replication rules. It can happen shortly after the initial Ceph setup and fix itself automatically. In other cases, this may require a manual interaction, such as bringing up a broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing the replication level may help.

If the placement groups are stuck perpetually, you need to check the output of ceph osd tree. The output should look tree-structured, similar to the example in Section 4.6, “OSD is down”.

If the output of ceph osd tree is rather flat as in the following example

cephuser@adm > ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME              STATUS  REWEIGHT  PRI-AFF
-1         0.02939  root default
-3         0.00980      host doc-ses-node2
 0    hdd  0.00980          osd.0              up   1.00000  1.00000
-5         0.00980      host doc-ses-node3
 1    hdd  0.00980          osd.1              up   1.00000  1.00000
-7         0.00980      host doc-ses-node4
 2    hdd  0.00980          osd.2              up   1.00000  1.00000

You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.

If the hierarchy is incorrect—for example the root contains hosts, but the OSDs are at the top level and are not themselves assigned to hosts—you will need to move the OSDs to the correct place in the hierarchy. This can be done using the ceph osd crush move and/or ceph osd crush set commands. For further details see Section 17.5, “CRUSH Map manipulation”.

13.13 Fixing clock skew warnings #

As a general rule, time synchronization must be configured and running on all nodes. Once time synchronization is set up, the clocks should not get skewed. However, if a clock skew is to occur, this is likely due to the chronyd.service not running on one or more hosts.

Note

It is also possible that the battery on the motherboard has died, and the clock skew will be more pronounced. If this is the case, be aware that it will take quite some time for chronyd to re-synchronize the clocks.

If you receive a clock skew warning, confirm that the chronyd.service daemon is running on all cluster nodes. If not, restart the service and wait for chronyd to re-sync the clock.

Find more information on setting up time synchronization in https://documentation.suse.com/sles/15-SP3/html/SLES-all/cha-ntp.html#sec-ntp-yast.

13.14 Determining poor cluster performance caused by network problems #

There may be other reasons why cluster performance becomes weak, such as network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.

To check whether cluster performance is degraded by network problems, inspect the Ceph log files under the /var/log/ceph directory.

To fix network issues on the cluster, focus on the following points:

Basic network diagnostics. Try running the net.ping diagnostics tool. This tool has cluster nodes send network pings from their network interfaces to the network interfaces of other cluster nodes, and measures the average response time. Any specific response time much slower then average will also be reported. See Section 13.8.1, “Performing basic diagnostics” for more information.
Check firewall settings on cluster nodes. Make sure they do not block ports or protocols required by Ceph operation. See Section 13.7, “Firewall settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.

Tip: Separate network

To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.

13.15 Managing `/var` running out of space #

By default, the Salt Master saves every minion's result for every job in its job cache. The cache can then be used later to look up results from previous jobs. The cache directory defaults to /var/cache/salt/master/jobs/.

Each job return from every minion is saved in a single file. Over time this directory can grow very large, depending on the number of published jobs and the value of the keep_jobs option in the /etc/salt/master file. keep_jobs sets the number of hours (24 by default) to keep information about past minion jobs.

keep_jobs: 24

Important: Do not set keep_jobs: 0

Setting keep_jobs to '0' will cause the job cache cleaner to never run, possibly resulting in a full partition.

If you want to disable the job cache, set job_cache to 'False':

job_cache: False

Tip: Restoring partition full because of job cache

When the partition with job cache files gets full because of wrong keep_jobs setting, follow these steps to free disk space and improve the job cache settings:

Stop the Salt Master service:

root@master # systemctl stop salt-master

Change the Salt Master configuration related to job cache by editing /etc/salt/master:
```
job_cache: False
keep_jobs: 1
```
Clear the Salt Master job cache:
```
# rm -rfv /var/cache/salt/master/jobs/*
```

Start the Salt Master service:

root@master # systemctl start salt-master