13 Hints and tips #
The chapter provides information to help you enhance performance of your Ceph cluster and provides tips how to set the cluster up.
13.1 Identifying orphaned volumes #
To identify possibly orphaned journal/WAL/DB volumes, follow these steps:
Get a list of OSD IDs for which LVs exist, but not OSD daemon is running on the node.
root@minion >
comm -3 <(ceph osd tree | grep up | awk '{print $4}') <(cephadm ceph-volume lvm list 2>/dev/null | grep ====== | awk '{print $2}')If the command outputs one or more OSDs (for example osd.33), take the IDs (33 in the prior example) and zap the associated volumes.
root@minion >
ceph-volume lvm zap --destroy --osd-id ID.
13.2 Adjusting scrubbing #
By default, Ceph performs light scrubbing daily (find more details in Section 17.6, “Scrubbing placement groups”) and deep scrubbing weekly. Light scrubbing checks object sizes and checksums to ensure that placement groups are storing the same object data. Deep scrubbing checks an object’s content with that of its replicas to ensure that the actual contents are the same. The price for checking data integrity is increased I/O load on the cluster during the scrubbing procedure.
The default settings allow Ceph OSDs to initiate scrubbing at inappropriate times, such as during periods of heavy loads. Customers may experience latency and poor performance when scrubbing operations conflict with their operations. Ceph provides several scrubbing settings that can limit scrubbing to periods with lower loads or during off-peak hours.
If the cluster experiences high loads during the day and low loads late at night, consider restricting scrubbing to night time hours, such as 11pm till 6am:
cephuser@adm >
ceph config set osd osd_scrub_begin_hour 23cephuser@adm >
ceph config set osd osd_scrub_end_hour 6
If time restriction is not an effective method of determining a scrubbing
schedule, consider using the osd_scrub_load_threshold
option. The default value is 0.5, but it could be modified for low load
conditions:
cephuser@adm >
ceph config set osd osd_scrub_load_threshold 0.25
13.3 Stopping OSDs without rebalancing #
You may need to stop OSDs for maintenance periodically. If you do not want
CRUSH to automatically rebalance the cluster in order to avoid huge data
transfers, set the cluster to noout
first:
root@minion >
ceph osd set noout
When the cluster is set to noout
, you can begin stopping
the OSDs within the failure domain that requires maintenance work. To
identify the unique FSID of the cluster, run ceph fsid
.
To identify the Object Gateway daemon name, run
cephuser@adm >
ceph orch ps ---hostname HOSTNAME
#
systemctl stop ceph-FSID@DAEMON_NAME
Find more information about operating Ceph services and identifying their names in Chapter 14, Operation of Ceph services.
After you complete the maintenance, start OSDs again:
#
systemctl start ceph-FSID@osd.SERVICE_ID.service
After OSD services are started, unset the cluster from
noout
:
cephuser@adm >
ceph osd unset noout
13.4 Checking for unbalanced data writing #
When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.
Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.
To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given ruleset, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:
cephuser@adm >
ceph osd tree | grep osd.13 13 hdd 3 osd.13 up 1.00000 1.00000cephuser@adm >
ceph osd crush reweight osd.13 3.05 reweighted item id 13 name 'osd.13' to 3.05 in crush mapcephuser@adm >
ceph osd tree | grep osd.13 13 hdd 3.05 osd.13 up 1.00000 1.00000
The ceph osd reweight-by-utilization
threshold command automates the process of
reducing the weight of OSDs which are heavily overused. By default it will
adjust the weights downward on OSDs which reached 120% of the average
usage, but if you include threshold it will use that percentage instead.
13.5 Increasing file descriptors #
For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.
To prevent OSDs from running out of file descriptors, you can override the default and specify an appropriate value, for example:
cephuser@adm >
ceph config set global max_open_files 131072
After you change max_open_files
, you need to restart the
OSD service on the relevant Ceph node.
13.6 Integration with virtualization software #
13.6.1 Storing KVM disks in Ceph cluster #
You can create a disk image for a KVM-driven virtual machine, store it in
a Ceph pool, optionally convert the content of an existing image to it,
and then run the virtual machine with qemu-kvm
making
use of the disk image stored in the cluster. For more detailed information,
see Chapter 27, Ceph as a back-end for QEMU KVM instance.
13.6.2 Storing libvirt
disks in Ceph cluster #
Similar to KVM (see Section 13.6.1, “Storing KVM disks in Ceph cluster”), you
can use Ceph to store virtual machines driven by libvirt
. The advantage
is that you can run any libvirt
-supported virtualization solution, such
as KVM, Xen, or LXC. For more information, see
Chapter 26, libvirt
and Ceph.
13.6.3 Storing Xen disks in Ceph cluster #
One way to use Ceph for storing Xen disks is to make use of libvirt
as described in Chapter 26, libvirt
and Ceph.
Another option is to make Xen talk to the rbd
block device driver directly:
If you have no disk image prepared for Xen, create a new one:
cephuser@adm >
rbd create myimage --size 8000 --pool mypoolList images in the pool
mypool
and check if your new image is there:cephuser@adm >
rbd list mypoolCreate a new block device by mapping the
myimage
image to therbd
kernel module:cephuser@adm >
rbd map --pool mypool myimageTip: User name and authenticationTo specify a user name, use
--id user-name
. Moreover, if you usecephx
authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:cephuser@adm >
rbd map --pool rbd myimage --id admin --keyring /path/to/keyringor
cephuser
rbd map --pool rbd myimage --id admin --keyfile /path/to/fileList all mapped devices:
cephuser@adm >
rbd showmapped
id pool image snap device 0 mypool myimage - /dev/rbd0Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the
xl
-style domain configuration file:disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]
13.7 Firewall settings for Ceph #
We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting
› › › .Following is a list of Ceph-related services and numbers of the ports that they normally use:
- Ceph Dashboard
The Ceph Dashboard binds to a specific TCP/IP address and TCP port. By default, the currently active Ceph Manager that hosts the dashboard binds to TCP port 8443 (or 8080 when SSL is disabled).
- Ceph Monitor
Enable the
service or ports 3300 and 6789 (TCP).- Ceph OSD or Metadata Server
Enable the
service or ports 6800-7300 (TCP).This port range needs to be adjusted for dense nodes. See Section 7.4, “Reviewing final steps” for more information.
- iSCSI Gateway
Open port 3260 (TCP).
- Object Gateway
Open the port where Object Gateway communication occurs. To display it, run the following command:
cephuser@adm >
ceph config get client.rgw.RGW_DAEMON_NAME rgw_frontendsDefault is 80 for HTTP and 443 for HTTPS (TCP).
- NFS Ganesha
By default, NFS Ganesha uses ports 2049 (NFS service, TCP) and 875 (rquota support, TCP).
- SSH
Open port 22 (TCP).
- NTP
Open port 123 (UDP).
- Salt
Open ports 4505 and 4506 (TCP).
- Grafana
Open port 3000 (TCP).
- Prometheus
Open port 9095 (TCP).
13.8 Testing network performance #
Both intermittent and complete network failures will impact a Ceph cluster. These two utilities can help in tracking down the cause and verifying expectations.
Use the salt-run saltutil.sync_runners
command if the
Salt runner is reported as not available
.
13.8.1 Performing basic diagnostics #
Try the salt-run network.ping
command to ping between
cluster nodes to see if an individual interface can reach to a specific
interface and the average response time. Any specific response time much
slower than average will also be reported. For example:
root@master #
salt-run network.ping
Succeeded: 8 addresses from 7 minions average rtt 0.15 ms
Or, for IPv6:
root@master #
salt-run network.ping6
Succeeded: 8 addresses from 7 minions average rtt 0.15 ms
Try validating all interfaces with JumboFrame enabled:
root@master #
salt-run network.jumbo_ping
Succeeded: 8 addresses from 7 minions average rtt 0.26 ms
13.8.2 Performing throughput benchmark #
Try the salt-run network.iperf
command to test network
bandwidth between each pair of interfaces. On a given cluster node, a
number of iperf
processes (according to the number of
CPU cores) are started as servers. The remaining cluster nodes will be used
as clients to generate network traffic. The accumulated bandwidth of all
per-node iperf
processes is reported. This should
reflect the maximum achievable network throughput on all cluster nodes. For
example:
root@master #
salt-run network.iperf
Fastest 2 hosts:
|_
- 192.168.31.25
- 11443 Mbits/sec
|_
- 172.16.31.25
- 10363 Mbits/sec
Slowest 2 hosts:
|_
- 172.16.32.14
- 10217 Mbits/sec
|_
- 192.168.121.164
- 10113 Mbits/sec
13.8.3 Useful options #
The output=full
option will list the results of each
interface rather than the summary of the two slowest and fastest.
root@master #
salt-run network.iperf output=full
192.168.128.1:
8644.0 Mbits/sec
192.168.128.2:
10360.0 Mbits/sec
192.168.128.3:
9336.0 Mbits/sec
192.168.128.4:
9588.56 Mbits/sec
192.168.128.5:
10187.0 Mbits/sec
192.168.128.6:
10465.0 Mbits/sec
The remove=network
where network is a comma delimited list
of subnets that should not be included.
root@master #
salt-run network.ping remove="192.168.121.0/24,192.168.1.0/24"
Succeeded: 20 addresses from 10 minions average rtt 14.16 ms
13.9 Locating physical disks using LED lights #
Ceph tracks which daemons manage which hardware storage devices (HDDs, SSDs), and collects health metrics about those devices in order to provide tools to predict and automatically respond to hardware failure.
You can blink the drive LEDs on hardware enclosures to make the replacement of failed disks easy and less error-prone. Use the following command:
cephuser@adm >
ceph device light --enable=on --devid=string --light_type=ident --force
The DEVID
parameter is the device identification. You can
obtain it by running the following command:
cephuser@adm >
ceph device ls
13.10 Sending large objects with rados
fails with full OSD #
rados
is a command line utility to manage RADOS object
storage. For more information, see man 8 rados
.
If you send a large object to a Ceph cluster with the
rados
utility, such as
cephuser@adm >
rados -p mypool put myobject /file/to/send
it can fill up all the related OSD space and cause serious trouble to the cluster performance.
13.11 Managing the 'Too Many PGs per OSD' status message #
If you receive a Too Many PGs per OSD
message after
running ceph status
, it means that the
mon_pg_warn_max_per_osd
value (300 by default) was
exceeded. This value is compared to the number of PGs per OSD ratio. This
means that the cluster setup is not optimal.
The number of PGs cannot be reduced after the pool is created. Pools that do not yet contain any data can safely be deleted and then re-created with a lower number of PGs. Where pools already contain data, the only solution is to add OSDs to the cluster so that the ratio of PGs per OSD becomes lower.
13.12 Managing the 'nn pg stuck inactive' status message #
If you receive a stuck inactive
status message after
running ceph status
, it means that Ceph does not know
where to replicate the stored data to fulfill the replication rules. It can
happen shortly after the initial Ceph setup and fix itself automatically.
In other cases, this may require a manual interaction, such as bringing up a
broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing
the replication level may help.
If the placement groups are stuck perpetually, you need to check the output
of ceph osd tree
. The output should look tree-structured,
similar to the example in Section 4.6, “OSD is down”.
If the output of ceph osd tree
is rather flat as in the
following example
cephuser@adm >
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02939 root default
-3 0.00980 host doc-ses-min2
0 hdd 0.00980 osd.0 up 1.00000 1.00000
-5 0.00980 host doc-ses-min3
1 hdd 0.00980 osd.1 up 1.00000 1.00000
-7 0.00980 host doc-ses-min4
2 hdd 0.00980 osd.2 up 1.00000 1.00000
You should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.
If the hierarchy is incorrect—for example the root contains hosts, but
the OSDs are at the top level and are not themselves assigned to
hosts—you will need to move the OSDs to the correct place in the
hierarchy. This can be done using the ceph osd crush move
and/or ceph osd crush set
commands. For further details
see Section 17.5, “CRUSH Map manipulation”.
13.13 Fixing clock skew warnings #
As a general rule, time synchronization must be configured and running on
all nodes. Once time synchronization is set up, the clocks should not get
skewed. However, if a clock skew is to occur, this is likely due to the
chronyd.service
not running on one
or more hosts.
It is also possible that the battery on the motherboard has died, and the
clock skew will be more pronounced. If this is the case, be aware that it
will take quite some time for
chronyd
to re-synchronize the
clocks.
If you receive a clock skew warning, confirm that the
chronyd.service
daemon is running on
all cluster nodes. If not, restart the service and wait for chronyd to
re-sync the clock.
Find more information on setting up time synchronization in https://documentation.suse.com/sles/15-SP2/html/SLES-all/cha-ntp.html#sec-ntp-yast.
13.14 Determining poor cluster performance caused by network problems #
There may be other reasons why cluster performance becomes weak, such as network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.
To check whether cluster performance is degraded by network problems,
inspect the Ceph log files under the /var/log/ceph
directory.
To fix network issues on the cluster, focus on the following points:
Basic network diagnostics. Try running the
net.ping
diagnostics tool. This tool has cluster nodes send network pings from their network interfaces to the network interfaces of other cluster nodes, and measures the average response time. Any specific response time much slower then average will also be reported. See Section 13.8.1, “Performing basic diagnostics” for more information.Check firewall settings on cluster nodes. Make sure they do not block ports or protocols required by Ceph operation. See Section 13.7, “Firewall settings for Ceph” for more information on firewall settings.
Check the networking hardware, such as network cards, cables, or switches, for proper operation.
To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.
13.15 Managing /var
running out of space #
By default, the Salt Master saves every minion's result for every job in its
job cache. The cache can then be used later to look up
results from previous jobs. The cache directory defaults to
/var/cache/salt/master/jobs/
.
Each job return from every minion is saved in a single file. Over time this
directory can grow very large, depending on the number of published jobs and
the value of the keep_jobs
option in the
/etc/salt/master
file. keep_jobs
sets
the number of hours (24 by default) to keep information about past minion
jobs.
keep_jobs: 24
keep_jobs: 0
Setting keep_jobs
to '0' will cause the job cache cleaner
to never run, possibly resulting in a full partition.
If you want to disable the job cache, set job_cache
to
'False':
job_cache: False
When the partition with job cache files gets full because of wrong
keep_jobs
setting, follow these steps to free disk space
and improve the job cache settings:
Stop the Salt Master service:
root@master #
systemctl stop salt-masterChange the Salt Master configuration related to job cache by editing
/etc/salt/master
:job_cache: False keep_jobs: 1
Clear the Salt Master job cache:
#
rm -rfv /var/cache/salt/master/jobs/*Start the Salt Master service:
root@master #
systemctl start salt-master