16 Common issues #
16.1 Ceph common issues #
Many of these problem cases are hard to summarize down to a short phrase that adequately describes the problem. Each problem will start with a bulleted list of symptoms. Keep in mind that all symptoms may not apply, depending on the configuration of Rook. If the majority of the symptoms are seen, then there is a fair chance that you are experiencing that problem.
16.1.1 Troubleshooting techniques #
There are two main categories of information you will need to investigate issues in the cluster:
Kubernetes status and logs.
Ceph cluster status.
16.1.1.1 Running Ceph tools #
After you verify the basic health of the running pods, next you will want to run Ceph tools for status of the storage components. There are two ways to run the Ceph tools, either in the Rook toolbox or inside other Rook pods that are already running.
Logs on a specific node to find why a PVC is failing to mount: Rook agent errors around the attach and detach:
kubectl@adm >
kubectl logs -n rook-ceph rook-ceph-agent-podSee the Section 12.1.3, “Collecting logs” for a script that will help you gather the logs.
Other artifacts:
The monitors that are expected to be in quorum:
kubectl@adm >
kubectl -n <cluster-namespace> get configmap rook-ceph-mon-endpoints -o yaml | grep data
16.1.1.1.1 Using tools in the Rook toolbox #
The rook-ceph-tools pod
provides a simple environment
to run Ceph tools. Once the pod is up and running, connect to the pod
to execute Ceph commands to evaluate that current state of the cluster.
kubectl@adm >
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
16.1.1.1.2 Ceph commands #
Here are some common commands to troubleshoot a Ceph cluster:
ceph status
ceph osd status
ceph osd df
ceph osd utilization
ceph osd pool stats
ceph osd tree
ceph pg stat
The first two status commands provide the overall cluster health. The
normal state for cluster operations is HEALTH_OK
, but
will still function when the state is in a HEALTH_WARN
state. If you are in a WARN
state, then the cluster is
in a condition that it may enter the HEALTH_ERROR
state at which point all disk I/O operations are
halted. If a HEALTH_WARN
state is observed, then one
should take action to prevent the cluster from halting when it enters the
HEALTH_ERROR
state.
16.1.2 Cluster failing to service requests #
16.1.2.1 Identifying symptoms #
Execution of the Ceph command hangs.
PersistentVolumes
are not being created.Large amount of slow requests are blocking.
Large amount of stuck requests are blocking.
One or more MONs are restarting periodically.
16.1.2.2 Investigating the current state of Ceph #
Create a rook-ceph-tools pod
to investigate the current
state of Ceph. The following is an example of the output. In this case,
the ceph status
command would just hang and the process
would need to be killed.
kubectl@adm >
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bashcephuser@adm >
ceph status ^CCluster connection interrupted or timed out
Another indication is when one or more of the MON pods restart frequently. Note the “mon107” that has only been up for 16 minutes in the following output.
kubectl@adm >
kubectl -n rook-ceph get all -o wide --show-all
NAME READY STATUS RESTARTS AGE IP NODE
po/rook-ceph-mgr0-2487684371-gzlbq 1/1 Running 0 17h 192.168.224.46 k8-host-0402
po/rook-ceph-mon107-p74rj 1/1 Running 0 16m 192.168.224.28 k8-host-0402
rook-ceph-mon1-56fgm 1/1 Running 0 2d 192.168.91.135 k8-host-0404
rook-ceph-mon2-rlxcd 1/1 Running 0 2d 192.168.123.33 k8-host-0403
rook-ceph-osd-bg2vj 1/1 Running 0 2d 192.168.91.177 k8-host-0404
rook-ceph-osd-mwxdm 1/1 Running 0 2d 192.168.123.31 k8-host-0403
16.1.2.3 Identifying the solution #
What is happening here is that the MON pods are restarting and one or more
of the Ceph daemons are not getting configured with the proper cluster
information. This is commonly the result of not specifying a value for
dataDirHostPath
in your Cluster CRD.
The dataDirHostPath
setting specifies a path on the
local host for the Ceph daemons to store configuration and data. Setting
this to a path like /var/lib/rook
, reapplying your
cluster CRD and restarting all the Ceph daemons (MON, MGR, OSD, RGW)
should solve this problem. After the Object Gateway daemons have been restarted, it
is advisable to restart the rook-tools
pod.
16.1.3 Monitors are the only PODs running #
16.1.3.1 Identifying symptoms #
Rook operator is running.
Either a single mon starts or the MONs skip letters, specifically named
a
,d
, andf
.No MGR, OSD, or other daemons are created.
16.1.3.2 Investigating MON health #
When the operator is starting a cluster, the operator will start one MON at a time and check that they are healthy before continuing to bring up all three MONs. If the first MON is not detected healthy, the operator will continue to check until it is healthy. If the first MON fails to start, a second and then a third MON may attempt to start. However, they will never form a quorum, and orchestration will be blocked from proceeding.
The likely causes for the MON health not being detected:
The operator pod does not have network connectivity to the MON pod.
The MON pod is failing to start.
One or more MON pods are in running state, but are not able to form a quorum.
16.1.3.2.1 Failing to connect to the MON #
Firstly, look at the logs of the operator to confirm if it is able to connect to the MONs.
kubectl@adm >
kubectl -n rook-ceph logs -l app=rook-ceph-operator
Likely you will see an error similar to the following that the operator
is timing out when connecting to the MON. The last command is
ceph mon_status
, followed by a timeout message five
minutes later.
2018-01-21 21:47:32.375833 I | exec: Running command: ceph mon_status --cluster=rook --conf=/var/lib/rook/rook-ceph/rook.config --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json --out-file /tmp/442263890 2018-01-21 21:52:35.370533 I | exec: 2018-01-21 21:52:35.071462 7f96a3b82700 0 monclient(hunting): authenticate timed out after 300 2018-01-21 21:52:35.071462 7f96a3b82700 0 monclient(hunting): authenticate timed out after 300 2018-01-21 21:52:35.071524 7f96a3b82700 0 librados: client.admin authentication error (110) Connection timed out 2018-01-21 21:52:35.071524 7f96a3b82700 0 librados: client.admin authentication error (110) Connection timed out [errno 110] error connecting to the cluster
The error would appear to be an authentication error, but it is misleading. The real issue is a timeout.
16.1.3.2.2 Identifying the solution #
If you see the timeout in the operator log, verify if the MON pod is running (see the next section). If the MON pod is running, check the network connectivity between the operator pod and the MON pod. A common issue is that the CNI is not configured correctly.
16.1.3.2.3 Failing MON pod #
We need to verify if the MON pod started successfully.
kubectl@adm >
kubectl -n rook-ceph get pod -l app=rook-ceph-mon
NAME READY STATUS RESTARTS AGE
rook-ceph-mon-a-69fb9c78cd-58szd 1/1 CrashLoopBackOff 2 47s
If the MON pod is failing as in this example, you will need to look at
the mon pod status
or logs to determine the cause. If
the pod is in a crash loop backoff state, you should see the reason by
describing the pod.
The pod shows a termination status that the keyring does not match the existing keyring.
kubectl@adm >
kubectl -n rook-ceph describe pod -l mon=rook-ceph-mon0
[...]
Last State: Terminated
Reason: Error
Message: The keyring does not match the existing keyring in /var/lib/rook/rook-ceph-mon0/data/keyring.
You may need to delete the contents of dataDirHostPath on the host from a previous deployment.
[...]
See the solution in the next section regarding cleaning up the
dataDirHostPath
on the nodes.
If you see the three mons running with the names a
,
d
, and f
, they likely did not form
quorum even though they are running.
NAME READY STATUS RESTARTS AGE rook-ceph-mon-a-7d9fd97d9b-cdq7g 1/1 Running 0 10m rook-ceph-mon-d-77df8454bd-r5jwr 1/1 Running 0 9m2s rook-ceph-mon-f-58b4f8d9c7-89lgs 1/1 Running 0 7m38s
16.1.3.2.4 Identifying the solution #
This is a common problem reinitializing the Rook cluster when the local
directory used for persistence has not
been purged. This directory is the dataDirHostPath
setting in the cluster CRD, and is typically set to
/var/lib/rook
. To fix the issue, you will need to
delete all components of Rook and then delete the contents of
/var/lib/rook
(or the directory specified by
dataDirHostPath
) on each of the hosts in the cluster.
Then, when the cluster CRD is applied to start a new cluster, the
rook-operator should start all the pods as expected.
Deleting the dataDirHostPath
folder is destructive to
the storage. Only delete the folder if you are trying to permanently
purge the Rook cluster.
16.1.4 PVCs stay in pending state #
16.1.4.1 Identifying symptoms #
When you create a PVC based on a Rook storage class, it stays pending indefinitely.
For the Wordpress example, you might see two PVCs in the pending state.
kubectl@adm >
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
mysql-pv-claim Pending rook-ceph-block 8s
wp-pv-claim Pending rook-ceph-block 16s
16.1.4.2 Investigating common causes #
There are two common causes for the PVCs staying in the pending state:
There are no OSDs in the cluster.
The CSI provisioner pod is not running or is not responding to the request to provision the storage.
16.1.4.2.1 Confirming if there are OSDs #
To confirm if you have OSDs in your cluster, connect to the Rook
Toolbox and run the ceph status
command. You should
see that you have at least one OSD up
and
in
. The minimum number of OSDs required depends on the
replicated.size
setting in the pool created for the
storage class. In a “test” cluster, only one OSD is required
(see storageclass-test.yaml
). In the production
storage class example (storageclass.yaml
), three
OSDs would be required.
cephuser@adm >
ceph status
cluster:
id: a0452c76-30d9-4c1a-a948-5d8405f19a7c
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 11m)
mgr: a(active, since 10m)
osd: 1 osds: 1 up (since 46s), 1 in (since 109m)
16.1.4.2.2 Preparing OSD logs #
If you do not see the expected number of OSDs, investigate why they were not created. On each node where Rook looks for OSDs to configure, you will see an “osd prepare” pod.
kubectl@adm >
kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare
NAME ... READY STATUS RESTARTS AGE
rook-ceph-osd-prepare-minikube-9twvk 0/2 Completed 0 30m
See the section on Section 16.1.6, “OSD pods are not created on my devices” to investigate the logs.
16.1.4.2.3 Checking CSI driver #
The CSI driver may not be responding to the requests. Look in the logs of the CSI provisioner pod to see if there are any errors during the provisioning.
There are two provisioner pods:
kubectl@adm >
kubectl -n rook-ceph get pod -l app=csi-rbdplugin-provisioner
Get the logs of each of the pods. One of them should be the leader and be responding to requests.
kubectl@adm >
kubectl -n rook-ceph logs csi-cephfsplugin-provisioner-d77bb49c6-q9hwq csi-provisioner
16.1.4.2.4 Restarting the operator #
Lastly, if you have OSDs up
and in
,
the next step is to confirm the operator is responding to the requests.
Look in the operator pod logs around the time when the PVC was created to
confirm if the request is being raised. If the operator does not show
requests to provision the block image, the operator may be stuck on some
other operation. In this case, restart the operator pod to get things
going again.
16.1.4.3 Identifying the solution #
If the OSD prepare logs did not give you enough clues about why the OSDs
were not being created, review your cluster.yaml
configuration. The common mistakes include:
If
useAllDevices: true
, Rook expects to find local devices attached to the nodes. If no devices are found, no OSDs will be created.If
useAllDevices: false
, OSDs will only be created ifdeviceFilter
is specified.Only local devices attached to the nodes will be configurable by Rook. In other words, the devices must show up under
/dev
.The devices must not have any partitions or file systems on them. Rook will only configure raw devices. Partitions are not yet supported.
16.1.5 OSD pods are failing to start #
16.1.5.1 Identifying symptoms #
OSD pods are failing to start.
You have started a cluster after tearing down another cluster.
16.1.5.2 Investigating configuration errors #
When an OSD starts, the device or directory will be configured for
consumption. If there is an error with the configuration, the pod will
crash and you will see the CrashLoopBackoff
status for
the pod. Look in the OSD pod logs for an indication of the failure.
kubectl@adm >
kubectl -n rook-ceph logs rook-ceph-osd-fl8fs
One common case for failure is that you have re-deployed a test cluster and some state may remain from a previous deployment. If your cluster is larger than a few nodes, you may get lucky enough that the monitors were able to start and form a quorum. However, now the OSDs pods may fail to start due to the old state. Looking at the OSD pod logs, you will see an error about the file already existing.
kubectl -n rook-ceph logs rook-ceph-osd-fl8fs [...] 2017-10-31 20:13:11.187106 I | mkfs-osd0: 2017-10-31 20:13:11.186992 7f0059d62e00 -1 bluestore(/var/lib/rook/osd0) _read_fsid unparsable uuid 2017-10-31 20:13:11.187208 I | mkfs-osd0: 2017-10-31 20:13:11.187026 7f0059d62e00 -1 bluestore(/var/lib/rook/osd0) _setup_block_symlink_or_file failed to create block symlink to /dev/disk/by-partuuid/651153ba-2dfc-4231-ba06-94759e5ba273: (17) File exists 2017-10-31 20:13:11.187233 I | mkfs-osd0: 2017-10-31 20:13:11.187038 7f0059d62e00 -1 bluestore(/var/lib/rook/osd0) mkfs failed, (17) File exists 2017-10-31 20:13:11.187254 I | mkfs-osd0: 2017-10-31 20:13:11.187042 7f0059d62e00 -1 OSD::mkfs: ObjectStore::mkfs failed with error (17) File exists 2017-10-31 20:13:11.187275 I | mkfs-osd0: 2017-10-31 20:13:11.187121 7f0059d62e00 -1 ** ERROR: error creating empty object store in /var/lib/rook/osd0: (17) File exists
16.1.5.3 Solution #
If the error is from the file that already exists, this is a common
problem reinitializing the Rook cluster when the local directory used
for persistence has not been purged.
This directory is the dataDirHostPath
setting in the
cluster CRD and is typically set to /var/lib/rook
. To
fix the issue you will need to delete all components of Rook and then
delete the contents of /var/lib/rook
(or the
directory specified by dataDirHostPath
) on each of the
hosts in the cluster. Then when the cluster CRD is applied to start a new
cluster, the rook-operator should start all the pods as expected.
16.1.6 OSD pods are not created on my devices #
16.1.6.1 Identifying symptoms #
No OSD pods are started in the cluster.
Devices are not configured with OSDs even though specified in the cluster CRD.
One OSD pod is started on each node instead of multiple pods for each device.
16.1.6.2 Investigating #
First, ensure that you have specified the devices correctly in the CRD. The cluster CRD has several ways to specify the devices that are to be consumed by the Rook storage:
useAllDevices: true
: Rook will consume all devices it determines to be available.deviceFilter
: Consume all devices that match this regular expression.devices
: Explicit list of device names on each node to consume.
Second, if Rook determines that a device is not available (has existing
partitions or a formatted file system), Rook will skip consuming the
devices. If Rook is not starting OSDs on the devices you expect, Rook
may have skipped it for this reason. To see if a device was skipped, view
the OSD preparation log on the node where the device was skipped. Note
that it is completely normal and expected for OSD prepare pod to be in the
completed
state. After the job is complete, Rook
leaves the pod around in case the logs need to be investigated.
Get the prepare pods in the cluster:
kubectl@adm >
kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-prepare-node1-fvmrp 0/1 Completed 0 18m
rook-ceph-osd-prepare-node2-w9xv9 0/1 Completed 0 22m
rook-ceph-osd-prepare-node3-7rgnv 0/1 Completed 0 22m
View the logs for the node of interest in the "provision" container:
kubectl@adm >
kubectl -n rook-ceph logs rook-ceph-osd-prepare-node1-fvmrp provision
Here are some key lines to look for in the log. A device will be skipped if Rook sees it has partitions or a file system:
2019-05-30 19:02:57.353171 W | cephosd: skipping device sda that is in use 2019-05-30 19:02:57.452168 W | skipping device "sdb5": ["Used by ceph-disk"]
Other messages about a disk being unusable by Ceph include:
Insufficient space (<5GB) on vgs Insufficient space (<5GB) LVM detected Has BlueStore device label locked read-only
A device is going to be configured:
2019-05-30 19:02:57.535598 I | cephosd: device sdc to be configured by ceph-volume
For each device configured, you will see a report in the log:
2019-05-30 19:02:59.844642 I | Type Path LV Size % of device 2019-05-30 19:02:59.844651 I | ---------------------------------------------------------------------------------------------------- 2019-05-30 19:02:59.844677 I | [data] /dev/sdc 7.00 GB 100%
16.1.6.3 Solution #
Either update the CR with the correct settings, or clean the partitions or file system from your devices.
After the settings are updated or the devices are cleaned, trigger the operator to analyze the devices again by restarting the operator. Each time the operator starts, it will ensure all the desired devices are configured. The operator does automatically deploy OSDs in most scenarios, but an operator restart will cover any scenarios that the operator does not detect automatically.
Restart the operator to ensure devices are configured. A new pod will automatically be started when the current operator pod is deleted.
kubectl@adm >
kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
16.1.7 Rook agent modprobe exec format error #
16.1.7.1 Identifying symptoms #
PersistentVolumes
from Ceph fail or timeout to mount.Rook Agent logs contain
modinfo: ERROR: could not get modinfo from 'rbd': Exec format error
lines.
16.1.7.2 Solution #
If it is feasible to upgrade your kernel, you should upgrade to 4.x, even better is 4.7 or above, due to a feature for CephFS added to the kernel.
If you are unable to upgrade the kernel, you need to go to each host that will consume storage and run:
modprobe rbd
This command inserts the rbd
module into the kernel.
To persist this fix, you need to add the rbd
kernel
module to either /etc/modprobe.d/
or
/etc/modules-load.d/
. For both paths create a file
called rbd.conf
with the following content:
rbd
Now when a host is restarted, the module should be loaded automatically.
16.1.9 Activating log to file for a particular Ceph daemon #
They are cases where looking at Kubernetes logs is not enough for various reasons, but just to name a few:
Not everyone is familiar for Kubernetes logging and expects to find logs in traditional directories.
Logs get eaten (buffer limit from the log engine) and thus not requestable from Kubernetes.
So for each daemon, dataDirHostPath
is used to store
logs, if logging is activated. Rook will bind-mount
dataDirHostPath
for every pod. As of Ceph Nautilus
14.2.1, it is possible to enable logging for a particular daemon on the
fly. Let us say you want to enable logging for mon.a
,
but only for this daemon. Using the toolbox or from inside the operator
run:
cephuser@adm >
ceph config daemon mon.a log_to_file true
This will activate logging on the file system, you will be able to find
logs in dataDirHostPath/$NAMESPACE/log
, so typically
this would mean /var/lib/rook/rook-ceph/log
. You do
not need to restart the pod, the effect will be immediate.
To disable the logging on file, simply set log_to_file
to false
.
For Ceph Luminous and Mimic releases,
mon_cluster_log_file
and
cluster_log_file
can be set to
/var/log/ceph/XXXX
in the config override ConfigMap to
enable logging.
16.1.10 A worker node using RBD devices hangs up #
16.1.10.1 Identifying symptoms #
There is no progress on I/O from/to one of RBD devices (
/dev/rbd*
or/dev/nbd*
).After that, the whole worker node hangs up.
16.1.10.2 Investigating #
This happens when the following conditions are satisfied.
The problematic RBD device and the corresponding OSDs are co-located.
There is an XFS file system on top of this device.
In addition, when this problem happens, you can see the following messages
in dmesg
.
dmesg ... [51717.039319] INFO: task kworker/2:1:5938 blocked for more than 120 seconds. [51717.039361] Not tainted 4.15.0-72-generic #81-Ubuntu [51717.039388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ...
This is the so-called hung_task
problem and means that
there is a deadlock in the kernel.
16.1.10.3 Solution #
You can bypass this problem by using ext4 or any other file systems rather
than XFS. The file system type can be specified with
csi.storage.k8s.io/fstype
in StorageClass resource.
16.1.11 Too few PGs per OSD warning is shown #
16.1.11.1 Identifying symptoms #
ceph status
shows “too few PGs per OSD” warning as follows.
cephuser@adm >
ceph status
cluster:
id: fd06d7c3-5c5c-45ca-bdea-1cf26b783065
health: HEALTH_WARN
too few PGs per OSD (16 < min 30)
16.1.11.2 Solution #
See Chapter 5, Troubleshooting placement groups (PGs) for more information.
16.1.12 LVM metadata can be corrupted with OSD on LV-backed PVC #
16.1.12.1 Identifying symptoms #
There is a critical flaw in OSD on LV-backed PVC. LVM metadata can be
corrupted if both the host and OSD container modify it simultaneously. For
example, the administrator might modify it on the host, while the OSD
initialization process in a container could modify it too. In addition, if
lvmetad
is running, the possibility of occurrence gets
higher. In this case, the change of LVM metadata in OSD container is not
reflected to LVM metadata cache in host for a while.
If you still decide to configure an OSD on LVM, keep the following in mind to reduce the probability of this issue.
16.1.12.2 Solution #
Disable
lvmetad
.Avoid configuration of LVs from the host. In addition, do not touch the VGs and physical volumes that back these LVs.
Avoid incrementing the
count
field ofstorageClassDeviceSets
and create a new LV that backs a OSD simultaneously.
You can know whether the above-mentioned tag exists tag by running
#
lvs -o lv_name,lv_tags
If the lv_tag
field is empty in an LV corresponding to
the OSD lv_tags, this OSD encountered the problem. In this case, retire
this OSD or replace with other new OSD before restarting.