5 Troubleshooting placement groups (PGs) #
5.1 Identifying troubled placement groups #
As previously noted, a placement group is not necessarily problematic
because its state is not active+clean
. Generally,
Ceph's ability to self-repair may not be working when placement groups get
stuck. The stuck states include:
Unclean: Placement groups contain objects that are not replicated the required number of times. They should be recovering.
Inactive: Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come back up.
Stale: Placement groups are in an unknown state, because the OSDs that host them have not reported to MONs in a while (configured by the
mon osd report timeout
option).
To identify stuck placement groups, run the following:
cephuser@adm >
ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
5.2 Placement groups never get clean #
When you create a cluster and your cluster remains in
active
, active+remapped
, or
active+degraded
status and never achieves an
active+clean
status, you likely have a problem with the
configuration. As a general rule, you should run your cluster with more than
one OSD and a pool size greater than 1 object replica.
5.2.1 Experimenting with a one node cluster #
Ceph no longer provides documentation for operating on a single node. Mounting client kernel modules on a single node containing a Ceph daemon can cause a deadlock due to issues with the Linux kernel itself (unless you use VMs for the clients). However, we recommend experimenting with Ceph in a 1-node configuration regardless of the limitations.
If you are trying to create a cluster on a single node, change the default
of the osd crush chooseleaf
type setting from 1 (meaning
host
or node
) to 0 (meaning
osd
) in your Ceph configuration file before you create
your monitors and OSDs. This tells Ceph that an OSD can peer with another
OSD on the same host. If you are trying to set up a 1-node cluster and
osd crush chooseleaf
type is greater than 0, Ceph
tries to pair the PGs of one OSD with the PGs of another OSD on another
node, chassis, rack, row, or even datacenter depending on the setting.
Do not mount kernel clients directly on the same node as your Ceph Storage Cluster, because kernel conflicts can arise. However, you can mount kernel clients within virtual machines (VMs) on a single node.
If you are creating OSDs using a single disk, you must create directories for the data manually first. For example:
cephuser@adm >
ceph-deploy osd create --data {disk} {host}
5.2.2 Fewer OSDs than replicas #
If you have brought up two OSDs to an up and in state, but you still do not
see active+clean
placement groups, you may have an
osd pool default size
set to greater than 2. There are a
few ways to address this situation. If you want to operate your cluster in
an active+degraded
state with two replicas, you can set
the osd pool default min size
to 2 so that you can write
objects in an active+degraded
state. You may also set
the osd pool default size
setting to 2 so that you only
have two stored replicas (the original and one replica), in which case the
cluster should achieve an active+clean
state.
You can make the changes at runtime. If you make the changes in your Ceph configuration file, you may need to restart your cluster.
5.2.3 Forcing pool sizes #
If you have the osd pool default size
set to 1, you only
have one copy of the object. OSDs rely on other OSDs to tell them which
objects they should have. If an OSD has a copy of an object and there is no
second copy, then no second OSD can tell the first OSD that it should have
that copy. For each placement group mapped to the first OSD (see
ceph pg dump
), you can force the first OSD to notice the
placement groups it needs by running:
cephuser@adm >
ceph osd force-create-pg <pgid>
5.2.4 Identifying CRUSH map errors #
Another candidate for placement groups remaining unclean involves errors in your CRUSH map.
5.3 Stuck placement groups #
It is normal for placement groups to enter states such as
degraded
or peering
following a
failure. These states indicate the normal progression through the failure
recovery process. However, if a placement group stays in one of these states
for a long time this may be an indication of a larger problem. For this
reason, the monitor will warn when placement groups get stuck in a
non-optimal state. Specifically, check for:
inactive
The placement group has not been active for too long. For example, it has not been able to service read/write requests.
unclean
The placement group has not been clean for too long. For exmaple, it has not been able to completely recover from a previous failure.
stale
The placement group status has not been updated by a
ceph-osd
, indicating that all nodes storing this placement group may be down.
You can explicitly list stuck placement groups with one of:
cephuser@adm >
ceph pg dump_stuck stalecephuser@adm >
ceph pg dump_stuck inactivecephuser@adm >
ceph pg dump_stuck unclean
For stuck stale placement groups, ensure you have the right
ceph-osd
daemons running again. For stuck inactive
placement groups, it is can be a peering problem. For stuck unclean
placement groups, there can be something preventing recovery from
completing, like unfound objects.
5.4 Peering failure of placement groups #
In certain cases, the ceph-osd
peering process can run
into problems, preventing a PG from becoming active and usable. For example,
ceph health
may report:
cephuser@adm >
ceph health detail
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; \
6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
Query the cluster to determine exactly why the PG is marked down by executing the following:
cephuser@adm >
ceph pg 0.5 query
{ "state": "down+peering",
...
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2012-03-06 14:40:16.169679",
"requested_info_from": []},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2012-03-06 14:40:16.169659",
"probing_osds": [
0,
1],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
1],
"peering_blocked_by": [
{ "osd": 1,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]},
{ "name": "Started",
"enter_time": "2012-03-06 14:40:16.169513"}
]
}
recovery_state
section shows that peering is blocked due
to down ceph-osd
daemons, specifically
osd.1
. In this case, restart the
ceph-osd
to recover. Alternatively, if there is a
catastrophic failure of osd.1
such as a disk failure,
tell the cluster that it is lost and to cope as best it can.
The cluster cannot guarantee that the other copies of the data are consistent and up to date.
To instruct Ceph to continue anyway:
cephuser@adm >
ceph osd lost 1
Recovery will proceed.
5.5 Failing unfound objects #
Under certain combinations of failures Ceph may complain about unfound objects:
cephuser@adm >
ceph health detail
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
pg 2.4 is active+degraded, 78 unfound
This means that the storage cluster knows that some objects (or newer copies
of existing objects) exist, but it has not found copies of them. One example
of how this might come about for a PG whose data is on
ceph-osd
s 1 and 2:
1 goes down
2 handles some writes, alone
1 comes up
1 and 2 repeer, and the objects missing on 1 are queued for recovery.
Before the new objects are copied, 2 goes down.
In this example, 1 is aware that these object exist, but there is no live
ceph-osd
who has a copy. In this case, I/O to those
objects blocks, and the cluster hopes that the failed node comes back soon.
This is assumed to be preferable to returning an I/O error to the user.
Identify which objects are unfound by executing the following:
cephuser@adm >
ceph pg 2.4 list_unfound [starting offset, in json]
{ "offset": { "oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0},
"num_missing": 0,
"num_unfound": 0,
"objects": [
{ "oid": "object 1",
"key": "",
"hash": 0,
"max": 0 },
...
],
"more": 0}
If there are too many objects to list in a single result, the
more
field is true and you can query for more.
Identify which OSDs have been probed or might contain data:
cephuser@adm >
ceph pg 2.4 query
"recovery_state": [
{ "name": "Started\/Primary\/Active",
"enter_time": "2012-03-06 15:15:46.713212",
"might_have_unfound": [
{ "osd": 1,
"status": "osd is down"}]},
In this case, for example, the cluster knows that osd.1
might have data, but it is down
. The full range of
possible states include:
already probed
querying
OSD is down
not queried (yet)
Sometimes it takes some time for the cluster to query possible locations.
It is possible that there are other locations where the object can exist
that are not listed. For example, if a ceph-osd
is
stopped and taken out of the cluster, the cluster fully recovers, and due to
some future set of failures ends up with an unfound object, it will not
consider the long-departed ceph-osd
as a potential
location to consider.
If all possible locations have been queried and objects are still lost, you
may have to give up on the lost objects. This, again, is possible given
unusual combinations of failures that allow the cluster to learn about
writes that were performed before the writes themselves are recovered. To
mark the unfound objects as lost
:
cephuser@adm >
ceph pg 2.5 mark_unfound_lost revert|delete
This the final argument specifies how the cluster should deal with lost
objects. The delete
option forgets about them entirely. The
revert
option (not available for erasure coded pools)
either rolls back to a previous version of the object or (if it was a new
object) forgets about it entirely. Use this with caution, as it may confuse
applications that expected the object to exist.
5.6 Identifying homeless placement groups #
It is possible for all OSDs that had copies of a given placement groups to
fail. If that is the case, that subset of the object store is unavailable,
and the monitor receives no status updates for those placement groups. To
detect this situation, the monitor marks any placement group whose primary
OSD has failed as stale
. For example:
cephuser@adm >
ceph health
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
Identify which placement groups are stale
, and what were
the last OSDs to store them by executing the following:
cephuser@adm >
ceph health detail
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
...
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
...
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
For example, to get placement group 2.5 back online, this output shows that
it was last managed by osd.0
and
osd.2
. Restarting the ceph-osd
daemons
allows the cluster to recover that placement group.
5.7 Only a few OSDs receive data #
If you have many nodes in your cluster and only a few of them receive data, check the number of placement groups in your pool. See Section 12.7, « Vérification des états des groupes de placement » for more information. Since placement groups get mapped to OSDs, a small number of placement groups will not distribute across the cluster. Create a pool with a placement group count that is a multiple of the number of OSDs. See Section 17.4, « Groupes de placement » for details.
5.8 Unable to write data #
If your cluster is up but some OSDs are down and you cannot write data, check to ensure that you have the minimum number of OSDs running for the placement group. If you do not have the minimum number of OSDs running, Ceph will not allow you to write data because there is no guarantee that Ceph can replicate your data.
5.9 Identifying inconsistent placement groups #
If you receive an active+clean+inconsistent
state, this
may happen due to an error during scrubbing. Identify the inconsistent
placement group(s) by executing the following:
cephuser@adm >
ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors
Or:
cephuser@adm >
rados list-inconsistent-pg rbd
["0.6"]
There is only one consistent state, but in the worst case, there could be
different inconsistencies in multiple perspectives found in more than one
objects. If an object named foo
in PG
0.6
is truncated, the output is:
cephuser@adm >
rados list-inconsistent-obj 0.6 --format=json-pretty
{ "epoch": 14, "inconsistents": [ { "object": { "name": "foo", "nspace": "", "locator": "", "snap": "head", "version": 1 }, "errors": [ "data_digest_mismatch", "size_mismatch" ], "union_shard_errors": [ "data_digest_mismatch_info", "size_mismatch_info" ], "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", "shards": [ { "osd": 0, "errors": [], "size": 968, "omap_digest": "0xffffffff", "data_digest": "0xe978e67f" }, { "osd": 1, "errors": [], "size": 968, "omap_digest": "0xffffffff", "data_digest": "0xe978e67f" }, { "osd": 2, "errors": [ "data_digest_mismatch_info", "size_mismatch_info" ], "size": 0, "omap_digest": "0xffffffff", "data_digest": "0xffffffff" } ] } ] }
In this case, we can learn from the output that the only inconsistent object
is named foo
, and it has inconsistencies. The
inconsistencies fall into two categories:
errors
These errors indicate inconsistencies between
shards
without a determination of which shard(s) are bad. Check for theerrors
in theshards
array, if available, to pinpoint the problem.data_digest_mismatch
The digest of the replica read from OSD.2 is different from the ones of OSD.0 and OSD.1
size_mismatch
The size of the replica read from OSD.2 is 0, while the size reported by OSD.0 and OSD.1 is 968.
union_shard_errors
The union of all shard specific
errors
inshards
array. Theerrors
are set for the given shard that has the problem. They include errors likeread_error
. Theerrors
ending inoi
indicate a comparison withselected_object_info
. Look at theshards
array to determine which shard has which error(s).data_digest_mismatch_info
The digest stored in the object-info is not 0xffffffff, which is calculated from the shard read from OSD.2
size_mismatch_info
The size stored in the object-info is different from the one read from OSD.2. The latter is 0.
Repair the inconsistent placement group by executing:
cephuser@adm >
ceph pg repair placement-group-ID
This command overwrites the bad copies with the authoritative ones. In most cases, Ceph is able to choose authoritative copies from all available replicas using some predefined criteria but this does not always work. For example, the stored data digest could be missing, and the calculated digest will be ignored when choosing the authoritative copies. Use the above command with caution.
If read_error
is listed in the errors attribute of a
shard, the inconsistency is likely due to disk errors. You might want to
check your disk used by that OSD.
If you receive active+clean+inconsistent
states
periodically due to clock skew, you may consider configuring your NTP
daemons on your monitor hosts to act as peers.
5.10 Identifying inactive erasure coded PGs #
When CRUSH fails to find enough OSDs to map to a PG, it will show as a
2147483647 which is ITEM_NONE
or no OSD
found
. For instance:
[2,1,6,0,5,8,2147483647,7,4]
5.10.1 Displaying not enough OSDs #
If the Ceph cluster only has 8 OSDs and the erasure coded pool needs 9, that is what it will show. You can either create another erasure coded pool that requires less OSDs:
cephuser@adm >
ceph osd erasure-code-profile set myprofile k=5 m=3cephuser@adm >
ceph osd pool create erasurepool erasure myprofile
Or, add a new OSDs and the PG automatically uses them.
5.10.2 Satisfying CRUSH constraints #
If the cluster has enough OSDs, it is possible that the CRUSH rule imposes constraints that cannot be satisfied. If there are 10 OSDs on two hosts and the CRUSH rule requires that no two OSDs from the same host are used in the same PG, the mapping may fail because only two OSDs will be found. You can check the constraint by displaying the rule:
cephuser@adm >
ceph osd crush rule ls
[
"replicated_rule",
"erasurepool"]
$ ceph osd crush rule dump erasurepool
{ "rule_id": 1,
"rule_name": "erasurepool",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 20,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_indep",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
Resolve the problem by creating a new pool in which PGs are allowed to have OSDs residing on the same host with:
cephuser@adm >
ceph osd erasure-code-profile set myprofile crush-failure-domain=osdcephuser@adm >
ceph osd pool create erasurepool erasure myprofile
5.10.3 Identifying when CRUSH gives up too soon #
If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster with a total of 9 OSDs and an erasure coded pool that requires 9 OSDs per PG), it is possible that CRUSH gives up before finding a mapping. It can be resolved by:
Lowering the erasure coded pool requirements to use less OSDs per PG (that requires the creation of another pool as erasure code profiles cannot be dynamically modified).
Adding more OSDs to the cluster (that does not require the erasure coded pool to be modified, it will become clean automatically)
Use a handmade CRUSH rule that tries more times to find a good mapping. This can be done by setting
set_choose_tries
to a value greater than the default.
Verify the problem with crushtool after extracting the crushmap from the cluster so your experiments do not modify the Ceph cluster and only work on a local files:
cephuser@adm >
ceph osd crush rule dump erasurepool
{ "rule_name": "erasurepool",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 20,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_indep",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
$ ceph osd getcrushmap > crush.map
got crush map from osdmap epoch 13
$ crushtool -i crush.map --test --show-bad-mappings \
--rule 1 \
--num-rep 9 \
--min-x 1 --max-x $((1024 * 1024))
bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
Where --num-rep
is the number of OSDs the erasure code
CRUSH rule needs, --rule
is the value of the ruleset field
displayed by ceph osd crush rule dump
. The test tries to
map one million values (i.e. the range defined by [--min-x,--max-x]) and
must display at least one bad mapping. If it outputs nothing it means all
mappings are successful and the problem is elsewhere.
The CRUSH rule can be edited by decompiling the crush map:
#
crushtool --decompile crush.map > crush.txt
Add the following line to the rule:
step set_choose_tries 100
The relevant part of of the crush.txt
file should look
something like:
rule erasurepool { ruleset 1 type erasure min_size 3 max_size 20 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step chooseleaf indep 0 type host step emit }
It can then be compiled and tested again:
#
crushtool --compile crush.txt -o better-crush.map
When all mappings succeed, an histogram of the number of tries that were
necessary to find all of them can be displayed with the
--show-choose-tries
option of
crushtool
:
#
crushtool -i better-crush.map --test --show-bad-mappings \
--show-choose-tries \
--rule 1 \
--num-rep 9 \
--min-x 1 --max-x $((1024 * 1024))
...
11: 42
12: 44
13: 54
14: 45
15: 35
16: 34
17: 30
18: 25
19: 19
20: 22
21: 20
22: 17
23: 13
24: 16
25: 13
26: 11
27: 11
28: 13
29: 11
30: 10
31: 6
32: 5
33: 10
34: 3
35: 7
36: 5
37: 2
38: 5
39: 5
40: 2
41: 5
42: 4
43: 1
44: 2
45: 2
46: 3
47: 1
48: 0
...
102: 0
103: 1
104: 0
...
It takes 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest
number of tries is the minimum value of set_choose_tries
that prevents bad mappings (i.e. 103 in the above output because it did not
take more than 103 tries for any PG to be mapped).