11 Troubleshooting Object Gateway #
11.1 Running a basic health check #
The most basic health check to test a running Object Gateway process is to simply point your browser or client at the Object Gateway endpoint. This should return an empty bucket results for the anonymous user:
root@master #
curl -I
If the result returns a 405 status, this indicates
MethodNotAllowed
. The output will result in a
NoSuchBucket
XML file. If the result returns a 404
status, this means the Object Gateway DNS name is misconfigured or the requests were
not made at an endpoint resolving to Object Gateway DNS name.
11.2 Identifying gateway issues #
If the gateway is not running, usually restartinng the gateway (automatic
under systemd
) should restore the service. This can be achieved via:
cephuser@adm >
ceph orch daemon restart rgw-daemon-name
For increasing the log level under the
client.RGW_NAME
section, the
Object Gateway configurable can be increased from 1 (default) up to 20 (very
verbose). For the messages sent to the cluster itself the messenger debug
levels can be raised (off by default). This is controlled via debug
ms
. Usually a level of 1 is sufficient to gather enough
information. For injecting arguments in a running process:
This is a temporary change.
cephuser@adm >
ceph-daemon CLIENT_RGW_NAME config set debug_rgw 20
For persistent changes, or a non-running process, it is best to set these
lines in the ceph.conf
file under the
client.rgw-name
section or,
alternatively, using the Ceph CLI. In this case, the debug levels are
extremely verbose, so it is best to do this only to capture error logs for a
short time. To set the logs using the Ceph CLI, run:
cephuser@adm >
ceph config set CLIENT_RGW_NAME debug_rgw 20
11.3 Diagnosing crashed Object Gateway process #
If the Object Gateway process dies, you will normally see a connection refused at the client. In that situation, restarting Object Gateway will restore the service.
To diagnose the cause of the crash, check the log in
/var/log/ceph
or the core file (if one was generated).
11.4 Identifying blocked Object Gateway requests #
If some (or all) Object Gateway requests appear to be blocked, you can get some
insight into the internal state of the Object Gateway daemon via its admin socket. By
default, there will be a socket configured to reside in
/var/run/ceph
, and the daemon can be queried with:
cephuser@adm >
ceph daemon /var/run/ceph/client.rgw-name help
help list available commands
objecter_requests show in-progress osd requests
perfcounters_dump dump perfcounters value
perfcounters_schema dump perfcounters schema
version get protocol version
Of particular interest:
cephuser@adm >
ceph daemon /var/run/ceph/client.rgw objecter_requests
...
This will dump information about current in-progress requests with the RADOS cluster. This allows you to identify if any requests are blocked by a non-responsive OSD. For example, one might see:
{ "ops": [ { "tid": 1858, "pg": "2.d2041a48", "osd": 1, "last_sent": "2012-03-08 14:56:37.949872", "attempts": 1, "object_id": "fatty_25647_object1857", "object_locator": "@2", "snapid": "head", "snap_context": "0=[]", "mtime": "2012-03-08 14:56:37.949813", "osd_ops": [ "write 0~4096"]}, { "tid": 1873, "pg": "2.695e9f8e", "osd": 1, "last_sent": "2012-03-08 14:56:37.970615", "attempts": 1, "object_id": "fatty_25647_object1872", "object_locator": "@2", "snapid": "head", "snap_context": "0=[]", "mtime": "2012-03-08 14:56:37.970555", "osd_ops": [ "write 0~4096"]}], "linger_ops": [], "pool_ops": [], "pool_stat_ops": [], "statfs_ops": []}
In this dump, two requests are in progress. The last_sent
field is the time the RADOS request was sent. If this is a while ago, it
suggests that the OSD is not responding. For example, for request 1858, you
could check the OSD status with:
cephuser@adm >
ceph pg map 2.d2041a48
osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]
This tells you to look at osd.1
, the primary copy for
this PG:
ceph daemon osd.1 ops { "num_ops": 651, "ops": [ { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)", "received_at": "1331247573.344650", "age": "25.606449", "flag_point": "waiting for sub ops", "client_info": { "client": "client.4124", "tid": 1858}}, ...
The flag_point
field indicates that the OSD is currently
waiting for replicas to respond, in this case osd.0
.
11.5 Large OMAP issues #
If you receive a cluster warning for a large OMAP issue, it means that an object has exceeded one of the OSD deep scrub large OMAP object thresholds, usually set to 200,000 keys and 1 GiB of storage. The relevant pool or PG involved can be found by grepping for large OMAP object, as mentioned in the warning message. Depending on the pool and the object key involved it might indicate one of the following:
11.5.1 Resharding issues #
If the large OMAP object name happens to be in the
zone.rgw.buckets.index
pool,
this means that a bucket has more than 200,000 keys. Dynamic resharding is
set on by default in single site clusters which would automatically have
resharded the bucket. For multi-site clusters, however, this is not
supported and would have to be manually done
第 21.10.1 节 “存储桶索引重分片”. Looking at the bucket statistics
reveals the total object count and the number of shards. This is helpful to
understand if the bucket has already been resharded, in which case the
warning is just because the deep scrub happened to run before the reshard
process and would be cleared in the next deep scrub.
11.5.2 Reading usage statistics #
Usage statistics are also stored as OMAP keys in the
zone.rgw.log
pool. However,
these are not automatically trimmed, therefore a manual trimming of usage
statistics would have to be done. For example:
#
radosgw-admin usage trim [--uid=user-id] --start-date=2019-01-01 --end-date=2019-03-03