Applies to SUSE Enterprise Storage 7.1

11 Troubleshooting Object Gateway #

11.1 Running a basic health check #

The most basic health check to test a running Object Gateway process is to simply point your browser or client at the Object Gateway endpoint. This should return an empty bucket results for the anonymous user:

root@master # curl -I

If the result returns a 405 status, this indicates MethodNotAllowed. The output will result in a NoSuchBucket XML file. If the result returns a 404 status, this means the Object Gateway DNS name is misconfigured or the requests were not made at an endpoint resolving to Object Gateway DNS name.

11.2 Identifying gateway issues #

If the gateway is not running, usually restartinng the gateway (automatic under systemd) should restore the service. This can be achieved via:

cephuser@adm > ceph orch daemon restart rgw-daemon-name

For increasing the log level under the client.RGW_NAME section, the Object Gateway configurable can be increased from 1 (default) up to 20 (very verbose). For the messages sent to the cluster itself the messenger debug levels can be raised (off by default). This is controlled via debug ms. Usually a level of 1 is sufficient to gather enough information. For injecting arguments in a running process:

Note

This is a temporary change.

cephuser@adm > ceph-daemon CLIENT_RGW_NAME config set debug_rgw 20

For persistent changes, or a non-running process, it is best to set these lines in the ceph.conf file under the client.rgw-name section or, alternatively, using the Ceph CLI. In this case, the debug levels are extremely verbose, so it is best to do this only to capture error logs for a short time. To set the logs using the Ceph CLI, run:

   cephuser@adm > ceph config set CLIENT_RGW_NAME debug_rgw 20

11.3 Diagnosing crashed Object Gateway process #

If the Object Gateway process dies, you will normally see a connection refused at the client. In that situation, restarting Object Gateway will restore the service.

To diagnose the cause of the crash, check the log in /var/log/ceph or the core file (if one was generated).

11.4 Identifying blocked Object Gateway requests #

If some (or all) Object Gateway requests appear to be blocked, you can get some insight into the internal state of the Object Gateway daemon via its admin socket. By default, there will be a socket configured to reside in /var/run/ceph, and the daemon can be queried with:

cephuser@adm > ceph daemon /var/run/ceph/client.rgw-name help
  help                list available commands
  objecter_requests   show in-progress osd requests
  perfcounters_dump   dump perfcounters value
  perfcounters_schema dump perfcounters schema
  version             get protocol version

Of particular interest:

cephuser@adm > ceph daemon /var/run/ceph/client.rgw objecter_requests
...

This will dump information about current in-progress requests with the RADOS cluster. This allows you to identify if any requests are blocked by a non-responsive OSD. For example, one might see:

{ "ops": [
      { "tid": 1858,
        "pg": "2.d2041a48",
        "osd": 1,
        "last_sent": "2012-03-08 14:56:37.949872",
        "attempts": 1,
        "object_id": "fatty_25647_object1857",
        "object_locator": "@2",
        "snapid": "head",
        "snap_context": "0=[]",
        "mtime": "2012-03-08 14:56:37.949813",
        "osd_ops": [
              "write 0~4096"]},
      { "tid": 1873,
        "pg": "2.695e9f8e",
        "osd": 1,
        "last_sent": "2012-03-08 14:56:37.970615",
        "attempts": 1,
        "object_id": "fatty_25647_object1872",
        "object_locator": "@2",
        "snapid": "head",
        "snap_context": "0=[]",
        "mtime": "2012-03-08 14:56:37.970555",
        "osd_ops": [
              "write 0~4096"]}],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": []}

In this dump, two requests are in progress. The last_sent field is the time the RADOS request was sent. If this is a while ago, it suggests that the OSD is not responding. For example, for request 1858, you could check the OSD status with:

cephuser@adm > ceph pg map 2.d2041a48
osdmap e9 pg 2.d2041a48 (2.0) -> up [1,0] acting [1,0]

This tells you to look at osd.1, the primary copy for this PG:

  ceph daemon osd.1 ops
  { "num_ops": 651,
   "ops": [
         { "description": "osd_op(client.4124.0:1858 fatty_25647_object1857 [write 0~4096] 2.d2041a48)",
           "received_at": "1331247573.344650",
           "age": "25.606449",
           "flag_point": "waiting for sub ops",
           "client_info": { "client": "client.4124",
               "tid": 1858}},
  ...

The flag_point field indicates that the OSD is currently waiting for replicas to respond, in this case osd.0.

11.5 Large OMAP issues #

If you receive a cluster warning for a large OMAP issue, it means that an object has exceeded one of the OSD deep scrub large OMAP object thresholds, usually set to 200,000 keys and 1 GiB of storage. The relevant pool or PG involved can be found by grepping for large OMAP object, as mentioned in the warning message. Depending on the pool and the object key involved it might indicate one of the following:

11.5.1 Resharding issues #

If the large OMAP object name happens to be in the zone.rgw.buckets.index pool, this means that a bucket has more than 200,000 keys. Dynamic resharding is set on by default in single site clusters which would automatically have resharded the bucket. For multi-site clusters, however, this is not supported and would have to be manually done Section 21.10.1, « Repartitionnement d'index de compartiment ». Looking at the bucket statistics reveals the total object count and the number of shards. This is helpful to understand if the bucket has already been resharded, in which case the warning is just because the deep scrub happened to run before the reshard process and would be cleared in the next deep scrub.

11.5.2 Reading usage statistics #

Usage statistics are also stored as OMAP keys in the zone.rgw.log pool. However, these are not automatically trimmed, therefore a manual trimming of usage statistics would have to be done. For example:

# radosgw-admin usage trim [--uid=user-id] --start-date=2019-01-01 --end-date=2019-03-03