4 Troubleshooting OSDs #
  Before troubleshooting your OSDs, check your monitors and network first. If
  you execute ceph health or ceph -s on
  the command line and Ceph returns a health status, it means that the
  monitors have a quorum. If you do not have a monitor quorum or if there are
  errors with the monitor status, see
  第6章 「Troubleshooting Ceph Monitors and Ceph Managers」. Check your networks to ensure
  they are running properly, because networks may have a significant impact on
  OSD operation and performance.
 
4.1 Obtain OSD data #
To begin troubleshooting the OSDs, obtain any information including the information collected from 12.9項 「OSDと配置グループの監視」.
4.1.1 Finding Ceph logs #
    If you have not changed the default path, you can find Ceph log files at
    /var/log/ceph:
   
cephuser@adm > ls /var/log/cephIf you do not get enough detail, change your logging level. See 第2章 「Troubleshooting logging and debugging」 for more information.
4.1.2 Using the admin socket tool #
Use the admin socket tool to retrieve runtime information. For details, list the sockets for your Ceph processes:
cephuser@adm > ls /var/run/cephExecute the following:
cephuser@adm > ceph daemon DAEMON-NAME help
    Alternatively, specify a SOCKET-FILE:
   
cephuser@adm > ceph daemon SOCKET-FILE helpThe admin socket, among other things, allows you to:
- List your configuration at runtime 
- Dump historic operations 
- Dump the operation priority queue state 
- Dump operations in flight 
- Dump perfcounters 
4.1.3 Displaying freespace #
    Filesystem issues may arise. To display your file system’s free space,
    execute df.
   
cephuser@adm > df -h
    Execute df --help for additional usage.
   
4.1.4 Identifying I/O statistics #
Use iostat to identify I/O-related issues.
cephuser@adm > iostat -x4.1.5 Retrieving diagnostic messages #
    To retrieve diagnostic messages, use dmesg with
    less, more, grep or
    tail. For example:
   
cephuser@adm > dmesg | grep scsi4.2 Stopping without rebalancing #
   Periodically you may need to perform maintenance on a subset of your cluster
   or resolve a problem that affects a failure domain. If you do not want CRUSH
   to automatically rebalance the cluster as you stop OSDs for maintenance, set
   the cluster to noout first:
  
cephuser@adm > ceph osd set noout
   Once the cluster is set to noout, you can begin stopping
   the OSDs within the failure domain that requires maintenance work:
  
cephuser@adm > ceph orch daemon stop osd.IDPlacement groups within the OSDs you stop will become degraded while you are addressing issues with within the failure domain.
Once you have completed your maintenance, restart the OSDs:
cephuser@adm > ceph orch daemon start osd.ID
   Finally, unset the cluster from noout:
  
cephuser@adm > ceph osd unset noout4.3 OSDs not running #
   Under normal circumstances, restarting the ceph-osd
   daemon will allow it to rejoin the cluster and recover. If this is not true,
   continue on to the next section.
  
4.3.1 An OSD will not start #
If an OSD will not start when you start your cluster and you are unable to determine the reason, we recommend generating a supportconfig and open a support ticket. For more information, see How do I submit a support case?
4.3.2 Failing OSD #
    When a ceph-osd process dies, the monitor learns about
    the failure from surviving ceph-osd daemons and reports
    it via the ceph health command:
   
cephuser@adm > ceph health
HEALTH_WARN 1/3 in osds are down
    A warning displays whenever there are ceph-osd processes
    that are marked in and down. Identify
    which ceph-osds are down with:
   
cephuser@adm > ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
    If there is a disk failure or other fault preventing
    ceph-osd from functioning or restarting, an error
    message should be present in its log file. You can monitor the cephadm
    log in real time with the following command:
   
cephuser@adm > ceph -W cephadmYou can view the last few messages with:
cephuser@adm > ceph log last cephadm
    If the daemon stopped because of a heartbeat failure, the underlying kernel
    file system may be unresponsive. Check dmesg output for
    disk or other kernel errors.
   
4.3.3 Preventing write action to OSD #
    Ceph prevents writing to a full OSD so that you do not lose data. In an
    operational cluster, you should receive a warning when your cluster is
    getting near its full ratio. The mon osd full ratio
    defaults to 0.95, or 95% of capacity before it stops clients from writing
    data. The mon osd backfillfull ratio defaults to 0.90,
    or 90% of capacity when it blocks backfills from starting. The OSD nearfull
    ratio defaults to 0.85, or 85% of capacity when it generates a health
    warning. Change this using the following command:
   
cephuser@adm > ceph osd set-nearfull-ratio <float[0.0-1.0]>
    Full cluster issues usually arise when testing how Ceph handles an OSD
    failure on a small cluster. When one node has a high percentage of the
    cluster’s data, the cluster can easily eclipse its nearfull and full
    ratio immediately. If you are testing how Ceph reacts to OSD failures on
    a small cluster, you should leave ample free disk space and consider
    temporarily lowering the OSD full ratio, OSD
    backfillfull ratio and OSD nearfull ratio
    using these commands:
   
cephuser@adm >ceph osd set-nearfull-ratio <float[0.0-1.0]>cephuser@adm >ceph osd set-full-ratio <float[0.0-1.0]>cephuser@adm >ceph osd set-backfillfull-ratio <float[0.0-1.0]>
    Full ceph-osds will be reported by ceph
    health:
   
cephuser@adm > ceph health
  HEALTH_WARN 1 nearfull osd(s)Or:
cephuser@adm > ceph health detail
  HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
  osd.3 is full at 97%
  osd.4 is backfill full at 91%
  osd.2 is near full at 87%
    We recommend adding new ceph-osds to deal with a full
    cluster, allowing the cluster to redistribute data to the newly available
    storage. If you cannot start an OSD because it is full, you may delete some
    data by deleting some placement group directories in the full OSD.
   
If you choose to delete a placement group directory on a full OSD, do not delete the same placement group directory on another full OSD, or you may loose data. You must maintain at least one copy of your data on at least one OSD.
4.4 Unresponsive or slow OSDs #
A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you have eliminated other troubleshooting possibilities before delving into OSD performance issues. For example, ensure that your network(s) is working properly and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
4.4.1 Identifying bad sectors and fragmented disks #
Check your disks for bad sectors and fragmentation. This can cause total throughput to drop substantially.
4.4.2 Co-resident monitors and OSDs #
Monitors are generally light-weight processes but often perform fsync(). This can interfere with other workloads, particularly if monitors run on the same drive as the OSDs. Additionally, if you run monitors on the same host as the OSDs, you may incur performance issues related to:
- Running an older kernel (pre-3.0) 
- Running a kernel with no syncfs(2) syscall. 
In these cases, multiple OSDs running on the same host can drag each other down by doing lots of commits. That often leads to the bursty writes.
4.4.3 Co-resident processes #
Spinning up co-resident processes such as a cloud-based solution, virtual machines and other applications that write data to Ceph while operating on the same hardware as OSDs can introduce significant OSD latency. We recommend optimizing a host for use with Ceph and using other hosts for other processes. The practice of separating Ceph operations from other applications may help improve performance and may streamline troubleshooting and maintenance.
4.4.4 Logging levels #
    If you turned logging levels up to track an issue and then forgot to turn
    logging levels back down, the OSD may be putting a lot of logs onto the
    disk. If you intend to keep logging levels high, you may consider mounting
    a drive to the default path for logging. For example,
    /var/log/ceph/$cluster-$name.log.
   
4.4.5 Recovery throttling #
Depending upon your configuration, Ceph may reduce recovery rates to maintain performance or it may increase recovery rates to the point that recovery impacts OSD performance. Check to see if the OSD is recovering.
4.4.6 Kernel version #
Check the kernel version you are running. Older kernels may not receive new backports that Ceph depends upon for better performance.
4.4.7 Kernel issues with syncfs #
    Try running one OSD per host to see if performance improves. Old kernels
    might not have a recent enough version of glibc to support
    syncfs(2).
   
4.4.8 Filesystem issues #
Currently, we recommend deploying clusters with XFS.
4.4.9 Insufficient RAM #
We recommend 1.5 GB of RAM per TB of raw OSD capacity for each Object Storage Node. You may notice that during normal operations, the OSD only uses a fraction of that amount. Unused RAM makes it tempting to use the excess RAM for co-resident applications, VMs and so forth. However, when OSDs go into recovery mode, their memory utilization spikes. If there is no RAM available, the OSD performance will slow considerably.
4.4.10 Complaining about old or slow requests #
    If a ceph-osd daemon is slow to respond to a request,
    log messages will generate complaining about requests that are taking too
    long. The warning threshold defaults to 30 seconds, and is configurable via
    the osd op complaint time option. When this happens, the
    cluster log receives messages. Legacy versions of Ceph complain about old
    requests:
   
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request \ osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) \ v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
    Newer versions of Ceph complain about slow requests:
   
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
{date} {osd.num}  [WRN] slow request 30.005692 seconds old, received at \
{date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 \
[write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]Possible causes include:
- A bad drive (check - dmesgoutput)
- A bug in the kernel file system (check - dmesgoutput)
- An overloaded cluster (check system load, iostat, etc.) 
- A bug in the - ceph-osddaemon.
Possible solutions:
- Remove VMs from Ceph hosts 
- Upgrade kernel 
- Upgrade Ceph 
- Restart OSDs 
4.4.11 Debugging slow requests #
    If you run ceph daemon osd.<id> dump_historic_ops or
    ceph daemon osd.<id> dump_ops_in_flight, you see a
    set of operations and a list of events each operation went through. These
    are briefly described below.
   
Events from the Messenger layer:
- header_read
- When the messenger first started reading the message off the wire. 
- throttled
- When the messenger tried to acquire memory throttle space to read the message into memory. 
- all_read
- When the messenger finished reading the message off the wire. 
- dispatched
- When the messenger gave the message to the OSD. 
- initiated
- This is identical to - header_read. The existence of both is a historical oddity.
Events from the OSD as it prepares operations:
- queued_for_pg
- The op has been put into the queue for processing by its PG. 
- reached_pg
- The PG has started doing the op. 
- waiting for \*
- The op is waiting for some other work to complete before it can proceed (e.g. a new OSDMap; for its object target to scrub; for the PG to finish peering; all as specified in the message). 
- started
- The op has been accepted as something the OSD should do and is now being performed. 
- waiting for subops from
- The op has been sent to replica OSDs. 
Events from the FileStore:
- commit_queued_for_journal_write:
- The op has been given to the FileStore. 
- write_thread_in_journal_buffer
- The op is in the journal’s buffer and waiting to be persisted (as the next disk write). 
- journaled_completion_queued
- The op was journaled to disk and its callback queued for invocation. 
Events from the OSD after stuff has been given to local disk:
- op_commit
- The op has been committed by the primary OSD. 
- op_applied
- The op has been - write()’ento the backing FS on the primary.
- sub_op_applied: op_applied
- For a replica’s “subop”. 
- sub_op_committed: op_commit
- For a replica’s sub-op (only for EC pools). 
- sub_op_commit_rec/sub_op_apply_rec from <X>
- The primary marks this when it hears about the above, but for a particular replica (i.e. <X>). 
- commit_sent
- We sent a reply back to the client (or primary OSD, for sub ops). 
Many of these events are seemingly redundant, but cross important boundaries in the internal code (such as passing data across locks into new threads).
4.5 OSD weight is 0 #
When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.
4.6 OSD is down #
OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:
- Hard disk failure. 
- The OSD crashed. 
- The server crashed. 
You can see the detailed status of OSDs by running
cephuser@adm > ceph osd tree
# id  weight  type name up/down reweight
 -1    0.02998  root default
 -2    0.009995   host doc-ceph1
 0     0.009995      osd.0 up  1
 -3    0.009995   host doc-ceph2
 1     0.009995      osd.1 up  1
 -4    0.009995   host doc-ceph3
 2     0.009995      osd.2 down
   The example listing shows that the osd.2 is down. Then
   you may check if the disk where the OSD is located is mounted:
  
root # lsblk -f
NAME                                                                                                  FSTYPE      LABEL UUID                                   FSAVAIL FSUSE% MOUNTPOINT
vda
├─vda1
├─vda2                                                                                                vfat        EFI   7BDD-40C3                                18.8M     6% /boot/efi
└─vda3                                                                                                ext4        ROOT  144d3eec-c193-4793-b6a9-1ac295259f4b     36.7G     5% /
vdb                                                                                                   LVM2_member       Fj8nlY-Dnmm-8Y0w-tJrO-8rAe-SUTH-lA2Ux2
└─ceph--6dc6f71f--ff02--4316--b145--c7804d5fb2f4-osd--block--2cb16f65--c87d--405d--a913--4e4ea6037ca1
vdc                                                                                                   LVM2_member       0101m1-mc83-K8ce-j4Dv-yxUO-5KPj-6FpuSu
└─ceph--1082eac0--4478--48cf--b8ab--4d3a62ca1c27-osd--data--185c9dd6--ce28--4b98--a9b6--9d5b1f444a4f
vdd                                                                                                   LVM2_member       wqE8sC-w5mx-J9t1-FoM1-vIRC-0xxq-5FPyuL
└─ceph--9e71d81a--e394--49b7--bef9--b639083be779-osd--data--eb94f8e6--af6d--4ef6--911b--7f44f7484c85
   You can track the reason why the OSD is down by inspecting its log file. See
   4.3.2項 「Failing OSD」 for instructions on how to source
   the log files. After you find and fix the reason why the OSD is not running,
   start it with the following command. To identify the unique FSID of the
   cluster, run ceph fsid. To identify the Object Gateway daemon
   name, run ceph orch ps ---hostname
   HOSTNAME.
  
root # systemctl start ceph-FSID@osd.2
   Do not forget to replace 2 with the actual number of your
   stopped OSD.
  
4.7 Finding slow OSDs #
When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.
It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:
ceph tell osd.OSD_ID_NUMBER benchFor example:
cephuser@adm > ceph tell osd.0 bench
 { "bytes_written": 1073741824,
   "blocksize": 4194304,
   "bytes_per_sec": "19377779.000000"}
   Then you need to run this command on each OSD and compare the
   bytes_per_sec value to get the slow(est) OSDs.
  
4.8 Flapping OSDs #
We recommend using both a public (front-end) network and a cluster (back-end) network so that you can better meet the capacity requirements of object replication. Another advantage is that you can run a cluster network such that it is not connected to the internet, thereby preventing some denial of service attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) network when it’s available.
However, if the cluster (back-end) network fails or develops significant latency while the public (front-end) network operates optimally, OSDs currently do not handle this situation well. What happens is that OSDs mark each other down on the monitor, while marking themselves up. This is called flapping.
If something is causing OSDs to flap (repeatedly getting marked down and then up again), you can force the monitors to stop the flapping with:
cephuser@adm >ceph osd set noup # prevent OSDs from getting marked upcephuser@adm >ceph osd set nodown # prevent OSDs from getting marked down
These flags are recorded in the osdmap structure:
cephuser@adm > ceph osd dump | grep flags
flags no-up,no-downYou can clear the flags with:
cephuser@adm >ceph osd unset noupcephuser@adm >ceph osd unset nodown
   Two other flags are supported, noin and
   noout, which prevent booting OSDs from being marked in
   (allocated data) or protect OSDs from eventually being marked out
   (regardless of what the current value for mon osd down
   out interval is).
  
noup, noout, and nodown
    are temporary in the sense that once the flags are cleared, the action they
    were blocking should occur shortly after. The noin flag,
    on the other hand, prevents OSDs from being marked in on
    boot, and any daemons that started while the flag was set will remain that
    way.