Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / SUSE Enterprise Storage 7 Documentation / Troubleshooting Guide / Troubleshooting Ceph Monitors and Ceph Managers
Applies to SUSE Enterprise Storage 7

6 Troubleshooting Ceph Monitors and Ceph Managers

6.1 Initial troubleshooting

Q: Are the monitors running?

Ensure the monitors are running, this is an important to check if you have performed an upgrade and not manually restarted the monitors.

Q: Are you able to connect to the monitor’s servers?

Occasionally, you can be running iptable rules that block access to monitor servers or monitor ports. This can often be the case from monitor stress-testing that was forgotten. We recommend trying to ssh into the server and, if that succeeds, try connecting to the monitor's port using your tool of choice (such as telnet or netcat).

Q: Does ceph -s run and obtain a reply from the cluster?

If the answer is yes, then your cluster is up and running. The monitors will only answer to a status request if there is a formed quorum. If ceph -s is blocked, without obtaining a reply from the cluster or showing a lot of fault messages, then it is possible that your monitors are either down completely or just a portion is up – a portion that is not enough to form a quorum (keep in mind that a quorum if formed by a majority of monitors).

Q: What if ceph -s does not finish?

Contact each monitor individually for the status, regardless of a quorum being formed. This can be achieved using ceph tell mon.ID mon_status with the ID being the monitor's identifier. Perform this for each monitor in the cluster. The section Section 6.3, “Understanding mons_status explains how to interpret the output of this command.

6.2 Using the monitor's admin socket

The admin socket allows you to interact with a given daemon directly using a Unix socket file. This file can be found in your monitor's run directory. By default, the admin socket will be kept in /var/run/ceph/ceph-mon.ID.asok but this can vary if you defined it otherwise. If you are unable to find it there, check your ceph.conf for an alternative path or run:

cephuser@adm > ceph-conf --name mon.ID --show-config-value admin_socket

Keep in mind that the admin socket is only available while the monitor is running. When the monitor is properly shutdown, the admin socket is removed. If however the monitor is not running and the admin socket still persists, it is likely that the monitor was improperly shutdown. Regardless, if the monitor is not running, you will not be able to use the admin socket, with ceph likely returning Error 111: Connection Refused. To accessing the admin socket run ceph tell on the daemon you are interested in. For example:

cephuser@adm > ceph tell mon.ID mon_status

This passes the command help to the running MON daemon ID via the admin socket, which is a file ending in .asok somewhere under /var/run/ceph. When you know the full path to the file, you can run the following:

cephuser@adm > ceph --admin-daemon PATH_TO_FILE COMMAND

Using help as the command to the ceph tool shows the supported commands available through the admin socket. Take a look at config get, config show, mon stat and quorum_status, as those can be enlightening when troubleshooting a monitor.

6.3 Understanding mons_status

mon_status can be obtained via the admin socket. This command outputs a multitude of information about the monitor including the same output you would get with quorum_status. For example, the following example output of ceph tell mon.c mon_status:

  { "name": "c",
  "rank": 2,
  "state": "peon",
  "election_epoch": 38,
  "quorum": [
        1,
        2],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 3,
      "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
      "modified": "2013-10-30 04:12:01.945629",
      "created": "2013-10-29 14:14:41.914786",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "127.0.0.1:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "127.0.0.1:6790\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "127.0.0.1:6795\/0"}]}}

This example shows that there are three montiors in the monmap (a, b and c), the quorum is formed by only two monitors, and c is in the quorum as a peon. This means that monitor a is out of quorum. This is because there are two monitors in this set: 1 and 2. These are not monitor names. These are monitor ranks, as established in the current monmap. It shows that the missing monitor is the one with a rank of 0, and according to the monmap that would be mon.a.

Ranks (re)calculated whenever you add or remove monitors and follow this rule: the greater the IP:PORT combination, the lower the rank is. In this case, considering that 127.0.0.1:6789 is lower than all the remaining IP:PORT combinations, mon.a has rank 0.

6.4 Restoring the MONs quorum

If the Ceph Monitors cannot form a quorum, cephadm will not be able to manage the cluster until the quorum is restored. In order to restore the Ceph Monitor quorum, remove unhealthy Ceph Monitors form the monmap by following these steps:

  1. Stop all Ceph Monitors. Log in to each Ceph Monitor host via SSH and run the following command there:

    cephuser@adm > cephadm unit --name mon.`hostname` stop
  2. Identify a surviving monitor by logging in to that host via SSH and running:

    cephuser@adm > cephadm shell --name mon.`hostname`
  3. Extract a copy of the monmap to a file, for example /tmp/monmap. Note that MON_ID is usually identical to the string that the hostname command returns:

    ceph-mon -i MON_ID --extract-monmap /tmp/monmap
  4. Remove the non-surviving or problematic monitors. For example, if you have three monitors, mon.a, mon.b, and mon.c where only mon.a is surviving, follow this example:

    cephuser@adm > monmaptool /tmp/monmap --rm b
    cephuser@adm > monmaptool /tmp/monmap --rm c
  5. Inject the surviving map with the removed monitors into the surviving monitor(s). For example, to inject the map into the monitor mon.a, follow this example:

    cephuser@adm > ceph-mon -i a --inject-monmap /tmp/monmap
  6. Start only the surviving monitors, and verify that the monitors form a quorum with the ceph -s command.

Note
Note

You may wish to archive the removed monitors' data directories from /var/lib/ceph/mon in a safe location, or delete it if you are confident the remaining monitors are healthy and are sufficiently redundant.

6.5 Most common monitor issues

6.5.1 Have quorum but at least one monitor is down

When this happens, depending on the version of Ceph you are running, you should be seeing something similar to:

cephuser@adm > ceph health detail
[snip]
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)

To troubleshoot, make sure that mon.a is running. After that, make sure you are able to connect to mon.a's server from the other monitors' servers. Check the ports as well. Check iptables on all your monitor nodes and make sure you are not dropping or rejecting connections. If this initial troubleshooting does not solve your problems, check the problematic monitor's mon_status via the admin socket. Considering the monitor is out of the quorum, its state should be one of probing, electing or synchronizing. If it happens to be either leader or peon, then the monitor believes itself to be in the quorum, while the remaining cluster is sure it is not; or maybe it got into the quorum while we were troubleshooting the monitor. Check using ceph -s again, just to make sure. Continue if the monitor is not yet in quorum.

1. What if the state is probing?

This means the monitor is still looking for the other monitors. Every time you start a monitor, the monitor stays in this state for some time while trying to find the rest of the monitors specified in the monmap. The time a monitor spends in this state can vary. For instance, when on a single-monitor cluster, the monitor passes through the probing state almost instantaneously, since there are no other monitors around. On a multi-monitor cluster, the monitors will stay in this state until they find enough monitors to form a quorum – this means that if you have 2 out of 3 monitors down, the one remaining monitor stays in this state indefinitely until one of the other monitors is brought up manually.

If there is a quorum, the monitor should be able to find the remaining monitors as long as they can be reached. If your monitor is stuck probing and you have gone through with all the communication troubleshooting, then there is a chance that the monitor is trying to reach the other monitors on a wrong address. mon_status outputs the monmap known to the monitor and checks if the other monitor’s locations match reality. If they do not, then it may be related to a broken mon map. If they do, then it may be related to severe clock skews amongst the monitor nodes and you should refer to Section 6.5.2, “Fixing clock skews”.

2. What if the state is electing?

This means the monitor is in the middle of an election. These should be fast to complete, but at times the monitors can get stuck electing. This is usually a sign of a clock skew among the monitor nodes. See Section 6.5.2, “Fixing clock skews” for more information. This is not a state that is likely to persist and aside from old bugs there is not an obvious reason besides clock skews on why this would happen.

3. What if the state is synchronizing?

This means the monitor is synchronizing with the rest of the cluster in order to join the quorum. However, if you notice that the monitor jumps from synchronizing to electing and then back to synchronizing, then it can mean that the cluster state is advancing (i.e., generating new maps) way too fast for the synchronization process to keep up.

4. What if the state is leader or peon?

This should not happen. If this does happen, it is likely related to clock skews, see Section 6.5.2, “Fixing clock skews” for more information. If you see no issue with the clock skews, prepare your logs and reach out to your support representative.

6.5.2 Fixing clock skews

Monitors can be severely affected by significant clock skews across the monitor nodes. This usually translates into weird behavior with no obvious cause. To avoid such issues, run a clock synchronization tool on your monitor nodes.

By default, the maximum tolerated clock skew allows clocks to drift up to 0.05 seconds. This value is configurable via the mon-clock-drift-allowed option, however we do not recommend doing this. The clock skew mechanism is in place because clock skewed monitor may not properly behave. Changing this value without testing it first may cause unforeseen effects on the stability of the monitors and overall cluster healthiness, although there is no risk of data loss.

The monitors will warn you if there is a clock skew by sending a HEALTH_WARN alert. Run the ceph health detail command to determine what monitor is flagging a clock skew. For example:

mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)

If you have a clock skew, synchronize your clocks. Running an NTP client may help. If you are already using one and you hit this sort of issue, check if you are using an NTP server remote to your network and consider hosting your own NTP server on your network. This last option tends to reduce the amount of issues with monitor clock skews.

6.5.3 Connecting and mounting to the client

If you cannot connect or mount to the client, check your iptables. Some OS install utilities add a REJECT rule to iptables. The rule rejects all clients trying to connect to the host except for SSH. If your monitor host’s IP tables have a REJECT rule in place, clients connecting from a separate node will fail to mount with a timeout error. You need to address iptables rules that reject clients trying to connect to Ceph daemons. For example, address rules that look like similar to this:

REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

You may also need to add rules to iptables on your Ceph hosts to ensure that clients can access the ports associated with your Ceph monitors (for example, port 6789 by default) and Ceph OSDs (for example, 6800 through 7300 by default). For example:

iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT

6.6 Monitor store failures

6.6.1 Identifying symptoms of store corruption

Ceph monitor stores the cluster map in a key/value store such as LevelDB. If a monitor fails due to the key/value store corruption, following error messages might be found in the monitor log:

Corruption: error in middle of record

Or:

Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb

6.6.2 Recovering using healthy monitors

If there are any survivors, replace the corrupted one with a new one. After booting up, the new joiner will sync up with a healthy peer, and once it is fully synchronized, it will be able to serve the clients.

6.6.3 Recovering using OSDs

But what if all monitors fail at the same time? Since users are encouraged to deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous failure is rare. But unplanned power-downs in a data center with improperly configured disk/fs settings could fail the underlying file system, and hence kill all the monitors. In this case, we can recover the monitor store with the information stored in OSDs.

  ms=/root/mon-store
  mkdir $ms

  # collect the cluster map from stopped OSDs
  for host in $hosts; do
    rsync -avz $ms/. user@$host:$ms.remote
    rm -rf $ms
    ssh user@$host EOF
      for osd in /var/lib/ceph/osd/ceph-*; do
        ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
      done
  EOF
    rsync -avz user@$host:$ms.remote/. $ms
  done

  # rebuild the monitor store from the collected map, if the cluster does not
  # use cephx authentication, we can skip the following steps to update the
  # keyring with the caps, and there is no need to pass the "--keyring" option.
  # i.e. just use "ceph-monstore-tool $ms rebuild" instead
  ceph-authtool /path/to/admin.keyring -n mon. \
    --cap mon 'allow *'
  ceph-authtool /path/to/admin.keyring -n client.admin \
    --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
  # add one or more ceph-mgr's key to the keyring. in this case, an encoded key
  # for mgr.x is added, you can find the encoded key in
  # /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
  # deployed
  ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
    --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
  # if your monitors' ids are not single characters like 'a', 'b', 'c', please
  # specify them in the command line by passing them as arguments of the "--mon-ids"
  # option. if you are not sure, please check your ceph.conf to see if there is any
  # sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
  # using DNS SRV for looking up monitors.
  ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma

  # make a backup of the corrupted store.db just in case!  repeat for
  # all monitors.
  mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted

  # move rebuild store.db into place.  repeat for all monitors.
  mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
  chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
  1. Collect the map from all OSD hosts.

  2. Rebuild the store.

  3. Fill the entities in the keyring file with appropriate caps.

  4. Replace the corrupted store on mon.foo with the recovered copy.

6.6.3.1 Known limitations

The following information is not recoverable using the steps above:

  • Some added keyrings: all the OSD keyrings added using the ceph auth add command are recovered from the OSD’s copy. The client.admin keyring is imported using ceph-monstore-tool. The MDS keyrings and other keyrings are missing in the recovered monitor store. You may need to re-add them manually.

  • Creating pools: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the unknown state after the recovery for a partially created pool, you can force creation of the empty PG with the ceph osd force-create-pg command. This will create an empty PG, so only do this if you know the pool is empty.

  • MDS Maps: the MDS maps are lost.

6.7 Next steps

6.7.1 Preparing your logs

Monitor logs are, by default, kept in /var/log/ceph/ceph-mon.FOO.log*. However, your logs may not have the necessary information. If you do not find your monitor logs at their default location, you can check where they are by running:

cephuser@adm > ceph-conf --name mon.FOO --show-config-value log_file

The amount of information in the logs are subject to the debug levels being enforced by your configuration files. If you have not enforced a specific debug level, then Ceph is using the default levels and your logs may not contain important information to track down you issue. A first step in getting relevant information into your logs will be to raise debug levels. Similarly to what happens on other components, different parts of the monitor will output their debug information on different subsystems. You will have to raise the debug levels of those subsystems more closely related to your issue. For most situations, setting the following options on your monitors will be enough to pinpoint a potential source of the issue:

debug mon = 10
debug ms = 1

6.7.2 Adjusting debug levels

You do not need to restart a monitor to adjust debug legals. You may do it in one of two ways:

  • If you have a quorum, either inject the debug option into the monitor you want to debug:

    cephuser@adm > ceph tell mon.FOO config set debug_mon 10/10

    Or into all monitors at once:

    cephuser@adm > ceph tell mon.* config set debug_mon 10/10
  • If you have no quorum, use the monitor's admin socket and directly adjust the configuration options:

    cephuser@adm > ceph daemon mon.FOO config set debug_mon 10/10

Going back to default values is as easy as rerunning the above commands using the debug level 1/10 instead. You can check your current values using the admin socket and the following commands:

cephuser@adm > ceph daemon mon.FOO config show

Or:

cephuser@adm > ceph daemon mon.FOO config get 'OPTION_NAME'

6.8 Manually deploying a MGR daemon

cephadm requires a MGR daemon in order to manage the cluster. If the last MGR of a cluster was removed, follow these steps to deploy an example MGR daemon named mgr.hostname.smfvfd on a random host of your cluster manually:

  1. Disable the cephadm scheduler to prevent cephadm from removing the new MGR daemon:

    cephuser@adm > ceph config-key set mgr/cephadm/pause true
  2. Get or create the auth entry for the new MGR daemon:

    cephuser@adm > ceph auth get-or-create mgr.hostname.smfvfd \
    mon "profile mgr" osd "allow *" mds "allow *"
  3. Generate a minimal ceph.conf:

    cephuser@adm > ceph config generate-minimal-conf
  4. Find the name of the container image:

    cephuser@adm > ceph config get "mgr.hostname.smfvfd" container_image
  5. Create a file config-json.json which contains the information necessary to deploy the daemon, for example:

    {
      "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
      "keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
    }
  6. Deploy the daemon:

    cephuser@adm > cephadm --image IMAGE_NAME \
     deploy --fsid CLUSTER_FSID \
     --name mgr.hostname.smfvfd --config-json config-json.json