Applies to SUSE Enterprise Storage 7.1

3 Troubleshooting cephadm #

As cephadm deploys daemons as containers, troubleshooting daemons is slightly different. Here are a few tools and commands to help investigating issues.

3.1 Pausing or disabling cephadm #

If something goes wrong and cephadm is behaving in a way you do not like, you can pause most background activity by executing the following:

cephuser@adm > ceph orch pause

This will stop any changes, but cephadm will still periodically check hosts to refresh its inventory of daemons and devices. You can disable cephadm completely with:

cephuser@adm > ceph orch set backend ''
cephuser@adm > ceph mgr module disable cephadm

This disables all of the ceph orch ... CLI commands but the previously deployed daemon containers will continue to exist and start as they did before.

3.2 Checking cephadm logs #

You can monitor the cephadm log in real time:

cephuser@adm > ceph -W cephadm

To view the last few messages, execute:

cephuser@adm > ceph log last cephadm

3.3 Accessing Ceph daemon logs #

SUSE Enterprise Storage 7.1 supports Ceph logging via systemd-journald. To access the logs of Ceph daemons in SUSE Enterprise Storage 7.1, follow the instructions below.

Use the ceph orch ps command (or ceph orch ps node_name or ceph orch ps --daemon-type daemon_type) to find the cephadm name of the daemon where the host is running.

cephuser@adm > ceph orch ps --daemon-type DAEMON_TYPE_NAME

For example, if you want to view the logs for Prometheus, run the following command to find the name of the daemon:

cephuser@adm > ceph orch ps --daemon-type prometheus
NAME               HOST    STATUS         REFRESHED  AGE  VERSION  IMAGE NAME                                                   IMAGE ID      CONTAINER ID
prometheus.main    main    running (65m)  7m ago     2h   2.32.1   registry.suse.com/ses/7.1/ceph/prometheus-server:2.32.1  e77db6e75e78  7b11062150ab

In this example, prometheus.main is the name and the host is main.

To view the daemon's logs, execute the following on the host where the daemon is running:
```
cephuser@adm > cephadm logs --name DAEMON_NAME
```
For example, to view the Prometheus logs from above:
```
cephuser@adm > cephadm logs --name prometheus.main
```

3.4 Collecting `systemd` status #

To print the state of a systemd unit, run:

systemctl status "ceph-$(cephadm shell ceph fsid)@service name.service";

To fetch all state of all daemons of a given host, run:

  fsid="$(cephadm shell ceph fsid)"
  for name in $(cephadm ls | jq -r '.[].name') ; do
    systemctl status "ceph-$fsid@$name.service" > $name;
  done

3.5 Listing configured container images #

cephadm has a set of container images configured by default.

To get the value of the configured default container:

cephuser@adm > ceph config get mgr mgr/cephadm/OPTION_NAME

Where OPTION_NAME is any of the following names:

container_image_base
container_image_prometheus
container_image_node_exporter
container_image_alertmanager
container_image_grafana
container_image_haproxy
container_image_keepalived

For example:

cephuser@adm > ceph config get mgr mgr/cephadm/container_image_base \
 registry.suse.com/ses/7.1/ceph/ceph

Note

mgr/cephadm/container_image_base is the default Ceph container image used by all services except the monitoring stack and the ingress stack.

3.6 Listing all downloaded container images #

To list all container images that are downloaded on a host:

Note

Image may also be called ImageID.

  podman ps -a --format json | jq '.[].Image'
  "docker.io/library/centos:8"
  "registry.opensuse.org/opensuse/leap:15.2"

3.7 Running containers manually #

cephadm writes small wrappers that run a containers. Refer to /var/lib/ceph/cluster-fsid/service-name/unit.run for the container execution command.

3.7.1 Assessing SSH errors #

If you receive the following error:

xxxxxx.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-kbqvkrkw root@10.10.1.2
raise OrchestratorError('Failed to connect to %s (%s).  Check that the host is reachable and accepts  connections using the cephadm SSH key' % (host, addr)) from
orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2).  Check that the host is reachable and accepts connections using the cephadm SSH key

You can verify the issue by trialing a few different options.

Ensure cephadm has an SSH identity key:

root@master # cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > key
INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'
root@master # chmod 0600 key

If this fails, cephadm does not have a key. Fix this by running the following command:

root@master # cephadm shell -- ceph cephadm generate-ssh-key

Or:

root@master # cat key | cephadm shell -- ceph cephadm set-ssk-key -i -

Ensure that the SSH config is correct:

root@master # cephadm shell -- ceph cephadm get-ssh-config > config

Verify the connection to the host:

root@master # ssh -F config -i key root@mon1

3.7.2 Verifying public key is in `authorized_keys` #

To verify that the public key is in the authorized_keys file, run the following commands:

root@master # cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_pub > key.pub
root@master # grep "`cat key.pub`"  /root/.ssh/authorized_keys

3.8 Failing to infer CIDR network error #

If you see one of the following errors:

ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later

Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP

You need to specify a subnet for Ceph Monitors:

cephuser@adm > ceph config set mon public_network MON_NETWORK

3.9 Accessing the admin socket #

Each Ceph daemon provides an admin socket that bypasses the MONs. To access the admin socket, enter the daemon container on the host:

root@master # cephadm enter --name daemon-name
root@master # ceph --admin-daemon /var/run/ceph/ceph-daemon-name.asok config show

3.10 Deploying a Ceph Manager manually #

Use the following steps to deploy a new Ceph Manager on a host.

Disable the cephadm scheduler, in order to prevent cephadm from removing the new Ceph Manager:
```
cephuser@adm > ceph config set mgr mgr/cephadm/pause true
```

Create the auth entry:

cephuser@adm > ceph auth create DAEMON_NAME mon "profile mgr" osd "allow *" mds "allow *"

If you already have an auth entry, you can get the entry with the following command:

cephuser@adm > ceph auth get DAEMON_NAME mon "profile mgr" osd "allow *" mds "allow *"

Generate the ceph.conf:

cephuser@adm > ceph config generate-minimal-conf

Get the container image:

cephuser@adm > ceph config get "DAEMON_NAME" container_image

Create the config json:

cat config-json.json
{
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.178.28:40483/0,v1:192.168.178.28:40484/0]\n",
"keyring": "[mgr.hostname.smfvfd]\n\tkey = AQDSb0ZfJqfDNxAAhgAVuH8Wg0yaRUAHwXdTQA==\n"
}

Deploy the daemon:

cephuser@adm > cephadm --image container-image deploy --fsid fsid --name DAEMON_NAME --config-json config-json.json --tcp-ports 42483

3.11 Distributing a program temporary fix (PTF) #

Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update (MU) containing a permanent fix has been released via the regular update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.

PTFs are published on the registry.suse.com server through a path similar to:

registry.suse.com/ptf/PTF_NUMBER/ses/7.1/ceph/ceph:PTF-PTF_NUMBER

The following steps describe how to deploy PTF images on an existing Ceph cluster:

Note the PTF_NUMBER used in the L3 PTF process.
Determine the Ceph Monitor key that corresponds to the relevant service:
```
cephuser@adm > ceph auth ls | grep DAEMON_TYPE
```

Receive the PTF:

# podman pull \
registry.suse.com/ptf/PTF_NUMBER/ses/7.1/ceph/ceph:PTF-PTF_NUMBER

Verify that the PTF is listed:
```
# podman image ls
```

Set the PTF image for the relevant service, for example, iSCSI:

cephuser@adm > ceph config set client.iscsi container_image \
registry.suse.com/ptf/PTF_NUMBER/ses/7.1/ceph/ceph:PTF-PTF_NUMBER

Remove the specific iSCSI daemon that is causing problems, for example:
```
cephuser@adm > ceph orch daemon rm iscsi.igw.sesblade16.zkomeh
```
The removed daemon will be deployed automatically using the PTF.

Important

If you plan to run the ceph-salt update command after applying a PTF (as described in 第 13.6.4 节 “运行更新”), it will not affect the manual changes made by the temporary fix.

Conversely, running the ceph orch upgrade command (see 第 13.7.1 节 “启动更新”) will upgrade daemons manually deployed by the PTF.

3.12 Failure When Adding Hosts with cephadm #

When cephadm fails to add a new cluster host—no matter whether by means of ceph-salt (see 第 7.2.2 节 “添加 Salt 受控端”) or manually using the ceph orch host add command—you can diagnose the reason for failure. The failure message has the following pattern:

Failed to connect to HOSTNAME (FQDN).

For example:

cephuser@adm > ceph orch host add node5.example.com
Failed to connect to node5 (node5.example.com).

If the message includes a fully qualified domain name (FQDN, node5.example com in our example) as the HOSTNAME, try adding the host using its FQDN:

cephuser@adm > ceph orch host add node5.example.com

If the message includes a short host name (node5 in our example) as the HOSTNAME, try adding the host using both its short name and FQDN:

cephuser@adm > ceph orch host add node5 node5.example.com

If either of the previous two commands succeed, ceph-salt is probably causing the problem. Otherwise, try adding the host using its short host name and IP address:

cephuser@adm > ceph orch host add node5 192.168.2.102

If this succeeds, the cluster does not have proper name resolution. Refer to 第 5 章 “安装和配置 SUSE Linux Enterprise Server” for more details.

Tip

For more information about resolving full and bare host names via DNS, refer to https://docs.ceph.com/en/octopus/cephadm/concepts/.

3.13 Disabling automatic deployment of daemons #

By default, Ceph daemons are automatically deployed to new hosts after you add these hosts to the placement specification and apply it (refer to 第 8 章 “使用 cephadm 部署其余核心服务” for more details).

If you need to disable the automated deployment of Ceph daemons, add unmanaged: true to its specification file and apply it, for example:

cat mgr.yaml
service_type: mgr
unmanaged: true
placement:
  label: mgr
[...]

cephuser@adm > ceph orch apply -i mgr.yaml

After applying this specification, cephadm will no longer deploy any new daemons on hosts that match the placement specification.

To manually deploy a daemon on a new host, run:

ceph orch daemon add DAEMON_TYPE --placement=PLACEMENT_SPEC

For example:

cephuser@adm > ceph orch daemon add mgr --placement=ses-node3

To manually remove a daemon, run:

ceph orch daemon rm DAEMON_NAME [--force]

For example:

cephuser@adm > ceph orch daemon rm mgr.ses-node3.xietsy

3 Troubleshooting cephadm #

3.1 Pausing or disabling cephadm #

3.2 Checking cephadm logs #

3.3 Accessing Ceph daemon logs #

3.4 Collecting systemd status #

3.5 Listing configured container images #

3.6 Listing all downloaded container images #

3.7 Running containers manually #

3.7.1 Assessing SSH errors #

3.7.2 Verifying public key is in authorized_keys #

3.8 Failing to infer CIDR network error #

3.9 Accessing the admin socket #

3.10 Deploying a Ceph Manager manually #

3.11 Distributing a program temporary fix (PTF) #

3.12 Failure When Adding Hosts with cephadm #

3.13 Disabling automatic deployment of daemons #

3.4 Collecting `systemd` status #

3.7.2 Verifying public key is in `authorized_keys` #