3 Troubleshooting cephadm #
As cephadm deploys daemons as containers, troubleshooting daemons is slightly different. Here are a few tools and commands to help investigating issues.
3.1 Pausing or disabling cephadm #
If something goes wrong and cephadm is behaving in a way you do not like, you can pause most background activity by executing the following:
cephuser@adm >
ceph orch pause
This will stop any changes, but cephadm will still periodically check hosts to refresh its inventory of daemons and devices. You can disable cephadm completely with:
cephuser@adm >
ceph orch set backend ''cephuser@adm >
ceph mgr module disable cephadm
This disables all of the ceph orch ...
CLI commands but
the previously deployed daemon containers will continue to exist and start
as they did before.
3.2 Checking cephadm logs #
You can monitor the cephadm log in real time:
cephuser@adm >
ceph -W cephadm
To view the last few messages, execute:
cephuser@adm >
ceph log last cephadm
3.3 Accessing Ceph daemon logs #
SUSE Enterprise Storage 7.1 supports Ceph logging via systemd-journald. To access the logs of Ceph daemons in SUSE Enterprise Storage 7.1, follow the instructions below.
Use the
ceph orch ps
command (orceph orch ps node_name
orceph orch ps --daemon-type daemon_type
) to find the cephadm name of the daemon where the host is running.cephuser@adm >
ceph orch ps --daemon-type DAEMON_TYPE_NAMEFor example, if you want to view the logs for Prometheus, run the following command to find the name of the daemon:
cephuser@adm >
ceph orch ps --daemon-type prometheus NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID prometheus.main main running (65m) 7m ago 2h 2.32.1 registry.suse.com/ses/7.1/ceph/prometheus-server:2.32.1 e77db6e75e78 7b11062150abIn this example,
prometheus.main
is the name and the host ismain
.To view the daemon's logs, execute the following on the host where the daemon is running:
cephuser@adm >
cephadm logs --name DAEMON_NAMEFor example, to view the Prometheus logs from above:
cephuser@adm >
cephadm logs --name prometheus.main
3.4 Collecting systemd
status #
To print the state of a systemd
unit, run:
systemctl status "ceph-$(cephadm shell ceph fsid)@service name.service";
To fetch all state of all daemons of a given host, run:
fsid="$(cephadm shell ceph fsid)" for name in $(cephadm ls | jq -r '.[].name') ; do systemctl status "ceph-$fsid@$name.service" > $name; done
3.5 Listing configured container images #
cephadm has a set of container images configured by default.
To get the value of the configured default container:
cephuser@adm >
ceph config get mgr mgr/cephadm/OPTION_NAME
Where OPTION_NAME is any of the following names:
container_image_base
container_image_prometheus
container_image_node_exporter
container_image_alertmanager
container_image_grafana
container_image_haproxy
container_image_keepalived
For example:
cephuser@adm >
ceph config get mgr mgr/cephadm/container_image_base \
registry.suse.com/ses/7.1/ceph/ceph
mgr/cephadm/container_image_base
is the default Ceph
container image used by all services except the monitoring stack and the ingress stack.
3.6 Listing all downloaded container images #
To list all container images that are downloaded on a host:
Image
may also be called ImageID
.
podman ps -a --format json | jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2"
3.7 Running containers manually #
cephadm writes small wrappers that run a containers. Refer to
/var/lib/ceph/cluster-fsid/service-name/unit.run
for the container execution command.
3.7.1 Assessing SSH errors #
If you receive the following error:
xxxxxx.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-kbqvkrkw root@10.10.1.2 raise OrchestratorError('Failed to connect to %s (%s). Check that the host is reachable and accepts connections using the cephadm SSH key' % (host, addr)) from orchestrator._interface.OrchestratorError: Failed to connect to 10.10.1.2 (10.10.1.2). Check that the host is reachable and accepts connections using the cephadm SSH key
You can verify the issue by trialing a few different options.
Ensure cephadm has an SSH identity key:
root@master #
cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > key INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98 INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 obtained 'mgr/cephadm/ssh_identity_key'root@master #
chmod 0600 keyIf this fails, cephadm does not have a key. Fix this by running the following command:
root@master #
cephadm shell -- ceph cephadm generate-ssh-keyOr:
root@master #
cat key | cephadm shell -- ceph cephadm set-ssk-key -i -Ensure that the SSH config is correct:
root@master #
cephadm shell -- ceph cephadm get-ssh-config > configVerify the connection to the host:
root@master #
ssh -F config -i key root@mon1
3.7.2 Verifying public key is in authorized_keys
#
To verify that the public key is in the
authorized_keys
file, run the following commands:
root@master #
cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_pub > key.pubroot@master #
grep "`cat key.pub`" /root/.ssh/authorized_keys
3.8 Failing to infer CIDR network error #
If you see one of the following errors:
ERROR: Failed to infer CIDR network for mon ip ***; pass --skip-mon-network to configure it later
or
Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
You need to specify a subnet for Ceph Monitors:
cephuser@adm >
ceph config set mon public_network MON_NETWORK
3.9 Accessing the admin socket #
Each Ceph daemon provides an admin socket that bypasses the MONs. To access the admin socket, enter the daemon container on the host:
root@master #
cephadm enter --name daemon-nameroot@master #
ceph --admin-daemon /var/run/ceph/ceph-daemon-name.asok config show
3.10 Deploying a Ceph Manager manually #
Use the following steps to deploy a new Ceph Manager on a host.
Disable the cephadm scheduler, in order to prevent cephadm from removing the new Ceph Manager:
cephuser@adm >
ceph config set mgr mgr/cephadm/pause trueCreate the auth entry:
cephuser@adm >
ceph auth create DAEMON_NAME mon "profile mgr" osd "allow *" mds "allow *"If you already have an auth entry, you can get the entry with the following command:
cephuser@adm >
ceph auth get DAEMON_NAME mon "profile mgr" osd "allow *" mds "allow *"Generate the
ceph.conf
:cephuser@adm >
ceph config generate-minimal-confGet the container image:
cephuser@adm >
ceph config get "DAEMON_NAME" container_imageCreate the config json:
cat config-json.json { "config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.178.28:40483/0,v1:192.168.178.28:40484/0]\n", "keyring": "[mgr.hostname.smfvfd]\n\tkey = AQDSb0ZfJqfDNxAAhgAVuH8Wg0yaRUAHwXdTQA==\n" }
Deploy the daemon:
cephuser@adm >
cephadm --image container-image deploy --fsid fsid --name DAEMON_NAME --config-json config-json.json --tcp-ports 42483
3.11 Distributing a program temporary fix (PTF) #
Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update (MU) containing a permanent fix has been released via the regular update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.
PTFs are published on the registry.suse.com
server
through a path similar to:
registry.suse.com/ptf/PTF_NUMBER/ses/7.1/ceph/ceph:PTF-PTF_NUMBER
The following steps describe how to deploy PTF images on an existing Ceph cluster:
Note the PTF_NUMBER used in the L3 PTF process.
Determine the Ceph Monitor key that corresponds to the relevant service:
cephuser@adm >
ceph auth ls | grep DAEMON_TYPEReceive the PTF:
#
podman pull \ registry.suse.com/ptf/PTF_NUMBER/ses/7.1/ceph/ceph:PTF-PTF_NUMBERVerify that the PTF is listed:
#
podman image lsSet the PTF image for the relevant service, for example, iSCSI:
cephuser@adm >
ceph config set client.iscsi container_image \ registry.suse.com/ptf/PTF_NUMBER/ses/7.1/ceph/ceph:PTF-PTF_NUMBERRemove the specific iSCSI daemon that is causing problems, for example:
cephuser@adm >
ceph orch daemon rm iscsi.igw.sesblade16.zkomehThe removed daemon will be deployed automatically using the PTF.
If you plan to run the ceph-salt update
command after
applying a PTF (as described in 第 13.6.4 节 “运行更新”),
it will not affect the manual changes made by the temporary fix.
Conversely, running the ceph orch upgrade
command (see
第 13.7.1 节 “启动更新”) will upgrade
daemons manually deployed by the PTF.
3.12 Failure When Adding Hosts with cephadm #
When cephadm fails to add a new cluster host—no matter whether by
means of ceph-salt
(see 第 7.2.2 节 “添加 Salt 受控端”)
or manually using the ceph orch host add
command—you can diagnose the reason for failure. The failure message
has the following pattern:
Failed to connect to HOSTNAME (FQDN).
For example:
cephuser@adm >
ceph orch host add node5.example.com
Failed to connect to node5 (node5.example.com).
If the message includes a fully qualified domain name (FQDN,
node5.example com
in our example) as the
HOSTNAME, try adding the host using its FQDN:
cephuser@adm >
ceph orch host add node5.example.com
If the message includes a short host name (node5
in our
example) as the HOSTNAME, try adding the host
using both its short name and FQDN:
cephuser@adm >
ceph orch host add node5 node5.example.com
If either of the previous two commands succeed, ceph-salt
is probably
causing the problem. Otherwise, try adding the host using its short host
name and IP address:
cephuser@adm >
ceph orch host add node5 192.168.2.102
If this succeeds, the cluster does not have proper name resolution. Refer to 第 5 章 “安装和配置 SUSE Linux Enterprise Server” for more details.
For more information about resolving full and bare host names via DNS, refer to https://docs.ceph.com/en/octopus/cephadm/concepts/.
3.13 Disabling automatic deployment of daemons #
By default, Ceph daemons are automatically deployed to new hosts after you add these hosts to the placement specification and apply it (refer to 第 8 章 “使用 cephadm 部署其余核心服务” for more details).
If you need to disable the automated deployment of Ceph daemons, add
unmanaged: true
to its specification file and apply it,
for example:
cat mgr.yaml
service_type: mgr
unmanaged: true
placement:
label: mgr
[...]
cephuser@adm >
ceph orch apply -i mgr.yaml
After applying this specification, cephadm will no longer deploy any new daemons on hosts that match the placement specification.
To manually deploy a daemon on a new host, run:
ceph orch daemon add DAEMON_TYPE --placement=PLACEMENT_SPEC
For example:
cephuser@adm >
ceph orch daemon add mgr --placement=ses-node3
To manually remove a daemon, run:
ceph orch daemon rm DAEMON_NAME [--force]
For example:
cephuser@adm >
ceph orch daemon rm mgr.ses-node3.xietsy