16 Monitoring and alerting #
In SUSE Enterprise Storage 7, cephadm deploys a monitoring and alerting stack. Users need to either define the services (such as Prometheus, Alertmanager, and Grafana) that they want to deploy with cephadm in a YAML configuration file, or they can use the CLI to deploy them. When multiple services of the same type are deployed, a highly-available setup is deployed. The node exporter is an exception to this rule.
The following monitoring services can be deployed with cephadm:
Prometheus is the monitoring and alerting toolkit. It collects the data provided by Prometheus exporters and fires preconfigured alerts if predefined thresholds have been reached.
Alertmanager handles alerts sent by the Prometheus server. It deduplicates, groups, and routes the alerts to the correct receiver. By default, the Ceph Dashboard will automatically be configured as the receiver.
Grafana is the visualization and alerting software. The alerting functionality of Grafana is not used by this monitoring stack. For alerting, the Alertmanager is used.
Node exporter is an exporter for Prometheus which provides data about the node it is installed on. It is recommended to install the node exporter on all nodes.
The Prometheus Manager Module provides a Prometheus exporter to pass on Ceph
performance counters from the collection point in
ceph-mgr
.
The Prometheus configuration, including scrape targets
(metrics providing daemons), is set up automatically by cephadm. cephadm
also deploys a list of default alerts, for example health
error
, 10% OSDs down
, or pgs
inactive
.
By default, traffic to Grafana is encrypted with TLS. You can either supply your own TLS certificate or use a self-signed one. If no custom certificate has been configured before Grafana has been deployed, then a self-signed certificate is automatically created and configured for Grafana.
You can configure custom certificates for Grafana by following these steps:
Configure certificate files:
cephuser@adm >
ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pemcephuser@adm >
ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pemRestart the Ceph Manager service:
cephuser@adm >
ceph orch restart mgrReconfigure the Grafana service to reflect the new certificate paths and set the right URL for the Ceph Dashboard:
cephuser@adm >
ceph orch reconfig grafana
The Alertmanager handles alerts sent by the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver. Alerts can be silenced using the Alertmanager, but silences can also be managed using the Ceph Dashboard.
We recommend that the Node exporter
is deployed on all nodes. This can be done using the
monitoring.yaml
file with the
node-exporter
service type. See
Section 8.3.8, “Deploying the monitoring stack” for more
information on deploying services.
16.1 Configuring custom or local images #
This section describes how to change the configuration of container images which are used when services are deployed or updated. It does not include the commands necessary to deploy or re-deploy services.
The recommended method to deploy the monitoring stack is by applying its specification as described in Section 8.3.8, “Deploying the monitoring stack”.
To deploy custom or local container images, the images need to be set in cephadm. To do so, you will need to run the following command:
cephuser@adm >
ceph config set mgr mgr/cephadm/OPTION_NAME VALUE
Where OPTION_NAME is any of the following names:
container_image_prometheus
container_image_node_exporter
container_image_alertmanager
container_image_grafana
If no option is set or if the setting has been removed, the following images are used as VALUE:
registry.suse.com/ses/7/ceph/prometheus-server:2.27.1
registry.suse.com/ses/7/ceph/prometheus-node-exporter:1.1.2
registry.suse.com/ses/7/ceph/prometheus-alertmanager:0.21.0
registry.suse.com/ses/7/ceph/grafana:7.3.1
For example:
cephuser@adm >
ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1
By setting a custom image, the default value will be overridden (but not overwritten). The default value changes when updates become available. By setting a custom image, you will not be able to update the component you have set the custom image for automatically. You will need to manually update the configuration (image name and tag) to be able to install updates.
If you choose to go with the recommendations instead, you can reset the
custom image you have set before. After that, the default value will be
used again. Use ceph config rm
to reset the
configuration option:
cephuser@adm >
ceph config rm mgr mgr/cephadm/OPTION_NAME
For example:
cephuser@adm >
ceph config rm mgr mgr/cephadm/container_image_prometheus
16.2 Updating monitoring services #
As mentioned in Section 16.1, “Configuring custom or local images”, cephadm is shipped with the URLs of the recommended and tested container images, and they are used by default.
By updating the Ceph packages, new versions of these URLs may be shipped. This just updates where the container images are pulled from but does not update any services.
After the URLs to the new container images have been updated, either manually as described in Section 16.1, “Configuring custom or local images”, or automatically through an update of the Ceph package, the monitoring services can be updated.
To do so, use ceph orch reconfig
like so:
cephuser@adm >
ceph orch reconfig node-exportercephuser@adm >
ceph orch reconfig prometheuscephuser@adm >
ceph orch reconfig alertmanagercephuser@adm >
ceph orch reconfig grafana
Currently no single command to update all monitoring services exists. The order in which these services are updated is not important.
If you use custom container images, the URLs specified for the monitoring services will not change automatically if the Ceph packages are updated. If you have specified custom container images, you will need to specify the URLs of the new container images manually. This may be the case if you use a local container registry.
You can find the URLs of the recommended container images to be used in the Section 16.1, “Configuring custom or local images” section.
16.3 Disabling monitoring #
To disable the monitoring stack, run the following commands:
cephuser@adm >
ceph orch rm grafanacephuser@adm >
ceph orch rm prometheus --force # this will delete metrics data collected so farcephuser@adm >
ceph orch rm node-exportercephuser@adm >
ceph orch rm alertmanagercephuser@adm >
ceph mgr module disable prometheus
16.4 Configuring Grafana #
The Ceph Dashboard back-end requires the Grafana URL to be able to verify the existence of Grafana Dashboards before the front-end even loads them. Because of the nature of how Grafana is implemented in Ceph Dashboard, this means that two working connections are required in order to be able to see Grafana graphs in Ceph Dashboard:
The back-end (Ceph MGR module) needs to verify the existence of the requested graph. If this request succeeds, it lets the front-end know that it can safely access Grafana.
The front-end then requests the Grafana graphs directly from the user's browser using an
iframe
. The Grafana instance is accessed directly without any detour through Ceph Dashboard.
Now, it might be the case that your environment makes it difficult for the user's browser to directly access the URL configured in Ceph Dashboard. To solve this issue, a separate URL can be configured which will solely be used to tell the front-end (the user's browser) which URL it should use to access Grafana.
To change the URL that is returned to the front-end issue the following command:
cephuser@adm >
ceph dashboard set-grafana-frontend-api-url GRAFANA-SERVER-URL
If no value is set for that option, it will simply fall back to the value of the GRAFANA_API_URL option, which is set automatically and periodically updated by cephadm. If set, it will instruct the browser to use this URL to access Grafana.
16.5 Configuring the Prometheus Manager Module #
The Prometheus Manager Module is a module inside Ceph that extends Ceph's functionality. The module reads (meta-)data from Ceph about its state and health, providing the (scraped) data in a consumable format to Prometheus.
The Prometheus Manager Module needs to be restarted for the configuration changes to be applied.
16.5.1 Configuring the network interface #
By default, the Prometheus Manager Module accepts HTTP requests on port 9283 on all
IPv4 and IPv6 addresses on the host. The port and listen address are both
configurable with ceph config-key set
, with keys
mgr/prometheus/server_addr
and
mgr/prometheus/server_port
. This port is registered
withPrometheus's registry.
To update the server_addr
execute the following command:
cephuser@adm >
ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
To update the server_port
execute the following command:
cephuser@adm >
ceph config set mgr mgr/prometheus/server_port 9283
16.5.2 Configuring scrape_interval
#
By default, the Prometheus Manager Module is configured with a scrape interval of 15
seconds. We do not recommend using a scrape interval below 10 seconds. To
set a different scrape interval in the Prometheus module, set
scrape_interval
to the desired value:
To work properly and not cause any issues, the
scrape_interval
of this module should always be set to
match the Prometheus scrape interval .
cephuser@adm >
ceph config set mgr mgr/prometheus/scrape_interval 15
16.5.3 Configuring the cache #
On large clusters (more than 1000 OSDs), the time to fetch the metrics may
become significant. Without the cache, the Prometheus Manager Module can overload the
manager and lead to unresponsive or crashing Ceph Manager instances. As a result,
the cache is enabled by default and cannot be disabled, but this does mean
that the cache can become stale. The cache is considered stale when the
time to fetch the metrics from Ceph exceeds the configured
scrape_interval
.
If this is the case, a warning will be logged and the module will either:
Respond with a 503 HTTP status code (service unavailable).
Return the content of the cache, even though it might be stale.
This behavior can be configured using the ceph config
set
commands.
To tell the module to respond with possibly-stale data, set it to
return
:
cephuser@adm >
ceph config set mgr mgr/prometheus/stale_cache_strategy return
To tell the module to respond with service unavailable
,
set it to fail
:
cephuser@adm >
ceph config set mgr mgr/prometheus/stale_cache_strategy fail
16.5.4 Enabling RBD-image monitoring #
The Prometheus Manager Module can optionally collect RBD per-image IO statistics by
enabling dynamic OSD performance counters. The statistics are gathered for
all images in the pools that are specified in the
mgr/prometheus/rbd_stats_pools
configuration parameter.
The parameter is a comma- or space-separated list of
pool[/namespace]
entries. If the namespace is not
specified, the statistics are collected for all namespaces in the pool.
For example:
cephuser@adm >
ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
The module scans the specified pools and namespaces and makes a list of all
available images, and refreshes it periodically. The interval is
configurable via the
mgr/prometheus/rbd_stats_pools_refresh_interval
parameter (in seconds), and is 300 seconds (five minutes) by default.
For example, if you changed the synchronization interval to 10 minutes:
cephuser@adm >
ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
16.6 Prometheus security model #
Prometheus' security model presumes that untrusted users have access to the Prometheus HTTP endpoint and logs. Untrusted users have access to all the (meta-)data Prometheus collects that is contained in the database, plus a variety of operational and debugging information.
However, Prometheus' HTTP API is limited to read-only operations. Configurations cannot be changed using the API, and secrets are not exposed. Moreover, Prometheus has some built-in measures to mitigate the impact of denial-of-service attacks.
16.7 Prometheus Alertmanager SNMP webhook #
If you want to get notified about Prometheus alerts via SNMP traps, then you can install the Prometheus Alertmanager SNMP webhook via cephadm. To do so, you need to create a service and placement specification file with the following content:
For more information on service and placement files, see Section 8.2, “Service and placement specification”.
service_type: container service_id: prometheus-webhook-snmp placement: ADD_PLACEMENT_HERE image: registry.suse.com/ses/7/prometheus-webhook-snmp:latest args: - "--publish 9099:9099" envs: - ARGS="--debug --snmp-host=ADD_HOST_GATEWAY_HERE" - RUN_ARGS="--metrics" EOF
Use this service specification to get the service running using its default settings.
You need to publish the port the Prometheus receiver is listening on by
using the command line argument --publish
HOST_PORT:CONTAINER_PORT
when running the service, because the port is not exposed automatically by
the container. This can be done by adding the following lines to the
specification:
args: - "--publish 9099:9099"
Alternatively, connect the container to the host network by using the
command line argument --network=host
.
args: - "--network=host"
If the SNMP trap receiver is not installed on the same host as the container, then you must also specify the FQDN of the SNMP host. Use the container's network gateway to be able to receive SNMP traps outside the container/host:
envs: - ARGS="--debug --snmp-host=CONTAINER_GATEWAY"
16.7.1 Configuring the prometheus-webhook-snmp
service #
The container can be configured by environment variables or by using a configuration file.
For the environment variables, use ARGS
to set global
options and RUN_ARGS
for the run
command options. You need to adapt the service specification the following
way:
envs: - ARGS="--debug --snmp-host=CONTAINER_GATEWAY" - RUN_ARGS="--metrics --port=9101"
To use a configuration file, the service specification must be adapted the following way:
files: etc/prometheus-webhook-snmp.conf: - "debug: True" - "snmp_host: ADD_HOST_GATEWAY_HERE" - "metrics: True" volume_mounts: etc/prometheus-webhook-snmp.conf: /etc/prometheus-webhook-snmp.conf
To deploy, run the following command:
cephuser@adm >
ceph orch apply -i SERVICE_SPEC_FILE
See Section 8.3, “Deploy Ceph services” for more information.
16.7.2 Configuring the Prometheus Alertmanager for SNMP #
Finally, the Prometheus Alertmanager needs to be configured
specifically for SNMP traps. If this service has not been deployed already,
create a service specification file. You need to replace
IP_OR_FQDN
with the IP address or FQDN of the host where
the Prometheus Alertmanager SNMP webhook has been installed. For
example:
If you have already deployed this service, then to ensure the Alertmanager is set up correctly for SNMP, re-deploy with the following settings.
service_type: alertmanager placement: hosts: - HOSTNAME user_data: default_webhook_urls: - 'http://IP_OR_FQDN:9099/'
Apply the service specification with the following command:
cephuser@adm >
ceph orch apply -i SERVICE_SPEC_FILE