Applies to SUSE Enterprise Storage 7

16 Monitoring and alerting #

In SUSE Enterprise Storage 7, cephadm deploys a monitoring and alerting stack. Users need to either define the services (such as Prometheus, Alertmanager, and Grafana) that they want to deploy with cephadm in a YAML configuration file, or they can use the CLI to deploy them. When multiple services of the same type are deployed, a highly-available setup is deployed. The node exporter is an exception to this rule.

The following monitoring services can be deployed with cephadm:

Prometheus is the monitoring and alerting toolkit. It collects the data provided by Prometheus exporters and fires preconfigured alerts if predefined thresholds have been reached.
Alertmanager handles alerts sent by the Prometheus server. It deduplicates, groups, and routes the alerts to the correct receiver. By default, the Ceph Dashboard will automatically be configured as the receiver.
Grafana is the visualization and alerting software. The alerting functionality of Grafana is not used by this monitoring stack. For alerting, the Alertmanager is used.
Node exporter is an exporter for Prometheus which provides data about the node it is installed on. It is recommended to install the node exporter on all nodes.

The Prometheus Manager Module provides a Prometheus exporter to pass on Ceph performance counters from the collection point in ceph-mgr.

The Prometheus configuration, including scrape targets (metrics providing daemons), is set up automatically by cephadm. cephadm also deploys a list of default alerts, for example health error, 10% OSDs down, or pgs inactive.

By default, traffic to Grafana is encrypted with TLS. You can either supply your own TLS certificate or use a self-signed one. If no custom certificate has been configured before Grafana has been deployed, then a self-signed certificate is automatically created and configured for Grafana.

You can configure custom certificates for Grafana by following these steps:

Configure certificate files:

cephuser@adm >  ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pem
cephuser@adm >  ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pem

Restart the Ceph Manager service:
```
cephuser@adm > ceph orch restart mgr
```
Reconfigure the Grafana service to reflect the new certificate paths and set the right URL for the Ceph Dashboard:
```
cephuser@adm > ceph orch reconfig grafana
```

The Alertmanager handles alerts sent by the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver. Alerts can be silenced using the Alertmanager, but silences can also be managed using the Ceph Dashboard.

We recommend that the Node exporter is deployed on all nodes. This can be done using the monitoring.yaml file with the node-exporter service type. See Section 8.3.8, “Deploying the monitoring stack” for more information on deploying services.

16.1 Configuring custom or local images #

Tip

This section describes how to change the configuration of container images which are used when services are deployed or updated. It does not include the commands necessary to deploy or re-deploy services.

The recommended method to deploy the monitoring stack is by applying its specification as described in Section 8.3.8, “Deploying the monitoring stack”.

To deploy custom or local container images, the images need to be set in cephadm. To do so, you will need to run the following command:

cephuser@adm > ceph config set mgr mgr/cephadm/OPTION_NAME VALUE

Where OPTION_NAME is any of the following names:

container_image_prometheus
container_image_node_exporter
container_image_alertmanager
container_image_grafana

If no option is set or if the setting has been removed, the following images are used as VALUE:

registry.suse.com/ses/7/ceph/prometheus-server:2.27.1
registry.suse.com/ses/7/ceph/prometheus-node-exporter:1.1.2
registry.suse.com/ses/7/ceph/prometheus-alertmanager:0.21.0
registry.suse.com/ses/7/ceph/grafana:7.3.1

For example:

cephuser@adm > ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1

Note

By setting a custom image, the default value will be overridden (but not overwritten). The default value changes when updates become available. By setting a custom image, you will not be able to update the component you have set the custom image for automatically. You will need to manually update the configuration (image name and tag) to be able to install updates.

If you choose to go with the recommendations instead, you can reset the custom image you have set before. After that, the default value will be used again. Use ceph config rm to reset the configuration option:

cephuser@adm > ceph config rm mgr mgr/cephadm/OPTION_NAME

For example:

cephuser@adm > ceph config rm mgr mgr/cephadm/container_image_prometheus

16.2 Updating monitoring services #

As mentioned in Section 16.1, “Configuring custom or local images”, cephadm is shipped with the URLs of the recommended and tested container images, and they are used by default.

By updating the Ceph packages, new versions of these URLs may be shipped. This just updates where the container images are pulled from but does not update any services.

After the URLs to the new container images have been updated, either manually as described in Section 16.1, “Configuring custom or local images”, or automatically through an update of the Ceph package, the monitoring services can be updated.

To do so, use ceph orch reconfig like so:

cephuser@adm > ceph orch reconfig node-exporter
cephuser@adm > ceph orch reconfig prometheus
cephuser@adm > ceph orch reconfig alertmanager
cephuser@adm > ceph orch reconfig grafana

Currently no single command to update all monitoring services exists. The order in which these services are updated is not important.

Note

If you use custom container images, the URLs specified for the monitoring services will not change automatically if the Ceph packages are updated. If you have specified custom container images, you will need to specify the URLs of the new container images manually. This may be the case if you use a local container registry.

You can find the URLs of the recommended container images to be used in the Section 16.1, “Configuring custom or local images” section.

16.3 Disabling monitoring #

To disable the monitoring stack, run the following commands:

cephuser@adm > ceph orch rm grafana
cephuser@adm > ceph orch rm prometheus --force   # this will delete metrics data collected so far
cephuser@adm > ceph orch rm node-exporter
cephuser@adm > ceph orch rm alertmanager
cephuser@adm > ceph mgr module disable prometheus

16.4 Configuring Grafana #

The Ceph Dashboard back-end requires the Grafana URL to be able to verify the existence of Grafana Dashboards before the front-end even loads them. Because of the nature of how Grafana is implemented in Ceph Dashboard, this means that two working connections are required in order to be able to see Grafana graphs in Ceph Dashboard:

The back-end (Ceph MGR module) needs to verify the existence of the requested graph. If this request succeeds, it lets the front-end know that it can safely access Grafana.
The front-end then requests the Grafana graphs directly from the user's browser using an iframe. The Grafana instance is accessed directly without any detour through Ceph Dashboard.

Now, it might be the case that your environment makes it difficult for the user's browser to directly access the URL configured in Ceph Dashboard. To solve this issue, a separate URL can be configured which will solely be used to tell the front-end (the user's browser) which URL it should use to access Grafana.

To change the URL that is returned to the front-end issue the following command:

cephuser@adm > ceph dashboard set-grafana-frontend-api-url GRAFANA-SERVER-URL

If no value is set for that option, it will simply fall back to the value of the GRAFANA_API_URL option, which is set automatically and periodically updated by cephadm. If set, it will instruct the browser to use this URL to access Grafana.

16.5 Configuring the Prometheus Manager Module #

The Prometheus Manager Module is a module inside Ceph that extends Ceph's functionality. The module reads (meta-)data from Ceph about its state and health, providing the (scraped) data in a consumable format to Prometheus.

Note

The Prometheus Manager Module needs to be restarted for the configuration changes to be applied.

16.5.1 Configuring the network interface #

By default, the Prometheus Manager Module accepts HTTP requests on port 9283 on all IPv4 and IPv6 addresses on the host. The port and listen address are both configurable with ceph config-key set , with keys mgr/prometheus/server_addr and mgr/prometheus/server_port . This port is registered withPrometheus's registry.

To update the server_addr execute the following command:

cephuser@adm > ceph config set mgr mgr/prometheus/server_addr 0.0.0.0

To update the server_port execute the following command:

cephuser@adm > ceph config set mgr mgr/prometheus/server_port 9283

16.5.2 Configuring `scrape_interval` #

By default, the Prometheus Manager Module is configured with a scrape interval of 15 seconds. We do not recommend using a scrape interval below 10 seconds. To set a different scrape interval in the Prometheus module, set scrape_interval to the desired value:

Important

To work properly and not cause any issues, the scrape_interval of this module should always be set to match the Prometheus scrape interval .

cephuser@adm > ceph config set mgr mgr/prometheus/scrape_interval 15

16.5.3 Configuring the cache #

On large clusters (more than 1000 OSDs), the time to fetch the metrics may become significant. Without the cache, the Prometheus Manager Module can overload the manager and lead to unresponsive or crashing Ceph Manager instances. As a result, the cache is enabled by default and cannot be disabled, but this does mean that the cache can become stale. The cache is considered stale when the time to fetch the metrics from Ceph exceeds the configured scrape_interval.

If this is the case, a warning will be logged and the module will either:

Respond with a 503 HTTP status code (service unavailable).
Return the content of the cache, even though it might be stale.

This behavior can be configured using the ceph config set commands.

To tell the module to respond with possibly-stale data, set it to return:

cephuser@adm > ceph config set mgr mgr/prometheus/stale_cache_strategy return

To tell the module to respond with service unavailable, set it to fail:

cephuser@adm > ceph config set mgr mgr/prometheus/stale_cache_strategy fail

16.5.4 Enabling RBD-image monitoring #

The Prometheus Manager Module can optionally collect RBD per-image IO statistics by enabling dynamic OSD performance counters. The statistics are gathered for all images in the pools that are specified in the mgr/prometheus/rbd_stats_pools configuration parameter.

The parameter is a comma- or space-separated list of pool[/namespace] entries. If the namespace is not specified, the statistics are collected for all namespaces in the pool.

For example:

cephuser@adm > ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"

The module scans the specified pools and namespaces and makes a list of all available images, and refreshes it periodically. The interval is configurable via the mgr/prometheus/rbd_stats_pools_refresh_interval parameter (in seconds), and is 300 seconds (five minutes) by default.

For example, if you changed the synchronization interval to 10 minutes:

cephuser@adm > ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600

16.6 Prometheus security model #

Prometheus' security model presumes that untrusted users have access to the Prometheus HTTP endpoint and logs. Untrusted users have access to all the (meta-)data Prometheus collects that is contained in the database, plus a variety of operational and debugging information.

However, Prometheus' HTTP API is limited to read-only operations. Configurations cannot be changed using the API, and secrets are not exposed. Moreover, Prometheus has some built-in measures to mitigate the impact of denial-of-service attacks.

16.7 Prometheus Alertmanager SNMP webhook #

If you want to get notified about Prometheus alerts via SNMP traps, then you can install the Prometheus Alertmanager SNMP webhook via cephadm. To do so, you need to create a service and placement specification file with the following content:

Note

For more information on service and placement files, see Section 8.2, “Service and placement specification”.

service_type: container
service_id: prometheus-webhook-snmp
placement:
    ADD_PLACEMENT_HERE
image: registry.suse.com/ses/7/prometheus-webhook-snmp:latest
args:
    - "--publish 9099:9099"
envs:
    - ARGS="--debug --snmp-host=ADD_HOST_GATEWAY_HERE"
    - RUN_ARGS="--metrics"
EOF

Use this service specification to get the service running using its default settings.

You need to publish the port the Prometheus receiver is listening on by using the command line argument --publish HOST_PORT:CONTAINER_PORT when running the service, because the port is not exposed automatically by the container. This can be done by adding the following lines to the specification:

args:
    - "--publish 9099:9099"

Alternatively, connect the container to the host network by using the command line argument --network=host.

args:
    - "--network=host"

If the SNMP trap receiver is not installed on the same host as the container, then you must also specify the FQDN of the SNMP host. Use the container's network gateway to be able to receive SNMP traps outside the container/host:

envs:
    - ARGS="--debug --snmp-host=CONTAINER_GATEWAY"

16.7.1 Configuring the `prometheus-webhook-snmp` service #

The container can be configured by environment variables or by using a configuration file.

For the environment variables, use ARGS to set global options and RUN_ARGS for the run command options. You need to adapt the service specification the following way:

envs:
    - ARGS="--debug --snmp-host=CONTAINER_GATEWAY"
    - RUN_ARGS="--metrics --port=9101"

To use a configuration file, the service specification must be adapted the following way:

files:
    etc/prometheus-webhook-snmp.conf:
        - "debug: True"
        - "snmp_host: ADD_HOST_GATEWAY_HERE"
        - "metrics: True"
volume_mounts:
    etc/prometheus-webhook-snmp.conf: /etc/prometheus-webhook-snmp.conf

To deploy, run the following command:

cephuser@adm > ceph orch apply -i SERVICE_SPEC_FILE

See Section 8.3, “Deploy Ceph services” for more information.

16.7.2 Configuring the Prometheus Alertmanager for SNMP #

Finally, the Prometheus Alertmanager needs to be configured specifically for SNMP traps. If this service has not been deployed already, create a service specification file. You need to replace IP_OR_FQDN with the IP address or FQDN of the host where the Prometheus Alertmanager SNMP webhook has been installed. For example:

Note

If you have already deployed this service, then to ensure the Alertmanager is set up correctly for SNMP, re-deploy with the following settings.

  service_type: alertmanager
  placement:
    hosts:
    - HOSTNAME
  user_data:
    default_webhook_urls:
    - 'http://IP_OR_FQDN:9099/'

Apply the service specification with the following command:

cephuser@adm > ceph orch apply -i SERVICE_SPEC_FILE

16 Monitoring and alerting #

16.1 Configuring custom or local images #

16.2 Updating monitoring services #

16.3 Disabling monitoring #

16.4 Configuring Grafana #

16.5 Configuring the Prometheus Manager Module #

16.5.1 Configuring the network interface #

16.5.2 Configuring scrape_interval #

16.5.3 Configuring the cache #

16.5.4 Enabling RBD-image monitoring #

16.6 Prometheus security model #

16.7 Prometheus Alertmanager SNMP webhook #

16.7.1 Configuring the prometheus-webhook-snmp service #

16.7.2 Configuring the Prometheus Alertmanager for SNMP #

16.5.2 Configuring `scrape_interval` #

16.7.1 Configuring the `prometheus-webhook-snmp` service #