Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE OpenStack Cloud Monitoring

4 Monitoring

SUSE OpenStack Cloud Monitoring offers various features which support you in proactively managing your cloud resources. A large number of metrics in combination with early warnings about problems and outages assists you in analyzing and troubleshooting any issue you encounter in your environment.

The monitoring features include:

  • A monitoring overview which allows you to access all monitoring information.

  • Metrics dashboards for visualizing your monitoring data.

  • Alerting features for monitoring.

In the following sections, you will find information on the monitoring overview and the metrics dashboards as well as details on how to define and handle alarms and notifications.

Accessing SUSE OpenStack Cloud Monitoring

For accessing SUSE OpenStack Cloud Monitoring and performing monitoring tasks, you must have access to the OpenStack platform as a user with the monasca-user or monasca-read-only-user role in the monasca tenant.

Log in to OpenStack Horizon with your user name and password. The functions you can use in OpenStack Horizon depend on your access permissions. To access logs and metrics, switch to the monasca tenant in Horizon. This allows you to access all monitoring data for SUSE OpenStack Cloud Monitoring.

SUSE OpenStack Cloud Horizon Dashboard—Monitoring
Figure 4.1: SUSE OpenStack Cloud Horizon Dashboard—Monitoring

4.1 Overview

SUSE OpenStack Cloud Monitoring provides one convenient access point to your monitoring data. Use Monitoring > Overview to keep track of your services and servers and quickly check their status. The overview also indicates any irregularities in the log data of the system components you are monitoring.

On the Overview page, you can:

4.2 Working with Data Visualizations

The user interface for monitoring your services, servers, and log data integrates with Grafana, an open source application for visualizing large-scale monitoring data. Use the options at the top border of the Overview page to access Grafana.

SUSE OpenStack Cloud Monitoring ships with preconfigured metrics dashboards. You can instantly use them for monitoring your environment. You can also use them as a starting point for building your own dashboards.

Preconfigured Metrics Dashboard for SUSE OpenStack Cloud Monitoring

As a Monitoring Service operator, you use the Monasca Health option on the Overview page to view the metrics data on the Monitoring Service. The OpenStack operator uses the Dashboard option to view the metrics data on the OpenStack services.

Metrics Dashboard—Monitoring Service Operator's View
Figure 4.2: Metrics Dashboard—Monitoring Service Operator's View

The preconfigured dashboard shows the following data on the Monitoring Service:

  • Status of the main SUSE OpenStack Cloud Monitoring components (UP or DOWN).

    In the upper part of the dashboard the components are grouped into metrics and log management components as well as common components that are used by metrics-based and log-based monitoring.

  • Information on system resources.

    The dashboard shows metrics data on CPU usage: the percentage of time the CPU is used in total (cpu.percent), at user level (cpu.user_perc), and at system level (cpu.system_perc), as well as the percentage of time the CPU is idle when no I/O requests are in progress (cpu.wait_perc).

    The dashboard shows metrics data on memory usage: the number of megabytes of total memory (mem.total_mb), used memory (mem.used_mb), total swap memory (mem.swap_total_mb), and used swap memory (mem.swap_used_mb), as well as the number of megabytes used for the page cache (mem.used_cache).

    The dashboard visualizes metrics on the percentage of disk space that is being used on a device (disk.space_used_perc).

    The dashboard shows metrics data on the SUSE OpenStack Cloud Monitoring system load over different periods (load.avg_1_min, load.avg_5_min, and load.avg_15_min).

  • The network usage of SUSE OpenStack Cloud Monitoring.

    The dashboard shows the number of network bytes received and sent per second (net.in_bytes_sec and net.out_bytes_sec).

  • Metrics data on each SUSE OpenStack Cloud Monitoring component. The metrics that are visualized differ slightly from component to component. The dashboard shows, for example, the percentage of CPU that is consumed by a component (process.cpu_perc.value ), the amount of physical memory that is allocated to the component (process.mem.rss_mbytes.value), or the number of processes that exist with the corresponding component name (process.pid_count.value).

Building Dashboards

Each metrics dashboard is composed of one or more panels that are arranged in one or more rows. A row serves as a logical divider within a dashboard. It organizes your panels in groups. The panel is the basic building block for visualizing your metrics data.

For building dashboards, you have two options:

  • Start from scratch and create a new dashboard.

  • Take the dashboard that is shipped with SUSE OpenStack Cloud Monitoring as a starting point and customize it.

The following sections provide introductory information on dashboards, rows, and panels, and make you familiar with the first steps involved in building a dashboard. For additional information, you can also refer to the Grafana documentation (http://docs.grafana.org/).

Creating a Dashboard

To create a new dashboard, use the Open icon in the top right corner of your dashboard window. The option provides access to various features for administrating dashboards. Click New to create an empty dashboard that serves as a starting point for adding rows and panels.

On the left side of an empty dashboard, there is a green rectangle displayed. Hover over this rectangle to access a Row menu. To insert your first panel, you can use the options in the Add Panel submenu. See below for details on the available panel types.

As soon as you have inserted an empty panel, you can add additional rows. For this purpose, use the Add Row option on the right side of the dashboard.

Editing Rows

Features for editing rows can be accessed via the green rectangle that is displayed to the left of each row.

In addition to adding panels to a row, you can collapse or remove a row, move the position of the row within your dashboard, or set the row height. Row settings allows you, for example, to insert a row title or to hide the Row menu so that the row can no longer be edited.

Editing Panels

Grafana distinguishes between three panel types:

Panels of type Graph are used to visualize metrics data. A query editor is provided to define the data to be visualized. The editor allows you to combine multiple queries. This means that any number of metrics and data series can be visualized in one panel.

A Graph Panel
Figure 4.3: A Graph Panel

Panels of type Singlestat are also used to visualize metrics data, yet they reduce a single query to a single number. The single number can be, for example, the minimum, maximum, average, or sum of values of the data series. The single number can be translated into a text value, if required.

Panels of type Text are used to insert static text. The text may, for example, provide information for the dashboard users. Text panels are not connected to any metrics data.

As soon as you have added a panel to your dashboard, you can access the options for editing the panel content. For this purpose, click the panel title and use Edit:

  • For panels of type Text, a simple text editor is displayed for entering text. Plain text, HTML, and markdown format are supported.

    Editing a Text Panel
    Figure 4.4: Editing a Text Panel
  • For panels of type Graph and Singlestat, a query editor is displayed to define which data it to be shown. You can add multiple metrics, and apply functions to the metrics. The query results will be visualized in your panel in real time.

A large number of display and formatting features are provided to customize how the content is presented in a panel. Click the panel title to access the corresponding options. The menu that is displayed also allows you to duplicate or remove a panel. To change the size of a panel, click the + and - icons.

You can move panels on your dashboard by simply dragging and dropping them within and between rows.

By default, the time range for panels is controlled by dashboard settings. Use the time picker in the top right corner of your dashboard window to define relative or absolute time ranges. You can also set an auto-refresh interval, or manually refresh the data that is displayed.

Saving and Sharing Dashboards

SUSE OpenStack Cloud Monitoring allows you to save a metrics dashboard and export it to a JSON file. The JSON file can be edited, it can be shared with other users, and it can be imported to SUSE OpenStack Cloud Monitoring again.

To save a dashboard, use Save in the top right corner of your dashboard window. You can enter a name for the dashboard and simply save it to your browser's local storage. Use Dashboard JSON to directly view the corresponding JSON syntax, or use Export dashboard to download the JSON file. The JSON file can be forwarded to other users, if required. To import a JSON file, use Open dashboard in the top left corner of the dashboard window.

4.3 Defining Alarms

You have to define alarms to monitor your cloud resources. An alarm definition specifies the metrics to be collected and the threshold at which an alarm is to be triggered for a cloud resource. If the specified threshold is reached or exceeded, the alarm is triggered and notifications can be sent to inform users. By default, an alarm definition is evaluated every minute.

To handle a large variety of monitoring requirements, you can create either simple alarm definitions that refer to one metrics only, or compound alarm definitions that combine multiple metrics and allow you to track and process more complex events.

Example for a simple alarm definition that checks whether the system-level load of the CPU exceeds a threshold of 90 percent:

cpu.system_perc{hostname=monasca} > 90

Example for a simple alarm definition that checks the average time of the system-level load of the CPU over a period of 480 seconds. The alarm is triggered only if this average is greater than 95 percent:

avg(cpu.system_perc{hostname=monasca}, 120) > 95 times 4

Example for a compound alarm definition that evaluates two metrics. The alarm is triggered if either the system-level load of the CPU exceeds a threshold of 90 percent, or if the disk space that is used by the specified service exceeds a threshold of 90 percent:

avg(cpu.system_perc{hostname=monasca}) > 90 OR
max(disk.space_used_perc{service=monitoring}) > 90

To create, edit, and delete alarms, use Monitoring > Alarm Definitions.

The elements that define an alarm are grouped into Details, Expression, and Notifications. They are described in the following sections.

Details

For an alarm definition, you specify the following details:

  • Name. Mandatory identifier of the alarm. The name must be unique within the project for which you define the alarm.

  • Description. Optional. A short description that depicts the purpose of the alarm.

  • Severity. The following severities for an alarm are supported: Low (default), Medium, High, or Critical.

    The severity affects the status information on the Overview page. If an alarm that is defined as Critical is triggered, the corresponding resource is displayed in a red box. If an alarm that is defined as Low, Medium, or High is triggered, the corresponding resource is displayed in a yellow box only.

    The severity level is subjective. Choose a level that is appropriate for prioritizing the alarms in your environment.

Creating an Alarm Definition
Figure 4.5: Creating an Alarm Definition
Expression

The expression defines how to evaluate a metrics. The expression syntax is based on a simple expressive grammar. For details, refer to the Monasca API documentation (https://github.com/openstack/monasca-api/blob/stable/ocata/docs/monasca-api-spec.md).

To define an alarm expression, proceed as follows:

  1. Select the metrics to be evaluated.

  2. Select a statistical function for the metrics: min to monitor the minimum values, max to monitor the maximum values, sum to monitor the sum of the values, count for the monitored number, or avg for the arithmetic average.

  3. Enter one or multiple dimensions in the Add a dimension field to further qualify the metrics.

    Dimensions filter the data to be monitored. They narrow down the evaluation to specific entities. Each dimension consists of a key/value pair that allows for a flexible and concise description of the data to be monitored, for example, region, availability zone, service tier, or resource ID.

    The dimensions available for the selected metrics are displayed in the Matching Metrics section. Type the name of the key you want to associate with the metrics in the Add a dimension field. You are offered a select list for adding the required key/value pair.

  4. Enter the threshold value at which an alarm is to be triggered, and combine it with a relational operator <, >, <=, or >=.

    The unit of the threshold value is related to the metrics for which you define the threshold, for example, the unit is percentage for cpu.idle_perc or MB for disk.total_used_space_mb.

  5. Switch on the Deterministic option if you evaluate a metrics for which data is received only sporadically. The option should be switched on, for example, for all log metrics. This ensures that the alarm status is OK and displayed as a green box on the Overview page although metrics data has not yet been received.

    Do not switch on the option if you evaluate a metrics for which data is received regularly. This ensures that you instantly notice, for example, that a host machine is offline and that there is no metrics data for the agent to collect. On the Overview page, the alarm status therefore changes from OK to UNDETERMINED and is displayed as a gray box.

  6. Enter one or multiple dimensions in the Match by field if you want these dimensions to be taken into account for triggering alarms.

    Example: If you enter hostname as dimension, individual alarms will be created for each host machine on which metrics data is collected. The expression you have defined is not evaluated as a whole but individually for each host machine in your environment.

    If Match by is set to a dimension, the number of alarms depends on the number of dimension values on which metrics data is received. An empty Match by field results in exactly one alarm.

    To enter a dimension, you can simply type the name of the dimension in the Match by field. The dimensions you enter cannot be changed once the alarm definition is saved.

  7. Build a compound alarm definition to combine multiple metrics in one expression. Using the logical operators AND or OR, any number of sub-expressions can be combined.

    Use the Add button to create a second expression, and choose either AND or OR as Operator to connect it to the one you have already defined. Proceed with the second expression as described in Step 1 to Step 6 above.

    The following options are provided for creating and organizing compound alarm definitions:

    • Create additional sub-expressions using the Add button.

    • Finish editing a sub-expression using the Submit button.

    • Delete a sub-expression using the Remove button.

    • Change the position of a sub-expression using the Up or Down button.

Note
Note

You can also edit the expression syntax directly. For this purpose, save your alarm definition and update it using the Edit Alarm Definition option.

By default, an alarm definition is evaluated every minute. When updating the alarm definition, you can change this interval. For syntax details, refer to the Monasca API documentation on Alarm Definition Expressions (https://github.com/openstack/monasca-api/blob/stable/ocata/docs/monasca-api-spec.md#alarm-definition-expressions).

Notifications

You can enable notifications for an alarm definition. As soon as an alarm is triggered, the enabled notifications will be sent.

The Notifications tab allows you to select the notifications from the ones that are predefined in your environment. For a selected notification, you specify whether you want to send it for a status transition to Alarm, OK, and/or Undetermined.

For details on defining notifications, refer to Section 4.4, “Defining Notifications”. For details on alarm statuses, refer to Section 4.5, “Status of Services, Servers, and Log Data”.

4.4 Defining Notifications

Notifications define how users are informed when a threshold value defined for an alarm is reached or exceeded. In the alarm definition, you can assign one or multiple notifications.

For a notification, you specify the following elements:

  • Name. A unique identifier of the notification. The name is offered for selection when defining an alarm.

  • Type. Email is the notification method supported by SUSE OpenStack Cloud Monitoring. If you want to use WebHook or PagerDuty, contact your SUSE OpenStack Cloud Monitoring support for further information.

  • Address. The email address to be notified when an alarm is triggered.

    Note
    Note

    Generic top-level domains such as business domain names are not supported in email addresses (for example, user@xyz.company).

To create, edit, and delete notifications, use Monitoring > Notifications.

4.5 Status of Services, Servers, and Log Data

An alarm definition for a service, server, or log data is evaluated over the interval specified in the alarm expression. The alarm definition is re-evaluated in each subsequent interval. The following alarm statuses are distinguished:

  • Alarm. The alarm expression has evaluated to true. An alarm has been triggered for the cloud resource.

  • OK. The alarm expression has evaluated to false. There is no need to trigger an alarm.

  • Undetermined. No metrics data has been received within the defined interval.

As soon as you have defined an alarm for a cloud resource, there is status information displayed for it on the Overview page:

The color of the boxes in the three sections indicates the status:

  • A green box for a service or server indicates that it is up and running. A green box for a log path indicates that a defined threshold for errors or warnings, for example, has not yet been reached or exceeded. There are alarms defined for the services, servers, or log paths, but no alarms have been triggered.

  • A red box for a service, server, or log path indicates that there is a severe problem that needs to be checked. One or multiple alarms defined for a service, a server, or log data have been triggered.

  • A yellow box indicates a problem. One or multiple alarms have already been triggered, yet, the severity of these alarms is low.

  • A gray box indicates that alarms have been defined. Yet, metrics data has not been received.

The status information on the Overview page results from one or multiple alarms that have been defined for the corresponding resource. If multiple alarms are defined, the severity of the individual alarms controls the status color.

You can click a resource on the Overview page to display details on the related alarms. The details include the status of each alarm and the expression that is evaluated. For each alarm, you can drill down on the alarm history. To narrow down the problem, the history presents detailed information on the status transitions.

Print this page