Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / Documentation / Operations Guide CLM / Managing Monitoring, Logging, and Usage Reporting
Applies to SUSE OpenStack Cloud 9

13 Managing Monitoring, Logging, and Usage Reporting

Information about the monitoring, logging, and metering services included with your SUSE OpenStack Cloud.

13.1 Monitoring

The SUSE OpenStack Cloud Monitoring service leverages OpenStack monasca, which is a multi-tenant, scalable, fault tolerant monitoring service.

13.1.1 Getting Started with Monitoring

You can use the SUSE OpenStack Cloud Monitoring service to monitor the health of your cloud and, if necessary, to troubleshoot issues.

monasca data can be extracted and used for a variety of legitimate purposes, and different purposes require different forms of data sanitization or encoding to protect against invalid or malicious data. Any data pulled from monasca should be considered untrusted data, so users are advised to apply appropriate encoding and/or sanitization techniques to ensure safe and correct usage and display of data in a web browser, database scan, or any other use of the data.

13.1.1.1 Monitoring Service Overview

13.1.1.1.1 Installation

The monitoring service is automatically installed as part of the SUSE OpenStack Cloud installation.

No specific configuration is required to use monasca. However, you can configure the database for storing metrics as explained in Section 13.1.2, “Configuring the Monitoring Service”.

13.1.1.1.2 Differences Between Upstream and SUSE OpenStack Cloud Implementations

In SUSE OpenStack Cloud, the OpenStack monitoring service, monasca, is included as the monitoring solution, except for the following which are not included:

  • Transform Engine

  • Events Engine

  • Anomaly and Prediction Engine

Note
Note

Icinga was supported in previous SUSE OpenStack Cloud versions but it has been deprecated in SUSE OpenStack Cloud 9.

13.1.1.1.3 Diagram of monasca Service
Image
13.1.1.1.4 For More Information

For more details on OpenStack monasca, see monasca.io

13.1.1.1.5 Back-end Database

The monitoring service default metrics database is Cassandra, which is a highly-scalable analytics database and the recommended database for SUSE OpenStack Cloud.

You can learn more about Cassandra at Apache Cassandra.

13.1.1.2 Working with Monasca

monasca-Agent

The monasca-agent is a Python program that runs on the control plane nodes. It runs the defined checks and then sends data onto the API. The checks that the agent runs include:

  • System Metrics: CPU utilization, memory usage, disk I/O, network I/O, and filesystem utilization on the control plane and resource nodes.

  • Service Metrics: the agent supports plugins such as MySQL, RabbitMQ, Kafka, and many others.

  • VM Metrics: CPU utilization, disk I/O, network I/O, and memory usage of hosted virtual machines on compute nodes. Full details of these can be found https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md#per-instance-metrics.

For a full list of packaged plugins that are included SUSE OpenStack Cloud, see monasca Plugins

You can further customize the monasca-agent to suit your needs, see Customizing the Agent

13.1.1.3 Accessing the Monitoring Service

Access to the Monitoring service is available through a number of different interfaces.

13.1.1.3.1 Command-Line Interface

For users who prefer using the command line, there is the python-monascaclient, which is part of the default installation on your Cloud Lifecycle Manager node.

For details on the CLI, including installation instructions, see Python-monasca Client

monasca API

If low-level access is desired, there is the monasca REST API.

Full details of the monasca API can be found on GitHub.

13.1.1.3.2 Operations Console GUI

You can use the Operations Console (Ops Console) for SUSE OpenStack Cloud to view data about your SUSE OpenStack Cloud cloud infrastructure in a web-based graphical user interface (GUI) and ensure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways: Triage alarm notifications.

  • Alarm Definitions and notifications now have their own screens and are collected under the Alarm Explorer menu item which can be accessed from the Central Dashboard. Central Dashboard now allows you to customize the view in the following ways:

    • Rename or re-configure existing alarm cards to include services different from the defaults

    • Create a new alarm card with the services you want to select

    • Reorder alarm cards using drag and drop

    • View all alarms that have no service dimension now grouped in an Uncategorized Alarms card

    • View all alarms that have a service dimension that does not match any of the other cards -now grouped in an Other Alarms card

  • You can also easily access alarm data for a specific component. On the Summary page for the following components, a link is provided to an alarms screen specifically for that component.

13.1.1.3.3 Connecting to the Operations Console

To connect to Operations Console, perform the following:

  • Ensure your login has the required access credentials.

  • Connect through a browser.

  • Optionally use a Host name OR virtual IP address to access Operations Console.

Operations Console will always be accessed over port 9095.

13.1.1.4 Service Alarm Definitions

SUSE OpenStack Cloud comes with some predefined monitoring alarms for the services installed.

Full details of all service alarms can be found here: Section 18.1.1, “Alarm Resolution Procedures”.

Each alarm will have one of the following statuses:

  • Critical - Open alarms, identified by red indicator.

  • Warning - Open alarms, identified by yellow indicator.

  • Unknown - Open alarms, identified by gray indicator. Unknown will be the status of an alarm that has stopped receiving a metric. This can be caused by the following conditions:

    • An alarm exists for a service or component that is not installed in the environment.

    • An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.

    • There is a gap between the last reported metric and the next metric.

  • Open - Complete list of open alarms.

  • Total - Complete list of alarms, may include Acknowledged and Resolved alarms.

When alarms are triggered it is helpful to review the service logs.

13.1.2 Configuring the Monitoring Service

The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. You also have options for your alarm metrics database should you choose not to use the default option provided with the product.

In SUSE OpenStack Cloud you have the option to specify a SMTP server for email notifications and a database platform you want to use for the metrics database. These steps will assist in this process.

13.1.2.1 Configuring the Monitoring Email Notification Settings

The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. In SUSE OpenStack Cloud, you have the option to specify a SMTP server for email notifications. These steps will assist in this process.

If you are going to use the email notifiication feature of the monitoring service, you must set the configuration options with valid email settings including an SMTP server and valid email addresses. The email server is not provided by SUSE OpenStack Cloud, but must be specified in the configuration file described below. The email server must support SMTP.

13.1.2.1.1 Configuring monitoring notification settings during initial installation
  1. Log in to the Cloud Lifecycle Manager.

  2. To change the SMTP server configuration settings edit the following file:

    ~/openstack/my_cloud/definition/cloudConfig.yml
    1. Enter your email server settings. Here is an example snippet showing the configuration file contents, uncomment these lines before entering your environment details.

          smtp-settings:
          #  server: mailserver.examplecloud.com
          #  port: 25
          #  timeout: 15
          # These are only needed if your server requires authentication
          #  user:
          #  password:

      This table explains each of these values:

      ValueDescription
      Server (required)

      The server entry must be uncommented and set to a valid hostname or IP Address.

      Port (optional)

      If your SMTP server is running on a port other than the standard 25, then uncomment the port line and set it your port.

      Timeout (optional)

      If your email server is heavily loaded, the timeout parameter can be uncommented and set to a larger value. 15 seconds is the default.

      User / Password (optional)

      If your SMTP server requires authentication, then you can configure user and password. Use double quotes around the password to avoid issues with special characters.

  3. To configure the sending email addresses, edit the following file:

    ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml

    Modify the following value to add your sending email address:

    email_from_addr
    Note
    Note

    The default value in the file is email_from_address: notification@exampleCloud.com which you should edit.

  4. [optional] To configure the receiving email addresses, edit the following file:

    ~/openstack/ardana/ansible/roles/monasca-default-alarms/defaults/main.yml

    Modify the following value to configure a receiving email address:

    notification_address
    Note
    Note

    You can also set the receiving email address via the Operations Console. Instructions for this are in the last section.

  5. If your environment requires a proxy address then you can add that in as well:

    # notification_environment can be used to configure proxies if needed.
    # Below is an example configuration. Note that all of the quotes are required.
    # notification_environment: '"http_proxy=http://<your_proxy>:<port>" "https_proxy=http://<your_proxy>:<port>"'
    notification_environment: ''
  6. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Updated monitoring service email notification settings"
  7. Continue with your installation.

13.1.2.1.2 Monasca and Apache Commons Validator

The monasca notification uses a standard Apache Commons validator to validate the configured SUSE OpenStack Cloud domain names before sending the notification over webhook. monasca notification supports some non-standard domain names, but not all. See the Domain Validator documentation for more information: https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/DomainValidator.html

You should ensure that any domains that you use are supported by IETF and IANA. As an example, .local is not listed by IANA and is invalid but .gov and .edu are valid.

Failure to use supported domains will generate an unprocessable exception in monasca notification create:

HTTPException code=422 message={"unprocessable_entity":
{"code":422,"message":"Address https://myopenstack.sample:8000/v1/signal/test is not of correct format","details":"","internal_code":"c6cf9d9eb79c3fc4"}
13.1.2.1.3 Configuring monitoring notification settings after the initial installation

If you need to make changes to the email notification settings after your initial deployment, you can change the "From" address using the configuration files but the "To" address will need to be changed in the Operations Console. The following section will describe both of these processes.

To change the sending email address:

  1. Log in to the Cloud Lifecycle Manager.

  2. To configure the sending email addresses, edit the following file:

    ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml

    Modify the following value to add your sending email address:

    email_from_addr
    Note
    Note

    The default value in the file is email_from_address: notification@exampleCloud.com which you should edit.

  3. Commit your configuration to the local Git repository (Chapter 22, Using Git for Configuration Management), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Updated monitoring service email notification settings"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the monasca reconfigure playbook to deploy the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notification
    Note
    Note

    You may need to use the --ask-vault-pass switch if you opted for encryption during the initial deployment.

To change the receiving email address via the Operations Console:

To configure the "To" email address, after installation,

  1. Connect to and log in to the Operations Console.

  2. On the Home screen, click the menu represented by 3 horizontal lines (Three-Line Icon).

  3. From the menu that slides in on the left side, click Home, and then Alarm Explorer.

  4. On the Alarm Explorer page, at the top, click the Notification Methods text.

  5. On the Notification Methods page, find the row with the Default Email notification.

  6. In the Default Email row, click the details icon (Ellipsis Icon), then click Edit.

  7. On the Edit Notification Method: Default Email page, in Name, Type, and Address/Key, type in the values you want to use.

  8. On the Edit Notification Method: Default Email page, click Update Notification.

Important
Important

Once the notification has been added, using the procedures using the Ansible playbooks will not change it.

13.1.2.2 Managing Notification Methods for Alarms

13.1.2.2.1 Enabling a Proxy for Webhook or Pager Duty Notifications

If your environment requires a proxy in order for communications to function then these steps will show you how you can enable one. These steps will only be needed if you are utilizing the webhook or pager duty notification methods.

These steps will require access to the Cloud Lifecycle Manager in your cloud deployment so you may need to contact your Administrator. You can make these changes during the initial configuration phase prior to the first installation or you can modify your existing environment, the only difference being the last step.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml file and edit the line below with your proxy address values:

    notification_environment: '"http_proxy=http://<proxy_address>:<port>" "https_proxy=<http://proxy_address>:<port>"'
    Note
    Note

    There are single quotation marks around the entire value of this entry and then double quotation marks around the individual proxy entries. This formatting must exist when you enter these values into your configuration file.

  3. If you are making these changes prior to your initial installation then you are done and can continue on with the installation. However, if you are modifying an existing environment, you will need to continue on with the remaining steps below.

  4. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Generate an updated deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the monasca reconfigure playbook to enable these changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notification
13.1.2.2.2 Creating a New Notification Method
  1. Log in to the Operations Console.

  2. Use the navigation menu to go to the Alarm Explorer page:

    Image
  3. Select the Notification Methods menu and then click the Create Notification Method button:

    Image
  4. On the Create Notification Method window you will select your options and then click the Create Notification button.

    Image

    A description of each of the fields you use for each notification method:

    FieldDescription
    Name

    Enter a unique name value for the notification method you are creating.

    Type

    Choose a type. Available values are Webhook, Email, or Pager Duty.

    Address/KeyEnter the value corresponding to the type you chose.
13.1.2.2.3 Applying a Notification Method to an Alarm Definition
  1. Log in to the Operations Console.

  2. Use the navigation menu to go to the Alarm Explorer page:

    Image
  3. Select the Alarm Definition menu which will give you a list of each of the alarm definitions in your environment.

    Image
  4. Locate the alarm you want to change the notification method for and click on its name to bring up the edit menu. You can use the sorting methods for assistance.

  5. In the edit menu, scroll down to the Notifications and Severity section where you will select one or more Notification Methods before selecting the Update Alarm Definition button:

    Image
  6. Repeat as needed until all of your alarms have the notification methods you desire.

13.1.2.3 Enabling the RabbitMQ Admin Console

The RabbitMQ Admin Console is off by default in SUSE OpenStack Cloud. You can turn on the console by following these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/config/rabbitmq/main.yml file. Under the rabbit_plugins:line, uncomment

    - rabbitmq_management
  3. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "Enabled RabbitMQ Admin Console"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the RabbitMQ reconfigure playbook to deploy the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts rabbitmq-reconfigure.yml

To turn the RabbitMQ Admin Console off again, add the comment back and repeat steps 3 through 6.

13.1.2.4 Capacity Reporting and Monasca Transform

Capacity reporting is a new feature in SUSE OpenStack Cloud which will provide cloud operators overall capacity (available, used, and remaining) information via the Operations Console so that the cloud operator can ensure that cloud resource pools have sufficient capacity to meet the demands of users. The cloud operator is also able to set thresholds and set alarms to be notified when the thresholds are reached.

For Compute

  • Host Capacity - CPU/Disk/Memory: Used, Available and Remaining Capacity - for the entire cloud installation or by host

  • VM Capacity - CPU/Disk/Memory: Allocated, Available and Remaining - for the entire cloud installation, by host or by project

For Object Storage

  • Disk Capacity - Used, Available and Remaining Capacity - for the entire cloud installation or by project

In addition to overall capacity, roll up views with appropriate slices provide views by a particular project, or compute node. Graphs also show trends and the change in capacity over time.

13.1.2.4.1 monasca Transform Features
  • monasca Transform is a new component in monasca which transforms and aggregates metrics using Apache Spark

  • Aggregated metrics are published to Kafka and are available for other monasca components like monasca-threshold and are stored in monasca datastore

  • Cloud operators can set thresholds and set alarms to receive notifications when thresholds are met.

  • These aggregated metrics are made available to the cloud operators via Operations Console's new Capacity Summary (reporting) UI

  • Capacity reporting is a new feature in SUSE OpenStack Cloud which will provides cloud operators an overall capacity (available, used and remaining) for Compute and Object Storage

  • Cloud operators can look at Capacity reporting via Operations Console's Compute Capacity Summary and Object Storage Capacity Summary UI

  • Capacity reporting allows the cloud operators the ability to ensure that cloud resource pools have sufficient capacity to meet demands of users. See table below for Service and Capacity Types.

  • A list of aggregated metrics is provided in Section 13.1.2.4.4, “New Aggregated Metrics”.

  • Capacity reporting aggregated metrics are aggregated and published every hour

  • In addition to the overall capacity, there are graphs which show the capacity trends over time range (for 1 day, for 7 days, for 30 days or for 45 days)

  • Graphs showing the capacity trends by a particular project or compute host are also provided.

  • monasca Transform is integrated with centralized monitoring (monasca) and centralized logging

  • Flexible Deployment

  • Upgrade & Patch Support

ServiceType of CapacityDescription
ComputeHost Capacity

CPU/Disk/Memory: Used, Available and Remaining Capacity - for entire cloud installation or by compute host

 VM Capacity

CPU/Disk/Memory: Allocated, Available and Remaining - for entire cloud installation, by host or by project

Object StorageDisk Capacity

Used, Available and Remaining Disk Capacity - for entire cloud installation or by project

 Storage Capacity

Utilized Storage Capacity - for entire cloud installation or by project

13.1.2.4.2 Architecture for Monasca Transform and Spark

monasca Transform is a new component in monasca. monasca Transform uses Spark for data aggregation. Both monasca Transform and Spark are depicted in the example diagram below.

Image

You can see that the monasca components run on the Cloud Controller nodes, and the monasca agents run on all nodes in the Mid-scale Example configuration.

Image
13.1.2.4.3 Components for Capacity Reporting
13.1.2.4.3.1 monasca Transform: Data Aggregation Reporting

monasca-transform is a new component which provides mechanism to aggregate or transform metrics and publish new aggregated metrics to monasca.

monasca Transform is a data driven Apache Spark based data aggregation engine which collects, groups and aggregates existing individual monasca metrics according to business requirements and publishes new transformed (derived) metrics to the monasca Kafka queue.

Since the new transformed metrics are published as any other metric in monasca, alarms can be set and triggered on the transformed metric, just like any other metric.

13.1.2.4.3.2 Object Storage and Compute Capacity Summary Operations Console UI

A new "Capacity Summary" tab for Compute and Object Storage will displays all the aggregated metrics under the "Compute" and "Object Storage" sections.

Operations Console UI makes calls to monasca API to retrieve and display various tiles and graphs on Capacity Summary tab in Compute and Object Storage Summary UI pages.

13.1.2.4.3.3 Persist new metrics and Trigger Alarms

New aggregated metrics will be published to monasca's Kafka queue and will be ingested by monasca-persister. If thresholds and alarms have been set on the aggregated metrics, monasca will generate and trigger alarms as it currently does with any other metric. No new/additional change is expected with persisting of new aggregated metrics or setting threshold/alarms.

13.1.2.4.4 New Aggregated Metrics

Following is the list of aggregated metrics produced by monasca transform in SUSE OpenStack Cloud

Table 13.1: Aggregated Metrics
 Metric NameForDescriptionDimensionsNotes
1

cpu.utilized_logical_cores_agg

compute summary

utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
2cpu.total_logical_cores_aggcompute summary

total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
3mem.total_mb_aggcompute summary

total physical host memory capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
4mem.usable_mb_aggcompute summary

usable physical host memory capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
5disk.total_used_space_mb_aggcompute summary

utilized physical host disk capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
6disk.total_space_mb_aggcompute summary

total physical host disk capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
7nova.vm.cpu.total_allocated_aggcompute summary

cpus allocated across all VMs by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
8vcpus_aggcompute summary

virtual cpus allocated capacity for VMs of one or all projects by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all or <project ID>

Available as total or per project
9nova.vm.mem.total_allocated_mb_aggcompute summary

memory allocated to all VMs by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
10vm.mem.used_mb_aggcompute summary

memory utilized by VMs of one or all projects by time interval (defaults to an hour)

aggregation_period: hourly

host: all

project_id: <project ID>

Available as total or per project
11vm.mem.total_mb_aggcompute summary

memory allocated to VMs of one or all projects by time interval (defaults to an hour)

aggregation_period: hourly

host: all

project_id: <project ID>

Available as total or per project
12vm.cpu.utilization_perc_aggcompute summary

cpu utilized by all VMs by project by time interval (defaults to an hour)

aggregation_period: hourly

host: all

project_id: <project ID>

 
13nova.vm.disk.total_allocated_gb_aggcompute summary

disk space allocated to all VMs by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
14vm.disk.allocation_aggcompute summary

disk allocation for VMs of one or all projects by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all or <project ID>

Available as total or per project
15swiftlm.diskusage.val.size_aggobject storage summary

total available object storage capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
16swiftlm.diskusage.val.avail_aggobject storage summary

remaining object storage capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all or <host name>

project_id: all

Available as total or per host
17swiftlm.diskusage.rate_aggobject storage summary

rate of change of object storage usage by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
18storage.objects.size_aggobject storage summary

used object storage capacity by time interval (defaults to a hour)

aggregation_period: hourly

host: all

project_id: all

 
13.1.2.4.5 Deployment

monasca Transform and Spark will be deployed on the same control plane nodes along with Logging and Monitoring Service (monasca).

Security Consideration during deployment of monasca Transform and Spark

The SUSE OpenStack Cloud Monitoring system connects internally to the Kafka and Spark technologies without authentication. If you choose to deploy Monitoring, configure it to use only trusted networks such as the Management network, as illustrated on the network diagrams below for Entry Scale Deployment and Mid Scale Deployment.

Entry Scale Deployment

In Entry Scale Deployment monasca Transform and Spark will be deployed on Shared Control Plane along with other Openstack Services along with Monitoring and Logging

Image

Mid scale Deployment

In a Mid Scale Deployment monasca Transform and Spark will be deployed on dedicated Metering Monitoring and Logging (MML) control plane along with other data processing intensive services like Metering, Monitoring and Logging

Image

Multi Control Plane Deployment

In a Multi Control Plane Deployment, monasca Transform and Spark will be deployed on the Shared Control plane along with rest of monasca Components.

Start, Stop and Status for monasca Transform and Spark processes

The service management methods for monasca-transform and spark follow the convention for services in the OpenStack platform. When executing from the deployer node, the commands are as follows:

Status

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-status.yml
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-status.yml

Start

As monasca-transform depends on spark for the processing of the metrics spark will need to be started before monasca-transform.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-start.yml

Stop

As a precaution, stop the monasca-transform service before taking spark down. Interruption to the spark service altogether while monasca-transform is still running can result in a monasca-transform process that is unresponsive and needing to be tidied up.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-stop.yml
ardana > ansible-playbook -i hosts/verb_hosts spark-stop.yml
13.1.2.4.6 Reconfigure

The reconfigure process can be triggered again from the deployer. Presuming that changes have been made to the variables in the appropriate places execution of the respective ansible scripts will be enough to update the configuration. The spark reconfigure process alters the nodes serially meaning that spark is never down altogether, each node is stopped in turn and zookeeper manages the leaders accordingly. This means that monasca-transform may be left running even while spark is upgraded.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-reconfigure.yml
ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.7 Adding monasca Transform and Spark to SUSE OpenStack Cloud Deployment

Since monasca Transform and Spark are optional components, the users might elect to not install these two components during their initial SUSE OpenStack Cloud install. The following instructions provide a way the users can add monasca Transform and Spark to their existing SUSE OpenStack Cloud deployment.

Steps

  1. Add monasca Transform and Spark to the input model. monasca Transform and Spark on a entry level cloud would be installed on the common control plane, for mid scale cloud which has a MML (Metering, Monitoring and Logging) cluster, monasca Transform and Spark will should be added to MML cluster.

    ardana > cd ~/openstack/my_cloud/definition/data/

    Add spark and monasca-transform to input model, control_plane.yml

    clusters
           - name: core
             cluster-prefix: c1
             server-role: CONTROLLER-ROLE
             member-count: 3
             allocation-policy: strict
             service-components:
    
               [...]
    
               - zookeeper
               - kafka
               - cassandra
               - storm
               - spark
               - monasca-api
               - monasca-persister
               - monasca-notifier
               - monasca-threshold
               - monasca-client
               - monasca-transform
    
               [...]
  2. Run the Configuration Processor

    ardana > cd ~/openstack/my_cloud/definition
    ardana > git add -A
    ardana > git commit -m "Adding monasca Transform and Spark"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  3. Run Ready Deployment

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run Cloud Lifecycle Manager Deploy

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

Verify Deployment

Login to each controller node and run

tux > sudo service monasca-transform status
tux > sudo service spark-master status
tux > sudo service spark-worker status
tux > sudo service monasca-transform status
● monasca-transform.service - monasca Transform Daemon
  Loaded: loaded (/etc/systemd/system/monasca-transform.service; disabled)
  Active: active (running) since Wed 2016-08-24 00:47:56 UTC; 2 days ago
Main PID: 7351 (bash)
  CGroup: /system.slice/monasca-transform.service
          ├─ 7351 bash /etc/monasca/transform/init/start-monasca-transform.sh
          ├─ 7352 /opt/stack/service/monasca-transform/venv//bin/python /opt/monasca/monasca-transform/lib/service_runner.py
          ├─27904 /bin/sh -c export SPARK_HOME=/opt/stack/service/spark/venv/bin/../current && spark-submit --supervise --master spark://omega-cp1-c1-m1-mgmt:7077,omega-cp1-c1-m2-mgmt:7077,omega-cp1-c1...
          ├─27905 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/lib/drizzle-jdbc-1.3.jar:/opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/v...
          └─28355 python /opt/monasca/monasca-transform/lib/driver.py
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.


tux > sudo service spark-worker status
● spark-worker.service - Spark Worker Daemon
  Loaded: loaded (/etc/systemd/system/spark-worker.service; disabled)
  Active: active (running) since Wed 2016-08-24 00:46:05 UTC; 2 days ago
Main PID: 63513 (bash)
  CGroup: /system.slice/spark-worker.service
          ├─ 7671 python -m pyspark.daemon
          ├─28948 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0...
          ├─63513 bash /etc/spark/init/start-spark-worker.sh &
          └─63514 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven...
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.



tux > sudo service spark-master status
● spark-master.service - Spark Master Daemon
  Loaded: loaded (/etc/systemd/system/spark-master.service; disabled)
  Active: active (running) since Wed 2016-08-24 00:44:24 UTC; 2 days ago
Main PID: 55572 (bash)
  CGroup: /system.slice/spark-master.service
          ├─55572 bash /etc/spark/init/start-spark-master.sh &
          └─55573 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven...
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
13.1.2.4.8 Increase monasca Transform Scale

monasca Transform in the default configuration can scale up to estimated data for 100 node cloud deployment. Estimated maximum rate of metrics from a 100 node cloud deployment is 120M/hour.

You can further increase the processing rate to 180M/hour. Making the Spark configuration change will increase the CPU's being used by Spark and monasca Transform from average of around 3.5 to 5.5 CPU's per control node over a 10 minute batch processing interval.

To increase the processing rate to 180M/hour the customer will have to make following spark configuration change.

Steps

  1. Edit /var/lib/ardana/openstack/my_cloud/config/spark/spark-defaults.conf.j2 and set spark.cores.max to 6 and spark.executor.cores 2

    Set spark.cores.max to 6

    spark.cores.max {{ spark_cores_max }}

    to

    spark.cores.max 6

    Set spark.executor.cores to 2

    spark.executor.cores {{ spark_executor_cores }}

    to

    spark.executor.cores 2
  2. Edit ~/openstack/my_cloud/config/spark/spark-env.sh.j2

    Set SPARK_WORKER_CORES to 2

    export SPARK_WORKER_CORES={{ spark_worker_cores }}

    to

    export SPARK_WORKER_CORES=2
  3. Edit ~/openstack/my_cloud/config/spark/spark-worker-env.sh.j2

    Set SPARK_WORKER_CORES to 2

    export SPARK_WORKER_CORES={{ spark_worker_cores }}

    to

    export SPARK_WORKER_CORES=2
  4. Run Configuration Processor

    ardana > cd ~/openstack/my_cloud/definition
    ardana > git add -A
    ardana > git commit -m "Changing Spark Config increase scale"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Run Ready Deployment

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run spark-reconfigure.yml and monasca-transform-reconfigure.yml

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-reconfigure.yml
    ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.9 Change Compute Host Pattern Filter in Monasca Transform

monasca Transform identifies compute host metrics by pattern matching on hostname dimension in the incoming monasca metrics. The default pattern is of the form compNNN. For example, comp001, comp002, etc. To filter for it in the transformation specs, use the expression -comp[0-9]+-. In case the compute host names follow a different pattern other than the standard pattern above, the filter by expression when aggregating metrics will have to be changed.

Steps

  1. On the deployer: Edit ~/openstack/my_cloud/config/monasca-transform/transform_specs.json.j2

  2. Look for all references of -comp[0-9]+- and change the regular expression to the desired pattern say for example -compute[0-9]+-.

    {"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data","insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"], "usage_fetch_operation": "avg", "filter_by_list": [{"field_to_filter": "host", "filter_expression": "-comp[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}

    to

    {"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data", "insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"],"usage_fetch_operation": "avg","filter_by_list": [{"field_to_filter": "host","filter_expression": "-compute[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}
    Note
    Note

    The filter_expression has been changed to the new pattern.

  3. To change all host metric transformation specs in the same JSON file, repeat Step 2.

    Transformation specs will have to be changed for following metric_ids namely "mem_total_all", "mem_usable_all", "disk_total_all", "disk_usable_all", "cpu_total_all", "cpu_total_host", "cpu_util_all", "cpu_util_host"

  4. Run the Configuration Processor:

    ardana > cd ~/openstack/my_cloud/definition
    ardana > git add -A
    ardana > git commit -m "Changing monasca Transform specs"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Run Ready Deployment:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run monasca Transform Reconfigure:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml

13.1.2.5 Configuring Availability of Alarm Metrics

Using the monasca agent tuning knobs, you can choose which alarm metrics are available in your environment.

The addition of the libvirt and OVS plugins to the monasca agent provides a number of additional metrics that can be used. Most of these metrics are included by default, but others are not. You have the ability to use tuning knobs to add or remove these metrics to your environment based on your individual needs in your cloud.

We will list these metrics along with the tuning knob name and instructions for how to adjust these.

13.1.2.5.1 Libvirt plugin metric tuning knobs

The following metrics are added as part of the libvirt plugin:

Note
Note

For a description of each of these metrics, see Section 13.1.4.11, “Libvirt Metrics”.

Tuning KnobDefault SettingAdmin Metric NameProject Metric Name
vm_cpu_check_enableTruevm.cpu.time_nscpu.time_ns
vm.cpu.utilization_norm_perccpu.utilization_norm_perc
vm.cpu.utilization_perccpu.utilization_perc
vm_disks_check_enable

True

Creates 20 disk metrics per disk device per virtual machine.

vm.io.errorsio.errors
vm.io.errors_secio.errors_sec
vm.io.read_bytesio.read_bytes
vm.io.read_bytes_secio.read_bytes_sec
vm.io.read_opsio.read_ops
vm.io.read_ops_secio.read_ops_sec
vm.io.write_bytesio.write_bytes
vm.io.write_bytes_secio.write_bytes_sec
vm.io.write_opsio.write_ops
vm.io.write_ops_sec io.write_ops_sec
vm_network_check_enable

True

Creates 16 network metrics per NIC per virtual machine.

vm.net.in_bytesnet.in_bytes
vm.net.in_bytes_secnet.in_bytes_sec
vm.net.in_packetsnet.in_packets
vm.net.in_packets_secnet.in_packets_sec
vm.net.out_bytesnet.out_bytes
vm.net.out_bytes_secnet.out_bytes_sec
vm.net.out_packetsnet.out_packets
vm.net.out_packets_secnet.out_packets_sec
vm_ping_check_enableTruevm.ping_statusping_status
vm_extended_disks_check_enable

True

Creates 6 metrics per device per virtual machine.

vm.disk.allocationdisk.allocation
vm.disk.capacitydisk.capacity
vm.disk.physicaldisk.physical

True

Creates 6 aggregate metrics per virtual machine.

vm.disk.allocation_totaldisk.allocation_total
vm.disk.capacity_totaldisk.capacity.total
vm.disk.physical_totaldisk.physical_total
vm_disks_check_enable vm_extended_disks_check_enable

True

Creates 20 aggregate metrics per virtual machine.

vm.io.errors_totalio.errors_total
vm.io.errors_total_secio.errors_total_sec
vm.io.read_bytes_totalio.read_bytes_total
vm.io.read_bytes_total_secio.read_bytes_total_sec
vm.io.read_ops_totalio.read_ops_total
vm.io.read_ops_total_secio.read_ops_total_sec
vm.io.write_bytes_totalio.write_bytes_total
vm.io.write_bytes_total_secio.write_bytes_total_sec
vm.io.write_ops_totalio.write_ops_total
vm.io.write_ops_total_secio.write_ops_total_sec
13.1.2.5.1.1 Configuring the libvirt metrics using the tuning knobs

Use the following steps to configure the tuning knobs for the libvirt plugin metrics.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the following file:

    ~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
  3. Change the value for each tuning knob to the desired setting, True if you want the metrics created and False if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.

    vm_cpu_check_enable: <true or false>
    vm_disks_check_enable: <true or false>
    vm_extended_disks_check_enable: <true or false>
    vm_network_check_enable: <true or false>
    vm_ping_check_enable: <true or false>
  4. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "configuring libvirt plugin tuning knobs"
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the nova reconfigure playbook to implement the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
Note
Note

If you modify either of the following files, then the monasca tuning parameters should be adjusted to handle a higher load on the system.

~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2

Tuning parameters are located in ~/openstack/my_cloud/config/monasca/configuration.yml. The parameter monasca_tuning_selector_override should be changed to the extra-large setting.

13.1.2.5.2 OVS plugin metric tuning knobs

The following metrics are added as part of the OVS plugin:

Note
Note

For a description of each of these metrics, see Section 13.1.4.16, “Open vSwitch (OVS) Metrics”.

Tuning KnobDefault SettingAdmin Metric NameProject Metric Name
use_rate_metricsFalseovs.vrouter.in_bytes_secvrouter.in_bytes_sec
ovs.vrouter.in_packets_secvrouter.in_packets_sec
ovs.vrouter.out_bytes_secvrouter.out_bytes_sec
ovs.vrouter.out_packets_secvrouter.out_packets_sec
use_absolute_metricsTrueovs.vrouter.in_bytesvrouter.in_bytes
ovs.vrouter.in_packetsvrouter.in_packets
ovs.vrouter.out_bytesvrouter.out_bytes
ovs.vrouter.out_packetsvrouter.out_packets
use_health_metrics with use_rate_metricsFalseovs.vrouter.in_dropped_secvrouter.in_dropped_sec
ovs.vrouter.in_errors_secvrouter.in_errors_sec
ovs.vrouter.out_dropped_secvrouter.out_dropped_sec
ovs.vrouter.out_errors_secvrouter.out_errors_sec
use_health_metrics with use_absolute_metricsFalseovs.vrouter.in_droppedvrouter.in_dropped
ovs.vrouter.in_errorsvrouter.in_errors
ovs.vrouter.out_droppedvrouter.out_dropped
ovs.vrouter.out_errorsvrouter.out_errors
13.1.2.5.2.1 Configuring the OVS metrics using the tuning knobs

Use the following steps to configure the tuning knobs for the libvirt plugin metrics.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the following file:

    ~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2
  3. Change the value for each tuning knob to the desired setting, True if you want the metrics created and False if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.

    init_config:
       use_absolute_metrics: <true or false>
       use_rate_metrics: <true or false>
       use_health_metrics: <true or false>
  4. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "configuring OVS plugin tuning knobs"
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Run the neutron reconfigure playbook to implement the changes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

13.1.3 Integrating HipChat, Slack, and JIRA

monasca, the SUSE OpenStack Cloud monitoring and notification service, includes three default notification methods, email, PagerDuty, and webhook. monasca also supports three other notification plugins which allow you to send notifications to HipChat, Slack, and JIRA. Unlike the default notification methods, the additional notification plugins must be manually configured.

This guide details the steps to configure each of the three non-default notification plugins. This guide also assumes that your cloud is fully deployed and functional.

13.1.3.1 Configuring the HipChat Plugin

To configure the HipChat plugin you will need the following four pieces of information from your HipChat system.

  • The URL of your HipChat system.

  • A token providing permission to send notifications to your HipChat system.

  • The ID of the HipChat room you wish to send notifications to.

  • A HipChat user account. This account will be used to authenticate any incoming notifications from your SUSE OpenStack Cloud cloud.

Obtain a token

Use the following instructions to obtain a token from your Hipchat system.

  1. Log in to HipChat as the user account that will be used to authenticate the notifications.

  2. Navigate to the following URL: https://<your_hipchat_system>/account/api. Replace <your_hipchat_system> with the fully-qualified-domain-name of your HipChat system.

  3. Select the Create token option. Ensure that the token has the "SendNotification" attribute.

Obtain a room ID

Use the following instructions to obtain the ID of a HipChat room.

  1. Log in to HipChat as the user account that will be used to authenticate the notifications.

  2. Select My account from the application menu.

  3. Select the Rooms tab.

  4. Select the room that you want your notifications sent to.

  5. Look for the API ID field in the room information. This is the room ID.

Create HipChat notification type

Use the following instructions to create a HipChat notification type.

  1. Begin by obtaining the API URL for the HipChat room that you wish to send notifications to. The format for a URL used to send notifications to a room is as follows:

    /v2/room/{room_id_or_name}/notification

  2. Use the monasca API to create a new notification method. The following example demonstrates how to create a HipChat notification type named MyHipChatNotification, for room ID 13, using an example API URL and auth token.

    ardana > monasca notification-create  NAME TYPE ADDRESS
    ardana > monasca notification-create  MyHipChatNotification HIPCHAT https://hipchat.hpe.net/v2/room/13/notification?auth_token=1234567890

    The preceding example creates a notification type with the following characteristics

    • NAME: MyHipChatNotification

    • TYPE: HIPCHAT

    • ADDRESS: https://hipchat.hpe.net/v2/room/13/notification

    • auth_token: 1234567890

Note
Note

The horizon dashboard can also be used to create a HipChat notification type.

13.1.3.2 Configuring the Slack Plugin

Configuring a Slack notification type requires four pieces of information from your Slack system.

  • Slack server URL

  • Authentication token

  • Slack channel

  • A Slack user account. This account will be used to authenticate incoming notifications to Slack.

Identify a Slack channel

  1. Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack.

  2. In the left navigation panel, under the CHANNELS section locate the channel that you wish to receive the notifications. The instructions that follow will use the example channel #general.

Create a Slack token

  1. Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack

  2. Navigate to the following URL: https://api.slack.com/docs/oauth-test-tokens

  3. Select the Create token button.

Create a Slack notification type

  1. Begin by identifying the structure of the API call to be used by your notification method. The format for a call to the Slack Web API is as follows:

    https://slack.com/api/METHOD

    You can authenticate a Web API request by using the token that you created in the previous Create a Slack Tokensection. Doing so will result in an API call that looks like the following.

    https://slack.com/api/METHOD?token=auth_token

    You can further refine your call by specifying the channel that the message will be posted to. Doing so will result in an API call that looks like the following.

    https://slack.com/api/METHOD?token=AUTH_TOKEN&channel=#channel

    The following example uses the chat.postMessage method, the token 1234567890, and the channel #general.

    https://slack.com/api/chat.postMessage?token=1234567890&channel=#general

    Find more information on the Slack Web API here: https://api.slack.com/web

  2. Use the CLI on your Cloud Lifecycle Manager to create a new Slack notification type, using the API call that you created in the preceding step. The following example creates a notification type named MySlackNotification, using token 1234567890, and posting to channel #general.

    ardana > monasca notification-create  MySlackNotification SLACK https://slack.com/api/chat.postMessage?token=1234567890&channel=#general
Note
Note

Notification types can also be created in the horizon dashboard.

13.1.3.3 Configuring the JIRA Plugin

Configuring the JIRA plugin requires three pieces of information from your JIRA system.

  • The URL of your JIRA system.

  • Username and password of a JIRA account that will be used to authenticate the notifications.

  • The name of the JIRA project that the notifications will be sent to.

Create JIRA notification type

You will configure the monasca service to send notifications to a particular JIRA project. You must also configure JIRA to create new issues for each notification it receives to this project, however, that configuration is outside the scope of this document.

The monasca JIRA notification plugin supports only the following two JIRA issue fields.

  • PROJECT. This is the only supported mandatory JIRA issue field.

  • COMPONENT. This is the only supported optional JIRA issue field.

The JIRA issue type that your notifications will create may only be configured with the "Project" field as mandatory. If your JIRA issue type has any other mandatory fields, the monasca plugin will not function correctly. Currently, the monasca plugin only supports the single optional "component" field.

Creating the JIRA notification type requires a few more steps than other notification types covered in this guide. Because the Python and YAML files for this notification type are not yet included in SUSE OpenStack Cloud 9, you must perform the following steps to manually retrieve and place them on your Cloud Lifecycle Manager.

  1. Configure the JIRA plugin by adding the following block to the /etc/monasca/notification.yaml file, under the notification_types section, and adding the username and password of the JIRA account used for the notifications to the respective sections.

        plugins:
    
         - monasca_notification.plugins.jira_notifier:JiraNotifier
    
        jira:
            user:
    
            password:
    
            timeout: 60

    After adding the necessary block, the notification_types section should look like the following example. Note that you must also add the username and password for the JIRA user related to the notification type.

    notification_types:
        plugins:
    
         - monasca_notification.plugins.jira_notifier:JiraNotifier
    
        jira:
            user:
    
            password:
    
            timeout: 60
    
        webhook:
            timeout: 5
    
        pagerduty:
            timeout: 5
    
            url: "https://events.pagerduty.com/generic/2010-04-15/create_event.json"
  2. Create the JIRA notification type. The following command example creates a JIRA notification type named MyJiraNotification, in the JIRA project HISO.

    ardana > monasca notification-create  MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISO

    The following command example creates a JIRA notification type named MyJiraNotification, in the JIRA project HISO, and adds the optional component field with a value of keystone.

    ardana > monasca notification-create MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISO&component=keystone
    Note
    Note

    There is a slash (/) separating the URL path and the query string. The slash is required if you have a query parameter without a path parameter.

    Note
    Note

    Notification types may also be created in the horizon dashboard.

13.1.4 Alarm Metrics

You can use the available metrics to create custom alarms to further monitor your cloud infrastructure and facilitate autoscaling features.

For details on how to create customer alarms using the Operations Console, see Section 16.2, “Alarm Definition”.

13.1.4.1 Apache Metrics

A list of metrics associated with the Apache service.

Metric NameDimensionsDescription
apache.net.hits
hostname
service=apache
component=apache
Total accesses
apache.net.kbytes_sec
hostname
service=apache
component=apache
Total Kbytes per second
apache.net.requests_sec
hostname
service=apache
component=apache
Total accesses per second
apache.net.total_kbytes
hostname
service=apache
component=apache
Total Kbytes
apache.performance.busy_worker_count
hostname
service=apache
component=apache
The number of workers serving requests
apache.performance.cpu_load_perc
hostname
service=apache
component=apache

The current percentage of CPU used by each worker and in total by all workers combined

apache.performance.idle_worker_count
hostname
service=apache
component=apache
The number of idle workers
apache.status
apache_port
hostname
service=apache
component=apache
Status of Apache port

13.1.4.2 ceilometer Metrics

A list of metrics associated with the ceilometer service.

Metric NameDimensionsDescription
disk.total_space_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Total space of disk
disk.total_used_space_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Total used space of disk
swiftlm.diskusage.rate_agg
aggregation_period=hourly,
host=all,
project_id=all
 
swiftlm.diskusage.val.avail_agg
aggregation_period=hourly,
host,
project_id=all
 
swiftlm.diskusage.val.size_agg
aggregation_period=hourly,
host,
project_id=all
 
image
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Existence of the image
image.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Delete operation on this image
image.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=B,
source=openstack
Size of the uploaded image
image.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Update operation on this image
image.upload
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=image,
source=openstack
Upload operation on this image
instance
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=instance,
source=openstack
Existence of instance
disk.ephemeral.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of ephemeral disk on this instance
disk.root.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of root disk on this instance
memory
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=MB,
source=openstack
Size of memory on this instance
ip.floating
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=ip,
source=openstack
Existence of IP
ip.floating.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=ip,
source=openstack
Create operation on this fip
ip.floating.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=ip,
source=openstack
Update operation on this fip
mem.total_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Total space of memory
mem.usable_mb_agg
aggregation_period=hourly,
host=all,
project_id=all
Available space of memory
network
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=network,
source=openstack
Existence of network
network.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=network,
source=openstack
Create operation on this network
network.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=network,
source=openstack
Update operation on this network
network.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=network,
source=openstack
Delete operation on this network
port
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=port,
source=openstack
Existence of port
port.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=port,
source=openstack
Create operation on this port
port.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=port,
source=openstack
Delete operation on this port
port.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=port,
source=openstack
Update operation on this port
router
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=router,
source=openstack
Existence of router
router.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=router,
source=openstack
Create operation on this router
router.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=router,
source=openstack
Delete operation on this router
router.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=router,
source=openstack
Update operation on this router
snapshot
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=snapshot,
source=openstack
Existence of the snapshot
snapshot.create.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=snapshot,
source=openstack
Create operation on this snapshot
snapshot.delete.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=snapshot,
source=openstack
Delete operation on this snapshot
snapshot.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of this snapshot
subnet
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=subnet,
source=openstack
Existence of the subnet
subnet.create
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=subnet,
source=openstack
Create operation on this subnet
subnet.delete
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=subnet,
source=openstack
Delete operation on this subnet
subnet.update
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=subnet,
source=openstack
Update operation on this subnet
vcpus
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=vcpus,
source=openstack
Number of virtual CPUs allocated to the instance
vcpus_agg
aggregation_period=hourly,
host=all,
project_id
Number of vcpus used by a project
volume
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=volume,
source=openstack
Existence of the volume
volume.create.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Create operation on this volume
volume.delete.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Delete operation on this volume
volume.resize.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Resize operation on this volume
volume.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=GB,
source=openstack
Size of this volume
volume.update.end
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=delta,
unit=volume,
source=openstack
Update operation on this volume
storage.objects
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=object,
source=openstack
Number of objects
storage.objects.size
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=B,
source=openstack
Total size of stored objects
storage.objects.containers
user_id,
region,
resource_id,
datasource=ceilometer,
project_id,
type=gauge,
unit=container,
source=openstack
Number of containers

13.1.4.3 cinder Metrics

A list of metrics associated with the cinder service.

Metric NameDimensionsDescription
cinderlm.cinder.backend.physical.list

service=block-storage, hostname, cluster, cloud_name, control_plane, component, backends

List of physical backends
cinderlm.cinder.backend.total.avail

service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname

Total available capacity metric per backend
cinderlm.cinder.backend.total.size

service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname

Total capacity metric per backend
cinderlm.cinder.cinder_services

service=block-storage, hostname, cluster, cloud_name, control_plane, component

Status of a cinder-volume service
cinderlm.hp_hardware.hpssacli.logical_drive

service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, logical_drive, controller_slot, array

The HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed for SSACLI status to be reported. To download and install the SSACLI utility to enable management of disk controllers, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

Status of a logical drive
cinderlm.hp_hardware.hpssacli.physical_drive

service=block-storage, hostname, cluster, cloud_name, control_plane, component, box, bay, controller_slot

Status of a logical drive
cinderlm.hp_hardware.hpssacli.smart_array

service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, model

Status of smart array
cinderlm.hp_hardware.hpssacli.smart_array.firmware

service=block-storage, hostname, cluster, cloud_name, control_plane, component, model

Checks firmware version

13.1.4.4 Compute Metrics

Note
Note

Compute instance metrics are listed in Section 13.1.4.11, “Libvirt Metrics”.

A list of metrics associated with the Compute service.

Metric NameDimensionsDescription
nova.heartbeat
service=compute
cloud_name
hostname
component
control_plane
cluster

Checks that all services are running heartbeats (uses nova user and to list services then sets up checks for each. For example, nova-scheduler, nova-conductor, nova-compute)

nova.vm.cpu.total_allocated
service=compute
hostname
component
control_plane
cluster
Total CPUs allocated across all VMs
nova.vm.disk.total_allocated_gb
service=compute
hostname
component
control_plane
cluster
Total Gbytes of disk space allocated to all VMs
nova.vm.mem.total_allocated_mb
service=compute
hostname
component
control_plane
cluster
Total Mbytes of memory allocated to all VMs

13.1.4.5 Crash Metrics

A list of metrics associated with the Crash service.

Metric NameDimensionsDescription
crash.dump_count
service=system
hostname
cluster
Number of crash dumps found

13.1.4.6 Directory Metrics

A list of metrics associated with the Directory service.

Metric NameDimensionsDescription
directory.files_count
service
hostname
path
Total number of files under a specific directory path
directory.size_bytes
service
hostname
path
Total size of a specific directory path

13.1.4.7 Elasticsearch Metrics

A list of metrics associated with the Elasticsearch service.

Metric NameDimensionsDescription
elasticsearch.active_primary_shards
service=logging
url
hostname

Indicates the number of primary shards in your cluster. This is an aggregate total across all indices.

elasticsearch.active_shards
service=logging
url
hostname

Aggregate total of all shards across all indices, which includes replica shards.

elasticsearch.cluster_status
service=logging
url
hostname

Cluster health status.

elasticsearch.initializing_shards
service=logging
url
hostname

The count of shards that are being freshly created.

elasticsearch.number_of_data_nodes
service=logging
url
hostname

Number of data nodes.

elasticsearch.number_of_nodes
service=logging
url
hostname

Number of nodes.

elasticsearch.relocating_shards
service=logging
url
hostname

Shows the number of shards that are currently moving from one node to another node.

elasticsearch.unassigned_shards
service=logging
url
hostname

The number of unassigned shards from the master node.

13.1.4.8 HAProxy Metrics

A list of metrics associated with the HAProxy service.

Metric NameDimensionsDescription
haproxy.backend.bytes.in_rate  
haproxy.backend.bytes.out_rate  
haproxy.backend.denied.req_rate  
haproxy.backend.denied.resp_rate  
haproxy.backend.errors.con_rate  
haproxy.backend.errors.resp_rate  
haproxy.backend.queue.current  
haproxy.backend.response.1xx  
haproxy.backend.response.2xx  
haproxy.backend.response.3xx  
haproxy.backend.response.4xx  
haproxy.backend.response.5xx  
haproxy.backend.response.other  
haproxy.backend.session.current  
haproxy.backend.session.limit  
haproxy.backend.session.pct  
haproxy.backend.session.rate  
haproxy.backend.warnings.redis_rate  
haproxy.backend.warnings.retr_rate  
haproxy.frontend.bytes.in_rate  
haproxy.frontend.bytes.out_rate  
haproxy.frontend.denied.req_rate  
haproxy.frontend.denied.resp_rate  
haproxy.frontend.errors.req_rate  
haproxy.frontend.requests.rate  
haproxy.frontend.response.1xx  
haproxy.frontend.response.2xx  
haproxy.frontend.response.3xx  
haproxy.frontend.response.4xx  
haproxy.frontend.response.5xx  
haproxy.frontend.response.other  
haproxy.frontend.session.current  
haproxy.frontend.session.limit  
haproxy.frontend.session.pct  
haproxy.frontend.session.rate  

13.1.4.9 HTTP Check Metrics

A list of metrics associated with the HTTP Check service:

Table 13.2: HTTP Check Metrics
Metric NameDimensionsDescription
http_response_time
url
hostname
service
component
The response time in seconds of the http endpoint call.
http_status
url
hostname
service
The status of the http endpoint call (0 = success, 1 = failure).

For each component and HTTP metric name there are two separate metrics reported, one for the local URL and another for the virtual IP (VIP) URL:

Table 13.3: HTTP Metric Components
ComponentDimensionsDescription
account-server
service=object-storage
component=account-server
url
swift account-server http endpoint status and response time
barbican-api
service=key-manager
component=barbican-api
url
barbican-api http endpoint status and response time
cinder-api
service=block-storage
component=cinder-api
url
cinder-api http endpoint status and response time
container-server
service=object-storage
component=container-server
url
swift container-server http endpoint status and response time
designate-api
service=dns
component=designate-api
url
designate-api http endpoint status and response time
glance-api
service=image-service
component=glance-api
url
glance-api http endpoint status and response time
glance-registry
service=image-service
component=glance-registry
url
glance-registry http endpoint status and response time
heat-api
service=orchestration
component=heat-api
url
heat-api http endpoint status and response time
heat-api-cfn
service=orchestration
component=heat-api-cfn
url
heat-api-cfn http endpoint status and response time
heat-api-cloudwatch
service=orchestration
component=heat-api-cloudwatch
url
heat-api-cloudwatch http endpoint status and response time
ardana-ux-services
service=ardana-ux-services
component=ardana-ux-services
url
ardana-ux-services http endpoint status and response time
horizon
service=web-ui
component=horizon
url
horizon http endpoint status and response time
keystone-api
service=identity-service
component=keystone-api
url
keystone-api http endpoint status and response time
monasca-api
service=monitoring
component=monasca-api
url
monasca-api http endpoint status
monasca-persister
service=monitoring
component=monasca-persister
url
monasca-persister http endpoint status
neutron-server
service=networking
component=neutron-server
url
neutron-server http endpoint status and response time
neutron-server-vip
service=networking
component=neutron-server-vip
url
neutron-server-vip http endpoint status and response time
nova-api
service=compute
component=nova-api
url
nova-api http endpoint status and response time
nova-vnc
service=compute
component=nova-vnc
url
nova-vnc http endpoint status and response time
object-server
service=object-storage
component=object-server
url
object-server http endpoint status and response time
object-storage-vip
service=object-storage
component=object-storage-vip
url
object-storage-vip http endpoint status and response time
octavia-api
service=octavia
component=octavia-api
url
octavia-api http endpoint status and response time
ops-console-web
service=ops-console
component=ops-console-web
url
ops-console-web http endpoint status and response time
proxy-server
service=object-storage
component=proxy-server
url
proxy-server http endpoint status and response time

13.1.4.10 Kafka Metrics

A list of metrics associated with the Kafka service.

Metric NameDimensionsDescription
kafka.consumer_lag
topic
service
component=kafka
consumer_group
hostname
Hostname consumer offset lag from broker offset

13.1.4.11 Libvirt Metrics

Note
Note

For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.1, “Libvirt plugin metric tuning knobs”.

A list of metrics associated with the Libvirt service.

Table 13.4: Tunable Libvirt Metrics
Admin Metric NameProject Metric NameDimensionsDescription
vm.cpu.time_nscpu.time_ns
zone
service
resource_id
hostname
component
Cumulative CPU time (in ns)
vm.cpu.utilization_norm_perccpu.utilization_norm_perc
zone
service
resource_id
hostname
component
Normalized CPU utilization (percentage)
vm.cpu.utilization_perccpu.utilization_perc
zone
service
resource_id
hostname
component
Overall CPU utilization (percentage)
vm.io.errorsio.errors
zone
service
resource_id
hostname
component
Overall disk I/O errors
vm.io.errors_secio.errors_sec
zone
service
resource_id
hostname
component
Disk I/O errors per second
vm.io.read_bytesio.read_bytes
zone
service
resource_id
hostname
component
Disk I/O read bytes value
vm.io.read_bytes_secio.read_bytes_sec
zone
service
resource_id
hostname
component
Disk I/O read bytes per second
vm.io.read_opsio.read_ops
zone
service
resource_id
hostname
component
Disk I/O read operations value
vm.io.read_ops_secio.read_ops_sec
zone
service
resource_id
hostname
component
Disk I/O write operations per second
vm.io.write_bytesio.write_bytes
zone
service
resource_id
hostname
component
Disk I/O write bytes value
vm.io.write_bytes_secio.write_bytes_sec
zone
service
resource_id
hostname
component
Disk I/O write bytes per second
vm.io.write_opsio.write_ops
zone
service
resource_id
hostname
component
Disk I/O write operations value
vm.io.write_ops_sec io.write_ops_sec
zone
service
resource_id
hostname
component
Disk I/O write operations per second
vm.net.in_bytesnet.in_bytes
zone
service
resource_id
hostname
component
device
port_id
Network received total bytes
vm.net.in_bytes_secnet.in_bytes_sec
zone
service
resource_id
hostname
component
device
port_id
Network received bytes per second
vm.net.in_packetsnet.in_packets
zone
service
resource_id
hostname
component
device
port_id
Network received total packets
vm.net.in_packets_secnet.in_packets_sec
zone
service
resource_id
hostname
component
device
port_id
Network received packets per second
vm.net.out_bytesnet.out_bytes
zone
service
resource_id
hostname
component
device
port_id
Network transmitted total bytes
vm.net.out_bytes_secnet.out_bytes_sec
zone
service
resource_id
hostname
component
device
port_id
Network transmitted bytes per second
vm.net.out_packetsnet.out_packets
zone
service
resource_id
hostname
component
device
port_id
Network transmitted total packets
vm.net.out_packets_secnet.out_packets_sec
zone
service
resource_id
hostname
component
device
port_id
Network transmitted packets per second
vm.ping_statusping_status
zone
service
resource_id
hostname
component
0 for ping success, 1 for ping failure
vm.disk.allocationdisk.allocation
zone
service
resource_id
hostname
component
Total Disk allocation for a device
vm.disk.allocation_totaldisk.allocation_total
zone
service
resource_id
hostname
component
Total Disk allocation across devices for instances
vm.disk.capacitydisk.capacity
zone
service
resource_id
hostname
component
Total Disk capacity for a device
vm.disk.capacity_totaldisk.capacity_total
zone
service
resource_id
hostname
component
Total Disk capacity across devices for instances
vm.disk.physicaldisk.physical
zone
service
resource_id
hostname
component
Total Disk usage for a device
vm.disk.physical_totaldisk.physical_total
zone
service
resource_id
hostname
component
Total Disk usage across devices for instances
vm.io.errors_totalio.errors_total
zone
service
resource_id
hostname
component
Total Disk I/O errors across all devices
vm.io.errors_total_secio.errors_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O errors per second across all devices
vm.io.read_bytes_totalio.read_bytes_total
zone
service
resource_id
hostname
component
Total Disk I/O read bytes across all devices
vm.io.read_bytes_total_secio.read_bytes_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O read bytes per second across devices
vm.io.read_ops_totalio.read_ops_total
zone
service
resource_id
hostname
component
Total Disk I/O read operations across all devices
vm.io.read_ops_total_secio.read_ops_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O read operations across all devices per sec
vm.io.write_bytes_totalio.write_bytes_total
zone
service
resource_id
hostname
component
Total Disk I/O write bytes across all devices
vm.io.write_bytes_total_secio.write_bytes_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O Write bytes per second across devices
vm.io.write_ops_totalio.write_ops_total
zone
service
resource_id
hostname
component
Total Disk I/O write operations across all devices
vm.io.write_ops_total_secio.write_ops_total_sec
zone
service
resource_id
hostname
component
Total Disk I/O write operations across all devices per sec

These metrics in libvirt are always enabled and cannot be disabled using the tuning knobs.

Table 13.5: Untunable Libvirt Metrics
Admin Metric NameProject Metric NameDimensionsDescription
vm.host_alive_statushost_alive_status
zone
service
resource_id
hostname
component

-1 for no status, 0 for Running / OK, 1 for Idle / blocked, 2 for Paused,

3 for Shutting down, 4 for Shut off or nova suspend 5 for Crashed,

6 for Power management suspend (S3 state)

vm.mem.free_mbmem.free_mb
cluster
service
hostname
Free memory in Mbytes
vm.mem.free_percmem.free_perc
cluster
service
hostname
Percent of memory free
vm.mem.resident_mb 
cluster
service
hostname
Total memory used on host, an Operations-only metric
vm.mem.swap_used_mbmem.swap_used_mb
cluster
service
hostname
Used swap space in Mbytes
vm.mem.total_mbmem.total_mb
cluster
service
hostname
Total memory in Mbytes
vm.mem.used_mbmem.used_mb
cluster
service
hostname
Used memory in Mbytes

13.1.4.12 Monitoring Metrics

A list of metrics associated with the Monitoring service.

Metric NameDimensionsDescription
alarm-state-transitions-added-to-batch-counter
service=monitoring
url
hostname
component=monasca-persister
 
jvm.memory.total.max
service=monitoring
url
hostname
component
Maximum JVM overall memory
jvm.memory.total.used
service=monitoring
url
hostname
component
Used JVM overall memory
metrics-added-to-batch-counter
service=monitoring
url
hostname
component=monasca-persister
 
metrics.published
service=monitoring
url
hostname
component=monasca-api
Total number of published metrics
monasca.alarms_finished_count
hostname
component=monasca-notification
service=monitoring
Total number of alarms received
monasca.checks_running_too_long
hostname
component=monasca-agent
service=monitoring
cluster
Only emitted when collection time for a check is too long
monasca.collection_time_sec
hostname
component=monasca-agent
service=monitoring
cluster
Collection time in monasca-agent
monasca.config_db_time
hostname
component=monasca-notification
service=monitoring
 
monasca.created_count
hostname
component=monasca-notification
service=monitoring
Number of notifications created
monasca.invalid_type_count
hostname
component=monasca-notification
service=monitoring
Number of notifications with invalid type
monasca.log.in_bulks_rejected
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.in_logs
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.in_logs_bytes
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.in_logs_rejected
hostname
component=monasca-log-api
service=monitoring
version
 
monasca.log.out_logs
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.out_logs_lost
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.out_logs_truncated_bytes
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.processing_time_ms
hostname
component=monasca-log-api
service=monitoring
 
monasca.log.publish_time_ms
hostname
component=monasca-log-api
service=monitoring
 
monasca.thread_count
service=monitoring
process_name
hostname
component
Number of threads monasca is using
raw-sql.time.avg
service=monitoring
url
hostname
component
Average raw sql query time
raw-sql.time.max
service=monitoring
url
hostname
component
Max raw sql query time

13.1.4.13 Monasca Aggregated Metrics

A list of the aggregated metrics associated with the monasca Transform feature.

Metric NameForDimensionsDescription
cpu.utilized_logical_cores_aggCompute summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour).

Available as total or per host

cpu.total_logical_cores_aggCompute summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour)

Available as total or per host

mem.total_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Total physical host memory capacity by time interval (defaults to a hour)

mem.usable_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all
Usable physical host memory capacity by time interval (defaults to a hour)
disk.total_used_space_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Utilized physical host disk capacity by time interval (defaults to a hour)

disk.total_space_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all
Total physical host disk capacity by time interval (defaults to a hour)
nova.vm.cpu.total_allocated_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

CPUs allocated across all virtual machines by time interval (defaults to a hour)

vcpus_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Virtual CPUs allocated capacity for virtual machines of one or all projects by time interval (defaults to a hour)

Available as total or per host

nova.vm.mem.total_allocated_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Memory allocated to all virtual machines by time interval (defaults to a hour)

vm.mem.used_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Memory utilized by virtual machines of one or all projects by time interval (defaults to an hour)

Available as total or per host

vm.mem.total_mb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Memory allocated to virtual machines of one or all projects by time interval (defaults to an hour)

Available as total or per host

vm.cpu.utilization_perc_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

CPU utilized by all virtual machines by project by time interval (defaults to an hour)

nova.vm.disk.total_allocated_gb_aggCompute summary
aggregation_period: hourly
host: all
project_id: all

Disk space allocated to all virtual machines by time interval (defaults to an hour)

vm.disk.allocation_aggCompute summary
aggregation_period: hourly
host: all
project_id: all or <project ID>

Disk allocation for virtual machines of one or all projects by time interval (defaults to a hour)

Available as total or per host

swiftlm.diskusage.val.size_aggObject Storage summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Total available object storage capacity by time interval (defaults to a hour)

Available as total or per host

swiftlm.diskusage.val.avail_aggObject Storage summary
aggregation_period: hourly
host: all or <hostname>
project_id: all

Remaining object storage capacity by time interval (defaults to a hour)

Available as total or per host

swiftlm.diskusage.rate_aggObject Storage summary
aggregation_period: hourly
host: all
project_id: all

Rate of change of object storage usage by time interval (defaults to a hour)

storage.objects.size_aggObject Storage summary
aggregation_period: hourly
host: all
project_id: all

Used object storage capacity by time interval (defaults to a hour)

13.1.4.14 MySQL Metrics

A list of metrics associated with the MySQL service.

Metric NameDimensionsDescription
mysql.innodb.buffer_pool_free
hostname
mode
service=mysql

The number of free pages, in bytes. This value is calculated by multiplying Innodb_buffer_pool_pages_free and Innodb_page_size of the server status variable.

mysql.innodb.buffer_pool_total
hostname
mode
service=mysql

The total size of buffer pool, in bytes. This value is calculated by multiplying Innodb_buffer_pool_pages_total and Innodb_page_size of the server status variable.

mysql.innodb.buffer_pool_used
hostname
mode
service=mysql

The number of used pages, in bytes. This value is calculated by subtracting Innodb_buffer_pool_pages_total away from Innodb_buffer_pool_pages_free of the server status variable.

mysql.innodb.current_row_locks
hostname
mode
service=mysql

Corresponding to current row locks of the server status variable.

mysql.innodb.data_reads
hostname
mode
service=mysql

Corresponding to Innodb_data_reads of the server status variable.

mysql.innodb.data_writes
hostname
mode
service=mysql

Corresponding to Innodb_data_writes of the server status variable.

mysql.innodb.mutex_os_waits
hostname
mode
service=mysql

Corresponding to the OS waits of the server status variable.

mysql.innodb.mutex_spin_rounds
hostname
mode
service=mysql

Corresponding to spinlock rounds of the server status variable.

mysql.innodb.mutex_spin_waits
hostname
mode
service=mysql

Corresponding to the spin waits of the server status variable.

mysql.innodb.os_log_fsyncs
hostname
mode
service=mysql

Corresponding to Innodb_os_log_fsyncs of the server status variable.

mysql.innodb.row_lock_time
hostname
mode
service=mysql

Corresponding to Innodb_row_lock_time of the server status variable.

mysql.innodb.row_lock_waits
hostname
mode
service=mysql

Corresponding to Innodb_row_lock_waits of the server status variable.

mysql.net.connections
hostname
mode
service=mysql

Corresponding to Connections of the server status variable.

mysql.net.max_connections
hostname
mode
service=mysql

Corresponding to Max_used_connections of the server status variable.

mysql.performance.com_delete
hostname
mode
service=mysql

Corresponding to Com_delete of the server status variable.

mysql.performance.com_delete_multi
hostname
mode
service=mysql

Corresponding to Com_delete_multi of the server status variable.

mysql.performance.com_insert
hostname
mode
service=mysql

Corresponding to Com_insert of the server status variable.

mysql.performance.com_insert_select
hostname
mode
service=mysql

Corresponding to Com_insert_select of the server status variable.

mysql.performance.com_replace_select
hostname
mode
service=mysql

Corresponding to Com_replace_select of the server status variable.

mysql.performance.com_select
hostname
mode
service=mysql

Corresponding to Com_select of the server status variable.

mysql.performance.com_update
hostname
mode
service=mysql

Corresponding to Com_update of the server status variable.

mysql.performance.com_update_multi
hostname
mode
service=mysql

Corresponding to Com_update_multi of the server status variable.

mysql.performance.created_tmp_disk_tables
hostname
mode
service=mysql

Corresponding to Created_tmp_disk_tables of the server status variable.

mysql.performance.created_tmp_files
hostname
mode
service=mysql

Corresponding to Created_tmp_files of the server status variable.

mysql.performance.created_tmp_tables
hostname
mode
service=mysql

Corresponding to Created_tmp_tables of the server status variable.

mysql.performance.kernel_time
hostname
mode
service=mysql

The kernel time for the databases performance, in seconds.

mysql.performance.open_files
hostname
mode
service=mysql

Corresponding to Open_files of the server status variable.

mysql.performance.qcache_hits
hostname
mode
service=mysql

Corresponding to Qcache_hits of the server status variable.

mysql.performance.queries
hostname
mode
service=mysql

Corresponding to Queries of the server status variable.

mysql.performance.questions
hostname
mode
service=mysql

Corresponding to Question of the server status variable.

mysql.performance.slow_queries
hostname
mode
service=mysql

Corresponding to Slow_queries of the server status variable.

mysql.performance.table_locks_waited
hostname
mode
service=mysql

Corresponding to Table_locks_waited of the server status variable.

mysql.performance.threads_connected
hostname
mode
service=mysql

Corresponding to Threads_connected of the server status variable.

mysql.performance.user_time
hostname
mode
service=mysql

The CPU user time for the databases performance, in seconds.

13.1.4.15 NTP Metrics

A list of metrics associated with the NTP service.

Metric NameDimensionsDescription
ntp.connection_status
hostname
ntp_server
Value of ntp server connection status (0=Healthy)
ntp.offset
hostname
ntp_server
Time offset in seconds

13.1.4.16 Open vSwitch (OVS) Metrics

A list of metrics associated with the OVS service.

Note
Note

For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.2, “OVS plugin metric tuning knobs”.

Table 13.6: Per-router metrics
Admin Metric NameProject Metric NameDimensionsDescription
ovs.vrouter.in_bytes_secvrouter.in_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Inbound bytes per second for the router (if network_use_bits is false)

ovs.vrouter.in_packets_secvrouter.in_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets per second for the router

ovs.vrouter.out_bytes_secvrouter.out_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing bytes per second for the router (if network_use_bits is false)

ovs.vrouter.out_packets_secvrouter.out_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets per second for the router

ovs.vrouter.in_bytesvrouter.in_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Inbound bytes for the router (if network_use_bits is false)

ovs.vrouter.in_packetsvrouter.in_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets for the router

ovs.vrouter.out_bytesvrouter.out_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing bytes for the router (if network_use_bits is false)

ovs.vrouter.out_packetsvrouter.out_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets for the router

ovs.vrouter.in_dropped_secvrouter.in_dropped_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped packets per second for the router

ovs.vrouter.in_errors_secvrouter.in_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Number of incoming errors per second for the router

ovs.vrouter.out_dropped_secvrouter.out_dropped_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets per second for the router

ovs.vrouter.out_errors_secvrouter.out_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Number of outgoing errors per second for the router

ovs.vrouter.in_droppedvrouter.in_dropped
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped packets for the router

ovs.vrouter.in_errorsvrouter.in_errors
service=networking
resource_id
component=ovs
router_name
port_id

Number of incoming errors for the router

ovs.vrouter.out_droppedvrouter.out_dropped
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets for the router

ovs.vrouter.out_errorsvrouter.out_errors
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Number of outgoing errors for the router

Table 13.7: Per-DHCP port and rate metrics
Admin Metric NameTenant Metric NameDimensionsDescription
ovs.vswitch.in_bytes_secvswitch.in_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Incoming Bytes per second on DHCP port(ifnetwork_use_bits is false)

ovs.vswitch.in_packets_secvswitch.in_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets per second for the DHCP port

ovs.vswitch.out_bytes_secvswitch.out_bytes_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing Bytes per second on DHCP port(ifnetwork_use_bits is false)

ovs.vswitch.out_packets_secvswitch.out_packets_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets per second for the DHCP port

ovs.vswitch.in_bytesvswitch.in_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Inbound bytes for the DHCP port (if network_use_bits is false)

ovs.vswitch.in_packetsvswitch.in_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming packets for the DHCP port

ovs.vswitch.out_bytesvswitch.out_bytes
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing bytes for the DHCP port (if network_use_bits is false)

ovs.vswitch.out_packetsvswitch.out_packets
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Outgoing packets for the DHCP port

ovs.vswitch.in_dropped_secvswitch.in_dropped_sec
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped per second for the DHCP port

ovs.vswitch.in_errors_secvswitch.in_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Incoming errors per second for the DHCP port

ovs.vswitch.out_dropped_secvswitch.out_dropped_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets per second for the DHCP port

ovs.vswitch.out_errors_secvswitch.out_errors_sec
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing errors per second for the DHCP port

ovs.vswitch.in_droppedvswitch.in_dropped
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Incoming dropped packets for the DHCP port

ovs.vswitch.in_errorsvswitch.in_errors
service=networking
resource_id
component=ovs
router_name
port_id

Errors received for the DHCP port

ovs.vswitch.out_droppedvswitch.out_dropped
service=networking
resource_id
component=ovs
router_name
port_id

Outgoing dropped packets for the DHCP port

ovs.vswitch.out_errorsvswitch.out_errors
service=networking
resource_id
tenant_id
component=ovs
router_name
port_id

Errors transmitted for the DHCP port

13.1.4.17 Process Metrics

A list of metrics associated with processes.

Metric NameDimensionsDescription
process.cpu_perc
hostname
service
process_name
component
Percentage of cpu being consumed by a process
process.io.read_count
hostname
service
process_name
component
Number of reads by a process
process.io.read_kbytes
hostname
service
process_name
component
Kbytes read by a process
process.io.write_count
hostname
service
process_name
component
Number of writes by a process
process.io.write_kbytes
hostname
service
process_name
component
Kbytes written by a process
process.mem.rss_mbytes
hostname
service
process_name
component
Amount of physical memory allocated to a process, including memory from shared libraries in Mbytes
process.open_file_descriptors
hostname
service
process_name
component
Number of files being used by a process
process.pid_count
hostname
service
process_name
component
Number of processes that exist with this process name
process.thread_count
hostname
service
process_name
component
Number of threads a process is using
13.1.4.17.1 process.cpu_perc, process.mem.rss_mbytes, process.pid_count and process.thread_count metrics
Component NameDimensionsDescription
apache-storm
service=monitoring
process_name=monasca-thresh
process_user=storm
apache-storm process info: cpu percent, momory, pid count and thread count
barbican-api
service=key-manager
process_name=barbican-api
barbican-api process info: cpu percent, momory, pid count and thread count
ceilometer-agent-notification
service=telemetry
process_name=ceilometer-agent-notification
ceilometer-agent-notification process info: cpu percent, momory, pid count and thread count
ceilometer-polling
service=telemetry
process_name=ceilometer-polling
ceilometer-polling process info: cpu percent, momory, pid count and thread count
cinder-api
service=block-storage
process_name=cinder-api
cinder-api process info: cpu percent, momory, pid count and thread count
cinder-scheduler
service=block-storage
process_name=cinder-scheduler
cinder-scheduler process info: cpu percent, momory, pid count and thread count
designate-api
service=dns
process_name=designate-api
designate-api process info: cpu percent, momory, pid count and thread count
designate-central
service=dns
process_name=designate-central
designate-central process info: cpu percent, momory, pid count and thread count
designate-mdns
service=dns
process_name=designate-mdns
designate-mdns process cpu percent, momory, pid count and thread count
designate-pool-manager
service=dns
process_name=designate-pool-manager
designate-pool-manager process info: cpu percent, momory, pid count and thread count
heat-api
service=orchestration
process_name=heat-api
heat-api process cpu percent, momory, pid count and thread count
heat-api-cfn
service=orchestration
process_name=heat-api-cfn
heat-api-cfn process info: cpu percent, momory, pid count and thread count
heat-api-cloudwatch
service=orchestration
process_name=heat-api-cloudwatch
heat-api-cloudwatch process cpu percent, momory, pid count and thread count
heat-engine
service=orchestration
process_name=heat-engine
heat-engine process info: cpu percent, momory, pid count and thread count
ipsec/charon
service=networking
process_name=ipsec/charon
ipsec/charon process info: cpu percent, momory, pid count and thread count
keystone-admin
service=identity-service
process_name=keystone-admin
keystone-admin process info: cpu percent, momory, pid count and thread count
keystone-main
service=identity-service
process_name=keystone-main
keystone-main process info: cpu percent, momory, pid count and thread count
monasca-agent
service=monitoring
process_name=monasca-agent
monasca-agent process info: cpu percent, momory, pid count and thread count
monasca-api
service=monitoring
process_name=monasca-api
monasca-api process info: cpu percent, momory, pid count and thread count
monasca-notification
service=monitoring
process_name=monasca-notification
monasca-notification process info: cpu percent, momory, pid count and thread count
monasca-persister
service=monitoring
process_name=monasca-persister
monasca-persister process info: cpu percent, momory, pid count and thread count
monasca-transform
service=monasca-transform
process_name=monasca-transform
monasca-transform process info: cpu percent, momory, pid count and thread count
neutron-dhcp-agent
service=networking
process_name=neutron-dhcp-agent
neutron-dhcp-agent process info: cpu percent, momory, pid count and thread count
neutron-l3-agent
service=networking
process_name=neutron-l3-agent
neutron-l3-agent process info: cpu percent, momory, pid count and thread count
neutron-metadata-agent
service=networking
process_name=neutron-metadata-agent
neutron-metadata-agent process info: cpu percent, momory, pid count and thread count
neutron-openvswitch-agent
service=networking
process_name=neutron-openvswitch-agent
neutron-openvswitch-agent process info: cpu percent, momory, pid count and thread count
neutron-rootwrap
service=networking
process_name=neutron-rootwrap
neutron-rootwrap process info: cpu percent, momory, pid count and thread count
neutron-server
service=networking
process_name=neutron-server
neutron-server process info: cpu percent, momory, pid count and thread count
neutron-vpn-agent
service=networking
process_name=neutron-vpn-agent
neutron-vpn-agent process info: cpu percent, momory, pid count and thread count
nova-api
service=compute
process_name=nova-api
nova-api process info: cpu percent, momory, pid count and thread count
nova-compute
service=compute
process_name=nova-compute
nova-compute process info: cpu percent, momory, pid count and thread count
nova-conductor
service=compute
process_name=nova-conductor
nova-conductor process info: cpu percent, momory, pid count and thread count
nova-novncproxy
service=compute
process_name=nova-novncproxy
nova-novncproxy process info: cpu percent, momory, pid count and thread count
nova-scheduler
service=compute
process_name=nova-scheduler
nova-scheduler process info: cpu percent, momory, pid count and thread count
octavia-api
service=octavia
process_name=octavia-api
octavia-api process info: cpu percent, momory, pid count and thread count
octavia-health-manager
service=octavia
process_name=octavia-health-manager
octavia-health-manager process info: cpu percent, momory, pid count and thread count
octavia-housekeeping
service=octavia
process_name=octavia-housekeeping
octavia-housekeeping process info: cpu percent, momory, pid count and thread count
octavia-worker
service=octavia
process_name=octavia-worker
octavia-worker process info: cpu percent, momory, pid count and thread count
org.apache.spark.deploy.master.Master
service=spark
process_name=org.apache.spark.deploy.master.Master
org.apache.spark.deploy.master.Master process info: cpu percent, momory, pid count and thread count
org.apache.spark.executor.CoarseGrainedExecutorBackend
service=monasca-transform
process_name=org.apache.spark.executor.CoarseGrainedExecutorBackend
org.apache.spark.executor.CoarseGrainedExecutorBackend process info: cpu percent, momory, pid count and thread count
pyspark
service=monasca-transform
process_name=pyspark
pyspark process info: cpu percent, momory, pid count and thread count
transform/lib/driver
service=monasca-transform
process_name=transform/lib/driver
transform/lib/driver process info: cpu percent, momory, pid count and thread count
cassandra
service=cassandra
process_name=cassandra
cassandra process info: cpu percent, momory, pid count and thread count
13.1.4.17.2 process.io.*, process.open_file_descriptors metrics
Component NameDimensionsDescription
monasca-agent
service=monitoring
process_name=monasca-agent
process_user=mon-agent
monasca-agent process info: number of reads, number of writes,number of files being used

13.1.4.18 RabbitMQ Metrics

A list of metrics associated with the RabbitMQ service.

Metric NameDimensionsDescription
rabbitmq.exchange.messages.published_count
hostname
exchange
vhost
type
service=rabbitmq

Value of the "publish_out" field of "message_stats" object

rabbitmq.exchange.messages.published_rate
hostname
exchange
vhost
type
service=rabbitmq

Value of the "rate" field of "message_stats/publish_out_details" object

rabbitmq.exchange.messages.received_count
hostname
exchange
vhost
type
service=rabbitmq

Value of the "publish_in" field of "message_stats" object

rabbitmq.exchange.messages.received_rate
hostname
exchange
vhost
type
service=rabbitmq

Value of the "rate" field of "message_stats/publish_in_details" object

rabbitmq.node.fd_used
hostname
node
service=rabbitmq

Value of the "fd_used" field in the response of /api/nodes

rabbitmq.node.mem_used
hostname
node
service=rabbitmq

Value of the "mem_used" field in the response of /api/nodes

rabbitmq.node.run_queue
hostname
node
service=rabbitmq

Value of the "run_queue" field in the response of /api/nodes

rabbitmq.node.sockets_used
hostname
node
service=rabbitmq

Value of the "sockets_used" field in the response of /api/nodes

rabbitmq.queue.messages
hostname
queue
vhost
service=rabbitmq

Sum of ready and unacknowledged messages (queue depth)

rabbitmq.queue.messages.deliver_rate
hostname
queue
vhost
service=rabbitmq

Value of the "rate" field of "message_stats/deliver_details" object

rabbitmq.queue.messages.publish_rate
hostname
queue
vhost
service=rabbitmq

Value of the "rate" field of "message_stats/publish_details" object

rabbitmq.queue.messages.redeliver_rate
hostname
queue
vhost
service=rabbitmq

Value of the "rate" field of "message_stats/redeliver_details" object

13.1.4.19 Swift Metrics

A list of metrics associated with the swift service.

Metric NameDimensionsDescription
swiftlm.access.host.operation.get.bytes
service=object-storage

This metric is the number of bytes read from objects in GET requests processed by this host during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included.

swiftlm.access.host.operation.ops
service=object-storage

This metric is a count of the all the API requests made to swift that were processed by this host during the last minute.

swiftlm.access.host.operation.project.get.bytes  
swiftlm.access.host.operation.project.ops  
swiftlm.access.host.operation.project.put.bytes  
swiftlm.access.host.operation.put.bytes
service=object-storage

This metric is the number of bytes written to objects in PUT or POST requests processed by this host during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included.

swiftlm.access.host.operation.status  
swiftlm.access.project.operation.status
service=object-storage

This metric reports whether the swiftlm-access-log-tailer program is running normally.

swiftlm.access.project.operation.ops
tenant_id
service=object-storage

This metric is a count of the all the API requests made to swift that were processed by this host during the last minute to a given project id.

swiftlm.access.project.operation.get.bytes
tenant_id
service=object-storage

This metric is the number of bytes read from objects in GET requests processed by this host for a given project during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included.

swiftlm.access.project.operation.put.bytes
tenant_id
service=object-storage

This metric is the number of bytes written to objects in PUT or POST requests processed by this host for a given project during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included.

swiftlm.async_pending.cp.total.queue_length
observer_host
service=object-storage

This metric reports the total length of all async pending queues in the system.

When a container update fails, the update is placed on the async pending queue. An update may fail becuase the container server is too busy or because the server is down or failed. Later the system will “replay” updates from the queue – so eventually, the container listings will show all objects known to the system.

If you know that container servers are down, it is normal to see the value of async pending increase. Once the server is restored, the value should return to zero.

A non-zero value may also indicate that containers are too large. Look for “lock timeout” messages in /var/log/swift/swift.log. If you find such messages consider reducing the container size or enable rate limiting.

swiftlm.check.failure
check
error
component
service=object-storage

The total exception string is truncated if longer than 1919 characters and an ellipsis is prepended in the first three characters of the message. If there is more than one error reported, the list of errors is paired to the last reported error and the operator is expected to resolve failures until no more are reported. Where there are no further reported errors, the Value Class is emitted as ‘Ok’.

swiftlm.diskusage.cp.avg.usage
observer_host
service=object-storage

Is the average utilization of all drives in the system. The value is a percentage (example: 30.0 means 30% of the total space is used).

swiftlm.diskusage.cp.max.usage
observer_host
service=object-storage

Is the highest utilization of all drives in the system. The value is a percentage (example: 80.0 means at least one drive is 80% utilized). The value is just as important as swiftlm.diskusage.usage.avg. For example, if swiftlm.diskusage.usage.avg is 70% you might think that there is plenty of space available. However, if swiftlm.diskusage.usage.max is 100%, this means that some objects cannot be stored on that drive. swift will store replicas on other drives. However, this will create extra overhead.

swiftlm.diskusage.cp.min.usage
observer_host
service=object-storage

Is the lowest utilization of all drives in the system. The value is a percentage (example: 10.0 means at least one drive is 10% utilized)

swiftlm.diskusage.cp.total.avail
observer_host
service=object-storage

Is the size in bytes of available (unused) space of all drives in the system. Only drives used by swift are included.

swiftlm.diskusage.cp.total.size
observer_host
service=object-storage

Is the size in bytes of raw size of all drives in the system.

swiftlm.diskusage.cp.total.used
observer_host
service=object-storage

Is the size in bytes of used space of all drives in the system. Only drives used by swift are included.

swiftlm.diskusage.host.avg.usage
hostname
service=object-storage

This metric reports the average percent usage of all swift filesystems on a host.

swiftlm.diskusage.host.max.usage
hostname
service=object-storage

This metric reports the percent usage of a swift filesystem that is most used (full) on a host. The value is the max of the percentage used of all swift filesystems.

swiftlm.diskusage.host.min.usage
hostname
service=object-storage

This metric reports the percent usage of a swift filesystem that is least used (has free space) on a host. The value is the min of the percentage used of all swift filesystems.

swiftlm.diskusage.host.val.avail
hostname
service=object-storage
mount
device
label

This metric reports the number of bytes available (free) in a swift filesystem. The value is an integer (units: Bytes)

swiftlm.diskusage.host.val.size
hostname
service=object-storage
mount
device
label

This metric reports the size in bytes of a swift filesystem. The value is an integer (units: Bytes)

swiftlm.diskusage.host.val.usage
hostname
service=object-storage
mount
device
label

This metric reports the percent usage of a swift filesystem. The value is a floating point number in range 0.0 to 100.0

swiftlm.diskusage.host.val.used
hostname
service=object-storage
mount
device
label

This metric reports the number of used bytes in a swift filesystem. The value is an integer (units: Bytes)

swiftlm.load.cp.avg.five
observer_host
service=object-storage

This is the averaged value of the five minutes system load average of all nodes in the swift system.

swiftlm.load.cp.max.five
observer_host
service=object-storage

This is the five minute load average of the busiest host in the swift system.

swiftlm.load.cp.min.five
observer_host
service=object-storage

This is the five minute load average of the least loaded host in the swift system.

swiftlm.load.host.val.five
hostname
service=object-storage

This metric reports the 5 minute load average of a host. The value is derived from /proc/loadavg.

swiftlm.md5sum.cp.check.ring_checksums
observer_host
service=object-storage

If you are in the middle of deploying new rings, it is normal for this to be in the failed state.

However, if you are not in the middle of a deployment, you need to investigate the cause. Use “swift-recon –md5 -v” to identify the problem hosts.

swiftlm.replication.cp.avg.account_duration
observer_host
service=object-storage

This is the average across all servers for the account replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds.

swiftlm.replication.cp.avg.container_duration
observer_host
service=object-storage

This is the average across all servers for the container replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds.

swiftlm.replication.cp.avg.object_duration
observer_host
service=object-storage

This is the average across all servers for the object replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds.

swiftlm.replication.cp.max.account_last
hostname
path
service=object-storage

This is the number of seconds since the account replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle.

swiftlm.replication.cp.max.container_last
hostname
path
service=object-storage

This is the number of seconds since the container replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle.

swiftlm.replication.cp.max.object_last
hostname
path
service=object-storage

This is the number of seconds since the object replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle.

swiftlm.swift.drive_audit
hostname
service=object-storage
mount_point
kernel_device

If an unrecoverable read error (URE) occurs on a filesystem, the error is logged in the kernel log. The swift-drive-audit program scans the kernel log looking for patterns indicating possible UREs.

To get more information, log onto the node in question and run:

sudoswift-drive-audit/etc/swift/drive-audit.conf

UREs are common on large disk drives. They do not necessarily indicate that the drive is failed. You can use the xfs_repair command to attempt to repair the filesystem. Failing this, you may need to wipe the filesystem.

If UREs occur very often on a specific drive, this may indicate that the drive is about to fail and should be replaced.

swiftlm.swift.file_ownership.config
hostname
path
service

This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects).

swiftlm.swift.file_ownership.data
hostname
path
service

This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects).

swiftlm.swiftlm_check
hostname
service=object-storage

This indicates of the swiftlm monasca-agent Plug-in is running normally. If the status is failed, it probable that some or all metrics are no longer being reported.

swiftlm.swift.replication.account.last_replication
hostname
service=object-storage

This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad.

swiftlm.swift.replication.container.last_replication
hostname
service=object-storage

This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad.

swiftlm.swift.replication.object.last_replication
hostname
service=object-storage

This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad.

swiftlm.swift.swift_services
hostname
service=object-storage

This metric reports of the process as named in the component dimension and the msg value_meta is running or not.

Use the swift-start.yml playbook to attempt to restart the stopped process (it will start any process that has stopped – you do not need to specifically name the process).

swiftlm.swift.swift_services.check_ip_port
hostname
service=object-storage
component
Reports if a service is listening to the correct ip and port.
swiftlm.systems.check_mounts
hostname
service=object-storage
mount
device
label

This metric reports the mount state of each drive that should be mounted on this node.

swiftlm.systems.connectivity.connect_check
observer_host
url
target_port
service=object-storage

This metric reports if a server can connect to a VIPs. Currently the following VIPs are checked:

  • The keystone VIP used to validate tokens (normally port 5000)

swiftlm.systems.connectivity.memcache_check
observer_host
hostname
target_port
service=object-storage

This metric reports if memcached on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used:

We successfully connected to <hostname> on port <target_port>

{
  "dimensions": {
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "11211"
  },
  "metric": "swiftlm.systems.connectivity.memcache_check",
  "timestamp": 1449084058,
  "value": 0,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:11211 ok"
  }
}

We failed to connect to <hostname> on port <target_port>

{
  "dimensions": {
    "fail_message": "[Errno 111] Connection refused",
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "11211"
  },
  "metric": "swiftlm.systems.connectivity.memcache_check",
  "timestamp": 1449084150,
  "value": 2,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:11211 [Errno 111] Connection refused"
  }
}
swiftlm.systems.connectivity.rsync_check
observer_host
hostname
target_port
service=object-storage

This metric reports if rsyncd on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used:

We successfully connected to <hostname> on port <target_port>:

{
  "dimensions": {
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "873"
  },
  "metric": "swiftlm.systems.connectivity.rsync_check",
  "timestamp": 1449082663,
  "value": 0,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:873 ok"
  }
}

We failed to connect to <hostname> on port <target_port>:

{
  "dimensions": {
    "fail_message": "[Errno 111] Connection refused",
    "hostname": "ardana-ccp-c1-m1-mgmt",
    "observer_host": "ardana-ccp-c1-m1-mgmt",
    "service": "object-storage",
    "target_port": "873"
  },
  "metric": "swiftlm.systems.connectivity.rsync_check",
  "timestamp": 1449082860,
  "value": 2,
  "value_meta": {
    "msg": "ardana-ccp-c1-m1-mgmt:873 [Errno 111] Connection refused"
  }
}
swiftlm.umon.target.avg.latency_sec
component
hostname
observer_host
service=object-storage
url

Reports the average value of N-iterations of the latency values recorded for a component.

swiftlm.umon.target.check.state
component
hostname
observer_host
service=object-storage
url

This metric reports the state of each component after N-iterations of checks. If the initial check succeeds, the checks move onto the next component until all components are queried, then the checks sleep for ‘main_loop_interval’ seconds. If a check fails, it is retried every second for ‘retries’ number of times per component. If the check fails ‘retries’ times, it is reported as a fail instance.

A successful state will be reported in JSON:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.check.state",
    "timestamp": 1453111805,
    "value": 0
},

A failed state will report a “fail” value and the value_meta will provide the http response error.

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.check.state",
    "timestamp": 1453112841,
    "value": 2,
    "value_meta": {
        "msg": "HTTPConnectionPool(host='192.168.245.9', port=8080): Max retries exceeded with url: /v1/AUTH_76538ce683654a35983b62e333001b47 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd857d7f550>: Failed to establish a new connection: [Errno 110] Connection timed out',))"
    }
}
swiftlm.umon.target.max.latency_sec
component
hostname
observer_host
service=object-storage
url

This metric reports the maximum response time in seconds of a REST call from the observer to the component REST API listening on the reported host

A response time query will be reported in JSON:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.max.latency_sec",
    "timestamp": 1453111805,
    "value": 0.2772650718688965
}

A failed query will have a much longer time value:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.max.latency_sec",
    "timestamp": 1453112841,
    "value": 127.288015127182
}
swiftlm.umon.target.min.latency_sec
component
hostname
observer_host
service=object-storage
url

This metric reports the minimum response time in seconds of a REST call from the observer to the component REST API listening on the reported host

A response time query will be reported in JSON:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.min.latency_sec",
    "timestamp": 1453111805,
    "value": 0.10025882720947266
}

A failed query will have a much longer time value:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.min.latency_sec",
    "timestamp": 1453112841,
    "value": 127.25378203392029
}
swiftlm.umon.target.val.avail_day
component
hostname
observer_host
service=object-storage
url

This metric reports the average of all the collected records in the swiftlm.umon.target.val.avail_minute metric data. This is a walking average data set of these approximately per-minute states of the swift Object Store. The most basic case is a whole day of successful per-minute records, which will average to 100% availability. If there is any downtime throughout the day resulting in gaps of data which are two minutes or longer, the per-minute availability data will be “back filled” with an assumption of a down state for all the per-minute records which did not exist during the non-reported time. Because this is a walking average of approximately 24 hours worth of data, any outtage will take 24 hours to be purged from the dataset.

A 24-hour average availability report:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.val.avail_day",
    "timestamp": 1453645405,
    "value": 7.894736842105263
}
swiftlm.umon.target.val.avail_minute
component
hostname
observer_host
service=object-storage
url

A value of 100 indicates that swift-uptime-monitor was able to get a token from keystone and was able to perform operations against the swift API during the reported minute. A value of zero indicates that either keystone or swift failed to respond successfully. A metric is produced every minute that swift-uptime-monitor is running.

An “up” minute report value will report 100 [percent]:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.val.avail_minute",
    "timestamp": 1453645405,
    "value": 100.0
}

A “down” minute report value will report 0 [percent]:

{
    "dimensions": {
        "component": "rest-api",
        "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt",
        "observer_host": "ardana-ccp-c1-m1-mgmt",
        "service": "object-storage",
        "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080"
    },
    "metric": "swiftlm.umon.target.val.avail_minute",
    "timestamp": 1453649139,
    "value": 0.0
}
swiftlm.hp_hardware.hpssacli.smart_array.firmware
component
hostname
service=object-storage
component
model
controller_slot

This metric reports the firmware version of a component of a Smart Array controller.

swiftlm.hp_hardware.hpssacli.smart_array
component
hostname
service=object-storage
component
sub_component
model
controller_slot

This reports the status of various sub-components of a Smart Array Controller.

A failure is considered to have occured if:

  • Controller is failed

  • Cache is not enabled or has failed

  • Battery or capacitor is not installed

  • Battery or capacitor has failed

swiftlm.hp_hardware.hpssacli.physical_drive
component
hostname
service=object-storage
component
controller_slot
box
bay

This reports the status of a disk drive attached to a Smart Array controller.

swiftlm.hp_hardware.hpssacli.logical_drive
component
hostname
observer_host
service=object-storage
controller_slot
array
logical_drive
sub_component

This reports the status of a LUN presented by a Smart Array controller.

A LUN is considered failed if the LUN has failed or if the LUN cache is not enabled and working.

Note
Note
  • HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed on all control nodes that are swift nodes, in order to generate the following swift metrics:

    • swiftlm.hp_hardware.hpssacli.smart_array

    • swiftlm.hp_hardware.hpssacli.logical_drive

    • swiftlm.hp_hardware.hpssacli.smart_array.firmware

    • swiftlm.hp_hardware.hpssacli.physical_drive

  • HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

  • After the HPE SSA CLI component is installed on the swift nodes, the metrics will be generated automatically during the next agent polling cycle. Manual reboot of the node is not required.

13.1.4.20 System Metrics

A list of metrics associated with the System.

Table 13.8: CPU Metrics
Metric NameDimensionsDescription
cpu.frequency_mhz
cluster
hostname
service=system

Maximum MHz value for the cpu frequency.

Note
Note

This value is dynamic, and driven by CPU governor depending on current resource need.

cpu.idle_perc
cluster
hostname
service=system

Percentage of time the CPU is idle when no I/O requests are in progress

cpu.idle_time
cluster
hostname
service=system

Time the CPU is idle when no I/O requests are in progress

cpu.percent
cluster
hostname
service=system

Percentage of time the CPU is used in total

cpu.stolen_perc
cluster
hostname
service=system

Percentage of stolen CPU time, that is, the time spent in other OS contexts when running in a virtualized environment

cpu.system_perc
cluster
hostname
service=system

Percentage of time the CPU is used at the system level

cpu.system_time
cluster
hostname
service=system

Time the CPU is used at the system level

cpu.time_ns
cluster
hostname
service=system

Time the CPU is used at the host level

cpu.total_logical_cores
cluster
hostname
service=system

Total number of logical cores available for an entire node (Includes hyper threading).

Note
Note:

This is an optional metric that is only sent when send_rollup_stats is set to true.

cpu.user_perc
cluster
hostname
service=system

Percentage of time the CPU is used at the user level

cpu.user_time
cluster
hostname
service=system

Time the CPU is used at the user level

cpu.wait_perc
cluster
hostname
service=system

Percentage of time the CPU is idle AND there is at least one I/O request in progress

cpu.wait_time
cluster
hostname
service=system

Time the CPU is idle AND there is at least one I/O request in progress

Table 13.9: Disk Metrics
Metric NameDimensionsDescription
disk.inode_used_perc
mount_point
service=system
hostname
cluster
device

The percentage of inodes that are used on a device

disk.space_used_perc
mount_point
service=system
hostname
cluster
device

The percentage of disk space that is being used on a device

disk.total_space_mb
mount_point
service=system
hostname
cluster
device

The total amount of disk space in Mbytes aggregated across all the disks on a particular node.

Note
Note

This is an optional metric that is only sent when send_rollup_stats is set to true.

disk.total_used_space_mb
mount_point
service=system
hostname
cluster
device

The total amount of used disk space in Mbytes aggregated across all the disks on a particular node.

Note
Note

This is an optional metric that is only sent when send_rollup_stats is set to true.

io.read_kbytes_sec
mount_point
service=system
hostname
cluster
device

Kbytes/sec read by an io device

io.read_req_sec
mount_point
service=system
hostname
cluster
device

Number of read requests/sec to an io device

io.read_time_sec
mount_point
service=system
hostname
cluster
device

Amount of read time in seconds to an io device

io.write_kbytes_sec
mount_point
service=system
hostname
cluster
device

Kbytes/sec written by an io device

io.write_req_sec
mount_point
service=system
hostname
cluster
device

Number of write requests/sec to an io device

io.write_time_sec
mount_point
service=system
hostname
cluster
device

Amount of write time in seconds to an io device

Table 13.10: Load Metrics
Metric NameDimensionsDescription
load.avg_15_min
service=system
hostname
cluster

The normalized (by number of logical cores) average system load over a 15 minute period

load.avg_1_min
service=system
hostname
cluster

The normalized (by number of logical cores) average system load over a 1 minute period

load.avg_5_min
service=system
hostname
cluster

The normalized (by number of logical cores) average system load over a 5 minute period

Table 13.11: Memory Metrics
Metric NameDimensionsDescription
mem.free_mb
service=system
hostname
cluster

Mbytes of free memory

mem.swap_free_mb
service=system
hostname
cluster

Percentage of free swap memory that is free

mem.swap_free_perc
service=system
hostname
cluster

Mbytes of free swap memory that is free

mem.swap_total_mb
service=system
hostname
cluster

Mbytes of total physical swap memory

mem.swap_used_mb
service=system
hostname
cluster

Mbytes of total swap memory used

mem.total_mb
service=system
hostname
cluster

Total Mbytes of memory

mem.usable_mb
service=system
hostname
cluster

Total Mbytes of usable memory

mem.usable_perc
service=system
hostname
cluster

Percentage of total memory that is usable

mem.used_buffers
service=system
hostname
cluster

Number of buffers in Mbytes being used by the kernel for block io

mem.used_cache
service=system
hostname
cluster

Mbytes of memory used for the page cache

mem.used_mb
service=system
hostname
cluster

Total Mbytes of used memory

Table 13.12: Network Metrics
Metric NameDimensionsDescription
net.in_bytes_sec
service=system
hostname
device

Number of network bytes received per second

net.in_errors_sec
service=system
hostname
device

Number of network errors on incoming network traffic per second

net.in_packets_dropped_sec
service=system
hostname
device

Number of inbound network packets dropped per second

net.in_packets_sec
service=system
hostname
device

Number of network packets received per second

net.out_bytes_sec
service=system
hostname
device

Number of network bytes sent per second

net.out_errors_sec
service=system
hostname
device

Number of network errors on outgoing network traffic per second

net.out_packets_dropped_sec
service=system
hostname
device

Number of outbound network packets dropped per second

net.out_packets_sec
service=system
hostname
device

Number of network packets sent per second

13.1.4.21 Zookeeper Metrics

A list of metrics associated with the Zookeeper service.

Metric NameDimensionsDescription
zookeeper.avg_latency_sec
hostname
mode
service=zookeeper
Average latency in second
zookeeper.connections_count
hostname
mode
service=zookeeper
Number of connections
zookeeper.in_bytes
hostname
mode
service=zookeeper
Received bytes
zookeeper.max_latency_sec
hostname
mode
service=zookeeper
Maximum latency in second
zookeeper.min_latency_sec
hostname
mode
service=zookeeper
Minimum latency in second
zookeeper.node_count
hostname
mode
service=zookeeper
Number of nodes
zookeeper.out_bytes
hostname
mode
service=zookeeper
Sent bytes
zookeeper.outstanding_bytes
hostname
mode
service=zookeeper
Outstanding bytes
zookeeper.zxid_count
hostname
mode
service=zookeeper
Count number
zookeeper.zxid_epoch
hostname
mode
service=zookeeper
Epoch number

13.2 Centralized Logging Service

You can use the Centralized Logging Service to evaluate and troubleshoot your distributed cloud environment from a single location.

13.2.1 Getting Started with Centralized Logging Service

A typical cloud consists of multiple servers which makes locating a specific log from a single server difficult. The Centralized Logging feature helps the administrator evaluate and troubleshoot the distributed cloud deployment from a single location.

The Logging API is a component in the centralized logging architecture. It works between log producers and log storage. In most cases it works by default after installation with no additional configuration. To use Logging API with logging-as-a-service, you must configure an end-point. This component adds flexibility and supportability for features in the future.

Do I need to Configure monasca-log-api? If you are only using Cloud Lifecycle Manager , then the default configuration is ready to use.

Important
Important

If you are using logging in any of the following deployments, then you will need to query keystone to get an end-point to use.

  • Logging as a Service

  • Platform as a Service

The Logging API is protected by keystone’s role-based access control. To ensure that logging is allowed and monasca alarms can be triggered, the user must have the monasca-user role. To get an end-point from keystone:

  1. Log on to Cloud Lifecycle Manager (deployer node).

  2. To list the Identity service catalog, run:

    ardana > source ./service.osrc
    ardana > openstack catalog list
  3. In the output, find Kronos. For example:

    NameTypeEndpoints
    kronosregion0

    public: http://myardana.test:5607/v3.0, admin: http://192.168.245.5:5607/v3.0, internal: http://192.168.245.5:5607/v3.0

  4. Use the same port number as found in the output. In the example, you would use port 5607.

In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.

Important
Important

It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.

For more information, see Section 13.2.4, “Managing the Centralized Logging Feature”.

13.2.1.1 For More Information

For more information about the centralized logging components, see the following sites:

13.2.2 Understanding the Centralized Logging Service

The Centralized Logging feature collects logs on a central system, rather than leaving the logs scattered across the network. The administrator can use a single Kibana interface to view log information in charts, graphs, tables, histograms, and other forms.

13.2.2.1 What Components are Part of Centralized Logging?

Centralized logging consists of several components, detailed below:

  • Administrator's Browser: Operations Console can be used to access logging alarms or to access Kibana's dashboards to review logging data.

  • Apache Website for Kibana: A standard Apache website that proxies web/REST requests to the Kibana NodeJS server.

  • Beaver: A Python daemon that collects information in log files and sends it to the Logging API (monasca-log API) over a secure connection.

  • Cloud Auditing Data Federation (CADF): Defines a standard, full-event model anyone can use to fill in the essential data needed to certify, self-manage and self-audit application security in cloud environments.

  • Centralized Logging and Monitoring (CLM): Used to evaluate and troubleshoot your SUSE OpenStack Cloud distributed cloud environment from a single location.

  • Curator: a tool provided by Elasticsearch to manage indices.

  • Elasticsearch: A data store offering fast indexing and querying.

  • SUSE OpenStack Cloud: Provides public, private, and managed cloud solutions to get you moving on your cloud journey.

  • JavaScript Object Notation (JSON) log file: A file stored in the JSON format and used to exchange data. JSON uses JavaScript syntax, but the JSON format is text only. Text can be read and used as a data format by any programming language. This format is used by the Beaver and Logstash components.

  • Kafka: A messaging broker used for collection of SUSE OpenStack Cloud centralized logging data across nodes. It is highly available, scalable and performant. Kafka stores logs in disk instead of memory and is therefore more tolerant to consumer down times.

    Important
    Important

    Make sure not to undersize your Kafka partition or the data retention period may be lower than expected. If the Kafka partition capacity is lower than 85%, the retention period will increase to 30 minutes. Over time Kafka will also eject old data.

  • Kibana: A client/server application with rich dashboards to visualize the data in Elasticsearch through a web browser. Kibana enables you to create charts and graphs using the log data.

  • Logging API (monasca-log-api): SUSE OpenStack Cloud API provides a standard REST interface to store logs. It uses keystone authentication and role-based access control support.

  • Logstash: A log processing system for receiving, processing and outputting logs. Logstash retrieves logs from Kafka, processes and enriches the data, then stores the data in Elasticsearch.

  • MML Service Node: Metering, Monitoring, and Logging (MML) service node. All services associated with metering, monitoring, and logging run on a dedicated three-node cluster. Three nodes are required for high availability with quorum.

  • Monasca: OpenStack monitoring at scale infrastructure for the cloud that supports alarms and reporting.

  • OpenStack Service.  An OpenStack service process that requires logging services.

  • Oslo.log.  An OpenStack library for log handling. The library functions automate configuration, deployment and scaling of complete, ready-for-work application platforms. Some PaaS solutions, such as Cloud Foundry, combine operating systems, containers, and orchestrators with developer tools, operations utilities, metrics, and security to create a developer-rich solution.

  • Text log: A type of file used in the logging process that contains human-readable records.

These components are configured to work out-of-the-box and the admin should be able to view log data using the default configurations.

In addition to each of the services, Centralized Logging also processes logs for the following features:

  • HAProxy

  • Syslog

  • keepalived

The purpose of the logging service is to provide a common logging infrastructure with centralized user access. Since there are numerous services and applications running in each node of a SUSE OpenStack Cloud cloud, and there could be hundreds of nodes, all of these services and applications can generate enough log files to make it very difficult to search for specific events in log files across all of the nodes. Centralized Logging addresses this issue by sending log messages in real time to a central Elasticsearch, Logstash, and Kibana cluster. In this cluster they are indexed and organized for easier and visual searches. The following illustration describes the architecture used to collect operational logs.

Image
Note
Note

The arrows come from the active (requesting) side to the passive (listening) side. The active side is always the one providing credentials, so the arrows may also be seen as coming from the credential holder to the application requiring authentication.

13.2.2.2 Steps 1- 2

Services configured to generate log files record the data. Beaver listens for changes to the files and sends the log files to the Logging Service. The first step the Logging service takes is to re-format the original log file to a new log file with text only and to remove all network operations. In Step 1a, the Logging service uses the Oslo.log library to re-format the file to text-only. In Step 1b, the Logging service uses the Python-Logstash library to format the original audit log file to a JSON file.

Step 1a

Beaver watches configured service operational log files for changes and reads incremental log changes from the files.

Step 1b

Beaver watches configured service operational log files for changes and reads incremental log changes from the files.

Step 2a

The monascalog transport of Beaver makes a token request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.

Step 2b

The monascalog transport of Beaver batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection. Failure logs are written to the local Beaver log.

Step 2c

The REST API client for monasca-log-api makes a token-request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.

Step 2d

The REST API client for monasca-log-api batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection.

13.2.2.3 Steps 3a- 3b

The Logging API (monasca-log API) communicates with keystone to validate the incoming request, and then sends the logs to Kafka.

Step 3a

The monasca-log-api WSGI pipeline is configured to validate incoming request tokens with keystone. The keystone middleware used for this purpose is configured to use the monasca-log-api admin user, password and project that have the required keystone role to validate a token.

Step 3b

monasca-log-api sends log messages to Kafka using a language-agnostic TCP protocol.

13.2.2.4 Steps 4- 8

Logstash pulls messages from Kafka, identifies the log type, and transforms the messages into either the audit log format or operational format. Then Logstash sends the messages to Elasticsearch, using either an audit or operational indices.

Step 4

Logstash input workers pull log messages from the Kafka-Logstash topic using TCP.

Step 5

This Logstash filter processes the log message in-memory in the request pipeline. Logstash identifies the log type from this field.

Step 6

This Logstash filter processes the log message in-memory in the request pipeline. If the message is of audit-log type, Logstash transforms it from the monasca-log-api envelope format to the original CADF format.

Step 7

This Logstash filter determines which index should receive the log message. There are separate indices in Elasticsearch for operational versus audit logs.

Step 8

Logstash output workers write the messages read from Kafka to the daily index in the local Elasticsearch instance.

13.2.2.5 Steps 9- 12

When an administrator who has access to the guest network accesses the Kibana client and makes a request, Apache forwards the request to the Kibana NodeJS server. Then the server uses the Elasticsearch REST API to service the client requests.

Step 9

An administrator who has access to the guest network accesses the Kibana client to view and search log data. The request can originate from the external network in the cloud through a tenant that has a pre-defined access route to the guest network.

Step 10

An administrator who has access to the guest network uses a web browser and points to the Kibana URL. This allows the user to search logs and view Dashboard reports.

Step 11

The authenticated request is forwarded to the Kibana NodeJS server to render the required dashboard, visualization, or search page.

Step 12

The Kibana NodeJS web server uses the Elasticsearch REST API in localhost to service the UI requests.

13.2.2.6 Steps 13- 15

Log data is backed-up and deleted in the final steps.

Step 13

A daily cron job running in the ELK node runs curator to prune old Elasticsearch log indices.

Step 14

The curator configuration is done at the deployer node through the Ansible role logging-common. Curator is scripted to then prune or clone old indices based on this configuration.

Step 15

The audit logs must be backed up manually. For more information about Backup and Recovery, see Chapter 17, Backup and Restore.

13.2.2.7 How Long are Log Files Retained?

The logs that are centrally stored are saved to persistent storage as Elasticsearch indices. These indices are stored in the partition /var/lib/elasticsearch on each of the Elasticsearch cluster nodes. Out of the box, logs are stored in one Elasticsearch index per service. As more days go by, the number of indices stored in this disk partition grows. Eventually the partition fills up. If they are open, each of these indices takes up CPU and memory. If these indices are left unattended they will continue to consume system resources and eventually deplete them.

Elasticsearch, by itself, does not prevent this from happening.

SUSE OpenStack Cloud uses a tool called curator that is developed by the Elasticsearch community to handle these situations. SUSE OpenStack Cloud installs and uses a curator in conjunction with several configurable settings. This curator is called by cron and performs the following checks:

  • First Check. The hourly cron job checks to see if the currently used Elasticsearch partition size is over the value set in:

    curator_low_watermark_percent

    If it is higher than this value, the curator deletes old indices according to the value set in:

    curator_num_of_indices_to_keep
  • Second Check. Another check is made to verify if the partition size is below the high watermark percent. If it is still too high, curator will delete all indices except the current one that is over the size as set in:

    curator_max_index_size_in_gb
  • Third Check. A third check verifies if the partition size is still too high. If it is, curator will delete all indices except the current one.

  • Final Check. A final check verifies if the partition size is still high. If it is, an error message is written to the log file but the current index is NOT deleted.

In the case of an extreme network issue, log files can run out of disk space in under an hour. To avoid this SUSE OpenStack Cloud uses a shell script called logrotate_if_needed.sh. The cron process runs this script every 5 minutes to see if the size of /var/log has exceeded the high_watermark_percent (95% of the disk, by default). If it is at or above this level, logrotate_if_needed.sh runs the logrotate script to rotate logs and to free up extra space. This script helps to minimize the chance of running out of disk space on /var/log.

13.2.2.8 How Are Logs Rotated?

SUSE OpenStack Cloud uses the cron process which in turn calls Logrotate to provide rotation, compression, and removal of log files. Each log file can be rotated hourly, daily, weekly, or monthly. If no rotation period is set then the log file will only be rotated when it grows too large.

Rotating a file means that the Logrotate process creates a copy of the log file with a new extension, for example, the .1 extension, then empties the contents of the original file. If a .1 file already exists, then that file is first renamed with a .2 extension. If a .2 file already exists, it is renamed to .3, etc., up to the maximum number of rotated files specified in the settings file. When Logrotate reaches the last possible file extension, it will delete the last file first on the next rotation. By the time the Logrotate process needs to delete a file, the results will have been copied to Elasticsearch, the central logging database.

The log rotation setting files can be found in the following directory

~/scratch/ansible/next/ardana/ansible/roles/logging-common/vars

These files allow you to set the following options:

Service

The name of the service that creates the log entries.

Rotated Log Files

List of log files to be rotated. These files are kept locally on the server and will continue to be rotated. If the file is also listed as Centrally Logged, it will also be copied to Elasticsearch.

Frequency

The timing of when the logs are rotated. Options include:hourly, daily, weekly, or monthly.

Max Size

The maximum file size the log can be before it is rotated out.

Rotation

The number of log files that are rotated.

Centrally Logged Files

These files will be indexed by Elasticsearch and will be available for searching in the Kibana user interface.

Only files that are listed in the Centrally Logged Files section are copied to Elasticsearch.

All of the variables for the Logrotate process are found in the following file:

~/scratch/ansible/next/ardana/ansible/roles/logging-ansible/logging-common/defaults/main.yml

Cron runs Logrotate hourly. Every 5 minutes another process is run called "logrotate_if_needed" which uses a watermark value to determine if the Logrotate process needs to be run. If the "high watermark" has been reached, and the /var/log partition is more than 95% full (by default - this can be adjusted), then Logrotate will be run within 5 minutes.

13.2.2.9 Are Log Files Backed-Up To Elasticsearch?

While centralized logging is enabled out of the box, the backup of these logs is not. The reason is because Centralized Logging relies on the Elasticsearch FileSystem Repository plugin, which in turn requires shared disk partitions to be configured and accessible from each of the Elasticsearch nodes. Since there are multiple ways to setup a shared disk partition, SUSE OpenStack Cloud allows you to choose an approach that works best for your deployment before enabling the back-up of log files to Elasticsearch.

If you enable automatic back-up of centralized log files, then all the logs collected from the cloud nodes will be backed-up to Elasticsearch. Every hour, in the management controller nodes where Elasticsearch is setup, a cron job runs to check if Elasticsearch is running low on disk space. If the check succeeds, it further checks if the backup feature is enabled. If enabled, the cron job saves a snapshot of the Elasticsearch indices to the configured shared disk partition using curator. Next, the script starts deleting the oldest index and moves down from there checking each time if there is enough space for Elasticsearch. A check is also made to ensure that the backup runs only once a day.

For steps on how to enable automatic back-up, see Section 13.2.5, “Configuring Centralized Logging”.

13.2.3 Accessing Log Data

All logging data in SUSE OpenStack Cloud is managed by the Centralized Logging Service and can be viewed or analyzed by Kibana. Kibana is the only graphical interface provided with SUSE OpenStack Cloud to search or create a report from log data. Operations Console provides only a link to the Kibana Logging dashboard.

The following two methods allow you to access the Kibana Logging dashboard to search log data:

To learn more about Kibana, read the Getting Started with Kibana guide.

13.2.3.1 Use the Operations Console Link

Operations Console allows you to access Kibana in the same tool that you use to manage the other SUSE OpenStack Cloud resources in your deployment. To use Operations Console, you must have the correct permissions.

To use Operations Console:

  1. In a browser, open the Operations Console.

  2. On the login page, enter the user name, and the Password, and then click LOG IN.

  3. On the Home/Central Dashboard page, click the menu represented by 3 horizontal lines (Three-Line Icon).

  4. From the menu that slides in on the left, select Home, and then select Logging.

  5. On the Home/Logging page, click View Logging Dashboard.

Important
Important

In SUSE OpenStack Cloud, Kibana usually runs on a different network than Operations Console. Due to this configuration, it is possible that using Operations Console to access Kibana will result in an “404 not found” error. This error only occurs if the user has access only to the public facing network.

13.2.3.2 Using Kibana to Access Log Data

Kibana is an open-source, data-visualization plugin for Elasticsearch. Kibana provides visualization capabilities using the log content indexed on an Elasticsearch cluster. Users can create bar and pie charts, line and scatter plots, and maps using the data collected by SUSE OpenStack Cloud in the cloud log files.

While creating Kibana dashboards is beyond the scope of this document, it is important to know that the dashboards you create are JSON files that you can modify or create new dashboards based on existing dashboards.

Note
Note

Kibana is client-server software. To operate properly, the browser must be able to access port 5601 on the control plane.

FieldDefault ValueDescription
userkibana

Username that will be required for logging into the Kibana UI.

passwordrandom password is generated

Password generated during installation that is used to login to the Kibana UI.

13.2.3.3 Logging into Kibana

To log into Kibana to view data, you must make sure you have the required login configuration.

13.2.3.3.1 Verify Login Credentials

During the installation of Kibana, a password is automatically set and it is randomized. Therefore, unless an administrator has already changed it, you need to retrieve the default password from a file on the control plane node.

13.2.3.3.2 Find the Randomized Password
  1. To find the Kibana password, run:

    ardana > grep kibana ~/scratch/ansible/next/my_cloud/stage/internal/CloudModel.yaml

13.2.4 Managing the Centralized Logging Feature

No specific configuration tasks are required to use Centralized Logging, as it is enabled by default after installation. However, you can configure the individual components as needed for your environment.

13.2.4.1 How Do I Stop and Start the Logging Service?

Although you might not need to stop and start the logging service very often, you may need to if, for example, one of the logging services is not behaving as expected or not working.

You cannot enable or disable centralized logging across all services unless you stop all centralized logging. Instead, it is recommended that you enable or disable individual log files in the <service>-clr.yml files and then reconfigure logging. You would enable centralized logging for a file when you want to make sure you are able to monitor those logs in Kibana.

In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.

Important
Important

It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.

The steps in this section only impact centralized logging. Logrotate is an essential feature that keeps the service log files from filling the disk and will not be affected.

Important
Important

These playbooks must be run from the Cloud Lifecycle Manager.

To stop the Logging service:

  1. To change to the directory containing the ansible playbook, run

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To run the ansible playbook that will stop the logging service, run:

    ardana > ansible-playbook -i hosts/verb_hosts logging-stop.yml

To start the Logging service:

  1. To change to the directory containing the ansible playbook, run

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To run the ansible playbook that will stop the logging service, run:

    ardana > ansible-playbook -i hosts/verb_hosts logging-start.yml

13.2.4.2 How Do I Enable or Disable Centralized Logging For a Service?

To enable or disable Centralized Logging for a service you need to modify the configuration for the service, set the enabled flag to true or false, and then reconfigure logging.

Important
Important

There are consequences if you enable too many logging files for a service. If there is not enough storage to support the increased logging, the retention period of logs in Elasticsearch is decreased. Alternatively, if you wanted to increase the retention period of log files or if you did not want those logs to show up in Kibana, you would disable centralized logging for a file.

To enable Centralized Logging for a service:

  1. Use the documentation provided with the service to ensure it is not configured for logging.

  2. To find the SUSE OpenStack Cloud file to edit, run:

    ardana > find ~/openstack/my_cloud/config/logging/vars/ -name "*service-name*"
  3. Edit the file for the service for which you want to enable logging.

  4. To enable Centralized Logging, find the following code and change the enabled flag to true, to disable, change the enabled flag to false:

    logging_options:
     - centralized_logging:
            enabled: true
            format: json
  5. Save the changes to the file.

  6. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  7. To reconfigure logging, run:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
    ardana > cd ~/openstack/ardana/ansible/
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

13.2.5 Configuring Centralized Logging

You can adjust the settings for centralized logging when you are troubleshooting problems with a service or to decrease log size and retention to save on disk space. For steps on how to configure logging settings, refer to the following tasks:

13.2.5.1 Configuration Files

Centralized Logging settings are stored in the configuration files in the following directory on the Cloud Lifecycle Manager: ~/openstack/my_cloud/config/logging/

The configuration files and their use are described below:

FileDescription
main.ymlMain configuration file for all centralized logging components.
elasticsearch.yml.j2Main configuration file for Elasticsearch.
elasticsearch-default.j2Default overrides for the Elasticsearch init script.
kibana.yml.j2Main configuration file for Kibana.
kibana-apache2.conf.j2Apache configuration file for Kibana.
logstash.conf.j2Logstash inputs/outputs configuration.
logstash-default.j2Default overrides for the Logstash init script.
beaver.conf.j2Main configuration file for Beaver.
varsPath to logrotate configuration files.

13.2.5.2 Planning Resource Requirements

The Centralized Logging service needs to have enough resources available to it to perform adequately for different scale environments. The base logging levels are tuned during installation according to the amount of RAM allocated to your control plane nodes to ensure optimum performance.

These values can be viewed and changed in the ~/openstack/my_cloud/config/logging/main.yml file, but you will need to run a reconfigure of the Centralized Logging service if changes are made.

Warning
Warning

The total process memory consumption for Elasticsearch will be the above allocated heap value (in ~/openstack/my_cloud/config/logging/main.yml) plus any Java Virtual Machine (JVM) overhead.

Setting Disk Size Requirements

In the entry-scale models, the disk partition sizes on your controller nodes for the logging and Elasticsearch data are set as a percentage of your total disk size. You can see these in the following file on the Cloud Lifecycle Manager (deployer): ~/openstack/my_cloud/definition/data/<controller_disk_files_used>

Sample file settings:

# Local Log files.
- name: log
  size: 13%
  mount: /var/log
  fstype: ext4
  mkfs-opts: -O large_file

# Data storage for centralized logging. This holds log entries from all
# servers in the cloud and hence can require a lot of disk space.
- name: elasticsearch
  size: 30%
  mount: /var/lib/elasticsearch
  fstype: ext4
Important
Important

The disk size is set automatically based on the hardware configuration. If you need to adjust it, you can set it manually with the following steps.

To set disk sizes:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/definition/data/disks.yml
  3. Make any desired changes.

  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A git
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the logging reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml

13.2.5.3 Backing Up Elasticsearch Log Indices

The log files that are centrally collected in SUSE OpenStack Cloud are stored by Elasticsearch on disk in the /var/lib/elasticsearch partition. However, this is distributed across each of the Elasticsearch cluster nodes as shards. A cron job runs periodically to see if the disk partition runs low on space, and, if so, it runs curator to delete the old log indices to make room for new logs. This deletion is permanent and the logs are lost forever. If you want to backup old logs, for example to comply with certain regulations, you can configure automatic backup of Elasticsearch indices.

Important
Important

If you need to restore data that was archived prior to SUSE OpenStack Cloud 9 and used the older versions of Elasticsearch, then this data will need to be restored to a separate deployment of Elasticsearch.

This can be accomplished using the following steps:

  1. Deploy a separate distinct Elasticsearch instance version matching the version in SUSE OpenStack Cloud.

  2. Configure the backed-up data using NFS or some other share mechanism to be available to the Elasticsearch instance matching the version in SUSE OpenStack Cloud.

Before enabling automatic back-ups, make sure you understand how much disk space you will need, and configure the disks that will store the data. Use the following checklist to prepare your deployment for enabling automatic backups:

Item

Add a shared disk partition to each of the Elasticsearch controller nodes.

The default partition name used for backup is

/var/lib/esbackup

You can change this by:

  1. Open the following file: my_cloud/config/logging/main.yml

  2. Edit the following variable curator_es_backup_partition

Ensure the shared disk has enough storage to retain backups for the desired retention period.

To enable automatic back-up of centralized logs to Elasticsearch:

  1. Log in to the Cloud Lifecycle Manager (deployer node).

  2. Open the following file in a text editor:

    ~/openstack/my_cloud/config/logging/main.yml
  3. Find the following variables:

    curator_backup_repo_name: "es_{{host.my_dimensions.cloud_name}}"
    curator_es_backup_partition: /var/lib/esbackup
  4. To enable backup, change the curator_enable_backup value to true in the curator section:

    curator_enable_backup: true
  5. Save your changes and re-run the configuration processor:

    ardana > cd ~/openstack
    ardana > git add -A
    # Verify the added files
    ardana > git status
    ardana > git commit -m "Enabling Elasticsearch Backup"
    
    $ cd ~/openstack/ardana/ansible
    $ ansible-playbook -i hosts/localhost config-processor-run.yml
    $ ansible-playbook -i hosts/localhost ready-deployment.yml
  6. To re-configure logging:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
  7. To verify that the indices are backed up, check the contents of the partition:

    ardana > ls /var/lib/esbackup

13.2.5.4 Restoring Logs From an Elasticsearch Backup

To restore logs from an Elasticsearch backup, see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-snapshots.html.

Note
Note

We do not recommend restoring to the original SUSE OpenStack Cloud Centralized Logging cluster as it may cause storage/capacity issues. We rather recommend setting up a separate ELK cluster of the same version and restoring the logs there.

13.2.5.5 Tuning Logging Parameters

When centralized logging is installed in SUSE OpenStack Cloud, parameters for Elasticsearch heap size and logstash heap size are automatically configured based on the amount of RAM on the system. These values are typically the required values, but they may need to be adjusted if performance issues arise, or disk space issues are encountered. These values may also need to be adjusted if hardware changes are made after an installation.

These values are defined at the top of the following file .../logging-common/defaults/main.yml. An example of the contents of the file is below:

1. Select heap tunings based on system RAM
#-------------------------------------------------------------------------------
threshold_small_mb: 31000
threshold_medium_mb: 63000
threshold_large_mb: 127000
tuning_selector: " {% if ansible_memtotal_mb < threshold_small_mb|int %}
demo
{% elif ansible_memtotal_mb < threshold_medium_mb|int %}
small
{% elif ansible_memtotal_mb < threshold_large_mb|int %}
medium
{% else %}
large
{%endif %}
"

logging_possible_tunings:
2. RAM < 32GB
demo:
elasticsearch_heap_size: 512m
logstash_heap_size: 512m
3. RAM < 64GB
small:
elasticsearch_heap_size: 8g
logstash_heap_size: 2g
4. RAM < 128GB
medium:
elasticsearch_heap_size: 16g
logstash_heap_size: 4g
5. RAM >= 128GB
large:
elasticsearch_heap_size: 31g
logstash_heap_size: 8g
logging_tunings: "{{ logging_possible_tunings[tuning_selector] }}"

This specifies thresholds for what a small, medium, or large system would look like, in terms of memory. To see what values will be used, see what RAM your system uses, and see where it fits in with the thresholds to see what values you will be installed with. To modify the values, you can either adjust the threshold values so that your system will change from a small configuration to a medium configuration, for example, or keep the threshold values the same, and modify the heap_size variables directly for the selector that your system is set for. For example, if your configuration is a medium configuration, which sets heap_sizes to 16 GB for Elasticsearch and 4 GB for logstash, and you want twice as much set aside for logstash, then you could increase the 4 GB for logstash to 8 GB.

13.2.6 Configuring Settings for Other Services

When you configure settings for the Centralized Logging Service, those changes impact all services that are enabled for centralized logging. However, if you only need to change the logging configuration for one specific service, you will want to modify the service's files instead of changing the settings for the entire Centralized Logging service. This topic helps you complete the following tasks:

13.2.6.1 Setting Logging Levels for Services

When it is necessary to increase the logging level for a specific service to troubleshoot an issue, or to decrease logging levels to save disk space, you can edit the service's config file and then reconfigure logging. All changes will be made to the service's files and not to the Centralized Logging service files.

Messages only appear in the log files if they are the same as or more severe than the log level you set. The DEBUG level logs everything. Most services default to the INFO logging level, which lists informational events, plus warnings, errors, and critical errors. Some services provide other logging options which will narrow the focus to help you debug an issue, receive a warning if an operation fails, or if there is a serious issue with the cloud.

For more information on logging levels, see the OpenStack Logging Guidelines documentation.

13.2.6.2 Configuring the Logging Level for a Service

If you want to increase or decrease the amount of details that are logged by a service, you can change the current logging level in the configuration files. Most services support, at a minimum, the DEBUG and INFO logging levels. For more information about what levels are supported by a service, check the documentation or Website for the specific service.

13.2.6.3 Barbican

ServiceSub-componentSupported Logging Levels
barbican

barbican-api

barbican-worker

INFO (default)

DEBUG

To change the barbican logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/
    ardana > vi my_cloud/config/barbican/barbican_deploy_config.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    barbican_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    barbican_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts barbican-reconfigure.yml

13.2.6.4 Block Storage (cinder)

ServiceSub-componentSupported Logging Levels
cinder

cinder-api

cinder-scheduler

cinder-backup

cinder-volume

INFO (default)

DEBUG

To manage cinder logging:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/ardana/ansible
    ardana > vi roles/_CND-CMN/defaults/main.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    cinder_loglevel: {{ ardana_loglevel | default('INFO') }}
    cinder_logstash_loglevel: {{ ardana_loglevel | default('INFO') }}
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml

13.2.6.5 Ceilometer

ServiceSub-componentSupported Logging Levels
ceilometer

ceilometer-collector

ceilometer-agent-notification

ceilometer-polling

ceilometer-expirer

INFO (default)

DEBUG

To change the ceilometer logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/ardana/ansible
    ardana > vi roles/_CEI-CMN/defaults/main.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    ceilometer_loglevel:  INFO
    ceilometer_logstash_loglevel:  INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml

13.2.6.6 Compute (nova)

ServiceSub-componentSupported Logging Levels
nova 

INFO (default)

DEBUG

To change the nova logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. The nova service component logging can be changed by modifying the following files:

    ~/openstack/my_cloud/config/nova/novncproxy-logging.conf.j2
    ~/openstack/my_cloud/config/nova/api-logging.conf.j2
    ~/openstack/my_cloud/config/nova/compute-logging.conf.j2
    ~/openstack/my_cloud/config/nova/conductor-logging.conf.j2
    ~/openstack/my_cloud/config/nova/scheduler-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml

13.2.6.7 Designate (DNS)

ServiceSub-componentSupported Logging Levels
designate

designate-api

designate-central

designate-mdns

designate-producer

designate-worker

designate-pool-manager

designate-zone-manager

INFO (default)

DEBUG

To change the designate logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/
    ardana > vi my_cloud/config/designate/designate.conf.j2
  3. To change the logging level, set the value of the following line:

    debug = False
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts designate-reconfigure.yml

13.2.6.8 Identity (keystone)

ServiceSub-componentSupported Logging Levels
keystonekeystone

INFO (default)

DEBUG

WARN

ERROR

To change the keystone logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    keystone_loglevel: INFO
    keystone_logstash_loglevel: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml

13.2.6.9 Image (glance)

ServiceSub-componentSupported Logging Levels
glance

glance-api

INFO (default)

DEBUG

To change the glance logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/glance/glance-api-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml

13.2.6.10 Bare Metal (ironic)

ServiceSub-componentSupported Logging Levels
ironic

ironic-api-logging.conf.j2

ironic-conductor-logging.conf.j2

INFO (default)

DEBUG

To change the ironic logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ardana > cd ~/openstack/ardana/ansible
    ardana > vi roles/ironic-common/defaults/main.yml
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    ironic_api_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    ironic_api_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    ironic_conductor_loglevel: "{{ ardana_loglevel | default('INFO') }}"
    ironic_conductor_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ironic-reconfigure.yml

13.2.6.11 Monitoring (monasca)

ServiceSub-componentSupported Logging Levels
monasca

monasca-persister

zookeeper

storm

monasca-notification

monasca-api

kafka

monasca-agent

WARN (default)

INFO

To change the monasca logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Monitoring service component logging can be changed by modifying the following files:

    ~/openstack/ardana/ansible/roles/monasca-persister/defaults/main.yml
    ~/openstack/ardana/ansible/roles/storm/defaults/main.yml
    ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
    ~/openstack/ardana/ansible/roles/monasca-api/defaults/main.yml
    ~/openstack/ardana/ansible/roles/kafka/defaults/main.yml
    ~/openstack/ardana/ansible/roles/monasca-agent/defaults/main.yml (For this file, you will need to add the variable)
  3. To change the logging level, use ALL CAPS to set the desired level in the following line:

    monasca_log_level: WARN
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml

13.2.6.12 Networking (neutron)

ServiceSub-componentSupported Logging Levels
neutron

neutron-server

dhcp-agent

l3-agent

metadata-agent

openvswitch-agent

ovsvapp-agent

sriov-agent

infoblox-ipam-agent

l2gateway-agent

INFO (default)

DEBUG

To change the neutron logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. The neutron service component logging can be changed by modifying the following files:

    ~/openstack/ardana/ansible/roles/neutron-common/templates/dhcp-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/infoblox-ipam-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/l2gateway-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/l3-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/metadata-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/openvswitch-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/ovsvapp-agent-logging.conf.j2
    ~/openstack/ardana/ansible/roles/neutron-common/templates/sriov-agent-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml

13.2.6.13 Object Storage (swift)

ServiceSub-componentSupported Logging Levels
swift 

INFO (default)

DEBUG

Note
Note

Currently it is not recommended to log at any level other than INFO.

13.2.6.14 Octavia

ServiceSub-componentSupported Logging Levels
octavia

octavia-api

octavia-worker

octavia-hk

octavia-hm

INFO (default)

DEBUG

To change the Octavia logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. The Octavia service component logging can be changed by modifying the following files:

    ~/openstack/my_cloud/config/octavia/octavia-api-logging.conf.j2
    ~/openstack/my_cloud/config/octavia/octavia-worker-logging.conf.j2
    ~/openstack/my_cloud/config/octavia/octavia-hk-logging.conf.j2
    ~/openstack/my_cloud/config/octavia/octavia-hm-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml

13.2.6.15 Operations Console

ServiceSub-componentSupported Logging Levels
opsconsole

ops-web

ops-mon

INFO (default)

DEBUG

To change the Operations Console logging level:

  1. Log in to the Cloud Lifecycle Manager.

  2. Open the following file:

    ~/openstack/ardana/ansible/roles/OPS-WEB/defaults/main.yml
  3. To change the logging level, use ALL CAPS to set the desired level in the following line:

    ops_console_loglevel: "{{ ardana_loglevel | default('INFO') }}"
  4. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  5. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ops-console-reconfigure.yml

13.2.6.16 Orchestration (heat)

ServiceSub-componentSupported Logging Levels
heat

api-cfn

api

engine

INFO (default)

DEBUG

To change the heat logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/heat/*-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml

13.2.6.17 Magnum

ServiceSub-componentSupported Logging Levels
magnum

api

conductor

INFO (default)

DEBUG

To change the Magnum logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/magnum/api-logging.conf.j2
    ~/openstack/my_cloud/config/magnum/conductor-logging.conf.j2
  3. The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:

    level: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts magnum-reconfigure.yml

13.2.6.18 File Storage (manila)

ServiceSub-componentSupported Logging Levels
manila

api

INFO (default)

DEBUG

To change the manila logging level:

  1. Log in to the Cloud Lifecycle Manager (deployer).

  2. Open the following file:

    ~/openstack/my_cloud/config/manila/manila-logging.conf.j2
  3. To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.

    manila_loglevel: INFO
    manila_logstash_loglevel: INFO
  4. Save the changes to the file.

  5. To commit the changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  6. To run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. To create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. To run the reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts manila-reconfigure.yml

13.2.6.19 Selecting Files for Centralized Logging

As you use SUSE OpenStack Cloud, you might find a need to redefine which log files are rotated on disk or transferred to centralized logging. These changes are all made in the centralized logging definition files.

SUSE OpenStack Cloud uses the logrotate service to provide rotation, compression, and removal of log files. All of the tunable variables for the logrotate process itself can be controlled in the following file: ~/openstack/ardana/ansible/roles/logging-common/defaults/main.yml

You can find the centralized logging definition files for each service in the following directory: ~/openstack/ardana/ansible/roles/logging-common/vars

You can change log settings for a service by following these steps.

  1. Log in to the Cloud Lifecycle Manager.

    Open the *.yml file for the service or sub-component that you want to modify.

    Using keystone, the Identity service as an example:

    ardana > vi ~/openstack/ardana/ansible/roles/logging-common/vars/keystone-clr.yml

    Consider the opening clause of the file:

    sub_service:
      hosts: KEY-API
      name: keystone
      service: keystone

    The hosts setting defines the role which will trigger this logrotate definition being applied to a particular host. It can use regular expressions for pattern matching, that is, NEU-.*.

    The service setting identifies the high-level service name associated with this content, which will be used for determining log files' collective quotas for storage on disk.

  2. Verify logging is enabled by locating the following lines:

    centralized_logging:
      enabled: true
      format: rawjson
    Note
    Note

    When possible, centralized logging is most effective on log files generated using logstash-formatted JSON. These files should specify format: rawjson. When only plaintext log files are available, format: json is appropriate. (This will cause their plaintext log lines to be wrapped in a json envelope before being sent to centralized logging storage.)

  3. Observe log files selected for rotation:

    - files:
      - /var/log/keystone/keystone.log
      log_rotate:
      - daily
      - maxsize 300M
      - rotate 7
      - compress
      - missingok
      - notifempty
      - copytruncate
      - create 640 keystone adm
    Note
    Note

    With the introduction of dynamic log rotation, the frequency (that is, daily) and file size threshold (that is, maxsize) settings no longer have any effect. The rotate setting may be easily overridden on a service-by-service basis.

  4. Commit any changes to your local git repository:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Create a deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the logging reconfigure playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml

13.2.6.20 Controlling Disk Space Allocation and Retention of Log Files

Each service is assigned a weighted allocation of the /var/log filesystem's capacity. When all its log files' cumulative sizes exceed this allocation, a rotation is triggered for that service's log files according to the behavior specified in the /etc/logrotate.d/* specification.

These specification files are auto-generated based on YML sources delivered with the Cloud Lifecycle Manager codebase. The source files can be edited and reapplied to control the allocation of disk space across services or the behavior during a rotation.

Disk capacity is allocated as a percentage of the total weighted value of all services running on a particular node. For example, if 20 services run on the same node, all with a default weight of 100, they will each be granted 1/20th of the log filesystem's capacity. If the configuration is updated to change one service's weight to 150, all the services' allocations will be adjusted to make it possible for that one service to consume 150% of the space available to other individual services.

These policies are enforced by the script /opt/kronos/rotate_if_exceeded_quota.py, which will be executed every 5 minutes via a cron job and will rotate the log files of any services which have exceeded their respective quotas. When log rotation takes place for a service, logs are generated to describe the activity in /var/log/kronos/check_if_exceeded_quota.log.

When logrotate is performed on a service, its existing log files are compressed and archived to make space available for fresh log entries. Once the number of archived log files exceeds that service's retention thresholds, the oldest files are deleted. Thus, longer retention thresholds (that is, 10 to 15) will result in more space in the service's allocated log capacity being used for historic logs, while shorter retention thresholds (that is, 1 to 5) will keep more space available for its active plaintext log files.

Use the following process to make adjustments to services' log capacity allocations or retention thresholds:

  1. Navigate to the following directory on your Cloud Lifecycle Manager:

    ~/stack/scratch/ansible/next/ardana/ansible
  2. Open and edit the service weights file:

    ardana > vi roles/kronos-logrotation/vars/rotation_config.yml
  3. Edit the service parameters to set the desired parameters. Example:

    cinder:
      weight: 300
      retention: 2
    Note
    Note

    The retention setting of default will use recommend defaults for each services' log files.

  4. Run the kronos-logrotation-deploy playbook:

    ardana > ansible-playbook -i hosts/verb_hosts kronos-logrotation-deploy.yml
  5. Verify the changes to the quotas have been changed:

    Login to a node and check the contents of the file /opt/kronos/service_info.yml to see the active quotas for that node, and the specifications in /etc/logrotate.d/* for rotation thresholds.

13.2.6.21 Configuring Elasticsearch for Centralized Logging

Elasticsearch includes some tunable options exposed in its configuration. SUSE OpenStack Cloud uses these options in Elasticsearch to prioritize indexing speed over search speed. SUSE OpenStack Cloud also configures Elasticsearch for optimal performance in low RAM environments. The options that SUSE OpenStack Cloud modifies are listed below along with an explanation about why they were modified.

These configurations are defined in the ~/openstack/my_cloud/config/logging/main.yml file and are implemented in the Elasticsearch configuration file ~/openstack/my_cloud/config/logging/elasticsearch.yml.j2.

13.2.6.22 Safeguards for the Log Partitions Disk Capacity

Because the logging partitions are at a high risk of filling up over time, a condition which can cause many negative side effects on services running, it is important to safeguard against log files consuming 100 % of available capacity.

This protection is implemented by pairs of low/high watermark thresholds, with values established in ~/stack/scratch/ansible/next/ardana/ansible/roles/logging-common/defaults/main.yml and applied by the kronos-logrotation-deploy playbook.

  • var_log_low_watermark_percent (default: 80) sets a capacity level for the contents of the /var/log partition beyond which alarms will be triggered (visible to administrators in monasca).

  • var_log_high_watermark_percent (default: 95) defines how much capacity of the /var/log partition to make available for log rotation (in calculating weighted service allocations).

  • var_audit_low_watermark_percent (default: 80) sets a capacity level for the contents of the /var/audit partition beyond which alarm notifications will be triggered.

  • var_audit_high_watermark_percent (default: 95) sets a capacity level for the contents of the /var/audit partition which will cause log rotation to be forced according to the specification in /etc/auditlogrotate.conf.

13.2.7 Audit Logging Overview

Existing OpenStack service logging varies widely across services. Generally, log messages do not have enough detail about who is requesting the application program interface (API), or enough context-specific details about an action performed. Often details are not even consistently logged across various services, leading to inconsistent data formats being used across services. These issues make it difficult to integrate logging with existing audit tools and processes.

To help you monitor your workload and data in compliance with your corporate, industry or regional policies, SUSE OpenStack Cloud provides auditing support as a basic security feature. The audit logging can be integrated with customer Security Information and Event Management (SIEM) tools and support your efforts to correlate threat forensics.

The SUSE OpenStack Cloud audit logging feature uses Audit Middleware for Python services. This middleware service is based on OpenStack services which use the Paste Deploy system. Most OpenStack services use the paste deploy mechanism to find and configure WSGI servers and applications. Utilizing the paste deploy system provides auditing support in services with minimal changes.

By default, audit logging as a post-installation feature is disabled in the cloudConfig file on the Cloud Lifecycle Manager and it can only be enabled after SUSE OpenStack Cloud installation or upgrade.

The tasks in this section explain how to enable services for audit logging in your environment. SUSE OpenStack Cloud provides audit logging for the following services:

  • nova

  • barbican

  • keystone

  • cinder

  • ceilometer

  • neutron

  • glance

  • heat

For audit log backup information see Section 17.3.4, “Audit Log Backup and Restore”

13.2.7.1 Audit Logging Checklist

Before enabling audit logging, make sure you understand how much disk space you will need, and configure the disks that will store the logging data. Use the following table to complete these tasks:

13.2.7.1.1 Frequently Asked Questions
How are audit logs generated?

The audit logs are created by services running in the cloud management controller nodes. The events that create auditing entries are formatted using a structure that is compliant with Cloud Auditing Data Federation (CADF) policies. The formatted audit entries are then saved to disk files. For more information, see the Cloud Auditing Data Federation Website.

Where are audit logs stored?

We strongly recommend adding a dedicated disk volume for /var/audit.

If the disk templates for the controllers are not updated to create a separate volume for /var/audit, the audit logs will still be created in the root partition under the folder /var/audit. This could be problematic if the root partition does not have adequate space to hold the audit logs.

Warning
Warning

We recommend that you do not store audit logs in the /var/log volume. The /var/log volume is used for storing operational logs and logrotation/alarms have been preconfigured for various services based on the size of this volume. Adding audit logs here may impact these causing undesired alarms. This would also impact the retention times for the operational logs.

Are audit logs centrally stored?

Yes. The existing operational log profiles have been configured to centrally log audit logs as well, once their generation has been enabled. The audit logs will be stored in separate Elasticsearch indices separate from the operational logs.

How long are audit log files retained?

By default, audit logs are configured to be retained for 7 days on disk. The audit logs are rotated each day and the rotated files are stored in a compressed format and retained up to 7 days (configurable). The backup service has been configured to back up the audit logs to a location outside of the controller nodes for much longer retention periods.

Do I lose audit data if a management controller node goes down?

Yes. For this reason, it is strongly recommended that you back up the audit partition in each of the management controller nodes for protection against any data loss.

13.2.7.1.2 Estimate Disk Size

The table below provides estimates from each service of audit log size generated per day. The estimates are provided for environments with 100 nodes, 300 nodes, and 500 nodes.

Service

Log File Size: 100 nodes

Log File Size: 300 nodes

Log File Size: 500 nodes

barbican2.6 MB4.2 MB5.6 MB
keystone96 - 131 MB288 - 394 MB480 - 657 MB
nova186 (with a margin of 46) MB557 (with a margin of 139) MB928 (with a margin of 232) MB
ceilometer12 MB12 MB12 MB
cinder2 - 250 MB2 - 250 MB2 - 250 MB
neutron145 MB433 MB722 MB
glance20 (with a margin of 8) MB60 (with a margin of 22) MB100 (with a margin of 36) MB
heat432 MB (1 transaction per second)432 MB (1 transaction per second)432 MB (1 transaction per second)
swift33 GB (700 transactions per second)102 GB (2100 transactions per second)172 GB (3500 transactions per second)
13.2.7.1.3 Add disks to the controller nodes

You need to add disks for the audit log partition to store the data in a secure manner. The steps to complete this task will vary depending on the type of server you are running. Please refer to the manufacturer’s instructions on how to add disks for the type of server node used by the management controller cluster. If you already have extra disks in the controller node, you can identify any unused one and use it for the audit log partition.

13.2.7.1.4 Update the disk template for the controller nodes

Since audit logging is disabled by default, the audit volume groups in the disk templates are commented out. If you want to turn on audit logging, the template needs to be updated first. If it is not updated, there will be no back-up volume group. To update the disk template, you will need to copy templates from the examples folder to the definition folder and then edit the disk controller settings. Changes to the disk template used for provisioning cloud nodes must be made prior to deploying the nodes.

To update the disk controller template:

  1. Log in to your Cloud Lifecycle Manager.

  2. To copy the example templates folder, run the following command:

    Important
    Important

    If you already have the required templates in the definition folder, you can skip this step.

    ardana > cp -r ~/openstack/examples/entry-scale-esx/* ~/openstack/my_cloud/definition/
  3. To change to the data folder, run:

    ardana > cd ~/openstack/my_cloud/definition/
  4. To edit the disks controller settings, open the file that matches your server model and disk model in a text editor:

    ModelFile
    entry-scale-kvm
    disks_controller_1TB.yml
    disks_controller_600GB.yml
    mid-scale
    disks_compute.yml
    disks_control_common_600GB.yml
    disks_dbmq_600GB.yml
    disks_mtrmon_2TB.yml
    disks_mtrmon_4.5TB.yml
    disks_mtrmon_600GB.yml
    disks_swobj.yml
    disks_swpac.yml
  5. To update the settings and enable an audit log volume group, edit the appropriate file(s) listed above and remove the '#' comments from these lines, confirming that they are appropriate for your environment.

    - name: audit-vg
      physical-volumes:
        - /dev/sdz
      logical-volumes:
        - name: audit
          size: 95%
          mount: /var/audit
          fstype: ext4
          mkfs-opts: -O large_file
13.2.7.1.5 Save your changes

To save your changes you will use the GIT repository to add the setup disk files.

To save your changes:

  1. To change to the openstack directory, run:

    ardana > cd ~/openstack
  2. To add the new and updated files, run:

    ardana > git add -A
  3. To verify the files are added, run:

    ardana > git status
  4. To commit your changes, run:

    ardana > git commit -m "Setup disks for audit logging"

13.2.7.2 Enable Audit Logging

To enable audit logging you must edit your cloud configuration settings, save your changes and re-run the configuration processor. Then you can run the playbooks to create the volume groups and configure them.

In the ~/openstack/my_cloud/definition/cloudConfig.yml file, service names defined under enabled-services or disabled-services override the default setting.

The following is an example of your audit-settings section:

# Disc space needs to be allocated to the audit directory before enabling
# auditing.
# Default can be either "disabled" or "enabled". Services listed in
# "enabled-services" and "disabled-services" override the default setting.
audit-settings:
   default: disabled
   #enabled-services:
   #  - keystone
   #  - barbican
   disabled-services:
     - nova
     - barbican
     - keystone
     - cinder
     - ceilometer
     - neutron

In this example, although the default setting for all services is set to disabled, keystone and barbican may be explicitly enabled by removing the comments from these lines and this setting overrides the default.

13.2.7.2.1 To edit the configuration file:
  1. Log in to your Cloud Lifecycle Manager.

  2. To change to the cloud definition folder, run:

    ardana > cd ~/openstack/my_cloud/definition
  3. To edit the auditing settings, in a text editor, open the following file:

    cloudConfig.yml
  4. To enable audit logging, begin by uncommenting the "enabled-services:" block.

    • enabled-service:

    • any service you want to enable for audit logging.

    For example, keystone has been enabled in the following text:

    Default cloudConfig.yml fileEnabling keystone audit logging
    audit-settings:
    default: disabled
    enabled-services:
    #  - keystone
    audit-settings:
    default: disabled
    enabled-services:
      - keystone
  5. To move the services you want to enable, comment out the service in the disabled section and add it to the enabled section. For example, barbican has been enabled in the following text:

    cloudConfig.yml fileEnabling barbican audit logging
    audit-settings:
    default: disabled
    enabled-services:
      - keystone
    disabled-services:
       - nova
       # - keystone
       - barbican
       - cinder
    audit-settings:
    default: disabled
    enabled-services:
     - keystone
     - barbican
    disabled-services:
     - nova
     # - barbican
     # - keystone
     - cinder
13.2.7.2.2 To save your changes and run the configuration processor:
  1. To change to the openstack directory, run:

    ardana > cd ~/openstack
  2. To add the new and updated files, run:

    ardana > git add -A
  3. To verify the files are added, run:

    ardana > git status
  4. To commit your changes, run:

    ardana > git commit -m "Enable audit logging"
  5. To change to the directory with the ansible playbooks, run:

    ardana > cd ~/openstack/ardana/ansible
  6. To rerun the configuration processor, run:

    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
13.2.7.2.3 To create the volume group:
  1. To change to the directory containing the osconfig playbook, run:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To remove the stub file that osconfig uses to decide if the disks are already configured, run:

    ardana > ansible -i hosts/verb_hosts KEY-API -a 'sudo rm -f /etc/openstack/osconfig-ran'
    Important
    Important

    The osconfig playbook uses the stub file to mark already configured disks as "idempotent." To stop osconfig from identifying your new disk as already configured, you must remove the stub file /etc/hos/osconfig-ran before re-running the osconfig playbook.

  3. To run the playbook that enables auditing for a service, run:

    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-API
    Important
    Important

    The variable KEY-API is used as an example to cover the management controller cluster. To enable auditing for a service that is not run on the same cluster, add the service to the –limit flag in the above command. For example:

    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-API:NEU-SVR
13.2.7.2.4 To Reconfigure services for audit logging:
  1. To change to the directory containing the service playbooks, run:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
  2. To run the playbook that reconfigures a service for audit logging, run:

    ardana > ansible-playbook -i hosts/verb_hosts SERVICE_NAME-reconfigure.yml

    For example, to reconfigure keystone for audit logging, run:

    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
  3. Repeat steps 1 and 2 for each service you need to reconfigure.

    Important
    Important

    You must reconfigure each service that you changed to be enabled or disabled in the cloudConfig.yml file.

13.2.8 Troubleshooting

For information on troubleshooting Central Logging, see Section 18.7.1, “Troubleshooting Centralized Logging”.

13.3 Metering Service (ceilometer) Overview

The SUSE OpenStack Cloud metering service collects and provides access to OpenStack usage data that can be used for billing reporting such as showback and chargeback. The metering service can also provide general usage reporting. ceilometer acts as the central collection and data access service to the meters provided by all the OpenStack services. The data collected is available through the monasca API. ceilometer V2 API was deprecated in the Pike release upstream.

13.3.1 Metering Service New Functionality

13.3.1.1 New Metering Functionality in SUSE OpenStack Cloud 9

  • ceilometer is now integrated with monasca, using it as the datastore.

  • The default meters and other items configured for ceilometer can now be modified and additional meters can be added. We recommend that users test overall SUSE OpenStack Cloud performance prior to deploying any ceilometer modifications to ensure the addition of new notifications or polling events does not negatively affect overall system performance.

  • ceilometer Central Agent (pollster) is now called Polling Agent and is configured to support HA (Active-Active).

  • Notification Agent has built-in HA (Active-Active) with support for pipeline transformers, but workload partitioning has been disabled in SUSE OpenStack Cloud

  • SWIFT Poll-based account level meters will be enabled by default with an hourly collection cycle.

  • Integration with centralized monitoring (monasca) and centralized logging

  • Support for upgrade and reconfigure operations

13.3.1.2 Limitations

  • The Number of metadata attributes that can be extracted from resource_metadata has a maximum of 16. This is the number of fields in the metadata section of the monasca_field_definitions.yaml file for any service. It is also the number that is equal to fields in metadata.common and fields in metadata.<service.meters> sections. The total number of these fields cannot be more than 16.

  • Several network-related attributes are accessible using a colon ":" but are returned as a period ".". For example, you can access a sample list using the following command:

    ardana > source ~/service.osrc
    ardana > ceilometer --debug sample-list network -q "resource_id=421d50a5-156e-4cb9-b404-
    d2ce5f32f18b;resource_metadata.provider.network_type=flat"

    However, in response you will see the following:

    provider.network_type

    instead of

    provider:network_type

    This limitation is known for the following attributes:

    provider:network_type
    provider:physical_network
    provider:segmentation_id
  • ceilometer Expirer is not supported. Data retention expiration is handled by monasca with a default retention period of 45 days.

  • ceilometer Collector is not supported.

13.3.2 Understanding the Metering Service Concepts

13.3.2.1 Ceilometer Introduction

Before configuring the ceilometer Metering Service, it is important to understand how it works.

13.3.2.1.1 Metering Architecture

SUSE OpenStack Cloud automatically configures ceilometer to use Logging and Monitoring Service (monasca) as its backend. ceilometer is deployed on the same control plane nodes as monasca.

The installation of Celiometer creates several management nodes running different metering components.

ceilometer Components on Controller nodes

This controller node is the first of the High Available (HA) cluster.

ceilometer Sample Polling

Sample Polling is part of the Polling Agent. Messages are posted by the Notification Agent directly to monasca API.

ceilometer Polling Agent

The Polling Agent is responsible for coordinating the polling activity. It parses the pipeline.yml configuration file and identifies all the sources that need to be polled. The sources are then evaluated using a discovery mechanism and all the sources are translated to resources where a dedicated pollster can retrieve and publish data. At each identified interval the discovery mechanism is triggered, the resource list is composed, and the data is polled and sent to the queue.

ceilometer Collector No Longer Required

In previous versions, the collector was responsible for getting the samples/events from the RabbitMQ service and storing it in the main database. The ceilometer Collector is no longer enabled. Now that Notification Agent posts the data directly to monasca API, the collector is no longer required

13.3.2.1.2 Meter Reference

ceilometer collects basic information grouped into categories known as meters. A meter is the unique resource-usage measurement of a particular OpenStack service. Each OpenStack service defines what type of data is exposed for metering.

Each meter has the following characteristics:

AttributeDescription
NameDescription of the meter
Unit of MeasurementThe method by which the data is measured. For example: storage meters are defined in Gigabytes (GB) and network bandwidth is measured in Gigabits (Gb).
Type

The origin of the meter's data. OpenStack defines the following origins:

  • Cumulative - Increasing over time (instance hours)

  • Gauge - a discrete value. For example: the number of floating IP addresses or image uploads.

  • Delta - Changing over time (bandwidth)

A meter is defined for every measurable resource. A meter can exist beyond the actual existence of a particular resource, such as an active instance, to provision long-cycle use cases such as billing.

Important
Important

For a list of meter types and default meters installed with SUSE OpenStack Cloud, see Section 13.3.3, “Ceilometer Metering Available Meter Types”

The most common meter submission method is notifications. With this method, each service sends the data from their respective meters on a periodic basis to a common notifications bus.

ceilometer, in turn, pulls all of the events from the bus and saves the notifications in a ceilometer-specific database. The period of time that the data is collected and saved is known as the ceilometer expiry and is configured during ceilometer installation. Each meter is collected from one or more samples, gathered from the messaging queue or polled by agents. The samples are represented by counter objects. Each counter has the following fields:

AttributeDescription
counter_nameDescription of the counter
counter_unitThe method by which the data is measured. For example: data can be defined in Gigabytes (GB) or for network bandwidth, measured in Gigabits (Gb).
counter_typee

The origin of the counter's data. OpenStack defines the following origins:

  • Cumulative - Increasing over time (instance hours)

  • Gauge - a discrete value. For example: the number of floating IP addresses or image uploads.

  • Delta - Changing over time (bandwidth)

counter_volumeThe volume of data measured (CPU ticks, bytes transmitted, etc.). Not used for gauge counters. Set to a default value such as 1.
resource_idThe identifier of the resource measured (UUID)
project_idThe project (tenant) ID to which the resource belongs.
user_idThe ID of the user who owns the resource.
resource_metadataOther data transmitted in the metering notification payload.

13.3.3 Ceilometer Metering Available Meter Types

The Metering service contains three types of meters:

Cumulative

A cumulative meter measures data over time (for example, instance hours).

Gauge

A gauge measures discrete items (for example, floating IPs or image uploads) or fluctuating values (such as disk input or output).

Delta

A delta measures change over time, for example, monitoring bandwidth.

Each meter is populated from one or more samples, which are gathered from the messaging queue (listening agent), polling agents, or push agents. Samples are populated by counter objects.

Each counter contains the following fields:

name

the name of the meter

type

the type of meter (cumulative, gauge, or delta)

amount

the amount of data measured

unit

the unit of measure

resource

the resource being measured

project ID

the project the resource is assigned to

user

the user the resource is assigned to.

Note: The metering service shares the same High-availability proxy, messaging, and database clusters with the other Information services. To avoid unnecessarily high loads, Section 13.3.8, “Optimizing the Ceilometer Metering Service”.

13.3.3.1 SUSE OpenStack Cloud Default Meters

These meters are installed and enabled by default during an SUSE OpenStack Cloud installation. More information about ceilometer can be found at OpenStack ceilometer.

13.3.3.2 Compute (nova) Meters

MeterTypeUnitResourceOriginNote
vcpusGaugevcpuInstance IDNotificationNumber of virtual CPUs allocated to the instance
memoryGaugeMBInstance IDNotificationVolume of RAM allocated to the instance
memory.residentGaugeMBInstance IDPollsterVolume of RAM used by the instance on the physical machine
memory.usageGaugeMBInstance IDPollsterVolume of RAM used by the instance from the amount of its allocated memory
cpuCumulativensInstance IDPollsterCPU time used
cpu_utilGauge%Instance IDPollsterAverage CPU utilization
disk.read.requestsCumulativerequestInstance IDPollsterNumber of read requests
disk.read.requests.rateGaugerequest/sInstance IDPollsterAverage rate of read requests
disk.write.requestsCumulativerequestInstance IDPollsterNumber of write requests
disk.write.requests.rateGaugerequest/sInstance IDPollsterAverage rate of write requests
disk.read.bytesCumulativeBInstance IDPollsterVolume of reads
disk.read.bytes.rateGaugeB/sInstance IDPollsterAverage rate of reads
disk.write.bytesCumulativeBInstance IDPollsterVolume of writes
disk.write.bytes.rateGaugeB/sInstance IDPollsterAverage rate of writes
disk.root.sizeGaugeGBInstance IDNotificationSize of root disk
disk.ephemeral.sizeGaugeGBInstance IDNotificationSize of ephemeral disk
disk.device.read.requestsCumulativerequestDisk IDPollsterNumber of read requests
disk.device.read.requests.rateGaugerequest/sDisk IDPollsterAverage rate of read requests
disk.device.write.requestsCumulativerequestDisk IDPollsterNumber of write requests
disk.device.write.requests.rateGaugerequest/sDisk IDPollsterAverage rate of write requests
disk.device.read.bytesCumulativeBDisk IDPollsterVolume of reads
disk.device.read.bytes .rateGaugeB/sDisk IDPollsterAverage rate of reads
disk.device.write.bytesCumulativeBDisk IDPollsterVolume of writes
disk.device.write.bytes .rateGaugeB/sDisk IDPollsterAverage rate of writes
disk.capacityGaugeBInstance IDPollsterThe amount of disk that the instance can see
disk.allocationGaugeBInstance IDPollsterThe amount of disk occupied by the instance on the host machine
disk.usageGaugeBInstance IDPollsterThe physical size in bytes of the image container on the host
disk.device.capacityGaugeBDisk IDPollsterThe amount of disk per device that the instance can see
disk.device.allocationGaugeBDisk IDPollsterThe amount of disk per device occupied by the instance on the host machine
disk.device.usageGaugeBDisk IDPollsterThe physical size in bytes of the image container on the host per device
network.incoming.bytesCumulativeBInterface IDPollsterNumber of incoming bytes
network.outgoing.bytesCumulativeBInterface IDPollsterNumber of outgoing bytes
network.incoming.packetsCumulativepacketInterface IDPollsterNumber of incoming packets
network.outgoing.packetsCumulativepacketInterface IDPollsterNumber of outgoing packets

13.3.3.3 Compute Host Meters

MeterTypeUnitResourceOriginNote
compute.node.cpu.frequencyGaugeMHzHost IDNotificationCPU frequency
compute.node.cpu.kernel.timeCumulativensHost IDNotificationCPU kernel time
compute.node.cpu.idle.timeCumulativensHost IDNotificationCPU idle time
compute.node.cpu.user.timeCumulativensHost IDNotificationCPU user mode time
compute.node.cpu.iowait.timeCumulativensHost IDNotificationCPU I/O wait time
compute.node.cpu.kernel.percentGauge%Host IDNotificationCPU kernel percentage
compute.node.cpu.idle.percentGauge%Host IDNotificationCPU idle percentage
compute.node.cpu.user.percentGauge%Host IDNotificationCPU user mode percentage
compute.node.cpu.iowait.percentGauge%Host IDNotificationCPU I/O wait percentage
compute.node.cpu.percentGauge%Host IDNotificationCPU utilization

13.3.3.4 Image (glance) Meters

MeterTypeUnitResourceOriginNote
image.sizeGaugeBImage IDNotificationUploaded image size
image.updateDeltaImageImage IDNotificationNumber of uploads of the image
image.uploadDeltaImageimage IDnotificationNumber of uploads of the image
image.deleteDeltaImageImage IDNotificationNumber of deletes on the image

13.3.3.5 Volume (cinder) Meters

MeterTypeUnitResourceOriginNote
volume.sizeGaugeGBVol IDNotificationSize of volume
snapshot.sizeGaugeGBSnap IDNotificationSize of snapshot's volume

13.3.3.6 Storage (swift) Meters

MeterTypeUnitResourceOriginNote
storage.objectsGaugeObjectStorage IDPollsterNumber of objects
storage.objects.sizeGaugeBStorage IDPollsterTotal size of stored objects
storage.objects.containersGaugeContainerStorage IDPollsterNumber of containers

The resource_id for any ceilometer query is the tenant_id for the swift object because swift usage is rolled up at the tenant level.

13.3.4 Configure the Ceilometer Metering Service

SUSE OpenStack Cloud 9 automatically deploys ceilometer to use the monasca database. ceilometer is deployed on the same control plane nodes along with other OpenStack services such as keystone, nova, neutron, glance, and swift.

The Metering Service can be configured using one of the procedures described below.

13.3.4.1 Run the Upgrade Playbook

Follow Standard Service upgrade mechanism available in the Cloud Lifecycle Manager distribution. For ceilometer, the playbook included with SUSE OpenStack Cloud is ceilometer-upgrade.yml

13.3.4.2 Enable Services for Messaging Notifications

After installation of SUSE OpenStack Cloud, the following services are enabled by default to send notifications:

  • nova

  • cinder

  • glance

  • neutron

  • swift

The list of meters for these services are specified in the Notification Agent or Polling Agent's pipeline configuration file.

For steps on how to edit the pipeline configuration files, see: Section 13.3.5, “Ceilometer Metering Service Notifications”

13.3.4.3 Restart the Polling Agent

The Polling Agent is responsible for coordinating the polling activity. It parses the pipeline.yml configuration file and identifies all the sources where data is collected. The sources are then evaluated and are translated to resources that a dedicated pollster can retrieve. The Polling Agent follows this process:

  1. At each identified interval, the pipeline.yml configuration file is parsed.

  2. The resource list is composed.

  3. The pollster collects the data.

  4. The pollster sends data to the queue.

Metering processes should normally be operating at all times. This need is addressed by the Upstart event engine which is designed to run on any Linux system. Upstart creates events, handles the consequences of those events, and starts and stops processes as required. Upstart will continually attempt to restart stopped processes even if the process was stopped manually. To stop or start the Polling Agent and avoid the conflict with Upstart, using the following steps.

To restart the Polling Agent:

  1. To determine whether the process is running, run:

    tux > sudo systemctl status ceilometer-agent-notification
    #SAMPLE OUTPUT:
    ceilometer-agent-notification.service - ceilometer-agent-notification Service
       Loaded: loaded (/etc/systemd/system/ceilometer-agent-notification.service; enabled; vendor preset: disabled)
       Active: active (running) since Tue 2018-06-12 05:07:14 UTC; 2 days ago
     Main PID: 31529 (ceilometer-agen)
        Tasks: 69
       CGroup: /system.slice/ceilometer-agent-notification.service
               ├─31529 ceilometer-agent-notification: master process [/opt/stack/service/ceilometer-agent-notification/venv/bin/ceilometer-agent-notification --config-file /opt/stack/service/ceilometer-agent-noti...
               └─31621 ceilometer-agent-notification: NotificationService worker(0)
    
    Jun 12 05:07:14 ardana-qe201-cp1-c1-m2-mgmt systemd[1]: Started ceilometer-agent-notification Service.
  2. To stop the process, run:

    tux > sudo systemctl stop ceilometer-agent-notification
  3. To start the process, run:

    tux > sudo systemctl start ceilometer-agent-notification

13.3.4.4 Replace a Logging, Monitoring, and Metering Controller

In a medium-scale environment, if a metering controller has to be replaced or rebuilt, use the following steps:

  1. Section 15.1.2.1, “Replacing a Controller Node”.

  2. If the ceilometer nodes are not on the shared control plane, to implement the changes and replace the controller, you must reconfigure ceilometer. To do this, run the ceilometer-reconfigure.yml ansible playbook without the limit option

13.3.4.5 Configure Monitoring

The monasca HTTP Process monitors ceilometer's notification and polling agents are monitored. If these agents are down, monasca monitoring alarms are triggered. You can use the notification alarms to debug the issue and restart the notifications agent. However, for Central-Agent (polling) and Collector the alarms need to be deleted. These two processes are not started after an upgrade so when the monitoring process checks the alarms for these components, they will be in UNDETERMINED state. SUSE OpenStack Cloud does not monitor these processes anymore. To resolve this issue, manually delete alarms that are no longer used but are installed.

To resolve notification alarms, first check the ceilometer-agent-notification logs for errors in the /var/log/ceilometer directory. You can also use the Operations Console to access Kibana and check the logs. This will help you understand and debug the error.

To restart the service, run the ceilometer-start.yml. This playbook starts the ceilometer processes that has stopped and only restarts during install, upgrade or reconfigure which is what is needed in this case. Restarting the process that has stopped will resolve this alarm because this monasca alarm means that ceilometer-agent-notification is no longer running on certain nodes.

You can access ceilometer data through monasca. ceilometer publishes samples to monasca with credentials of the following accounts:

  • ceilometer user

  • services

Data collected by ceilometer can also be retrieved by the monasca REST API. Make sure you use the following guidelines when requesting data from the monasca REST API:

  • Verify you have the monasca-admin role. This role is configured in the monasca-api configuration file.

  • Specify the tenant id of the services project.

For more details, read the monasca API Specification.

To run monasca commands at the command line, you must be have the admin role. This allows you to use the ceilometer account credentials to replace the default admin account credentials defined in the service.osrc file. When you use the ceilometer account credentials, monasca commands will only return data collected by ceilometer. At this time, monasca command line interface (CLI) does not support the data retrieval of other tenants or projects.

13.3.5 Ceilometer Metering Service Notifications

ceilometer uses the notification agent to listen to the message queue, convert notifications to Events and Samples, and apply pipeline actions.

13.3.5.1 Manage Whitelisting and Polling

SUSE OpenStack Cloud is designed to reduce the amount of data that is stored. SUSE OpenStack Cloud's use of a SQL-based cluster, which is not recommended for big data, means you must control the data that ceilometer collects. You can do this by filtering (whitelisting) the data or by using the configuration files for the ceilometer Polling Agent and the ceilometer Notificfoation Agent.

Whitelisting is used in a rule specification as a positive filtering parameter. Whitelist is only included in rules that can be used in direct mappings, for identity service issues such as service discovery, provisioning users, groups, roles, projects, domains as well as user authentication and authorization.

You can run tests against specific scenarios to see if filtering reduces the amount of data stored. You can create a test by editing or creating a run filter file (whitelist). For steps on how to do this, see: Section 38.1, “API Verification”.

ceilometer Polling Agent (polling agent) and ceilometer Notification Agent (notification agent) use different pipeline.yaml files to configure meters that are collected. This prevents accidentally polling for meters which can be retrieved by the polling agent as well as the notification agent. For example, glance image and image.size are meters which can be retrieved both by polling and notifications.

In both of the separate configuration files, there is a setting for interval. The interval attribute determines the frequency, in seconds, of how often data is collected. You can use this setting to control the amount of resources that are used for notifications and for polling. For example, you want to use more resources for notifications and less for polling. To accomplish this you would set the interval in the polling configuration file to a large amount of time, such as 604800 seconds, which polls only once a week. Then in the notifications configuration file, you can set the interval to a higher amount, such as collecting data every 30 seconds.

Important
Important

swift account data will be collected using the polling mechanism in an hourly interval.

Setting this interval to manage both notifications and polling is the recommended procedure when using a SQL cluster back-end.

Sample ceilometer Polling Agent file:

#File: ~/opt/stack/service/ceilometer-polling/etc/pipeline-polling.yaml
---
sources:
    - name: swift_source
      interval: 3600
      meters:
          - "storage.objects"
          - "storage.objects.size"
          - "storage.objects.containers"
      resources:
      discovery:
      sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
         - notifier://

Sample ceilometer Notification Agent(notification agent) file:

#File:    ~/opt/stack/service/ceilometer-agent-notification/etc/pipeline-agent-notification.yaml
---
sources:
    - name: meter_source
      interval: 30
      meters:
          - "instance"
          - "image"
          - "image.size"
          - "image.upload"
          - "image.delete"
          - "volume"
          - "volume.size"
          - "snapshot"
          - "snapshot.size"
          - "ip.floating"
          - "network"
          - "network.create"
          - "network.update"
resources:
discovery:
sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
         - notifier://

Both of the pipeline files have two major sections:

Sources

represents the data that is collected either from notifications posted by services or through polling. In the Sources section there is a list of meters. These meters define what kind of data is collected. For a full list refer to the ceilometer documentation available at: Telemetry Measurements

Sinks

represents how the data is modified before it is published to the internal queue for collection and storage.

You will only need to change a setting in the Sources section to control the data collection interval.

For more information, see Telemetry Measurements

To change the ceilometer Polling Agent interval setting:

  1. To find the polling agent configuration file, run:

    cd ~/opt/stack/service/ceilometer-polling/etc
  2. In a text editor, open the following file:

    pipeline-polling.yaml
  3. In the following section, change the value of interval to the desired amount of time:

    ---
    sources:
        - name: swift_source
          interval: 3600
          meters:
              - "storage.objects"
              - "storage.objects.size"
              - "storage.objects.containers"
          resources:
          discovery:
          sinks:
              - meter_sink
    sinks:
        - name: meter_sink
          transformers:
          publishers:
             - notifier://

    In the sample code above, the polling agent will collect data every 600 seconds, or 10 minutes.

To change the ceilometer Notification Agent (notification agent) interval setting:

  1. To find the notification agent configuration file, run:

    cd /opt/stack/service/ceilometer-agent-notification
  2. In a text editor, open the following file:

    pipeline-agent-notification.yaml
  3. In the following section, change the value of interval to the desired amount of time:

    sources:
        - name: meter_source
          interval: 30
          meters:
              - "instance"
              - "image"
              - "image.size"
              - "image.upload"
              - "image.delete"
              - "volume"
              - "volume.size"
              - "snapshot"
              - "snapshot.size"
              - "ip.floating"
              - "network"
              - "network.create"
              - "network.update"

    In the sample code above, the notification agent will collect data every 30 seconds.

Note
Note

The pipeline-agent-notification.yaml file needs to be changed on all controller nodes to change the white-listing and polling strategy.

13.3.5.2 Edit the List of Meters

The number of enabled meters can be reduced or increased by editing the pipeline configuration of the notification and polling agents. To deploy these changes you must then restart the agent. If pollsters and notifications are both modified, then you will have to restart both the Polling Agent and the Notification Agent. ceilometer Collector will also need to be restarted. The following code is an example of a compute-only ceilometer Notification Agent (notification agent) pipeline-agent-notification.yaml file:

---
sources:
    - name: meter_source
      interval: 86400
      meters:
          - "instance"
          - "memory"
          - "vcpus"
          - "compute.instance.create.end"
          - "compute.instance.delete.end"
          - "compute.instance.update"
          - "compute.instance.exists"
      sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
          - notifier://
Important
Important

If you enable meters at the container level in this file, every time the polling interval triggers a collection, at least 5 messages per existing container in swift are collected.

The following table illustrates the amount of data produced hourly in different scenarios:

swift Containersswift Objects per containerSamples per HourSamples stored per 24 hours
101050012000
101005000120000
100100500001200000
100100050000012000000

The data in the table shows that even a very small swift storage with 10 containers and 100 files will store 120,000 samples in 24 hours, generating a total of 3.6 million samples.

Important
Important

The size of each file does not have any impact on the number of samples collected. As shown in the table above, the smallest number of samples results from polling when there are a small number of files and a small number of containers. When there are a lot of small files and containers, the number of samples is the highest.

13.3.5.3 Add Resource Fields to Meters

By default, not all the resource metadata fields for an event are recorded and stored in ceilometer. If you want to collect metadata fields for a consumer application, for example, it is easier to add a field to an existing meter rather than creating a new meter. If you create a new meter, you must also reconfigure ceilometer.

Important
Important

Consider the following information before you add or edit a meter:

  • You can add a maximum of 12 new fields.

  • Adding or editing a meter causes all non-default meters to STOP receiving notifications. You will need to restart ceilometer.

  • New meters added to the pipeline-polling.yaml.j2 file must also be added to the pipeline-agent-notification.yaml.j2 file. This is due to the fact that polling meters are drained by the notification agent and not by the collector.

  • After SUSE OpenStack Cloud is installed, services like compute, cinder, glance, and neutron are configured to publish ceilometer meters by default. Other meters can also be enabled after the services are configured to start publishing the meter. The only requirement for publishing a meter is that the origin must have a value of notification. For a complete list of meters, see the OpenStack documentation on Measurements.

  • Not all meters are supported. Meters collected by ceilometer Compute Agent or any agent other than ceilometer Polling are not supported or tested with SUSE OpenStack Cloud.

  • Identity meters are disabled by keystone.

  • To enable ceilometer to start collecting meters, some services require you enable the meters you need in the service first before enabling them in ceilometer. Refer to the documentation for the specific service before you add new meters or resource fields.

To add Resource Metadata fields:

  1. Log on to the Cloud Lifecycle Manager (deployer node).

  2. To change to the ceilometer directory, run:

    ardana > cd ~/openstack/my_cloud/config/ceilometer
  3. In a text editor, open the target configuration file (for example, monasca-field-definitions.yaml.j2).

  4. In the metadata section, either add a new meter or edit an existing one provided by SUSE OpenStack Cloud.

  5. Include the metadata fields you need. You can use the instance meter in the file as an example.

  6. Save and close the configuration file.

  7. To save your changes in SUSE OpenStack Cloud, run:

    ardana > cd ~/openstack
    ardana > git add -A
    ardana > git commit -m "My config"
  8. If you added a new meter, reconfigure ceilometer:

    ardana > cd ~/openstack/ardana/ansible/
    # To run the config-processor playbook:
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    #To run the ready-deployment playbook:
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml

13.3.5.4 Update the Polling Strategy and Swift Considerations

Polling can be very taxing on the system due to the sheer volume of data that the system may have to process. It also has a severe impact on queries since the database will now have a very large amount of data to scan to respond to the query. This consumes a great amount of cpu and memory. This can result in long wait times for query responses, and in extreme cases can result in timeouts.

There are 3 polling meters in swift:

  • storage.objects

  • storage.objects.size

  • storage.objects.containers

Here is an example of pipeline.yml in which swift polling is set to occur hourly.

---
      sources:
      - name: swift_source
      interval: 3600
      meters:
      - "storage.objects"
      - "storage.objects.size"
      - "storage.objects.containers"
      resources:
      discovery:
      sinks:
      - meter_sink
      sinks:
      - name: meter_sink
      transformers:
      publishers:
      - notifier://

With this configuration above, we did not enable polling of container based meters and we only collect 3 messages for any given tenant, one for each meter listed in the configuration files. Since we have 3 messages only per tenant, it does not create a heavy load on the MySQL database as it would have if container-based meters were enabled. Hence, other APIs are not hit because of this data collection configuration.

13.3.6 Ceilometer Metering Setting Role-based Access Control

Role Base Access Control (RBAC) is a technique that limits access to resources based on a specific set of roles associated with each user's credentials.

keystone has a set of users that are associated with each project. Each user has at least one role. After a user has authenticated with keystone using a valid set of credentials, keystone will augment that request with the Roles that are associated with that user. These roles are added to the Request Header under the X-Roles attribute and are presented as a comma-separated list.

13.3.6.1 Displaying All Users

To discover the list of users available in the system, an administrator can run the following command using the keystone command-line interface:

ardana > source ~/service.osrc
ardana > openstack user list

The output should resemble this response, which is a list of all the users currently available in this system.

+----------------------------------+-----------------------------------------+----+
|                id                |    name      | enabled |       email        |
+----------------------------------+-----------------------------------------+----+
| 1c20d327c92a4ea8bb513894ce26f1f1 |   admin      |   True  | admin.example.com  |
| 0f48f3cc093c44b4ad969898713a0d65 | ceilometer   |   True  | nobody@example.com |
| 85ba98d27b1c4c8f97993e34fcd14f48 |   cinder     |   True  | nobody@example.com |
| d2ff982a0b6547d0921b94957db714d6 |    demo      |   True  |  demo@example.com  |
| b2d597e83664489ebd1d3c4742a04b7c |    ec2       |   True  | nobody@example.com |
| 2bd85070ceec4b608d9f1b06c6be22cb |   glance     |   True  | nobody@example.com |
| 0e9e2daebbd3464097557b87af4afa4c |    heat      |   True  | nobody@example.com |
| 0b466ddc2c0f478aa139d2a0be314467 |  neutron     |   True  | nobody@example.com |
| 5cda1a541dee4555aab88f36e5759268 |    nova      |   True  | nobody@example.com ||
| 5cda1a541dee4555aab88f36e5759268 |    nova      |   True  | nobody@example.com |
| 1cefd1361be8437d9684eb2add8bdbfa |   swift      |   True  | nobody@example.com |
| f05bac3532c44414a26c0086797dab23 | user20141203213957|True| nobody@example.com |
| 3db0588e140d4f88b0d4cc8b5ca86a0b | user20141205232231|True| nobody@example.com |
+----------------------------------+-----------------------------------------+----+

13.3.6.2 Displaying All Roles

To see all the roles that are currently available in the deployment, an administrator (someone with the admin role) can run the following command:

ardana > source ~/service.osrc
ardana > openstack role list

The output should resemble the following response:

+----------------------------------+-------------------------------------+
|                id                |                 name                |
+----------------------------------+-------------------------------------+
| 507bface531e4ac2b7019a1684df3370 |            ResellerAdmin            |
| 9fe2ff9ee4384b1894a90878d3e92bab |               member                |
| e00e9406b536470dbde2689ce1edb683 |                admin                |
| aa60501f1e664ddab72b0a9f27f96d2c |           heat_stack_user           |
| a082d27b033b4fdea37ebb2a5dc1a07b |               service               |
| 8f11f6761534407585feecb5e896922f |            swiftoperator            |
+----------------------------------+-------------------------------------+

13.3.6.3 Assigning a Role to a User

In this example, we want to add the role ResellerAdmin to the demo user who has the ID d2ff982a0b6547d0921b94957db714d6.

  1. Determine which Project/Tenant the user belongs to.

    ardana > source ~/service.osrc
    ardana > openstack user show d2ff982a0b6547d0921b94957db714d6

    The response should resemble the following output:

    +---------------------+----------------------------------+
    | Field               | Value                            |
    +---------------------+----------------------------------+
    | domain_id           | default                          |
    | enabled             | True                             |
    |    id               | d2ff982a0b6547d0921b94957db714d6 |
    | name                | admin                            |
    | options             | {}                               |
    | password_expires_at | None                             |
    +---------------------+----------------------------------+
  2. We need to link the ResellerAdmin Role to a Project/Tenant. To start, determine which tenants are available on this deployment.

    ardana > source ~/service.osrc
    ardana > openstack project list

    The response should resemble the following output:

    +----------------------------------+-------------------------------+--+
    |                id                |        name       | enabled |
    +----------------------------------+-------------------------------+--+
    | 4a8f4207a13444089a18dc524f41b2cf |       admin       |   True  |
    | 00cbaf647bf24627b01b1a314e796138 |        demo       |   True  |
    | 8374761f28df43b09b20fcd3148c4a08 |        gf1        |   True  |
    | 0f8a9eef727f4011a7c709e3fbe435fa |        gf2        |   True  |
    | 6eff7b888f8e470a89a113acfcca87db |        gf3        |   True  |
    | f0b5d86c7769478da82cdeb180aba1b0 |        jaq1       |   True  |
    | a46f1127e78744e88d6bba20d2fc6e23 |        jaq2       |   True  |
    | 977b9b7f9a6b4f59aaa70e5a1f4ebf0b |        jaq3       |   True  |
    | 4055962ba9e44561ab495e8d4fafa41d |        jaq4       |   True  |
    | 33ec7f15476545d1980cf90b05e1b5a8 |        jaq5       |   True  |
    | 9550570f8bf147b3b9451a635a1024a1 |      service      |   True  |
    +----------------------------------+-------------------------------+--+
  3. Now that we have all the pieces, we can assign the ResellerAdmin role to this User on the Demo project.

    ardana > openstack role add --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138 507bface531e4ac2b7019a1684df3370

    This will produce no response if everything is correct.

  4. Validate that the role has been assigned correctly. Pass in the user and tenant ID and request a list of roles assigned.

    ardana > openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138

    Note that all members have the member role as a default role in addition to any other roles that have been assigned.

    +----------------------------------+---------------+----------------------------------+----------------------------------+
    |                id                |      name     |             user_id              | tenant_id             |
    +----------------------------------+---------------+----------------------------------+----------------------------------+
    | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    | 9fe2ff9ee4384b1894a90878d3e92bab |    member     | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    +----------------------------------+---------------+----------------------------------+----------------------------------+

13.3.6.4 Creating a New Role

In this example, we will create a Level 3 Support role called L3Support.

  1. Add the new role to the list of roles.

    ardana > openstack role create L3Support

    The response should resemble the following output:

    +----------+----------------------------------+
    | Property |              Value               |
    +----------+----------------------------------+
    |    id    | 7e77946db05645c4ba56c6c82bf3f8d2 |
    |   name   |            L3Support             |
    +----------+----------------------------------+
  2. Now that we have the new role's ID, we can add that role to the Demo user from the previous example.

    ardana > openstack role add --user d2ff982a0b6547d0921b94957db714d6  --project 00cbaf647bf24627b01b1a314e796138 7e77946db05645c4ba56c6c82bf3f8d2

    This will produce no response if everything is correct.

  3. Verify that the user Demo has both the ResellerAdmin and L3Support roles.

    ardana > openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138
  4. The response should resemble the following output. Note that this user has the L3Support role, the ResellerAdmin role, and the default member role.

    +----------------------------------+---------------+----------------------------------+----------------------------------+
    |                id                |      name     |             user_id              |            tenant_id             |
    +----------------------------------+---------------+----------------------------------+----------------------------------+
    | 7e77946db05645c4ba56c6c82bf3f8d2 |   L3Support   | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    | 9fe2ff9ee4384b1894a90878d3e92bab |    member     | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 |
    +----------------------------------+---------------+----------------------------------+----------------------------------+

13.3.6.5 Access Policies

Before introducing RBAC, ceilometer had very simple access control. There were two types of user: admins and users. Admins will be able to access any API and perform any operation. Users will only be able to access non-admin APIs and perform operations only on the Project/Tenant where they belonged.

13.3.7 Ceilometer Metering Failover HA Support

In the SUSE OpenStack Cloud environment, the ceilometer metering service supports native Active-Active high-availability (HA) for the notification and polling agents. Implementing HA support includes workload-balancing, workload-distribution and failover.

Tooz is the coordination engine that is used to coordinate workload among multiple active agent instances. It also maintains the knowledge of active-instance-to-handle failover and group membership using hearbeats (pings).

Zookeeper is used as the coordination backend. Zookeeper uses Tooz to expose the APIs that manage group membership and retrieve workload specific to each agent.

The following section in the configuration file is used to implement high-availability (HA):

[coordination]
backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default)
heartbeat = 1.0
check_watchers = 10.0

For the notification agent to be configured in HA mode, additional configuration is needed in the configuration file:

[notification]
workload_partitioning = true

The HA notification agent distributes workload among multiple queues that are created based on the number of unique source:sink combinations. The combinations are configured in the notification agent pipeline configuration file. If there are additional services to be metered using notifications, then the recommendation is to use a separate source for those events. This is recommended especially if the expected load of data from that source is considered high. Implementing HA support should lead to better workload balancing among multiple active notification agents.

ceilometer-expirer is also an Active-Active HA. Tooz is used to pick an expirer process that acquires a lock when there are multiple contenders and the winning process runs. There is no failover support, as expirer is not a daemon and is scheduled to run at pre-determined intervals.

Important
Important

You must ensure that a single expirer process runs when multiple processes are scheduled to run at the same time. This must be done using cron-based scheduling. on multiple controller nodes

The following configuration is needed to enable expirer HA:

[coordination]
backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default)
heartbeat = 1.0
check_watchers = 10.0

The notification agent HA support is mainly designed to coordinate among notification agents so that correlated samples can be handled by the same agent. This happens when samples get transformed from other samples. The SUSE OpenStack Cloud ceilometer pipeline has no transformers, so this task of coordination and workload partitioning does not need to be enabled. The notification agent is deployed on multiple controller nodes and they distribute workload among themselves by randomly fetching the data from the queue.

To disable coordination and workload partitioning by OpenStack, set the following value in the configuration file:

        [notification]
        workload_partitioning = False
Important
Important

When a configuration change is made to an API running under the HA Proxy, that change needs to be replicated in all controllers.

13.3.8 Optimizing the Ceilometer Metering Service

You can improve ceilometer responsiveness by configuring metering to store only the data you are require. This topic provides strategies for getting the most out of metering while not overloading your resources.

13.3.8.1 Change the List of Meters

The list of meters can be easily reduced or increased by editing the pipeline.yaml file and restarting the polling agent.

Sample compute-only pipeline.yaml file with the daily poll interval:

---
sources:
    - name: meter_source
      interval: 86400
      meters:
          - "instance"
          - "memory"
          - "vcpus"
          - "compute.instance.create.end"
          - "compute.instance.delete.end"
          - "compute.instance.update"
          - "compute.instance.exists"
      sinks:
          - meter_sink
sinks:
    - name: meter_sink
      transformers:
      publishers:
          - notifier://
Note
Note

This change will cause all non-default meters to stop receiving notifications.

13.3.8.2 Enable Nova Notifications

You can configure nova to send notifications by enabling the setting in the configuration file. When enabled, nova will send information to ceilometer related to its usage and VM status. You must restart nova for these changes to take effect.

The Openstack notification daemon, also known as a polling agent, monitors the message bus for data being provided by other OpenStack components such as nova. The notification daemon loads one or more listener plugins, using the ceilometer.notification namespace. Each plugin can listen to any topic, but by default it will listen to the notifications.info topic. The listeners grab messages off the defined topics and redistribute them to the appropriate plugins (endpoints) to be processed into Events and Samples. After the nova service is restarted, you should verify that the notification daemons are receiving traffic.

For a more in-depth look at how information is sent over openstack.common.rpc, refer to the OpenStack ceilometer documentation.

nova can be configured to send following data to ceilometer:

Name Unit Type Resource Note
instanceginstance inst IDExistence of instance
instance: type ginstance inst IDExistence of instance of type (Where type is a valid OpenStack type.)
memorygMB inst IDAmount of allocated RAM. Measured in MB.
vcpusgvcpu inst IDNumber of VCPUs
disk.root.sizegGB inst IDSize of root disk. Measured in GB.
disk.ephemeral.sizegGB inst IDSize of ephemeral disk. Measured in GB.

To enable nova to publish notifications:

  1. In a text editor, open the following file:

    nova.conf
  2. Compare the example of a working configuration file with the necessary changes to your configuration file. If there is anything missing in your file, add it, and then save the file.

    notification_driver=messaging
    notification_topics=notifications
    notify_on_state_change=vm_and_task_state
    instance_usage_audit=True
    instance_usage_audit_period=hour
    Important
    Important

    The instance_usage_audit_period interval can be set to check the instance's status every hour, once a day, once a week or once a month. Every time the audit period elapses, nova sends a notification to ceilometer to record whether or not the instance is alive and running. Metering this statistic is critical if billing depends on usage.

  3. To restart nova service, run:

    tux > sudo systemctl restart nova-api.service
    tux > sudo systemctl restart nova-conductor.service
    tux > sudo systemctl restart nova-scheduler.service
    tux > sudo systemctl restart nova-novncproxy.service
    Important
    Important

    Different platforms may use their own unique command to restart nova-compute services. If the above command does not work, please refer to the documentation for your specific platform.

  4. To verify successful launch of each process, list the service components:

    ardana > source ~/service.osrc
    ardana > openstack compute service list
    +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+
    | Id | Binary           | Host       | Zone     | Status  | State | Updated_at                 | Disabled Reason |
    +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+
    | 1  | nova-conductor   | controller | internal | enabled | up    | 2014-09-16T23:54:02.000000 | -               |
    | 3  | nova-scheduler   | controller | internal | enabled | up    | 2014-09-16T23:54:07.000000 | -               |
    | 4  | nova-cert        | controller | internal | enabled | up    | 2014-09-16T23:54:00.000000 | -               |
    | 5  | nova-compute     | compute1   | nova     | enabled | up    | 2014-09-16T23:54:06.000000 | -               |
    +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+

13.3.8.3 Improve Reporting API Responsiveness

Reporting APIs are the main access to the metering data stored in ceilometer. These APIs are accessed by horizon to provide basic usage data and information.

SUSE OpenStack Cloud uses Apache2 Web Server to provide the API access. This topic provides some strategies to help you optimize the front-end and back-end databases.

To improve the responsiveness you can increase the number of threads and processes in the ceilometer configuration file. Each process can have a certain amount of threads managing the filters and applications, which can comprise the processing pipeline.

To configure Apache2 to use increase the number of threads, use the steps in Section 13.3.4, “Configure the Ceilometer Metering Service”

Warning
Warning

The resource usage panel could take some time to load depending on the number of metrics selected.

13.3.8.4 Update the Polling Strategy and Swift Considerations

Polling can put an excessive amount of strain on the system due to the amount of data the system may have to process. Polling also has a severe impact on queries since the database can have very large amount of data to scan before responding to the query. This process usually consumes a large amount of CPU and memory to complete the requests. Clients can also experience long waits for queries to come back and, in extreme cases, even timeout.

There are 3 polling meters in swift:

  • storage.objects

  • storage.objects.size

  • storage.objects.containers

Sample section of the pipeline.yaml configuration file with swift polling on an hourly interval:

---
sources:
    - name: swift_source
      interval: 3600
      sources:
            meters:
          - "storage.objects"
          - "storage.objects.size"
          - "storage.objects.containers"
sinks:
    - name: meter_sink
      transformers:
      publishers:
          - notifier://

Every time the polling interval occurs, at least 3 messages per existing object/container in swift are collected. The following table illustrates the amount of data produced hourly in different scenarios:

swift Containersswift Objects per containerSamples per HourSamples stored per 24 hours
101050012000
101005000120000
100100500001200000
100100050000012000000

Looking at the data we can see that even a very small swift storage with 10 containers and 100 files will store 120K samples in 24 hours, bringing it to a total of 3.6 million samples.

Note
Note

The file size of each file does not have any impact on the number of samples collected. In fact the smaller the number of containers or files, the smaller the sample size. In the scenario where there a large number of small files and containers, the sample size is also large and the performance is at its worst.

13.3.9 Metering Service Samples

Samples are discrete collections of a particular meter or the actual usage data defined by a meter description. Each sample is time-stamped and includes a variety of data that varies per meter but usually includes the project ID and UserID of the entity that consumed the resource represented by the meter and sample.

In a typical deployment, the number of samples can be in the tens of thousands if not higher for a specific collection period depending on overall activity.

Sample collection and data storage expiry settings are configured in ceilometer. Use cases that include collecting data for monthly billing cycles are usually stored over a period of 45 days and require a large, scalable, back-end database to support the large volume of samples generated by production OpenStack deployments.

Example configuration:

[database]
metering_time_to_live=-1

In our example use case, to construct a complete billing record, an external billing application must collect all pertinent samples. Then the results must be sorted, summarized, and combine with the results of other types of metered samples that are required. This function is known as aggregation and is external to the ceilometer service.

Meter data, or samples, can also be collected directly from the service APIs by individual ceilometer polling agents. These polling agents directly access service usage by calling the API of each service.

OpenStack services such as swift currently only provide metered data through this function and some of the other OpenStack services provide specific metrics only through a polling action.