13 Managing Monitoring, Logging, and Usage Reporting #
Information about the monitoring, logging, and metering services included with your SUSE OpenStack Cloud.
13.1 Monitoring #
The SUSE OpenStack Cloud Monitoring service leverages OpenStack monasca, which is a multi-tenant, scalable, fault tolerant monitoring service.
13.1.1 Getting Started with Monitoring #
You can use the SUSE OpenStack Cloud Monitoring service to monitor the health of your cloud and, if necessary, to troubleshoot issues.
monasca data can be extracted and used for a variety of legitimate purposes, and different purposes require different forms of data sanitization or encoding to protect against invalid or malicious data. Any data pulled from monasca should be considered untrusted data, so users are advised to apply appropriate encoding and/or sanitization techniques to ensure safe and correct usage and display of data in a web browser, database scan, or any other use of the data.
13.1.1.1 Monitoring Service Overview #
13.1.1.1.1 Installation #
The monitoring service is automatically installed as part of the SUSE OpenStack Cloud installation.
No specific configuration is required to use monasca. However, you can configure the database for storing metrics as explained in Section 13.1.2, “Configuring the Monitoring Service”.
13.1.1.1.2 Differences Between Upstream and SUSE OpenStack Cloud Implementations #
In SUSE OpenStack Cloud, the OpenStack monitoring service, monasca, is included as the monitoring solution, except for the following which are not included:
Transform Engine
Events Engine
Anomaly and Prediction Engine
Icinga was supported in previous SUSE OpenStack Cloud versions but it has been deprecated in SUSE OpenStack Cloud 9.
13.1.1.1.3 Diagram of monasca Service #
13.1.1.1.4 For More Information #
For more details on OpenStack monasca, see monasca.io
13.1.1.1.5 Back-end Database #
The monitoring service default metrics database is Cassandra, which is a highly-scalable analytics database and the recommended database for SUSE OpenStack Cloud.
You can learn more about Cassandra at Apache Cassandra.
13.1.1.2 Working with Monasca #
monasca-Agent
The monasca-agent is a Python program that runs on the control plane nodes. It runs the defined checks and then sends data onto the API. The checks that the agent runs include:
System Metrics: CPU utilization, memory usage, disk I/O, network I/O, and filesystem utilization on the control plane and resource nodes.
Service Metrics: the agent supports plugins such as MySQL, RabbitMQ, Kafka, and many others.
VM Metrics: CPU utilization, disk I/O, network I/O, and memory usage of hosted virtual machines on compute nodes. Full details of these can be found https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md#per-instance-metrics.
For a full list of packaged plugins that are included SUSE OpenStack Cloud, see monasca Plugins
You can further customize the monasca-agent to suit your needs, see Customizing the Agent
13.1.1.3 Accessing the Monitoring Service #
Access to the Monitoring service is available through a number of different interfaces.
13.1.1.3.1 Command-Line Interface #
For users who prefer using the command line, there is the python-monascaclient, which is part of the default installation on your Cloud Lifecycle Manager node.
For details on the CLI, including installation instructions, see Python-monasca Client
monasca API
If low-level access is desired, there is the monasca REST API.
Full details of the monasca API can be found on GitHub.
13.1.1.3.2 Operations Console GUI #
You can use the Operations Console (Ops Console) for SUSE OpenStack Cloud to view data about your SUSE OpenStack Cloud cloud infrastructure in a web-based graphical user interface (GUI) and ensure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways: Triage alarm notifications.
Alarm Definitions and notifications now have their own screens and are collected under the Alarm Explorer menu item which can be accessed from the Central Dashboard. Central Dashboard now allows you to customize the view in the following ways:
Rename or re-configure existing alarm cards to include services different from the defaults
Create a new alarm card with the services you want to select
Reorder alarm cards using drag and drop
View all alarms that have no service dimension now grouped in an Uncategorized Alarms card
View all alarms that have a service dimension that does not match any of the other cards -now grouped in an Other Alarms card
You can also easily access alarm data for a specific component. On the Summary page for the following components, a link is provided to an alarms screen specifically for that component.
13.1.1.3.3 Connecting to the Operations Console #
To connect to Operations Console, perform the following:
Ensure your login has the required access credentials.
Connect through a browser.
Optionally use a Host name OR virtual IP address to access Operations Console.
Operations Console will always be accessed over port 9095.
13.1.1.4 Service Alarm Definitions #
SUSE OpenStack Cloud comes with some predefined monitoring alarms for the services installed.
Full details of all service alarms can be found here: Section 18.1.1, “Alarm Resolution Procedures”.
Each alarm will have one of the following statuses:
An alarm exists for a service or component that is not installed in the environment.
An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.
There is a gap between the last reported metric and the next metric.
When alarms are triggered it is helpful to review the service logs.
13.1.2 Configuring the Monitoring Service #
The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. You also have options for your alarm metrics database should you choose not to use the default option provided with the product.
In SUSE OpenStack Cloud you have the option to specify a SMTP server for email notifications and a database platform you want to use for the metrics database. These steps will assist in this process.
13.1.2.1 Configuring the Monitoring Email Notification Settings #
The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. In SUSE OpenStack Cloud, you have the option to specify a SMTP server for email notifications. These steps will assist in this process.
If you are going to use the email notifiication feature of the monitoring service, you must set the configuration options with valid email settings including an SMTP server and valid email addresses. The email server is not provided by SUSE OpenStack Cloud, but must be specified in the configuration file described below. The email server must support SMTP.
13.1.2.1.1 Configuring monitoring notification settings during initial installation #
Log in to the Cloud Lifecycle Manager.
To change the SMTP server configuration settings edit the following file:
~/openstack/my_cloud/definition/cloudConfig.yml
Enter your email server settings. Here is an example snippet showing the configuration file contents, uncomment these lines before entering your environment details.
smtp-settings: # server: mailserver.examplecloud.com # port: 25 # timeout: 15 # These are only needed if your server requires authentication # user: # password:
This table explains each of these values:
Value Description Server (required) The server entry must be uncommented and set to a valid hostname or IP Address.
Port (optional) If your SMTP server is running on a port other than the standard 25, then uncomment the port line and set it your port.
Timeout (optional) If your email server is heavily loaded, the timeout parameter can be uncommented and set to a larger value. 15 seconds is the default.
User / Password (optional) If your SMTP server requires authentication, then you can configure user and password. Use double quotes around the password to avoid issues with special characters.
To configure the sending email addresses, edit the following file:
~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
Modify the following value to add your sending email address:
email_from_addr
NoteThe default value in the file is
email_from_address: notification@exampleCloud.com
which you should edit.[optional] To configure the receiving email addresses, edit the following file:
~/openstack/ardana/ansible/roles/monasca-default-alarms/defaults/main.yml
Modify the following value to configure a receiving email address:
notification_address
NoteYou can also set the receiving email address via the Operations Console. Instructions for this are in the last section.
If your environment requires a proxy address then you can add that in as well:
# notification_environment can be used to configure proxies if needed. # Below is an example configuration. Note that all of the quotes are required. # notification_environment: '"http_proxy=http://<your_proxy>:<port>" "https_proxy=http://<your_proxy>:<port>"' notification_environment: ''
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Updated monitoring service email notification settings"Continue with your installation.
13.1.2.1.2 Monasca and Apache Commons Validator #
The monasca notification uses a standard Apache Commons validator to validate the configured SUSE OpenStack Cloud domain names before sending the notification over webhook. monasca notification supports some non-standard domain names, but not all. See the Domain Validator documentation for more information: https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/DomainValidator.html
You should ensure that any domains that you use are supported by IETF and IANA. As an example, .local is not listed by IANA and is invalid but .gov and .edu are valid.
Internet Assigned Numbers Authority (IANA): https://www.iana.org/domains/root/db
Failure to use supported domains will generate an unprocessable exception in monasca notification create:
HTTPException code=422 message={"unprocessable_entity": {"code":422,"message":"Address https://myopenstack.sample:8000/v1/signal/test is not of correct format","details":"","internal_code":"c6cf9d9eb79c3fc4"}
13.1.2.1.3 Configuring monitoring notification settings after the initial installation #
If you need to make changes to the email notification settings after your initial deployment, you can change the "From" address using the configuration files but the "To" address will need to be changed in the Operations Console. The following section will describe both of these processes.
To change the sending email address:
Log in to the Cloud Lifecycle Manager.
To configure the sending email addresses, edit the following file:
~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
Modify the following value to add your sending email address:
email_from_addr
NoteThe default value in the file is
email_from_address: notification@exampleCloud.com
which you should edit.Commit your configuration to the local Git repository (Chapter 22, Using Git for Configuration Management), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Updated monitoring service email notification settings"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the monasca reconfigure playbook to deploy the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notificationNoteYou may need to use the
--ask-vault-pass
switch if you opted for encryption during the initial deployment.
To change the receiving email address via the Operations Console:
To configure the "To" email address, after installation,
Connect to and log in to the Operations Console.
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left side, click Home, and then Alarm Explorer.
On the Alarm Explorer page, at the top, click the Notification Methods text.
On the Notification Methods page, find the row with the Default Email notification.
In the Default Email row, click the details icon (), then click Edit.
On the Edit Notification Method: Default Email page, in Name, Type, and Address/Key, type in the values you want to use.
On the Edit Notification Method: Default Email page, click Update Notification.
Once the notification has been added, using the procedures using the Ansible playbooks will not change it.
13.1.2.2 Managing Notification Methods for Alarms #
13.1.2.2.1 Enabling a Proxy for Webhook or Pager Duty Notifications #
If your environment requires a proxy in order for communications to function then these steps will show you how you can enable one. These steps will only be needed if you are utilizing the webhook or pager duty notification methods.
These steps will require access to the Cloud Lifecycle Manager in your cloud deployment so you may need to contact your Administrator. You can make these changes during the initial configuration phase prior to the first installation or you can modify your existing environment, the only difference being the last step.
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
file and edit the line below with your proxy address values:notification_environment: '"http_proxy=http://<proxy_address>:<port>" "https_proxy=<http://proxy_address>:<port>"'
NoteThere are single quotation marks around the entire value of this entry and then double quotation marks around the individual proxy entries. This formatting must exist when you enter these values into your configuration file.
If you are making these changes prior to your initial installation then you are done and can continue on with the installation. However, if you are modifying an existing environment, you will need to continue on with the remaining steps below.
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlGenerate an updated deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the monasca reconfigure playbook to enable these changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notification
13.1.2.2.2 Creating a New Notification Method #
Log in to the Operations Console.
Use the navigation menu to go to the Alarm Explorer page:
Select the Notification Methods menu and then click the Create Notification Method button:
On the Create Notification Method window you will select your options and then click the Create Notification button.
A description of each of the fields you use for each notification method:
Field Description Name Enter a unique name value for the notification method you are creating.
Type Choose a type. Available values are Webhook, Email, or Pager Duty.
Address/Key Enter the value corresponding to the type you chose.
13.1.2.2.3 Applying a Notification Method to an Alarm Definition #
Log in to the Operations Console.
Use the navigation menu to go to the Alarm Explorer page:
Select the Alarm Definition menu which will give you a list of each of the alarm definitions in your environment.
Locate the alarm you want to change the notification method for and click on its name to bring up the edit menu. You can use the sorting methods for assistance.
In the edit menu, scroll down to the Notifications and Severity section where you will select one or more Notification Methods before selecting the Update Alarm Definition button:
Repeat as needed until all of your alarms have the notification methods you desire.
13.1.2.3 Enabling the RabbitMQ Admin Console #
The RabbitMQ Admin Console is off by default in SUSE OpenStack Cloud. You can turn on the console by following these steps:
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/rabbitmq/main.yml
file. Under therabbit_plugins:
line, uncomment- rabbitmq_management
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Enabled RabbitMQ Admin Console"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the RabbitMQ reconfigure playbook to deploy the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-reconfigure.yml
To turn the RabbitMQ Admin Console off again, add the comment back and repeat steps 3 through 6.
13.1.2.4 Capacity Reporting and Monasca Transform #
Capacity reporting is a new feature in SUSE OpenStack Cloud which will provide cloud operators overall capacity (available, used, and remaining) information via the Operations Console so that the cloud operator can ensure that cloud resource pools have sufficient capacity to meet the demands of users. The cloud operator is also able to set thresholds and set alarms to be notified when the thresholds are reached.
For Compute
Host Capacity - CPU/Disk/Memory: Used, Available and Remaining Capacity - for the entire cloud installation or by host
VM Capacity - CPU/Disk/Memory: Allocated, Available and Remaining - for the entire cloud installation, by host or by project
For Object Storage
Disk Capacity - Used, Available and Remaining Capacity - for the entire cloud installation or by project
In addition to overall capacity, roll up views with appropriate slices provide views by a particular project, or compute node. Graphs also show trends and the change in capacity over time.
13.1.2.4.1 monasca Transform Features #
monasca Transform is a new component in monasca which transforms and aggregates metrics using Apache Spark
Aggregated metrics are published to Kafka and are available for other monasca components like monasca-threshold and are stored in monasca datastore
Cloud operators can set thresholds and set alarms to receive notifications when thresholds are met.
These aggregated metrics are made available to the cloud operators via Operations Console's new Capacity Summary (reporting) UI
Capacity reporting is a new feature in SUSE OpenStack Cloud which will provides cloud operators an overall capacity (available, used and remaining) for Compute and Object Storage
Cloud operators can look at Capacity reporting via Operations Console's Compute Capacity Summary and Object Storage Capacity Summary UI
Capacity reporting allows the cloud operators the ability to ensure that cloud resource pools have sufficient capacity to meet demands of users. See table below for Service and Capacity Types.
A list of aggregated metrics is provided in Section 13.1.2.4.4, “New Aggregated Metrics”.
Capacity reporting aggregated metrics are aggregated and published every hour
In addition to the overall capacity, there are graphs which show the capacity trends over time range (for 1 day, for 7 days, for 30 days or for 45 days)
Graphs showing the capacity trends by a particular project or compute host are also provided.
monasca Transform is integrated with centralized monitoring (monasca) and centralized logging
Flexible Deployment
Upgrade & Patch Support
Service | Type of Capacity | Description |
---|---|---|
Compute | Host Capacity |
CPU/Disk/Memory: Used, Available and Remaining Capacity - for entire cloud installation or by compute host |
VM Capacity |
CPU/Disk/Memory: Allocated, Available and Remaining - for entire cloud installation, by host or by project | |
Object Storage | Disk Capacity |
Used, Available and Remaining Disk Capacity - for entire cloud installation or by project |
Storage Capacity |
Utilized Storage Capacity - for entire cloud installation or by project |
13.1.2.4.2 Architecture for Monasca Transform and Spark #
monasca Transform is a new component in monasca. monasca Transform uses Spark for data aggregation. Both monasca Transform and Spark are depicted in the example diagram below.
You can see that the monasca components run on the Cloud Controller nodes, and the monasca agents run on all nodes in the Mid-scale Example configuration.
13.1.2.4.3 Components for Capacity Reporting #
13.1.2.4.3.1 monasca Transform: Data Aggregation Reporting #
monasca-transform is a new component which provides mechanism to aggregate or transform metrics and publish new aggregated metrics to monasca.
monasca Transform is a data driven Apache Spark based data aggregation engine which collects, groups and aggregates existing individual monasca metrics according to business requirements and publishes new transformed (derived) metrics to the monasca Kafka queue.
Since the new transformed metrics are published as any other metric in monasca, alarms can be set and triggered on the transformed metric, just like any other metric.
13.1.2.4.3.2 Object Storage and Compute Capacity Summary Operations Console UI #
A new "Capacity Summary" tab for Compute and Object Storage will displays all the aggregated metrics under the "Compute" and "Object Storage" sections.
Operations Console UI makes calls to monasca API to retrieve and display various tiles and graphs on Capacity Summary tab in Compute and Object Storage Summary UI pages.
13.1.2.4.3.3 Persist new metrics and Trigger Alarms #
New aggregated metrics will be published to monasca's Kafka queue and will be ingested by monasca-persister. If thresholds and alarms have been set on the aggregated metrics, monasca will generate and trigger alarms as it currently does with any other metric. No new/additional change is expected with persisting of new aggregated metrics or setting threshold/alarms.
13.1.2.4.4 New Aggregated Metrics #
Following is the list of aggregated metrics produced by monasca transform in SUSE OpenStack Cloud
Metric Name | For | Description | Dimensions | Notes | |
---|---|---|---|---|---|
1 |
cpu.utilized_logical_cores_agg | compute summary |
utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
2 | cpu.total_logical_cores_agg | compute summary |
total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
3 | mem.total_mb_agg | compute summary |
total physical host memory capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
4 | mem.usable_mb_agg | compute summary |
usable physical host memory capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
5 | disk.total_used_space_mb_agg | compute summary |
utilized physical host disk capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
6 | disk.total_space_mb_agg | compute summary |
total physical host disk capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
7 | nova.vm.cpu.total_allocated_agg | compute summary |
cpus allocated across all VMs by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
8 | vcpus_agg | compute summary |
virtual cpus allocated capacity for VMs of one or all projects by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all or <project ID> | Available as total or per project |
9 | nova.vm.mem.total_allocated_mb_agg | compute summary |
memory allocated to all VMs by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
10 | vm.mem.used_mb_agg | compute summary |
memory utilized by VMs of one or all projects by time interval (defaults to an hour) |
aggregation_period: hourly host: all project_id: <project ID> | Available as total or per project |
11 | vm.mem.total_mb_agg | compute summary |
memory allocated to VMs of one or all projects by time interval (defaults to an hour) |
aggregation_period: hourly host: all project_id: <project ID> | Available as total or per project |
12 | vm.cpu.utilization_perc_agg | compute summary |
cpu utilized by all VMs by project by time interval (defaults to an hour) |
aggregation_period: hourly host: all project_id: <project ID> | |
13 | nova.vm.disk.total_allocated_gb_agg | compute summary |
disk space allocated to all VMs by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
14 | vm.disk.allocation_agg | compute summary |
disk allocation for VMs of one or all projects by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all or <project ID> | Available as total or per project |
15 | swiftlm.diskusage.val.size_agg | object storage summary |
total available object storage capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
16 | swiftlm.diskusage.val.avail_agg | object storage summary |
remaining object storage capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
17 | swiftlm.diskusage.rate_agg | object storage summary |
rate of change of object storage usage by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
18 | storage.objects.size_agg | object storage summary |
used object storage capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all |
13.1.2.4.5 Deployment #
monasca Transform and Spark will be deployed on the same control plane nodes along with Logging and Monitoring Service (monasca).
Security Consideration during deployment of monasca Transform and Spark
The SUSE OpenStack Cloud Monitoring system connects internally to the Kafka and Spark technologies without authentication. If you choose to deploy Monitoring, configure it to use only trusted networks such as the Management network, as illustrated on the network diagrams below for Entry Scale Deployment and Mid Scale Deployment.
Entry Scale Deployment
In Entry Scale Deployment monasca Transform and Spark will be deployed on Shared Control Plane along with other Openstack Services along with Monitoring and Logging
Mid scale Deployment
In a Mid Scale Deployment monasca Transform and Spark will be deployed on dedicated Metering Monitoring and Logging (MML) control plane along with other data processing intensive services like Metering, Monitoring and Logging
Multi Control Plane Deployment
In a Multi Control Plane Deployment, monasca Transform and Spark will be deployed on the Shared Control plane along with rest of monasca Components.
Start, Stop and Status for monasca Transform and Spark processes
The service management methods for monasca-transform and spark follow the convention for services in the OpenStack platform. When executing from the deployer node, the commands are as follows:
Status
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-status.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-status.yml
Start
As monasca-transform depends on spark for the processing of the metrics spark will need to be started before monasca-transform.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-start.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-start.yml
Stop
As a precaution, stop the monasca-transform service before taking spark down. Interruption to the spark service altogether while monasca-transform is still running can result in a monasca-transform process that is unresponsive and needing to be tidied up.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-stop.ymlardana >
ansible-playbook -i hosts/verb_hosts spark-stop.yml
13.1.2.4.6 Reconfigure #
The reconfigure process can be triggered again from the deployer. Presuming that changes have been made to the variables in the appropriate places execution of the respective ansible scripts will be enough to update the configuration. The spark reconfigure process alters the nodes serially meaning that spark is never down altogether, each node is stopped in turn and zookeeper manages the leaders accordingly. This means that monasca-transform may be left running even while spark is upgraded.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-reconfigure.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.7 Adding monasca Transform and Spark to SUSE OpenStack Cloud Deployment #
Since monasca Transform and Spark are optional components, the users might elect to not install these two components during their initial SUSE OpenStack Cloud install. The following instructions provide a way the users can add monasca Transform and Spark to their existing SUSE OpenStack Cloud deployment.
Steps
Add monasca Transform and Spark to the input model. monasca Transform and Spark on a entry level cloud would be installed on the common control plane, for mid scale cloud which has a MML (Metering, Monitoring and Logging) cluster, monasca Transform and Spark will should be added to MML cluster.
ardana >
cd ~/openstack/my_cloud/definition/data/Add spark and monasca-transform to input model, control_plane.yml
clusters - name: core cluster-prefix: c1 server-role: CONTROLLER-ROLE member-count: 3 allocation-policy: strict service-components: [...] - zookeeper - kafka - cassandra - storm - spark - monasca-api - monasca-persister - monasca-notifier - monasca-threshold - monasca-client - monasca-transform [...]
Run the Configuration Processor
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Adding monasca Transform and Spark"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun Ready Deployment
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun Cloud Lifecycle Manager Deploy
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
Verify Deployment
Login to each controller node and run
tux >
sudo service monasca-transform statustux >
sudo service spark-master statustux >
sudo service spark-worker status
tux >
sudo service monasca-transform status ● monasca-transform.service - monasca Transform Daemon Loaded: loaded (/etc/systemd/system/monasca-transform.service; disabled) Active: active (running) since Wed 2016-08-24 00:47:56 UTC; 2 days ago Main PID: 7351 (bash) CGroup: /system.slice/monasca-transform.service ├─ 7351 bash /etc/monasca/transform/init/start-monasca-transform.sh ├─ 7352 /opt/stack/service/monasca-transform/venv//bin/python /opt/monasca/monasca-transform/lib/service_runner.py ├─27904 /bin/sh -c export SPARK_HOME=/opt/stack/service/spark/venv/bin/../current && spark-submit --supervise --master spark://omega-cp1-c1-m1-mgmt:7077,omega-cp1-c1-m2-mgmt:7077,omega-cp1-c1... ├─27905 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/lib/drizzle-jdbc-1.3.jar:/opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/v... └─28355 python /opt/monasca/monasca-transform/lib/driver.py Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.tux >
sudo service spark-worker status ● spark-worker.service - Spark Worker Daemon Loaded: loaded (/etc/systemd/system/spark-worker.service; disabled) Active: active (running) since Wed 2016-08-24 00:46:05 UTC; 2 days ago Main PID: 63513 (bash) CGroup: /system.slice/spark-worker.service ├─ 7671 python -m pyspark.daemon ├─28948 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0... ├─63513 bash /etc/spark/init/start-spark-worker.sh & └─63514 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven... Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.tux >
sudo service spark-master status ● spark-master.service - Spark Master Daemon Loaded: loaded (/etc/systemd/system/spark-master.service; disabled) Active: active (running) since Wed 2016-08-24 00:44:24 UTC; 2 days ago Main PID: 55572 (bash) CGroup: /system.slice/spark-master.service ├─55572 bash /etc/spark/init/start-spark-master.sh & └─55573 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven... Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
13.1.2.4.8 Increase monasca Transform Scale #
monasca Transform in the default configuration can scale up to estimated data for 100 node cloud deployment. Estimated maximum rate of metrics from a 100 node cloud deployment is 120M/hour.
You can further increase the processing rate to 180M/hour. Making the Spark configuration change will increase the CPU's being used by Spark and monasca Transform from average of around 3.5 to 5.5 CPU's per control node over a 10 minute batch processing interval.
To increase the processing rate to 180M/hour the customer will have to make following spark configuration change.
Steps
Edit /var/lib/ardana/openstack/my_cloud/config/spark/spark-defaults.conf.j2 and set spark.cores.max to 6 and spark.executor.cores 2
Set spark.cores.max to 6
spark.cores.max {{ spark_cores_max }}
to
spark.cores.max 6
Set spark.executor.cores to 2
spark.executor.cores {{ spark_executor_cores }}
to
spark.executor.cores 2
Edit ~/openstack/my_cloud/config/spark/spark-env.sh.j2
Set SPARK_WORKER_CORES to 2
export SPARK_WORKER_CORES={{ spark_worker_cores }}
to
export SPARK_WORKER_CORES=2
Edit ~/openstack/my_cloud/config/spark/spark-worker-env.sh.j2
Set SPARK_WORKER_CORES to 2
export SPARK_WORKER_CORES={{ spark_worker_cores }}
to
export SPARK_WORKER_CORES=2
Run Configuration Processor
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Changing Spark Config increase scale"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun Ready Deployment
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun spark-reconfigure.yml and monasca-transform-reconfigure.yml
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-reconfigure.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.9 Change Compute Host Pattern Filter in Monasca Transform #
monasca Transform identifies compute host metrics by pattern matching on
hostname dimension in the incoming monasca metrics. The default pattern is of
the form compNNN
. For example,
comp001
, comp002
, etc. To filter for it
in the transformation specs, use the expression
-comp[0-9]+-
. In case the compute
host names follow a different pattern other than the standard pattern above,
the filter by expression when aggregating metrics will have to be changed.
Steps
On the deployer: Edit
~/openstack/my_cloud/config/monasca-transform/transform_specs.json.j2
Look for all references of
-comp[0-9]+-
and change the regular expression to the desired pattern say for example-compute[0-9]+-
.{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data","insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"], "usage_fetch_operation": "avg", "filter_by_list": [{"field_to_filter": "host", "filter_expression": "-comp[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}
to
{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data", "insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"],"usage_fetch_operation": "avg","filter_by_list": [{"field_to_filter": "host","filter_expression": "-compute[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}
NoteThe filter_expression has been changed to the new pattern.
To change all host metric transformation specs in the same JSON file, repeat Step 2.
Transformation specs will have to be changed for following metric_ids namely "mem_total_all", "mem_usable_all", "disk_total_all", "disk_usable_all", "cpu_total_all", "cpu_total_host", "cpu_util_all", "cpu_util_host"
Run the Configuration Processor:
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Changing monasca Transform specs"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun Ready Deployment:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun monasca Transform Reconfigure:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.5 Configuring Availability of Alarm Metrics #
Using the monasca agent tuning knobs, you can choose which alarm metrics are available in your environment.
The addition of the libvirt and OVS plugins to the monasca agent provides a number of additional metrics that can be used. Most of these metrics are included by default, but others are not. You have the ability to use tuning knobs to add or remove these metrics to your environment based on your individual needs in your cloud.
We will list these metrics along with the tuning knob name and instructions for how to adjust these.
13.1.2.5.1 Libvirt plugin metric tuning knobs #
The following metrics are added as part of the libvirt plugin:
For a description of each of these metrics, see Section 13.1.4.11, “Libvirt Metrics”.
Tuning Knob | Default Setting | Admin Metric Name | Project Metric Name |
---|---|---|---|
vm_cpu_check_enable | True | vm.cpu.time_ns | cpu.time_ns |
vm.cpu.utilization_norm_perc | cpu.utilization_norm_perc | ||
vm.cpu.utilization_perc | cpu.utilization_perc | ||
vm_disks_check_enable |
True Creates 20 disk metrics per disk device per virtual machine. | vm.io.errors | io.errors |
vm.io.errors_sec | io.errors_sec | ||
vm.io.read_bytes | io.read_bytes | ||
vm.io.read_bytes_sec | io.read_bytes_sec | ||
vm.io.read_ops | io.read_ops | ||
vm.io.read_ops_sec | io.read_ops_sec | ||
vm.io.write_bytes | io.write_bytes | ||
vm.io.write_bytes_sec | io.write_bytes_sec | ||
vm.io.write_ops | io.write_ops | ||
vm.io.write_ops_sec | io.write_ops_sec | ||
vm_network_check_enable |
True Creates 16 network metrics per NIC per virtual machine. | vm.net.in_bytes | net.in_bytes |
vm.net.in_bytes_sec | net.in_bytes_sec | ||
vm.net.in_packets | net.in_packets | ||
vm.net.in_packets_sec | net.in_packets_sec | ||
vm.net.out_bytes | net.out_bytes | ||
vm.net.out_bytes_sec | net.out_bytes_sec | ||
vm.net.out_packets | net.out_packets | ||
vm.net.out_packets_sec | net.out_packets_sec | ||
vm_ping_check_enable | True | vm.ping_status | ping_status |
vm_extended_disks_check_enable |
True Creates 6 metrics per device per virtual machine. | vm.disk.allocation | disk.allocation |
vm.disk.capacity | disk.capacity | ||
vm.disk.physical | disk.physical | ||
True Creates 6 aggregate metrics per virtual machine. | vm.disk.allocation_total | disk.allocation_total | |
vm.disk.capacity_total | disk.capacity.total | ||
vm.disk.physical_total | disk.physical_total | ||
vm_disks_check_enable vm_extended_disks_check_enable |
True Creates 20 aggregate metrics per virtual machine. | vm.io.errors_total | io.errors_total |
vm.io.errors_total_sec | io.errors_total_sec | ||
vm.io.read_bytes_total | io.read_bytes_total | ||
vm.io.read_bytes_total_sec | io.read_bytes_total_sec | ||
vm.io.read_ops_total | io.read_ops_total | ||
vm.io.read_ops_total_sec | io.read_ops_total_sec | ||
vm.io.write_bytes_total | io.write_bytes_total | ||
vm.io.write_bytes_total_sec | io.write_bytes_total_sec | ||
vm.io.write_ops_total | io.write_ops_total | ||
vm.io.write_ops_total_sec | io.write_ops_total_sec |
13.1.2.5.1.1 Configuring the libvirt metrics using the tuning knobs #
Use the following steps to configure the tuning knobs for the libvirt plugin metrics.
Log in to the Cloud Lifecycle Manager.
Edit the following file:
~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
Change the value for each tuning knob to the desired setting,
True
if you want the metrics created andFalse
if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.vm_cpu_check_enable: <true or false> vm_disks_check_enable: <true or false> vm_extended_disks_check_enable: <true or false> vm_network_check_enable: <true or false> vm_ping_check_enable: <true or false>
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "configuring libvirt plugin tuning knobs"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the nova reconfigure playbook to implement the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
If you modify either of the following files, then the monasca tuning parameters should be adjusted to handle a higher load on the system.
~/openstack/my_cloud/config/nova/libvirt-monitoring.yml ~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2
Tuning parameters are located in
~/openstack/my_cloud/config/monasca/configuration.yml
.
The parameter monasca_tuning_selector_override
should be
changed to the extra-large
setting.
13.1.2.5.2 OVS plugin metric tuning knobs #
The following metrics are added as part of the OVS plugin:
For a description of each of these metrics, see Section 13.1.4.16, “Open vSwitch (OVS) Metrics”.
Tuning Knob | Default Setting | Admin Metric Name | Project Metric Name |
---|---|---|---|
use_rate_metrics | False | ovs.vrouter.in_bytes_sec | vrouter.in_bytes_sec |
ovs.vrouter.in_packets_sec | vrouter.in_packets_sec | ||
ovs.vrouter.out_bytes_sec | vrouter.out_bytes_sec | ||
ovs.vrouter.out_packets_sec | vrouter.out_packets_sec | ||
use_absolute_metrics | True | ovs.vrouter.in_bytes | vrouter.in_bytes |
ovs.vrouter.in_packets | vrouter.in_packets | ||
ovs.vrouter.out_bytes | vrouter.out_bytes | ||
ovs.vrouter.out_packets | vrouter.out_packets | ||
use_health_metrics with use_rate_metrics | False | ovs.vrouter.in_dropped_sec | vrouter.in_dropped_sec |
ovs.vrouter.in_errors_sec | vrouter.in_errors_sec | ||
ovs.vrouter.out_dropped_sec | vrouter.out_dropped_sec | ||
ovs.vrouter.out_errors_sec | vrouter.out_errors_sec | ||
use_health_metrics with use_absolute_metrics | False | ovs.vrouter.in_dropped | vrouter.in_dropped |
ovs.vrouter.in_errors | vrouter.in_errors | ||
ovs.vrouter.out_dropped | vrouter.out_dropped | ||
ovs.vrouter.out_errors | vrouter.out_errors |
13.1.2.5.2.1 Configuring the OVS metrics using the tuning knobs #
Use the following steps to configure the tuning knobs for the libvirt plugin metrics.
Log in to the Cloud Lifecycle Manager.
Edit the following file:
~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2
Change the value for each tuning knob to the desired setting,
True
if you want the metrics created andFalse
if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.init_config: use_absolute_metrics: <true or false> use_rate_metrics: <true or false> use_health_metrics: <true or false>
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "configuring OVS plugin tuning knobs"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the neutron reconfigure playbook to implement the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
13.1.3 Integrating HipChat, Slack, and JIRA #
monasca, the SUSE OpenStack Cloud monitoring and notification service, includes three default notification methods, email, PagerDuty, and webhook. monasca also supports three other notification plugins which allow you to send notifications to HipChat, Slack, and JIRA. Unlike the default notification methods, the additional notification plugins must be manually configured.
This guide details the steps to configure each of the three non-default notification plugins. This guide also assumes that your cloud is fully deployed and functional.
13.1.3.1 Configuring the HipChat Plugin #
To configure the HipChat plugin you will need the following four pieces of information from your HipChat system.
The URL of your HipChat system.
A token providing permission to send notifications to your HipChat system.
The ID of the HipChat room you wish to send notifications to.
A HipChat user account. This account will be used to authenticate any incoming notifications from your SUSE OpenStack Cloud cloud.
Obtain a token
Use the following instructions to obtain a token from your Hipchat system.
Log in to HipChat as the user account that will be used to authenticate the notifications.
Navigate to the following URL:
https://<your_hipchat_system>/account/api
. Replace<your_hipchat_system>
with the fully-qualified-domain-name of your HipChat system.Select the Create token option. Ensure that the token has the "SendNotification" attribute.
Obtain a room ID
Use the following instructions to obtain the ID of a HipChat room.
Log in to HipChat as the user account that will be used to authenticate the notifications.
Select My account from the application menu.
Select the Rooms tab.
Select the room that you want your notifications sent to.
Look for the API ID field in the room information. This is the room ID.
Create HipChat notification type
Use the following instructions to create a HipChat notification type.
Begin by obtaining the API URL for the HipChat room that you wish to send notifications to. The format for a URL used to send notifications to a room is as follows:
/v2/room/{room_id_or_name}/notification
Use the monasca API to create a new notification method. The following example demonstrates how to create a HipChat notification type named MyHipChatNotification, for room ID 13, using an example API URL and auth token.
ardana >
monasca notification-create NAME TYPE ADDRESSardana >
monasca notification-create MyHipChatNotification HIPCHAT https://hipchat.hpe.net/v2/room/13/notification?auth_token=1234567890The preceding example creates a notification type with the following characteristics
NAME: MyHipChatNotification
TYPE: HIPCHAT
ADDRESS: https://hipchat.hpe.net/v2/room/13/notification
auth_token: 1234567890
The horizon dashboard can also be used to create a HipChat notification type.
13.1.3.2 Configuring the Slack Plugin #
Configuring a Slack notification type requires four pieces of information from your Slack system.
Slack server URL
Authentication token
Slack channel
A Slack user account. This account will be used to authenticate incoming notifications to Slack.
Identify a Slack channel
Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack.
In the left navigation panel, under the CHANNELS section locate the channel that you wish to receive the notifications. The instructions that follow will use the example channel #general.
Create a Slack token
Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack
Navigate to the following URL: https://api.slack.com/docs/oauth-test-tokens
Select the Create token button.
Create a Slack notification type
Begin by identifying the structure of the API call to be used by your notification method. The format for a call to the Slack Web API is as follows:
https://slack.com/api/METHOD
You can authenticate a Web API request by using the token that you created in the previous Create a Slack Tokensection. Doing so will result in an API call that looks like the following.
https://slack.com/api/METHOD?token=auth_token
You can further refine your call by specifying the channel that the message will be posted to. Doing so will result in an API call that looks like the following.
https://slack.com/api/METHOD?token=AUTH_TOKEN&channel=#channel
The following example uses the
chat.postMessage
method, the token1234567890
, and the channel#general
.https://slack.com/api/chat.postMessage?token=1234567890&channel=#general
Find more information on the Slack Web API here: https://api.slack.com/web
Use the CLI on your Cloud Lifecycle Manager to create a new Slack notification type, using the API call that you created in the preceding step. The following example creates a notification type named MySlackNotification, using token 1234567890, and posting to channel #general.
ardana >
monasca notification-create MySlackNotification SLACK https://slack.com/api/chat.postMessage?token=1234567890&channel=#general
Notification types can also be created in the horizon dashboard.
13.1.3.3 Configuring the JIRA Plugin #
Configuring the JIRA plugin requires three pieces of information from your JIRA system.
The URL of your JIRA system.
Username and password of a JIRA account that will be used to authenticate the notifications.
The name of the JIRA project that the notifications will be sent to.
Create JIRA notification type
You will configure the monasca service to send notifications to a particular JIRA project. You must also configure JIRA to create new issues for each notification it receives to this project, however, that configuration is outside the scope of this document.
The monasca JIRA notification plugin supports only the following two JIRA issue fields.
PROJECT. This is the only supported “mandatory” JIRA issue field.
COMPONENT. This is the only supported “optional” JIRA issue field.
The JIRA issue type that your notifications will create may only be configured with the "Project" field as mandatory. If your JIRA issue type has any other mandatory fields, the monasca plugin will not function correctly. Currently, the monasca plugin only supports the single optional "component" field.
Creating the JIRA notification type requires a few more steps than other notification types covered in this guide. Because the Python and YAML files for this notification type are not yet included in SUSE OpenStack Cloud 9, you must perform the following steps to manually retrieve and place them on your Cloud Lifecycle Manager.
Configure the JIRA plugin by adding the following block to the
/etc/monasca/notification.yaml
file, under thenotification_types
section, and adding the username and password of the JIRA account used for the notifications to the respective sections.plugins: - monasca_notification.plugins.jira_notifier:JiraNotifier jira: user: password: timeout: 60
After adding the necessary block, the
notification_types
section should look like the following example. Note that you must also add the username and password for the JIRA user related to the notification type.notification_types: plugins: - monasca_notification.plugins.jira_notifier:JiraNotifier jira: user: password: timeout: 60 webhook: timeout: 5 pagerduty: timeout: 5 url: "https://events.pagerduty.com/generic/2010-04-15/create_event.json"
Create the JIRA notification type. The following command example creates a JIRA notification type named
MyJiraNotification
, in the JIRA projectHISO
.ardana >
monasca notification-create MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISOThe following command example creates a JIRA notification type named
MyJiraNotification
, in the JIRA projectHISO
, and adds the optional component field with a value ofkeystone
.ardana >
monasca notification-create MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISO&component=keystoneNoteThere is a slash (
/
) separating the URL path and the query string. The slash is required if you have a query parameter without a path parameter.NoteNotification types may also be created in the horizon dashboard.
13.1.4 Alarm Metrics #
You can use the available metrics to create custom alarms to further monitor your cloud infrastructure and facilitate autoscaling features.
For details on how to create customer alarms using the Operations Console, see Section 16.2, “Alarm Definition”.
13.1.4.1 Apache Metrics #
A list of metrics associated with the Apache service.
Metric Name | Dimensions | Description |
---|---|---|
apache.net.hits |
hostname service=apache component=apache | Total accesses |
apache.net.kbytes_sec |
hostname service=apache component=apache | Total Kbytes per second |
apache.net.requests_sec |
hostname service=apache component=apache | Total accesses per second |
apache.net.total_kbytes |
hostname service=apache component=apache | Total Kbytes |
apache.performance.busy_worker_count |
hostname service=apache component=apache | The number of workers serving requests |
apache.performance.cpu_load_perc |
hostname service=apache component=apache |
The current percentage of CPU used by each worker and in total by all workers combined |
apache.performance.idle_worker_count |
hostname service=apache component=apache | The number of idle workers |
apache.status |
apache_port hostname service=apache component=apache | Status of Apache port |
13.1.4.2 ceilometer Metrics #
A list of metrics associated with the ceilometer service.
Metric Name | Dimensions | Description |
---|---|---|
disk.total_space_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Total space of disk |
disk.total_used_space_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Total used space of disk |
swiftlm.diskusage.rate_agg |
aggregation_period=hourly, host=all, project_id=all | |
swiftlm.diskusage.val.avail_agg |
aggregation_period=hourly, host, project_id=all | |
swiftlm.diskusage.val.size_agg |
aggregation_period=hourly, host, project_id=all | |
image |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Existence of the image |
image.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Delete operation on this image |
image.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=B, source=openstack | Size of the uploaded image |
image.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Update operation on this image |
image.upload |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Upload operation on this image |
instance |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=instance, source=openstack | Existence of instance |
disk.ephemeral.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of ephemeral disk on this instance |
disk.root.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of root disk on this instance |
memory |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=MB, source=openstack | Size of memory on this instance |
ip.floating |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=ip, source=openstack | Existence of IP |
ip.floating.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=ip, source=openstack | Create operation on this fip |
ip.floating.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=ip, source=openstack | Update operation on this fip |
mem.total_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Total space of memory |
mem.usable_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Available space of memory |
network |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=network, source=openstack | Existence of network |
network.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=network, source=openstack | Create operation on this network |
network.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=network, source=openstack | Update operation on this network |
network.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=network, source=openstack | Delete operation on this network |
port |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=port, source=openstack | Existence of port |
port.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=port, source=openstack | Create operation on this port |
port.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=port, source=openstack | Delete operation on this port |
port.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=port, source=openstack | Update operation on this port |
router |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=router, source=openstack | Existence of router |
router.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=router, source=openstack | Create operation on this router |
router.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=router, source=openstack | Delete operation on this router |
router.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=router, source=openstack | Update operation on this router |
snapshot |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=snapshot, source=openstack | Existence of the snapshot |
snapshot.create.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=snapshot, source=openstack | Create operation on this snapshot |
snapshot.delete.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=snapshot, source=openstack | Delete operation on this snapshot |
snapshot.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of this snapshot |
subnet |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=subnet, source=openstack | Existence of the subnet |
subnet.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=subnet, source=openstack | Create operation on this subnet |
subnet.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=subnet, source=openstack | Delete operation on this subnet |
subnet.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=subnet, source=openstack | Update operation on this subnet |
vcpus |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=vcpus, source=openstack | Number of virtual CPUs allocated to the instance |
vcpus_agg |
aggregation_period=hourly, host=all, project_id | Number of vcpus used by a project |
volume |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=volume, source=openstack | Existence of the volume |
volume.create.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Create operation on this volume |
volume.delete.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Delete operation on this volume |
volume.resize.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Resize operation on this volume |
volume.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of this volume |
volume.update.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Update operation on this volume |
storage.objects |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=object, source=openstack | Number of objects |
storage.objects.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=B, source=openstack | Total size of stored objects |
storage.objects.containers |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=container, source=openstack | Number of containers |
13.1.4.3 cinder Metrics #
A list of metrics associated with the cinder service.
Metric Name | Dimensions | Description |
---|---|---|
cinderlm.cinder.backend.physical.list |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, backends | List of physical backends |
cinderlm.cinder.backend.total.avail |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname | Total available capacity metric per backend |
cinderlm.cinder.backend.total.size |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname | Total capacity metric per backend |
cinderlm.cinder.cinder_services |
service=block-storage, hostname, cluster, cloud_name, control_plane, component | Status of a cinder-volume service |
cinderlm.hp_hardware.hpssacli.logical_drive |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, logical_drive, controller_slot, array The HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed for SSACLI status to be reported. To download and install the SSACLI utility to enable management of disk controllers, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f | Status of a logical drive |
cinderlm.hp_hardware.hpssacli.physical_drive |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, box, bay, controller_slot | Status of a logical drive |
cinderlm.hp_hardware.hpssacli.smart_array |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, model | Status of smart array |
cinderlm.hp_hardware.hpssacli.smart_array.firmware |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, model | Checks firmware version |
13.1.4.4 Compute Metrics #
Compute instance metrics are listed in Section 13.1.4.11, “Libvirt Metrics”.
A list of metrics associated with the Compute service.
Metric Name | Dimensions | Description |
---|---|---|
nova.heartbeat |
service=compute cloud_name hostname component control_plane cluster |
Checks that all services are running heartbeats (uses nova user and to list services then sets up checks for each. For example, nova-scheduler, nova-conductor, nova-compute) |
nova.vm.cpu.total_allocated |
service=compute hostname component control_plane cluster | Total CPUs allocated across all VMs |
nova.vm.disk.total_allocated_gb |
service=compute hostname component control_plane cluster | Total Gbytes of disk space allocated to all VMs |
nova.vm.mem.total_allocated_mb |
service=compute hostname component control_plane cluster | Total Mbytes of memory allocated to all VMs |
13.1.4.5 Crash Metrics #
A list of metrics associated with the Crash service.
Metric Name | Dimensions | Description |
---|---|---|
crash.dump_count |
service=system hostname cluster | Number of crash dumps found |
13.1.4.6 Directory Metrics #
A list of metrics associated with the Directory service.
Metric Name | Dimensions | Description |
---|---|---|
directory.files_count |
service hostname path | Total number of files under a specific directory path |
directory.size_bytes |
service hostname path | Total size of a specific directory path |
13.1.4.7 Elasticsearch Metrics #
A list of metrics associated with the Elasticsearch service.
Metric Name | Dimensions | Description |
---|---|---|
elasticsearch.active_primary_shards |
service=logging url hostname |
Indicates the number of primary shards in your cluster. This is an aggregate total across all indices. |
elasticsearch.active_shards |
service=logging url hostname |
Aggregate total of all shards across all indices, which includes replica shards. |
elasticsearch.cluster_status |
service=logging url hostname |
Cluster health status. |
elasticsearch.initializing_shards |
service=logging url hostname |
The count of shards that are being freshly created. |
elasticsearch.number_of_data_nodes |
service=logging url hostname |
Number of data nodes. |
elasticsearch.number_of_nodes |
service=logging url hostname |
Number of nodes. |
elasticsearch.relocating_shards |
service=logging url hostname |
Shows the number of shards that are currently moving from one node to another node. |
elasticsearch.unassigned_shards |
service=logging url hostname |
The number of unassigned shards from the master node. |
13.1.4.8 HAProxy Metrics #
A list of metrics associated with the HAProxy service.
Metric Name | Dimensions | Description |
---|---|---|
haproxy.backend.bytes.in_rate | ||
haproxy.backend.bytes.out_rate | ||
haproxy.backend.denied.req_rate | ||
haproxy.backend.denied.resp_rate | ||
haproxy.backend.errors.con_rate | ||
haproxy.backend.errors.resp_rate | ||
haproxy.backend.queue.current | ||
haproxy.backend.response.1xx | ||
haproxy.backend.response.2xx | ||
haproxy.backend.response.3xx | ||
haproxy.backend.response.4xx | ||
haproxy.backend.response.5xx | ||
haproxy.backend.response.other | ||
haproxy.backend.session.current | ||
haproxy.backend.session.limit | ||
haproxy.backend.session.pct | ||
haproxy.backend.session.rate | ||
haproxy.backend.warnings.redis_rate | ||
haproxy.backend.warnings.retr_rate | ||
haproxy.frontend.bytes.in_rate | ||
haproxy.frontend.bytes.out_rate | ||
haproxy.frontend.denied.req_rate | ||
haproxy.frontend.denied.resp_rate | ||
haproxy.frontend.errors.req_rate | ||
haproxy.frontend.requests.rate | ||
haproxy.frontend.response.1xx | ||
haproxy.frontend.response.2xx | ||
haproxy.frontend.response.3xx | ||
haproxy.frontend.response.4xx | ||
haproxy.frontend.response.5xx | ||
haproxy.frontend.response.other | ||
haproxy.frontend.session.current | ||
haproxy.frontend.session.limit | ||
haproxy.frontend.session.pct | ||
haproxy.frontend.session.rate |
13.1.4.9 HTTP Check Metrics #
A list of metrics associated with the HTTP Check service:
Metric Name | Dimensions | Description |
---|---|---|
http_response_time |
url hostname service component | The response time in seconds of the http endpoint call. |
http_status |
url hostname service | The status of the http endpoint call (0 = success, 1 = failure). |
For each component and HTTP metric name there are two separate metrics reported, one for the local URL and another for the virtual IP (VIP) URL:
Component | Dimensions | Description |
---|---|---|
account-server |
service=object-storage component=account-server url | swift account-server http endpoint status and response time |
barbican-api |
service=key-manager component=barbican-api url | barbican-api http endpoint status and response time |
cinder-api |
service=block-storage component=cinder-api url | cinder-api http endpoint status and response time |
container-server |
service=object-storage component=container-server url | swift container-server http endpoint status and response time |
designate-api |
service=dns component=designate-api url | designate-api http endpoint status and response time |
glance-api |
service=image-service component=glance-api url | glance-api http endpoint status and response time |
glance-registry |
service=image-service component=glance-registry url | glance-registry http endpoint status and response time |
heat-api |
service=orchestration component=heat-api url | heat-api http endpoint status and response time |
heat-api-cfn |
service=orchestration component=heat-api-cfn url | heat-api-cfn http endpoint status and response time |
heat-api-cloudwatch |
service=orchestration component=heat-api-cloudwatch url | heat-api-cloudwatch http endpoint status and response time |
ardana-ux-services |
service=ardana-ux-services component=ardana-ux-services url | ardana-ux-services http endpoint status and response time |
horizon |
service=web-ui component=horizon url | horizon http endpoint status and response time |
keystone-api |
service=identity-service component=keystone-api url | keystone-api http endpoint status and response time |
monasca-api |
service=monitoring component=monasca-api url | monasca-api http endpoint status |
monasca-persister |
service=monitoring component=monasca-persister url | monasca-persister http endpoint status |
neutron-server |
service=networking component=neutron-server url | neutron-server http endpoint status and response time |
neutron-server-vip |
service=networking component=neutron-server-vip url | neutron-server-vip http endpoint status and response time |
nova-api |
service=compute component=nova-api url | nova-api http endpoint status and response time |
nova-vnc |
service=compute component=nova-vnc url | nova-vnc http endpoint status and response time |
object-server |
service=object-storage component=object-server url | object-server http endpoint status and response time |
object-storage-vip |
service=object-storage component=object-storage-vip url | object-storage-vip http endpoint status and response time |
octavia-api |
service=octavia component=octavia-api url | octavia-api http endpoint status and response time |
ops-console-web |
service=ops-console component=ops-console-web url | ops-console-web http endpoint status and response time |
proxy-server |
service=object-storage component=proxy-server url | proxy-server http endpoint status and response time |
13.1.4.10 Kafka Metrics #
A list of metrics associated with the Kafka service.
Metric Name | Dimensions | Description |
---|---|---|
kafka.consumer_lag |
topic service component=kafka consumer_group hostname | Hostname consumer offset lag from broker offset |
13.1.4.11 Libvirt Metrics #
For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.1, “Libvirt plugin metric tuning knobs”.
A list of metrics associated with the Libvirt service.
Admin Metric Name | Project Metric Name | Dimensions | Description |
---|---|---|---|
vm.cpu.time_ns | cpu.time_ns |
zone service resource_id hostname component | Cumulative CPU time (in ns) |
vm.cpu.utilization_norm_perc | cpu.utilization_norm_perc |
zone service resource_id hostname component | Normalized CPU utilization (percentage) |
vm.cpu.utilization_perc | cpu.utilization_perc |
zone service resource_id hostname component | Overall CPU utilization (percentage) |
vm.io.errors | io.errors |
zone service resource_id hostname component | Overall disk I/O errors |
vm.io.errors_sec | io.errors_sec |
zone service resource_id hostname component | Disk I/O errors per second |
vm.io.read_bytes | io.read_bytes |
zone service resource_id hostname component | Disk I/O read bytes value |
vm.io.read_bytes_sec | io.read_bytes_sec |
zone service resource_id hostname component | Disk I/O read bytes per second |
vm.io.read_ops | io.read_ops |
zone service resource_id hostname component | Disk I/O read operations value |
vm.io.read_ops_sec | io.read_ops_sec |
zone service resource_id hostname component | Disk I/O write operations per second |
vm.io.write_bytes | io.write_bytes |
zone service resource_id hostname component | Disk I/O write bytes value |
vm.io.write_bytes_sec | io.write_bytes_sec |
zone service resource_id hostname component | Disk I/O write bytes per second |
vm.io.write_ops | io.write_ops |
zone service resource_id hostname component | Disk I/O write operations value |
vm.io.write_ops_sec | io.write_ops_sec |
zone service resource_id hostname component | Disk I/O write operations per second |
vm.net.in_bytes | net.in_bytes |
zone service resource_id hostname component device port_id | Network received total bytes |
vm.net.in_bytes_sec | net.in_bytes_sec |
zone service resource_id hostname component device port_id | Network received bytes per second |
vm.net.in_packets | net.in_packets |
zone service resource_id hostname component device port_id | Network received total packets |
vm.net.in_packets_sec | net.in_packets_sec |
zone service resource_id hostname component device port_id | Network received packets per second |
vm.net.out_bytes | net.out_bytes |
zone service resource_id hostname component device port_id | Network transmitted total bytes |
vm.net.out_bytes_sec | net.out_bytes_sec |
zone service resource_id hostname component device port_id | Network transmitted bytes per second |
vm.net.out_packets | net.out_packets |
zone service resource_id hostname component device port_id | Network transmitted total packets |
vm.net.out_packets_sec | net.out_packets_sec |
zone service resource_id hostname component device port_id | Network transmitted packets per second |
vm.ping_status | ping_status |
zone service resource_id hostname component | 0 for ping success, 1 for ping failure |
vm.disk.allocation | disk.allocation |
zone service resource_id hostname component | Total Disk allocation for a device |
vm.disk.allocation_total | disk.allocation_total |
zone service resource_id hostname component | Total Disk allocation across devices for instances |
vm.disk.capacity | disk.capacity |
zone service resource_id hostname component | Total Disk capacity for a device |
vm.disk.capacity_total | disk.capacity_total |
zone service resource_id hostname component | Total Disk capacity across devices for instances |
vm.disk.physical | disk.physical |
zone service resource_id hostname component | Total Disk usage for a device |
vm.disk.physical_total | disk.physical_total |
zone service resource_id hostname component | Total Disk usage across devices for instances |
vm.io.errors_total | io.errors_total |
zone service resource_id hostname component | Total Disk I/O errors across all devices |
vm.io.errors_total_sec | io.errors_total_sec |
zone service resource_id hostname component | Total Disk I/O errors per second across all devices |
vm.io.read_bytes_total | io.read_bytes_total |
zone service resource_id hostname component | Total Disk I/O read bytes across all devices |
vm.io.read_bytes_total_sec | io.read_bytes_total_sec |
zone service resource_id hostname component | Total Disk I/O read bytes per second across devices |
vm.io.read_ops_total | io.read_ops_total |
zone service resource_id hostname component | Total Disk I/O read operations across all devices |
vm.io.read_ops_total_sec | io.read_ops_total_sec |
zone service resource_id hostname component | Total Disk I/O read operations across all devices per sec |
vm.io.write_bytes_total | io.write_bytes_total |
zone service resource_id hostname component | Total Disk I/O write bytes across all devices |
vm.io.write_bytes_total_sec | io.write_bytes_total_sec |
zone service resource_id hostname component | Total Disk I/O Write bytes per second across devices |
vm.io.write_ops_total | io.write_ops_total |
zone service resource_id hostname component | Total Disk I/O write operations across all devices |
vm.io.write_ops_total_sec | io.write_ops_total_sec |
zone service resource_id hostname component | Total Disk I/O write operations across all devices per sec |
These metrics in libvirt are always enabled and cannot be disabled using the tuning knobs.
Admin Metric Name | Project Metric Name | Dimensions | Description |
---|---|---|---|
vm.host_alive_status | host_alive_status |
zone service resource_id hostname component |
-1 for no status, 0 for Running / OK, 1 for Idle / blocked, 2 for Paused, 3 for Shutting down, 4 for Shut off or nova suspend 5 for Crashed, 6 for Power management suspend (S3 state) |
vm.mem.free_mb | mem.free_mb |
cluster service hostname | Free memory in Mbytes |
vm.mem.free_perc | mem.free_perc |
cluster service hostname | Percent of memory free |
vm.mem.resident_mb |
cluster service hostname | Total memory used on host, an Operations-only metric | |
vm.mem.swap_used_mb | mem.swap_used_mb |
cluster service hostname | Used swap space in Mbytes |
vm.mem.total_mb | mem.total_mb |
cluster service hostname | Total memory in Mbytes |
vm.mem.used_mb | mem.used_mb |
cluster service hostname | Used memory in Mbytes |
13.1.4.12 Monitoring Metrics #
A list of metrics associated with the Monitoring service.
Metric Name | Dimensions | Description |
---|---|---|
alarm-state-transitions-added-to-batch-counter |
service=monitoring url hostname component=monasca-persister | |
jvm.memory.total.max |
service=monitoring url hostname component | Maximum JVM overall memory |
jvm.memory.total.used |
service=monitoring url hostname component | Used JVM overall memory |
metrics-added-to-batch-counter |
service=monitoring url hostname component=monasca-persister | |
metrics.published |
service=monitoring url hostname component=monasca-api | Total number of published metrics |
monasca.alarms_finished_count |
hostname component=monasca-notification service=monitoring | Total number of alarms received |
monasca.checks_running_too_long |
hostname component=monasca-agent service=monitoring cluster | Only emitted when collection time for a check is too long |
monasca.collection_time_sec |
hostname component=monasca-agent service=monitoring cluster | Collection time in monasca-agent |
monasca.config_db_time |
hostname component=monasca-notification service=monitoring | |
monasca.created_count |
hostname component=monasca-notification service=monitoring | Number of notifications created |
monasca.invalid_type_count |
hostname component=monasca-notification service=monitoring | Number of notifications with invalid type |
monasca.log.in_bulks_rejected |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.in_logs |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.in_logs_bytes |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.in_logs_rejected |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.out_logs |
hostname component=monasca-log-api service=monitoring | |
monasca.log.out_logs_lost |
hostname component=monasca-log-api service=monitoring | |
monasca.log.out_logs_truncated_bytes |
hostname component=monasca-log-api service=monitoring | |
monasca.log.processing_time_ms |
hostname component=monasca-log-api service=monitoring | |
monasca.log.publish_time_ms |
hostname component=monasca-log-api service=monitoring | |
monasca.thread_count |
service=monitoring process_name hostname component | Number of threads monasca is using |
raw-sql.time.avg |
service=monitoring url hostname component | Average raw sql query time |
raw-sql.time.max |
service=monitoring url hostname component | Max raw sql query time |
13.1.4.13 Monasca Aggregated Metrics #
A list of the aggregated metrics associated with the monasca Transform feature.
Metric Name | For | Dimensions | Description |
---|---|---|---|
cpu.utilized_logical_cores_agg | Compute summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour). Available as total or per host |
cpu.total_logical_cores_agg | Compute summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour) Available as total or per host |
mem.total_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Total physical host memory capacity by time interval (defaults to a hour) |
mem.usable_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all | Usable physical host memory capacity by time interval (defaults to a hour) |
disk.total_used_space_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Utilized physical host disk capacity by time interval (defaults to a hour) |
disk.total_space_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all | Total physical host disk capacity by time interval (defaults to a hour) |
nova.vm.cpu.total_allocated_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
CPUs allocated across all virtual machines by time interval (defaults to a hour) |
vcpus_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Virtual CPUs allocated capacity for virtual machines of one or all projects by time interval (defaults to a hour) Available as total or per host |
nova.vm.mem.total_allocated_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Memory allocated to all virtual machines by time interval (defaults to a hour) |
vm.mem.used_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Memory utilized by virtual machines of one or all projects by time interval (defaults to an hour) Available as total or per host |
vm.mem.total_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Memory allocated to virtual machines of one or all projects by time interval (defaults to an hour) Available as total or per host |
vm.cpu.utilization_perc_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
CPU utilized by all virtual machines by project by time interval (defaults to an hour) |
nova.vm.disk.total_allocated_gb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Disk space allocated to all virtual machines by time interval (defaults to an hour) |
vm.disk.allocation_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Disk allocation for virtual machines of one or all projects by time interval (defaults to a hour) Available as total or per host |
swiftlm.diskusage.val.size_agg | Object Storage summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Total available object storage capacity by time interval (defaults to a hour) Available as total or per host |
swiftlm.diskusage.val.avail_agg | Object Storage summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Remaining object storage capacity by time interval (defaults to a hour) Available as total or per host |
swiftlm.diskusage.rate_agg | Object Storage summary |
aggregation_period: hourly host: all project_id: all |
Rate of change of object storage usage by time interval (defaults to a hour) |
storage.objects.size_agg | Object Storage summary |
aggregation_period: hourly host: all project_id: all |
Used object storage capacity by time interval (defaults to a hour) |
13.1.4.14 MySQL Metrics #
A list of metrics associated with the MySQL service.
Metric Name | Dimensions | Description |
---|---|---|
mysql.innodb.buffer_pool_free |
hostname mode service=mysql |
The number of free pages, in bytes. This value is calculated by
multiplying |
mysql.innodb.buffer_pool_total |
hostname mode service=mysql |
The total size of buffer pool, in bytes. This value is calculated by
multiplying |
mysql.innodb.buffer_pool_used |
hostname mode service=mysql |
The number of used pages, in bytes. This value is calculated by
subtracting |
mysql.innodb.current_row_locks |
hostname mode service=mysql |
Corresponding to current row locks of the server status variable. |
mysql.innodb.data_reads |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.data_writes |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.mutex_os_waits |
hostname mode service=mysql |
Corresponding to the OS waits of the server status variable. |
mysql.innodb.mutex_spin_rounds |
hostname mode service=mysql |
Corresponding to spinlock rounds of the server status variable. |
mysql.innodb.mutex_spin_waits |
hostname mode service=mysql |
Corresponding to the spin waits of the server status variable. |
mysql.innodb.os_log_fsyncs |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.row_lock_time |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.row_lock_waits |
hostname mode service=mysql |
Corresponding to |
mysql.net.connections |
hostname mode service=mysql |
Corresponding to |
mysql.net.max_connections |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_delete |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_delete_multi |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_insert |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_insert_select |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_replace_select |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_select |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_update |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_update_multi |
hostname mode service=mysql |
Corresponding to |
mysql.performance.created_tmp_disk_tables |
hostname mode service=mysql |
Corresponding to |
mysql.performance.created_tmp_files |
hostname mode service=mysql |
Corresponding to |
mysql.performance.created_tmp_tables |
hostname mode service=mysql |
Corresponding to |
mysql.performance.kernel_time |
hostname mode service=mysql |
The kernel time for the databases performance, in seconds. |
mysql.performance.open_files |
hostname mode service=mysql |
Corresponding to |
mysql.performance.qcache_hits |
hostname mode service=mysql |
Corresponding to |
mysql.performance.queries |
hostname mode service=mysql |
Corresponding to |
mysql.performance.questions |
hostname mode service=mysql |
Corresponding to |
mysql.performance.slow_queries |
hostname mode service=mysql |
Corresponding to |
mysql.performance.table_locks_waited |
hostname mode service=mysql |
Corresponding to |
mysql.performance.threads_connected |
hostname mode service=mysql |
Corresponding to |
mysql.performance.user_time |
hostname mode service=mysql |
The CPU user time for the databases performance, in seconds. |
13.1.4.15 NTP Metrics #
A list of metrics associated with the NTP service.
Metric Name | Dimensions | Description |
---|---|---|
ntp.connection_status |
hostname ntp_server | Value of ntp server connection status (0=Healthy) |
ntp.offset |
hostname ntp_server | Time offset in seconds |
13.1.4.16 Open vSwitch (OVS) Metrics #
A list of metrics associated with the OVS service.
For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.2, “OVS plugin metric tuning knobs”.
Admin Metric Name | Project Metric Name | Dimensions | Description |
---|---|---|---|
ovs.vrouter.in_bytes_sec | vrouter.in_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Inbound bytes per second for the router (if
|
ovs.vrouter.in_packets_sec | vrouter.in_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets per second for the router |
ovs.vrouter.out_bytes_sec | vrouter.out_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing bytes per second for the router (if
|
ovs.vrouter.out_packets_sec | vrouter.out_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets per second for the router |
ovs.vrouter.in_bytes | vrouter.in_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Inbound bytes for the router (if |
ovs.vrouter.in_packets | vrouter.in_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets for the router |
ovs.vrouter.out_bytes | vrouter.out_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing bytes for the router (if |
ovs.vrouter.out_packets | vrouter.out_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets for the router |
ovs.vrouter.in_dropped_sec | vrouter.in_dropped_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped packets per second for the router |
ovs.vrouter.in_errors_sec | vrouter.in_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Number of incoming errors per second for the router |
ovs.vrouter.out_dropped_sec | vrouter.out_dropped_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets per second for the router |
ovs.vrouter.out_errors_sec | vrouter.out_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Number of outgoing errors per second for the router |
ovs.vrouter.in_dropped | vrouter.in_dropped |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped packets for the router |
ovs.vrouter.in_errors | vrouter.in_errors |
service=networking resource_id component=ovs router_name port_id |
Number of incoming errors for the router |
ovs.vrouter.out_dropped | vrouter.out_dropped |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets for the router |
ovs.vrouter.out_errors | vrouter.out_errors |
service=networking resource_id tenant_id component=ovs router_name port_id |
Number of outgoing errors for the router |
Admin Metric Name | Tenant Metric Name | Dimensions | Description |
---|---|---|---|
ovs.vswitch.in_bytes_sec | vswitch.in_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Incoming Bytes per second on DHCP
port(if |
ovs.vswitch.in_packets_sec | vswitch.in_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets per second for the DHCP port |
ovs.vswitch.out_bytes_sec | vswitch.out_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing Bytes per second on DHCP
port(if |
ovs.vswitch.out_packets_sec | vswitch.out_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets per second for the DHCP port |
ovs.vswitch.in_bytes | vswitch.in_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Inbound bytes for the DHCP port (if |
ovs.vswitch.in_packets | vswitch.in_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets for the DHCP port |
ovs.vswitch.out_bytes | vswitch.out_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing bytes for the DHCP port (if |
ovs.vswitch.out_packets | vswitch.out_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets for the DHCP port |
ovs.vswitch.in_dropped_sec | vswitch.in_dropped_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped per second for the DHCP port |
ovs.vswitch.in_errors_sec | vswitch.in_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Incoming errors per second for the DHCP port |
ovs.vswitch.out_dropped_sec | vswitch.out_dropped_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets per second for the DHCP port |
ovs.vswitch.out_errors_sec | vswitch.out_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing errors per second for the DHCP port |
ovs.vswitch.in_dropped | vswitch.in_dropped |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped packets for the DHCP port |
ovs.vswitch.in_errors | vswitch.in_errors |
service=networking resource_id component=ovs router_name port_id |
Errors received for the DHCP port |
ovs.vswitch.out_dropped | vswitch.out_dropped |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets for the DHCP port |
ovs.vswitch.out_errors | vswitch.out_errors |
service=networking resource_id tenant_id component=ovs router_name port_id |
Errors transmitted for the DHCP port |
13.1.4.17 Process Metrics #
A list of metrics associated with processes.
Metric Name | Dimensions | Description |
---|---|---|
process.cpu_perc |
hostname service process_name component | Percentage of cpu being consumed by a process |
process.io.read_count |
hostname service process_name component | Number of reads by a process |
process.io.read_kbytes |
hostname service process_name component | Kbytes read by a process |
process.io.write_count |
hostname service process_name component | Number of writes by a process |
process.io.write_kbytes |
hostname service process_name component | Kbytes written by a process |
process.mem.rss_mbytes |
hostname service process_name component | Amount of physical memory allocated to a process, including memory from shared libraries in Mbytes |
process.open_file_descriptors |
hostname service process_name component | Number of files being used by a process |
process.pid_count |
hostname service process_name component | Number of processes that exist with this process name |
process.thread_count |
hostname service process_name component | Number of threads a process is using |
13.1.4.17.1 process.cpu_perc, process.mem.rss_mbytes, process.pid_count and process.thread_count metrics #
Component Name | Dimensions | Description |
---|---|---|
apache-storm |
service=monitoring process_name=monasca-thresh process_user=storm | apache-storm process info: cpu percent, momory, pid count and thread count |
barbican-api |
service=key-manager process_name=barbican-api | barbican-api process info: cpu percent, momory, pid count and thread count |
ceilometer-agent-notification |
service=telemetry process_name=ceilometer-agent-notification | ceilometer-agent-notification process info: cpu percent, momory, pid count and thread count |
ceilometer-polling |
service=telemetry process_name=ceilometer-polling | ceilometer-polling process info: cpu percent, momory, pid count and thread count |
cinder-api |
service=block-storage process_name=cinder-api | cinder-api process info: cpu percent, momory, pid count and thread count |
cinder-scheduler |
service=block-storage process_name=cinder-scheduler | cinder-scheduler process info: cpu percent, momory, pid count and thread count |
designate-api |
service=dns process_name=designate-api | designate-api process info: cpu percent, momory, pid count and thread count |
designate-central |
service=dns process_name=designate-central | designate-central process info: cpu percent, momory, pid count and thread count |
designate-mdns |
service=dns process_name=designate-mdns | designate-mdns process cpu percent, momory, pid count and thread count |
designate-pool-manager |
service=dns process_name=designate-pool-manager | designate-pool-manager process info: cpu percent, momory, pid count and thread count |
heat-api |
service=orchestration process_name=heat-api | heat-api process cpu percent, momory, pid count and thread count |
heat-api-cfn |
service=orchestration process_name=heat-api-cfn | heat-api-cfn process info: cpu percent, momory, pid count and thread count |
heat-api-cloudwatch |
service=orchestration process_name=heat-api-cloudwatch | heat-api-cloudwatch process cpu percent, momory, pid count and thread count |
heat-engine |
service=orchestration process_name=heat-engine | heat-engine process info: cpu percent, momory, pid count and thread count |
ipsec/charon |
service=networking process_name=ipsec/charon | ipsec/charon process info: cpu percent, momory, pid count and thread count |
keystone-admin |
service=identity-service process_name=keystone-admin | keystone-admin process info: cpu percent, momory, pid count and thread count |
keystone-main |
service=identity-service process_name=keystone-main | keystone-main process info: cpu percent, momory, pid count and thread count |
monasca-agent |
service=monitoring process_name=monasca-agent | monasca-agent process info: cpu percent, momory, pid count and thread count |
monasca-api |
service=monitoring process_name=monasca-api | monasca-api process info: cpu percent, momory, pid count and thread count |
monasca-notification |
service=monitoring process_name=monasca-notification | monasca-notification process info: cpu percent, momory, pid count and thread count |
monasca-persister |
service=monitoring process_name=monasca-persister | monasca-persister process info: cpu percent, momory, pid count and thread count |
monasca-transform |
service=monasca-transform process_name=monasca-transform | monasca-transform process info: cpu percent, momory, pid count and thread count |
neutron-dhcp-agent |
service=networking process_name=neutron-dhcp-agent | neutron-dhcp-agent process info: cpu percent, momory, pid count and thread count |
neutron-l3-agent |
service=networking process_name=neutron-l3-agent | neutron-l3-agent process info: cpu percent, momory, pid count and thread count |
neutron-metadata-agent |
service=networking process_name=neutron-metadata-agent | neutron-metadata-agent process info: cpu percent, momory, pid count and thread count |
neutron-openvswitch-agent |
service=networking process_name=neutron-openvswitch-agent | neutron-openvswitch-agent process info: cpu percent, momory, pid count and thread count |
neutron-rootwrap |
service=networking process_name=neutron-rootwrap | neutron-rootwrap process info: cpu percent, momory, pid count and thread count |
neutron-server |
service=networking process_name=neutron-server | neutron-server process info: cpu percent, momory, pid count and thread count |
neutron-vpn-agent |
service=networking process_name=neutron-vpn-agent | neutron-vpn-agent process info: cpu percent, momory, pid count and thread count |
nova-api |
service=compute process_name=nova-api | nova-api process info: cpu percent, momory, pid count and thread count |
nova-compute |
service=compute process_name=nova-compute | nova-compute process info: cpu percent, momory, pid count and thread count |
nova-conductor |
service=compute process_name=nova-conductor | nova-conductor process info: cpu percent, momory, pid count and thread count |
nova-novncproxy |
service=compute process_name=nova-novncproxy | nova-novncproxy process info: cpu percent, momory, pid count and thread count |
nova-scheduler |
service=compute process_name=nova-scheduler | nova-scheduler process info: cpu percent, momory, pid count and thread count |
octavia-api |
service=octavia process_name=octavia-api | octavia-api process info: cpu percent, momory, pid count and thread count |
octavia-health-manager |
service=octavia process_name=octavia-health-manager | octavia-health-manager process info: cpu percent, momory, pid count and thread count |
octavia-housekeeping |
service=octavia process_name=octavia-housekeeping | octavia-housekeeping process info: cpu percent, momory, pid count and thread count |
octavia-worker |
service=octavia process_name=octavia-worker | octavia-worker process info: cpu percent, momory, pid count and thread count |
org.apache.spark.deploy.master.Master |
service=spark process_name=org.apache.spark.deploy.master.Master | org.apache.spark.deploy.master.Master process info: cpu percent, momory, pid count and thread count |
org.apache.spark.executor.CoarseGrainedExecutorBackend |
service=monasca-transform process_name=org.apache.spark.executor.CoarseGrainedExecutorBackend | org.apache.spark.executor.CoarseGrainedExecutorBackend process info: cpu percent, momory, pid count and thread count |
pyspark |
service=monasca-transform process_name=pyspark | pyspark process info: cpu percent, momory, pid count and thread count |
transform/lib/driver |
service=monasca-transform process_name=transform/lib/driver | transform/lib/driver process info: cpu percent, momory, pid count and thread count |
cassandra |
service=cassandra process_name=cassandra | cassandra process info: cpu percent, momory, pid count and thread count |
13.1.4.17.2 process.io.*, process.open_file_descriptors metrics #
Component Name | Dimensions | Description |
---|---|---|
monasca-agent |
service=monitoring process_name=monasca-agent process_user=mon-agent | monasca-agent process info: number of reads, number of writes,number of files being used |
13.1.4.18 RabbitMQ Metrics #
A list of metrics associated with the RabbitMQ service.
Metric Name | Dimensions | Description |
---|---|---|
rabbitmq.exchange.messages.published_count |
hostname exchange vhost type service=rabbitmq |
Value of the "publish_out" field of "message_stats" object |
rabbitmq.exchange.messages.published_rate |
hostname exchange vhost type service=rabbitmq |
Value of the "rate" field of "message_stats/publish_out_details" object |
rabbitmq.exchange.messages.received_count |
hostname exchange vhost type service=rabbitmq |
Value of the "publish_in" field of "message_stats" object |
rabbitmq.exchange.messages.received_rate |
hostname exchange vhost type service=rabbitmq |
Value of the "rate" field of "message_stats/publish_in_details" object |
rabbitmq.node.fd_used |
hostname node service=rabbitmq |
Value of the "fd_used" field in the response of /api/nodes |
rabbitmq.node.mem_used |
hostname node service=rabbitmq |
Value of the "mem_used" field in the response of /api/nodes |
rabbitmq.node.run_queue |
hostname node service=rabbitmq |
Value of the "run_queue" field in the response of /api/nodes |
rabbitmq.node.sockets_used |
hostname node service=rabbitmq |
Value of the "sockets_used" field in the response of /api/nodes |
rabbitmq.queue.messages |
hostname queue vhost service=rabbitmq |
Sum of ready and unacknowledged messages (queue depth) |
rabbitmq.queue.messages.deliver_rate |
hostname queue vhost service=rabbitmq |
Value of the "rate" field of "message_stats/deliver_details" object |
rabbitmq.queue.messages.publish_rate |
hostname queue vhost service=rabbitmq |
Value of the "rate" field of "message_stats/publish_details" object |
rabbitmq.queue.messages.redeliver_rate |
hostname queue vhost service=rabbitmq |
Value of the "rate" field of "message_stats/redeliver_details" object |
13.1.4.19 Swift Metrics #
A list of metrics associated with the swift service.
Metric Name | Dimensions | Description |
---|---|---|
swiftlm.access.host.operation.get.bytes |
service=object-storage |
This metric is the number of bytes read from objects in GET requests processed by this host during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included. |
swiftlm.access.host.operation.ops |
service=object-storage |
This metric is a count of the all the API requests made to swift that were processed by this host during the last minute. |
swiftlm.access.host.operation.project.get.bytes | ||
swiftlm.access.host.operation.project.ops | ||
swiftlm.access.host.operation.project.put.bytes | ||
swiftlm.access.host.operation.put.bytes |
service=object-storage |
This metric is the number of bytes written to objects in PUT or POST requests processed by this host during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included. |
swiftlm.access.host.operation.status | ||
swiftlm.access.project.operation.status |
service=object-storage |
This metric reports whether the swiftlm-access-log-tailer program is running normally. |
swiftlm.access.project.operation.ops |
tenant_id service=object-storage |
This metric is a count of the all the API requests made to swift that were processed by this host during the last minute to a given project id. |
swiftlm.access.project.operation.get.bytes |
tenant_id service=object-storage |
This metric is the number of bytes read from objects in GET requests processed by this host for a given project during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included. |
swiftlm.access.project.operation.put.bytes |
tenant_id service=object-storage |
This metric is the number of bytes written to objects in PUT or POST requests processed by this host for a given project during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included. |
swiftlm.async_pending.cp.total.queue_length |
observer_host service=object-storage |
This metric reports the total length of all async pending queues in the system. When a container update fails, the update is placed on the async pending queue. An update may fail becuase the container server is too busy or because the server is down or failed. Later the system will “replay” updates from the queue – so eventually, the container listings will show all objects known to the system. If you know that container servers are down, it is normal to see the value of async pending increase. Once the server is restored, the value should return to zero. A non-zero value may also indicate that containers are too large. Look for “lock timeout” messages in /var/log/swift/swift.log. If you find such messages consider reducing the container size or enable rate limiting. |
swiftlm.check.failure |
check error component service=object-storage |
The total exception string is truncated if longer than 1919 characters and an ellipsis is prepended in the first three characters of the message. If there is more than one error reported, the list of errors is paired to the last reported error and the operator is expected to resolve failures until no more are reported. Where there are no further reported errors, the Value Class is emitted as ‘Ok’. |
swiftlm.diskusage.cp.avg.usage |
observer_host service=object-storage |
Is the average utilization of all drives in the system. The value is a percentage (example: 30.0 means 30% of the total space is used). |
swiftlm.diskusage.cp.max.usage |
observer_host service=object-storage |
Is the highest utilization of all drives in the system. The value is a percentage (example: 80.0 means at least one drive is 80% utilized). The value is just as important as swiftlm.diskusage.usage.avg. For example, if swiftlm.diskusage.usage.avg is 70% you might think that there is plenty of space available. However, if swiftlm.diskusage.usage.max is 100%, this means that some objects cannot be stored on that drive. swift will store replicas on other drives. However, this will create extra overhead. |
swiftlm.diskusage.cp.min.usage |
observer_host service=object-storage |
Is the lowest utilization of all drives in the system. The value is a percentage (example: 10.0 means at least one drive is 10% utilized) |
swiftlm.diskusage.cp.total.avail |
observer_host service=object-storage |
Is the size in bytes of available (unused) space of all drives in the system. Only drives used by swift are included. |
swiftlm.diskusage.cp.total.size |
observer_host service=object-storage |
Is the size in bytes of raw size of all drives in the system. |
swiftlm.diskusage.cp.total.used |
observer_host service=object-storage |
Is the size in bytes of used space of all drives in the system. Only drives used by swift are included. |
swiftlm.diskusage.host.avg.usage |
hostname service=object-storage |
This metric reports the average percent usage of all swift filesystems on a host. |
swiftlm.diskusage.host.max.usage |
hostname service=object-storage |
This metric reports the percent usage of a swift filesystem that is most used (full) on a host. The value is the max of the percentage used of all swift filesystems. |
swiftlm.diskusage.host.min.usage |
hostname service=object-storage |
This metric reports the percent usage of a swift filesystem that is least used (has free space) on a host. The value is the min of the percentage used of all swift filesystems. |
swiftlm.diskusage.host.val.avail |
hostname service=object-storage mount device label |
This metric reports the number of bytes available (free) in a swift filesystem. The value is an integer (units: Bytes) |
swiftlm.diskusage.host.val.size |
hostname service=object-storage mount device label |
This metric reports the size in bytes of a swift filesystem. The value is an integer (units: Bytes) |
swiftlm.diskusage.host.val.usage |
hostname service=object-storage mount device label |
This metric reports the percent usage of a swift filesystem. The value is a floating point number in range 0.0 to 100.0 |
swiftlm.diskusage.host.val.used |
hostname service=object-storage mount device label |
This metric reports the number of used bytes in a swift filesystem. The value is an integer (units: Bytes) |
swiftlm.load.cp.avg.five |
observer_host service=object-storage |
This is the averaged value of the five minutes system load average of all nodes in the swift system. |
swiftlm.load.cp.max.five |
observer_host service=object-storage |
This is the five minute load average of the busiest host in the swift system. |
swiftlm.load.cp.min.five |
observer_host service=object-storage |
This is the five minute load average of the least loaded host in the swift system. |
swiftlm.load.host.val.five |
hostname service=object-storage |
This metric reports the 5 minute load average of a host. The value is
derived from |
swiftlm.md5sum.cp.check.ring_checksums |
observer_host service=object-storage |
If you are in the middle of deploying new rings, it is normal for this to be in the failed state. However, if you are not in the middle of a deployment, you need to investigate the cause. Use “swift-recon –md5 -v” to identify the problem hosts. |
swiftlm.replication.cp.avg.account_duration |
observer_host service=object-storage |
This is the average across all servers for the account replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds. |
swiftlm.replication.cp.avg.container_duration |
observer_host service=object-storage |
This is the average across all servers for the container replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds. |
swiftlm.replication.cp.avg.object_duration |
observer_host service=object-storage |
This is the average across all servers for the object replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds. |
swiftlm.replication.cp.max.account_last |
hostname path service=object-storage |
This is the number of seconds since the account replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle. |
swiftlm.replication.cp.max.container_last |
hostname path service=object-storage |
This is the number of seconds since the container replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle. |
swiftlm.replication.cp.max.object_last |
hostname path service=object-storage |
This is the number of seconds since the object replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle. |
swiftlm.swift.drive_audit |
hostname service=object-storage mount_point kernel_device |
If an unrecoverable read error (URE) occurs on a filesystem, the error is logged in the kernel log. The swift-drive-audit program scans the kernel log looking for patterns indicating possible UREs. To get more information, log onto the node in question and run: sudoswift-drive-audit/etc/swift/drive-audit.conf UREs are common on large disk drives. They do not necessarily indicate that the drive is failed. You can use the xfs_repair command to attempt to repair the filesystem. Failing this, you may need to wipe the filesystem. If UREs occur very often on a specific drive, this may indicate that the drive is about to fail and should be replaced. |
swiftlm.swift.file_ownership.config |
hostname path service |
This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects). |
swiftlm.swift.file_ownership.data |
hostname path service |
This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects). |
swiftlm.swiftlm_check |
hostname service=object-storage |
This indicates of the swiftlm |
swiftlm.swift.replication.account.last_replication |
hostname service=object-storage |
This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad. |
swiftlm.swift.replication.container.last_replication |
hostname service=object-storage |
This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad. |
swiftlm.swift.replication.object.last_replication |
hostname service=object-storage |
This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad. |
swiftlm.swift.swift_services |
hostname service=object-storage |
This metric reports of the process as named in the component dimension and the msg value_meta is running or not.
Use the |
swiftlm.swift.swift_services.check_ip_port |
hostname service=object-storage component | Reports if a service is listening to the correct ip and port. |
swiftlm.systems.check_mounts |
hostname service=object-storage mount device label |
This metric reports the mount state of each drive that should be mounted on this node. |
swiftlm.systems.connectivity.connect_check |
observer_host url target_port service=object-storage |
This metric reports if a server can connect to a VIPs. Currently the following VIPs are checked:
|
swiftlm.systems.connectivity.memcache_check |
observer_host hostname target_port service=object-storage |
This metric reports if memcached on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used: We successfully connected to <hostname> on port <target_port> { "dimensions": { "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "11211" }, "metric": "swiftlm.systems.connectivity.memcache_check", "timestamp": 1449084058, "value": 0, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:11211 ok" } } We failed to connect to <hostname> on port <target_port> { "dimensions": { "fail_message": "[Errno 111] Connection refused", "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "11211" }, "metric": "swiftlm.systems.connectivity.memcache_check", "timestamp": 1449084150, "value": 2, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:11211 [Errno 111] Connection refused" } } |
swiftlm.systems.connectivity.rsync_check |
observer_host hostname target_port service=object-storage |
This metric reports if rsyncd on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used: We successfully connected to <hostname> on port <target_port>: { "dimensions": { "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "873" }, "metric": "swiftlm.systems.connectivity.rsync_check", "timestamp": 1449082663, "value": 0, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:873 ok" } } We failed to connect to <hostname> on port <target_port>: { "dimensions": { "fail_message": "[Errno 111] Connection refused", "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "873" }, "metric": "swiftlm.systems.connectivity.rsync_check", "timestamp": 1449082860, "value": 2, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:873 [Errno 111] Connection refused" } } |
swiftlm.umon.target.avg.latency_sec |
component hostname observer_host service=object-storage url |
Reports the average value of N-iterations of the latency values recorded for a component. |
swiftlm.umon.target.check.state |
component hostname observer_host service=object-storage url |
This metric reports the state of each component after N-iterations of checks. If the initial check succeeds, the checks move onto the next component until all components are queried, then the checks sleep for ‘main_loop_interval’ seconds. If a check fails, it is retried every second for ‘retries’ number of times per component. If the check fails ‘retries’ times, it is reported as a fail instance. A successful state will be reported in JSON: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.check.state", "timestamp": 1453111805, "value": 0 }, A failed state will report a “fail” value and the value_meta will provide the http response error. { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.check.state", "timestamp": 1453112841, "value": 2, "value_meta": { "msg": "HTTPConnectionPool(host='192.168.245.9', port=8080): Max retries exceeded with url: /v1/AUTH_76538ce683654a35983b62e333001b47 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd857d7f550>: Failed to establish a new connection: [Errno 110] Connection timed out',))" } } |
swiftlm.umon.target.max.latency_sec |
component hostname observer_host service=object-storage url |
This metric reports the maximum response time in seconds of a REST call from the observer to the component REST API listening on the reported host A response time query will be reported in JSON: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.max.latency_sec", "timestamp": 1453111805, "value": 0.2772650718688965 } A failed query will have a much longer time value: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.max.latency_sec", "timestamp": 1453112841, "value": 127.288015127182 } |
swiftlm.umon.target.min.latency_sec |
component hostname observer_host service=object-storage url |
This metric reports the minimum response time in seconds of a REST call from the observer to the component REST API listening on the reported host A response time query will be reported in JSON: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.min.latency_sec", "timestamp": 1453111805, "value": 0.10025882720947266 } A failed query will have a much longer time value: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.min.latency_sec", "timestamp": 1453112841, "value": 127.25378203392029 } |
swiftlm.umon.target.val.avail_day |
component hostname observer_host service=object-storage url |
This metric reports the average of all the collected records in the swiftlm.umon.target.val.avail_minute metric data. This is a walking average data set of these approximately per-minute states of the swift Object Store. The most basic case is a whole day of successful per-minute records, which will average to 100% availability. If there is any downtime throughout the day resulting in gaps of data which are two minutes or longer, the per-minute availability data will be “back filled” with an assumption of a down state for all the per-minute records which did not exist during the non-reported time. Because this is a walking average of approximately 24 hours worth of data, any outtage will take 24 hours to be purged from the dataset. A 24-hour average availability report: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.val.avail_day", "timestamp": 1453645405, "value": 7.894736842105263 } |
swiftlm.umon.target.val.avail_minute |
component hostname observer_host service=object-storage url |
A value of 100 indicates that swift-uptime-monitor was able to get a token from keystone and was able to perform operations against the swift API during the reported minute. A value of zero indicates that either keystone or swift failed to respond successfully. A metric is produced every minute that swift-uptime-monitor is running. An “up” minute report value will report 100 [percent]: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.val.avail_minute", "timestamp": 1453645405, "value": 100.0 } A “down” minute report value will report 0 [percent]: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.val.avail_minute", "timestamp": 1453649139, "value": 0.0 } |
swiftlm.hp_hardware.hpssacli.smart_array.firmware |
component hostname service=object-storage component model controller_slot |
This metric reports the firmware version of a component of a Smart Array controller. |
swiftlm.hp_hardware.hpssacli.smart_array |
component hostname service=object-storage component sub_component model controller_slot |
This reports the status of various sub-components of a Smart Array Controller. A failure is considered to have occured if:
|
swiftlm.hp_hardware.hpssacli.physical_drive |
component hostname service=object-storage component controller_slot box bay |
This reports the status of a disk drive attached to a Smart Array controller. |
swiftlm.hp_hardware.hpssacli.logical_drive |
component hostname observer_host service=object-storage controller_slot array logical_drive sub_component |
This reports the status of a LUN presented by a Smart Array controller. A LUN is considered failed if the LUN has failed or if the LUN cache is not enabled and working. |
HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed on all control nodes that are swift nodes, in order to generate the following swift metrics:
swiftlm.hp_hardware.hpssacli.smart_array
swiftlm.hp_hardware.hpssacli.logical_drive
swiftlm.hp_hardware.hpssacli.smart_array.firmware
swiftlm.hp_hardware.hpssacli.physical_drive
HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f
After the HPE SSA CLI component is installed on the swift nodes, the metrics will be generated automatically during the next agent polling cycle. Manual reboot of the node is not required.
13.1.4.20 System Metrics #
A list of metrics associated with the System.
Metric Name | Dimensions | Description |
---|---|---|
cpu.frequency_mhz |
cluster hostname service=system |
Maximum MHz value for the cpu frequency. Note This value is dynamic, and driven by CPU governor depending on current resource need. |
cpu.idle_perc |
cluster hostname service=system |
Percentage of time the CPU is idle when no I/O requests are in progress |
cpu.idle_time |
cluster hostname service=system |
Time the CPU is idle when no I/O requests are in progress |
cpu.percent |
cluster hostname service=system |
Percentage of time the CPU is used in total |
cpu.stolen_perc |
cluster hostname service=system |
Percentage of stolen CPU time, that is, the time spent in other OS contexts when running in a virtualized environment |
cpu.system_perc |
cluster hostname service=system |
Percentage of time the CPU is used at the system level |
cpu.system_time |
cluster hostname service=system |
Time the CPU is used at the system level |
cpu.time_ns |
cluster hostname service=system |
Time the CPU is used at the host level |
cpu.total_logical_cores |
cluster hostname service=system |
Total number of logical cores available for an entire node (Includes hyper threading). Note: This is an optional metric that is only sent when send_rollup_stats is set to true. |
cpu.user_perc |
cluster hostname service=system |
Percentage of time the CPU is used at the user level |
cpu.user_time |
cluster hostname service=system |
Time the CPU is used at the user level |
cpu.wait_perc |
cluster hostname service=system |
Percentage of time the CPU is idle AND there is at least one I/O request in progress |
cpu.wait_time |
cluster hostname service=system |
Time the CPU is idle AND there is at least one I/O request in progress |
Metric Name | Dimensions | Description |
---|---|---|
disk.inode_used_perc |
mount_point service=system hostname cluster device |
The percentage of inodes that are used on a device |
disk.space_used_perc |
mount_point service=system hostname cluster device |
The percentage of disk space that is being used on a device |
disk.total_space_mb |
mount_point service=system hostname cluster device |
The total amount of disk space in Mbytes aggregated across all the disks on a particular node. Note This is an optional metric that is only sent when send_rollup_stats is set to true. |
disk.total_used_space_mb |
mount_point service=system hostname cluster device |
The total amount of used disk space in Mbytes aggregated across all the disks on a particular node. Note This is an optional metric that is only sent when send_rollup_stats is set to true. |
io.read_kbytes_sec |
mount_point service=system hostname cluster device |
Kbytes/sec read by an io device |
io.read_req_sec |
mount_point service=system hostname cluster device |
Number of read requests/sec to an io device |
io.read_time_sec |
mount_point service=system hostname cluster device |
Amount of read time in seconds to an io device |
io.write_kbytes_sec |
mount_point service=system hostname cluster device |
Kbytes/sec written by an io device |
io.write_req_sec |
mount_point service=system hostname cluster device |
Number of write requests/sec to an io device |
io.write_time_sec |
mount_point service=system hostname cluster device |
Amount of write time in seconds to an io device |
Metric Name | Dimensions | Description |
---|---|---|
load.avg_15_min |
service=system hostname cluster |
The normalized (by number of logical cores) average system load over a 15 minute period |
load.avg_1_min |
service=system hostname cluster |
The normalized (by number of logical cores) average system load over a 1 minute period |
load.avg_5_min |
service=system hostname cluster |
The normalized (by number of logical cores) average system load over a 5 minute period |
Metric Name | Dimensions | Description |
---|---|---|
mem.free_mb |
service=system hostname cluster |
Mbytes of free memory |
mem.swap_free_mb |
service=system hostname cluster |
Percentage of free swap memory that is free |
mem.swap_free_perc |
service=system hostname cluster |
Mbytes of free swap memory that is free |
mem.swap_total_mb |
service=system hostname cluster |
Mbytes of total physical swap memory |
mem.swap_used_mb |
service=system hostname cluster |
Mbytes of total swap memory used |
mem.total_mb |
service=system hostname cluster |
Total Mbytes of memory |
mem.usable_mb |
service=system hostname cluster |
Total Mbytes of usable memory |
mem.usable_perc |
service=system hostname cluster |
Percentage of total memory that is usable |
mem.used_buffers |
service=system hostname cluster |
Number of buffers in Mbytes being used by the kernel for block io |
mem.used_cache |
service=system hostname cluster |
Mbytes of memory used for the page cache |
mem.used_mb |
service=system hostname cluster |
Total Mbytes of used memory |
Metric Name | Dimensions | Description |
---|---|---|
net.in_bytes_sec |
service=system hostname device |
Number of network bytes received per second |
net.in_errors_sec |
service=system hostname device |
Number of network errors on incoming network traffic per second |
net.in_packets_dropped_sec |
service=system hostname device |
Number of inbound network packets dropped per second |
net.in_packets_sec |
service=system hostname device |
Number of network packets received per second |
net.out_bytes_sec |
service=system hostname device |
Number of network bytes sent per second |
net.out_errors_sec |
service=system hostname device |
Number of network errors on outgoing network traffic per second |
net.out_packets_dropped_sec |
service=system hostname device |
Number of outbound network packets dropped per second |
net.out_packets_sec |
service=system hostname device |
Number of network packets sent per second |
13.1.4.21 Zookeeper Metrics #
A list of metrics associated with the Zookeeper service.
Metric Name | Dimensions | Description |
---|---|---|
zookeeper.avg_latency_sec |
hostname mode service=zookeeper | Average latency in second |
zookeeper.connections_count |
hostname mode service=zookeeper | Number of connections |
zookeeper.in_bytes |
hostname mode service=zookeeper | Received bytes |
zookeeper.max_latency_sec |
hostname mode service=zookeeper | Maximum latency in second |
zookeeper.min_latency_sec |
hostname mode service=zookeeper | Minimum latency in second |
zookeeper.node_count |
hostname mode service=zookeeper | Number of nodes |
zookeeper.out_bytes |
hostname mode service=zookeeper | Sent bytes |
zookeeper.outstanding_bytes |
hostname mode service=zookeeper | Outstanding bytes |
zookeeper.zxid_count |
hostname mode service=zookeeper | Count number |
zookeeper.zxid_epoch |
hostname mode service=zookeeper | Epoch number |
13.2 Centralized Logging Service #
You can use the Centralized Logging Service to evaluate and troubleshoot your distributed cloud environment from a single location.
13.2.1 Getting Started with Centralized Logging Service #
A typical cloud consists of multiple servers which makes locating a specific log from a single server difficult. The Centralized Logging feature helps the administrator evaluate and troubleshoot the distributed cloud deployment from a single location.
The Logging API is a component in the centralized logging architecture. It works between log producers and log storage. In most cases it works by default after installation with no additional configuration. To use Logging API with logging-as-a-service, you must configure an end-point. This component adds flexibility and supportability for features in the future.
Do I need to Configure monasca-log-api? If you are only using Cloud Lifecycle Manager , then the default configuration is ready to use.
If you are using logging in any of the following deployments, then you will need to query keystone to get an end-point to use.
Logging as a Service
Platform as a Service
The Logging API is protected by keystone’s role-based access control. To ensure that logging is allowed and monasca alarms can be triggered, the user must have the monasca-user role. To get an end-point from keystone:
Log on to Cloud Lifecycle Manager (deployer node).
To list the Identity service catalog, run:
ardana >
source ./service.osrcardana >
openstack catalog listIn the output, find Kronos. For example:
Name Type Endpoints kronos region0 public: http://myardana.test:5607/v3.0, admin: http://192.168.245.5:5607/v3.0, internal: http://192.168.245.5:5607/v3.0
Use the same port number as found in the output. In the example, you would use port 5607.
In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.
It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.
For more information, see Section 13.2.4, “Managing the Centralized Logging Feature”.
13.2.1.1 For More Information #
For more information about the centralized logging components, see the following sites:
13.2.2 Understanding the Centralized Logging Service #
The Centralized Logging feature collects logs on a central system, rather than leaving the logs scattered across the network. The administrator can use a single Kibana interface to view log information in charts, graphs, tables, histograms, and other forms.
13.2.2.1 What Components are Part of Centralized Logging? #
Centralized logging consists of several components, detailed below:
Administrator's Browser: Operations Console can be used to access logging alarms or to access Kibana's dashboards to review logging data.
Apache Website for Kibana: A standard Apache website that proxies web/REST requests to the Kibana NodeJS server.
Beaver: A Python daemon that collects information in log files and sends it to the Logging API (monasca-log API) over a secure connection.
Cloud Auditing Data Federation (CADF): Defines a standard, full-event model anyone can use to fill in the essential data needed to certify, self-manage and self-audit application security in cloud environments.
Centralized Logging and Monitoring (CLM): Used to evaluate and troubleshoot your SUSE OpenStack Cloud distributed cloud environment from a single location.
Curator: a tool provided by Elasticsearch to manage indices.
Elasticsearch: A data store offering fast indexing and querying.
SUSE OpenStack Cloud: Provides public, private, and managed cloud solutions to get you moving on your cloud journey.
JavaScript Object Notation (JSON) log file: A file stored in the JSON format and used to exchange data. JSON uses JavaScript syntax, but the JSON format is text only. Text can be read and used as a data format by any programming language. This format is used by the Beaver and Logstash components.
Kafka: A messaging broker used for collection of SUSE OpenStack Cloud centralized logging data across nodes. It is highly available, scalable and performant. Kafka stores logs in disk instead of memory and is therefore more tolerant to consumer down times.
ImportantMake sure not to undersize your Kafka partition or the data retention period may be lower than expected. If the Kafka partition capacity is lower than 85%, the retention period will increase to 30 minutes. Over time Kafka will also eject old data.
Kibana: A client/server application with rich dashboards to visualize the data in Elasticsearch through a web browser. Kibana enables you to create charts and graphs using the log data.
Logging API (monasca-log-api): SUSE OpenStack Cloud API provides a standard REST interface to store logs. It uses keystone authentication and role-based access control support.
Logstash: A log processing system for receiving, processing and outputting logs. Logstash retrieves logs from Kafka, processes and enriches the data, then stores the data in Elasticsearch.
MML Service Node: Metering, Monitoring, and Logging (MML) service node. All services associated with metering, monitoring, and logging run on a dedicated three-node cluster. Three nodes are required for high availability with quorum.
Monasca: OpenStack monitoring at scale infrastructure for the cloud that supports alarms and reporting.
OpenStack Service. An OpenStack service process that requires logging services.
Oslo.log. An OpenStack library for log handling. The library functions automate configuration, deployment and scaling of complete, ready-for-work application platforms. Some PaaS solutions, such as Cloud Foundry, combine operating systems, containers, and orchestrators with developer tools, operations utilities, metrics, and security to create a developer-rich solution.
Text log: A type of file used in the logging process that contains human-readable records.
These components are configured to work out-of-the-box and the admin should be able to view log data using the default configurations.
In addition to each of the services, Centralized Logging also processes logs for the following features:
HAProxy
Syslog
keepalived
The purpose of the logging service is to provide a common logging infrastructure with centralized user access. Since there are numerous services and applications running in each node of a SUSE OpenStack Cloud cloud, and there could be hundreds of nodes, all of these services and applications can generate enough log files to make it very difficult to search for specific events in log files across all of the nodes. Centralized Logging addresses this issue by sending log messages in real time to a central Elasticsearch, Logstash, and Kibana cluster. In this cluster they are indexed and organized for easier and visual searches. The following illustration describes the architecture used to collect operational logs.
The arrows come from the active (requesting) side to the passive (listening) side. The active side is always the one providing credentials, so the arrows may also be seen as coming from the credential holder to the application requiring authentication.
13.2.2.2 Steps 1- 2 #
Services configured to generate log files record the data. Beaver listens for changes to the files and sends the log files to the Logging Service. The first step the Logging service takes is to re-format the original log file to a new log file with text only and to remove all network operations. In Step 1a, the Logging service uses the Oslo.log library to re-format the file to text-only. In Step 1b, the Logging service uses the Python-Logstash library to format the original audit log file to a JSON file.
- Step 1a
Beaver watches configured service operational log files for changes and reads incremental log changes from the files.
- Step 1b
Beaver watches configured service operational log files for changes and reads incremental log changes from the files.
- Step 2a
The monascalog transport of Beaver makes a token request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.
- Step 2b
The monascalog transport of Beaver batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection. Failure logs are written to the local Beaver log.
- Step 2c
The REST API client for monasca-log-api makes a token-request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.
- Step 2d
The REST API client for monasca-log-api batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection.
13.2.2.3 Steps 3a- 3b #
The Logging API (monasca-log API) communicates with keystone to validate the incoming request, and then sends the logs to Kafka.
- Step 3a
The monasca-log-api WSGI pipeline is configured to validate incoming request tokens with keystone. The keystone middleware used for this purpose is configured to use the monasca-log-api admin user, password and project that have the required keystone role to validate a token.
- Step 3b
monasca-log-api sends log messages to Kafka using a language-agnostic TCP protocol.
13.2.2.4 Steps 4- 8 #
Logstash pulls messages from Kafka, identifies the log type, and transforms the messages into either the audit log format or operational format. Then Logstash sends the messages to Elasticsearch, using either an audit or operational indices.
- Step 4
Logstash input workers pull log messages from the Kafka-Logstash topic using TCP.
- Step 5
This Logstash filter processes the log message in-memory in the request pipeline. Logstash identifies the log type from this field.
- Step 6
This Logstash filter processes the log message in-memory in the request pipeline. If the message is of audit-log type, Logstash transforms it from the monasca-log-api envelope format to the original CADF format.
- Step 7
This Logstash filter determines which index should receive the log message. There are separate indices in Elasticsearch for operational versus audit logs.
- Step 8
Logstash output workers write the messages read from Kafka to the daily index in the local Elasticsearch instance.
13.2.2.5 Steps 9- 12 #
When an administrator who has access to the guest network accesses the Kibana client and makes a request, Apache forwards the request to the Kibana NodeJS server. Then the server uses the Elasticsearch REST API to service the client requests.
- Step 9
An administrator who has access to the guest network accesses the Kibana client to view and search log data. The request can originate from the external network in the cloud through a tenant that has a pre-defined access route to the guest network.
- Step 10
An administrator who has access to the guest network uses a web browser and points to the Kibana URL. This allows the user to search logs and view Dashboard reports.
- Step 11
The authenticated request is forwarded to the Kibana NodeJS server to render the required dashboard, visualization, or search page.
- Step 12
The Kibana NodeJS web server uses the Elasticsearch REST API in localhost to service the UI requests.
13.2.2.6 Steps 13- 15 #
Log data is backed-up and deleted in the final steps.
- Step 13
A daily cron job running in the ELK node runs curator to prune old Elasticsearch log indices.
- Step 14
The curator configuration is done at the deployer node through the Ansible role logging-common. Curator is scripted to then prune or clone old indices based on this configuration.
- Step 15
The audit logs must be backed up manually. For more information about Backup and Recovery, see Chapter 17, Backup and Restore.
13.2.2.7 How Long are Log Files Retained? #
The logs that are centrally stored are saved to persistent storage as
Elasticsearch indices. These indices are stored in the partition
/var/lib/elasticsearch
on each of the Elasticsearch
cluster nodes. Out of the box, logs are stored in one Elasticsearch index
per service. As more days go by, the number of indices stored in this disk
partition grows. Eventually the partition fills up. If they are
open, each of these indices takes up CPU
and memory. If these indices are left unattended they will continue to
consume system resources and eventually deplete them.
Elasticsearch, by itself, does not prevent this from happening.
SUSE OpenStack Cloud uses a tool called curator that is developed by the Elasticsearch community to handle these situations. SUSE OpenStack Cloud installs and uses a curator in conjunction with several configurable settings. This curator is called by cron and performs the following checks:
First Check. The hourly cron job checks to see if the currently used Elasticsearch partition size is over the value set in:
curator_low_watermark_percent
If it is higher than this value, the curator deletes old indices according to the value set in:
curator_num_of_indices_to_keep
Second Check. Another check is made to verify if the partition size is below the high watermark percent. If it is still too high, curator will delete all indices except the current one that is over the size as set in:
curator_max_index_size_in_gb
Third Check. A third check verifies if the partition size is still too high. If it is, curator will delete all indices except the current one.
Final Check. A final check verifies if the partition size is still high. If it is, an error message is written to the log file but the current index is NOT deleted.
In the case of an extreme network issue, log files can run out of disk space
in under an hour. To avoid this SUSE OpenStack Cloud uses a shell script called
logrotate_if_needed.sh
. The cron process runs this script
every 5 minutes to see if the size of /var/log
has
exceeded the high_watermark_percent (95% of the disk, by default). If it is
at or above this level, logrotate_if_needed.sh
runs the
logrotate
script to rotate logs and to free up extra
space. This script helps to minimize the chance of running out of disk space
on /var/log
.
13.2.2.8 How Are Logs Rotated? #
SUSE OpenStack Cloud uses the cron process which in turn calls Logrotate to provide rotation, compression, and removal of log files. Each log file can be rotated hourly, daily, weekly, or monthly. If no rotation period is set then the log file will only be rotated when it grows too large.
Rotating a file means that the Logrotate process creates a copy of the log file with a new extension, for example, the .1 extension, then empties the contents of the original file. If a .1 file already exists, then that file is first renamed with a .2 extension. If a .2 file already exists, it is renamed to .3, etc., up to the maximum number of rotated files specified in the settings file. When Logrotate reaches the last possible file extension, it will delete the last file first on the next rotation. By the time the Logrotate process needs to delete a file, the results will have been copied to Elasticsearch, the central logging database.
The log rotation setting files can be found in the following directory
~/scratch/ansible/next/ardana/ansible/roles/logging-common/vars
These files allow you to set the following options:
- Service
The name of the service that creates the log entries.
- Rotated Log Files
List of log files to be rotated. These files are kept locally on the server and will continue to be rotated. If the file is also listed as Centrally Logged, it will also be copied to Elasticsearch.
- Frequency
The timing of when the logs are rotated. Options include:hourly, daily, weekly, or monthly.
- Max Size
The maximum file size the log can be before it is rotated out.
- Rotation
The number of log files that are rotated.
- Centrally Logged Files
These files will be indexed by Elasticsearch and will be available for searching in the Kibana user interface.
Only files that are listed in the Centrally Logged Files section are copied to Elasticsearch.
All of the variables for the Logrotate process are found in the following file:
~/scratch/ansible/next/ardana/ansible/roles/logging-ansible/logging-common/defaults/main.yml
Cron runs Logrotate hourly. Every 5 minutes another process is run called "logrotate_if_needed" which uses a watermark value to determine if the Logrotate process needs to be run. If the "high watermark" has been reached, and the /var/log partition is more than 95% full (by default - this can be adjusted), then Logrotate will be run within 5 minutes.
13.2.2.9 Are Log Files Backed-Up To Elasticsearch? #
While centralized logging is enabled out of the box, the backup of these logs is not. The reason is because Centralized Logging relies on the Elasticsearch FileSystem Repository plugin, which in turn requires shared disk partitions to be configured and accessible from each of the Elasticsearch nodes. Since there are multiple ways to setup a shared disk partition, SUSE OpenStack Cloud allows you to choose an approach that works best for your deployment before enabling the back-up of log files to Elasticsearch.
If you enable automatic back-up of centralized log files, then all the logs collected from the cloud nodes will be backed-up to Elasticsearch. Every hour, in the management controller nodes where Elasticsearch is setup, a cron job runs to check if Elasticsearch is running low on disk space. If the check succeeds, it further checks if the backup feature is enabled. If enabled, the cron job saves a snapshot of the Elasticsearch indices to the configured shared disk partition using curator. Next, the script starts deleting the oldest index and moves down from there checking each time if there is enough space for Elasticsearch. A check is also made to ensure that the backup runs only once a day.
For steps on how to enable automatic back-up, see Section 13.2.5, “Configuring Centralized Logging”.
13.2.3 Accessing Log Data #
All logging data in SUSE OpenStack Cloud is managed by the Centralized Logging Service and can be viewed or analyzed by Kibana. Kibana is the only graphical interface provided with SUSE OpenStack Cloud to search or create a report from log data. Operations Console provides only a link to the Kibana Logging dashboard.
The following two methods allow you to access the Kibana Logging dashboard to search log data:
To learn more about Kibana, read the Getting Started with Kibana guide.
13.2.3.1 Use the Operations Console Link #
Operations Console allows you to access Kibana in the same tool that you use to manage the other SUSE OpenStack Cloud resources in your deployment. To use Operations Console, you must have the correct permissions.
To use Operations Console:
In a browser, open the Operations Console.
On the login page, enter the user name, and the Password, and then click LOG IN.
On the Home/Central Dashboard page, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left, select Home, and then select Logging.
On the Home/Logging page, click View Logging Dashboard.
In SUSE OpenStack Cloud, Kibana usually runs on a different network than Operations Console. Due to this configuration, it is possible that using Operations Console to access Kibana will result in an “404 not found” error. This error only occurs if the user has access only to the public facing network.
13.2.3.2 Using Kibana to Access Log Data #
Kibana is an open-source, data-visualization plugin for Elasticsearch. Kibana provides visualization capabilities using the log content indexed on an Elasticsearch cluster. Users can create bar and pie charts, line and scatter plots, and maps using the data collected by SUSE OpenStack Cloud in the cloud log files.
While creating Kibana dashboards is beyond the scope of this document, it is important to know that the dashboards you create are JSON files that you can modify or create new dashboards based on existing dashboards.
Kibana is client-server software. To operate properly, the browser must be able to access port 5601 on the control plane.
Field | Default Value | Description |
---|---|---|
user | kibana |
Username that will be required for logging into the Kibana UI. |
password | random password is generated |
Password generated during installation that is used to login to the Kibana UI. |
13.2.3.3 Logging into Kibana #
To log into Kibana to view data, you must make sure you have the required login configuration.
Verify login credentials: Section 13.2.3.3.1, “Verify Login Credentials”
Find the randomized password: Section 13.2.3.3.2, “Find the Randomized Password”
Access Kibana using a direct link: Section 13.2.3.3.3, “Access Kibana Using a Direct Link:”
13.2.3.3.1 Verify Login Credentials #
During the installation of Kibana, a password is automatically set and it is randomized. Therefore, unless an administrator has already changed it, you need to retrieve the default password from a file on the control plane node.
13.2.3.3.2 Find the Randomized Password #
To find the Kibana password, run:
ardana >
grep kibana ~/scratch/ansible/next/my_cloud/stage/internal/CloudModel.yaml
13.2.3.3.3 Access Kibana Using a Direct Link: #
This section helps you verify the horizon virtual IP (VIP) address that you should use. To provide enhanced security, access to Kibana is not available on the External network.
To determine which IP address to use to access Kibana, from your Cloud Lifecycle Manager, run:
ardana >
grep HZN-WEB /etc/hostsThe output of the grep command should show you the virtual IP address for Kibana that you should use.
ImportantIf nothing is returned by the grep command, you can open the following file to look for the IP address manually:
/etc/hosts
Access to Kibana will be over port 5601 of that virtual IP address. Example:
https://VIP:5601
13.2.4 Managing the Centralized Logging Feature #
No specific configuration tasks are required to use Centralized Logging, as it is enabled by default after installation. However, you can configure the individual components as needed for your environment.
13.2.4.1 How Do I Stop and Start the Logging Service? #
Although you might not need to stop and start the logging service very often, you may need to if, for example, one of the logging services is not behaving as expected or not working.
You cannot enable or disable centralized logging across all services unless you stop all centralized logging. Instead, it is recommended that you enable or disable individual log files in the <service>-clr.yml files and then reconfigure logging. You would enable centralized logging for a file when you want to make sure you are able to monitor those logs in Kibana.
In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.
It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.
The steps in this section only impact centralized logging. Logrotate is an essential feature that keeps the service log files from filling the disk and will not be affected.
These playbooks must be run from the Cloud Lifecycle Manager.
To stop the Logging service:
To change to the directory containing the ansible playbook, run
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo run the ansible playbook that will stop the logging service, run:
ardana >
ansible-playbook -i hosts/verb_hosts logging-stop.yml
To start the Logging service:
To change to the directory containing the ansible playbook, run
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo run the ansible playbook that will stop the logging service, run:
ardana >
ansible-playbook -i hosts/verb_hosts logging-start.yml
13.2.4.2 How Do I Enable or Disable Centralized Logging For a Service? #
To enable or disable Centralized Logging for a service you need to modify the configuration for the service, set the enabled flag to true or false, and then reconfigure logging.
There are consequences if you enable too many logging files for a service. If there is not enough storage to support the increased logging, the retention period of logs in Elasticsearch is decreased. Alternatively, if you wanted to increase the retention period of log files or if you did not want those logs to show up in Kibana, you would disable centralized logging for a file.
To enable Centralized Logging for a service:
Use the documentation provided with the service to ensure it is not configured for logging.
To find the SUSE OpenStack Cloud file to edit, run:
ardana >
find ~/openstack/my_cloud/config/logging/vars/ -name "*service-name*"Edit the file for the service for which you want to enable logging.
To enable Centralized Logging, find the following code and change the enabled flag to true, to disable, change the enabled flag to false:
logging_options: - centralized_logging: enabled: true format: json
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To reconfigure logging, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.ymlardana >
cd ~/openstack/ardana/ansible/ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
13.2.5 Configuring Centralized Logging #
You can adjust the settings for centralized logging when you are troubleshooting problems with a service or to decrease log size and retention to save on disk space. For steps on how to configure logging settings, refer to the following tasks:
13.2.5.1 Configuration Files #
Centralized Logging settings are stored in the configuration files in the
following directory on the Cloud Lifecycle Manager:
~/openstack/my_cloud/config/logging/
The configuration files and their use are described below:
File | Description |
---|---|
main.yml | Main configuration file for all centralized logging components. |
elasticsearch.yml.j2 | Main configuration file for Elasticsearch. |
elasticsearch-default.j2 | Default overrides for the Elasticsearch init script. |
kibana.yml.j2 | Main configuration file for Kibana. |
kibana-apache2.conf.j2 | Apache configuration file for Kibana. |
logstash.conf.j2 | Logstash inputs/outputs configuration. |
logstash-default.j2 | Default overrides for the Logstash init script. |
beaver.conf.j2 | Main configuration file for Beaver. |
vars | Path to logrotate configuration files. |
13.2.5.2 Planning Resource Requirements #
The Centralized Logging service needs to have enough resources available to it to perform adequately for different scale environments. The base logging levels are tuned during installation according to the amount of RAM allocated to your control plane nodes to ensure optimum performance.
These values can be viewed and changed in the
~/openstack/my_cloud/config/logging/main.yml
file, but you
will need to run a reconfigure of the Centralized Logging service if changes
are made.
The total process memory consumption for Elasticsearch will be the above
allocated heap value (in
~/openstack/my_cloud/config/logging/main.yml
) plus any Java
Virtual Machine (JVM) overhead.
Setting Disk Size Requirements
In the entry-scale models, the disk partition sizes on your controller nodes
for the logging and Elasticsearch data are set as a percentage of your total
disk size. You can see these in the following file on the Cloud Lifecycle Manager
(deployer):
~/openstack/my_cloud/definition/data/<controller_disk_files_used>
Sample file settings:
# Local Log files. - name: log size: 13% mount: /var/log fstype: ext4 mkfs-opts: -O large_file # Data storage for centralized logging. This holds log entries from all # servers in the cloud and hence can require a lot of disk space. - name: elasticsearch size: 30% mount: /var/lib/elasticsearch fstype: ext4
The disk size is set automatically based on the hardware configuration. If you need to adjust it, you can set it manually with the following steps.
To set disk sizes:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/definition/data/disks.yml
Make any desired changes.
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -A gitardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the logging reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
13.2.5.3 Backing Up Elasticsearch Log Indices #
The log files that are centrally collected in SUSE OpenStack Cloud are stored by
Elasticsearch on disk in the /var/lib/elasticsearch
partition. However, this is distributed across each of the Elasticsearch
cluster nodes as shards. A cron job runs periodically to see if the disk
partition runs low on space, and, if so, it runs curator to delete the old
log indices to make room for new logs. This deletion is permanent and the
logs are lost forever. If you want to backup old logs, for example to comply
with certain regulations, you can configure automatic backup of
Elasticsearch indices.
If you need to restore data that was archived prior to SUSE OpenStack Cloud 9 and used the older versions of Elasticsearch, then this data will need to be restored to a separate deployment of Elasticsearch.
This can be accomplished using the following steps:
Deploy a separate distinct Elasticsearch instance version matching the version in SUSE OpenStack Cloud.
Configure the backed-up data using NFS or some other share mechanism to be available to the Elasticsearch instance matching the version in SUSE OpenStack Cloud.
Before enabling automatic back-ups, make sure you understand how much disk space you will need, and configure the disks that will store the data. Use the following checklist to prepare your deployment for enabling automatic backups:
☐ | Item |
---|---|
☐ |
Add a shared disk partition to each of the Elasticsearch controller nodes. The default partition name used for backup is /var/lib/esbackup You can change this by:
|
☐ |
Ensure the shared disk has enough storage to retain backups for the desired retention period. |
To enable automatic back-up of centralized logs to Elasticsearch:
Log in to the Cloud Lifecycle Manager (deployer node).
Open the following file in a text editor:
~/openstack/my_cloud/config/logging/main.yml
Find the following variables:
curator_backup_repo_name: "es_{{host.my_dimensions.cloud_name}}" curator_es_backup_partition: /var/lib/esbackup
To enable backup, change the curator_enable_backup value to true in the curator section:
curator_enable_backup: true
Save your changes and re-run the configuration processor:
ardana >
cd ~/openstackardana >
git add -A # Verify the added filesardana >
git statusardana >
git commit -m "Enabling Elasticsearch Backup" $ cd ~/openstack/ardana/ansible $ ansible-playbook -i hosts/localhost config-processor-run.yml $ ansible-playbook -i hosts/localhost ready-deployment.ymlTo re-configure logging:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.ymlTo verify that the indices are backed up, check the contents of the partition:
ardana >
ls /var/lib/esbackup
13.2.5.4 Restoring Logs From an Elasticsearch Backup #
To restore logs from an Elasticsearch backup, see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-snapshots.html.
We do not recommend restoring to the original SUSE OpenStack Cloud Centralized Logging cluster as it may cause storage/capacity issues. We rather recommend setting up a separate ELK cluster of the same version and restoring the logs there.
13.2.5.5 Tuning Logging Parameters #
When centralized logging is installed in SUSE OpenStack Cloud, parameters for Elasticsearch heap size and logstash heap size are automatically configured based on the amount of RAM on the system. These values are typically the required values, but they may need to be adjusted if performance issues arise, or disk space issues are encountered. These values may also need to be adjusted if hardware changes are made after an installation.
These values are defined at the top of the following file
.../logging-common/defaults/main.yml
. An example of the
contents of the file is below:
1. Select heap tunings based on system RAM #------------------------------------------------------------------------------- threshold_small_mb: 31000 threshold_medium_mb: 63000 threshold_large_mb: 127000 tuning_selector: " {% if ansible_memtotal_mb < threshold_small_mb|int %} demo {% elif ansible_memtotal_mb < threshold_medium_mb|int %} small {% elif ansible_memtotal_mb < threshold_large_mb|int %} medium {% else %} large {%endif %} " logging_possible_tunings: 2. RAM < 32GB demo: elasticsearch_heap_size: 512m logstash_heap_size: 512m 3. RAM < 64GB small: elasticsearch_heap_size: 8g logstash_heap_size: 2g 4. RAM < 128GB medium: elasticsearch_heap_size: 16g logstash_heap_size: 4g 5. RAM >= 128GB large: elasticsearch_heap_size: 31g logstash_heap_size: 8g logging_tunings: "{{ logging_possible_tunings[tuning_selector] }}"
This specifies thresholds for what a small, medium, or large system would look like, in terms of memory. To see what values will be used, see what RAM your system uses, and see where it fits in with the thresholds to see what values you will be installed with. To modify the values, you can either adjust the threshold values so that your system will change from a small configuration to a medium configuration, for example, or keep the threshold values the same, and modify the heap_size variables directly for the selector that your system is set for. For example, if your configuration is a medium configuration, which sets heap_sizes to 16 GB for Elasticsearch and 4 GB for logstash, and you want twice as much set aside for logstash, then you could increase the 4 GB for logstash to 8 GB.
13.2.6 Configuring Settings for Other Services #
When you configure settings for the Centralized Logging Service, those changes impact all services that are enabled for centralized logging. However, if you only need to change the logging configuration for one specific service, you will want to modify the service's files instead of changing the settings for the entire Centralized Logging service. This topic helps you complete the following tasks:
13.2.6.1 Setting Logging Levels for Services #
When it is necessary to increase the logging level for a specific service to troubleshoot an issue, or to decrease logging levels to save disk space, you can edit the service's config file and then reconfigure logging. All changes will be made to the service's files and not to the Centralized Logging service files.
Messages only appear in the log files if they are the same as or more severe than the log level you set. The DEBUG level logs everything. Most services default to the INFO logging level, which lists informational events, plus warnings, errors, and critical errors. Some services provide other logging options which will narrow the focus to help you debug an issue, receive a warning if an operation fails, or if there is a serious issue with the cloud.
For more information on logging levels, see the OpenStack Logging Guidelines documentation.
13.2.6.2 Configuring the Logging Level for a Service #
If you want to increase or decrease the amount of details that are logged by a service, you can change the current logging level in the configuration files. Most services support, at a minimum, the DEBUG and INFO logging levels. For more information about what levels are supported by a service, check the documentation or Website for the specific service.
13.2.6.3 Barbican #
Service | Sub-component | Supported Logging Levels |
---|---|---|
barbican |
barbican-api barbican-worker |
INFO (default) DEBUG |
To change the barbican logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana >
vi my_cloud/config/barbican/barbican_deploy_config.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
barbican_loglevel: "{{ ardana_loglevel | default('INFO') }}" barbican_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts barbican-reconfigure.yml
13.2.6.4 Block Storage (cinder) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
cinder |
cinder-api cinder-scheduler cinder-backup cinder-volume |
INFO (default) DEBUG |
To manage cinder logging:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana/ansibleardana >
vi roles/_CND-CMN/defaults/main.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
cinder_loglevel: {{ ardana_loglevel | default('INFO') }} cinder_logstash_loglevel: {{ ardana_loglevel | default('INFO') }}
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml
13.2.6.5 Ceilometer #
Service | Sub-component | Supported Logging Levels |
---|---|---|
ceilometer |
ceilometer-collector ceilometer-agent-notification ceilometer-polling ceilometer-expirer |
INFO (default) DEBUG |
To change the ceilometer logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana/ansibleardana >
vi roles/_CEI-CMN/defaults/main.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
ceilometer_loglevel: INFO ceilometer_logstash_loglevel: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml
13.2.6.6 Compute (nova) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
nova |
INFO (default) DEBUG |
To change the nova logging level:
Log in to the Cloud Lifecycle Manager.
The nova service component logging can be changed by modifying the following files:
~/openstack/my_cloud/config/nova/novncproxy-logging.conf.j2 ~/openstack/my_cloud/config/nova/api-logging.conf.j2 ~/openstack/my_cloud/config/nova/compute-logging.conf.j2 ~/openstack/my_cloud/config/nova/conductor-logging.conf.j2 ~/openstack/my_cloud/config/nova/scheduler-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
13.2.6.7 Designate (DNS) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
designate |
designate-api designate-central designate-mdns designate-producer designate-worker designate-pool-manager designate-zone-manager |
INFO (default) DEBUG |
To change the designate logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana >
vi my_cloud/config/designate/designate.conf.j2To change the logging level, set the value of the following line:
debug = False
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts designate-reconfigure.yml
13.2.6.8 Identity (keystone) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
keystone | keystone |
INFO (default) DEBUG WARN ERROR |
To change the keystone logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
keystone_loglevel: INFO keystone_logstash_loglevel: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
13.2.6.9 Image (glance) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
glance |
glance-api |
INFO (default) DEBUG |
To change the glance logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/glance/glance-api-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml
13.2.6.10 Bare Metal (ironic) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
ironic |
ironic-api-logging.conf.j2 ironic-conductor-logging.conf.j2 |
INFO (default) DEBUG |
To change the ironic logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana/ansibleardana >
vi roles/ironic-common/defaults/main.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
ironic_api_loglevel: "{{ ardana_loglevel | default('INFO') }}" ironic_api_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}" ironic_conductor_loglevel: "{{ ardana_loglevel | default('INFO') }}" ironic_conductor_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ironic-reconfigure.yml
13.2.6.11 Monitoring (monasca) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
monasca |
monasca-persister zookeeper storm monasca-notification monasca-api kafka monasca-agent |
WARN (default) INFO |
To change the monasca logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Monitoring service component logging can be changed by modifying the following files:
~/openstack/ardana/ansible/roles/monasca-persister/defaults/main.yml ~/openstack/ardana/ansible/roles/storm/defaults/main.yml ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml ~/openstack/ardana/ansible/roles/monasca-api/defaults/main.yml ~/openstack/ardana/ansible/roles/kafka/defaults/main.yml ~/openstack/ardana/ansible/roles/monasca-agent/defaults/main.yml (For this file, you will need to add the variable)
To change the logging level, use ALL CAPS to set the desired level in the following line:
monasca_log_level: WARN
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml
13.2.6.12 Networking (neutron) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
neutron |
neutron-server dhcp-agent l3-agent metadata-agent openvswitch-agent ovsvapp-agent sriov-agent infoblox-ipam-agent l2gateway-agent |
INFO (default) DEBUG |
To change the neutron logging level:
Log in to the Cloud Lifecycle Manager.
The neutron service component logging can be changed by modifying the following files:
~/openstack/ardana/ansible/roles/neutron-common/templates/dhcp-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/infoblox-ipam-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/l2gateway-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/l3-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/metadata-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/openvswitch-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/ovsvapp-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/sriov-agent-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
13.2.6.13 Object Storage (swift) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
swift |
INFO (default) DEBUG |
Currently it is not recommended to log at any level other than INFO.
13.2.6.14 Octavia #
Service | Sub-component | Supported Logging Levels |
---|---|---|
octavia |
octavia-api octavia-worker octavia-hk octavia-hm |
INFO (default) DEBUG |
To change the Octavia logging level:
Log in to the Cloud Lifecycle Manager.
The Octavia service component logging can be changed by modifying the following files:
~/openstack/my_cloud/config/octavia/octavia-api-logging.conf.j2 ~/openstack/my_cloud/config/octavia/octavia-worker-logging.conf.j2 ~/openstack/my_cloud/config/octavia/octavia-hk-logging.conf.j2 ~/openstack/my_cloud/config/octavia/octavia-hm-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml
13.2.6.15 Operations Console #
Service | Sub-component | Supported Logging Levels |
---|---|---|
opsconsole |
ops-web ops-mon |
INFO (default) DEBUG |
To change the Operations Console logging level:
Log in to the Cloud Lifecycle Manager.
Open the following file:
~/openstack/ardana/ansible/roles/OPS-WEB/defaults/main.yml
To change the logging level, use ALL CAPS to set the desired level in the following line:
ops_console_loglevel: "{{ ardana_loglevel | default('INFO') }}"
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ops-console-reconfigure.yml
13.2.6.16 Orchestration (heat) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
heat |
api-cfn api engine |
INFO (default) DEBUG |
To change the heat logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/heat/*-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml
13.2.6.17 Magnum #
Service | Sub-component | Supported Logging Levels |
---|---|---|
magnum |
api conductor |
INFO (default) DEBUG |
To change the Magnum logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/magnum/api-logging.conf.j2 ~/openstack/my_cloud/config/magnum/conductor-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts magnum-reconfigure.yml
13.2.6.18 File Storage (manila) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
manila |
api |
INFO (default) DEBUG |
To change the manila logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/manila/manila-logging.conf.j2
To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
manila_loglevel: INFO manila_logstash_loglevel: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts manila-reconfigure.yml
13.2.6.19 Selecting Files for Centralized Logging #
As you use SUSE OpenStack Cloud, you might find a need to redefine which log files are rotated on disk or transferred to centralized logging. These changes are all made in the centralized logging definition files.
SUSE OpenStack Cloud uses the logrotate service to provide rotation, compression, and
removal of log files. All of the tunable variables for the logrotate process
itself can be controlled in the following file:
~/openstack/ardana/ansible/roles/logging-common/defaults/main.yml
You can find the centralized logging definition files for each service in
the following directory:
~/openstack/ardana/ansible/roles/logging-common/vars
You can change log settings for a service by following these steps.
Log in to the Cloud Lifecycle Manager.
Open the *.yml file for the service or sub-component that you want to modify.
Using keystone, the Identity service as an example:
ardana >
vi ~/openstack/ardana/ansible/roles/logging-common/vars/keystone-clr.ymlConsider the opening clause of the file:
sub_service: hosts: KEY-API name: keystone service: keystone
The hosts setting defines the role which will trigger this logrotate definition being applied to a particular host. It can use regular expressions for pattern matching, that is, NEU-.*.
The service setting identifies the high-level service name associated with this content, which will be used for determining log files' collective quotas for storage on disk.
Verify logging is enabled by locating the following lines:
centralized_logging: enabled: true format: rawjson
NoteWhen possible, centralized logging is most effective on log files generated using logstash-formatted JSON. These files should specify format: rawjson. When only plaintext log files are available, format: json is appropriate. (This will cause their plaintext log lines to be wrapped in a json envelope before being sent to centralized logging storage.)
Observe log files selected for rotation:
- files: - /var/log/keystone/keystone.log log_rotate: - daily - maxsize 300M - rotate 7 - compress - missingok - notifempty - copytruncate - create 640 keystone adm
NoteWith the introduction of dynamic log rotation, the frequency (that is, daily) and file size threshold (that is, maxsize) settings no longer have any effect. The rotate setting may be easily overridden on a service-by-service basis.
Commit any changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlCreate a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the logging reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
13.2.6.20 Controlling Disk Space Allocation and Retention of Log Files #
Each service is assigned a weighted allocation of the
/var/log
filesystem's capacity. When all its log files'
cumulative sizes exceed this allocation, a rotation is triggered for that
service's log files according to the behavior specified in the
/etc/logrotate.d/*
specification.
These specification files are auto-generated based on YML sources delivered with the Cloud Lifecycle Manager codebase. The source files can be edited and reapplied to control the allocation of disk space across services or the behavior during a rotation.
Disk capacity is allocated as a percentage of the total weighted value of all services running on a particular node. For example, if 20 services run on the same node, all with a default weight of 100, they will each be granted 1/20th of the log filesystem's capacity. If the configuration is updated to change one service's weight to 150, all the services' allocations will be adjusted to make it possible for that one service to consume 150% of the space available to other individual services.
These policies are enforced by the script
/opt/kronos/rotate_if_exceeded_quota.py
, which will be
executed every 5 minutes via a cron job and will rotate the log files of any
services which have exceeded their respective quotas. When log rotation
takes place for a service, logs are generated to describe the activity in
/var/log/kronos/check_if_exceeded_quota.log
.
When logrotate is performed on a service, its existing log files are compressed and archived to make space available for fresh log entries. Once the number of archived log files exceeds that service's retention thresholds, the oldest files are deleted. Thus, longer retention thresholds (that is, 10 to 15) will result in more space in the service's allocated log capacity being used for historic logs, while shorter retention thresholds (that is, 1 to 5) will keep more space available for its active plaintext log files.
Use the following process to make adjustments to services' log capacity allocations or retention thresholds:
Navigate to the following directory on your Cloud Lifecycle Manager:
~/stack/scratch/ansible/next/ardana/ansible
Open and edit the service weights file:
ardana >
vi roles/kronos-logrotation/vars/rotation_config.ymlEdit the service parameters to set the desired parameters. Example:
cinder: weight: 300 retention: 2
NoteThe retention setting of default will use recommend defaults for each services' log files.
Run the kronos-logrotation-deploy playbook:
ardana >
ansible-playbook -i hosts/verb_hosts kronos-logrotation-deploy.ymlVerify the changes to the quotas have been changed:
Login to a node and check the contents of the file /opt/kronos/service_info.yml to see the active quotas for that node, and the specifications in /etc/logrotate.d/* for rotation thresholds.
13.2.6.21 Configuring Elasticsearch for Centralized Logging #
Elasticsearch includes some tunable options exposed in its configuration. SUSE OpenStack Cloud uses these options in Elasticsearch to prioritize indexing speed over search speed. SUSE OpenStack Cloud also configures Elasticsearch for optimal performance in low RAM environments. The options that SUSE OpenStack Cloud modifies are listed below along with an explanation about why they were modified.
These configurations are defined in the
~/openstack/my_cloud/config/logging/main.yml
file and are
implemented in the Elasticsearch configuration file
~/openstack/my_cloud/config/logging/elasticsearch.yml.j2
.
13.2.6.22 Safeguards for the Log Partitions Disk Capacity #
Because the logging partitions are at a high risk of filling up over time, a condition which can cause many negative side effects on services running, it is important to safeguard against log files consuming 100 % of available capacity.
This protection is implemented by pairs of low/high
watermark thresholds, with values
established in
~/stack/scratch/ansible/next/ardana/ansible/roles/logging-common/defaults/main.yml
and applied by the kronos-logrotation-deploy
playbook.
var_log_low_watermark_percent (default: 80) sets a capacity level for the contents of the
/var/log
partition beyond which alarms will be triggered (visible to administrators in monasca).var_log_high_watermark_percent (default: 95) defines how much capacity of the
/var/log
partition to make available for log rotation (in calculating weighted service allocations).var_audit_low_watermark_percent (default: 80) sets a capacity level for the contents of the
/var/audit
partition beyond which alarm notifications will be triggered.var_audit_high_watermark_percent (default: 95) sets a capacity level for the contents of the
/var/audit
partition which will cause log rotation to be forced according to the specification in/etc/auditlogrotate.conf
.
13.2.7 Audit Logging Overview #
Existing OpenStack service logging varies widely across services. Generally, log messages do not have enough detail about who is requesting the application program interface (API), or enough context-specific details about an action performed. Often details are not even consistently logged across various services, leading to inconsistent data formats being used across services. These issues make it difficult to integrate logging with existing audit tools and processes.
To help you monitor your workload and data in compliance with your corporate, industry or regional policies, SUSE OpenStack Cloud provides auditing support as a basic security feature. The audit logging can be integrated with customer Security Information and Event Management (SIEM) tools and support your efforts to correlate threat forensics.
The SUSE OpenStack Cloud audit logging feature uses Audit Middleware for Python services. This middleware service is based on OpenStack services which use the Paste Deploy system. Most OpenStack services use the paste deploy mechanism to find and configure WSGI servers and applications. Utilizing the paste deploy system provides auditing support in services with minimal changes.
By default, audit logging as a post-installation feature is disabled in the cloudConfig file on the Cloud Lifecycle Manager and it can only be enabled after SUSE OpenStack Cloud installation or upgrade.
The tasks in this section explain how to enable services for audit logging in your environment. SUSE OpenStack Cloud provides audit logging for the following services:
nova
barbican
keystone
cinder
ceilometer
neutron
glance
heat
For audit log backup information see Section 17.3.4, “Audit Log Backup and Restore”
13.2.7.1 Audit Logging Checklist #
Before enabling audit logging, make sure you understand how much disk space you will need, and configure the disks that will store the logging data. Use the following table to complete these tasks:
13.2.7.1.1 Frequently Asked Questions #
- How are audit logs generated?
The audit logs are created by services running in the cloud management controller nodes. The events that create auditing entries are formatted using a structure that is compliant with Cloud Auditing Data Federation (CADF) policies. The formatted audit entries are then saved to disk files. For more information, see the Cloud Auditing Data Federation Website.
- Where are audit logs stored?
We strongly recommend adding a dedicated disk volume for
/var/audit
.If the disk templates for the controllers are not updated to create a separate volume for
/var/audit
, the audit logs will still be created in the root partition under the folder/var/audit
. This could be problematic if the root partition does not have adequate space to hold the audit logs.WarningWe recommend that you do not store audit logs in the
/var/log
volume. The/var/log
volume is used for storing operational logs and logrotation/alarms have been preconfigured for various services based on the size of this volume. Adding audit logs here may impact these causing undesired alarms. This would also impact the retention times for the operational logs.- Are audit logs centrally stored?
Yes. The existing operational log profiles have been configured to centrally log audit logs as well, once their generation has been enabled. The audit logs will be stored in separate Elasticsearch indices separate from the operational logs.
- How long are audit log files retained?
By default, audit logs are configured to be retained for 7 days on disk. The audit logs are rotated each day and the rotated files are stored in a compressed format and retained up to 7 days (configurable). The backup service has been configured to back up the audit logs to a location outside of the controller nodes for much longer retention periods.
- Do I lose audit data if a management controller node goes down?
Yes. For this reason, it is strongly recommended that you back up the audit partition in each of the management controller nodes for protection against any data loss.
13.2.7.1.2 Estimate Disk Size #
The table below provides estimates from each service of audit log size generated per day. The estimates are provided for environments with 100 nodes, 300 nodes, and 500 nodes.
Service |
Log File Size: 100 nodes |
Log File Size: 300 nodes |
Log File Size: 500 nodes |
---|---|---|---|
barbican | 2.6 MB | 4.2 MB | 5.6 MB |
keystone | 96 - 131 MB | 288 - 394 MB | 480 - 657 MB |
nova | 186 (with a margin of 46) MB | 557 (with a margin of 139) MB | 928 (with a margin of 232) MB |
ceilometer | 12 MB | 12 MB | 12 MB |
cinder | 2 - 250 MB | 2 - 250 MB | 2 - 250 MB |
neutron | 145 MB | 433 MB | 722 MB |
glance | 20 (with a margin of 8) MB | 60 (with a margin of 22) MB | 100 (with a margin of 36) MB |
heat | 432 MB (1 transaction per second) | 432 MB (1 transaction per second) | 432 MB (1 transaction per second) |
swift | 33 GB (700 transactions per second) | 102 GB (2100 transactions per second) | 172 GB (3500 transactions per second) |
13.2.7.1.3 Add disks to the controller nodes #
You need to add disks for the audit log partition to store the data in a secure manner. The steps to complete this task will vary depending on the type of server you are running. Please refer to the manufacturer’s instructions on how to add disks for the type of server node used by the management controller cluster. If you already have extra disks in the controller node, you can identify any unused one and use it for the audit log partition.
13.2.7.1.4 Update the disk template for the controller nodes #
Since audit logging is disabled by default, the audit volume groups in the disk templates are commented out. If you want to turn on audit logging, the template needs to be updated first. If it is not updated, there will be no back-up volume group. To update the disk template, you will need to copy templates from the examples folder to the definition folder and then edit the disk controller settings. Changes to the disk template used for provisioning cloud nodes must be made prior to deploying the nodes.
To update the disk controller template:
Log in to your Cloud Lifecycle Manager.
To copy the example templates folder, run the following command:
ImportantIf you already have the required templates in the definition folder, you can skip this step.
ardana >
cp -r ~/openstack/examples/entry-scale-esx/* ~/openstack/my_cloud/definition/To change to the data folder, run:
ardana >
cd ~/openstack/my_cloud/definition/To edit the disks controller settings, open the file that matches your server model and disk model in a text editor:
Model File entry-scale-kvm disks_controller_1TB.yml
disks_controller_600GB.yml
mid-scale disks_compute.yml
disks_control_common_600GB.yml
disks_dbmq_600GB.yml
disks_mtrmon_2TB.yml
disks_mtrmon_4.5TB.yml
disks_mtrmon_600GB.yml
disks_swobj.yml
disks_swpac.yml
To update the settings and enable an audit log volume group, edit the appropriate file(s) listed above and remove the '#' comments from these lines, confirming that they are appropriate for your environment.
- name: audit-vg physical-volumes: - /dev/sdz logical-volumes: - name: audit size: 95% mount: /var/audit fstype: ext4 mkfs-opts: -O large_file
13.2.7.1.5 Save your changes #
To save your changes you will use the GIT repository to add the setup disk files.
To save your changes:
To change to the openstack directory, run:
ardana >
cd ~/openstackTo add the new and updated files, run:
ardana >
git add -ATo verify the files are added, run:
ardana >
git statusTo commit your changes, run:
ardana >
git commit -m "Setup disks for audit logging"
13.2.7.2 Enable Audit Logging #
To enable audit logging you must edit your cloud configuration settings, save your changes and re-run the configuration processor. Then you can run the playbooks to create the volume groups and configure them.
In the ~/openstack/my_cloud/definition/cloudConfig.yml
file,
service names defined under enabled-services or disabled-services override
the default setting.
The following is an example of your audit-settings section:
# Disc space needs to be allocated to the audit directory before enabling # auditing. # Default can be either "disabled" or "enabled". Services listed in # "enabled-services" and "disabled-services" override the default setting. audit-settings: default: disabled #enabled-services: # - keystone # - barbican disabled-services: - nova - barbican - keystone - cinder - ceilometer - neutron
In this example, although the default setting for all services is set to disabled, keystone and barbican may be explicitly enabled by removing the comments from these lines and this setting overrides the default.
13.2.7.2.1 To edit the configuration file: #
Log in to your Cloud Lifecycle Manager.
To change to the cloud definition folder, run:
ardana >
cd ~/openstack/my_cloud/definitionTo edit the auditing settings, in a text editor, open the following file:
cloudConfig.yml
To enable audit logging, begin by uncommenting the "enabled-services:" block.
enabled-service:
any service you want to enable for audit logging.
For example, keystone has been enabled in the following text:
Default cloudConfig.yml file Enabling keystone audit logging audit-settings: default: disabled enabled-services: # - keystone
audit-settings: default: disabled enabled-services: - keystone
To move the services you want to enable, comment out the service in the disabled section and add it to the enabled section. For example, barbican has been enabled in the following text:
cloudConfig.yml file Enabling barbican audit logging audit-settings: default: disabled enabled-services: - keystone disabled-services: - nova # - keystone - barbican - cinder
audit-settings: default: disabled enabled-services: - keystone - barbican disabled-services: - nova # - barbican # - keystone - cinder
13.2.7.2.2 To save your changes and run the configuration processor: #
To change to the openstack directory, run:
ardana >
cd ~/openstackTo add the new and updated files, run:
ardana >
git add -ATo verify the files are added, run:
ardana >
git statusTo commit your changes, run:
ardana >
git commit -m "Enable audit logging"To change to the directory with the ansible playbooks, run:
ardana >
cd ~/openstack/ardana/ansibleTo rerun the configuration processor, run:
ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
13.2.7.2.3 To create the volume group: #
To change to the directory containing the osconfig playbook, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo remove the stub file that osconfig uses to decide if the disks are already configured, run:
ardana >
ansible -i hosts/verb_hosts KEY-API -a 'sudo rm -f /etc/openstack/osconfig-ran'ImportantThe osconfig playbook uses the stub file to mark already configured disks as "idempotent." To stop osconfig from identifying your new disk as already configured, you must remove the stub file /etc/hos/osconfig-ran before re-running the osconfig playbook.
To run the playbook that enables auditing for a service, run:
ardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-APIImportantThe variable KEY-API is used as an example to cover the management controller cluster. To enable auditing for a service that is not run on the same cluster, add the service to the –limit flag in the above command. For example:
ardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-API:NEU-SVR
13.2.7.2.4 To Reconfigure services for audit logging: #
To change to the directory containing the service playbooks, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo run the playbook that reconfigures a service for audit logging, run:
ardana >
ansible-playbook -i hosts/verb_hosts SERVICE_NAME-reconfigure.ymlFor example, to reconfigure keystone for audit logging, run:
ardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlRepeat steps 1 and 2 for each service you need to reconfigure.
ImportantYou must reconfigure each service that you changed to be enabled or disabled in the cloudConfig.yml file.
13.2.8 Troubleshooting #
For information on troubleshooting Central Logging, see Section 18.7.1, “Troubleshooting Centralized Logging”.
13.3 Metering Service (ceilometer) Overview #
The SUSE OpenStack Cloud metering service collects and provides access to OpenStack usage data that can be used for billing reporting such as showback and chargeback. The metering service can also provide general usage reporting. ceilometer acts as the central collection and data access service to the meters provided by all the OpenStack services. The data collected is available through the monasca API. ceilometer V2 API was deprecated in the Pike release upstream.
13.3.1 Metering Service New Functionality #
13.3.1.1 New Metering Functionality in SUSE OpenStack Cloud 9 #
ceilometer is now integrated with monasca, using it as the datastore.
The default meters and other items configured for ceilometer can now be modified and additional meters can be added. We recommend that users test overall SUSE OpenStack Cloud performance prior to deploying any ceilometer modifications to ensure the addition of new notifications or polling events does not negatively affect overall system performance.
ceilometer Central Agent (pollster) is now called Polling Agent and is configured to support HA (Active-Active).
Notification Agent has built-in HA (Active-Active) with support for pipeline transformers, but workload partitioning has been disabled in SUSE OpenStack Cloud
SWIFT Poll-based account level meters will be enabled by default with an hourly collection cycle.
Integration with centralized monitoring (monasca) and centralized logging
Support for upgrade and reconfigure operations
13.3.1.2 Limitations #
The Number of metadata attributes that can be extracted from resource_metadata has a maximum of 16. This is the number of fields in the metadata section of the monasca_field_definitions.yaml file for any service. It is also the number that is equal to fields in metadata.common and fields in metadata.<service.meters> sections. The total number of these fields cannot be more than 16.
Several network-related attributes are accessible using a colon ":" but are returned as a period ".". For example, you can access a sample list using the following command:
ardana >
source ~/service.osrcardana >
ceilometer --debug sample-list network -q "resource_id=421d50a5-156e-4cb9-b404- d2ce5f32f18b;resource_metadata.provider.network_type=flat"However, in response you will see the following:
provider.network_type
instead of
provider:network_type
This limitation is known for the following attributes:
provider:network_type provider:physical_network provider:segmentation_id
ceilometer Expirer is not supported. Data retention expiration is handled by monasca with a default retention period of 45 days.
ceilometer Collector is not supported.
13.3.2 Understanding the Metering Service Concepts #
13.3.2.1 Ceilometer Introduction #
Before configuring the ceilometer Metering Service, it is important to understand how it works.
13.3.2.1.1 Metering Architecture #
SUSE OpenStack Cloud automatically configures ceilometer to use Logging and Monitoring Service (monasca) as its backend. ceilometer is deployed on the same control plane nodes as monasca.
The installation of Celiometer creates several management nodes running different metering components.
ceilometer Components on Controller nodes
This controller node is the first of the High Available (HA) cluster.
ceilometer Sample Polling
Sample Polling is part of the Polling Agent. Messages are posted by the Notification Agent directly to monasca API.
ceilometer Polling Agent
The Polling Agent is responsible for coordinating the polling activity. It
parses the pipeline.yml
configuration file and
identifies all the sources that need to be polled. The sources are then
evaluated using a discovery mechanism and all the sources are translated to
resources where a dedicated pollster can retrieve and publish data. At each
identified interval the discovery mechanism is triggered, the resource list
is composed, and the data is polled and sent to the queue.
ceilometer Collector No Longer Required
In previous versions, the collector was responsible for getting the samples/events from the RabbitMQ service and storing it in the main database. The ceilometer Collector is no longer enabled. Now that Notification Agent posts the data directly to monasca API, the collector is no longer required
13.3.2.1.2 Meter Reference #
ceilometer collects basic information grouped into categories known as
meters
. A meter is the unique resource-usage measurement
of a particular OpenStack service. Each OpenStack service defines what type
of data is exposed for metering.
Each meter has the following characteristics:
Attribute | Description |
---|---|
Name | Description of the meter |
Unit of Measurement | The method by which the data is measured. For example: storage meters are defined in Gigabytes (GB) and network bandwidth is measured in Gigabits (Gb). |
Type | The origin of the meter's data. OpenStack defines the following origins:
|
A meter is defined for every measurable resource. A meter can exist beyond the actual existence of a particular resource, such as an active instance, to provision long-cycle use cases such as billing.
For a list of meter types and default meters installed with SUSE OpenStack Cloud, see Section 13.3.3, “Ceilometer Metering Available Meter Types”
The most common meter submission method is notifications. With this method, each service sends the data from their respective meters on a periodic basis to a common notifications bus.
ceilometer, in turn, pulls all of the events from the bus and saves the notifications in a ceilometer-specific database. The period of time that the data is collected and saved is known as the ceilometer expiry and is configured during ceilometer installation. Each meter is collected from one or more samples, gathered from the messaging queue or polled by agents. The samples are represented by counter objects. Each counter has the following fields:
Attribute | Description |
---|---|
counter_name | Description of the counter |
counter_unit | The method by which the data is measured. For example: data can be defined in Gigabytes (GB) or for network bandwidth, measured in Gigabits (Gb). |
counter_typee |
The origin of the counter's data. OpenStack defines the following origins:
|
counter_volume | The volume of data measured (CPU ticks, bytes transmitted, etc.). Not used for gauge counters. Set to a default value such as 1. |
resource_id | The identifier of the resource measured (UUID) |
project_id | The project (tenant) ID to which the resource belongs. |
user_id | The ID of the user who owns the resource. |
resource_metadata | Other data transmitted in the metering notification payload. |
13.3.3 Ceilometer Metering Available Meter Types #
The Metering service contains three types of meters:
- Cumulative
A cumulative meter measures data over time (for example, instance hours).
- Gauge
A gauge measures discrete items (for example, floating IPs or image uploads) or fluctuating values (such as disk input or output).
- Delta
A delta measures change over time, for example, monitoring bandwidth.
Each meter is populated from one or more samples, which are gathered from the messaging queue (listening agent), polling agents, or push agents. Samples are populated by counter objects.
Each counter contains the following fields:
- name
the name of the meter
- type
the type of meter (cumulative, gauge, or delta)
- amount
the amount of data measured
- unit
the unit of measure
- resource
the resource being measured
- project ID
the project the resource is assigned to
- user
the user the resource is assigned to.
Note: The metering service shares the same High-availability proxy, messaging, and database clusters with the other Information services. To avoid unnecessarily high loads, Section 13.3.8, “Optimizing the Ceilometer Metering Service”.
13.3.3.1 SUSE OpenStack Cloud Default Meters #
These meters are installed and enabled by default during an SUSE OpenStack Cloud installation. More information about ceilometer can be found at OpenStack ceilometer.
13.3.3.2 Compute (nova) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
vcpus | Gauge | vcpu | Instance ID | Notification | Number of virtual CPUs allocated to the instance |
memory | Gauge | MB | Instance ID | Notification | Volume of RAM allocated to the instance |
memory.resident | Gauge | MB | Instance ID | Pollster | Volume of RAM used by the instance on the physical machine |
memory.usage | Gauge | MB | Instance ID | Pollster | Volume of RAM used by the instance from the amount of its allocated memory |
cpu | Cumulative | ns | Instance ID | Pollster | CPU time used |
cpu_util | Gauge | % | Instance ID | Pollster | Average CPU utilization |
disk.read.requests | Cumulative | request | Instance ID | Pollster | Number of read requests |
disk.read.requests.rate | Gauge | request/s | Instance ID | Pollster | Average rate of read requests |
disk.write.requests | Cumulative | request | Instance ID | Pollster | Number of write requests |
disk.write.requests.rate | Gauge | request/s | Instance ID | Pollster | Average rate of write requests |
disk.read.bytes | Cumulative | B | Instance ID | Pollster | Volume of reads |
disk.read.bytes.rate | Gauge | B/s | Instance ID | Pollster | Average rate of reads |
disk.write.bytes | Cumulative | B | Instance ID | Pollster | Volume of writes |
disk.write.bytes.rate | Gauge | B/s | Instance ID | Pollster | Average rate of writes |
disk.root.size | Gauge | GB | Instance ID | Notification | Size of root disk |
disk.ephemeral.size | Gauge | GB | Instance ID | Notification | Size of ephemeral disk |
disk.device.read.requests | Cumulative | request | Disk ID | Pollster | Number of read requests |
disk.device.read.requests.rate | Gauge | request/s | Disk ID | Pollster | Average rate of read requests |
disk.device.write.requests | Cumulative | request | Disk ID | Pollster | Number of write requests |
disk.device.write.requests.rate | Gauge | request/s | Disk ID | Pollster | Average rate of write requests |
disk.device.read.bytes | Cumulative | B | Disk ID | Pollster | Volume of reads |
disk.device.read.bytes .rate | Gauge | B/s | Disk ID | Pollster | Average rate of reads |
disk.device.write.bytes | Cumulative | B | Disk ID | Pollster | Volume of writes |
disk.device.write.bytes .rate | Gauge | B/s | Disk ID | Pollster | Average rate of writes |
disk.capacity | Gauge | B | Instance ID | Pollster | The amount of disk that the instance can see |
disk.allocation | Gauge | B | Instance ID | Pollster | The amount of disk occupied by the instance on the host machine |
disk.usage | Gauge | B | Instance ID | Pollster | The physical size in bytes of the image container on the host |
disk.device.capacity | Gauge | B | Disk ID | Pollster | The amount of disk per device that the instance can see |
disk.device.allocation | Gauge | B | Disk ID | Pollster | The amount of disk per device occupied by the instance on the host machine |
disk.device.usage | Gauge | B | Disk ID | Pollster | The physical size in bytes of the image container on the host per device |
network.incoming.bytes | Cumulative | B | Interface ID | Pollster | Number of incoming bytes |
network.outgoing.bytes | Cumulative | B | Interface ID | Pollster | Number of outgoing bytes |
network.incoming.packets | Cumulative | packet | Interface ID | Pollster | Number of incoming packets |
network.outgoing.packets | Cumulative | packet | Interface ID | Pollster | Number of outgoing packets |
13.3.3.3 Compute Host Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
compute.node.cpu.frequency | Gauge | MHz | Host ID | Notification | CPU frequency |
compute.node.cpu.kernel.time | Cumulative | ns | Host ID | Notification | CPU kernel time |
compute.node.cpu.idle.time | Cumulative | ns | Host ID | Notification | CPU idle time |
compute.node.cpu.user.time | Cumulative | ns | Host ID | Notification | CPU user mode time |
compute.node.cpu.iowait.time | Cumulative | ns | Host ID | Notification | CPU I/O wait time |
compute.node.cpu.kernel.percent | Gauge | % | Host ID | Notification | CPU kernel percentage |
compute.node.cpu.idle.percent | Gauge | % | Host ID | Notification | CPU idle percentage |
compute.node.cpu.user.percent | Gauge | % | Host ID | Notification | CPU user mode percentage |
compute.node.cpu.iowait.percent | Gauge | % | Host ID | Notification | CPU I/O wait percentage |
compute.node.cpu.percent | Gauge | % | Host ID | Notification | CPU utilization |
13.3.3.4 Image (glance) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
image.size | Gauge | B | Image ID | Notification | Uploaded image size |
image.update | Delta | Image | Image ID | Notification | Number of uploads of the image |
image.upload | Delta | Image | image ID | notification | Number of uploads of the image |
image.delete | Delta | Image | Image ID | Notification | Number of deletes on the image |
13.3.3.5 Volume (cinder) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
volume.size | Gauge | GB | Vol ID | Notification | Size of volume |
snapshot.size | Gauge | GB | Snap ID | Notification | Size of snapshot's volume |
13.3.3.6 Storage (swift) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
storage.objects | Gauge | Object | Storage ID | Pollster | Number of objects |
storage.objects.size | Gauge | B | Storage ID | Pollster | Total size of stored objects |
storage.objects.containers | Gauge | Container | Storage ID | Pollster | Number of containers |
The resource_id
for any ceilometer query is the
tenant_id
for the swift object because swift usage is
rolled up at the tenant level.
13.3.4 Configure the Ceilometer Metering Service #
SUSE OpenStack Cloud 9 automatically deploys ceilometer to use the monasca database. ceilometer is deployed on the same control plane nodes along with other OpenStack services such as keystone, nova, neutron, glance, and swift.
The Metering Service can be configured using one of the procedures described below.
13.3.4.1 Run the Upgrade Playbook #
Follow Standard Service upgrade mechanism available in the Cloud Lifecycle Manager distribution. For ceilometer, the playbook included with SUSE OpenStack Cloud is ceilometer-upgrade.yml
13.3.4.2 Enable Services for Messaging Notifications #
After installation of SUSE OpenStack Cloud, the following services are enabled by default to send notifications:
nova
cinder
glance
neutron
swift
The list of meters for these services are specified in the Notification Agent or Polling Agent's pipeline configuration file.
For steps on how to edit the pipeline configuration files, see: Section 13.3.5, “Ceilometer Metering Service Notifications”
13.3.4.3 Restart the Polling Agent #
The Polling Agent is responsible for coordinating the polling activity. It parses the pipeline.yml configuration file and identifies all the sources where data is collected. The sources are then evaluated and are translated to resources that a dedicated pollster can retrieve. The Polling Agent follows this process:
At each identified interval, the pipeline.yml configuration file is parsed.
The resource list is composed.
The pollster collects the data.
The pollster sends data to the queue.
Metering processes should normally be operating at all times. This need is addressed by the Upstart event engine which is designed to run on any Linux system. Upstart creates events, handles the consequences of those events, and starts and stops processes as required. Upstart will continually attempt to restart stopped processes even if the process was stopped manually. To stop or start the Polling Agent and avoid the conflict with Upstart, using the following steps.
To restart the Polling Agent:
To determine whether the process is running, run:
tux >
sudo systemctl status ceilometer-agent-notification #SAMPLE OUTPUT: ceilometer-agent-notification.service - ceilometer-agent-notification Service Loaded: loaded (/etc/systemd/system/ceilometer-agent-notification.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2018-06-12 05:07:14 UTC; 2 days ago Main PID: 31529 (ceilometer-agen) Tasks: 69 CGroup: /system.slice/ceilometer-agent-notification.service ├─31529 ceilometer-agent-notification: master process [/opt/stack/service/ceilometer-agent-notification/venv/bin/ceilometer-agent-notification --config-file /opt/stack/service/ceilometer-agent-noti... └─31621 ceilometer-agent-notification: NotificationService worker(0) Jun 12 05:07:14 ardana-qe201-cp1-c1-m2-mgmt systemd[1]: Started ceilometer-agent-notification Service.To stop the process, run:
tux >
sudo systemctl stop ceilometer-agent-notificationTo start the process, run:
tux >
sudo systemctl start ceilometer-agent-notification
13.3.4.4 Replace a Logging, Monitoring, and Metering Controller #
In a medium-scale environment, if a metering controller has to be replaced or rebuilt, use the following steps:
If the ceilometer nodes are not on the shared control plane, to implement the changes and replace the controller, you must reconfigure ceilometer. To do this, run the ceilometer-reconfigure.yml ansible playbook without the limit option
13.3.4.5 Configure Monitoring #
The monasca HTTP Process monitors ceilometer's notification and polling
agents are monitored. If these agents are down, monasca monitoring alarms
are triggered. You can use the notification alarms to debug the issue and
restart the notifications agent. However, for
Central-Agent
(polling) and Collector
the alarms need to be deleted. These two processes are not started after an
upgrade so when the monitoring process checks the alarms for these
components, they will be in UNDETERMINED
state. SUSE OpenStack Cloud does not monitor these processes anymore. To resolve
this issue, manually delete alarms that are no longer used but are
installed.
To resolve notification alarms, first check the ceilometer-agent-notification logs for errors in the /var/log/ceilometer directory. You can also use the Operations Console to access Kibana and check the logs. This will help you understand and debug the error.
To restart the service, run the ceilometer-start.yml. This playbook starts the ceilometer processes that has stopped and only restarts during install, upgrade or reconfigure which is what is needed in this case. Restarting the process that has stopped will resolve this alarm because this monasca alarm means that ceilometer-agent-notification is no longer running on certain nodes.
You can access ceilometer data through monasca. ceilometer publishes samples to monasca with credentials of the following accounts:
ceilometer user
services
Data collected by ceilometer can also be retrieved by the monasca REST API. Make sure you use the following guidelines when requesting data from the monasca REST API:
Verify you have the monasca-admin role. This role is configured in the monasca-api configuration file.
Specify the
tenant id
of the services project.
For more details, read the monasca API Specification.
To run monasca commands at the command line, you must be have the admin role. This allows you to use the ceilometer account credentials to replace the default admin account credentials defined in the service.osrc file. When you use the ceilometer account credentials, monasca commands will only return data collected by ceilometer. At this time, monasca command line interface (CLI) does not support the data retrieval of other tenants or projects.
13.3.5 Ceilometer Metering Service Notifications #
ceilometer uses the notification agent to listen to the message queue, convert notifications to Events and Samples, and apply pipeline actions.
13.3.5.1 Manage Whitelisting and Polling #
SUSE OpenStack Cloud is designed to reduce the amount of data that is stored. SUSE OpenStack Cloud's use of a SQL-based cluster, which is not recommended for big data, means you must control the data that ceilometer collects. You can do this by filtering (whitelisting) the data or by using the configuration files for the ceilometer Polling Agent and the ceilometer Notificfoation Agent.
Whitelisting is used in a rule specification as a positive filtering parameter. Whitelist is only included in rules that can be used in direct mappings, for identity service issues such as service discovery, provisioning users, groups, roles, projects, domains as well as user authentication and authorization.
You can run tests against specific scenarios to see if filtering reduces the amount of data stored. You can create a test by editing or creating a run filter file (whitelist). For steps on how to do this, see: Section 38.1, “API Verification”.
ceilometer Polling Agent (polling agent) and ceilometer Notification Agent (notification agent) use different pipeline.yaml files to configure meters that are collected. This prevents accidentally polling for meters which can be retrieved by the polling agent as well as the notification agent. For example, glance image and image.size are meters which can be retrieved both by polling and notifications.
In both of the separate configuration files, there is a setting for
interval
. The interval attribute determines the
frequency, in seconds, of how often data is collected. You can use this
setting to control the amount of resources that are used for notifications
and for polling. For example, you want to use more resources for
notifications and less for polling. To accomplish this you would set the
interval
in the polling configuration file to a large
amount of time, such as 604800 seconds, which polls only once a week. Then
in the notifications configuration file, you can set the
interval
to a higher amount, such as collecting data
every 30 seconds.
swift account data will be collected using the polling mechanism in an hourly interval.
Setting this interval to manage both notifications and polling is the recommended procedure when using a SQL cluster back-end.
Sample ceilometer Polling Agent file:
#File: ~/opt/stack/service/ceilometer-polling/etc/pipeline-polling.yaml --- sources: - name: swift_source interval: 3600 meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
Sample ceilometer Notification Agent(notification agent) file:
#File: ~/opt/stack/service/ceilometer-agent-notification/etc/pipeline-agent-notification.yaml --- sources: - name: meter_source interval: 30 meters: - "instance" - "image" - "image.size" - "image.upload" - "image.delete" - "volume" - "volume.size" - "snapshot" - "snapshot.size" - "ip.floating" - "network" - "network.create" - "network.update" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
Both of the pipeline files have two major sections:
- Sources
represents the data that is collected either from notifications posted by services or through polling. In the Sources section there is a list of meters. These meters define what kind of data is collected. For a full list refer to the ceilometer documentation available at: Telemetry Measurements
- Sinks
represents how the data is modified before it is published to the internal queue for collection and storage.
You will only need to change a setting in the Sources section to control the data collection interval.
For more information, see Telemetry Measurements
To change the ceilometer Polling Agent interval setting:
To find the polling agent configuration file, run:
cd ~/opt/stack/service/ceilometer-polling/etc
In a text editor, open the following file:
pipeline-polling.yaml
In the following section, change the value of
interval
to the desired amount of time:--- sources: - name: swift_source interval: 3600 meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
In the sample code above, the polling agent will collect data every 600 seconds, or 10 minutes.
To change the ceilometer Notification Agent (notification agent) interval setting:
To find the notification agent configuration file, run:
cd /opt/stack/service/ceilometer-agent-notification
In a text editor, open the following file:
pipeline-agent-notification.yaml
In the following section, change the value of
interval
to the desired amount of time:sources: - name: meter_source interval: 30 meters: - "instance" - "image" - "image.size" - "image.upload" - "image.delete" - "volume" - "volume.size" - "snapshot" - "snapshot.size" - "ip.floating" - "network" - "network.create" - "network.update"
In the sample code above, the notification agent will collect data every 30 seconds.
The pipeline-agent-notification.yaml
file needs to be changed on all
controller nodes to change the white-listing and polling strategy.
13.3.5.2 Edit the List of Meters #
The number of enabled meters can be reduced or increased by editing the pipeline configuration of the notification and polling agents. To deploy these changes you must then restart the agent. If pollsters and notifications are both modified, then you will have to restart both the Polling Agent and the Notification Agent. ceilometer Collector will also need to be restarted. The following code is an example of a compute-only ceilometer Notification Agent (notification agent) pipeline-agent-notification.yaml file:
--- sources: - name: meter_source interval: 86400 meters: - "instance" - "memory" - "vcpus" - "compute.instance.create.end" - "compute.instance.delete.end" - "compute.instance.update" - "compute.instance.exists" sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
If you enable meters at the container level in this file, every time the polling interval triggers a collection, at least 5 messages per existing container in swift are collected.
The following table illustrates the amount of data produced hourly in different scenarios:
swift Containers | swift Objects per container | Samples per Hour | Samples stored per 24 hours |
10 | 10 | 500 | 12000 |
10 | 100 | 5000 | 120000 |
100 | 100 | 50000 | 1200000 |
100 | 1000 | 500000 | 12000000 |
The data in the table shows that even a very small swift storage with 10 containers and 100 files will store 120,000 samples in 24 hours, generating a total of 3.6 million samples.
The size of each file does not have any impact on the number of samples collected. As shown in the table above, the smallest number of samples results from polling when there are a small number of files and a small number of containers. When there are a lot of small files and containers, the number of samples is the highest.
13.3.5.3 Add Resource Fields to Meters #
By default, not all the resource metadata fields for an event are recorded and stored in ceilometer. If you want to collect metadata fields for a consumer application, for example, it is easier to add a field to an existing meter rather than creating a new meter. If you create a new meter, you must also reconfigure ceilometer.
Consider the following information before you add or edit a meter:
You can add a maximum of 12 new fields.
Adding or editing a meter causes all non-default meters to STOP receiving notifications. You will need to restart ceilometer.
New meters added to the
pipeline-polling.yaml.j2
file must also be added to thepipeline-agent-notification.yaml.j2
file. This is due to the fact that polling meters are drained by the notification agent and not by the collector.After SUSE OpenStack Cloud is installed, services like compute, cinder, glance, and neutron are configured to publish ceilometer meters by default. Other meters can also be enabled after the services are configured to start publishing the meter. The only requirement for publishing a meter is that the
origin
must have a value ofnotification
. For a complete list of meters, see the OpenStack documentation on Measurements.Not all meters are supported. Meters collected by ceilometer Compute Agent or any agent other than ceilometer Polling are not supported or tested with SUSE OpenStack Cloud.
Identity meters are disabled by keystone.
To enable ceilometer to start collecting meters, some services require you enable the meters you need in the service first before enabling them in ceilometer. Refer to the documentation for the specific service before you add new meters or resource fields.
To add Resource Metadata fields:
Log on to the Cloud Lifecycle Manager (deployer node).
To change to the ceilometer directory, run:
ardana >
cd ~/openstack/my_cloud/config/ceilometerIn a text editor, open the target configuration file (for example, monasca-field-definitions.yaml.j2).
In the metadata section, either add a new meter or edit an existing one provided by SUSE OpenStack Cloud.
Include the metadata fields you need. You can use the
instance meter
in the file as an example.Save and close the configuration file.
To save your changes in SUSE OpenStack Cloud, run:
ardana >
cd ~/openstackardana >
git add -Aardana >
git commit -m "My config"If you added a new meter, reconfigure ceilometer:
ardana >
cd ~/openstack/ardana/ansible/ # To run the config-processor playbook:ardana >
ansible-playbook -i hosts/localhost config-processor-run.yml #To run the ready-deployment playbook:ardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml
13.3.5.4 Update the Polling Strategy and Swift Considerations #
Polling can be very taxing on the system due to the sheer volume of data that the system may have to process. It also has a severe impact on queries since the database will now have a very large amount of data to scan to respond to the query. This consumes a great amount of cpu and memory. This can result in long wait times for query responses, and in extreme cases can result in timeouts.
There are 3 polling meters in swift:
storage.objects
storage.objects.size
storage.objects.containers
Here is an example of pipeline.yml
in which
swift polling is set to occur hourly.
--- sources: - name: swift_source interval: 3600 meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
With this configuration above, we did not enable polling of container based meters and we only collect 3 messages for any given tenant, one for each meter listed in the configuration files. Since we have 3 messages only per tenant, it does not create a heavy load on the MySQL database as it would have if container-based meters were enabled. Hence, other APIs are not hit because of this data collection configuration.
13.3.6 Ceilometer Metering Setting Role-based Access Control #
Role Base Access Control (RBAC) is a technique that limits access to resources based on a specific set of roles associated with each user's credentials.
keystone has a set of users that are associated with each project. Each user has at least one role. After a user has authenticated with keystone using a valid set of credentials, keystone will augment that request with the Roles that are associated with that user. These roles are added to the Request Header under the X-Roles attribute and are presented as a comma-separated list.
13.3.6.1 Displaying All Users #
To discover the list of users available in the system, an administrator can run the following command using the keystone command-line interface:
ardana >
source ~/service.osrcardana >
openstack user list
The output should resemble this response, which is a list of all the users currently available in this system.
+----------------------------------+-----------------------------------------+----+ | id | name | enabled | email | +----------------------------------+-----------------------------------------+----+ | 1c20d327c92a4ea8bb513894ce26f1f1 | admin | True | admin.example.com | | 0f48f3cc093c44b4ad969898713a0d65 | ceilometer | True | nobody@example.com | | 85ba98d27b1c4c8f97993e34fcd14f48 | cinder | True | nobody@example.com | | d2ff982a0b6547d0921b94957db714d6 | demo | True | demo@example.com | | b2d597e83664489ebd1d3c4742a04b7c | ec2 | True | nobody@example.com | | 2bd85070ceec4b608d9f1b06c6be22cb | glance | True | nobody@example.com | | 0e9e2daebbd3464097557b87af4afa4c | heat | True | nobody@example.com | | 0b466ddc2c0f478aa139d2a0be314467 | neutron | True | nobody@example.com | | 5cda1a541dee4555aab88f36e5759268 | nova | True | nobody@example.com || | 5cda1a541dee4555aab88f36e5759268 | nova | True | nobody@example.com | | 1cefd1361be8437d9684eb2add8bdbfa | swift | True | nobody@example.com | | f05bac3532c44414a26c0086797dab23 | user20141203213957|True| nobody@example.com | | 3db0588e140d4f88b0d4cc8b5ca86a0b | user20141205232231|True| nobody@example.com | +----------------------------------+-----------------------------------------+----+
13.3.6.2 Displaying All Roles #
To see all the roles that are currently available in the deployment, an administrator (someone with the admin role) can run the following command:
ardana >
source ~/service.osrcardana >
openstack role list
The output should resemble the following response:
+----------------------------------+-------------------------------------+ | id | name | +----------------------------------+-------------------------------------+ | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | | 9fe2ff9ee4384b1894a90878d3e92bab | member | | e00e9406b536470dbde2689ce1edb683 | admin | | aa60501f1e664ddab72b0a9f27f96d2c | heat_stack_user | | a082d27b033b4fdea37ebb2a5dc1a07b | service | | 8f11f6761534407585feecb5e896922f | swiftoperator | +----------------------------------+-------------------------------------+
13.3.6.3 Assigning a Role to a User #
In this example, we want to add the role ResellerAdmin to the demo user who has the ID d2ff982a0b6547d0921b94957db714d6.
Determine which Project/Tenant the user belongs to.
ardana >
source ~/service.osrcardana >
openstack user show d2ff982a0b6547d0921b94957db714d6The response should resemble the following output:
+---------------------+----------------------------------+ | Field | Value | +---------------------+----------------------------------+ | domain_id | default | | enabled | True | | id | d2ff982a0b6547d0921b94957db714d6 | | name | admin | | options | {} | | password_expires_at | None | +---------------------+----------------------------------+
We need to link the ResellerAdmin Role to a Project/Tenant. To start, determine which tenants are available on this deployment.
ardana >
source ~/service.osrcardana >
openstack project listThe response should resemble the following output:
+----------------------------------+-------------------------------+--+ | id | name | enabled | +----------------------------------+-------------------------------+--+ | 4a8f4207a13444089a18dc524f41b2cf | admin | True | | 00cbaf647bf24627b01b1a314e796138 | demo | True | | 8374761f28df43b09b20fcd3148c4a08 | gf1 | True | | 0f8a9eef727f4011a7c709e3fbe435fa | gf2 | True | | 6eff7b888f8e470a89a113acfcca87db | gf3 | True | | f0b5d86c7769478da82cdeb180aba1b0 | jaq1 | True | | a46f1127e78744e88d6bba20d2fc6e23 | jaq2 | True | | 977b9b7f9a6b4f59aaa70e5a1f4ebf0b | jaq3 | True | | 4055962ba9e44561ab495e8d4fafa41d | jaq4 | True | | 33ec7f15476545d1980cf90b05e1b5a8 | jaq5 | True | | 9550570f8bf147b3b9451a635a1024a1 | service | True | +----------------------------------+-------------------------------+--+
Now that we have all the pieces, we can assign the ResellerAdmin role to this User on the Demo project.
ardana >
openstack role add --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138 507bface531e4ac2b7019a1684df3370This will produce no response if everything is correct.
Validate that the role has been assigned correctly. Pass in the user and tenant ID and request a list of roles assigned.
ardana >
openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138Note that all members have the member role as a default role in addition to any other roles that have been assigned.
+----------------------------------+---------------+----------------------------------+----------------------------------+ | id | name | user_id | tenant_id | +----------------------------------+---------------+----------------------------------+----------------------------------+ | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | | 9fe2ff9ee4384b1894a90878d3e92bab | member | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | +----------------------------------+---------------+----------------------------------+----------------------------------+
13.3.6.4 Creating a New Role #
In this example, we will create a Level 3 Support role called L3Support.
Add the new role to the list of roles.
ardana >
openstack role create L3SupportThe response should resemble the following output:
+----------+----------------------------------+ | Property | Value | +----------+----------------------------------+ | id | 7e77946db05645c4ba56c6c82bf3f8d2 | | name | L3Support | +----------+----------------------------------+
Now that we have the new role's ID, we can add that role to the Demo user from the previous example.
ardana >
openstack role add --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138 7e77946db05645c4ba56c6c82bf3f8d2This will produce no response if everything is correct.
Verify that the user Demo has both the ResellerAdmin and L3Support roles.
ardana >
openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138The response should resemble the following output. Note that this user has the L3Support role, the ResellerAdmin role, and the default member role.
+----------------------------------+---------------+----------------------------------+----------------------------------+ | id | name | user_id | tenant_id | +----------------------------------+---------------+----------------------------------+----------------------------------+ | 7e77946db05645c4ba56c6c82bf3f8d2 | L3Support | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | | 9fe2ff9ee4384b1894a90878d3e92bab | member | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | +----------------------------------+---------------+----------------------------------+----------------------------------+
13.3.6.5 Access Policies #
Before introducing RBAC, ceilometer had very simple access control. There were two types of user: admins and users. Admins will be able to access any API and perform any operation. Users will only be able to access non-admin APIs and perform operations only on the Project/Tenant where they belonged.
13.3.7 Ceilometer Metering Failover HA Support #
In the SUSE OpenStack Cloud environment, the ceilometer metering service supports native Active-Active high-availability (HA) for the notification and polling agents. Implementing HA support includes workload-balancing, workload-distribution and failover.
Tooz is the coordination engine that is used to coordinate workload among multiple active agent instances. It also maintains the knowledge of active-instance-to-handle failover and group membership using hearbeats (pings).
Zookeeper is used as the coordination backend. Zookeeper uses Tooz to expose the APIs that manage group membership and retrieve workload specific to each agent.
The following section in the configuration file is used to implement high-availability (HA):
[coordination] backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default) heartbeat = 1.0 check_watchers = 10.0
For the notification agent to be configured in HA mode, additional configuration is needed in the configuration file:
[notification] workload_partitioning = true
The HA notification agent distributes workload among multiple queues that are created based on the number of unique source:sink combinations. The combinations are configured in the notification agent pipeline configuration file. If there are additional services to be metered using notifications, then the recommendation is to use a separate source for those events. This is recommended especially if the expected load of data from that source is considered high. Implementing HA support should lead to better workload balancing among multiple active notification agents.
ceilometer-expirer is also an Active-Active HA. Tooz is used to pick an expirer process that acquires a lock when there are multiple contenders and the winning process runs. There is no failover support, as expirer is not a daemon and is scheduled to run at pre-determined intervals.
You must ensure that a single expirer process runs when multiple processes are scheduled to run at the same time. This must be done using cron-based scheduling. on multiple controller nodes
The following configuration is needed to enable expirer HA:
[coordination] backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default) heartbeat = 1.0 check_watchers = 10.0
The notification agent HA support is mainly designed to coordinate among notification agents so that correlated samples can be handled by the same agent. This happens when samples get transformed from other samples. The SUSE OpenStack Cloud ceilometer pipeline has no transformers, so this task of coordination and workload partitioning does not need to be enabled. The notification agent is deployed on multiple controller nodes and they distribute workload among themselves by randomly fetching the data from the queue.
To disable coordination and workload partitioning by OpenStack, set the following value in the configuration file:
[notification] workload_partitioning = False
When a configuration change is made to an API running under the HA Proxy, that change needs to be replicated in all controllers.
13.3.8 Optimizing the Ceilometer Metering Service #
You can improve ceilometer responsiveness by configuring metering to store only the data you are require. This topic provides strategies for getting the most out of metering while not overloading your resources.
13.3.8.1 Change the List of Meters #
The list of meters can be easily reduced or increased by editing the pipeline.yaml file and restarting the polling agent.
Sample compute-only pipeline.yaml file with the daily poll interval:
--- sources: - name: meter_source interval: 86400 meters: - "instance" - "memory" - "vcpus" - "compute.instance.create.end" - "compute.instance.delete.end" - "compute.instance.update" - "compute.instance.exists" sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
This change will cause all non-default meters to stop receiving notifications.
13.3.8.2 Enable Nova Notifications #
You can configure nova to send notifications by enabling the setting in the configuration file. When enabled, nova will send information to ceilometer related to its usage and VM status. You must restart nova for these changes to take effect.
The Openstack notification daemon, also known as a polling agent, monitors
the message bus for data being provided by other OpenStack components such
as nova. The notification daemon loads one or more listener plugins, using
the ceilometer.notification
namespace. Each plugin can
listen to any topic, but by default it will listen to the
notifications.info
topic. The listeners grab messages off
the defined topics and redistribute them to the appropriate plugins
(endpoints) to be processed into Events and Samples. After the nova service
is restarted, you should verify that the notification daemons are receiving
traffic.
For a more in-depth look at how information is sent over openstack.common.rpc, refer to the OpenStack ceilometer documentation.
nova can be configured to send following data to ceilometer:
Name | Unit | Type | Resource | Note |
instance | g | instance | inst ID | Existence of instance |
instance: type
| g | instance | inst ID | Existence of instance of type (Where
type is a valid OpenStack type.) |
memory | g | MB | inst ID | Amount of allocated RAM. Measured in MB. |
vcpus | g | vcpu | inst ID | Number of VCPUs |
disk.root.size | g | GB | inst ID | Size of root disk. Measured in GB. |
disk.ephemeral.size | g | GB | inst ID | Size of ephemeral disk. Measured in GB. |
To enable nova to publish notifications:
In a text editor, open the following file:
nova.conf
Compare the example of a working configuration file with the necessary changes to your configuration file. If there is anything missing in your file, add it, and then save the file.
notification_driver=messaging notification_topics=notifications notify_on_state_change=vm_and_task_state instance_usage_audit=True instance_usage_audit_period=hour
ImportantThe
instance_usage_audit_period
interval can be set to check the instance's status every hour, once a day, once a week or once a month. Every time the audit period elapses, nova sends a notification to ceilometer to record whether or not the instance is alive and running. Metering this statistic is critical if billing depends on usage.To restart nova service, run:
tux >
sudo systemctl restart nova-api.servicetux >
sudo systemctl restart nova-conductor.servicetux >
sudo systemctl restart nova-scheduler.servicetux >
sudo systemctl restart nova-novncproxy.serviceImportantDifferent platforms may use their own unique command to restart nova-compute services. If the above command does not work, please refer to the documentation for your specific platform.
To verify successful launch of each process, list the service components:
ardana >
source ~/service.osrcardana >
openstack compute service list +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-conductor | controller | internal | enabled | up | 2014-09-16T23:54:02.000000 | - | | 3 | nova-scheduler | controller | internal | enabled | up | 2014-09-16T23:54:07.000000 | - | | 4 | nova-cert | controller | internal | enabled | up | 2014-09-16T23:54:00.000000 | - | | 5 | nova-compute | compute1 | nova | enabled | up | 2014-09-16T23:54:06.000000 | - | +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+
13.3.8.3 Improve Reporting API Responsiveness #
Reporting APIs are the main access to the metering data stored in ceilometer. These APIs are accessed by horizon to provide basic usage data and information.
SUSE OpenStack Cloud uses Apache2 Web Server to provide the API access. This topic provides some strategies to help you optimize the front-end and back-end databases.
To improve the responsiveness you can increase the number of threads and processes in the ceilometer configuration file. Each process can have a certain amount of threads managing the filters and applications, which can comprise the processing pipeline.
To configure Apache2 to use increase the number of threads, use the steps in Section 13.3.4, “Configure the Ceilometer Metering Service”
The resource usage panel could take some time to load depending on the number of metrics selected.
13.3.8.4 Update the Polling Strategy and Swift Considerations #
Polling can put an excessive amount of strain on the system due to the amount of data the system may have to process. Polling also has a severe impact on queries since the database can have very large amount of data to scan before responding to the query. This process usually consumes a large amount of CPU and memory to complete the requests. Clients can also experience long waits for queries to come back and, in extreme cases, even timeout.
There are 3 polling meters in swift:
storage.objects
storage.objects.size
storage.objects.containers
Sample section of the pipeline.yaml configuration file with swift polling on an hourly interval:
--- sources: - name: swift_source interval: 3600 sources: meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" sinks: - name: meter_sink transformers: publishers: - notifier://
Every time the polling interval occurs, at least 3 messages per existing object/container in swift are collected. The following table illustrates the amount of data produced hourly in different scenarios:
swift Containers | swift Objects per container | Samples per Hour | Samples stored per 24 hours |
10 | 10 | 500 | 12000 |
10 | 100 | 5000 | 120000 |
100 | 100 | 50000 | 1200000 |
100 | 1000 | 500000 | 12000000 |
Looking at the data we can see that even a very small swift storage with 10 containers and 100 files will store 120K samples in 24 hours, bringing it to a total of 3.6 million samples.
The file size of each file does not have any impact on the number of samples collected. In fact the smaller the number of containers or files, the smaller the sample size. In the scenario where there a large number of small files and containers, the sample size is also large and the performance is at its worst.
13.3.9 Metering Service Samples #
Samples are discrete collections of a particular meter or the actual usage data defined by a meter description. Each sample is time-stamped and includes a variety of data that varies per meter but usually includes the project ID and UserID of the entity that consumed the resource represented by the meter and sample.
In a typical deployment, the number of samples can be in the tens of thousands if not higher for a specific collection period depending on overall activity.
Sample collection and data storage expiry settings are configured in ceilometer. Use cases that include collecting data for monthly billing cycles are usually stored over a period of 45 days and require a large, scalable, back-end database to support the large volume of samples generated by production OpenStack deployments.
Example configuration:
[database] metering_time_to_live=-1
In our example use case, to construct a complete billing record, an external billing application must collect all pertinent samples. Then the results must be sorted, summarized, and combine with the results of other types of metered samples that are required. This function is known as aggregation and is external to the ceilometer service.
Meter data, or samples, can also be collected directly from the service APIs by individual ceilometer polling agents. These polling agents directly access service usage by calling the API of each service.
OpenStack services such as swift currently only provide metered data through this function and some of the other OpenStack services provide specific metrics only through a polling action.