Operations Guide CLM #
At the time of the SUSE OpenStack Cloud 9 release, this guide contains information pertaining to the operation, administration, and user functions of SUSE OpenStack Cloud. The audience is the admin-level operator of the cloud.
- 1 Operations Overview
- 2 Tutorials
- 3 Cloud Lifecycle Manager Admin UI User Guide
- 4 Third-Party Integrations
- 5 Managing Identity
- 5.1 The Identity Service
- 5.2 Supported Upstream Keystone Features
- 5.3 Understanding Domains, Projects, Users, Groups, and Roles
- 5.4 Identity Service Token Validation Example
- 5.5 Configuring the Identity Service
- 5.6 Retrieving the Admin Password
- 5.7 Changing Service Passwords
- 5.8 Reconfiguring the Identity service
- 5.9 Integrating LDAP with the Identity Service
- 5.10 keystone-to-keystone Federation
- 5.11 Configuring Web Single Sign-On
- 5.12 Identity Service Notes and Limitations
- 6 Managing Compute
- 6.1 Managing Compute Hosts using Aggregates and Scheduler Filters
- 6.2 Using Flavor Metadata to Specify CPU Model
- 6.3 Forcing CPU and RAM Overcommit Settings
- 6.4 Enabling the Nova Resize and Migrate Features
- 6.5 Enabling ESX Compute Instance(s) Resize Feature
- 6.6 GPU passthrough
- 6.7 Configuring the Image Service
- 7 Managing ESX
- 7.1 Networking for ESXi Hypervisor (OVSvApp)
- 7.2 Validating the neutron Installation
- 7.3 Removing a Cluster from the Compute Resource Pool
- 7.4 Removing an ESXi Host from a Cluster
- 7.5 Configuring Debug Logging
- 7.6 Making Scale Configuration Changes
- 7.7 Monitoring vCenter Clusters
- 7.8 Monitoring Integration with OVSvApp Appliance
- 8 Managing Block Storage
- 9 Managing Object Storage
- 10 Managing Networking
- 11 Managing the Dashboard
- 12 Managing Orchestration
- 13 Managing Monitoring, Logging, and Usage Reporting
- 14 Managing Container as a Service (Magnum)
- 15 System Maintenance
- 16 Operations Console
- 17 Backup and Restore
- 18 Troubleshooting Issues
- 18.1 General Troubleshooting
- 18.2 Control Plane Troubleshooting
- 18.3 Troubleshooting Compute service
- 18.4 Network Service Troubleshooting
- 18.5 Troubleshooting the Image (glance) Service
- 18.6 Storage Troubleshooting
- 18.7 Monitoring, Logging, and Usage Reporting Troubleshooting
- 18.8 Orchestration Troubleshooting
- 18.9 Troubleshooting Tools
- 3.1 Cloud Lifecycle Manager Admin UI Login Page
- 3.2 Cloud Lifecycle Manager Admin UI Service Information
- 3.3 Cloud Lifecycle Manager Admin UI SUSE Cloud Package
- 3.4 Cloud Lifecycle Manager Admin UI SUSE Service Configuration
- 3.5 Cloud Lifecycle Manager Admin UI SUSE Service Configuration Editor
- 3.6 Cloud Lifecycle Manager Admin UI SUSE Service Configuration Update
- 3.7 Cloud Lifecycle Manager Admin UI SUSE Service Model
- 3.8 Cloud Lifecycle Manager Admin UI SUSE Service Model Editor
- 3.9 Cloud Lifecycle Manager Admin UI SUSE Service Model Confirmation
- 3.10 Cloud Lifecycle Manager Admin UI SUSE Service Model Update
- 3.11 Cloud Lifecycle Manager Admin UI Services Per Role
- 3.12 Cloud Lifecycle Manager Admin UI Server Summary
- 3.13 Server Details (1/2)
- 3.14 Server Details (2/2)
- 3.15 Control Plane Topology
- 3.16 Control Plane Topology - Availability Zones
- 3.17 Regions Topology
- 3.18 Services Topology
- 3.19 Service Details Topology
- 3.20 Networks Topology
- 3.21 Network Groups Topology
- 3.22 Server Groups Topology
- 3.23 Roles Topology
- 3.24 Add Server Overview
- 3.25 Manually Add Server
- 3.26 Manually Add Server
- 3.27 Add Server Settings options
- 3.28 Select Servers to Provision OS
- 3.29 Confirm Provision OS
- 3.30 OS Install Progress
- 3.31 OS Install Summary
- 3.32 Confirm Deploy Servers
- 3.33 Validate Server Changes
- 3.34 Prepare Servers
- 3.35 Deploy Servers
- 3.36 Deploy Summary
- 3.37 Activate Server
- 3.38 Activate Server Progress
- 3.39 Deactivate Server
- 3.40 Deactivate Server Confirmation
- 3.41 Deactivate Server Progress
- 3.42 Select Migration Target
- 3.43 Deactivate Migration Progress
- 3.44 Delete Server
- 3.45 Delete Server Confirmation
- 3.46 Unreachable Delete Confirmation
- 3.47 Delete Server Progress
- 3.48 Replace Server Menu
- 3.49 Replace Controller Form
- 3.50 Replace Controller Progress
- 3.51 Replace Compute Menu
- 3.52 Unreachable Compute Node Warning
- 3.53 Replace Compute Form
- 3.54 Install SLES on New Compute
- 3.55 Prepare Compute Server
- 3.56 Deploy New Compute Server
- 3.57 Host Aggregate Removal Warning
- 3.58 Migrate Instances from Existing Compute Server
- 3.59 Disable Existing Compute Server
- 3.60 Existing Server Shutdown Check
- 3.61 Existing Server Delete
- 3.62 Compute Replacement Summary
- 5.1 Keystone Authentication Flow
- 16.1 Compute Hosts
- 16.2 Compute Summary
- 10.1 Intel 82599 devices supported with SRIOV and PCIPT
- 13.1 Aggregated Metrics
- 13.2 HTTP Check Metrics
- 13.3 HTTP Metric Components
- 13.4 Tunable Libvirt Metrics
- 13.5 Untunable Libvirt Metrics
- 13.6 Per-router metrics
- 13.7 Per-DHCP port and rate metrics
- 13.8 CPU Metrics
- 13.9 Disk Metrics
- 13.10 Load Metrics
- 13.11 Memory Metrics
- 13.12 Network Metrics
- 17.1 Cloud Lifecycle Manager Backup Paths
Copyright © 2006– 2024 SUSE LLC and contributors. All rights reserved.
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License : https://creativecommons.org/licenses/by/3.0/legalcode.
For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.
1 Operations Overview #
A high-level overview of the processes related to operating a SUSE OpenStack Cloud 9 cloud.
1.1 What is a cloud operator? #
When we talk about a cloud operator it is important to understand the scope of the tasks and responsibilities we are referring to. SUSE OpenStack Cloud defines a cloud operator as the person or group of people who will be administering the cloud infrastructure, which includes:
Monitoring the cloud infrastructure, resolving issues as they arise.
Managing hardware resources, adding/removing hardware due to capacity needs.
Repairing, and recovering if needed, any hardware issues.
Performing domain administration tasks, which involves creating and managing projects, users, and groups as well as setting and managing resource quotas.
1.2 Tools provided to operate your cloud #
SUSE OpenStack Cloud provides the following tools which are available to operate your cloud:
Operations Console
Often referred to as the Ops Console, you can use this console to view data about your cloud infrastructure in a web-based graphical user interface (GUI) to make sure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways:
Triage alarm notifications in the central dashboard
Monitor the environment by giving priority to alarms that take precedence
Manage compute nodes and easily use a form to create a new host
Refine the monitoring environment by creating new alarms to specify a combination of metrics, services, and hosts that match the triggers unique to an environment
Plan for future storage by tracking capacity over time to predict with some degree of reliability the amount of additional storage needed
Dashboard
Often referred to as horizon or the horizon dashboard, you can use this console to manage resources on a domain and project level in a web-based graphical user interface (GUI). The following are some of the typical operational tasks that you may perform using the dashboard:
Creating and managing projects, users, and groups within your domain.
Assigning roles to users and groups to manage access to resources.
Setting and updating resource quotas for the projects.
For more details, see the following page: Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”
Command-line interface (CLI)
The OpenStack community has created a unified client, called the openstackclient (OSC), which combines the available commands in the various service-specific clients into one tool. Some service-specific commands do not have OSC equivalents.
You will find processes defined in our documentation that use these command-line tools. There is also a list of common cloud administration tasks which we have outlined which you can use the command-line tools to do.
There are references throughout the SUSE OpenStack Cloud documentation to the HPE Smart Storage Administrator (HPE SSA) CLI. HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f
1.3 Daily tasks #
Ensure your cloud is running correctly: SUSE OpenStack Cloud is deployed as a set of highly available services to minimize the impact of failures. That said, hardware and software systems can fail. Detection of failures early in the process will enable you to address issues before they affect the broader system. SUSE OpenStack Cloud provides a monitoring solution, based on OpenStack’s monasca, which provides monitoring and metrics for all OpenStack components and much of the underlying system, including service status, performance metrics, compute node, and virtual machine status. Failures are exposed via the Operations Console and/or alarm notifications. In the case where more detailed diagnostics are required, you can use a centralized logging system based on the Elasticsearch, Logstash, and Kibana (ELK) stack. This provides the ability to search service logs to get detailed information on behavior and errors.
Perform critical maintenance: To ensure your OpenStack installation is running correctly, provides the right access and functionality, and is secure, you should make ongoing adjustments to the environment. Examples of daily maintenance tasks include:
Add/remove projects and users. The frequency of this task depends on your policy.
Apply security patches (if released).
Run daily backups.
1.4 Weekly or monthly tasks #
Do regular capacity planning: Your initial deployment will likely reflect the known near to mid-term scale requirements, but at some point your needs will outgrow your initial deployment’s capacity. You can expand SUSE OpenStack Cloud in a variety of ways, such as by adding compute and storage capacity.
To manage your cloud’s capacity, begin by determining the load on the existing system. OpenStack is a set of relatively independent components and services, so there are multiple subsystems that can affect capacity. These include control plane nodes, compute nodes, object storage nodes, block storage nodes, and an image management system. At the most basic level, you should look at the CPU used, RAM used, I/O load, and the disk space used relative to the amounts available. For compute nodes, you can also evaluate the allocation of resource to hosted virtual machines. This information can be viewed in the Operations Console. You can pull historical information from the monitoring service (OpenStack’s monasca) by using its client or API. Also, OpenStack provides you some ability to manage the hosted resource utilization by using quotas for projects. You can track this usage over time to get your growth trend so that you can project when you will need to add capacity.
1.5 Semi-annual tasks #
Perform upgrades: OpenStack releases new versions on a six-month cycle. In general, SUSE OpenStack Cloud will release new major versions annually with minor versions and maintenance updates more often. Each new release consists of both new functionality and services, as well as bug fixes for existing functionality.
If you are planning to upgrade, this is also an excellent time to evaluate your existing capabilities, especially in terms of capacity (see Capacity Planning above).
1.6 Troubleshooting #
As part of managing your cloud, you should be ready to troubleshoot issues, as needed. The following are some common troubleshooting scenarios and solutions:
How do I determine if my cloud is operating correctly now?: SUSE OpenStack Cloud provides a monitoring solution based on OpenStack’s monasca service. This service provides monitoring and metrics for all OpenStack components, as well as much of the underlying system. By default, SUSE OpenStack Cloud comes with a set of alarms that provide coverage of the primary systems. In addition, you can define alarms based on threshold values for any metrics defined in the system. You can view alarm information in the Operations Console. You can also receive or deliver this information to others by configuring email or other mechanisms. Alarms provide information about whether a component failed and is affecting the system, and also what condition triggered the alarm.
How do I troubleshoot and resolve performance issues for my cloud?: There are a variety of factors that can affect the performance of a cloud system, such as the following:
Health of the control plane
Health of the hosting compute node and virtualization layer
Resource allocation on the compute node
If your cloud users are experiencing performance issues on your cloud, use the following approach:
View the compute summary page on the Operations Console to determine if any alarms have been triggered.
Determine the hosting node of the virtual machine that is having issues.
On the compute hosts page, view the status and resource utilization of the compute node to determine if it has errors or is over-allocated.
On the compute instances page you can view the status of the VM along with its metrics.
How do I troubleshoot and resolve availability issues for my cloud?: If your cloud users are experiencing availability issues, determine what your users are experiencing that indicates to them the cloud is down. For example, can they not access the Dashboard service (horizon) console or APIs, indicating a problem with the control plane? Or are they having trouble accessing resources? Console/API issues would indicate a problem with the control planes. Use the Operations Console to view the status of services to see if there is an issue. However, if it is an issue of accessing a virtual machine, then also search the consolidated logs that are available in the ELK stack or errors related to the virtual machine and supporting networking.
1.7 Common Questions #
What skills do my cloud administrators need?
Your administrators should be experienced Linux admins. They should have experience in application management, as well as experience with Ansible. It is a plus if they have experience with Bash shell scripting and Python programming skills.
In addition, you will need skilled networking engineering staff to administer the cloud network environment.
2 Tutorials #
This section contains tutorials for common tasks for your SUSE OpenStack Cloud 9 cloud.
2.1 SUSE OpenStack Cloud Quickstart Guide #
2.1.1 Introduction #
This document provides simplified instructions for installing and setting up a SUSE OpenStack Cloud. Use this quickstart guide to build testing, demonstration, and lab-type environments., rather than production installations. When you complete this quickstart process, you will have a fully functioning SUSE OpenStack Cloud demo environment.
These simplified instructions are intended for testing or demonstration. Instructions for production installations are in Book “Deployment Guide using Cloud Lifecycle Manager”.
2.1.2 Overview of components #
The following are short descriptions of the components that SUSE OpenStack Cloud employs when installing and deploying your cloud.
Ansible. Ansible is a powerful configuration management tool used by SUSE OpenStack Cloud to manage nearly all aspects of your cloud infrastructure. Most commands in this quickstart guide execute Ansible scripts, known as playbooks. You will run playbooks that install packages, edit configuration files, manage network settings, and take care of the general administration tasks required to get your cloud up and running.
Get more information on Ansible at https://www.ansible.com/.
Cobbler. Cobbler is another third-party tool used by SUSE OpenStack Cloud to deploy operating systems across the physical servers that make up your cloud. Find more info at http://cobbler.github.io/.
Git. Git is the version control system used to manage the configuration files that define your cloud. Any changes made to your cloud configuration files must be committed to the locally hosted git repository to take effect. Read more information on Git at https://git-scm.com/.
2.1.3 Preparation #
Successfully deploying a SUSE OpenStack Cloud environment is a large endeavor, but it is not complicated. For a successful deployment, you must put a number of components in place before rolling out your cloud. Most importantly, a basic SUSE OpenStack Cloud requires the proper network infrastrucure. Because SUSE OpenStack Cloud segregates the network traffic of many of its elements, if the necessary networks, routes, and firewall access rules are not in place, communication required for a successful deployment will not occur.
2.1.4 Getting Started #
When your network infrastructure is in place, go ahead and set up the Cloud Lifecycle Manager. This is the server that will orchestrate the deployment of the rest of your cloud. It is also the server you will run most of your deployment and management commands on.
Set up the Cloud Lifecycle Manager
Download the installation media
Obtain a copy of the SUSE OpenStack Cloud installation media, and make sure that it is accessible by the server that you are installing it on. Your method of doing this may vary. For instance, some may choose to load the installation ISO on a USB drive and physically attach it to the server, while others may run the IPMI Remote Console and attach the ISO to a virtual disc drive.
Install the operating system
Boot your server, using the installation media as the boot source.
Choose "install" from the list of options and choose your preferred keyboard layout, location, language, and other settings.
Set the address, netmask, and gateway for the primary network interface.
Create a root user account.
Proceed with the OS installation. After the installation is complete and the server has rebooted into the new OS, log in with the user account you created.
Configure the new server
SSH to your new server, and set a valid DNS nameserver in the
/etc/resolv.conf
file.Set the environment variable
LC_ALL
:export LC_ALL=C
You now have a server running SUSE Linux Enterprise Server (SLES). The next step is to configure this machine as a Cloud Lifecycle Manager.
Configure the Cloud Lifecycle Manager
The installation media you used to install the OS on the server also has the files that will configure your cloud. You need to mount this installation media on your new server in order to use these files.
Using the URL that you obtained the SUSE OpenStack Cloud installation media from, run
wget
to download the ISO file to your server:wget INSTALLATION_ISO_URL
Now mount the ISO in the
/media/cdrom/
directorysudo mount INSTALLATION_ISO /media/cdrom/
Unpack the tar file found in the
/media/cdrom/ardana/
directory where you just mounted the ISO:tar xvf /media/cdrom/ardana/ardana-x.x.x-x.tar
Now you will install and configure all the components needed to turn this server into a Cloud Lifecycle Manager. Run the
ardana-init.bash
script from the uncompressed tar file:~/ardana-x.x.x/ardana-init.bash
The
ardana-init.bash
script prompts you to enter an optional SSH passphrase. This passphrase protects the RSA key used to SSH to the other cloud nodes. This is an optional passphrase, and you can skip it by pressing Enter at the prompt.The
ardana-init.bash
script automatically installs and configures everything needed to set up this server as the lifecycle manager for your cloud.When the script has finished running, you can proceed to the next step, editing your input files.
Edit your input files
Your SUSE OpenStack Cloud input files are where you define your cloud infrastructure and how it runs. The input files define options such as which servers are included in your cloud, the type of disks the servers use, and their network configuration. The input files also define which services your cloud will provide and use, the network architecture, and the storage backends for your cloud.
There are several example configurations, which you can find on your Cloud Lifecycle Manager in the
~/openstack/examples/
directory.The simplest way to set up your cloud is to copy the contents of one of these example configurations to your
~/openstack/mycloud/definition/
directory. You can then edit the copied files and define your cloud.cp -r ~/openstack/examples/CHOSEN_EXAMPLE/* ~/openstack/my_cloud/definition/
Edit the files in your
~/openstack/my_cloud/definition/
directory to define your cloud.
Commit your changes
When you finish editing the necessary input files, stage them, and then commit the changes to the local Git repository:
cd ~/openstack/ardana/ansible git add -A git commit -m "My commit message"
Image your servers
Now that you have finished editing your input files, you can deploy the configuration to the servers that will comprise your cloud.
Image the servers. You will install the SLES operating system across all the servers in your cloud, using Ansible playbooks to trigger the process.
The following playbook confirms that your servers are accessible over their IPMI ports, which is a prerequisite for the imaging process:
ansible-playbook -i hosts/localhost bm-power-status.yml
Now validate that your cloud configuration files have proper YAML syntax by running the
config-processor-run.yml
playbook:ansible-playbook -i hosts/localhost config-processor-run.yml
If you receive an error when running the preceeding playbook, one or more of your configuration files has an issue. Refer to the output of the Ansible playbook, and look for clues in the Ansible log file, found at
~/.ansible/ansible.log
.The next step is to prepare your imaging system, Cobbler, to deploy operating systems to all your cloud nodes:
ansible-playbook -i hosts/localhost cobbler-deploy.yml
Now you can image your cloud nodes. You will use an Ansible playbook to trigger Cobbler to deploy operating systems to all the nodes you specified in your input files:
ansible-playbook -i hosts/localhost bm-reimage.yml
The
bm-reimage.yml
playbook performs the following operations:Powers down the servers.
Sets the servers to boot from a network interface.
Powers on the servers and performs a PXE OS installation.
Waits for the servers to power themselves down as part of a successful OS installation. This can take some time.
Sets the servers to boot from their local hard disks and powers on the servers.
Waits for the SSH service to start on the servers and verifies that they have the expected host-key signature.
Deploy your cloud
Now that your servers are running the SLES operating system, it is time to configure them for the roles they will play in your new cloud.
Prepare the Cloud Lifecycle Manager to deploy your cloud configuration to all the nodes:
ansible-playbook -i hosts/localhost ready-deployment.yml
NOTE: The preceding playbook creates a new directory,
~/scratch/ansible/next/ardana/ansible/
, from which you will run many of the following commands.(Optional) If you are reusing servers or disks to run your cloud, you can wipe the disks of your newly imaged servers by running the
wipe_disks.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts wipe_disks.yml
The
wipe_disks.yml
playbook removes any existing data from the drives on your new servers. This can be helpful if you are reusing servers or disks. This action will not affect the OS partitions on the servers.NoteThe
wipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions. For example, ifsite.yml
fails, you cannot start fresh by runningwipe_disks.yml
. You mustbm-reimage
the node first and then runwipe_disks
.Now it is time to deploy your cloud. Do this by running the
site.yml
playbook, which pushes the configuration you defined in the input files out to all the servers that will host your cloud.cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts site.yml
The
site.yml
playbook installs packages, starts services, configures network interface settings, sets iptables firewall rules, and more. Upon successful completion of this playbook, your SUSE OpenStack Cloud will be in place and in a running state. This playbook can take up to six hours to complete.
SSH to your nodes
Now that you have successfully run
site.yml
, your cloud will be up and running. You can verify connectivity to your nodes by connecting to each one by using SSH. You can find the IP addresses of your nodes by viewing the/etc/hosts
file.For security reasons, you can only SSH to your nodes from the Cloud Lifecycle Manager. SSH connections from any machine other than the Cloud Lifecycle Manager will be refused by the nodes.
From the Cloud Lifecycle Manager, SSH to your nodes:
ssh <management IP address of node>
Also note that SSH is limited to your cloud's management network. Each node has an address on the management network, and you can find this address by reading the
/etc/hosts
orserver_info.yml
file.
2.2 Installing the Command-Line Clients #
During the installation, by default, the suite of OpenStack command-line tools are installed on the Cloud Lifecycle Manager and the control plane in your environment. You can learn more about these in the OpenStack documentation here: OpenStackClient.
If you wish to install the command-line interfaces on other nodes in your environment, there are two methods you can use to do so that we describe below.
2.2.1 Installing the CLI tools using the input model #
During the initial install phase of your cloud you can edit your input model to request that the command-line clients be installed on any of the node clusters in your environment. To do so, follow these steps:
Log in to the Cloud Lifecycle Manager.
Edit your
control_plane.yml
file. Full path:~/openstack/my_cloud/definition/data/control_plane.yml
In this file you will see a list of
service-components
to be installed on each of your clusters. These clusters will be divided per role, with your controller node cluster likely coming at the beginning. Here you will see a list of each of the clients that can be installed. These include:keystone-client glance-client cinder-client nova-client neutron-client swift-client heat-client openstack-client monasca-client barbican-client designate-client
For each client you want to install, specify the name under the
service-components
section for the cluster you want to install it on.So, for example, if you would like to install the nova and neutron clients on your Compute node cluster, you can do so by adding the
nova-client
andneutron-client
services, like this:resources: - name: compute resource-prefix: comp server-role: COMPUTE-ROLE allocation-policy: any min-count: 0 service-components: - ntp-client - nova-compute - nova-compute-kvm - neutron-l3-agent - neutron-metadata-agent - neutron-openvswitch-agent - nova-client - neutron-client
NoteThis example uses the
entry-scale-kvm
sample file. Your model may be different so use this as a guide but do not copy and paste the contents of this example into your input model.Commit your configuration to the local git repo, as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Continue with the rest of your installation.
2.2.2 Installing the CLI tools using Ansible #
At any point after your initial installation you can install the command-line clients on any of the nodes in your environment. To do so, follow these steps:
Log in to the Cloud Lifecycle Manager.
Obtain the hostname for the nodes you want to install the clients on by looking in your hosts file:
cat /etc/hosts
Install the clients using this playbook, specifying your hostnames using commas:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts -e "install_package=<client_name>" client-deploy.yml -e "install_hosts=<hostname>"
So, for example, if you would like to install the novaClient on two of your Compute nodes with hostnames
ardana-cp1-comp0001-mgmt
andardana-cp1-comp0002-mgmt
you can use this syntax:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts -e "install_package=novaclient" client-deploy.yml -e "install_hosts=ardana-cp1-comp0001-mgmt,ardana-cp1-comp0002-mgmt"
Once the playbook completes successfully, you should be able to SSH to those nodes and, using the proper credentials, authenticate and use the command-line interfaces you have installed.
2.3 Cloud Admin Actions with the Command Line #
Cloud admins can use the command line tools to perform domain admin tasks such as user and project administration.
2.3.1 Creating Additional Cloud Admins #
You can create additional Cloud Admins to help with the administration of your cloud.
keystone identity service query and administration tasks can be performed using the OpenStack command line utility. The utility is installed by the Cloud Lifecycle Manager onto the Cloud Lifecycle Manager.
keystone administration tasks should be performed by an
admin user with a token scoped to the
default domain via the keystone v3 identity API.
These settings are preconfigured in the file
~/keystone.osrc
. By default,
keystone.osrc
is configured with the admin endpoint of
keystone. If the admin endpoint is not accessible from your network, change
OS_AUTH_URL
to point to the public endpoint.
2.3.2 Command Line Examples #
For a full list of OpenStackClient commands, see OpenStackClient Command List.
Sourcing the keystone Administration Credentials
You can set the environment variables needed for identity administration by
sourcing the keystone.osrc
file created by the lifecycle
manager:
source ~/keystone.osrc
List users in the default domain
These users are created by the Cloud Lifecycle Manager in the MySQL back end:
openstack user list
Example output:
$ openstack user list +----------------------------------+------------------+ | ID | Name | +----------------------------------+------------------+ | 155b68eda9634725a1d32c5025b91919 | heat | | 303375d5e44d48f298685db7e6a4efce | octavia | | 40099e245a394e7f8bb2aa91243168ee | logging | | 452596adbf4d49a28cb3768d20a56e38 | admin | | 76971c3ad2274820ad5347d46d7560ec | designate | | 7b2dc0b5bb8e4ffb92fc338f3fa02bf3 | hlm_backup | | 86d345c960e34c9189519548fe13a594 | barbican | | 8e7027ab438c4920b5853d52f1e08a22 | nova_monasca | | 9c57dfff57e2400190ab04955e7d82a0 | barbican_service | | a3f99bcc71b242a1bf79dbc9024eec77 | nova | | aeeb56fc4c4f40e0a6a938761f7b154a | glance-check | | af1ef292a8bb46d9a1167db4da48ac65 | cinder | | af3000158c6d4d3d9257462c9cc68dda | demo | | b41a7d0cb1264d949614dc66f6449870 | swift | | b78a2b17336b43368fb15fea5ed089e9 | cinderinternal | | bae1718dee2d47e6a75cd6196fb940bd | monasca | | d4b9b32f660943668c9f5963f1ff43f9 | ceilometer | | d7bef811fb7e4d8282f19fb3ee5089e9 | swift-monitor | | e22bbb2be91342fd9afa20baad4cd490 | neutron | | ec0ad2418a644e6b995d8af3eb5ff195 | glance | | ef16c37ec7a648338eaf53c029d6e904 | swift-dispersion | | ef1a6daccb6f4694a27a1c41cc5e7a31 | glance-swift | | fed3a599b0864f5b80420c9e387b4901 | monasca-agent | +----------------------------------+------------------+
List domains created by the installation process:
openstack domain list
Example output:
$ openstack domain list +----------------------------------+---------+---------+----------------------------------------------------------------------+ | ID | Name | Enabled | Description | +----------------------------------+---------+---------+----------------------------------------------------------------------+ | 6740dbf7465a4108a36d6476fc967dbd | heat | True | Owns users and projects created by heat | | default | Default | True | Owns users and tenants (i.e. projects) available on Identity API v2. | +----------------------------------+---------+---------+----------------------------------------------------------------------+
List the roles:
openstack role list
Example output:
$ openstack role list +----------------------------------+---------------------------+ | ID | Name | +----------------------------------+---------------------------+ | 0be3da26cd3f4cd38d490b4f1a8b0c03 | designate_admin | | 13ce16e4e714473285824df8188ee7c0 | monasca-agent | | 160f25204add485890bc95a6065b9954 | key-manager:service-admin | | 27755430b38c411c9ef07f1b78b5ebd7 | monitor | | 2b8eb0a261344fbb8b6b3d5934745fe1 | key-manager:observer | | 345f1ec5ab3b4206a7bffdeb5318bd32 | admin | | 49ba3b42696841cea5da8398d0a5d68e | nova_admin | | 5129400d4f934d4fbfc2c3dd608b41d9 | ResellerAdmin | | 60bc2c44f8c7460a9786232a444b56a5 | neutron_admin | | 654bf409c3c94aab8f929e9e82048612 | cinder_admin | | 854e542baa144240bfc761cdb5fe0c07 | monitoring-delegate | | 8946dbdfa3d346b2aa36fa5941b43643 | key-manager:auditor | | 901453d9a4934610ad0d56434d9276b4 | key-manager:admin | | 9bc90d1121544e60a39adbfe624a46bc | monasca-user | | 9fe2a84a3e7443ae868d1009d6ab4521 | service | | 9fe2ff9ee4384b1894a90878d3e92bab | member | | a24d4e0a5de14bffbe166bfd68b36e6a | swiftoperator | | ae088fcbf579425580ee4593bfa680e5 | heat_stack_user | | bfba56b2562942e5a2e09b7ed939f01b | keystoneAdmin | | c05f54cf4bb34c7cb3a4b2b46c2a448b | glance_admin | | fe010be5c57240db8f559e0114a380c1 | key-manager:creator | +----------------------------------+---------------------------+
List admin user role assignment within default domain:
openstack role assignment list --user admin --domain default
Example output:
# This indicates that the admin user is assigned the admin role within the default domain
ardana >
openstack role assignment list --user admin --domain default
+----------------------------------+----------------------------------+-------+---------+---------+
| Role | User | Group | Project | Domain |
+----------------------------------+----------------------------------+-------+---------+---------+
| b398322103504546a070d607d02618ad | fed1c038d9e64392890b6b44c38f5bbb | | | default |
+----------------------------------+----------------------------------+-------+---------+---------+
Create a new user in default domain:
openstack user create --domain default --password-prompt --email <email_address> --description <description> --enable <username>
Example output showing the creation of a user named
testuser
with email address
test@example.com
and a description of Test
User
:
ardana >
openstack user create --domain default --password-prompt --email test@example.com --description "Test User" --enable testuser
User Password:
Repeat User Password:
+-------------+----------------------------------+
| Field | Value |
+-------------+----------------------------------+
| description | Test User |
| domain_id | default |
| email | test@example.com |
| enabled | True |
| id | 8aad69acacf0457e9690abf8c557754b |
| name | testuser |
+-------------+----------------------------------+
Assign admin role for testuser within the default domain:
openstack role add admin --user <username> --domain default openstack role assignment list --user <username> --domain default
Example output:
# Just for demonstration purposes - do not do this in a production environment!ardana >
openstack role add admin --user testuser --domain defaultardana >
openstack role assignment list --user testuser --domain default +----------------------------------+----------------------------------+-------+---------+---------+ | Role | User | Group | Project | Domain | +----------------------------------+----------------------------------+-------+---------+---------+ | b398322103504546a070d607d02618ad | 8aad69acacf0457e9690abf8c557754b | | | default | +----------------------------------+----------------------------------+-------+---------+---------+
2.3.3 Assigning the default service admin roles #
The following examples illustrate how you can assign each of the new service admin roles to a user.
Assigning the glance_admin role
A user must have the role of admin in order to assign the glance_admin role. To assign the role, you will set the environment variables needed for the identity service administrator.
First, source the identity service credentials:
source ~/keystone.osrc
You can add the glance_admin role to a user on a project with this command:
openstack role add --user <username> --project <project_name> glance_admin
Example, showing a user named
testuser
being granted theglance_admin
role in thetest_project
project:openstack role add --user testuser --project test_project glance_admin
You can confirm the role assignment by listing out the roles:
openstack role assignment list --user <username>
Example output:
ardana >
openstack role assignment list --user testuser +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+ | Role | User | Group | Project | Domain | Inherited | +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+ | 46ba80078bc64853b051c964db918816 | 8bcfe10101964e0c8ebc4de391f3e345 | | 0ebbf7640d7948d2a17ac08bbbf0ca5b | | False | +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+Note that only the role ID is displayed. To get the role name, execute the following:
openstack role show <role_id>
Example output:
ardana >
openstack role show 46ba80078bc64853b051c964db918816 +-------+----------------------------------+ | Field | Value | +-------+----------------------------------+ | id | 46ba80078bc64853b051c964db918816 | | name | glance_admin | +-------+----------------------------------+To demonstrate that the user has glance admin privileges, authenticate with those user creds and then upload and publish an image. Only a user with an admin role or glance_admin can publish an image.
The easiest way to do this will be to make a copy of the
service.osrc
file and edit it with your user credentials. You can do that with this command:cp ~/service.osrc ~/user.osrc
Using your preferred editor, edit the
user.osrc
file and replace the values for the following entries to match your user credentials:export OS_USERNAME=<username> export OS_PASSWORD=<password>
You will also need to edit the following lines for your environment:
## Change these values from 'unset' to 'export' export OS_PROJECT_NAME=<project_name> export OS_PROJECT_DOMAIN_NAME=Default
Here is an example output:
unset OS_DOMAIN_NAME export OS_IDENTITY_API_VERSION=3 export OS_AUTH_VERSION=3 export OS_PROJECT_NAME=test_project export OS_PROJECT_DOMAIN_NAME=Default export OS_USERNAME=testuser export OS_USER_DOMAIN_NAME=Default export OS_PASSWORD=testuser export OS_AUTH_URL=http://192.168.245.9:35357/v3 export OS_ENDPOINT_TYPE=internalURL # OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT export OS_INTERFACE=internal export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
Source the environment variables for your user:
source ~/user.osrc
Upload an image and publicize it:
openstack image create --name "upload me" --visibility public --container-format bare --disk-format qcow2 --file uploadme.txt
Example output:
+------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | checksum | dd75c3b840a16570088ef12f6415dd15 | | container_format | bare | | created_at | 2016-01-06T23:31:27Z | | disk_format | qcow2 | | id | cf1490f4-1eb1-477c-92e8-15ebbe91da03 | | min_disk | 0 | | min_ram | 0 | | name | upload me | | owner | bd24897932074780a20b780c4dde34c7 | | protected | False | | size | 10 | | status | active | | tags | [] | | updated_at | 2016-01-06T23:31:31Z | | virtual_size | None | | visibility | public | +------------------+--------------------------------------+
NoteYou can use the command
openstack help image create
to get the full syntax for this command.
Assigning the nova_admin role
A user must have the role of admin in order to assign the nova_admin role. To assign the role, you will set the environment variables needed for the identity service administrator.
First, source the identity service credentials:
source ~/keystone.osrc
You can add the glance_admin role to a user on a project with this command:
openstack role add --user <username> --project <project_name> nova_admin
Example, showing a user named
testuser
being granted theglance_admin
role in thetest_project
project:openstack role add --user testuser --project test_project nova_admin
You can confirm the role assignment by listing out the roles:
openstack role assignment list --user <username>
Example output:
ardana >
openstack role assignment list --user testuser +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+ | Role | User | Group | Project | Domain | Inherited | +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+ | 8cdb02bab38347f3b65753099f3ab73c | 8bcfe10101964e0c8ebc4de391f3e345 | | 0ebbf7640d7948d2a17ac08bbbf0ca5b | | False | +----------------------------------+----------------------------------+-------+----------------------------------+--------+-----------+Note that only the role ID is displayed. To get the role name, execute the following:
openstack role show <role_id>
Example output:
ardana >
openstack role show 8cdb02bab38347f3b65753099f3ab73c +-------+----------------------------------+ | Field | Value | +-------+----------------------------------+ | id | 8cdb02bab38347f3b65753099f3ab73c | | name | nova_admin | +-------+----------------------------------+To demonstrate that the user has nova admin privileges, authenticate with those user creds and then upload and publish an image. Only a user with an admin role or glance_admin can publish an image.
The easiest way to do this will be to make a copy of the
service.osrc
file and edit it with your user credentials. You can do that with this command:cp ~/service.osrc ~/user.osrc
Using your preferred editor, edit the
user.osrc
file and replace the values for the following entries to match your user credentials:export OS_USERNAME=<username> export OS_PASSWORD=<password>
You will also need to edit the following lines for your environment:
## Change these values from 'unset' to 'export' export OS_PROJECT_NAME=<project_name> export OS_PROJECT_DOMAIN_NAME=Default
Here is an example output:
unset OS_DOMAIN_NAME export OS_IDENTITY_API_VERSION=3 export OS_AUTH_VERSION=3 export OS_PROJECT_NAME=test_project export OS_PROJECT_DOMAIN_NAME=Default export OS_USERNAME=testuser export OS_USER_DOMAIN_NAME=Default export OS_PASSWORD=testuser export OS_AUTH_URL=http://192.168.245.9:35357/v3 export OS_ENDPOINT_TYPE=internalURL # OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT export OS_INTERFACE=internal export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
Source the environment variables for your user:
source ~/user.osrc
List all of the virtual machines in the project specified in user.osrc:
openstack server list
Example output showing no virtual machines, because there are no virtual machines created on the project specified in the user.osrc file:
+--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+ | ID | Name | Status | Networks | +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+ +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+
For this demonstration, we do have a virtual machine associated with a different project and because your user has nova_admin permissions, you can view those virtual machines using a slightly different command:
openstack server list --all-projects
Example output, now showing a virtual machine:
ardana >
openstack server list --all-projects +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+ | ID | Name | Status | Networks | +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+ | da4f46e2-4432-411b-82f7-71ab546f91f3 | testvml | ACTIVE | | +--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------------------------------+You can also now delete virtual machines in other projects by using the
--all-tenants
switch:openstack server delete --all-projects <instance_id>
Example, showing us deleting the instance in the previous step:
openstack server delete --all-projects da4f46e2-4432-411b-82f7-71ab546f91f3
You can get a full list of available commands by using this:
openstack -h
You can perform the same steps as above for the neutron and cinder service admin roles:
neutron_admin cinder_admin
2.3.4 Customize policy.json on the Cloud Lifecycle Manager #
One way to deploy policy.json
for a service is by going to each of the
target nodes and making changes there. This is not necessary anymore. This
process has been streamlined and policy.json files can be edited on the
Cloud Lifecycle Manager and then deployed to nodes. Please exercise caution when modifying
policy.json files. It is best to validate the changes in a non-production
environment before rolling out policy.json changes into production. It is
not recommended that you make policy.json changes without a way to validate
the desired policy behavior. Updated policy.json files can be deployed using
the appropriate <service_name>-reconfigure.yml
playbook.
2.3.5 Roles #
Service roles represent the functionality used to implement the OpenStack role based access control (RBAC) model. This is used to manage access to each OpenStack service. Roles are named and assigned per user or group for each project by the identity service. Role definition and policy enforcement are defined outside of the identity service independently by each OpenStack service.
The token generated by the identity service for each user authentication contains the role(s) assigned to that user for a particular project. When a user attempts to access a specific OpenStack service, the role is parsed by the service, compared to the service-specific policy file, and then granted the resource access defined for that role by the service policy file.
Each service has its own service policy file with the
/etc/[SERVICE_CODENAME]/policy.json
file name format
where [SERVICE_CODENAME]
represents a specific OpenStack
service name. For example, the OpenStack nova service would have a policy
file called /etc/nova/policy.json
.
Service policy files can be modified and deployed to control nodes from the Cloud Lifecycle Manager. Administrators are advised to validate policy changes before checking in the changes to the site branch of the local git repository before rolling the changes into production. Do not make changes to policy files without having a way to validate them.
The policy files are located at the following site branch directory on the Cloud Lifecycle Manager.
~/openstack/ardana/ansible/roles/
For test and validation, policy files can be modified in a non-production
environment from the ~/scratch/
directory. For a specific
policy file, run a search for policy.json
. To deploy
policy changes for a service, run the service specific reconfiguration
playbook (for example, nova-reconfigure.yml
). For a
complete list of reconfiguration playbooks, change directories to
~/scratch/ansible/next/ardana/ansible
and run this
command:
ls –l | grep reconfigure
Comments added to any *.j2
files (including templates)
must follow proper comment syntax. Otherwise you may see errors when
running the config-processor or any of the service playbooks.
2.4 Log Management and Integration #
2.4.1 Overview #
SUSE OpenStack Cloud uses the ELK (Elasticsearch, Logstash, Kibana) stack for log management across the entire cloud infrastructure. This configuration facilitates simple administration as well as integration with third-party tools. This tutorial covers how to forward your logs to a third-party tool or service, and how to access and search the Elasticsearch log stores through API endpoints.
2.4.2 The ELK stack #
The ELK logging stack consists of the Elasticsearch, Logstash, and Kibana elements.
Elasticsearch. Elasticsearch is the storage and indexing component of the ELK stack. It stores and indexes the data received from Logstash. Indexing makes your log data searchable by tools designed for querying and analyzing massive sets of data. You can query the Elasticsearch datasets from the built-in Kibana console, a third-party data analysis tool, or through the Elasticsearch API (covered later).
Logstash. Logstash reads the log data from the services running on your servers, and then aggregates and ships that data to a storage location. By default, Logstash sends the data to the Elasticsearch indexes, but it can also be configured to send data to other storage and indexing tools such as Splunk.
Kibana. Kibana provides a simple and easy-to-use method for searching, analyzing, and visualizing the log data stored in the Elasticsearch indexes. You can customize the Kibana console to provide graphs, charts, and other visualizations of your log data.
2.4.3 Using the Elasticsearch API #
You can query the Elasticsearch indexes through various language-specific
APIs, as well as directly over the IP address and port that Elasticsearch
exposes on your implementation. By default, Elasticsearch presents from
localhost, port 9200. You can run queries directly from a terminal using
curl
. For example:
ardana >
curl -XGET 'http://localhost:9200/_search?q=tag:yourSearchTag'
The preceding command searches all indexes for all data with the "yourSearchTag" tag.
You can also use the Elasticsearch API from outside the logging node. This method connects over the Kibana VIP address, port 5601, using basic http authentication. For example, you can use the following command to perform the same search as the preceding search:
curl -u kibana:<password> kibana_vip:5601/_search?q=tag:yourSearchTag
You can further refine your search to a specific index of data, in this case the "elasticsearch" index:
ardana >
curl -XGET 'http://localhost:9200/elasticsearch/_search?q=tag:yourSearchTag'
The search API is RESTful, so responses are provided in JSON format. Here's a sample (though empty) response:
{ "took":13, "timed_out":false, "_shards":{ "total":45, "successful":45, "failed":0 }, "hits":{ "total":0, "max_score":null, "hits":[] } }
2.4.4 For More Information #
You can find more detailed Elasticsearch API documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html.
Review the Elasticsearch Python API documentation at the following sources: http://elasticsearch-py.readthedocs.io/en/master/api.html
Read the Elasticsearch Java API documentation at https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/index.html.
2.4.5 Forwarding your logs #
You can configure Logstash to ship your logs to an outside storage and indexing system, such as Splunk. Setting up this configuration is as simple as editing a few configuration files, and then running the Ansible playbooks that implement the changes. Here are the steps.
Begin by logging in to the Cloud Lifecycle Manager.
Verify that the logging system is up and running:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts logging-status.ymlWhen the preceding playbook completes without error, proceed to the next step.
Edit the Logstash configuration file, found at the following location:
~/openstack/ardana/ansible/roles/logging-server/templates/logstash.conf.j2
Near the end of the Logstash configuration file, you will find a section for configuring Logstash output destinations. The following example demonstrates the changes necessary to forward your logs to an outside server (changes in bold). The configuration block sets up a TCP connection to the destination server's IP address over port 5514.
# Logstash outputs output { # Configure Elasticsearch output # http://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html elasticsearch { index => "${[@metadata][es_index]"} hosts => ["{{ elasticsearch_http_host }}:{{ elasticsearch_http_port }}"] flush_size => {{ logstash_flush_size }} idle_flush_time => 5 workers => {{ logstash_threads }} } # Forward Logs to Splunk on TCP port 5514 which matches the one specified in Splunk Web UI. tcp { mode => "client" host => "<Enter Destination listener IP address>" port => 5514 } }
Logstash can forward log data to multiple sources, so there is no need to remove or alter the Elasticsearch section in the preceding file. However, if you choose to stop forwarding your log data to Elasticsearch, you can do so by removing the related section in this file, and then continue with the following steps.
Commit your changes to the local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Your commit message"Run the configuration processor to check the status of all configuration files:
ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun the ready-deployment playbook:
ardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlImplement the changes to the Logstash configuration file:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-server-configure.yml
Configuring the receiving service will vary from product to product. Consult the documentation for your particular product for instructions on how to set it up to receive log files from Logstash.
2.5 Integrating Your Logs with Splunk #
2.5.1 Integrating with Splunk #
The SUSE OpenStack Cloud 9 logging solution provides a flexible and extensible framework to centralize the collection and processing of logs from all nodes in your cloud. The logs are shipped to a highly available and fault-tolerant cluster where they are transformed and stored for better searching and reporting. The SUSE OpenStack Cloud 9 logging solution uses the ELK stack (Elasticsearch, Logstash and Kibana) as a production-grade implementation and can support other storage and indexing technologies.
You can configure Logstash, the service that aggregates and forwards the logs to a searchable index, to send the logs to a third-party target, such as Splunk.
For how to integrate the SUSE OpenStack Cloud 9 centralized logging solution with Splunk, including the steps to set up and forward logs, please refer to Section 4.1, “Splunk Integration”.
2.6 Integrating SUSE OpenStack Cloud with an LDAP System #
You can configure your SUSE OpenStack Cloud cloud to work with an outside user authentication source such as Active Directory or OpenLDAP. keystone, the SUSE OpenStack Cloud identity service, functions as the first stop for any user authorization/authentication requests. keystone can also function as a proxy for user account authentication, passing along authentication and authorization requests to any LDAP-enabled system that has been configured as an outside source. This type of integration lets you use an existing user-management system such as Active Directory and its powerful group-based organization features as a source for permissions in SUSE OpenStack Cloud.
Upon successful completion of this tutorial, your cloud will refer user authentication requests to an outside LDAP-enabled directory system, such as Microsoft Active Directory or OpenLDAP.
2.6.1 Configure your LDAP source #
To configure your SUSE OpenStack Cloud cloud to use an outside user-management source, perform the following steps:
Make sure that the LDAP-enabled system you plan to integrate with is up and running and accessible over the necessary ports from your cloud management network.
Edit the
/var/lib/ardana/openstack/my_cloud/config/keystone/keystone.conf.j2
file and set the following options:domain_specific_drivers_enabled = True domain_configurations_from_database = False
Create a YAML file in the
/var/lib/ardana/openstack/my_cloud/config/keystone/
directory that defines your LDAP connection. You can make a copy of the sample keystone-LDAP configuration file, and then edit that file with the details of your LDAP connection.The following example copies the
keystone_configure_ldap_sample.yml
file and names the new filekeystone_configure_ldap_my.yml
:ardana >
cp /var/lib/ardana/openstack/my_cloud/config/keystone/keystone_configure_ldap_sample.yml \ /var/lib/ardana/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.ymlEdit the new file to define the connection to your LDAP source. This guide does not provide comprehensive information on all aspects of the
keystone_configure_ldap.yml
file. Find a complete list of keystone/LDAP configuration file options at: https://github.com/openstack/keystone/tree/stable/rocky/etcThe following file illustrates an example keystone configuration that is customized for an Active Directory connection.
keystone_domainldap_conf: # CA certificates file content. # Certificates are stored in Base64 PEM format. This may be entire LDAP server # certificate (in case of self-signed certificates), certificate of authority # which issued LDAP server certificate, or a full certificate chain (Root CA # certificate, intermediate CA certificate(s), issuer certificate). # cert_settings: cacert: | -----BEGIN CERTIFICATE----- certificate appears here -----END CERTIFICATE----- # A domain will be created in MariaDB with this name, and associated with ldap back end. # Installer will also generate a config file named /etc/keystone/domains/keystone.<domain_name>.conf # domain_settings: name: ad description: Dedicated domain for ad users conf_settings: identity: driver: ldap # For a full list and description of ldap configuration options, please refer to # http://docs.openstack.org/liberty/config-reference/content/keystone-configuration-file.html. # # Please note: # 1. LDAP configuration is read-only. Configuration which performs write operations (i.e. creates users, groups, etc) # is not supported at the moment. # 2. LDAP is only supported for identity operations (reading users and groups from LDAP). Assignment # operations with LDAP (i.e. managing roles, projects) are not supported. # 3. LDAP is configured as non-default domain. Configuring LDAP as a default domain is not supported. # ldap: url: ldap://YOUR_COMPANY_AD_URL suffix: YOUR_COMPANY_DC query_scope: sub user_tree_dn: CN=Users,YOUR_COMPANY_DC user : CN=admin,CN=Users,YOUR_COMPANY_DC password: REDACTED user_objectclass: user user_id_attribute: cn user_name_attribute: cn group_tree_dn: CN=Users,YOUR_COMPANY_DC group_objectclass: group group_id_attribute: cn group_name_attribute: cn use_pool: True user_enabled_attribute: userAccountControl user_enabled_mask: 2 user_enabled_default: 512 use_tls: True tls_req_cert: demand # if you are configuring multiple LDAP domains, and LDAP server certificates are issued # by different authorities, make sure that you place certs for all the LDAP backend domains in the # cacert parameter as seen in this sample yml file so that all the certs are combined in a single CA file # and every LDAP domain configuration points to the combined CA file. # Note: # 1. Please be advised that every time a new ldap domain is configured, the single CA file gets overwritten # and hence ensure that you place certs for all the LDAP backend domains in the cacert parameter. # 2. There is a known issue on one cert per CA file per domain when the system processes # concurrent requests to multiple LDAP domains. Using the single CA file with all certs combined # shall get the system working properly. tls_cacertfile: /etc/keystone/ssl/certs/all_ldapdomains_ca.pem
Add your new file to the local Git repository and commit the changes.
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add -Aardana >
git commit -m "Adding LDAP server integration config"Run the configuration processor and deployment preparation playbooks to validate the YAML files and prepare the environment for configuration.
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the keystone reconfiguration playbook to implement your changes, passing the newly created YAML file as an argument to the
-e@FILE_PATH
parameter:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml \ -e@/var/lib/ardana/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.ymlTo integrate your SUSE OpenStack Cloud cloud with multiple domains, repeat these steps starting from Step 3 for each domain.
3 Cloud Lifecycle Manager Admin UI User Guide #
The Cloud Lifecycle Manager Admin UI is a web-based GUI for viewing and managing the
configuration of an installed cloud. After successfully deploying the cloud
with the Install UI, the final screen displays a link to the CLM Admin UI.
(For example, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.5 “Running the Install UI”, Cloud Deployment Successful). Usually the URL
associated with this link is
https://DEPLOYER_MGMT_NET_IP:9085
,
although it may be different depending on the cloud configuration and the
installed version of SUSE OpenStack Cloud.
3.1 Accessing the Admin UI #
In a browser, go to
https://DEPLOYER_MGMT_NET_IP:9085
.
The
DEPLOYER_MGMT_NET_IP:PORT_NUMBER
is not necessarily the same for all installations, and can be displayed with
the following command:
ardana >
openstack endpoint list --service ardana --interface admin -c URL
Accessing the Cloud Lifecycle Manager Admin UI requires access to the MANAGEMENT network that was configured when the Cloud was deployed. Access to this network is necessary to be able to access the Cloud Lifecycle Manager Admin UI and log in. Depending on the network setup, it may be necessary to use an SSH tunnel similar to what is recommended in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.5 “Running the Install UI”. The Admin UI requires keystone and HAProxy to be running and to be accesible. If keystone or HAProxy are not running, cloud reconfiguration is limited to the command line.
Logging in requires a keystone user. If the user is not an admin on the default domain and one or more projects, the Cloud Lifecycle Manager Admin UI will not display information about the Cloud and may present errors.
3.2 Admin UI Pages #
3.2.1 Services #
Services pages relay information about the various OpenStack and other
services that have been deployed as part of the cloud. Service information
displays the list of services registered with keystone and the endpoints
associated with those services. The information is equivalent to running
the command openstack endpoint list
.
The Service Information
table contains the following
information, based on how the service is registered with keystone:
- Name
The name of the service, this may be an OpenStack code name
- Description
Service description, for some services this is a repeat of the name
- Endpoints
Services typically have 1 or more endpoints that are accessible to make API calls. The most common configuration is for a service to have
Admin
,Public
, andInternal
endpoints, with each intended for access by consumers corresponding to the type of endpoint.- Region
Service endpoints are part of a region. In multi-region clouds, some services will have endpoints in multiple regions.
3.2.2 Packages #
The
tab displays packages that are part of the SUSE OpenStack Cloud product.
The SUSE Cloud Packages
table contains the following:
- Name
The name of the SUSE Cloud package
- Version
The version of the package which is installed in the Cloud
Packages with the venv-
prefix denote the version of
the specific OpenStack package that is deployed. The release name can be
determined from the OpenStack Releases
page.
3.2.3 Configuration #
The .j2
files listed and clicking the
button.
This page also provides the ability to set up SUSE Enterprise Storage Integration after initial deployment.
Clicking one of the listed configuration files opens the file editor where changes can be made. Asterisks identify files that have been edited but have not had their updates applied to the cloud.
After editing the service configuration, click the
button to begin deploying configuration changes to the cloud. The status of those changes will be streamed to the UI.Configure SUSE Enterprise Storage After Initial Deployment
A link to the settings.yml
file is available under the
ses
selection on the tab.
To set up SUSE Enterprise Storage Integration:
Click on the link to edit the
settings.yml
file.Uncomment the
ses_config_path
parameter, specify the location on the deployer host containing theses_config.yml
file, and save thesettings.yml
file.If the
ses_config.yml
file does not yet exist in that location on the deployer host, a new link will appear for uploading a file from your local workstation.When
ses_config.yml
is present on the deployer host, it will appear in theses
section of the tab and can be edited directly there.
If the cloud is configured using self-signed certificates, the streaming status updates (including the log) may be interupted and require a reload of the CLM Admin UI. See Book “Security Guide”, Chapter 8 “Transport Layer Security (TLS) Overview”, Section 8.2 “TLS Configuration” for details on using signed certificates.
3.2.4 Model #
The
tab displays input models that are deployed in the cloud and the associated model files. The model files listed can be modified.Clicking one of the listed model files opens the file editor where changes can be made. Asterisks identify files that have been edited but have not had their updates applied to the cloud.
After editing the model file, click the config-processor-run.yml
and
ready-deployment.yml
playbooks or running a full
deployment. It also indicates the risk of updating the deployed cloud.
Click
to start deployment. The status of the changes will be streamed to the UI.If the cloud is configured using self-signed certificates, the streaming status updates (including the log) may be interrupted. The CLM Admin UI must be reloaded. See Book “Security Guide”, Chapter 8 “Transport Layer Security (TLS) Overview”, Section 8.2 “TLS Configuration” for details on using signed certificates.
3.2.5 Roles #
The
tab displays the list of all roles that have been defined in the Cloud Lifecycle Manager input model, the list of servers that role, and the services installed on those servers.
The Services Per Role
table contains the following:
- Role
The name of the role in the data model. In the included data model templates, these names are descriptive, such as
MTRMON-ROLE
for a metering and monitoring server. There is no strict constraint on role names and they may have been altered at install time.- Servers
The model IDs for the servers that have been assigned this role. This does not necessarily correspond to any DNS or other naming labels a host has, unless the host ID was set that way during install.
- Services
A list of OpenStack and other Cloud related services that comprise this role. Servers that have been assigned this role will have these services installed and enabled.
3.2.6 Servers #
The
pages contain information about the hardware that comprises the cloud, including the configuration of the servers, and the ability to add new compute nodes to the cloud.
The Servers
table contains the following information:
- ID
This is the ID of the server in the data model. This does not necessarily correspond to any DNS or other naming labels a host has, unless the host ID was set that way during install.
- IP Address
The management network IP address of the server
- Server Group
The server group which this server is assigned to
- NIC Mapping
The NIC mapping that describes the PCI slot addresses for the servers ethernet adapters
- Mac Address
The hardware address of the servers primary physical ethernet adapter
3.2.7 Admin UI Server Details #
Server Details
can be viewed by clicking the menu at the
right side of each row in the Servers
table, the server
details dialog contains the information from the Servers table and the
following additional fields:
- IPMI IP Address
The IPMI network address, this may be empty if the server was provisioned prior to being added to the Cloud
- IPMI Username
The username that was specified for IPMI access
- IPMI Password
This is obscured in the readonly dialog, but is editable when adding a new server
- Network Interfaces
The network interfaces configured on the server
- Filesystem Utilization
Filesystem usage (percentage of filesystem in use). Only available if monasca is in use
3.3 Topology #
The topology section of the Cloud Lifecycle Manager Admin UI displays an overview of how the Cloud is configured. Each section of the topology represents some facet of the Cloud configuration and provides a visual layout of the way components are associated with each other. Many of the components in the topology are linked to each other, and can be navigated between by clicking on any component that appears as a hyperlink.
3.3.1 Control Planes #
The
tab displays control planes and availability zones within the Cloud.Each control plane is show as a table of clusters, resources, and load balancers (represented by vertical columns in the table).
- Control Plane
A set of servers dedicated to running the infrastructure of the Cloud. Many Cloud configurations will have only a single control plane.
- Clusters
A set of one or more servers hosting a particular set of services, tied to the
role
that has been assigned to that server.Clusters
are generally differentiated fromResources
in that they are fixed size groups of servers that do not grow as the Cloud grows.- Resources
Servers hosting the scalable parts of the Cloud, such as Compute Hosts that host VMs, or swift servers for object storage. These will vary in number with the size and scale of the Cloud and can generally be increased after the initial Cloud deployment.
- Load Balancers
Servers that distribute API calls across servers hosting the called services.
- Availability Zones
Listed beneath the running services, groups together in a row the hosts in a particular availability zone for a particular cluster or resource type (the rows are AZs, the columns are clusters/resources)
3.3.2 Regions #
Displays the distribution of control plane services across regions. Clouds that have only a single region will list all services in the same cell.
- Control Planes
The group of services that run the Cloud infrastructure
- Region
Each region will be represented by a column with the region name as the column header. The list of services that are running in that region will be in that column, with each row corresponding to a particular control plane.
3.3.3 Services #
A list of services running in the Cloud, organized by the type (class) of service. Each service is then listed along with the control planes that the service is part of, the other services that each particular service consumes (requires), and the endpoints of the service, if the service exposes an API.
- Class
A category of like services, such as "security" or "operations". Multiple services may belong to the same category.
- Description
A short description of the service, typically sourced from the service itself
- Service
The name of the service. For OpenStack services, this is the project codename, such as nova for virtual machine provisioning. Clicking a service will navigate to the section of this page with details for that particular service.
The detail data about a service provides additional insight into the service, such as what other services are required to run a service, and what network protocols can be used to access the service
- Components
Each service is made up of one or more components, which are listed separately here. The components of a service may represent pieces of the service that run on different hosts, provide distinct functionality, or modularize business logic.
- Control Planes
A service may be running in multiple control planes. Each control plane that a service is running in will be listed here.
- Consumes
Other services required for this service to operate correctly.
- Endpoints
How a service can be accessed, typically a REST API, though other network protocols may be listed here. Services that do not expose an API or have any sort of external access will not list any entries here.
3.3.4 Networks #
Lists the networks and network groups that comprise the Cloud. Each network group is respresented by a row in the table, with columns identifying which networks are used by the intersection of the group (row) and cluster/resource (column).
- Group
The network group
- Clusters
A set of one or more servers hosting a particular set of services, tied to the
role
that has been assigned to that server.Clusters
are generally differentiated fromResources
in that they are fixed size groups of servers that do not grow as the Cloud grows.- Resources
Servers hosting the scalable parts of the Cloud, such as Compute Hosts that host VMs, or swift servers for object storage. These will vary in number with the size and scale of the Cloud and can generally be increased after the initial Cloud deployment.
Cells in the middle of the table represent the network that is running on the resource/cluster represented by that column and is part of the network group identified in the leftmost column of the same row.
Each network group is listed along with the servers and interfaces that comprise the network group.
- Network Group
The elements that make up the network group, whose name is listed above the table
- Networks
Networks that are part of the specified network group
- Address
IP address of the corresponding server
- Server
Server name of the server that is part of this network. Clicking on a server will load the server topology details.
- Interface Model
The particular combination of hardware address and bonding that tie this server to the specified network group. Clicking on an
Interface Model
will load the corresponding section of theRoles
page.
3.3.5 Servers #
A hierarchical display of the tree of Server Groups. Groups will be
represented by a heading with their name, starting with the first row which
contains the Cloud-wide server group (often called
CLOUD
). Within each Server Group, the Network Groups,
Networks, Servers, and Server Roles are broken down. Note that server groups
can be nested, producing a tree-like structure of groups.
- Network Groups
The network groups that are part of this server group.
- Networks
The network that is part of the server group and corresponds to the network group in the same row.
- Server Roles
The model defined role that was applied to the server, made up of a combination of services, and network/storage configurations unique to that role within the Cloud
- Servers
The servers that have the role defined in their row and are part of the network group represented by the column the server is in.
3.3.6 Roles #
The list of server roles that define the server configurations for the Cloud. Each server role consists of several configurations. In this topology the focus is on the Disk Models and Network Interface Models that are applied to the servers with that role.
- Server Role
The name of the role, as it is defined in the model
- Disk Model
The name of the disk model
- Volume Group
Name of the volume group
- Mount
Name of the volume being mounted on the server
- Size
The size of the volume as a percentage of physical disk space
- FS Type
Filesystem type
- Options
Optional flags applied when mounting the volume
- PVol(s)
The physical address to the storage used for this volume group
- Interface Model
The name of the interface model
- Network Group
The name of the network group. Clicking on a Network Group will load the details of that group on the Networks page.
- Interface/Options
Includes logical network name, such as
hed1
,hed2
, andbond
information grouping the logical network name together. The Cloud software will map these to physical devices.
3.4 Server Management #
3.4.1 Adding Servers #
The Add Server page in the Cloud Lifecycle Manager Admin UI allows for adding additional Compute Nodes to the Cloud.
3.4.1.1 Available Servers #
Servers that can be added to the Cloud are shown on the left side of the
Add Server
screen. Additional servers can be included in
this list three different ways:
Discover servers via SUSE Manager or HPE OneView (for details on adding servers via autodiscovery, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.4 “Optional: Importing Certificates for SUSE Manager and HPE OneView” and Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.5 “Running the Install UI”
Manually add servers individually by clicking
and filling out the form with the server information (instructions below)Create a CSV file of the servers to be added (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”, Section 21.3 “Optional: Creating a CSV File to Import Server Data”)
Manually adding a server requires the following fields:
- ID
A unique name for the server
- IP Address
The IP address that the server has, or will have, in the Cloud
- Server Group
Which server group the server will belong to. The IP address must be compatible with the selected Server Group. If the required Server Group is not present, it can be created
- NIC Mapping
The NIC to PCI address mapping for the server being added to the Cloud. If the required NIC mapping is not present, it can be created
- Role
Which compute role to add the server to. If this is set, the server will be immediately assigned that role on the right side of the page. If it is not set, the server will be added to the left side panel of available servers
Some additional fields must be set if the server is not already provisioned with an OS, or if a new OS install is desired for the server. These fields are not required if an OpenStack Cloud compatible OS is already installed:
- MAC Address
The MAC address of the IPMI network card of the server
- IPMI IP Address
The IPMI network address (IP address) of the server
- IPMI Username
Username to log in to IPMI on the server
- IPMI Password
Password to log in to IPMI on the server
Servers in the available list can be dragged to the desired role on the right. Only Compute-related roles will be displayed.
3.4.1.2 Add Server Settings #
There are several settings that apply across all Compute Nodes being added to the Cloud. Beneath the list of nodes, users will find options to control whether existing nodes can be modified, whether the new nodes should have their data disks wiped, and whether to activate the new Compute Nodes as part of the update process.
- Safe Mode
Prevents modification of existing Compute Nodes. Can be unchecked to allow modifications. Modifying existing Compute Nodes has the potential to disrupt the continuous operation of the Cloud and should be done with caution.
- Wipe Data Disks
The data disks on the new server will not be wiped by default, but users can specify to wipe clean the data disks as part of the process of adding the Compute Node(s) to the Cloud.
- Activate
Activates the added Compute Node(s) during the process of adding them to the Cloud. Activation adds a Compute Node to the pool of nodes that the
nova-scheduler
uses when instantiating VMs.
3.4.1.3 Install OS #
Servers that have been assigned a role but not yet deployed can have SLES installed as part of the Cloud deployment. This step is necessary for servers that are not provisioned with an OS.
On the Install OS
page, the Available
Servers
list will be populated with servers that have been
assigned to a role but not yet deployed to the Cloud. From here, select
which servers to install an OS onto and use the arrow controls to move them
to the Selected Servers box on the right. After all servers that require an
OS to be provisioned have been added to the Selected Servers list and click
.
The UI will prompt for confirmation that the OS should be installed, because provisioning an OS will replace any existing operating system on the server.
When the OS install begins, progress of the install will be displayed on screen
After OS provisioning is complete, a summary of the provisioned servers will be displayed. Clicking
will return the user to the role selection page where deployment can continue.3.4.1.4 Deploy New Servers #
When all newly added servers have an OS provisioned, either via the Install OS process detailed above or having previously been provisioned outside of the Cloud Lifecycle Manager Admin UI, deployment can begin.
The
button will be enabled when one or more new servers have been assigned roles. Clicking prompt for confirmation before beginning the deployment processThe deployment process will begin by running the Configuration Processor in basic validation mode to check the values input for the servers being added. This will check IP addresses, server groups, and NIC mappings for syntax or format errors.
After validation is successful, the servers will be prepared for deployment. The preparation consists of running the full Configuration Processor and two additional playbooks to ready servers for deployment.
After the servers have been prepared, deployment can begin. This process
will generate a new hosts
file, run the
site.yml
playbook, and update monasca (if
monasca is deployed)
When deployment is completed, a summary page will be displayed. Clicking
Add Server
page.
3.4.2 Activating Servers #
The Server Summary page in the Cloud Lifecycle Manager Admin UI allows for activating
Compute Nodes in the Cloud. Compute Nodes may be activated when
they are added to the Cloud. An activated compute node is available
for the nova-scheduler
to use for hosting new VMs that are created.
Only servers that are not currently activated will have the
activation menu option available.
Once activation is triggered, the progress of activating the node
and adding it to the nova-scheduler
is displayed.
3.4.3 Deactivating Servers #
The Server Summary page in the Cloud Lifecycle Manager Admin UI allows for deactivating
Compute Nodes in the Cloud. Deactivating a Compute Node removes it from
the pool of servers that the nova-scheduler
will put VMs on.
When a Compute Node is deactivated, the UI attempts to migrate any
currently running VMs from that server to an active node.
The deactivation process requires confirmation before proceeding.
Once deactivation is triggered, the progress of deactivating the node
and removing it from the nova-scheduler
is displayed.
If a Compute Node selected for deactivation has VMs running on it, a prompt will appear to select where to migrate the running VMs
A summary of the VMs being migrated will be displayed, along with the progress migrating them from the deactivated Compute Node to the target host. Once the migration attempt is complete, click 'Done' to continue the deactivation process.
3.4.4 Deleting Servers #
The Server Summary page in the Cloud Lifecycle Manager Admin UI allows for deleting Compute Nodes from the Cloud. Deleting a Compute Node removes it from the cloud. Only Compute Nodes that are deactivated can be deleted.
The deletion process requires confirmation before proceeding.
If the Compute Node is not reachable (SSH from the deployer is not possible), a warning will appear, requesting confirmation that the node is shut down or otherwise removed from the environment. Reachable Compute Nodes will be shutdown as part of the deletion process.
The progress of deleting the Compute Node will be displayed, including a streaming log with additional details of the running playbooks.
3.5 Server Replacement #
The process of replacing a server is initiated from the Server Summary (see Section 3.2.6, “Servers”). Replacing a server will remove the existing server from the Cloud configuration and install the new server in its place. The rest of this process varies slightly depending on the type of server being replaced.
3.5.1 Control Plane Servers #
Servers that are part of the Control Plane (generally those that are not hosting Compute VMs or ephemeral storage) are replaced "in-place". This means the replacement server has the same IP Address and is expected to have the same NIC Mapping and Server Group as the server being replaced.
To replace a Control Plane server, click the menu to the right of the server listing on the Section 3.2.6, “Servers” page. From the menu options, select .
tab of theSelecting
will open a dialog box that includes information about the server being replaced, as well as a form for inputting the required information for the new server.The IPMI information for the new server is required to perform the replacement process.
- MAC Address
The hardware address of the server's primary physical ethernet adapter
- IPMI IP Address
The network address for IPMI access to the new server
- IPMI Username
The username credential for IPMI access to the new server
- IPMI Password
The password associated with the
IPMI Username
on the new server
To use a server that has already been discovered, check the box for
and select an existing server from the dropdown. This will automatically populate the server information fields above with the information previously entered/discovered for the specified server.If SLES is not already installed, or to reinstall SLES on the new server, check the box for
. The username will be pre-populated with the username from the Cloud install. Installing the OS requires specifying the password that was used for deploying the cloud so that the replacement process can access the host after the OS is installed.The data disks on the new server will not be wiped by default, but users can specify to wipe clean the data disks as part of the replacement process.
Once the new server information is set, click the
button in the lower right to begin replacement. A list of the replacement process steps will be displayed, and there will be a link at the bottom of the list to show the log file as the changes are made.When all of the steps are complete, click
to return to the Servers page.3.5.2 Compute Servers #
When servers that host VMs are replaced, the following actions happen:
a new server is added
existing instances are migrated from the existing server to the new server
the existing server is deleted from the model
The new server will not have the same IP Address and may have a different NIC Mapping and Server Group than the server being replaced.
To replace a Compute server, click the menu to the right of the server listing on the Section 3.2.6, “Servers” page. From the menu options, select .
tab of theSelecting
will open a dialog box that includes information about the server being replaced, and a form for inputting the required information for the new server.If the IP address of the server being replaced cannot be reached by the deployer, a warning will appear to verify that the replacement should continue.
Replacing a Compute server involves adding the new server and then performing migration. This requires some new information:
an unused IP address
a new ID
selections for Server Group and NIC Mapping, which do not need to match the original server.
- ID
This is the ID of the server in the data model. This does not necessarily correspond to any DNS or other naming labels of a host, unless the host ID was set that way during install.
- IP Address
The management network IP address of the server
- Server Group
The server group which this server is assigned to. If the required Server Group does not exist, it can be created
- NIC Mapping
The NIC mapping that describes the PCI slot addresses for the server's ethernet adapters. If the required NIC mapping does not exist, it can be created
The IPMI information for the new server is also required to perform the replacement process.
- Mac Address
The hardware address of the server's primary physical ethernet adapter
- IPMI IP Address
The network address for IPMI access to the new server
- IPMI Username
The username credential for IPMI access to the new server
- IPMI Password
The password associated with the
IPMI Username
To use a server that has already been discovered, check the box for
and select an existing server from the dropdown. This will automatically populate the server information fields above with the information previously entered/discovered for the specified server.If SLES is not already installed, or to reinstall SLES on the new server, check the box for
. The username will be pre-populated with the username from the Cloud install. Installing the OS requires specifying the password that was used for deploying the cloud so that the replacement process can access the host after the OS is installed.
The data disks on the new server will not be wiped by default, but
wipe clean
can specified for the data disks as
part of the replacement process.
When the new server information is set, click the
button in the lower right to begin replacement. The configuration processor will be run to validate that the entered information is compatible with the configuration of the Cloud.When validation has completed, the Compute replacement takes place in several distinct steps, and each will have its own page with a list of process steps displayed. A link at the bottom of the list can show the log file as the changes are made.
Install SLES if that option was selected
Figure 3.54: Install SLES on New Compute #Commit the changes to the data model and run the configuration processor
Figure 3.55: Prepare Compute Server #Deploy the new server, install services on it, update monasca (if installed), activate the server with nova so that it can host VMs.
Figure 3.56: Deploy New Compute Server #Disable the existing server. If the existing server is unreachable, there may be warnings about disabling services on that server.
Figure 3.57: Host Aggregate Removal Warning #If the existing server is reachable, instances on that server will be migrated to the new server.
Figure 3.58: Migrate Instances from Existing Compute Server #If the existing server is not reachable, the migration step will be skipped.
Figure 3.59: Disable Existing Compute Server #Remove the existing server from the model and update the cloud configuration. If the server is not reachable, the user is asked to verify that the server is shut down. If server is reachable, the cloud services running on it will be stopped and the server will be shut down as part of the removal from the Cloud.
Figure 3.60: Existing Server Shutdown Check #Upon verification that the unreachable host is shut down, it will be removed from the data model.
Figure 3.61: Existing Server Delete #After the model has been updated, a summary of the changes will appear. Click
to return to the server summary screen.Figure 3.62: Compute Replacement Summary #
4 Third-Party Integrations #
4.1 Splunk Integration #
This documentation demonstrates the possible integration between the SUSE OpenStack Cloud 9 centralized logging solution and Splunk including the steps to set up and forward logs.
The SUSE OpenStack Cloud 9 logging solution provides a flexible and extensible framework to centralize the collection and processing of logs from all of the nodes in a cloud. The logs are shipped to a highly available and fault tolerant cluster where they are transformed and stored for better searching and reporting. The SUSE OpenStack Cloud 9 logging solution uses the ELK stack (Elasticsearch, Logstash and Kibana) as a production grade implementation and can support other storage and indexing technologies. The Logstash pipeline can be configured to forward the logs to an alternative target if you wish.
This documentation demonstrates the possible integration between the SUSE OpenStack Cloud 9 centralized logging solution and Splunk including the steps to set up and forward logs.
4.1.1 What is Splunk? #
Splunk is software for searching, monitoring, and analyzing machine-generated big data, via a web-style interface. Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations. It is commercial software (unlike Elasticsearch) and more details about Splunk can be found at https://www.splunk.com.
4.1.2 Configuring Splunk to receive log messages from SUSE OpenStack Cloud 9 #
This documentation assumes that you already have Splunk set up and running. For help with installing and setting up Splunk, refer to Splunk Tutorial.
There are different ways in which a log message (or "event" in Splunk's terminology) can be sent to Splunk. These steps will set up a TCP port where Splunk will listen for messages.
On the Splunk web UI, click on the Settings menu in the upper right-hand corner.
In the
section of the Settings menu, click .Choose the
option.Click the
button to add an input.In the
field, enter the port number you want to use.NoteIf you are on a less secure network and want to restrict connections to this port, use the
field to restrict the traffic to a specific IP address.Click the
button.Specify the Source Type by clicking on the
button and choosinglinux_messages_syslog
from the list.Click the
button.Review the configuration and click the
button.A success message will be displayed.
4.1.3 Forwarding log messages from SUSE OpenStack Cloud 9 Centralized Logging to Splunk #
When you have Splunk set up and configured to receive log messages, you can configure SUSE OpenStack Cloud 9 to forward the logs to Splunk.
Log in to the Cloud Lifecycle Manager.
Check the status of the logging service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts logging-status.ymlIf everything is up and running, continue to the next step.
Edit the logstash config file at the location below:
~/openstack/ardana/ansible/roles/logging-server/templates/logstash.conf.j2
At the bottom of the file will be a section for the Logstash outputs. Add details about your Splunk environment details.
Below is an example, showing the placement in bold:
# Logstash outputs #------------------------------------------------------------------------------ output { # Configure Elasticsearch output # http://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html elasticsearch { index => %{[@metadata][es_index]} hosts => ["{{ elasticsearch_http_host }}:{{ elasticsearch_http_port }}"] flush_size => {{ logstash_flush_size }} idle_flush_time => 5 workers => {{ logstash_threads }} } # Forward Logs to Splunk on the TCP port that matches the one specified in Splunk Web UI. tcp { mode => "client" host => "<Enter Splunk listener IP address>" port => TCP_PORT_NUMBER } }
NoteIf you are not planning on using the Splunk UI to parse your centralized logs, there is no need to forward your logs to Elasticsearch. In this situation, comment out the lines in the Logstash outputs pertaining to Elasticsearch. However, you can continue to forward your centralized logs to multiple locations.
Commit your changes to git:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Logstash configuration change for Splunk integration"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlComplete this change with a reconfigure of the logging environment:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.ymlIn your Splunk UI, confirm that the logs have begun to forward.
4.1.4 Searching for log messages from the Spunk dashboard #
To both verify that your integration worked and to search your log messages that have been forwarded you can navigate back to your Splunk dashboard. In the search field, use this string:
source="tcp:TCP_PORT_NUMBER"
Find information on using the Splunk search tool at http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchTutorial/WelcometotheSearchTutorial.
4.2 Operations Bridge Integration #
The SUSE OpenStack Cloud 9 monitoring solution (monasca) can easily be integrated with your existing monitoring tools. Integrating SUSE OpenStack Cloud 9 monasca with Operations Bridge using the Operations Bridge Connector simplifies monitoring and managing events and topology information.
The integration provides the following functionality:
Forwarding of SUSE OpenStack Cloud monasca alerts and topology to Operations Bridge for event correlation
Customization of forwarded events and topology
For more information about this connector please see https://software.microfocus.com/en-us/products/operations-bridge-suite/overview.
4.3 Monitoring Third-Party Components With Monasca #
4.3.1 monasca Monitoring Integration Overview #
monasca, the SUSE OpenStack Cloud 9 monitoring service, collects information about your cloud's systems, and allows you to create alarm definitions based on these measurements. monasca-agent is the component that collects metrics such as metric storage and alarm thresholding and forwards them to the monasca-api for further processing.
With a small amount of configuration, you can use the detection and check plugins that are provided with your cloud to monitor integrated third-party components. In addition, you can write custom plugins and integrate them with the existing monitoring service.
Find instructions for customizing existing plugins to monitor third-party components in the Section 4.3.4, “Configuring Check Plugins”.
Find instructions for installing and configuring new custom plugins in the Section 4.3.3, “Writing Custom Plugins”.
You can also use existing alarm definitions, as well as create new alarm definitions that relate to a custom plugin or metric. Instructions for defining new alarm definitions are in the Section 4.3.6, “Configuring Alarm Definitions”.
You can use the Operations Console and monasca CLI to list all of the alarms, alarm-definitions, and metrics that exist on your cloud.
4.3.2 monasca Agent #
The monasca agent (monasca-agent) collects information about your cloud using the installed plugins. The plugins are written in Python, and determine the monitoring metrics for your system, as well as the interval for collection. The default collection interval is 30 seconds, and we strongly recommend not changing this default value.
The following two types of custom plugins can be added to your cloud.
Detection Plugin. Determines whether the monasca-agent has the ability to monitor the specified component or service on a host. If successful, this type of plugin configures an associated check plugin by creating a YAML configuration file.
Check Plugin. Specifies the metrics to be monitored, using the configuration file created by the detection plugin.
monasca-agent is installed on every server in your cloud, and provides plugins that monitor the following.
System metrics relating to CPU, memory, disks, host availability, etc.
Process health metrics (process, http_check)
SUSE OpenStack Cloud 9-specific component metrics, such as apache rabbitmq, kafka, cassandra, etc.
monasca is pre-configured with default check plugins and associated detection plugins. The default plugins can be reconfigured to monitor third-party components, and often only require small adjustments to adapt them to this purpose. Find a list of the default plugins here: https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md#detection-plugins
Often, a single check plugin will be used to monitor multiple services. For
example, many services use the http_check.py
detection
plugin to detect the up/down status of a service endpoint. Often the
process.py
check plugin, which provides process monitoring
metrics, is used as a basis for a custom process detection plugin.
More information about the monasca agent can be found in the following locations
monasca agent overview: https://github.com/openstack/monasca-agent/blob/master/docs/Agent.md
Information on existing plugins: https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md
Information on plugin customizations: https://github.com/openstack/monasca-agent/blob/master/docs/Customizations.md
4.3.3 Writing Custom Plugins #
When the pre-built monasca plugins do not meet your monitoring needs, you can write custom plugins to monitor your cloud. After you have written a plugin, you must install and configure it.
When your needs dictate a very specific custom monitoring check, you must provide both a detection and check plugin.
The steps involved in configuring a custom plugin include running a detection plugin and passing any necesssary parameters to the detection plugin so the resulting check configuration file is created with all necessary data.
When using an existing check plugin to monitor a third-party component, a custom detection plugin is needed only if there is not an associated default detection plugin.
Check plugin configuration files
Each plugin needs a corresponding YAML configuration file with the same stem
name as the plugin check file. For example, the plugin file
http_check.py
(in
/usr/lib/python2.7/site-packages/monasca_agent/collector/checks_d/
)
should have a corresponding configuration file,
http_check.yaml
(in
/etc/monasca/agent/conf.d/http_check.yaml
). The stem
name http_check
must be the same for both files.
Permissions for the YAML configuration file must be read+write for mon-agent
user (the
user that must also own the file), and read
for the mon-agent
group. Permissions for the file must be
restricted to the mon-agent user and
monasca group. The following example shows
correct permissions settings for the file
http_check.yaml
.
ardana >
ls -alt /etc/monasca/agent/conf.d/http_check.yaml
-rw-r----- 1 monasca-agent monasca 10590 Jul 26 05:44 http_check.yaml
A check plugin YAML configuration file has the following structure.
init_config: key1: value1 key2: value2 instances: - name: john_smith username: john_smith password: 123456 - name: jane_smith username: jane_smith password: 789012
In the above file structure, the init_config
section
allows you to specify any number of global
key:value pairs. Each pair will be available
on every run of the check that relates to the YAML configuration file.
The instances
section allows you to list the instances
that the related check will be run on. The check will be run once on each
instance listed in the instances
section. Ensure that each
instance listed in the instances
section has a unique
name.
Custom detection plugins
Detection plugins should be written to perform checks that ensure that a component can be monitored on a host. Any arguments needed by the associated check plugin are passed into the detection plugin at setup (configuration) time. The detection plugin will write to the associated check configuration file.
When a detection plugin is successfully run in the configuration step, it will write to the check configuration YAML file. The configuration file for the check is written to the following directory.
/etc/monasca/agent/conf.d/
Writing process detection plugin using the ServicePlugin class
The monasca-agent provides a ServicePlugin
class that makes process detection monitoring easy.
Process check
The process check plugin generates metrics based on the process status for
specified process names. It generates
process.pid_count
metrics for the specified
dimensions, and a set of detailed process metrics for the specified
dimensions by default.
The ServicePlugin class allows you to specify a list of process name(s) to
detect, and uses psutil to see if the
process exists on the host. It then appends the process.yml
configuration
file with the process name(s), if they do not already exist.
The following is an example of a process.py
check ServicePlugin
.
import monasca_setup.detection class monascaTransformDetect(monasca_setup.detection.ServicePlugin): """Detect monasca Transform daemons and setup configuration to monitor them.""" def __init__(self, template_dir, overwrite=False, args=None): log.info(" Watching the monasca transform processes.") service_params = { 'args': {}, 'template_dir': template_dir, 'overwrite': overwrite, 'service_name': 'monasca-transform', 'process_names': ['monasca-transform','pyspark', 'transform/lib/driver'] } super(monascaTransformDetect, self).__init__(service_params)
Writing a Custom Detection Plugin using Plugin or ArgsPlugin classes
A custom detection plugin class should derive from either the Plugin or
ArgsPlugin classes provided in the
/usr/lib/python2.7/site-packages/monasca_setup/detection
directory.
If the plugin parses command line arguments, the ArgsPlugin
class is useful.
The ArgsPlugin class derives from the Plugin class. The ArgsPlugin class has
a method to check for required arguments, and a method to return the instance
that will be used for writing to the configuration file with the dimensions
from the command line parsed and included.
If the ArgsPlugin methods do not seem to apply, then derive directly from the Plugin class.
When deriving from these classes, the following methods should be implemented.
_detect - set self.available=True when conditions are met that the thing to monitor exists on a host.
build_config - writes the instance information to the configuration and return the configuration.
dependencies_installed (default implementation is in ArgsPlugin, but not Plugin) - return true when python dependent libraries are installed.
The following is an example custom detection plugin.
import ast import logging import monasca_setup.agent_config import monasca_setup.detection log = logging.getLogger(__name__) class HttpCheck(monasca_setup.detection.ArgsPlugin): """Setup an http_check according to the passed in args. Despite being a detection plugin this plugin does no detection and will be a noop without arguments. Expects space separated arguments, the required argument is url. Optional parameters include: disable_ssl_validation and match_pattern. """ def _detect(self): """Run detection, set self.available True if the service is detected. """ self.available = self._check_required_args(['url']) def build_config(self): """Build the config as a Plugins object and return. """ config = monasca_setup.agent_config.Plugins() # No support for setting headers at this time instance = self._build_instance(['url', 'timeout', 'username', 'password', 'match_pattern', 'disable_ssl_validation', 'name', 'use_keystone', 'collect_response_time']) # Normalize any boolean parameters for param in ['use_keystone', 'collect_response_time']: if param in self.args: instance[param] = ast.literal_eval(self.args[param].capitalize()) # Set some defaults if 'collect_response_time' not in instance: instance['collect_response_time'] = True if 'name' not in instance: instance['name'] = self.args['url'] config['http_check'] = {'init_config': None, 'instances': [instance]} return config
Installing a detection plugin in the OpenStack version delivered with SUSE OpenStack Cloud
Install a plugin by copying it to the plugin directory
(/usr/lib/python2.7/site-packages/monasca_agent/collector/checks_d/
).
The plugin should have file permissions of read+write for the root user (the user that should also own the file) and read for the root group and all other users.
The following is an example of correct file permissions for the http_check.py file.
-rw-r--r-- 1 root root 1769 Sep 19 20:14 http_check.py
Detection plugins should be placed in the following directory.
/usr/lib/monasca/agent/custom_detect.d/
The detection plugin directory name should be accessed using the
monasca_agent_detection_plugin_dir
Ansible variable. This
variable is defined in the
roles/monasca-agent/vars/main.yml
file.
monasca_agent_detection_plugin_dir: /usr/lib/monasca/agent/custom_detect.d/
Example: Add Ansible monasca_configure
task to install the
plugin. (The monasca_configure
task can be added to any
service playbook.) In this example, it is added to
~/openstack/ardana/ansible/roles/_CEI-CMN/tasks/monasca_configure.yml
.
--- - name: _CEI-CMN | monasca_configure | Copy ceilometer Custom plugin become: yes copy: src: ardanaceilometer_mon_plugin.py dest: "{{ monasca_agent_detection_plugin_dir }}" owner: root group: root mode: 0440
Custom check plugins
Custom check plugins generate metrics. Scalability should be taken into consideration on systems that will have hundreds of servers, as a large number of metrics can affect performance by impacting disk performance, RAM and CPU usage.
You may want to tune your configuration parameters so that less-important metrics are not monitored as frequently. When check plugins are configured (when they have an associated YAML configuration file) the agent will attempt to run them.
Checks should be able to run within the 30-second metric collection window.
If your check runs a command, you should provide a timeout to prevent the
check from running longer than the default 30-second window. You can use the
monasca_agent.common.util.timeout_command
to set a timeout
for in your custom check plugin python code.
Find a description of how to write custom check plugins at https://github.com/openstack/monasca-agent/blob/master/docs/Customizations.md#creating-a-custom-check-plugin
Custom checks derive from the AgentCheck class located in the
monasca_agent/collector/checks/check.py
file. A check
method is required.
Metrics should contain dimensions that make each item that you are monitoring unique (such as service, component, hostname). The hostname dimension is defined by default within the AgentCheck class, so every metric has this dimension.
A custom check will do the following.
Read the configuration instance passed into the check method.
Set dimensions that will be included in the metric.
Create the metric with gauge, rate, or counter types.
Metric Types:
gauge: Instantaneous reading of a particular value (for example, mem.free_mb).
rate: Measurement over a time period. The following equation can be used to define rate.
rate=delta_v/float(delta_t)
counter: The number of events, increment and decrement methods, for example, zookeeper.timeouts
The following is an example component check named SimpleCassandraExample.
import monasca_agent.collector.checks as checks from monasca_agent.common.util import timeout_command CASSANDRA_VERSION_QUERY = "SELECT version();" class SimpleCassandraExample(checks.AgentCheck): def __init__(self, name, init_config, agent_config): super(SimpleCassandraExample, self).__init__(name, init_config, agent_config) @staticmethod def _get_config(instance): user = instance.get('user') password = instance.get('password') service = instance.get('service') timeout = int(instance.get('timeout')) return user, password, service, timeout def check(self, instance): user, password, service, node_name, timeout = self._get_config(instance) dimensions = self._set_dimensions({'component': 'cassandra', 'service': service}, instance) results, connection_status = self._query_database(user, password, timeout, CASSANDRA_VERSION_QUERY) if connection_status != 0: self.gauge('cassandra.connection_status', 1, dimensions=dimensions) else: # successful connection status self.gauge('cassandra.connection_status', 0, dimensions=dimensions) def _query_database(self, user, password, timeout, query): stdout, stderr, return_code = timeout_command(["/opt/cassandra/bin/vsql", "-U", user, "-w", password, "-A", "-R", "|", "-t", "-F", ",", "-x"], timeout, command_input=query) if return_code == 0: # remove trailing newline stdout = stdout.rstrip() return stdout, 0 else: self.log.error("Error querying cassandra with return code of {0} and error {1}".format(return_code, stderr)) return stderr, 1
Installing check plugin
The check plugin needs to have the same file permissions as the detection plugin. File permissions must be read+write for the root user (the user that should own the file), and read for the root group and all other users.
Check plugins should be placed in the following directory.
/usr/lib/monasca/agent/custom_checks.d/
The check plugin directory should be accessed using the
monasca_agent_check_plugin_dir
Ansible variable. This
variable is defined in the
roles/monasca-agent/vars/main.yml
file.
monasca_agent_check_plugin_dir: /usr/lib/monasca/agent/custom_checks.d/
4.3.4 Configuring Check Plugins #
Manually configure a plugin when unit-testing using the monasca-setup script installed with the monasca-agent
Find a good explanation of configuring plugins here: https://github.com/openstack/monasca-agent/blob/master/docs/Agent.md#configuring
SSH to a node that has both the monasca-agent installed as well as the component you wish to monitor.
The following is an example command that configures a plugin that has no parameters (uses the detection plugin class name).
root #
/usr/bin/monasca-setup -d ARDANACeilometer
The following is an example command that configures the apache plugin and includes related parameters.
root #
/usr/bin/monasca-setup -d apache -a 'url=http://192.168.245.3:9095/server-status?auto'
If there is a change in the configuration it will restart the monasca-agent on the host so the configuration is loaded.
After the plugin is configured, you can verify that the configuration file has your changes (see the next Verify that your check plugin is configured section).
Use the monasca CLI to see if your metric exists (see the Verify that metrics exist section).
Using Ansible modules to configure plugins in SUSE OpenStack Cloud 9
The monasca_agent_plugin
module is installed as part of
the monasca-agent role.
The following Ansible example configures the process.py plugin for the ceilometer detection plugin. The following example only passes in the name of the detection class.
- name: _CEI-CMN | monasca_configure | Run monasca agent Cloud Lifecycle Manager specific ceilometer detection plugin become: yes monasca_agent_plugin: name: "ARDANACeilometer"
If a password or other sensitive data are passed to the detection plugin, the
no_log
option should be set to
True. If the no_log
option is not set to True, the data passed
to the plugin will be logged to syslog.
The following Ansible example configures the Cassandra plugin and passes in related arguments.
- name: Run monasca Agent detection plugin for Cassandra monasca_agent_plugin: name: "Cassandra" args="directory_names={{ FND_CDB.vars.cassandra_data_dir }},{{ FND_CDB.vars.cassandra_commit_log_dir }} process_username={{ FND_CDB.vars.cassandra_user }}" when: database_type == 'cassandra'
The following Ansible example configures the keystone endpoint using the
http_check.py detection plugin. The class name httpcheck
of the http_check.py detection plugin is the name.
root #
- name: keystone-monitor | local_monitor |
Setup active check on keystone internal endpoint locally
become: yes
monasca_agent_plugin:
name: "httpcheck"
args: "use_keystone=False \
url=http://{{ keystone_internal_listen_ip }}:{{
keystone_internal_port }}/v3 \
dimensions=service:identity-service,\
component:keystone-api,\
api_endpoint:internal,\
monitored_host_type:instance"
tags:
- keystone
- keystone_monitor
Verify that your check plugin is configured
All check configuration files are located in the following directory. You can see the plugins that are running by looking at the plugin configuration directory.
/etc/monasca/agent/conf.d/
When the monasca-agent starts up, all of the check plugins that have a
matching configuration file in the
/etc/monasca/agent/conf.d/
directory will be loaded.
If there are errors running the check plugin they will be written to the following error log file.
/var/log/monasca/agent/collector.log
You can change the monasca-agent log level by modifying the
log_level
option in the
/etc/monasca/agent/agent.yaml
configuration file, and then
restarting the monasca-agent, using the following command.
root #
service openstack-monasca-agent restart
You can debug a check plugin by running monasca-collector
with the check option. The following is an example of the
monasca-collector
command.
tux >
sudo /usr/bin/monasca-collector check CHECK_NAME
Verify that metrics exist
Begin by logging in to your deployer or controller node.
Run the following set of commands, including the monasca
metric-list
command. If the metric exists, it will be displayed in
the output.
ardana >
source ~/service.osrcardana >
monasca metric-list --name METRIC_NAME
4.3.5 Metric Performance Considerations #
Collecting metrics on your virtual machines can greatly affect performance. SUSE OpenStack Cloud 9 supports 200 compute nodes, with up to 40 VMs each. If your environment is managing maximum number of VMs, adding a single metric for all VMs is the equivalent of adding 8000 metrics.
Because of the potential impact that new metrics have on system performance, consider adding only new metrics that are useful for alarm-definition, capacity planning, or debugging process failure.
4.3.6 Configuring Alarm Definitions #
The monasca-api-spec, found here https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md provides an explanation of Alarm Definitions and Alarms. You can find more information on alarm definition expressions at the following page: https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md#alarm-definition-expressions.
When an alarm definition is defined, the monasca-threshold engine will generate an alarm for each unique instance of the match_by metric dimensions found in the metric. This allows a single alarm definition that can dynamically handle the addition of new hosts.
There are default alarm definitions configured for all "process check" (process.py check) and "HTTP Status" (http_check.py check) metrics in the monasca-default-alarms role. The monasca-default-alarms role is installed as part of the monasca deployment phase of your cloud's deployment. You do not need to create alarm definitions for these existing checks.
Third parties should create an alarm definition when they wish to alarm on a custom plugin metric. The alarm definition should only be defined once. Setting a notification method for the alarm definition is recommended but not required.
The following Ansible modules used for alarm definitions are installed as part of the monasca-alarm-definition role. This process takes place during the monasca set up phase of your cloud's deployment.
monasca_alarm_definition
monasca_notification_method
The following examples, found in the
~/openstack/ardana/ansible/roles/monasca-default-alarms
directory, illustrate how monasca sets up the default alarm definitions.
monasca Notification Methods
The monasca-api-spec, found in the following link, provides details about creating a notification https://github.com/openstack/monasca-api/blob/master/docs/monasca-api-spec.md#create-notification-method
The following are supported notification types.
EMAIL
WEBHOOK
PAGERDUTY
The keystone_admin_tenant
project is used so that the
alarms will show up on the Operations Console UI.
The following file snippet shows variables from the
~/openstack/ardana/ansible/roles/monasca-default-alarms/defaults/main.yml
file.
--- notification_address: root@localhost notification_name: 'Default Email' notification_type: EMAIL monasca_keystone_url: "{{ KEY_API.advertises.vips.private[0].url }}/v3" monasca_api_url: "{{ MON_AGN.consumes_MON_API.vips.private[0].url }}/v2.0" monasca_keystone_user: "{{ MON_API.consumes_KEY_API.vars.keystone_monasca_user }}" monasca_keystone_password: "{{ MON_API.consumes_KEY_API.vars.keystone_monasca_password | quote }}" monasca_keystone_project: "{{ KEY_API.vars.keystone_admin_tenant }}" monasca_client_retries: 3 monasca_client_retry_delay: 2
You can specify a single default notification method in the
~/openstack/ardana/ansible/roles/monasca-default-alarms/tasks/main.yml
file. You can also add or modify the notification type and related details
using the Operations Console UI or monasca CLI.
The following is a code snippet from the
~/openstack/ardana/ansible/roles/monasca-default-alarms/tasks/main.yml
file.
--- - name: monasca-default-alarms | main | Setup default notification method monasca_notification_method: name: "{{ notification_name }}" type: "{{ notification_type }}" address: "{{ notification_address }}" keystone_url: "{{ monasca_keystone_url }}" keystone_user: "{{ monasca_keystone_user }}" keystone_password: "{{ monasca_keystone_password }}" keystone_project: "{{ monasca_keystone_project }}" monasca_api_url: "{{ monasca_api_url }}" no_log: True tags: - system_alarms - monasca_alarms - openstack_alarms register: default_notification_result until: not default_notification_result | failed retries: "{{ monasca_client_retries }}" delay: "{{ monasca_client_retry_delay }}"
monasca Alarm Definition
In the alarm definition "expression" field, you can specify the metric name and threshold. The "match_by" field is used to create a new alarm for every unique combination of the match_by metric dimensions.
Find more details on alarm definitions at the monasca API documentation: (https://github.com/stackforge/monasca-api/blob/master/docs/monasca-api-spec.md#alarm-definitions-and-alarms).
The following is a code snippet from the
~/openstack/ardana/ansible/roles/monasca-default-alarms/tasks/main.yml
file.
- name: monasca-default-alarms | main | Create Alarm Definitions monasca_alarm_definition: name: "{{ item.name }}" description: "{{ item.description | default('') }}" expression: "{{ item.expression }}" keystone_token: "{{ default_notification_result.keystone_token }}" match_by: "{{ item.match_by | default(['hostname']) }}" monasca_api_url: "{{ default_notification_result.monasca_api_url }}" severity: "{{ item.severity | default('LOW') }}" alarm_actions: - "{{ default_notification_result.notification_method_id }}" ok_actions: - "{{ default_notification_result.notification_method_id }}" undetermined_actions: - "{{ default_notification_result.notification_method_id }}" register: monasca_system_alarms_result until: not monasca_system_alarms_result | failed retries: "{{ monasca_client_retries }}" delay: "{{ monasca_client_retry_delay }}" with_flattened: - monasca_alarm_definitions_system - monasca_alarm_definitions_monasca - monasca_alarm_definitions_openstack - monasca_alarm_definitions_misc_services when: monasca_create_definitions
In the following example
~/openstack/ardana/ansible/roles/monasca-default-alarms/vars/main.yml
Ansible variables file, the alarm definition named
Process Check sets the
match_by variable with the following
parameters.
process_name
hostname
monasca_alarm_definitions_system: - name: "Host Status" description: "Alarms when the specified host is down or not reachable" severity: "HIGH" expression: "host_alive_status > 0" match_by: - "target_host" - "hostname" - name: "HTTP Status" description: > "Alarms when the specified HTTP endpoint is down or not reachable" severity: "HIGH" expression: "http_status > 0" match_by: - "service" - "component" - "hostname" - "url" - name: "CPU Usage" description: "Alarms when CPU usage is high" expression: "avg(cpu.idle_perc) < 10 times 3" - name: "High CPU IOWait" description: "Alarms when CPU IOWait is high, possible slow disk issue" expression: "avg(cpu.wait_perc) > 40 times 3" match_by: - "hostname" - name: "Disk Inode Usage" description: "Alarms when disk inode usage is high" expression: "disk.inode_used_perc > 90" match_by: - "hostname" - "device" severity: "HIGH" - name: "Disk Usage" description: "Alarms when disk usage is high" expression: "disk.space_used_perc > 90" match_by: - "hostname" - "device" severity: "HIGH" - name: "Memory Usage" description: "Alarms when memory usage is high" severity: "HIGH" expression: "avg(mem.usable_perc) < 10 times 3" - name: "Network Errors" description: > "Alarms when either incoming or outgoing network errors are high" severity: "MEDIUM" expression: "net.in_errors_sec > 5 or net.out_errors_sec > 5" - name: "Process Check" description: "Alarms when the specified process is not running" severity: "HIGH" expression: "process.pid_count < 1" match_by: - "process_name" - "hostname" - name: "Crash Dump Count" description: "Alarms when a crash directory is found" severity: "MEDIUM" expression: "crash.dump_count > 0" match_by: - "hostname"
The preceding configuration would result in the creation of an alarm for each unique metric that matched the following criteria.
process.pid_count + process_name + hostname
Check that the alarms exist
Begin by using the following commands, including monasca
alarm-definition-list
, to check that the alarm definition exists.
ardana >
source ~/service.osrcardana >
monasca alarm-definition-list --name ALARM_DEFINITION_NAME
Then use either of the following commands to check that the alarm has been generated. A status of "OK" indicates a healthy alarm.
ardana >
monasca alarm-list --metric-name metric name
Or
ardana >
monasca alarm-list --alarm-definition-id ID_FROM_ALARM-DEFINITION-LIST
To see CLI options use the monasca help
command.
Alarm state upgrade considerations
If the name of a monitoring metric changes or is no longer being sent,
existing alarms will show the alarm state as
UNDETERMINED
. You can update an alarm definition as long
as you do not change the metric name or
dimension name values in the expression or match_by fields. If you find that you need to alter
either of these values, you must delete the old alarm definitions and create
new definitions with the updated values.
If a metric is never sent, but has a related alarm definition, then no alarms would exist. If you find that metrics are never sent, then you should remove the related alarm definitions.
When removing an alarm definition, the Ansible module
monasca_alarm_definition supports the state
absent
.
The following file snippet shows an example of how to remove an alarm
definition by setting the state to absent
.
- name: monasca-pre-upgrade | Remove alarm definitions monasca_alarm_definition: name: "{{ item.name }}" state: "absent" keystone_url: "{{ monasca_keystone_url }}" keystone_user: "{{ monasca_keystone_user }}" keystone_password: "{{ monasca_keystone_password }}" keystone_project: "{{ monasca_keystone_project }}" monasca_api_url: "{{ monasca_api_url }}" with_items: - { name: "Kafka Consumer Lag" }
An alarm exists in the OK state when the monasca threshold engine has seen at least one metric associated with the alarm definition and has not exceeded the alarm definition threshold.
4.3.7 Openstack Integration of Custom Plugins into monasca-Agent (if applicable) #
monasca-agent is an OpenStack open-source project. monasca can also monitor non-openstack services. Third parties should install custom plugins into their SUSE OpenStack Cloud 9 system using the steps outlined in the Section 4.3.3, “Writing Custom Plugins”. If the OpenStack community determines that the custom plugins are of general benefit, the plugin may be added to the openstack/monasca-agent so that they are installed with the monasca-agent. During the review process for openstack/monasca-agent there are no guarantees that code will be approved or merged by a deadline. Open-source contributors are expected to help with codereviews in order to get their code accepted. Once changes are approved and integrated into the openstack/monasca-agent and that version of the monasca-agent is integrated with SUSE OpenStack Cloud 9, the third party can remove the custom plugin installation steps since they would be installed in the default monasca-agent venv.
Find the open source repository for the monaca-agent here: https://github.com/openstack/monasca-agent
5 Managing Identity #
The Identity service provides the structure for user authentication to your cloud.
5.1 The Identity Service #
This topic explains the purpose and mechanisms of the identity service.
The SUSE OpenStack Cloud Identity service, based on the OpenStack keystone API, is responsible for providing UserID authentication and access authorization to enable organizations to achieve their access security and compliance objectives and successfully deploy OpenStack. In short, the Identity service is the gateway to the rest of the OpenStack services.
5.1.1 Which version of the Identity service should you use? #
Use Identity API version 3.0, as previous versions no longer exist as endpoints for Identity API queries.
Similarly, when performing queries, you must use the OpenStack CLI (the
openstack
command), and not the keystone CLI
(keystone
) as the latter is only compatible with API
versions prior to 3.0.
5.1.2 Authentication #
The authentication function provides the initial login function to OpenStack. keystone supports multiple sources of authentication, including a native or built-in authentication system. The keystone native system can be used for all user management functions for proof of concept deployments or small deployments not requiring integration with a corporate authentication system, but it lacks some of the advanced functions usually found in user management systems such as forcing password changes. The focus of the keystone native authentication system is to be the source of authentication for OpenStack-specific users required for the operation of the various OpenStack services. These users are stored by keystone in a default domain; the addition of these IDs to an external authentication system is not required.
keystone is more commonly integrated with external authentication systems such as OpenLDAP or Microsoft Active Directory. These systems are usually centrally deployed by organizations to serve as the single source of user management and authentication for all in-house deployed applications and systems requiring user authentication. In addition to LDAP and Microsoft Active Directory, support for integration with Security Assertion Markup Language (SAML)-based identity providers from companies such as Ping, CA, IBM, Oracle, and others is also nearly "production-ready".
keystone also provides architectural support via the underlying Apache deployment for other types of authentication systems such as Multi-Factor Authentication. These types of systems typically require driver support and integration from the respective provider vendors.
While support for Identity Providers and Multi-factor authentication is available in keystone, it has not yet been certified by the SUSE OpenStack Cloud engineering team and is an experimental feature in SUSE OpenStack Cloud.
LDAP-compatible directories such as OpenLDAP and Microsoft Active Directory are recommended alternatives to using the keystone local authentication. Both methods are widely used by organizations and are integrated with a variety of other enterprise applications. These directories act as the single source of user information within an organization. keystone can be configured to authenticate against an LDAP-compatible directory on a per-domain basis.
Domains, as explained in Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”, can be configured so that based on the user ID, a incoming user is automatically mapped to a specific domain. This domain can then be configured to authenticate against a specific LDAP directory. The user credentials provided by the user to keystone are passed along to the designated LDAP source for authentication. This communication can be optionally configured to be secure via SSL encryption. No special LDAP administrative access is required, and only read-only access is needed for this configuration. keystone will not add any LDAP information. All user additions, deletions, and modifications are performed by the application's front end in the LDAP directories. After a user has been successfully authenticated, they are then assigned to the groups, roles, and projects defined by the keystone domain or project administrators. This information is stored within the keystone service database.
Another form of external authentication provided by the keystone service is via integration with SAML-based Identity Providers (IdP) such as Ping Identity, IBM Tivoli, and Microsoft Active Directory Federation Server. A SAML-based identity provider provides authentication that is often called "single sign-on". The IdP server is configured to authenticate against identity sources such as Active Directory and provides a single authentication API against multiple types of downstream identity sources. This means that an organization could have multiple identity storage sources but a single authentication source. In addition, if a user has logged into one such source during a defined session time frame, they do not need to re-authenticate within the defined session. Instead, the IdP will automatically validate the user to requesting applications and services.
A SAML-based IdP authentication source is configured with keystone on a per-domain basis similar to the manner in which native LDAP directories are configured. Extra mapping rules are required in the configuration that define which keystone group an incoming UID is automatically assigned to. This means that groups need to be defined in keystone first, but it also removes the requirement that a domain or project admin assign user roles and project membership on a per-user basis. Instead, groups are used to define project membership and roles and incoming users are automatically mapped to keystone groups based on their upstream group membership. This provides a very consistent role-based access control (RBAC) model based on the upstream identity source. The configuration of this option is fairly straightforward. IdP vendors such as Ping and IBM are contributing to the maintenance of this function and have also produced their own integration documentation. Microsoft Active Directory Federation Services (ADFS) is used for functional testing and future documentation.
In addition to SAML-based IdP, keystone also supports external authentication with a third party IdP using OpenID Connect protocol by leveraging the capabilities provided by the Apache2 auth_mod_openidc module. The configuration of OpenID Connect is similar to SAML.
The third keystone-supported authentication source is known as Multi-Factor Authentication (MFA). MFA typically requires an external source of authentication beyond a login name and password, and can include options such as SMS text, a temporal token generator, a fingerprint scanner, etc. Each of these types of MFA are usually specific to a particular MFA vendor. The keystone architecture supports an MFA-based authentication system, but this has not yet been certified or documented for SUSE OpenStack Cloud.
5.1.3 Authorization #
The second major function provided by the keystone service is access authorization that determines what resources and actions are available based on the UserID, the role of the user, and the projects that a user is provided access to. All of this information is created, managed, and stored by keystone. These functions are applied via the horizon web interface, the OpenStack command-line interface, or the direct keystone API.
keystone provides support for organizing users via three entities including:
- Domains
Domains provide the highest level of organization. Domains are intended to be used as high-level containers for multiple projects. A domain can represent different tenants, companies or organizations for an OpenStack cloud deployed for public cloud deployments or represent major business units, functions, or any other type of top-level organization unit in an OpenStack private cloud deployment. Each domain has at least one Domain Admin assigned to it. This Domain Admin can then create multiple projects within the domain and assign the project admin role to specific project owners. Each domain created in an OpenStack deployment is unique and the projects assigned to a domain cannot exist in another domain.
- Projects
Projects are entities within a domain that represent groups of users, each user role within that project, and how many underlying infrastructure resources can be consumed by members of the project.
- Groups
Groups are an optional function and provide the means of assigning project roles to multiple users at once.
keystone also provides the means to create and assign roles to groups of users or individual users. The role names are created and user assignments are made within keystone. The actual function of a role is defined currently per each OpenStack service via scripts. When a user requests access to an OpenStack service, his access token contains information about his assigned project membership and role for that project. This role is then matched to the service-specific script and the user is allowed to perform functions within that service defined by the role mapping.
5.2 Supported Upstream Keystone Features #
5.2.1 OpenStack upstream features that are enabled by default in SUSE OpenStack Cloud 9 #
The following supported keystone features are enabled by default in the SUSE OpenStack Cloud 9 release.
Name | User/Admin | Note: API support only. No CLI/UI support |
---|---|---|
Implied Roles | Admin | https://blueprints.launchpad.net/keystone/+spec/implied-roles |
Domain-Specific Roles | Admin | https://blueprints.launchpad.net/keystone/+spec/domain-specific-roles |
Fernet Token Provider | User and Admin | https://docs.openstack.org/keystone/rocky/admin/identity-fernet-token-faq.html |
Implied rules
To allow for the practice of hierarchical permissions in user roles, this feature enables roles to be linked in such a way that they function as a hierarchy with role inheritance.
When a user is assigned a superior role, the user will also be assigned all roles implied by any subordinate roles. The hierarchy of the assigned roles will be expanded when issuing the user a token.
Domain-specific roles
This feature extends the principle of implied roles to include a set of roles that are specific to a domain. At the time a token is issued, the domain-specific roles are not included in the token, however, the roles that they map to are.
Fernet token provider
Provides tokens in the Fernet format. This feature is automatically configured and is enabled by default. Fernet tokens are preferred and used by default instead of the older UUID token format.
5.2.2 OpenStack upstream features that are disabled by default in SUSE OpenStack Cloud 9 #
The following is a list of features which are fully supported in the SUSE OpenStack Cloud 9 release, but are disabled by default. Customers can run a playbook to enable the features.
Name | User/Admin | Reason Disabled |
---|---|---|
Support multiple LDAP backends via per-domain configuration | Admin | Needs explicit configuration. |
WebSSO | User and Admin | Needs explicit configuration. |
keystone-to-keystone (K2K) federation | User and Admin | Needs explicit configuration. |
Domain-specific config in SQL | Admin | Domain specific configuration options can be stored in SQL instead of configuration files, using the new REST APIs. |
Multiple LDAP backends for each domain
This feature allows identity backends to be configured on a domain-by-domain basis. Domains will be capable of having their own exclusive LDAP service (or multiple services). A single LDAP service can also serve multiple domains, with each domain in a separate subtree.
To implement this feature, individual domains will require domain-specific configuration files. Domains that do not implement this feature will continue to share a common backend driver.
WebSSO
This feature enables the keystone service to provide federated identity services through a token-based single sign-on page. This feature is disabled by default, as it requires explicit configuration.
keystone-to-keystone (K2K) federation
This feature enables separate keystone instances to federate identities among the instances, offering inter-cloud authorization. This feature is disabled by default, as it requires explicit configuration.
Domain-specific config in SQL
Using the new REST APIs, domain-specific configuration options can be stored in a SQL database instead of in configuration files.
5.2.3 Stack upstream features that have been specifically disabled in SUSE OpenStack Cloud 9 #
The following is a list of extensions which are disabled by default in SUSE OpenStack Cloud 9, according to keystone policy.
Target Release | Name | User/Admin | Reason Disabled |
---|---|---|---|
TBD | Endpoint Filtering | Admin |
This extension was implemented to facilitate service activation. However, due to lack of enforcement at the service side, this feature is only half effective right now. |
TBD | Endpoint Policy | Admin |
This extension was intended to facilitate policy (policy.json) management and enforcement. This feature is useless right now due to lack of the needed middleware to utilize the policy files stored in keystone. |
TBD | OATH 1.0a | User and Admin |
Complexity in workflow. Lack of adoption. Its alternative, keystone Trust, is enabled by default. HEAT is using keystone Trust. |
TBD | Revocation Events | Admin |
For PKI token only and PKI token is disabled by default due to usability concerns. |
TBD | OS CERT | Admin |
For PKI token only and PKI token is disabled by default due to usability concerns. |
TBD | PKI Token | Admin |
PKI token is disabled by default due to usability concerns. |
TBD | Driver level caching | Admin |
Driver level caching is disabled by default due to complexity in setup. |
TBD | Tokenless Authz | Admin |
Tokenless authorization with X.509 SSL client certificate. |
TBD | TOTP Authentication | User |
Not fully baked. Has not been battle-tested. |
TBD | is_admin_project | Admin |
No integration with the services. |
5.3 Understanding Domains, Projects, Users, Groups, and Roles #
The identity service uses these concepts for authentication within your cloud and these are descriptions of each of them.
The SUSE OpenStack Cloud 9 identity service uses OpenStack keystone and the concepts of domains, projects, users, groups, and roles to manage authentication. This page describes how these work together.
5.3.1 Domains, Projects, Users, Groups, and Roles #
Most large business organizations use an identity system such as Microsoft Active Directory to store and manage their internal user information. A variety of applications such as HR systems are, in turn, used to manage the data inside of Active Directory. These same organizations often deploy a separate user management system for external users such as contractors, partners, and customers. Multiple authentication systems are then deployed to support multiple types of users.
An LDAP-compatible directory such as Active Directory provides a top-level organization or domain component. In this example, the organization is called Acme. The domain component (DC) is defined as acme.com. Underneath the top level domain component are entities referred to as organizational units (OU). Organizational units are typically designed to reflect the entity structure of the organization. For example, this particular schema has 3 different organizational units for the Marketing, IT, and Contractors units or departments of the Acme organization. Users (and other types of entities like printers) are then defined appropriately underneath each organizational entity. The keystone domain entity can be used to match the LDAP OU entity; each LDAP OU can have a corresponding keystone domain created. In this example, both the Marketing and IT domains represent internal employees of Acme and use the same authentication source. The Contractors domain contains all external people associated with Acme. UserIDs associated with the Contractor domain are maintained in a separate user directory and thus have a different authentication source assigned to the corresponding keystone-defined Contractors domain.
A public cloud deployment usually supports multiple, separate organizations. keystone domains can be created to provide a domain per organization with each domain configured to the underlying organization's authentication source. For example, the ABC company would have a keystone domain created called "abc". All users authenticating to the "abc" domain would be authenticated against the authentication system provided by the ABC organization; in this case ldap://ad.abc.com
5.3.2 Domains #
A domain is a top-level container targeted at defining major organizational entities.
Domains can be used in a multi-tenant OpenStack deployment to segregate projects and users from different companies in a public cloud deployment or different organizational units in a private cloud setting.
Domains provide the means to identify multiple authentication sources.
Each domain is unique within an OpenStack implementation.
Multiple projects can be assigned to a domain but each project can only belong to a single domain.
Each domain has an assigned "admin".
Each project has an assigned "admin".
Domains are created by the "admin" service account and domain admins are assigned by the "admin" user.
The "admin" UserID (UID) is created during the keystone installation, has the "admin" role assigned to it, and is defined as the "Cloud Admin". This UID is created using the "magic" or "secret" admin token found in the default 'keystone.conf' file installed during SUSE OpenStack Cloud keystone installation after the keystone service has been installed. This secret token should be removed after installation and the "admin" password changed.
The "default" domain is created automatically during the SUSE OpenStack Cloud keystone installation.
The "default" domain contains all OpenStack service accounts that are installed during the SUSE OpenStack Cloud keystone installation process.
No users but the OpenStack service accounts should be assigned to the "default" domain.
Domain admins can be any UserID inside or outside of the domain.
5.3.3 Domain Administrator #
A UUID is a domain administrator for a given domain if that UID has a domain-scoped token scoped for the given domain. This means that the UID has the "admin" role assigned to it for the selected domain.
The Cloud Admin UID assigns the domain administrator role for a domain to a selected UID.
A domain administrator can create and delete local users who have authenticated against keystone. These users will be assigned to the domain belonging to the domain administrator who creates the UserID.
A domain administrator can only create users and projects within her assigned domains.
A domain administrator can assign the "admin" role of their domains to another UID or revoke it; each UID with the "admin" role for a specified domain will be a co-administrator for that domain.
A UID can be assigned to be the domain admin of multiple domains.
A domain administrator can assign non-admin roles to any users and groups within their assigned domain, including projects owned by their assigned domain.
A domain admin UID can belong to projects within their administered domains.
Each domain can have a different authentication source.
The domain field is used during the initial login to define the source of authentication.
The "List Users" function can only be executed by a UID with the domain admin role.
A domain administrator can assign a UID from outside of their domain the "domain admin" role, but it is assumed that the domain admin would know the specific UID and would not need to list users from an external domain.
A domain administrator can assign a UID from outside of their domain the "project admin" role for a specific project within their domain, but it is assumed that the domain admin would know the specific UID and would not need to list users from an external domain.
Any user that needs the ability to create a user in a project should be granted the "admin" role for the domain where the user and the project reside.
In order for the horizon
› panel to properly fill the "Owner" column, any user that is granted the admin role on a project must also be granted the "member" or "admin" role in the domain.
5.3.4 Projects #
The domain administrator creates projects within his assigned domain and assigns the project admin role to each project to a selected UID. A UID is a project administrator for a given project if that UID has a project-scoped token scoped for the given project. There can be multiple projects per domain. The project admin sets the project quota settings, adds/deletes users and groups to and from the project, and defines the user/group roles for the assigned project. Users can be belong to multiple projects and have different roles on each project. Users are assigned to a specific domain and a default project. Roles are assigned per project.
5.3.5 Users and Groups #
Each user belongs to one domain only. Domain assignments are defined either by the domain configuration files or by a domain administrator when creating a new, local (user authenticated against keystone) user. There is no current method for "moving" a user from one domain to another. A user can belong to multiple projects within a domain with a different role assignment per project. A group is a collection of users. Users can be assigned to groups either by the project admin or automatically via mappings if an external authentication source is defined for the assigned domain. Groups can be assigned to multiple projects within a domain and have different roles assigned to the group per project. A group can be assigned the "admin" role for a domain or project. All members of the group will be an "admin" for the selected domain or project.
5.3.6 Roles #
Service roles represent the functionality used to implement the OpenStack role based access control (RBAC), model used to manage access to each OpenStack service. Roles are named and assigned per user or group for each project by the identity service. Role definition and policy enforcement are defined outside of the identity service independently by each OpenStack service. The token generated by the identity service for each user authentication contains the role assigned to that user for a particular project. When a user attempts to access a specific OpenStack service, the role is parsed by the service, compared to the service-specific policy file, and then granted the resource access defined for that role by the service policy file.
Each service has its own service policy file with the /etc/[SERVICE_CODENAME]/policy.json file name format where [SERVICE_CODENAME] represents a specific OpenStack service name. For example, the OpenStack nova service would have a policy file called /etc/nova/policy.json. With Service policy files can be modified and deployed to control nodes from the Cloud Lifecycle Manager. Administrators are advised to validate policy changes before checking in the changes to the site branch of the local git repository before rolling the changes into production. Do not make changes to policy files without having a way to validate them.
The policy files are located at the following site branch locations on the Cloud Lifecycle Manager.
~/openstack/ardana/ansible/roles/GLA-API/templates/policy.json.j2 ~/openstack/ardana/ansible/roles/ironic-common/files/policy.json ~/openstack/ardana/ansible/roles/KEYMGR-API/templates/policy.json ~/openstack/ardana/ansible/roles/heat-common/files/policy.json ~/openstack/ardana/ansible/roles/CND-API/templates/policy.json ~/openstack/ardana/ansible/roles/nova-common/files/policy.json ~/openstack/ardana/ansible/roles/CEI-API/templates/policy.json.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/policy.json.j2
For test and validation, policy files can be modified in a non-production
environment from the ~/scratch/
directory. For a specific
policy file, run a search for policy.json. To deploy policy changes for a
service, run the service specific reconfiguration playbook (for example,
nova-reconfigure.yml). For a complete list of reconfiguration playbooks,
change directories to ~/scratch/ansible/next/ardana/ansible
and run this command:
ardana >
ls | grep reconfigure
A read-only role named project_observer
is explicitly
created in SUSE OpenStack Cloud 9. Any user who is granted this role can use
list_project
.
5.4 Identity Service Token Validation Example #
The following diagram illustrates the flow of typical Identity service (keystone) requests/responses between SUSE OpenStack Cloud services and the Identity service. It shows how keystone issues and validates tokens to ensure the identity of the caller of each service.
horizon sends an HTTP authentication request to keystone for user credentials.
keystone validates the credentials and replies with token.
horizon sends a POST request, with token to nova to start provisioning a virtual machine.
nova sends token to keystone for validation.
keystone validates the token.
nova forwards a request for an image with the attached token.
glance sends token to keystone for validation.
keystone validates the token.
glance provides image-related information to nova.
nova sends request for networks to neutron with token.
neutron sends token to keystone for validation.
keystone validates the token.
neutron provides network-related information to nova.
nova reports the status of the virtual machine provisioning request.
5.5 Configuring the Identity Service #
5.5.1 What is the Identity service? #
The SUSE OpenStack Cloud Identity service, based on the OpenStack keystone API, provides UserID authentication and access authorization to help organizations achieve their access security and compliance objectives and successfully deploy OpenStack. In short, the Identity service is the gateway to the rest of the OpenStack services.
The identity service is installed automatically by the Cloud Lifecycle Manager (just after MySQL and RabbitMQ). When your cloud is up and running, you can customize keystone in a number of ways, including integrating with LDAP servers. This topic describes the default configuration. See Section 5.8, “Reconfiguring the Identity service” for changes you can implement. Also see Section 5.9, “Integrating LDAP with the Identity Service” for information on integrating with an LDAP provider.
5.5.2 Which version of the Identity service should you use? #
Note that you should use identity API version 3.0. Identity API v2.0 was has been deprecated. Many features such as LDAP integration and fine-grained access control will not work with v2.0. The following are a few questions you may have regarding versions.
Why does the keystone identity catalog still show version 2.0?
Tempest tests still use the v2.0 API. They are in the process of migrating to v3.0. We will remove the v2.0 version once tempest has migrated the tests. The Identity catalog has version 2.0 just to support tempest migration.
Will the keystone identity v3.0 API work if the identity catalog has only the v2.0 endpoint?
Identity v3.0 does not rely on the content of the catalog. It will continue to work regardless of the version of the API in the catalog.
Which CLI client should you use?
You should use the OpenStack CLI, not the keystone CLI, because it is deprecated. The keystone CLI does not support the v3.0 API; only the OpenStack CLI supports the v3.0 API.
5.5.3 Authentication #
The authentication function provides the initial login function to OpenStack. keystone supports multiple sources of authentication, including a native or built-in authentication system. You can use the keystone native system for all user management functions for proof-of-concept deployments or small deployments not requiring integration with a corporate authentication system, but it lacks some of the advanced functions usually found in user management systems such as forcing password changes. The focus of the keystone native authentication system is to be the source of authentication for OpenStack-specific users required to operate various OpenStack services. These users are stored by keystone in a default domain; the addition of these IDs to an external authentication system is not required.
keystone is more commonly integrated with external authentication systems such as OpenLDAP or Microsoft Active Directory. These systems are usually centrally deployed by organizations to serve as the single source of user management and authentication for all in-house deployed applications and systems requiring user authentication. In addition to LDAP and Microsoft Active Directory, support for integration with Security Assertion Markup Language (SAML)-based identity providers from companies such as Ping, CA, IBM, Oracle, and others is also nearly "production-ready."
keystone also provides architectural support through the underlying Apache deployment for other types of authentication systems, such as multi-factor authentication. These types of systems typically require driver support and integration from the respective providers.
While support for Identity providers and multi-factor authentication is available in keystone, it has not yet been certified by the SUSE OpenStack Cloud engineering team and is an experimental feature in SUSE OpenStack Cloud.
LDAP-compatible directories such as OpenLDAP and Microsoft Active Directory are recommended alternatives to using keystone local authentication. Both methods are widely used by organizations and are integrated with a variety of other enterprise applications. These directories act as the single source of user information within an organization. You can configure keystone to authenticate against an LDAP-compatible directory on a per-domain basis.
Domains, as explained in Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”, can be configured so that, based on the user ID, an incoming user is automatically mapped to a specific domain. You can then configure this domain to authenticate against a specific LDAP directory. User credentials provided by the user to keystone are passed along to the designated LDAP source for authentication. You can optionally configure this communication to be secure through SSL encryption. No special LDAP administrative access is required, and only read-only access is needed for this configuration. keystone will not add any LDAP information. All user additions, deletions, and modifications are performed by the application's front end in the LDAP directories. After a user has been successfully authenticated, that user is then assigned to the groups, roles, and projects defined by the keystone domain or project administrators. This information is stored in the keystone service database.
Another form of external authentication provided by the keystone service is through integration with SAML-based identity providers (IdP) such as Ping Identity, IBM Tivoli, and Microsoft Active Directory Federation Server. A SAML-based identity provider provides authentication that is often called "single sign-on." The IdP server is configured to authenticate against identity sources such as Active Directory and provides a single authentication API against multiple types of downstream identity sources. This means that an organization could have multiple identity storage sources but a single authentication source. In addition, if a user has logged into one such source during a defined session time frame, that user does not need to reauthenticate within the defined session. Instead, the IdP automatically validates the user to requesting applications and services.
A SAML-based IdP authentication source is configured with keystone on a per-domain basis similar to the manner in which native LDAP directories are configured. Extra mapping rules are required in the configuration that define which keystone group an incoming UID is automatically assigned to. This means that groups need to be defined in keystone first, but it also removes the requirement that a domain or project administrator assign user roles and project membership on a per-user basis. Instead, groups are used to define project membership and roles and incoming users are automatically mapped to keystone groups based on their upstream group membership. This strategy provides a consistent role-based access control (RBAC) model based on the upstream identity source. The configuration of this option is fairly straightforward. IdP vendors such as Ping and IBM are contributing to the maintenance of this function and have also produced their own integration documentation. HPE is using the Microsoft Active Directory Federation Services (AD FS) for functional testing and future documentation.
The third keystone-supported authentication source is known as multi-factor authentication (MFA). MFA typically requires an external source of authentication beyond a login name and password, and can include options such as SMS text, a temporal token generator, or a fingerprint scanner. Each of these types of MFAs are usually specific to a particular MFA vendor. The keystone architecture supports an MFA-based authentication system, but this has not yet been certified or documented for SUSE OpenStack Cloud.
5.5.4 Authorization #
Another major function provided by the keystone service is access authorization that determines which resources and actions are available based on the UserID, the role of the user, and the projects that a user is provided access to. All of this information is created, managed, and stored by keystone. These functions are applied through the horizon web interface, the OpenStack command-line interface, or the direct keystone API.
keystone provides support for organizing users by using three entities:
- Domains
Domains provide the highest level of organization. Domains are intended to be used as high-level containers for multiple projects. A domain can represent different tenants, companies, or organizations for an OpenStack cloud deployed for public cloud deployments or it can represent major business units, functions, or any other type of top-level organization unit in an OpenStack private cloud deployment. Each domain has at least one Domain Admin assigned to it. This Domain Admin can then create multiple projects within the domain and assign the project administrator role to specific project owners. Each domain created in an OpenStack deployment is unique and the projects assigned to a domain cannot exist in another domain.
- Projects
Projects are entities within a domain that represent groups of users, each user role within that project, and how many underlying infrastructure resources can be consumed by members of the project.
- Groups
Groups are an optional function and provide the means of assigning project roles to multiple users at once.
keystone also makes it possible to create and assign roles to groups of users or individual users. Role names are created and user assignments are made within keystone. The actual function of a role is defined currently for each OpenStack service via scripts. When users request access to an OpenStack service, their access tokens contain information about their assigned project membership and role for that project. This role is then matched to the service-specific script and users are allowed to perform functions within that service defined by the role mapping.
5.5.5 Default settings #
Identity service configuration settings
The identity service configuration options are described in the OpenStack documentation at keystone Configuration Options on the OpenStack site.
Default domain and service accounts
The "default" domain is automatically created during the installation to contain the various required OpenStack service accounts, including the following:
admin | heat | monasca-agent |
barbican | logging | neutron |
barbican_service | logging_api | nova |
ceilometer | logging_beaver | nova_monasca |
cinder | logging_monitor | octavia |
cinderinternal | magnum | placement |
demo | manila | swift |
designate | manilainternal | swift-demo |
glance | monasca | swift-dispersion |
glance-check | monasca_read_only | swift-monitor |
glance-swift |
These are required accounts and are used by the underlying OpenStack services. These accounts should not be removed or reassigned to a different domain. These "default" domain should be used only for these service accounts.
5.5.6 Preinstalled roles #
The following are the preinstalled roles. You can create additional roles by UIDs with the "admin" role. Roles are defined on a per-service basis (more information is available at Manage projects, users, and roles on the OpenStack website).
Role | Description |
---|---|
admin |
The "superuser" role. Provides full access to all SUSE OpenStack Cloud services across all domains and projects. This role should be given only to a cloud administrator. |
member |
A general role that enables a user to access resources within an assigned project including creating, modifying, and deleting compute, storage, and network resources. |
You can find additional information on these roles in each service policy
stored in the /etc/PROJECT/policy.json
files where
PROJECT is a placeholder for an OpenStack service. For example, the Compute
(nova) service roles are stored in the
/etc/nova/policy.json
file. Each service policy file
defines the specific API functions available to a role label.
5.6 Retrieving the Admin Password #
The admin password will be used to access the dashboard and Operations Console as well as allow you to authenticate to use the command-line tools and API.
In a default SUSE OpenStack Cloud 9 installation there is a randomly generated password for the Admin user created. These steps will show you how to retrieve this password.
5.6.1 Retrieving the Admin Password #
You can retrieve the randomly generated Admin password by using this command on the Cloud Lifecycle Manager:
ardana >
cat ~/service.osrc
In this example output, the value for OS_PASSWORD
is the
Admin password:
ardana >
cat ~/service.osrc
unset OS_DOMAIN_NAME
export OS_IDENTITY_API_VERSION=3
export OS_AUTH_VERSION=3
export OS_PROJECT_NAME=admin
export OS_PROJECT_DOMAIN_NAME=Default
export OS_USERNAME=admin
export OS_USER_DOMAIN_NAME=Default
export OS_PASSWORD=SlWSfwxuJY0
export OS_AUTH_URL=https://10.13.111.145:5000/v3
export OS_ENDPOINT_TYPE=internalURL
# OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT
export OS_INTERFACE=internal
export OS_CACERT=/etc/ssl/certs/ca-certificates.crt
export OS_COMPUTE_API_VERSION=2
5.7 Changing Service Passwords #
SUSE OpenStack Cloud provides a process for changing the default service passwords, including your admin user password, which you may want to do for security or other purposes.
You can easily change the inter-service passwords used for authenticating communications between services in your SUSE OpenStack Cloud deployment, promoting better compliance with your organization’s security policies. The inter-service passwords that can be changed include (but are not limited to) keystone, MariaDB, RabbitMQ, Cloud Lifecycle Manager cluster, monasca and barbican.
The general process for changing the passwords is to:
Indicate to the configuration processor which password(s) you want to change, and optionally include the value of that password
Run the configuration processor to generate the new passwords (you do not need to run
git add
before this)Run ready-deployment
Check your password name(s) against the tables included below to see which high-level credentials-change playbook(s) you need to run
Run the appropriate high-level credentials-change playbook(s)
5.7.1 Password Strength #
Encryption passwords supplied to the configuration processor for use with Ansible Vault and for encrypting the configuration processor’s persistent state must have a minimum length of 12 characters and a maximum of 128 characters. Passwords must contain characters from each of the following three categories:
Uppercase characters (A-Z)
Lowercase characters (a-z)
Base 10 digits (0-9)
Service Passwords that are automatically generated by the configuration processor are chosen from the 62 characters made up of the 26 uppercase, the 26 lowercase, and the 10 numeric characters, with no preference given to any character or set of characters, with the minimum and maximum lengths being determined by the specific requirements of individual services.
Currently, you can not use any special characters with Ansible Vault, Service Passwords, or vCenter configuration.
5.7.2 Telling the configuration processor which password(s) you want to change #
In SUSE OpenStack Cloud 9, the configuration processor will produce metadata about
each of the passwords (and other variables) that it generates in the file
~/openstack/my_cloud/info/private_data_metadata_ccp.yml
. A
snippet of this file follows. Expand the header to see the file:
5.7.3 private_data_metadata_ccp.yml #
metadata_proxy_shared_secret: metadata: - clusters: - cluster1 component: nova-metadata consuming-cp: ccp cp: ccp version: '2.0' mysql_admin_password: metadata: - clusters: - cluster1 component: ceilometer consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 component: heat consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 component: keystone consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 - compute component: nova consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 component: cinder consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 component: glance consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 - compute component: neutron consumes: mysql consuming-cp: ccp cp: ccp - clusters: - cluster1 component: horizon consumes: mysql consuming-cp: ccp cp: ccp version: '2.0' mysql_barbican_password: metadata: - clusters: - cluster1 component: barbican consumes: mysql consuming-cp: ccp cp: ccp version: '2.0'
For each variable, there is a metadata entry for each pair of services that
use the variable including a list of the clusters on which the service
component that consumes the variable (defined as "component:" in
private_data_metadata_ccp.yml
above) runs.
Note above that the variable mysql_admin_password
is used by a number of
service components, and the service that is consumed in each case is mysql
,
which in this context refers to the MariaDB instance that is part of the
product.
5.7.4 Steps to change a password #
First, make sure that you have a copy of
private_data_metadata_ccp.yml
. If you
do not, generate one to run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml
Make a copy of the private_data_metadata_ccp.yml
file and
place it into the ~/openstack/change_credentials
directory:
ardana >
cp ~/openstack/my_cloud/info/private_data_metadata_control-plane-1.yml \
~/openstack/change_credentials/
Edit the copied file in ~/openstack/change_credentials
leaving only those passwords you intend to change. All entries in this
template file should be deleted except for those
passwords.
If you leave other passwords in that file that you do not want to change, they will be regenerated and no longer match those in use which could disrupt operations.
It is required that you change passwords in batches of each category listed below.
For example, the snippet below would result in the configuration processor generating new random values for keystone_backup_password, keystone_ceilometer_password, and keystone_cinder_password:
keystone_backup_password: metadata: - clusters: - cluster0 - cluster1 - compute consumes: keystone-api consuming-cp: ccp cp: ccp version: '2.0' keystone_ceilometer_password: metadata: - clusters: - cluster1 component: ceilometer-common consumes: keystone-api consuming-cp: ccp cp: ccp version: '2.0' keystone_cinder_password: metadata: - clusters: - cluster1 component: cinder-api consumes: keystone-api consuming-cp: ccp cp: ccp version: '2.0'
5.7.5 Specifying password value #
Optionally, you can specify a value for the password by including a "value:" key and value at the same level as metadata:
keystone_backup_password: value: 'new_password' metadata: - clusters: - cluster0 - cluster1 - compute consumes: keystone-api consuming-cp: ccp cp: ccp version: '2.0'
Note that you can have multiple files in openstack/change_credentials. The configuration processor will only read files that end in .yml or .yaml.
If you have specified a password value in your credential change file, you may want to encrypt it using ansible-vault. If you decide to encrypt with ansible-vault, make sure that you use the encryption key you have already used when running the configuration processor.
To encrypt a file using ansible-vault, execute:
ardana >
cd ~/openstack/change_credentialsardana >
ansible-vault encrypt credential change file ending in .yml or .yaml
Be sure to provide the encryption key when prompted. Note that if you have specified the wrong ansible-vault password, the configuration-processor will error out with a message like the following:
################################################## Reading Persistent State ################################################## ################################################################################ # The configuration processor failed. # PersistentStateCreds: User-supplied creds file test1.yml was not parsed properly ################################################################################
5.7.6 Running the configuration processor to change passwords #
The directory openstack/change_credentials is not managed by git, so to rerun the configuration processor to generate new passwords and prepare for the next deployment just enter the following commands:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
The files that you placed in
~/openstack/change_credentials
should be removed
once you have run the configuration processor because the old password
values and new password values will be stored in the configuration
processor's persistent state.
Note that if you see output like the following after running the configuration processor:
################################################################################ # The configuration processor completed with warnings. # PersistentStateCreds: User-supplied password name 'blah' is not valid ################################################################################
this tells you that the password name you have supplied, 'blah,' does not exist. A failure to correctly parse the credentials change file will result in the configuration processor erroring out with a message like the following:
################################################## Reading Persistent State ################################################## ################################################################################ # The configuration processor failed. # PersistentStateCreds: User-supplied creds file test1.yml was not parsed properly ################################################################################
Once you have run the configuration processor to change passwords, an
information file
~/openstack/my_cloud/info/password_change.yml
similar to the
private_data_metadata_ccp.yml
is written to tell you which
passwords have been changed, including metadata but not including the
values.
5.7.7 Password change playbooks and tables #
Once you have completed the steps above to change password(s) value(s) and then prepare for the deployment that will actually switch over to the new passwords, you will need to run some high-level playbooks. The passwords that can be changed are grouped into six categories. The tables below list the password names that belong in each category. The categories are:
- keystone
Playbook: ardana-keystone-credentials-change.yml
- RabbitMQ
Playbook: ardana-rabbitmq-credentials-change.yml
- MariaDB
Playbook: ardana-reconfigure.yml
- Cluster:
Playbook: ardana-cluster-credentials-change.yml
- monasca:
Playbook: monasca-reconfigure-credentials-change.yml
- Other:
Playbook: ardana-other-credentials-change.yml
It is recommended that you change passwords in batches; in other words, run through a complete password change process for each batch of passwords, preferably in the above order. Once you have followed the process indicated above to change password(s), check the names against the tables below to see which password change playbook(s) you should run.
Changing identity service credentials
The following table lists identity service credentials you can change.
keystone credentials |
---|
Password name
barbican_admin_password
barbican_service_password
keystone_admin_pwd
keystone_ceilometer_password
keystone_cinder_password
keystone_cinderinternal_password
keystone_demo_pwd
keystone_designate_password
keystone_glance_password
keystone_glance_swift_password
keystone_heat_password
keystone_magnum_password
keystone_monasca_agent_password
keystone_monasca_password
keystone_neutron_password
keystone_nova_password
keystone_octavia_password
keystone_swift_dispersion_password
keystone_swift_monitor_password
keystone_swift_password
nova_monasca_password |
The playbook to run to change keystone credentials is
ardana-keystone-credentials-change.yml
. Execute the
following commands to make the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-keystone-credentials-change.yml
Changing RabbitMQ credentials
The following table lists the RabbitMQ credentials you can change.
RabbitMQ credentials |
---|
Password name
rmq_barbican_password
rmq_ceilometer_password
rmq_cinder_password
rmq_designate_password
rmq_keystone_password
rmq_magnum_password
rmq_monasca_monitor_password
rmq_nova_password
rmq_octavia_password
rmq_service_password |
The playbook to run to change RabbitMQ credentials is
ardana-rabbitmq-credentials-change.yml
. Execute the
following commands to make the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-rabbitmq-credentials-change.yml
Changing MariaDB credentials
The following table lists the MariaDB credentials you can change.
MariaDB credentials |
---|
Password name
mysql_admin_password
mysql_barbican_password
mysql_clustercheck_pwd
mysql_designate_password
mysql_magnum_password
mysql_monasca_api_password
mysql_monasca_notifier_password
mysql_monasca_thresh_password
mysql_octavia_password
mysql_powerdns_password
mysql_root_pwd
mysql_sst_password
ops_mon_mdb_password
mysql_monasca_transform_password
mysql_nova_api_password
password |
The playbook to run to change MariaDB credentials is
ardana-reconfigure.yml
. To make the changes, execute the
following commands:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
Changing cluster credentials
The following table lists the cluster credentials you can change.
cluster credentials |
---|
Password name
haproxy_stats_password
keepalive_vrrp_password |
The playbook to run to change cluster credentials is
ardana-cluster-credentials-change.yml
. To make changes,
execute the following commands:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-cluster-credentials-change.yml
Changing monasca credentials
The following table lists the monasca credentials you can change.
monasca credentials |
---|
Password name
cassandra_monasca_api_password
cassandra_monasca_persister_password |
The playbook to run to change monasca credentials is
monasca-reconfigure-credentials-change.yml
. To make the
changes, execute the following commands:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure-credentials-change.yml
Changing other credentials
The following table lists the other credentials you can change.
Other credentials |
---|
Password name
logging_beaver_password
logging_api_password
logging_monitor_password
logging_kibana_password |
The playbook to run to change these credentials is
ardana-other-credentials-change.yml
. To make the changes,
execute the following commands:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-other-credentials-change.yml
5.7.8 Changing RADOS Gateway Credential #
To change the keystone credentials of RADOS Gateway, follow the preceding
steps documented in Section 5.7, “Changing Service Passwords” by modifying the
keystone_rgw_password
section in
private_data_metadata_ccp.yml
file in
Section 5.7.4, “Steps to change a password” or
Section 5.7.5, “Specifying password value”.
5.7.9 Immutable variables #
The values of certain variables are immutable, which means that once they have been generated by the configuration processor they cannot be changed. These variables are:
barbican_master_kek_db_plugin
swift_hash_path_suffix
swift_hash_path_prefix
mysql_cluster_name
heartbeat_key
erlang_cookie
The configuration processor will not re-generate the values of the above passwords, nor will it allow you to specify a value for them. In addition to the above variables, the following are immutable in SUSE OpenStack Cloud 9:
All ssh keys generated by the configuration processor
All UUIDs generated by the configuration processor
metadata_proxy_shared_secret
horizon_secret_key
ceilometer_metering_secret
5.8 Reconfiguring the Identity service #
5.8.1 Updating the keystone Identity Service #
This topic explains configuration options for the Identity service.
SUSE OpenStack Cloud lets you perform updates on the following parts of the Identity service configuration:
Any content in the main keystone configuration file:
/etc/keystone/keystone.conf
. This lets you manipulate keystone configuration parameters. Next, continue with Section 5.8.2, “Updating the Main Identity service Configuration File”.Updating certain configuration options and enabling features, such as:
Verbosity of logs being written to keystone log files.
Process counts for the Apache2 WSGI module, separately for admin and public keystone interfaces.
Enabling/disabling auditing.
Enabling/disabling Fernet tokens.
For more information, see Section 5.8.3, “Enabling Identity Service Features”.
Creating and updating domain-specific configuration files: /etc/keystone/domains/keystone.<domain_name>.conf. This lets you integrate keystone with one or more external authentication sources, such as LDAP server. See the topic on Section 5.9, “Integrating LDAP with the Identity Service”.
5.8.2 Updating the Main Identity service Configuration File #
The main keystone Identity service configuration file (/etc/keystone/keystone.conf), located on each control plane server, is generated from the following template file located on a Cloud Lifecycle Manager:
~/openstack/my_cloud/config/keystone/keystone.conf.j2
Modify this template file as appropriate. See keystone Liberty documentation for full descriptions of all settings. This is a Jinja2 template, which expects certain template variables to be set. Do not change values inside double curly braces:
{{ }}
.NoteSUSE OpenStack Cloud 9 has the following token expiration setting, which differs from the upstream value
3600
:[token] expiration = 14400
After you modify the template, commit the change to the local git repository, and rerun the configuration processor / deployment area preparation playbooks (as suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”):
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add my_cloud/config/keystone/keystone.conf.j2ardana >
git commit -m "Adjusting some parameters in keystone.conf"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfiguration playbook in the deployment area:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
5.8.3 Enabling Identity Service Features #
To enable or disable keystone features, do the following:
Adjust respective parameters in ~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
Commit the change into local git repository, and rerun the configuration processor/deployment area preparation playbooks (as suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”):
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add my_cloud/config/keystone/keystone_deploy_config.ymlardana >
git commit -m "Adjusting some WSGI or logging parameters for keystone"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfiguration playbook in the deployment area:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
5.8.4 Fernet Tokens #
SUSE OpenStack Cloud 9 supports Fernet tokens by default. The benefit of using Fernet tokens is that tokens are not persisted in a database, which is helpful if you want to deploy the keystone Identity service as one master and multiple slaves; only roles, projects, and other details are replicated from master to slaves. The token table is not replicated.
Tempest does not work with Fernet tokens in SUSE OpenStack Cloud 9. If Fernet tokens are enabled, do not run token tests in Tempest.
During reconfiguration when switching to a Fernet token provider or during
Fernet key rotation, you may see a warning in
keystone.log
stating [fernet_tokens]
key_repository is world readable: /etc/keystone/fernet-keys/
.
This is expected. You can safely ignore this message. For other keystone
operations, this warning is not displayed. Directory permissions are set to
600 (read/write by owner only), not world readable.
Fernet token-signing key rotation is being handled by a cron job, which is configured on one of the controllers. The controller with the Fernet token-signing key rotation cron job is also known as the Fernet Master node. By default, the Fernet token-signing key is rotated once every 24 hours. The Fernet token-signing keys are distributed from the Fernet Master node to the rest of the controllers at each rotation. Therefore, the Fernet token-signing keys are consistent for all the controlers at all time.
When enabling Fernet token provider the first time, specific steps are needed to set up the necessary mechanisms for Fernet token-signing key distributions.
Set
keystone_configure_fernet
toTrue
in~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
.Run the following commands to commit your change in Git and enable Fernet:
ardana >
git add my_cloud/config/keystone/keystone_deploy_config.ymlardana >
git commit -m "enable Fernet token provider"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-deploy.yml
When the Fernet token provider is enabled, a Fernet Master alarm definition
is also created on monasca to monitor the Fernet Master node. If the Fernet
Master node is offline or unreachable, a CRITICAL
alarm
is raised for the Cloud Admin to take corrective actions. If the Fernet
Master node is offline for a prolonged period of time, Fernet token-signing
key rotation is not performed. This may introduce security risks to the
cloud. The Cloud Admin must take immediate actions to resurrect the Fernet
Master node.
5.9 Integrating LDAP with the Identity Service #
5.9.1 Integrating with an external LDAP server #
The keystone identity service provides two primary functions: user authentication and access authorization. The user authentication function validates a user's identity. keystone has a very basic user management system that can be used to create and manage user login and password credentials but this system is intended only for proof of concept deployments due to the very limited password control functions. The internal identity service user management system is also commonly used to store and authenticate OpenStack-specific service account information.
The recommended source of authentication is external user management systems such as LDAP directory services. The identity service can be configured to connect to and use external systems as the source of user authentication. The identity service domain construct is used to define different authentication sources based on domain membership. For example, cloud deployment could consist of as few as two domains:
The default domain that is pre-configured for the service account users that are authenticated directly against the identity service internal user management system
A customer-defined domain that contains all user projects and membership definitions. This domain can then be configured to use an external LDAP directory such as Microsoft Active Directory as the authentication source.
SUSE OpenStack Cloud can support multiple domains for deployments that support multiple tenants. Multiple domains can be created with each domain configured to either the same or different external authentication sources. This deployment model is known as a "per-domain" model.
There are currently two ways to configure "per-domain" authentication sources:
File store – each domain configuration is created and stored in separate text files. This is the older and current default method for defining domain configurations.
Database store – each domain configuration can be created using either the identity service manager utility (recommenced) or a Domain Admin API (from OpenStack.org), and the results are stored in the identity service MariaDB database. This database store is a new method introduced in the OpenStack Kilo release and now available in SUSE OpenStack Cloud.
Instructions for initially creating per-domain configuration files and then migrating to the Database store method via the identity service manager utility are provided as follows.
5.9.2 Set up domain-specific driver configuration - file store #
To update configuration to a specific LDAP domain:
Ensure that the following configuration options are in the main configuration file template: ~/openstack/my_cloud/config/keystone/keystone.conf.j2
[identity] domain_specific_drivers_enabled = True domain_configurations_from_database = False
Create a YAML file that contains the definition of the LDAP server connection. The sample file below is already provided as part of the Cloud Lifecycle Manager in the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”. It is available on the Cloud Lifecycle Manager in the following file:
~/openstack/my_cloud/config/keystone/keystone_configure_ldap_sample.yml
Save a copy of this file with a new name, for example:
~/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.yml
NotePlease refer to the LDAP section of the keystone configuration example for OpenStack for the full option list and description.
Below are samples of YAML configurations for identity service LDAP certificate settings, optimized for Microsoft Active Directory server.
Sample YAML configuration keystone_configure_ldap_my.yml
--- keystone_domainldap_conf: # CA certificates file content. # Certificates are stored in Base64 PEM format. This may be entire LDAP server # certificate (in case of self-signed certificates), certificate of authority # which issued LDAP server certificate, or a full certificate chain (Root CA # certificate, intermediate CA certificate(s), issuer certificate). # cert_settings: cacert: | -----BEGIN CERTIFICATE----- certificate appears here -----END CERTIFICATE----- # A domain will be created in MariaDB with this name, and associated with ldap back end. # Installer will also generate a config file named /etc/keystone/domains/keystone.<domain_name>.conf # domain_settings: name: ad description: Dedicated domain for ad users conf_settings: identity: driver: ldap # For a full list and description of ldap configuration options, please refer to # https://github.com/openstack/keystone/blob/master/etc/keystone.conf.sample or # http://docs.openstack.org/liberty/config-reference/content/keystone-configuration-file.html. # # Please note: # 1. LDAP configuration is read-only. Configuration which performs write operations (i.e. creates users, groups, etc) # is not supported at the moment. # 2. LDAP is only supported for identity operations (reading users and groups from LDAP). Assignment # operations with LDAP (i.e. managing roles, projects) are not supported. # 3. LDAP is configured as non-default domain. Configuring LDAP as a default domain is not supported. # ldap: url: ldap://ad.hpe.net suffix: DC=hpe,DC=net query_scope: sub user_tree_dn: CN=Users,DC=hpe,DC=net user : CN=admin,CN=Users,DC=hpe,DC=net password: REDACTED user_objectclass: user user_id_attribute: cn user_name_attribute: cn group_tree_dn: CN=Users,DC=hpe,DC=net group_objectclass: group group_id_attribute: cn group_name_attribute: cn use_pool: True user_enabled_attribute: userAccountControl user_enabled_mask: 2 user_enabled_default: 512 use_tls: True tls_req_cert: demand # if you are configuring multiple LDAP domains, and LDAP server certificates are issued # by different authorities, make sure that you place certs for all the LDAP backend domains in the # cacert parameter as seen in this sample yml file so that all the certs are combined in a single CA file # and every LDAP domain configuration points to the combined CA file. # Note: # 1. Please be advised that every time a new ldap domain is configured, the single CA file gets overwritten # and hence ensure that you place certs for all the LDAP backend domains in the cacert parameter. # 2. There is a known issue on one cert per CA file per domain when the system processes # concurrent requests to multiple LDAP domains. Using the single CA file with all certs combined # shall get the system working properly*. tls_cacertfile: /etc/keystone/ssl/certs/all_ldapdomains_ca.pem # The issue is in the underlying SSL library. Upstream is not investing in python-ldap package anymore. # It is also not python3 compliant.
keystone_domain_MSAD_conf: # CA certificates file content. # Certificates are stored in Base64 PEM format. This may be entire LDAP server # certificate (in case of self-signed certificates), certificate of authority # which issued LDAP server certificate, or a full certificate chain (Root CA # certificate, intermediate CA certificate(s), issuer certificate). # cert_settings: cacert: | -----BEGIN CERTIFICATE----- certificate appears here -----END CERTIFICATE----- # A domain will be created in MariaDB with this name, and associated with ldap back end. # Installer will also generate a config file named /etc/keystone/domains/keystone.<domain_name>.conf # domain_settings: name: msad description: Dedicated domain for msad users conf_settings: identity: driver: ldap # For a full list and description of ldap configuration options, please refer to # https://github.com/openstack/keystone/blob/master/etc/keystone.conf.sample or # http://docs.openstack.org/liberty/config-reference/content/keystone-configuration-file.html. # # Please note: # 1. LDAP configuration is read-only. Configuration which performs write operations (i.e. creates users, groups, etc) # is not supported at the moment. # 2. LDAP is only supported for identity operations (reading users and groups from LDAP). Assignment # operations with LDAP (i.e. managing roles, projects) are not supported. # 3. LDAP is configured as non-default domain. Configuring LDAP as a default domain is not supported. # ldap: # If the url parameter is set to ldap then typically use_tls should be set to True. If # url is set to ldaps, then use_tls should be set to False url: ldaps://10.16.22.5 use_tls: False query_scope: sub user_tree_dn: DC=l3,DC=local # this is the user and password for the account that has access to the AD server user: administrator@l3.local password: OpenStack123 user_objectclass: user # For a default Active Directory schema this is where to find the user name, openldap uses a different value user_id_attribute: userPrincipalName user_name_attribute: sAMAccountName group_tree_dn: DC=l3,DC=local group_objectclass: group group_id_attribute: cn group_name_attribute: cn # An upstream defect requires use_pool to be set false use_pool: False user_enabled_attribute: userAccountControl user_enabled_mask: 2 user_enabled_default: 512 tls_req_cert: allow # Referals may contain urls that can't be resolved and will cause timeouts, ignore them chase_referrals: False # if you are configuring multiple LDAP domains, and LDAP server certificates are issued # by different authorities, make sure that you place certs for all the LDAP backend domains in the # cacert parameter as seen in this sample yml file so that all the certs are combined in a single CA file # and every LDAP domain configuration points to the combined CA file. # Note: # 1. Please be advised that every time a new ldap domain is configured, the single CA file gets overwritten # and hence ensure that you place certs for all the LDAP backend domains in the cacert parameter. # 2. There is a known issue on one cert per CA file per domain when the system processes # concurrent requests to multiple LDAP domains. Using the single CA file with all certs combined # shall get the system working properly. tls_cacertfile: /etc/keystone/ssl/certs/all_ldapdomains_ca.pem
As suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, commit the new file to the local git repository, and rerun the configuration processor and ready deployment playbooks:
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add my_cloud/config/keystone/keystone_configure_ldap_my.ymlardana >
git commit -m "Adding LDAP server integration config"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfiguration playbook in a deployment area, passing the YAML file created in the previous step as a command-line option:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@~/openstack/my_cloud/config/keystone/keystone_configure_ldap_my.ymlFollow these same steps for each LDAP domain with which you are integrating the identity service, creating a YAML file for each and running the reconfigure playbook once for each additional domain.
Ensure that a new domain was created for LDAP (Microsoft AD in this example) and set environment variables for admin level access
ardana >
source keystone.osrcGet a list of domains
ardana >
openstack domain listAs output here:
+----------------------------------+---------+---------+----------------------------------------------------------------------+ | ID | Name | Enabled | Description | +----------------------------------+---------+---------+----------------------------------------------------------------------+ | 6740dbf7465a4108a36d6476fc967dbd | heat | True | Owns users and projects created by heat | | default | Default | True | Owns users and tenants (i.e. projects) available on Identity API v2. | | b2aac984a52e49259a2bbf74b7c4108b | ad | True | Dedicated domain for users managed by Microsoft AD server | +----------------------------------+---------+---------+----------------------------------------------------------------------+
NoteLDAP domain is read-only. This means that you cannot create new user or group records in it.
Once the LDAP user is granted the appropriate role, you can authenticate within the specified domain. Set environment variables for admin-level access
ardana >
source keystone.osrcGet user record within the ad (Active Directory) domain
ardana >
openstack user show testuser1 --domain adNote the output:
+-----------+------------------------------------------------------------------+ | Field | Value | +-----------+------------------------------------------------------------------+ | domain_id | 143af847018c4dc7bd35390402395886 | | id | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | | name | testuser1 | +-----------+------------------------------------------------------------------+
Now, get list of LDAP groups:
ardana >
openstack group list --domain adHere you see testgroup1 and testgroup2:
+------------------------------------------------------------------+------------+ | ID | Name | +------------------------------------------------------------------+------------+ | 03976b0ea6f54a8e4c0032e8f756ad581f26915c7e77500c8d4aaf0e83afcdc6| testgroup1 | 7ba52ee1c5829d9837d740c08dffa07ad118ea1db2d70e0dc7fa7853e0b79fcf | testgroup2 | +------------------------------------------------------------------+------------+
Create a new role. Note that the role is not bound to the domain.
ardana >
openstack role create testrole1Testrole1 has been created:
+-------+----------------------------------+ | Field | Value | +-------+----------------------------------+ | id | 02251585319d459ab847409dea527dee | | name | testrole1 | +-------+----------------------------------+
Grant the user a role within the domain by executing the code below. Note that due to a current OpenStack CLI limitation, you must use the user ID rather than the user name when working with a non-default domain.
ardana >
openstack role add testrole1 --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --domain adVerify that the role was successfully granted, as shown here:
ardana >
openstack role assignment list --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --domain ad +----------------------------------+------------------------------------------------------------------+-------+---------+----------------------------------+ | Role | User | Group | Project | Domain | +----------------------------------+------------------------------------------------------------------+-------+---------+----------------------------------+ | 02251585319d459ab847409dea527dee | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | | | 143af847018c4dc7bd35390402395886 | +----------------------------------+------------------------------------------------------------------+-------+---------+----------------------------------+Authenticate (get a domain-scoped token) as a new user with a new role. The --os-* command-line parameters specified below override the respective OS_* environment variables set by the keystone.osrc script to provide admin access. To ensure that the command below is executed in a clean environment, you may want log out from the node and log in again.
ardana >
openstack --os-identity-api-version 3 \ --os-username testuser1 \ --os-password testuser1_password \ --os-auth-url http://10.0.0.6:35357/v3 \ --os-domain-name ad \ --os-user-domain-name ad \ token issueHere is the result:
+-----------+------------------------------------------------------------------+ | Field | Value | +-----------+------------------------------------------------------------------+ | domain_id | 143af847018c4dc7bd35390402395886 | | expires | 2015-09-09T21:36:15.306561Z | | id | 6f8f9f1a932a4d01b7ad9ab061eb0917 | | user_id | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | +-----------+------------------------------------------------------------------+
Users can also have a project within the domain and get a project-scoped token. To accomplish this, set environment variables for admin level access:
ardana >
source keystone.osrcThen create a new project within the domain:
ardana >
openstack project create testproject1 --domain adThe result shows that they have been created:
+-------------+----------------------------------+ | Field | Value | +-------------+----------------------------------+ | description | | | domain_id | 143af847018c4dc7bd35390402395886 | | enabled | True | | id | d065394842d34abd87167ab12759f107 | | name | testproject1 | +-------------+----------------------------------+
Grant the user a role with a project, re-using the role created in the previous example. Note that due to a current OpenStack CLI limitation, you must use user ID rather than user name when working with a non-default domain.
ardana >
openstack role add testrole1 --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --project testproject1Verify that the role was successfully granted by generating a list:
ardana >
openstack role assignment list --user e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 --project testproject1The output shows the result:
+----------------------------------+------------------------------------------------------------------+-------+----------------------------------+--------+ | Role | User | Group | Project | Domain | +----------------------------------+------------------------------------------------------------------+-------+----------------------------------+--------+ | 02251585319d459ab847409dea527dee | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | | d065394842d34abd87167ab12759f107 | | +----------------------------------+------------------------------------------------------------------+-------+----------------------------------+--------+
Authenticate (get a project-scoped token) as the new user with a new role. The --os-* command line parameters specified below override their respective OS_* environment variables set by keystone.osrc to provide admin access. To ensure that the command below is executed in a clean environment, you may want log out from the node and log in again. Note that both the --os-project-domain-name and --os-project-user-name parameters are needed to verify that both user and project are not in the default domain.
ardana >
openstack --os-identity-api-version 3 \ --os-username testuser1 \ --os-password testuser1_password \ --os-auth-url http://10.0.0.6:35357/v3 \ --os-project-name testproject1 \ --os-project-domain-name ad \ --os-user-domain-name ad \ token issueBelow is the result:
+------------+------------------------------------------------------------------+ | Field | Value | +------------+------------------------------------------------------------------+ | expires | 2015-09-09T21:50:49.945893Z | | id | 328e18486f69441fb13f4842423f52d1 | | project_id | d065394842d34abd87167ab12759f107 | | user_id | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | +------------+------------------------------------------------------------------+
5.9.3 Set up or switch to domain-specific driver configuration using a database store #
To make the switch, execute the steps below. Remember, you must have already set up the configuration for a file store as explained in Section 5.9.2, “Set up domain-specific driver configuration - file store”, and it must be working properly.
Ensure that the following configuration options are set in the main configuration file, ~/openstack/my_cloud/config/keystone/keystone.conf.j2:
[identity] domain_specific_drivers_enabled = True domain_configurations_from_database = True [domain_config] driver = sql
Once the template is modified, commit the change to the local git repository, and rerun the configuration processor / deployment area preparation playbooks (as suggested at Using Git for Configuration Management):
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add -AVerify that the files have been added using git status:
ardana >
git statusThen commit the changes:
ardana >
git commit -m "Use Domain-Specific Driver Configuration - Database Store: more description here..."Next, run the configuration processor and ready deployment playbooks:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfiguration playbook in a deployment area:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlUpload the domain-specific config files to the database if they have not been loaded. If they have already been loaded and you want to switch back to database store mode, then skip this upload step and move on to step 5.
Go to one of the controller nodes where keystone is deployed.
Verify that domain-specific driver configuration files are located under the directory (default /etc/keystone/domains) with the format: keystone.<domain name>.conf Use the keystone manager utility to load domain-specific config files to the database. There are two options for uploading the files:
Option 1: Upload all configuration files to the SQL database:
ardana >
keystone-manage domain_config_upload --allOption 2: Upload individual domain-specific configuration files by specifying the domain name one by one:
ardana >
keystone-manage domain_config_upload --domain-name domain nameHere is an example:
keystone-manage domain_config_upload --domain-name ad
Note that the keystone manager utility does not upload the domain-specific driver configuration file the second time for the same domain. For the management of the domain-specific driver configuration in the database store, you may refer to OpenStack Identity API - Domain Configuration.
Verify that the switched domain driver configuration for LDAP (Microsoft AD in this example) in the database store works properly. Then set the environment variables for admin level access:
ardana >
source ~/keystone.osrcGet a list of domain users:
ardana >
openstack user list --domain adNote the three users returned:
+------------------------------------------------------------------+------------+ | ID | Name | +------------------------------------------------------------------+------------+ | e7dbec51ecaf07906bd743debcb49157a0e8af557b860a7c1dadd454bdab03fe | testuser1 | | 8a09630fde3180c685e0cd663427e8638151b534a8a7ccebfcf244751d6f09bd | testuser2 | | ea463d778dadcefdcfd5b532ee122a70dce7e790786678961420ae007560f35e | testuser3 | +------------------------------------------------------------------+------------+
Get user records within the ad domain:
ardana >
openstack user show testuser1 --domain adHere testuser1 is returned:
+-----------+------------------------------------------------------------------+ | Field | Value | +-----------+------------------------------------------------------------------+ | domain_id | 143af847018c4dc7bd35390402395886 | | id | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | | name | testuser1 | +-----------+------------------------------------------------------------------+
Get a list of LDAP groups:
ardana >
openstack group list --domain adNote that testgroup1 and testgroup2 are returned:
+------------------------------------------------------------------+------------+ | ID | Name | +------------------------------------------------------------------+------------+ | 03976b0ea6f54a8e4c0032e8f756ad581f26915c7e77500c8d4aaf0e83afcdc6 | testgroup1 | | 7ba52ee1c5829d9837d740c08dffa07ad118ea1db2d70e0dc7fa7853e0b79fcf | testgroup2 | +------------------------------------------------------------------+------------+
NoteLDAP domain is read-only. This means that you cannot create new user or group records in it.
5.9.4 Domain-specific driver configuration. Switching from a database to a file store #
Following is the procedure to switch a domain-specific driver configuration from a database store to a file store. It is assumed that:
The domain-specific driver configuration with a database store has been set up and is working properly.
Domain-specific driver configuration files with the format: keystone.<domain name>.conf have already been located and verified in the specific directory (by default, /etc/keystone/domains/) on all of the controller nodes.
Ensure that the following configuration options are set in the main configuration file template in ~/openstack/my_cloud/config/keystone/keystone.conf.j2:
[identity] domain_specific_drivers_enabled = True domain_configurations_from_database = False [domain_config] # driver = sql
Once the template is modified, commit the change to the local git repository, and rerun the configuration processor / deployment area preparation playbooks (as suggested at Using Git for Configuration Management):
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add -AVerify that the files have been added using git status, then commit the changes:
ardana >
git statusardana >
git commit -m "Domain-Specific Driver Configuration - Switch From Database Store to File Store: more description here..."Then run the configuration processor and ready deployment playbooks:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun reconfiguration playbook in a deployment area:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlVerify that the switched domain driver configuration for LDAP (Microsoft AD in this example) using file store works properly: Set environment variables for admin level access
ardana >
source ~/keystone.osrcGet list of domain users:
ardana >
openstack user list --domain adHere you see the three users:
+------------------------------------------------------------------+------------+ | ID | Name | +------------------------------------------------------------------+------------+ | e7dbec51ecaf07906bd743debcb49157a0e8af557b860a7c1dadd454bdab03fe | testuser1 | | 8a09630fde3180c685e0cd663427e8638151b534a8a7ccebfcf244751d6f09bd | testuser2 | | ea463d778dadcefdcfd5b532ee122a70dce7e790786678961420ae007560f35e | testuser3 | +------------------------------------------------------------------+------------+
Get user records within the ad domain:
ardana >
openstack user show testuser1 --domain adHere is the result:
+-----------+------------------------------------------------------------------+ | Field | Value | +-----------+------------------------------------------------------------------+ | domain_id | 143af847018c4dc7bd35390402395886 | | id | e6d8c90abdc4510621271b73cc4dda8bc6009f263e421d8735d5f850f002f607 | | name | testuser1 | +-----------+------------------------------------------------------------------+
Get a list of LDAP groups:
ardana >
openstack group list --domain adHere are the groups returned:
+------------------------------------------------------------------+------------+ | ID | Name | +------------------------------------------------------------------+------------+ | 03976b0ea6f54a8e4c0032e8f756ad581f26915c7e77500c8d4aaf0e83afcdc6 | testgroup1 | | 7ba52ee1c5829d9837d740c08dffa07ad118ea1db2d70e0dc7fa7853e0b79fcf | testgroup2 | +------------------------------------------------------------------+------------+
Note: Note: LDAP domain is read-only. This means that you can not create new user or group record in it.
5.9.5 Update LDAP CA certificates #
There is a chance that LDAP CA certificates may expire or for some reason not work anymore. Below are steps to update the LDAP CA certificates on the identity service side. Follow the steps below to make the updates.
Locate the file keystone_configure_ldap_certs_sample.yml
~/openstack/my_cloud/config/keystone/keystone_configure_ldap_certs_sample.yml
Save a copy of this file with a new name, for example:
~/openstack/my_cloud/config/keystone/keystone_configure_ldap_certs_all.yml
Edit the file and specify the correct single file path name for the ldap certificates. This file path name has to be consistent with the one defined in tls_cacertfile of the domain-specific configuration. Edit the file and populate or update it with LDAP CA certificates for all LDAP domains.
As suggested in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, add the new file to the local git repository:
ardana >
cd ~/openstackardana >
git checkout siteardana >
git add -AVerify that the files have been added using git status and commit the file:
ardana >
git statusardana >
git commit -m "Update LDAP CA certificates: more description here..."Then run the configuration processor and ready deployment playbooks:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfiguration playbook in the deployment area:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@~/openstack/my_cloud/config/keystone/keystone_configure_ldap_certs_all.yml
5.9.6 Limitations #
SUSE OpenStack Cloud 9 domain-specific configuration:
No Global User Listing: Once domain-specific driver configuration is enabled, listing all users and listing all groups are not supported operations. Those calls require a specific domain filter and a domain-scoped token for the target domain.
You cannot have both a file store and a database store for domain-specific driver configuration in a single identity service instance. Once a database store is enabled within the identity service instance, any file store will be ignored, and vice versa.
The identity service allows a list limit configuration to globally set the maximum number of entities that will be returned in an identity collection per request but it does not support per-domain list limit setting at this time.
Each time a new domain is configured with LDAP integration the single CA file gets overwritten. Ensure that you place certs for all the LDAP back-end domains in the cacert parameter. Detailed CA file inclusion instructions are provided in the comments of the sample YAML configuration file
keystone_configure_ldap_my.yml
(Section 5.9.2, “Set up domain-specific driver configuration - file store”).LDAP is only supported for identity operations (reading users and groups from LDAP).
keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.
The SUSE OpenStack Cloud 'default' domain is pre-configured to store service account users and is authenticated locally against the identity service. Domains configured for external LDAP integration are non-default domains.
When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.
Each LDAP connection with the identity service is for read-only operations. Configurations that require identity service write operations (to create users, groups, etc.) are not currently supported.
LDAP is only supported for identity operations (reading users and groups from LDAP). keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.
When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.
SUSE OpenStack Cloud 9 API-based domain-specific configuration management
No GUI dashboard for domain-specific driver configuration management
API-based Domain specific config does not check for type of option.
API-based Domain specific config does not check for option values supported.
API-based Domain config method does not provide retrieval of default values of domain-specific configuration options.
Status: Domain-specific driver configuration database store is a non-core feature for SUSE OpenStack Cloud 9.
When integrating with an external identity provider, cloud security is dependent upon the security of that identify provider. You should examine the security of the identity provider, and in particular the SAML 2.0 token generation process and decide what security properties you need to ensure adequate security of your cloud deployment. More information about SAML can be found at https://www.owasp.org/index.php/SAML_Security_Cheat_Sheet.
5.10 keystone-to-keystone Federation #
This topic explains how you can use one instance of keystone as an identity provider and one as a service provider.
5.10.1 What Is Keystone-to-Keystone Federation? #
Identity federation lets you configure SUSE OpenStack Cloud using existing identity management systems such as an LDAP directory as the source of user access authentication. The keystone-to-keystone federation (K2K) function extends this concept for accessing resources in multiple, separate SUSE OpenStack Cloud clouds. You can configure each cloud to trust the authentication credentials of other clouds to provide the ability for users to authenticate with their home cloud and to access authorized resources in another cloud without having to reauthenticate with the remote cloud. This function is sometimes referred to as "single sign-on" or SSO.
The SUSE OpenStack Cloud cloud that provides the initial user authentication is called the identity provider (IdP). The identity provider cloud can support domain-based authentication against external authentication sources including LDAP-based directories such as Microsoft Active Directory. The identity provider creates the user attributes, known as assertions, which are used to automatically authenticate users with other SUSE OpenStack Cloud clouds.
An SUSE OpenStack Cloud cloud that provides resources is called a service provider (SP). A service provider cloud accepts user authentication assertions from the identity provider and provides access to project resources based on the mapping file settings developed for each service provider cloud. The following are characteristics of a service provider:
Each service provider cloud has a unique set of projects, groups, and group role assignments that are created and managed locally.
The mapping file consists a set of rules that define user group membership.
The mapping file enables the ability to auto-assign incoming users to a specific group. Project membership and access are defined by group membership.
Project quotas are defined locally by each service provider cloud.
keystone-to-keystone federation is supported and enabled in SUSE OpenStack Cloud 9 using configuration parameters in specific Ansible files. Instructions are provided to define and enable the required configurations.
Support for keystone-to-keystone federation happens on the API level, and you must implement it using your own client code by calling the supported APIs. Python-keystoneclient has supported APIs to access the K2K APIs.
The following k2kclient.py file is an example, and the request diagram Figure 5.1, “Keystone Authentication Flow” explains the flow of client requests.
import json
import os
import requests
import xml.dom.minidom
from keystoneclient.auth.identity import v3
from keystoneclient import session
class K2KClient(object):
def __init__(self):
# IdP auth URL
self.auth_url = "http://192.168.245.9:35357/v3/"
self.project_name = "admin"
self.project_domain_name = "Default"
self.username = "admin"
self.password = "vvaQIZ1S"
self.user_domain_name = "Default"
self.session = requests.Session()
self.verify = False
# identity provider Id
self.idp_id = "z420_idp"
# service provider Id
self.sp_id = "z620_sp"
#self.sp_ecp_url = "https://16.103.149.44:8443/Shibboleth.sso/SAML2/ECP"
#self.sp_auth_url = "https://16.103.149.44:8443/v3"
def v3_authenticate(self):
auth = v3.Password(auth_url=self.auth_url,
username=self.username,
password=self.password,
user_domain_name=self.user_domain_name,
project_name=self.project_name,
project_domain_name=self.project_domain_name)
self.auth_session = session.Session(session=requests.session(),
auth=auth, verify=self.verify)
auth_ref = self.auth_session.auth.get_auth_ref(self.auth_session)
self.token = self.auth_session.auth.get_token(self.auth_session)
def _generate_token_json(self):
return {
"auth": {
"identity": {
"methods": [
"token"
],
"token": {
"id": self.token
}
},
"scope": {
"service_provider": {
"id": self.sp_id
}
}
}
}
def get_saml2_ecp_assertion(self):
token = json.dumps(self._generate_token_json())
url = self.auth_url + 'auth/OS-FEDERATION/saml2/ecp'
r = self.session.post(url=url,
data=token,
verify=self.verify)
if not r.ok:
raise Exception("Something went wrong, %s" % r.__dict__)
self.ecp_assertion = r.text
def _get_sp_url(self):
url = self.auth_url + 'OS-FEDERATION/service_providers/' + self.sp_id
r = self.auth_session.get(
url=url,
verify=self.verify)
if not r.ok:
raise Exception("Something went wrong, %s" % r.__dict__)
sp = json.loads(r.text)[u'service_provider']
self.sp_ecp_url = sp[u'sp_url']
self.sp_auth_url = sp[u'auth_url']
def _handle_http_302_ecp_redirect(self, response, method, **kwargs):
location = self.sp_auth_url + '/OS-FEDERATION/identity_providers/' + self.idp_id + '/protocols/saml2/auth'
return self.auth_session.request(location, method, authenticated=False, **kwargs)
def exchange_assertion(self):
"""Send assertion to a keystone SP and get token."""
self._get_sp_url()
print("SP ECP Url:%s" % self.sp_ecp_url)
print("SP Auth Url:%s" % self.sp_auth_url)
#self.sp_ecp_url = 'https://16.103.149.44:8443/Shibboleth.sso/SAML2/ECP'
r = self.auth_session.post(
self.sp_ecp_url,
headers={'Content-Type': 'application/vnd.paos+xml'},
data=self.ecp_assertion,
authenticated=False, redirect=False)
r = self._handle_http_302_ecp_redirect(r, 'GET',
headers={'Content-Type': 'application/vnd.paos+xml'})
self.fed_token_id = r.headers['X-Subject-Token']
self.fed_token = r.text
if __name__ == "__main__":
client = K2KClient()
client.v3_authenticate()
client.get_saml2_ecp_assertion()
client.exchange_assertion()
print('Unscoped token_id: %s' % client.fed_token_id)
print('Unscoped token body:
%s' % client.fed_token)
5.10.2 Setting Up a keystone Provider #
To set up keystone as a service provider, follow these steps.
Create a config file called
k2k.yml
with the following parameters and place it in any directory on your Cloud Lifecycle Manager, such as /tmp.keystone_trusted_idp: k2k keystone_sp_conf: shib_sso_idp_entity_id: <protocol>://<idp_host>:<port>/v3/OS-FEDERATION/saml2/idp shib_sso_application_entity_id: http://service_provider_uri_entityId target_domain: name: domain1 description: my domain target_project: name: project1 description: my project target_group: name: group1 description: my group role: name: service idp_metadata_file: /tmp/idp_metadata.xml identity_provider: id: my_idp_id description: This is the identity service provider. mapping: id: mapping1 rules_file: /tmp/k2k_sp_mapping.json protocol: id: saml2 attribute_map: - name: name1 id: id1
The following are descriptions of each of the attributes.
Attribute Definition keystone_trusted_idp A flag to indicate if this configuration is used for keystone-to-keystone or WebSSO. The value can be either k2k or adfs.
keystone_sp_conf shib_sso_idp_entity_id The identity provider URI used as an entity Id to identity the IdP. You shoud use the following value: <protocol>://<idp_host>:<port>/v3/OS-FEDERATION/saml2/idp.
shib_sso_application_entity_id The service provider URI used as an entity Id. It can be any URI here for keystone-to-keystone.
target_domain A domain where the group will be created.
name Any domain name. If it does not exist, it will be created or updated.
description Any description.
target_project A project scope of the group.
name Any project name. If it does not exist, it will be created or updated.
description Any description. target_group A group will be created from target_domain.
name Any group name. If it does not exist, it will be created or updated.
description Any description. role A role will be assigned on target_project. This role impacts the IdP user scoped token permission on the service provider side.
name Must be an existing role. idp_metadata_file A reference to the IdP metadata file that validates the SAML2 assertion.
identity_provider A supported IdP. id Any Id. If it does not exist, it will be created or updated. This Id needs to be shared with the client so that the right mapping will be selected.
description Any description. mapping A mapping in JSON format that maps a federated user to a corresponding group.
id Any Id. If it does not exist, it will be created or updated.
rules_file A reference to the file that has the mapping in JSON.
protocol The supported federation protocol.
id Security Assertion Markup Language 2.0 (SAML2) is the only supported protocol for K2K.
attribute_map A shibboleth mapping that defines additional attributes to map the attributes from the SAML2 assertion to the K2K mapping that the service provider understands. K2K does not require any additional attribute mapping.
name An attribute name from the SAML2 assertion. id An Id that the preceding name will be mapped to. Create a metadata file that is referenced from
k2k.yml
, such as/tmp/idp_metadata.xml
. The content of the metadata file comes from the identity provider and can be found in/etc/keystone/idp_metadata.xml
.Create a mapping file that is referenced in k2k.yml, shown previously. An example is
/tmp/k2k_sp_mapping.json
. You can see the reference in bold in the preceding k2k.yml example. The following is an example of the mapping file.[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://idp_host:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
You can find more information on how the K2K mapping works at http://docs.openstack.org.
Go to
~/stack/scratch/ansible/next/ardana/ansible
and run the following playbook to enable the service provider:ardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@/tmp/k2k.yml
Setting Up an Identity Provider
To set up keystone as an identity provider, follow these steps:
Create a config file
k2k.yml
with the following parameters and place it in any directory on your Cloud Lifecycle Manager, such as/tmp
. Note that the certificate and key here are excerpted for space.keystone_k2k_idp_conf: service_provider: - id: my_sp_id description: This is service provider. sp_url: https://sp_host:5000 auth_url: https://sp_host:5000/v3 signer_cert: -----BEGIN CERTIFICATE----- MIIDmDCCAoACCQDS+ZDoUfr cIzANBgkqhkiG9w0BAQsFADCBjDELMAkGA1UEBhMC\ nVVMxEzARBgNVB AgMCkNhbGlmb3JuaWExEjAQBgNVBAcMCVN1bm55dmFsZTEMMAoG\ ... nOpKEvhlMsl5I/tle -----END CERTIFICATE----- signer_key: -----BEGIN RSA PRIVATE KEY----- MIIEowIBAAKCAQEA1gRiHiwSO6L5PrtroHi/f17DQBOpJ1KMnS9FOHS ...
The following are descriptions of each of the attributes under keystone_k2k_idp_conf
- service_provider
One or more service providers can be defined. If it does not exist, it will be created or updated.
- id
Any Id. If it does not exist, it will be created or updated. This Id needs to be shared with the client so that it knows where the service provider is.
- description
Any description.
- sp_url
Service provider base URL.
- auth_url
Service provider auth URL.
- signer_cert
Content of self-signed certificate that is embedded in the metadata file. We recommend setting the validity for a longer period of time, such as 3650 days (10 years).
- signer_key
A private key that has a key size of 2048 bits.
Create a private key and a self-signed certificate. The command-line tool, openssl, is required to generate the keys and certificates. If the system does not have it, you must install it.
Create a private key of size 2048.
ardana >
openssl genrsa -out myidp.key 2048Generate a certificate request named myidp.csr. When prompted, choose CommonName for the server's hostname.
ardana >
openssl req -new -key myidp.key -out myidp.csrGenerate a self-signed certificate named myidp.cer.
ardana >
openssl x509 -req -days 3650 -in myidp.csr -signkey myidp.key -out myidp.cer
Go to
~/scratch/ansible/next/ardana/ansible
and run the following playbook to enable the service provider in keystone:ardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@/tmp/k2k.yml
5.10.3 Test It Out #
You can use the script listed earlier, k2kclient.py
(Example 5.1, “k2kclient.py”), as an example for the end-to-end flows. To run
k2kclient.py
, follow these steps:
A few parameters must be changed in the beginning of
k2kclient.py
. For example, enter your specific URL, project name, and user name, as follows:# IdP auth URL self.auth_url = "http://idp_host:5000/v3/" self.project_name = "my_project_name" self.project_domain_name = "my_project_domain_name" self.username = "test" self.password = "mypass" self.user_domain_name = "my_domain" # identity provider Id that is defined in the SP config self.idp_id = "my_idp_id" # service provider Id that is defined in the IdP config self.sp_id = "my_sp_id"
Install python-keystoneclient along with its dependencies.
Run the
k2kclient.py
script. An unscoped token will be returned from the service provider.
At this point, the domain or project scope of the unscoped taken can be discovered by sending the following URLs:
ardana >
curl -k -X GET -H "X-Auth-Token: unscoped token" \ https://<sp_public_endpoint>:5000/v3/OS-FEDERATION/domainsardana >
curl -k -X GET -H "X-Auth-Token: unscoped token" \ https://<sp_public_endpoint:5000/v3/OS-FEDERATION/projects
5.10.4 Inside keystone-to-keystone Federation #
K2K federation places a lot of responsibility with the user. The complexity is apparent from the following diagram.
Users must first authenticate to their home or local cloud, or local identity provider keystone instance to obtain a scoped token.
Users must discover which service providers (or remote clouds) are available to them by querying their local cloud.
For a given remote cloud, users must discover which resources are available to them by querying the remote cloud for the projects they can scope to.
To talk to the remote cloud, users must first exchange, with the local cloud, their locally scoped token for a SAML2 assertion to present to the remote cloud.
Users then present the SAML2 assertion to the remote cloud. The remote cloud applies its mapping for the incoming SAML2 assertion to map each user to a local ephemeral persona (such as groups) and issues an unscoped token.
Users query the remote cloud for the list of projects they have access to.
Users then rescope their token to a given project.
Users now have access to the resources owned by the project.
The following diagram illustrates the flow of authentication requests.
5.10.5 Additional Testing Scenarios #
The following tests assume one identity provider and one service provider.
Test Case 1: Any federated user in the identity provider maps to a single designated group in the service provider
On the identity provider side:
hostname=myidp.com username=user1
On the service provider side:
group=group1 group_domain_name=domain1 'group1' scopes to 'project1'
Mapping used:
testcase1_1.json
testcase1_1.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to project1.
Test Case 2: A federated user in a specific domain in the identity provider maps to two different groups in the service provider
On the identity provider side:
hostname=myidp.com username=user1 user_domain_name=Default
On the service provider side:
group=group1 group_domain_name=domain1 'group1' scopes to 'project1' group=group2 group_domain_name=domain2 'group2' scopes to 'project2'
Mapping used:
testcase1_2.json
testcase1_2.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group2", "domain":{ "name": "domain2" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "openstack_user_domain", "any_one_of": [ "Default" ] }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to both project1 and project2.
Test Case 3: A federated user with a specific project in the identity provider maps to a specific group in the service provider
On the identity provider side:
hostname=myidp.com username=user4 user_project_name=test1
On the service provider side:
group=group4 group_domain_name=domain4 'group4' scopes to 'project4'
Mapping used:
testcase1_3.json
testcase1_3.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group4", "domain":{ "name": "domain4" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "openstack_project", "any_one_of": [ "test1" ] }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] }, { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group5", "domain":{ "name": "domain5" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "openstack_roles", "not_any_of": [ "member" ] }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to project4.
Test Case 4: A federated user with a specific role in the identity provider maps to a specific group in the service provider
On the identity provider side:
hostname=myidp.com, username=user5, role_name=member
On the service provider side:
group=group5, group_domain_name=domain5, 'group5' scopes to 'project5'
Mapping used:
testcase1_3.json
testcase1_3.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group4", "domain":{ "name": "domain4" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "openstack_project", "any_one_of": [ "test1" ] }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] }, { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group5", "domain":{ "name": "domain5" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "openstack_roles", "not_any_of": [ "member" ] }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to project5.
Test Case 5: Retain the previous scope for a federated user
On the identity provider side:
hostname=myidp.com, username=user1, user_domain_name=Default
On the service provider side:
group=group1, group_domain_name=domain1, 'group1' scopes to 'project1'
Mapping used:
testcase1_1.json
testcase1_1.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to project1. Later, we would like to scope federated users who have the default domain in the identity provider to project2 in addition to project1.
On the identity provider side:
hostname=myidp.com, username=user1, user_domain_name=Default
On the service provider side:
group=group1 group_domain_name=domain1 'group1' scopes to 'project1' group=group2 group_domain_name=domain2 'group2' scopes to 'project2'
Mapping used:
testcase1_2.json
testcase1_2.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group2", "domain":{ "name": "domain2" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "openstack_user_domain", "any_one_of": [ "Default" ] }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to project1 and project2.
Test Case 6: Scope a federated user to a domain
On the identity provider side:
hostname=myidp.com, username=user1
On the service provider side:
group=group1, group_domain_name=domain1, 'group1' scopes to 'project1'
Mapping used:
testcase1_1.json
testcase1_1.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result:
The federated user will scope to project1.
User uses CLI/Curl to assign any existing role to group1 on domain1.
User uses CLI/Curl to remove project1 scope from group1.
Final result: The federated user will scope to domain1.
Test Case 7: Test five remote attributes for mapping
Test all five different remote attributes, as follows, with similar test cases as noted previously.
openstack_user
openstack_user_domain
openstack_roles
openstack_project
openstack_project_domain
The attribute openstack_user does not make much sense for testing because it is mapped only to a specific username. The preceding test cases have already covered the attributes openstack_user_domain, openstack_roles, and openstack_project.
Note that similar tests have also been run for two identity providers with one service provider, and for one identity provider with two service providers.
5.10.6 Known Issues and Limitations #
Keep the following points in mind:
When a user is disabled in the identity provider, the issued federated token from the service provider still remains valid until the token is expired based on the keystone expiration setting.
An already issued federated token will retain its scope until its expiration. Any changes in the mapping on the service provider will not impact the scope of an already issued federated token. For example, if an already issued federated token was mapped to group1 that has scope on project1, and mapping is changed to group2 that has scope on project2, the prevously issued federated token still has scope on project1.
Access to service provider resources is provided only through the python-keystone CLI client or the keystone API. No horizon web interface support is currently available.
Domains, projects, groups, roles, and quotas are created per the service provider cloud. Support for federated projects, groups, roles, and quotas is currently not available.
keystone-to-keystone federation and WebSSO cannot be configured by putting both sets of configuration attributes in the same config file; they will overwrite each other. Consequently, they need to be configured individually.
Scoping the federated user to a domain is not supported by default in the playbook. Please follow the steps at Section 5.10.7, “Scope Federated User to Domain”.
5.10.7 Scope Federated User to Domain #
Use the following steps to scope a federated user to a domain:
On the IdP side, set
hostname=myidp.com
andusername=user1
.On the service provider side, set:
group=group1
,group_domain_name=domain1
, group1 scopes to project1.Mapping used: testcase1_1.json.
testcase1_1.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "openstack_user" }, { "type": "Shib-Identity-Provider", "any_one_of":[ "https://myidp.com:5000/v3/OS-FEDERATION/saml2/idp" ] } ] } ]
Expected result: The federated user will scope to project1. Use CLI/Curl to assign any existing role to group1 on domain1. Use CLI/Curl to remove project1 scope from group1.
Result: The federated user will scope to domain1.
5.11 Configuring Web Single Sign-On #
The external-name in ~/openstack/my_cloud/definition/data/network_groups.yml must be set to a valid DNS-resolvable FQDN.
This topic explains how to implement web single sign-on.
5.11.1 What is WebSSO? #
WebSSO, or web single sign-on, is a method for web browsers to receive current authentication information from an identity provider system without requiring a user to log in again to the application displayed by the browser. Users initially access the identity provider web page and supply their credentials. If the user successfully authenticates with the identity provider, the authentication credentials are then stored in the user’s web browser and automatically provided to all web-based applications, such as the horizon dashboard in SUSE OpenStack Cloud 9. If users have not yet authenticated with an identity provider or their credentials have timed out, they are automatically redirected to the identity provider to renew their credentials.
5.11.2 Limitations #
The WebSSO function supports only horizon web authentication. It is not supported for direct API or CLI access.
WebSSO works only with Fernet token provider. See Section 5.8.4, “Fernet Tokens”.
The SUSE OpenStack Cloud WebSSO function was tested with Microsoft Active Directory Federation Services (AD FS). The instructions provided are pertinent to ADFS and are intended to provide a sample configuration for deploying WebSSO with an external identity provider. If you have a different identity provider such as Ping Identity or IBM Tivoli, consult with those vendors for specific instructions for those products.
The SUSE OpenStack Cloud WebSSO function with OpenID method was tested with Google OAuth 2.0 APIs, which conform to the OpenID Connect specification. The interaction between Keystone and the external Identity Provider (IdP) is handled by the Apache2 auth_openidc module. Please consult with the specific OpenID Connect vendor on whether they support
auth_openidc
.Both SAML and OpenID methods are supported for WebSSO federation in SUSE OpenStack Cloud 9 .
WebSSO has a change password option in User Settings, but note that this function is not accessible for users authenticating with external systems such as LDAP or SAML Identity Providers.
5.11.3 Enabling WebSSO #
SUSE OpenStack Cloud 9 provides WebSSO support for the horizon web interface. This support requires several configuration steps including editing the horizon configuration file as well as ensuring that the correct keystone authentication configuration is enabled to receive the authentication assertions provided by the identity provider.
WebSSO support both SAML and OpenID methods. The following workflow depicts how horizon and keystone support WebSSO via SAML method if no current authentication assertion is available.
horizon redirects the web browser to the keystone endpoint.
keystone automatically redirects the web browser to the correct identity provider authentication web page based on the keystone configuration file.
The user authenticates with the identity provider.
The identity provider automatically redirects the web browser back to the keystone endpoint.
keystone generates the required Javascript code to POST a token back to horizon.
keystone automatically redirects the web browser back to horizon and the user can then access projects and resources assigned to the user.
The following diagram provides more details on the WebSSO authentication workflow.
Note that the horizon dashboard service never talks directly to the keystone identity service until the end of the sequence, after the federated unscoped token negotiation has completed. The browser interacts with the horizon dashboard service, the keystone identity service, and ADFS on their respective public endpoints.
The following sequence of events is depicted in the diagram.
The user's browser reaches the horizon dashboard service's login page. The user selects ADFS login from the drop-down menu.
The horizon dashboard service issues an HTTP Redirect (301) to redirect the browser to the keystone identity service's (public) SAML2 Web SSO endpoint (/auth/OS-FEDERATION/websso/saml2). The endpoint is protected by Apache mod_shib (shibboleth).
The browser talks to the keystone identity service. Because the user's browser does not have an active session with AD FS, the keystone identity service issues an HTTP Redirect (301) to the browser, along with the required SAML2 request, to the ADFS endpoint.
The browser talks to AD FS. ADFS returns a login form. The browser presents it to the user.
The user enters credentials (such as username and password) and submits the form to AD FS.
Upon successful validation of the user's credentials, ADFS issues an HTTP Redirect (301) to the browser, along with the SAML2 assertion, to the keystone identity service's (public) SAML2 endpoint (/auth/OS-FEDERATION/websso/saml2).
The browser talks to the keystone identity service. the keystone identity service validates the SAML2 assertion and issues a federated unscoped token. the keystone identity service returns JavaScript code to be executed by the browser, along with the federated unscoped token in the headers.
Upon execution of the JavaScript code, the browser is redirected to the horizon dashboard service with the federated unscoped token in the header.
The browser talks to the horizon dashboard service with the federated unscoped token.
With the unscoped token, the horizon dashboard service talks to the keystone identity service's (internal) endpoint to get a list of projects the user has access to.
The horizon dashboard service rescopes the token to the first project in the list. At this point, the user is successfully logged in.
The sequence of events for WebSSO using OpenID method is similar to SAML method.
5.11.4 Prerequisites #
5.11.4.1 WebSSO Using SAML Method #
5.11.4.1.1 Creating ADFS metadata #
For information about creating Active Directory Federation Services metadata, see the section To create edited ADFS 2.0 metadata with an added scope element of https://technet.microsoft.com/en-us/library/gg317734.
On the ADFS computer, use a browser such as Internet Explorer to view
https://<adfs_server_hostname>/FederationMetadata/2007-06/FederationMetadata.xml
.On the File menu, click Save as, and then navigate to the Windows desktop and save the file with the name adfs_metadata.xml. Make sure to change the Save as type drop-down box to All Files (*.*).
Use Windows Explorer to navigate to the Windows desktop, right-click adfs_metadata.xml, and then click Edit.
In Notepad, insert the following XML in the first element. Before editing, the EntityDescriptor appears as follows:
<EntityDescriptor ID="abc123" entityID=http://WIN-CAICP35LF2I.vlan44.domain/adfs/services/trust xmlns="urn:oasis:names:tc:SAML:2.0:metadata" >
After editing, it should look like this:
<EntityDescriptor ID="abc123" entityID="http://WIN-CAICP35LF2I.vlan44.domain/adfs/services/trust" xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:shibmd="urn:mace:shibboleth:metadata:1.0">
In Notepad, on the Edit menu, click Find. In Find what, type IDPSSO, and then click Find Next.
Insert the following XML in this section: Before editing, the IDPSSODescriptor appears as follows:
<IDPSSODescriptor protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol"><KeyDescriptor use="encryption">
After editing, it should look like this:
<IDPSSODescriptor protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol"><Extensions><shibmd:Scope regexp="false">vlan44.domain</shibmd:Scope></Extensions><KeyDescriptor use="encryption">
Delete the metadata document signature section of the file (the bold text shown in the following code). Because you have edited the document, the signature will now be invalid. Before editing the signature appears as follows:
<EntityDescriptor ID="abc123" entityID="http://FSWEB.contoso.com/adfs/services/trust" xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:shibmd="urn:mace:shibboleth:metadata:1.0"> <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"> SIGNATURE DATA </ds:Signature> <RoleDescriptor xsi:type=…>
After editing it should look like this:
<EntityDescriptor ID="abc123" entityID="http://FSWEB.contoso.com/adfs/services/trust" xmlns="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:shibmd="urn:mace:shibboleth:metadata:1.0"> <RoleDescriptor xsi:type=…>
Save and close adfs_metadata.xml.
Copy adfs_metadata.xml to the Cloud Lifecycle Manager node and place it into
/var/lib/ardana/openstack/my_cloud/config/keystone/
directory and put it under revision control.ardana >
cd ~/openstackardana >
git checkout siteardana >
git add my_cloud/config/keystone/adfs_metadata.xmlardana >
git commit -m "Add ADFS metadata file for WebSSO authentication"
5.11.4.1.2 Setting Up WebSSO #
Start by creating a config file adfs_config.yml
with the
following parameters and place it in the
/var/lib/ardana/openstack/my_cloud/config/keystone/
directory on your Cloud Lifecycle Manager node.
keystone_trusted_idp: adfs keystone_sp_conf: idp_metadata_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_metadata.xml shib_sso_application_entity_id: http://sp_uri_entityId shib_sso_idp_entity_id: http://default_idp_uri_entityId target_domain: name: domain1 description: my domain target_project: name: project1 description: my project target_group: name: group1 description: my group role: name: service identity_provider: id: adfs_idp1 description: This is the ADFS identity provider. mapping: id: mapping1 rules_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_mapping.json protocol: id: saml2 attribute_map: - name: http://schemas.xmlsoap.org/claims/Group id: ADFS_GROUP - name: urn:oid:1.3.6.1.4.1.5923.1.1.1.6 id: ADFS_LOGIN
A sample config file like this exists in roles/KEY-API/files/samples/websso/keystone_configure_adfs_sample.yml. Here are some detailed descriptions for each of the config options:
keystone_trusted_idp: A flag to indicate if this configuration is used for WebSSO or K2K. The value can be either 'adfs' or 'k2k'. keystone_sp_conf: shib_sso_idp_entity_id: The ADFS URI used as an entity Id to identity the IdP. shib_sso_application_entity_id: The Service Provider URI used as a entity Id. It can be any URI here for Websso as long as it is unique to the SP. target_domain: A domain where the group will be created from. name: Any domain name. If it does not exist, it will be created or be updated. description: Any description. target_project: A project scope that the group has. name: Any project name. If it does not exist, it will be created or be updated. description: Any description. target_group: A group will be created from 'target_domain'. name: Any group name. If it does not exist, it will be created or be updated. description: Any description. role: A role will be assigned on 'target_project'. This role impacts the idp user scoped token permission at sp side. name: It has to be an existing role. idp_metadata_file: A reference to the ADFS metadata file that validates the SAML2 assertion. identity_provider: An ADFS IdP id: Any Id. If it does not exist, it will be created or be updated. This Id needs to be shared with the client so that the right mapping will be selected. description: Any description. mapping: A mapping in json format that maps a federated user to a corresponding group. id: Any Id. If it does not exist, it will be created or be updated. rules_file: A reference to the file that has the mapping in json. protocol: The supported federation protocol. id: 'saml2' is the only supported protocol for Websso. attribute_map: A shibboleth mapping defined additional attributes to map the attributes from the SAML2 assertion to the Websso mapping that SP understands. - name: An attribute name from the SAML2 assertion. id: An Id that the above name will be mapped to.
Create a mapping file,
adfs_mapping.json
, that is referenced from the preceding config file in/var/lib/ardana/openstack/my_cloud/config/keystone/
.rules_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_mapping.json.
The following is an example of the mapping file, existing in roles/KEY-API/files/samples/websso/adfs_sp_mapping.json:
[ { "local": [{ "user": { "name": "{0}" } }], "remote": [{ "type": "ADFS_LOGIN" }] }, { "local": [{ "group": { "id": "GROUP_ID" } }], "remote": [{ "type": "ADFS_GROUP", "any_one_of": [ "Domain Users" ] }] } ]
You can find more details about how the WebSSO mapping works at http://docs.openstack.org. Also see Section 5.11.4.1.3, “Mapping rules” for more information.
Add
adfs_config.yml
andadfs_mapping.json
to revision control.ardana >
cd ~/openstackardana >
git checkout siteardana >
git add my_cloud/config/keystone/adfs_config.ymlardana >
git add my_cloud/config/keystone/adfs_mapping.jsonardana >
git commit -m "Add ADFS config and mapping."Go to ~/scratch/ansible/next/ardana/ansible and run the following playbook to enable WebSSO in the keystone identity service:
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml -e@/var/lib/ardana/openstack/my_cloud/config/keystone/adfs_config.yml
Enable WebSSO in the horizon dashboard service by setting horizon_websso_enabled flag to True in roles/HZN-WEB/defaults/main.yml and then run the horizon-reconfigure playbook:
ardana >
ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
5.11.4.1.3 Mapping rules #
One IdP-SP has only one mapping. The last mapping that the customer
configures will be the one used and will overwrite the old mapping setting.
Therefore, if the example mapping adfs_sp_mapping.json is used, the
following behavior is expected because it maps the federated user only to
the one group configured in
keystone_configure_adfs_sample.yml
.
Configure domain1/project1/group1, mapping1; websso login horizon, see project1;
Then reconfigure: domain1/project2/group1. mapping1, websso login horizon, see project1 and project2;
Reconfigure: domain3/project3/group3; mapping1, websso login horizon, only see project3; because now the IDP mapping maps the federated user to group3, which only has priviliges on project3.
If you need a more complex mapping, you can use a custom mapping file, which needs to be specified in keystone_configure_adfs_sample.yml -> rules_file.
You can use different attributes of the ADFS user in order to map to different or multiple groups.
An example of a more complex mapping file is adfs_sp_mapping_multiple_groups.json, as follows.
adfs_sp_mapping_multiple_groups.json
[ { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group1", "domain":{ "name": "domain1" } } } ], "remote":[{ "type": "ADFS_LOGIN" }, { "type": "ADFS_GROUP", "any_one_of":[ "Domain Users" ] } ] }, { "local": [ { "user": { "name": "{0}" } }, { "group": { "name": "group2", "domain":{ "name": "domain2" } } } ], "remote":[{ "type": "ADFS_LOGIN" }, { "type": "ADFS_SCOPED_AFFILIATION", "any_one_of": [ "member@contoso.com" ] }, ] } ]
The adfs_sp_mapping_multiple_groups.json must be run together with keystone_configure_mutiple_groups_sample.yml, which adds a new attribute for the shibboleth mapping. That file is as follows:
keystone_configure_mutiple_groups_sample.yml
# # Licensed under the Apache License, Version 2.0 (the "License"); you may # not use this file except in compliance with the License. You may obtain # a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the # License for the specific language governing permissions and limitations # under the License. # --- keystone_trusted_idp: adfs keystone_sp_conf: identity_provider: id: adfs_idp1 description: This is the ADFS identity provider. idp_metadata_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_metadata.xml shib_sso_application_entity_id: http://blabla shib_sso_idp_entity_id: http://WIN-CAICP35LF2I.vlan44.domain/adfs/services/trust target_domain: name: domain2 description: my domain target_project: name: project6 description: my project target_group: name: group2 description: my group role: name: admin mapping: id: mapping1 rules_file: /var/lib/ardana/openstack/my_cloud/config/keystone/adfs_sp_mapping_multiple_groups.json protocol: id: saml2 attribute_map: - name: http://schemas.xmlsoap.org/claims/Group id: ADFS_GROUP - name: urn:oid:1.3.6.1.4.1.5923.1.1.1.6 id: ADFS_LOGIN - name: urn:oid:1.3.6.1.4.1.5923.1.1.1.9 id: ADFS_SCOPED_AFFILIATION
5.11.4.2 Setting up the ADFS server as the identity provider #
For ADFS to be able to communicate with the keystone identity service, you need to add the keystone identity service as a trusted relying party for ADFS and also specify the user attributes that you want to send to the keystone identity service when users authenticate via WebSSO.
For more information, see the Microsoft ADFS wiki, section "Step 2: Configure ADFS 2.0 as the identity provider and shibboleth as the Relying Party".
Log in to the ADFS server.
Add a relying party using metadata
From Server Manager Dashboard, click Tools on the upper right, then ADFS Management.
Right-click ADFS, and then select Add Relying Party Trust.
Click Start, leave the already selected option
Import data about the relying party published online or on a local network
.In the Federation metadata address field, type
<keystone_publicEndpoint>/Shibboleth.sso/Metadata
(your keystone identity service Metadata endpoint), and then click Next. You can also import metadata from a file. Create a file with the content of the result of the following curl commandcurl <keystone_publicEndpoint>/Shibboleth.sso/Metadata
and then choose this file for importing the metadata for the relying party.
In the Specify Display Name page, choose a proper name to identify this trust relationship, and then click Next.
On the Choose Issuance Authorization Rules page, leave the default Permit all users to access the relying party selected, and then click Next.
Click Next, and then click Close.
Edit claim rules for relying party trust
The Edit Claim Rules dialog box should already be open. If not, In the ADFS center pane, under Relying Party Trusts, right-click your newly created trust, and then click Edit Claim Rules.
On the Issuance Transform Rules tab, click Add Rule.
On the Select Rule Template page, select Send LDAP Attributes as Claims, and then click Next.
On the Configure Rule page, in the Claim rule name box, type Get Data.
In the Attribute Store list, select Active Directory.
In the Mapping of LDAP attributes section, create the following mappings.
LDAP Attribute Outgoing Claim Type Token-Groups – Unqualified Names Group User-Principal-Name UPN Click Finish.
On the Issuance Transform Rules tab, click Add Rule.
On the Select Rule Template page, select Send Claims Using a Custom Rule, and then click Next.
In the Configure Rule page, in the Claim rule name box, type Transform UPN to epPN.
In the Custom Rule window, type or copy and paste the following:
c:[Type == "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn"] => issue(Type = "urn:oid:1.3.6.1.4.1.5923.1.1.1.6", Value = c.Value, Properties["http://schemas.xmlsoap.org/ws/2005/05/identity/claimproperties/attributename"] = "urn:oasis:names:tc:SAML:2.0:attrname-format:uri");
Click Finish.
On the Issuance Transform Rules tab, click Add Rule.
On the Select Rule Template page, select Send Claims Using a Custom Rule, and then click Next.
On the Configure Rule page, in the Claim rule name box, type Transform Group to epSA.
In the Custom Rule window, type or copy and paste the following:
c:[Type == "http://schemas.xmlsoap.org/claims/Group", Value == "Domain Users"] => issue(Type = "urn:oid:1.3.6.1.4.1.5923.1.1.1.9", Value = "member@contoso.com", Properties["http://schemas.xmlsoap.org/ws/2005/05/identity/claimproperties/attributename"] = "urn:oasis:names:tc:SAML:2.0:attrname-format:uri");
Click Finish, and then click OK.
This list of Claim Rules is just an example and can be modified or enhanced based on the customer's necessities and ADFS setup specifics.
Create a sample user on the ADFS server
From the Server Manager Dashboard, click Tools on the upper right, then Active Directory Users and Computer.
Right click User, then New, and then User.
Follow the on-screen instructions.
You can test the horizon dashboard service "Login with ADFS" by opening a
browser at the horizon dashboard service URL and choose
Authenticate using: ADFS Credentials
. You should be
redirected to the ADFS login page and be able to log into the horizon
dashboard service with your ADFS credentials.
5.11.5 WebSSO Using OpenID Method #
The interaction between Keystone and the external Identity Provider (IdP) is handled by the Apache2 auth_openidc module.
There are two steps to enable the feature.
Configure Keystone with the required OpenID Connect provider information.
Create the Identity Provider, protocol, and mapping in Keystone, using OpenStack Command Line Tool.
5.11.5.1 Configuring Keystone #
Log in to the Cloud Lifecycle Manager node and edit the
~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
file with the"keystone_openid_connect_conf"
variable. For example:keystone_openid_connect_conf: identity_provider: google response_type: id_token scope: "openid email profile" metadata_url: https://accounts.google.com/.well-known/openid-configuration client_id: [Replace with your client ID] client_secret: [Replace with your client secret] redirect_uri: https://www.myenterprise.com:5000/v3/OS-FEDERATION/identity_providers/google/protocols/openid/auth crypto_passphrase: ""
Where:
identity_provider
: name of the OpenID Connect identity provider. This must be the same as the identity provider to be created in Keystone using OpenStack Command Line Tool. For example, if the identity provider isfoo
, we must create the identity provider with the name. For example:openstack identity provider create foo
response_type
: corresponding to auth_openidc OIDCResponseType. In most cases, it should be"id_token"
.scope
: corresponding to auth_openidc OIDCScope.metadata_url
: corresponding to auth_openidc OIDCProviderMetadataURL.client_id
: corresponding to auth_openidc OIDCClientID.client_secret
: corresponding to auth_openidc OIDCClientSecret.redirect_uri
: corresponding to auth_openidc OIDCRedirectURI. This must be the Keystone public endpoint for given OpenID Connect identity provider. i.e."https://keystone-public-endpoint.foo.com/v3/OS-FEDERATION/identity_providers/foo/protocols/openid/auth"
.WarningSome OpenID Connect IdPs such as Google require the hostname in the "redirect_uri" to be a public FQDN. In that case, the hostname in Keystone public endpoint must also be a public FQDN and must match the one specified in the "redirect_uri".
crypto_passphrase
: corresponding to auth_openidc OIDCCryptoPassphrase. If left blank, a random cryto passphrase will be generated.
Commit the changes to your local git repository.
cd ~/openstack/ardana/ansible git add -A git commit -m "add OpenID Connect configuration"
Run
keystone-reconfigure
Ansible playbook.cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
5.11.5.2 Configure Horizon #
Complete the following steps to configure horizon to support WebSSO with OpenID method.
Edit the
~/openstack/ardana/ansible/roles/HZN-WEB/defaults/main.yml
file and set the following parameter toTrue
.horizon_websso_enabled: True
Locate the last line in the
~/openstack/ardana/ansible/roles/HZN-WEB/defaults/main.yml
file. The default configuration for this line should look like the following:horizon_websso_choices: - {protocol: saml2, description: "ADFS Credentials"}
If your cloud does not have AD FS enabled, then replace the preceding
horizon_websso_choices:
parameter with the following.- {protocol: openid, description: "OpenID Connect"}
The resulting block should look like the following.
horizon_websso_choices: - {protocol: openid, description: "OpenID Connect"}
If your cloud does have ADFS enabled, then simply add the following parameter to the
horizon_websso_choices:
section. Do not replace the default parameter, add the following line to the existing block.- {protocol: saml2, description: "ADFS Credentials"}
If your cloud has ADFS enabled, the final block of your
~/openstack/ardana/ansible/roles/HZN-WEB/defaults/main.yml
should have the following entries.horizon_websso_choices: - {protocol: openid, description: "OpenID Connect"} - {protocol: saml2, description: "ADFS Credentials"}
Run the following commands to add your changes to the local git repository, and reconfigure the horizon service, enabling the changes made in Step 1:
cd ~/openstack git add -A git commit -m "Configured WebSSO using OpenID Connect" cd ~/openstack/ardana/ansible/ ansible-playbook -i hosts/localhost config-processor-run.yml ansible-playbook -i hosts/localhost ready-deployment.yml cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
5.11.5.3 Create Identity Provider, Protocol, and Mapping #
To fully enable OpenID Connect, Identity Provider, Protocol, and Mapping for the given IdP must be created in Keystone. This is done by using the OpenStack Command Line Tool using the Keystone admin credential.
Log in to the Cloud Lifecycle Manager node and source
keystone.osrc
file.source ~/keystone.osrc
Create the Identity Provider. For example:
openstack identity provider create foo
WarningThe name of the Identity Provider must be exactly the same as the "identity_provider" attribute given when configuring Keystone in the previous section.
Next, create the Mapping for the Identity Provider. Prior to creating the Mapping, one must fully grasp the intricacies of Mapping Combinations as it may have profound security implications if done incorrectly. Here's an example of a mapping file.
[ { "local": [ { "user": { "name": "{0}", "email": "{1}", "type": "ephemeral" }, "group": { "domain": { "name": "Default" }, "name": "openidc_demo" } } ], "remote": [ { "type": "REMOTE_USER" }, { "type": "HTTP_OIDC_EMAIL" } ] } ]
Once the mapping file is created, now create the Mapping resource in Keystone. For example:
openstack mapping create --rule oidc_mapping.json oidc_mapping
Lastly, create the Protocol for the Identity Provider and its mapping. For OpenID Connect, the protocol name must be openid. For example:
openstack federation protocol create --identity-provider google --mapping oidc_mapping openid
5.12 Identity Service Notes and Limitations #
5.12.1 Notes #
This topic describes limitations of and important notes pertaining to the identity service. Domains
Domains can be created and managed by the horizon web interface, keystone API and OpenStackClient CLI.
The configuration of external authentication systems requires the creation and usage of Domains.
All configurations are managed by creating and editing specific configuration files.
End users can authenticate to a particular project and domain via the horizon web interface, keystone API and OpenStackClient CLI.
A new horizon login page that requires a Domain entry is now installed by default.
keystone-to-keystone Federation
keystone-to-keystone (K2K) Federation provides the ability to authenticate once with one cloud and then use these credentials to access resources on other federated clouds.
All configurations are managed by creating and editing specific configuration files.
Multi-Factor Authentication (MFA)
The keystone architecture provides support for MFA deployments.
MFA provides the ability to deploy non-password based authentication; for example: token providing hardware and text messages.
Hierarchical Multitenancy
Provides the ability to create sub-projects within a Domain-Project hierarchy.
Hash Algorithm Configuration
The default hash algorithm is
bcrypt
, which has a built-in limitation of 72 characters. As keystone defaults to a secret length of 86 characters, customers may choose to change the keystone hash algorithm to one that supports the full length of their secret.Process for changing the hash algorithm configuration:
Update the
identity
section of keystone.conf.j2 to reference the desired algorithm[identity] password_hash_algorithm=pbkdf2_sha512
commit the changes
run the keystone-redeploy.yml playbook
ansible-playbook -i hosts/verb_hosts keystone_redeploy.yml
verify that existing users retain access by logging into Horizon
5.12.2 Limitations #
Authentication with external authentication systems (LDAP, Active Directory (AD) or Identity Providers)
No horizon web portal support currently exists for the creation and management of external authentication system configurations.
Integration with LDAP services SUSE OpenStack Cloud 9 domain-specific configuration:
No Global User Listing: Once domain-specific driver configuration is enabled, listing all users and listing all groups are not supported operations. Those calls require a specific domain filter and a domain-scoped token for the target domain.
You cannot have both a file store and a database store for domain-specific driver configuration in a single identity service instance. Once a database store is enabled within the identity service instance, any file store will be ignored, and vice versa.
The identity service allows a list limit configuration to globally set the maximum number of entities that will be returned in an identity collection per request but it does not support per-domain list limit setting at this time.
Each time a new domain is configured with LDAP integration the single CA file gets overwritten. Ensure that you place certs for all the LDAP back-end domains in the cacert parameter. Detailed CA file inclusion instructions are provided in the comments of the sample YAML configuration file
keystone_configure_ldap_my.yml
(see Section 5.9.2, “Set up domain-specific driver configuration - file store”).LDAP is only supported for identity operations (reading users and groups from LDAP).
keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.
The SUSE OpenStack Cloud 'default' domain is pre-configured to store service account users and is authenticated locally against the identity service. Domains configured for external LDAP integration are non-default domains.
When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.
Each LDAP connection with the identity service is for read-only operations. Configurations that require identity service write operations (to create users, groups, etc.) are not currently supported.
LDAP is only supported for identity operations (reading users and groups from LDAP). keystone assignment operations from LDAP records such as managing or assigning roles and projects, are not currently supported.
When using the current OpenStackClient CLI you must use the user ID rather than the user name when working with a non-default domain.
SUSE OpenStack Cloud 9 API-based domain-specific configuration management
No GUI dashboard for domain-specific driver configuration management
API-based Domain specific config does not check for type of option.
API-based Domain specific config does not check for option values supported.
API-based Domain config method does not provide retrieval of default values of domain-specific configuration options.
Status: Domain-specific driver configuration database store is a non-core feature for SUSE OpenStack Cloud 9.
5.12.3 keystone-to-keystone federation #
When a user is disabled in the identity provider, the issued federated token from the service provider still remains valid until the token is expired based on the keystone expiration setting.
An already issued federated token will retain its scope until its expiration. Any changes in the mapping on the service provider will not impact the scope of an already issued federated token. For example, if an already issued federated token was mapped to group1 that has scope on project1, and mapping is changed to group2 that has scope on project2, the prevously issued federated token still has scope on project1.
Access to service provider resources is provided only through the python-keystone CLI client or the keystone API. No horizon web interface support is currently available.
Domains, projects, groups, roles, and quotas are created per the service provider cloud. Support for federated projects, groups, roles, and quotas is currently not available.
keystone-to-keystone federation and WebSSO cannot be configured by putting both sets of configuration attributes in the same config file; they will overwrite each other. Consequently, they need to be configured individually.
Scoping the federated user to a domain is not supported by default in the playbook. To enable it, see the steps in Section 5.10.7, “Scope Federated User to Domain”.
No horizon web portal support currently exists for the creation and management of federation configurations.
All end user authentication is available only via the keystone API and OpenStackClient CLI.
Additional information can be found at http://docs.openstack.org.
WebSSO
The WebSSO function supports only horizon web authentication. It is not supported for direct API or CLI access.
WebSSO works only with Fernet token provider. See Section 5.8.4, “Fernet Tokens”.
The SUSE OpenStack Cloud WebSSO function with SAML method was tested with Microsoft Active Directory Federation Services (ADFS). The instructions provided are pertinent to ADFS and are intended to provide a sample configuration for deploying WebSSO with an external identity provider. If you have a different identity provider such as Ping Identity or IBM Tivoli, consult with those vendors for specific instructions for those products.
The SUSE OpenStack Cloud WebSSO function with OpenID method was tested with Google OAuth 2.0 APIs, which conform to the OpenID Connect specification. The interaction between keystone and the external Identity Provider (IdP) is handled by the Apache2 auth_openidc module. Please consult with the specific OpenID Connect vendor on whether they support
auth_openidc
Both SAML and OpenID methods are supported for WebSSO federation in SUSE OpenStack Cloud 9 .
WebSSO has a change password option in User Settings, but note that this function is not accessible for users authenticating with external systems such as LDAP or SAML Identity Providers.
Multi-factor authentication (MFA)
SUSE OpenStack Cloud MFA support is a custom configuration requiring Sales Engineering support.
MFA drivers are not included with SUSE OpenStack Cloud and need to be provided by a specific MFA vendor.
Additional information can be found at http://docs.openstack.org/security-guide/content/identity-authentication-methods.html#identity-authentication-methods-external-authentication-methods.
Hierarchical multitenancy
This function requires additional support from various OpenStack services to be functional. It is a non-core function in SUSE OpenStack Cloud and is not ready for either proof of concept or production deployments.
Additional information can be found at http://specs.openstack.org/openstack/keystone-specs/specs/juno/hierarchical_multitenancy.html.
Missing quota information for compute resources
An error message that will appear in the default horizon page if you are running a swift-only deployment (no Compute service). In this configuration, you will not see any quota information for Compute resources and will see the following error message:
The Compute service is not installed or is not configured properly. No information is available for Compute resources. This error message is expected as no Compute service is configured for this deployment. Please ignore the error message.
The following is the benchmark of the performance that is based on 150 concurrent requests and run for 10 minute periods of stable load time.
Operation | In SUSE OpenStack Cloud 9 (secs/request) | In SUSE OpenStack Cloud 9 3.0 (secs/request) |
---|---|---|
Token Creation | 0.86 | 0.42 |
Token Validation | 0.47 | 0.41 |
Considering that token creation operations do not happen as frequently as token validation operations, you are likely to experience less of a performance problem regardless of the extended time for token creation.
5.12.4 System cron jobs need setup #
keystone relies on two cron jobs to periodically clean up expired tokens and for token revocation. The following is how the cron jobs appear on the system:
1 1 * * * /opt/stack/service/keystone/venv/bin/keystone-manage token_flush 1 1,5,10,15,20 * * * /opt/stack/service/keystone/venv/bin/revocation_cleanup.sh
By default, the two cron jobs are enabled on controller node 1 only, not on the other two nodes. When controller node 1 is down or has failed for any reason, these two cron jobs must be manually set up on one of the other two nodes.
6 Managing Compute #
Information about managing and configuring the Compute service.
6.1 Managing Compute Hosts using Aggregates and Scheduler Filters #
OpenStack nova has the concepts of availability zones and host aggregates that enable you to segregate your compute hosts. Availability zones are used to specify logical separation within your cloud based on the physical isolation or redundancy you have set up. Host aggregates are used to group compute hosts together based upon common features, such as operation system. For more information, read this topic.
OpenStack nova has the concepts of availability zones and host aggregates that enable you to segregate your Compute hosts. Availability zones are used to specify logical separation within your cloud based on the physical isolation or redundancy you have set up. Host aggregates are used to group compute hosts together based upon common features, such as operation system. For more information, see Scaling and Segregating your Cloud.
The nova scheduler also has a filter scheduler, which supports both filtering and weighting to make decisions on where new compute instances should be created. For more information, see Filter Scheduler and Scheduling.
This document is going to show you how to set up both a nova host aggregate and configure the filter scheduler to further segregate your compute hosts.
6.1.1 Creating a nova Aggregate #
These steps will show you how to create a nova aggregate and how to add a compute host to it. You can run these steps on any machine that contains the OpenStackClient that also has network access to your cloud environment. These requirements are met by the Cloud Lifecycle Manager.
Log in to the Cloud Lifecycle Manager.
Source the administrative creds:
ardana >
source ~/service.osrcList your current nova aggregates:
ardana >
openstack aggregate listCreate a new nova aggregate with this syntax:
ardana >
openstack aggregate create AGGREGATE-NAMEIf you wish to have the aggregate appear as an availability zone, then specify an availability zone with this syntax:
ardana >
openstack aggregate create AGGREGATE-NAME AVAILABILITY-ZONE-NAMESo, for example, if you wish to create a new aggregate for your SUSE Linux Enterprise compute hosts and you wanted that to show up as the
SLE
availability zone, you could use this command:ardana >
openstack aggregate create SLE SLEThis would produce an output similar to this:
+----+------+-------------------+-------+------------------+ | Id | Name | Availability Zone | Hosts | Metadata +----+------+-------------------+-------+--------------------------+ | 12 | SLE | SLE | | 'availability_zone=SLE' +----+------+-------------------+-------+--------------------------+
Next, you need to add compute hosts to this aggregate so you can start by listing your current hosts. You can view the current list of hosts running running the
compute
service like this:ardana >
openstack hypervisor listYou can then add host(s) to your aggregate with this syntax:
ardana >
openstack aggregate add host AGGREGATE-NAME HOSTThen you can confirm that this has been completed by listing the details of your aggregate:
openstack aggregate show AGGREGATE-NAME
You can also list out your availability zones using this command:
ardana >
openstack availability zone list
6.1.2 Using nova Scheduler Filters #
The nova scheduler has two filters that can help with differentiating between different compute hosts that we'll describe here.
Filter | Description |
---|---|
AggregateImagePropertiesIsolation |
Isolates compute hosts based on image properties and aggregate metadata. You can use commas to specify multiple values for the same property. The filter will then ensure at least one value matches. |
AggregateInstanceExtraSpecsFilter |
Checks that the aggregate metadata satisfies any extra specifications
associated with the instance type. This uses
|
For details about other available filters, see Filter Scheduler.
Using the AggregateImagePropertiesIsolation Filter
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/nova/nova.conf.j2
file and addAggregateImagePropertiesIsolation
to the scheduler_filters section. Example below, in bold:# Scheduler ... scheduler_available_filters = nova.scheduler.filters.all_filters scheduler_default_filters = AvailabilityZoneFilter,RetryFilter,ComputeFilter, DiskFilter,RamFilter,ImagePropertiesFilter,ServerGroupAffinityFilter, ServerGroupAntiAffinityFilter,ComputeCapabilitiesFilter,NUMATopologyFilter, AggregateImagePropertiesIsolation ...
Optionally, you can also add these lines:
aggregate_image_properties_isolation_namespace = <a prefix string>
aggregate_image_properties_isolation_separator = <a separator character>
(defaults to
.
)If these are added, the filter will only match image properties starting with the name space and separator - for example, setting to
my_name_space
and:
would mean the image propertymy_name_space:image_type=SLE
matches metadataimage_type=SLE
, butan_other=SLE
would not be inspected for a match at all.If these are not added all image properties will be matched against any similarly named aggregate metadata.
Add image properties to images that should be scheduled using the above filter
Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "editing nova schedule filters"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun the ready deployment playbook:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the nova reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
Using the AggregateInstanceExtraSpecsFilter Filter
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/nova/nova.conf.j2
file and addAggregateInstanceExtraSpecsFilter
to the scheduler_filters section. Example below, in bold:# Scheduler ... scheduler_available_filters = nova.scheduler.filters.all_filters scheduler_default_filters = AvailabilityZoneFilter,RetryFilter,ComputeFilter, DiskFilter,RamFilter,ImagePropertiesFilter,ServerGroupAffinityFilter, ServerGroupAntiAffinityFilter,ComputeCapabilitiesFilter,NUMATopologyFilter, AggregateInstanceExtraSpecsFilter ...
There is no additional configuration needed because the following is true:
The filter assumes
:
is a separatorThe filter will match all simple keys in extra_specs plus all keys with a separator if the prefix is
aggregate_instance_extra_specs
- for example,image_type=SLE
andaggregate_instance_extra_specs:image_type=SLE
will both be matched against aggregate metadataimage_type=SLE
Add
extra_specs
to flavors that should be scheduled according to the above.Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "Editing nova scheduler filters"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun the ready deployment playbook:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the nova reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts novan-reconfigure.yml
6.2 Using Flavor Metadata to Specify CPU Model #
Libvirt
is a collection of software used in OpenStack to
manage virtualization. It has the ability to emulate a host CPU model in a
guest VM. In SUSE OpenStack Cloud nova, the ComputeCapabilitiesFilter limits this
ability by checking the exact CPU model of the compute host against the
requested compute instance model. It will only pick compute hosts that have
the cpu_model
requested by the instance model, and if the
selected compute host does not have that cpu_model
, the
ComputeCapabilitiesFilter moves on to find another compute host that matches,
if possible. Selecting an unavailable vCPU model may cause nova to fail
with no valid host found
.
To assist, there is a nova scheduler filter that captures
cpu_models
as a subset of a particular CPU family. The
filter determines if the host CPU model is capable of emulating the guest
CPU model by maintaining the mapping of the vCPU models and comparing it with
the host CPU model.
There is a limitation when a particular cpu_model
is
specified with hw:cpu_model
via a compute flavor: the
cpu_mode
will be set to custom
. This
mode ensures that a persistent guest virtual machine will see the same
hardware no matter what host physical machine the guest virtual machine is
booted on. This allows easier live migration of virtual machines. Because of
this limitation, only some of the features of a CPU are exposed to the guest.
Requesting particular CPU features is not supported.
6.2.1 Editing the flavor metadata in the horizon dashboard #
These steps can be used to edit a flavor's metadata in the horizon
dashboard to add the extra_specs
for a
cpu_model
:
Access the horizon dashboard and log in with admin credentials.
Access the Flavors menu by (A) clicking on the menu button, (B) navigating to the Admin section, and then (C) clicking on Flavors:
In the list of flavors, choose the flavor you wish to edit and click on the entry under the Metadata column:
NoteYou can also create a new flavor and then choose that one to edit.
In the Custom field, enter
hw:cpu_model
and then click on the+
(plus) sign to continue:Then you will want to enter the CPU model into the field that you wish to use and then click Save:
6.3 Forcing CPU and RAM Overcommit Settings #
SUSE OpenStack Cloud supports overcommitting of CPU and RAM resources on compute nodes. Overcommitting is a technique of allocating more virtualized CPUs and/or memory than there are physical resources.
The default settings for this are:
Setting | Default Value | Description |
---|---|---|
cpu_allocation_ratio | 16 |
Virtual CPU to physical CPU allocation ratio which affects all CPU filters. This configuration specifies a global ratio for CoreFilter. For AggregateCoreFilter, it will fall back to this configuration value if no per-aggregate setting found. Note
This can be set per-compute, or if set to |
ram_allocation_ratio | 1.0 |
Virtual RAM to physical RAM allocation ratio which affects all RAM filters. This configuration specifies a global ratio for RamFilter. For AggregateRamFilter, it will fall back to this configuration value if no per-aggregate setting found. Note
This can be set per-compute, or if set to |
disk_allocation_ratio | 1.0 |
This is the virtual disk to physical disk allocation ratio used by the disk_filter.py script to determine if a host has sufficient disk space to fit a requested instance. A ratio greater than 1.0 will result in over-subscription of the available physical disk, which can be useful for more efficiently packing instances created with images that do not use the entire virtual disk,such as sparse or compressed images. It can be set to a value between 0.0 and 1.0 in order to preserve a percentage of the disk for uses other than instances. Note
This can be set per-compute, or if set to |
6.3.1 Changing the overcommit ratios for your entire environment #
If you wish to change the CPU and/or RAM overcommit ratio settings for your entire environment then you can do so via your Cloud Lifecycle Manager with these steps.
Log in to the Cloud Lifecycle Manager.
Edit the nova configuration settings located in this file:
~/openstack/my_cloud/config/nova/nova.conf.j2
Add or edit the following lines to specify the ratios you wish to use:
cpu_allocation_ratio = 16 ram_allocation_ratio = 1.0
Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "setting nova overcommit settings"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the nova reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
6.4 Enabling the Nova Resize and Migrate Features #
The nova resize and migrate features are disabled by default. If you wish to utilize these options, these steps will show you how to enable it in your cloud.
The two features below are disabled by default:
Resize - this feature allows you to change the size of a Compute instance by changing its flavor. See the OpenStack User Guide for more details on its use.
Migrate - read about the differences between "live" migration (enabled by default) and regular migration (disabled by default) in Section 15.1.3.3, “Live Migration of Instances”.
These two features are disabled by default because they require passwordless SSH access between Compute hosts with the user having access to the file systems to perform the copy.
6.4.1 Enabling Nova Resize and Migrate #
If you wish to enable these features, use these steps on your lifecycle
manager. This will deploy a set of public and private SSH keys to the
Compute hosts, allowing the nova
user SSH access between
each of your Compute hosts.
Log in to the Cloud Lifecycle Manager.
Run the nova reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml --extra-vars nova_migrate_enabled=trueTo ensure that the resize and migration options show up in the horizon dashboard, run the horizon reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
6.4.2 Disabling Nova Resize and Migrate #
This feature is disabled by default. However, if you have previously enabled
it and wish to re-disable it, you can use these steps on your lifecycle
manager. This will remove the set of public and private SSH keys that were
previously added to the Compute hosts, removing the nova
users SSH access between each of your Compute hosts.
Log in to the Cloud Lifecycle Manager.
Run the nova reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml --extra-vars nova_migrate_enabled=falseTo ensure that the resize and migrate options are removed from the horizon dashboard, run the horizon reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
6.5 Enabling ESX Compute Instance(s) Resize Feature #
The resize of ESX compute instance is disabled by default. If you want to utilize this option, these steps will show you how to configure and enable it in your cloud.
The following feature is disabled by default:
Resize - this feature allows you to change the size of a Compute instance by changing its flavor. See the OpenStack User Guide for more details on its use.
6.5.1 Procedure #
If you want to configure and re-size ESX compute instance(s), perform the following steps:
Log in to the Cloud Lifecycle Manager.
Edit the
~ /openstack/my_cloud/config/nova/nova.conf.j2
to add the following parameter under Policy:# Policy allow_resize_to_same_host=True
Commit your configuration:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "<commit message>"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlBy default the nova resize feature is disabled. To enable nova resize, refer to Section 6.4, “Enabling the Nova Resize and Migrate Features”.
By default an ESX console log is not set up. For more details about Hypervisor setup, refer to the OpenStack documentation.
6.6 GPU passthrough #
GPU passthrough for SUSE OpenStack Cloud provides the nova instance direct access to the GPU device for increased performance.
This section demonstrates the steps to pass through a Nvidia GPU card supported by SUSE OpenStack Cloud,
Resizing the VM to the same host with the same PCI card is not supported with PCI passthrough.
The following steps are necessary to leverage PCI passthrough on a SUSE OpenStack Cloud 9 Compute Node: preparing the Compute Node, preparing nova via the input model updates and glance. Ensure you follow the below procedures in sequence:
There should be no kernel drivers or binaries with direct access to the PCI device. If there are kernel modules, ensure they are blacklisted.
For example, it is common to have a
nouveau
driver from when the node was installed. This driver is a graphics driver for Nvidia-based GPUs. It must be blacklisted as shown in this example:ardana >
echo 'blacklist nouveau' >> /etc/modprobe.d/nouveau-default.confThe file location and its contents are important, however the name of the file is your choice. Other drivers can be blacklisted in the same manner, including Nvidia drivers.
On the host,
iommu_groups
is necessary and may already be enabled. To check if IOMMU is enabled, run the following commands:root #
virt-host-validate ..... QEMU: Checking if IOMMU is enabled by kernel : WARN (IOMMU appears to be disabled in kernel. Add intel_iommu=on to kernel cmdline arguments) .....To modify the kernel command line as suggested in the warning, edit
/etc/default/grub
and appendintel_iommu=on
to theGRUB_CMDLINE_LINUX_DEFAULT
variable. Run:root #
update-bootloaderReboot to enable
iommu_groups
.After the reboot, check that IOMMU is enabled:
root #
virt-host-validate ..... QEMU: Checking if IOMMU is enabled by kernel : PASS .....Confirm IOMMU groups are available by finding the group associated with your PCI device (for example Nvidia GPU):
ardana >
lspci -nn | grep -i nvidia 84:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)In this example,
84:00.0
is the address of the PCI device. The vendorID is10de
. The product ID is1db4
.Confirm that the devices are available for passthrough:
ardana >
ls -ld /sys/kernel/iommu_groups/*/devices/*84:00.?/ drwxr-xr-x 3 root root 0 Nov 19 17:00 /sys/kernel/iommu_groups/56/devices/0000:84:00.0/
6.6.1 Preparing nova via the input model updates #
To implement the required configuration, log into the Cloud Lifecycle Manager node and update the Cloud Lifecycle Manager model files to enable GPU passthrough for compute nodes.
Edit servers.yml
Add the pass-through
section after the definition of
servers section in the servers.yml
file.
The following example shows only the relevant sections:
--- product: version: 2 baremetal: netmask: 255.255.255.0 subnet: 192.168.100.0 servers: . . . . - id: compute-0001 ip-addr: 192.168.75.5 role: COMPUTE-ROLE server-group: RACK3 nic-mapping: HP-DL360-4PORT ilo-ip: **** ilo-user: **** ilo-password: **** mac-addr: **** . . . - id: compute-0008 ip-addr: 192.168.75.7 role: COMPUTE-ROLE server-group: RACK2 nic-mapping: HP-DL360-4PORT ilo-ip: **** ilo-user: **** ilo-password: **** mac-addr: **** pass-through: servers: - id: compute-0001 data: gpu: - vendor_id: 10de product_id: 1db4 bus_address: 0000:84:00.0 pf_mode: type-PCI name: a1 - vendor_id: 10de product_id: 1db4 bus_address: 0000:85:00.0 pf_mode: type-PCI name: b1 - id: compute-0008 data: gpu: - vendor_id: 10de product_id: 1db4 pf_mode: type-PCI name: c1
Check out the site branch of the local git repository and change to the correct directory:
ardana >
cd ~/openstackardana >
git checkout siteardana >
cd ~/openstack/my_cloud/definition/data/Open the file containing the servers list, for example
servers.yml
, with your chosen editor. Save the changes to the file and commit to the local git repository:ardana >
git add -AConfirm that the changes to the tree are relevant changes and commit:
ardana >
git statusardana >
git commit -m "your commit message goes here in quotes"Enable your changes by running the necessary playbooks:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansibleIf you are enabling GPU passthrough for your compute nodes during your initial installation, run the following command:
ardana >
ansible-playbook -i hosts/verb_hosts site.ymlIf you are enabling GPU passthrough for your compute nodes post-installation, run the following command:
ardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
The above procedure updates the configuration for the nova api, nova compute and scheduler as defined in https://docs.openstack.org/nova/rocky/admin/pci-passthrough.html.
The following is the PCI configuration for the compute0001
node using the above example post-playbook run:
[pci] passthrough_whitelist = [{"address": "0000:84:00.0"}, {"address": "0000:85:00.0"}] alias = {"vendor_id": "10de", "name": "a1", "device_type": "type-PCI", "product_id": "1db4"} alias = {"vendor_id": "10de", "name": "b1", "device_type": "type-PCI", "product_id": "1db4"}
The following is the PCI configuration for compute0008
node using the above example post-playbook run:
[pci] passthrough_whitelist = [{"vendor_id": "10de", "product_id": "1db4"}] alias = {"vendor_id": "10de", "name": "c1", "device_type": "type-PCI", "product_id": "1db4"}
After running the site.yml
playbook above,
reboot the compute nodes that are configured with Intel PCI devices.
6.6.2 Create a flavor #
For GPU passthrough, set the pci_passthrough:alias
property. You can do so for an existing flavor or create a new flavor
as shown in the example below:
ardana >
openstack flavor create --ram 8192 --disk 100 --vcpu 8 gpuflavorardana >
openstack flavor set gpuflavor --property "pci_passthrough:alias"="a1:1"
Here the a1
references the alias name as provided
in the model while the 1
tells nova that a single GPU
should be assigned.
Boot an instance using the flavor created above:
ardana >
openstack server create --flavor gpuflavor --image sles12sp4 --key-name key --nic net-id=$net_id gpu-instance-1
6.7 Configuring the Image Service #
The Image service, based on OpenStack glance, works out of the box and does not need any special configuration. However, we show you how to enable glance image caching as well as how to configure your environment to allow the glance copy-from feature if you choose to do so. A few features detailed below will require some additional configuration if you choose to use them.
glance images are assigned IDs upon creation, either automatically or specified by the user. The ID of an image should be unique, so if a user assigns an ID which already exists, a conflict (409) will occur.
This only becomes a problem if users can publicize or share images with
others. If users can share images AND cannot publicize images then your
system is not vulnerable. If the system has also been purged (via
glance-manage db purge
) then it is possible for deleted
image IDs to be reused.
If deleted image IDs can be reused then recycling of public and shared images becomes a possibility. This means that a new (or modified) image can replace an old image, which could be malicious.
If this is a problem for you, please contact Sales Engineering.
6.7.1 How to enable glance image caching #
In SUSE OpenStack Cloud 9, by default, the glance image caching option is not enabled. You have the option to have image caching enabled and these steps will show you how to do that.
The main benefits to using image caching is that it will allow the glance service to return the images faster and it will cause less load on other services to supply the image.
In order to use the image caching option you will need to supply a logical volume for the service to use for the caching.
If you wish to use the glance image caching option, you will see the
section below in your
~/openstack/my_cloud/definition/data/disks_controller.yml
file. You will specify the mount point for the logical volume you wish to
use for this.
Log in to the Cloud Lifecycle Manager.
Edit your
~/openstack/my_cloud/definition/data/disks_controller.yml
file and specify the volume and mount point for yourglance-cache
. Here is an example:# glance cache: if a logical volume with consumer usage glance-cache # is defined glance caching will be enabled. The logical volume can be # part of an existing volume group or a dedicated volume group. - name: glance-vg physical-volumes: - /dev/sdx logical-volumes: - name: glance-cache size: 95% mount: /var/lib/glance/cache fstype: ext4 mkfs-opts: -O large_file consumer: name: glance-api usage: glance-cache
If you are enabling image caching during your initial installation, prior to running
site.yml
the first time, then continue with the installation steps. However, if you are making this change post-installation then you will need to commit your changes with the steps below.Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the glance reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml
An existing volume image cache is not properly deleted when cinder detects the source image has changed. After updating any source image, delete the cache volume so that the cache is refreshed.
The volume image cache must be deleted before trying to use the associated
source image in any other volume operations. This includes creating bootable
volumes or booting an instance with create volume
enabled
and the updated image as the source image.
6.7.2 Allowing the glance copy-from option in your environment #
When creating images, one of the options you have is to copy the image from
a remote location to your local glance store. You do this by specifying the
--copy-from
option when creating the image. To use this
feature though you need to ensure the following conditions are met:
The server hosting the glance service must have network access to the remote location that is hosting the image.
There cannot be a proxy between glance and the remote location.
The glance v1 API must be enabled, as v2 does not currently support the
copy-from
function.The http glance store must be enabled in the environment, following the steps below.
Enabling the HTTP glance Store
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/glance/glance-api.conf.j2
file and addhttp
to the list of glance stores in the[glance_store]
section as seen below in bold:[glance_store] stores = {{ glance_stores }}, http
Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the glance reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts glance-reconfigure.ymlRun the horizon reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
7 Managing ESX #
Information about managing and configuring the ESX service.
7.1 Networking for ESXi Hypervisor (OVSvApp) #
To provide the network as a service for tenant VM's hosted on ESXi
Hypervisor, a service VM called OVSvApp VM
is deployed on
each ESXi Hypervisor within a cluster managed by OpenStack nova, as shown
in the following figure.
The OVSvApp VM runs SLES as a guest operating system, and has Open vSwitch
2.1.0 or above installed. It also runs an agent called OVSvApp
agent
, which is responsible for dynamically creating the port
groups for the tenant VMs and manages OVS bridges, which contain the flows
related to security groups and L2 networking.
To facilitate fault tolerance and mitigation of data path loss for tenant
VMs, run the neutron-ovsvapp-agent-monitor
process as part of the neutron-ovsvapp-agent
service, responsible for monitoring the Open vSwitch module within
the OVSvApp VM. It also uses a nginx
server to provide the
health status of the Open vSwitch module to the neutron server for mitigation
actions. There is a mechanism to keep the
neutron-ovsvapp-agent service alive through
a systemd
script.
When a OVSvApp Service VM crashes, an agent monitoring mechanism starts a cluster mitigation process. You can mitigate data path traffic loss for VMs on the failed ESX host in that cluster by putting the failed ESX host in the maintenance mode. This, in turn, triggers the vCenter DRS migrates tenant VMs to other ESX hosts within the same cluster. This ensures data path continuity of tenant VMs traffic.
View Cluster Mitigation
Install python-networking-vsphere
so that neutron
ovsvapp commands will work properly.
ardana >
sudo zypper in python-networking-vsphere
An administrator can view cluster mitigation status using the following commands.
neutron ovsvapp-mitigated-cluster-list
Lists all the clusters where at least one round of host mitigation has happened.
Example:
ardana >
neutron ovsvapp-mitigated-cluster-list +----------------+--------------+-----------------------+---------------------------+ | vcenter_id | cluster_id | being_mitigated | threshold_reached | +----------------+--------------+-----------------------+---------------------------+ | vcenter1 | cluster1 | True | False | | vcenter2 | cluster2 | False | True | +---------------+------------+-----------------+------------------------------------+neutron ovsvapp-mitigated-cluster-show --vcenter-id <VCENTER_ID> --cluster-id <CLUSTER_ID>
Shows the status of a particular cluster.
Example :
ardana >
neutron ovsvapp-mitigated-cluster-show --vcenter-id vcenter1 --cluster-id cluster1 +---------------------------+-------------+ | Field | Value | +---------------------------+-------------+ | being_mitigated | True | | cluster_id | cluster1 | | threshold_reached | False | | vcenter_id | vcenter1 | +---------------------------+-------------+There can be instances where a triggered mitigation may not succeed and the neutron server is not informed of such failure (for example, if the selected agent which had to mitigate the host, goes down before finishing the task). In this case, the cluster will be locked. To unlock the cluster for further mitigations, use the update command.
neutron ovsvapp-mitigated-cluster-update --vcenter-id <VCENTER_ID> --cluster-id <CLUSTER_ID>
Update the status of a mitigated cluster:
Modify the values of being-mitigated from True to False to unlock the cluster.
Example:
ardana >
neutron ovsvapp-mitigated-cluster-update --vcenter-id vcenter1 --cluster-id cluster1 --being-mitigated FalseUpdate the threshold value:
Update the threshold-reached value to True, if no further migration is required in the selected cluster.
Example :
ardana >
neutron ovsvapp-mitigated-cluster-update --vcenter-id vcenter1 --cluster-id cluster1 --being-mitigated False --threshold-reached True
Rest API
ardana >
curl -i -X GET http://<ip>:9696/v2.0/ovsvapp_mitigated_clusters \ -H "User-Agent: python-neutronclient" -H "Accept: application/json" -H \ "X-Auth-Token: <token_id>"
7.1.1 More Information #
For more information on the Networking for ESXi Hypervisor (OVSvApp), see the following references:
VBrownBag session in Vancouver OpenStack Liberty Summit:
https://www.youtube.com/watch?v=icYA_ixhwsM&feature=youtu.be
Wiki Link:
Codebase:
Whitepaper:
https://github.com/hp-networking/ovsvapp/blob/master/OVSvApp_Solution.pdf
7.2 Validating the neutron Installation #
You can validate that the ESX compute cluster is added to the cloud successfully using the following command:
# openstack network agent list +------------------+----------------------+-----------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +------------------+----------------------+-----------------------+-------------------+-------+----------------+---------------------------+ | 05ca6ef...999c09 | L3 agent | doc-cp1-comp0001-mgmt | nova | :-) | True | neutron-l3-agent | | 3b9179a...28e2ef | Metadata agent | doc-cp1-comp0001-mgmt | | :-) | True | neutron-metadata-agent | | 4e8f84f...c9c58f | Metadata agent | doc-cp1-comp0002-mgmt | | :-) | True | neutron-metadata-agent | | 55a5791...c17451 | L3 agent | doc-cp1-c1-m1-mgmt | nova | :-) | True | neutron-vpn-agent | | 5e3db8f...87f9be | Open vSwitch agent | doc-cp1-c1-m1-mgmt | | :-) | True | neutron-openvswitch-agent | | 6968d9a...b7b4e9 | L3 agent | doc-cp1-c1-m2-mgmt | nova | :-) | True | neutron-vpn-agent | | 7b02b20...53a187 | Metadata agent | doc-cp1-c1-m2-mgmt | | :-) | True | neutron-metadata-agent | | 8ece188...5c3703 | Open vSwitch agent | doc-cp1-comp0002-mgmt | | :-) | True | neutron-openvswitch-agent | | 8fcb3c7...65119a | Metadata agent | doc-cp1-c1-m1-mgmt | | :-) | True | neutron-metadata-agent | | 9f48967...36effe | OVSvApp agent | doc-cp1-comp0002-mgmt | | :-) | True | ovsvapp-agent | | a2a0b78...026da9 | Open vSwitch agent | doc-cp1-comp0001-mgmt | | :-) | True | neutron-openvswitch-agent | | a2fbd4a...28a1ac | DHCP agent | doc-cp1-c1-m2-mgmt | nova | :-) | True | neutron-dhcp-agent | | b2428d5...ee60b2 | DHCP agent | doc-cp1-c1-m1-mgmt | nova | :-) | True | neutron-dhcp-agent | | c0983a6...411524 | Open vSwitch agent | doc-cp1-c1-m2-mgmt | | :-) | True | neutron-openvswitch-agent | | c32778b...a0fc75 | L3 agent | doc-cp1-comp0002-mgmt | nova | :-) | True | neutron-l3-agent | +------------------+----------------------+-----------------------+-------------------+-------+----------------+---------------------------+
7.3 Removing a Cluster from the Compute Resource Pool #
7.3.1 Prerequisites #
Write down the Hostname and ESXi configuration IP addresses of OVSvAPP VMs of that ESX cluster before deleting the VMs. These IP address and Hostname will be used to cleanup monasca alarm definitions.
Perform the following steps:
7.3.2 Removing an existing cluster from the compute resource pool #
Perform the following steps to remove an existing cluster from the compute resource pool.
Run the following command to check for the instances launched in that cluster:
# openstack server list --host <hostname> +--------------------------------------+------+--------+------------+-------------+------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+------------------+ | 80e54965-758b-425e-901b-9ea756576331 | VM1 | ACTIVE | - | Running | private=10.0.0.2 | +--------------------------------------+------+--------+------------+-------------+------------------+
where:
hostname: Specifies hostname of the compute proxy present in that cluster.
Delete all instances spawned in that cluster:
# openstack server delete <server> [<server ...>]
where:
server: Specifies the name or ID of server (s)
OR
Migrate all instances spawned in that cluster.
# openstack server migrate <server>
Run the following playbooks for stop the Compute (nova) and Networking (neutron) services:
ardana >
ansible-playbook -i hosts/verb_hosts nova-stop --limit <hostname>;ardana >
ansible-playbook -i hosts/verb_hosts neutron-stop --limit <hostname>;where:
hostname: Specifies hostname of the compute proxy present in that cluster.
7.3.3 Cleanup monasca-agent for OVSvAPP Service #
Perform the following procedure to cleanup monasca agents for ovsvapp-agent service.
If monasca-API is installed on different node, copy the
service.orsc
from Cloud Lifecycle Manager to monasca API server.scp service.orsc $USER@ardana-cp1-mtrmon-m1-mgmt:
SSH to monasca API server. You must SSH to each monasca API server for cleanup.
For example:
ssh ardana-cp1-mtrmon-m1-mgmt
Edit
/etc/monasca/agent/conf.d/host_alive.yaml
file to remove the reference to the OVSvAPP you removed. This requiressudo
access.sudo vi /etc/monasca/agent/conf.d/host_alive.yaml
A sample of
host_alive.yaml
:- alive_test: ping built_by: HostAlive host_name: esx-cp1-esx-ovsvapp0001-mgmt name: esx-cp1-esx-ovsvapp0001-mgmt ping target_hostname: esx-cp1-esx-ovsvapp0001-mgmt
where HOST_NAME and TARGET_HOSTNAME is mentioned at the DNS name field at the vSphere client. (Refer to Section 7.3.1, “Prerequisites”).
After removing the reference on each of the monasca API servers, restart the monasca-agent on each of those servers by executing the following command.
tux >
sudo service openstack-monasca-agent restartWith the OVSvAPP references removed and the monasca-agent restarted, you can delete the corresponding alarm to complete the cleanup process. We recommend using the monasca CLI which is installed on each of your monasca API servers by default. Execute the following command from the monasca API server (for example:
ardana-cp1-mtrmon-mX-mgmt
).monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=<ovsvapp deleted>
For example: You can execute the following command to get the alarm ID, if the OVSvAPP appears as a preceding example.
monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=MCP-VCP-cpesx-esx-ovsvapp0001-mgmt +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+ | id | alarm_definition_id | alarm_definition_name | metric_name | metric_dimensions | severity | state | lifecycle_state | link | state_updated_timestamp | updated_timestamp | created_timestamp | +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+ | cfc6bfa4-2485-4319-b1e5-0107886f4270 | cca96c53-a927-4b0a-9bf3-cb21d28216f3 | Host Status | host_alive_status | service: system | HIGH | OK | None | None | 2016-10-27T06:33:04.256Z | 2016-10-27T06:33:04.256Z | 2016-10-23T13:41:57.258Z | | | | | | cloud_name: entry-scale-kvm-esx-mml | | | | | | | | | | | | | test_type: ping | | | | | | | | | | | | | hostname: ardana-cp1-esx-ovsvapp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: mtrmon | | | | | | | | | | | | | observer_host: ardana-cp1-mtrmon-m1-mgmt | | | | | | | | | | | | host_alive_status | service: system | | | | | | | | | | | | | cloud_name: entry-scale-kvm-esx-mml | | | | | | | | | | | | | test_type: ping | | | | | | | | | | | | | hostname: ardana-cp1-esx-ovsvapp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: mtrmon | | | | | | | | | | | | | observer_host: ardana-cp1-mtrmon-m3-mgmt | | | | | | | | | | | | host_alive_status | service: system | | | | | | | | | | | | | cloud_name: entry-scale-kvm-esx-mml | | | | | | | | | | | | | test_type: ping | | | | | | | | | | | | | hostname: ardana-cp1-esx-ovsvapp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: mtrmon | | | | | | | | | | | | | observer_host: ardana-cp1-mtrmon-m2-mgmt | | | | | | | | +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
Delete the monasca alarm.
monasca alarm-delete <alarm ID>
For example:
monasca alarm-delete cfc6bfa4-2485-4319-b1e5-0107886f4270Successfully deleted alarm
After deleting the alarms and updating the monasca-agent configuration, those alarms will be removed from the Operations Console UI. You can login to Operations Console and view the status.
7.3.4 Removing the Compute Proxy from Monitoring #
Once you have removed the Compute proxy, the alarms against them will still trigger. Therefore to resolve this, you must perform the following steps.
SSH to monasca API server. You must SSH to each monasca API server for cleanup.
For example:
ssh ardana-cp1-mtrmon-m1-mgmt
Edit
/etc/monasca/agent/conf.d/host_alive.yaml
file to remove the reference to the Compute proxy you removed. This requiressudo
access.sudo vi /etc/monasca/agent/conf.d/host_alive.yaml
A sample of
host_alive.yaml
file.- alive_test: ping built_by: HostAlive host_name: MCP-VCP-cpesx-esx-comp0001-mgmt name: MCP-VCP-cpesx-esx-comp0001-mgmt ping
Once you have removed the references on each of your monasca API servers, execute the following command to restart the monasca-agent on each of those servers.
tux >
sudo service openstack-monasca-agent restartWith the Compute proxy references removed and the monasca-agent restarted, delete the corresponding alarm to complete this process. complete the cleanup process. We recommend using the monasca CLI which is installed on each of your monasca API servers by default.
monasca alarm-list --metric-dimensions hostname= <compute node deleted>
For example: You can execute the following command to get the alarm ID, if the Compute proxy appears as a preceding example.
monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt
Delete the monasca alarm
monasca alarm-delete <alarm ID>
7.3.5 Cleaning the monasca Alarms Related to ESX Proxy and vCenter Cluster #
Perform the following procedure:
Using the ESX proxy hostname, execute the following command to list all alarms.
monasca alarm-list --metric-dimensions hostname=COMPUTE_NODE_DELETED
where COMPUTE_NODE_DELETED - hostname is taken from the vSphere client (refer to Section 7.3.1, “Prerequisites”).
NoteMake a note of all the alarm IDs that are displayed after executing the preceding command.
For example, the compute proxy hostname is
MCP-VCP-cpesx-esx-comp0001-mgmt
.monasca alarm-list --metric-dimensions hostname=MCP-VCP-cpesx-esx-comp0001-mgmt ardana@R28N6340-701-cp1-c1-m1-mgmt:~$ monasca alarm-list --metric-dimensions hostname=R28N6340-701-cp1-esx-comp0001-mgmt +--------------------------------------+--------------------------------------+------------------------+------------------------+--------------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+ | id | alarm_definition_id | alarm_definition_name | metric_name | metric_dimensions | severity | state | lifecycle_state | link | state_updated_timestamp | updated_timestamp | created_timestamp | +--------------------------------------+--------------------------------------+------------------------+------------------------+--------------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+ | 02342bcb-da81-40db-a262-09539523c482 | 3e302297-0a36-4f0e-a1bd-03402b937a4e | HTTP Status | http_status | service: compute | HIGH | OK | None | None | 2016-11-11T06:58:11.717Z | 2016-11-11T06:58:11.717Z | 2016-11-10T08:55:45.136Z | | | | | | cloud_name: entry-scale-esx-kvm | | | | | | | | | | | | | url: https://10.244.209.9:8774 | | | | | | | | | | | | | hostname: R28N6340-701-cp1-esx-comp0001-mgmt | | | | | | | | | | | | | component: nova-api | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: esx-compute | | | | | | | | | 04cb36ce-0c7c-4b4c-9ebc-c4011e2f6c0a | 15c593de-fa54-4803-bd71-afab95b980a4 | Disk Usage | disk.space_used_perc | mount_point: /proc/sys/fs/binfmt_misc | HIGH | OK | None | None | 2016-11-10T08:52:52.886Z | 2016-11-10T08:52:52.886Z | 2016-11-10T08:51:29.197Z | | | | | | service: system | | | | | | | | | | | | | cloud_name: entry-scale-esx-kvm | | | | | | | | | | | | | hostname: R28N6340-701-cp1-esx-comp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: esx-compute | | | | | | | | | | | | | device: systemd-1 | | | | | | | | +--------------------------------------+--------------------------------------+------------------------+------------------------+--------------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+
Delete the alarm using the alarm IDs.
monasca alarm-delete <alarm ID>
Perform this step for all alarm IDs listed from the preceding step (Step 1).
For example:
monasca alarm-delete 1cc219b1-ce4d-476b-80c2-0cafa53e1a12
7.4 Removing an ESXi Host from a Cluster #
This topic describes how to remove an existing ESXi host from a cluster and clean up of services for OVSvAPP VM.
Before performing this procedure, wait until VCenter migrates all the tenant VMs to other active hosts in that same cluster.
7.4.1 Prerequisite #
Write down the Hostname and ESXi configuration IP addresses of OVSvAPP VMs of that ESX cluster before deleting the VMs. These IP address and Hostname will be used to clean up monasca alarm definitions.
Login to vSphere client.
Select the ovsvapp node running on the ESXi host and click Summary tab.
7.4.2 Procedure #
Right-click and put the host in the maintenance mode. This will automatically migrate all the tenant VMs except OVSvApp.
Cancel the maintenance mode task.
Right-click the ovsvapp VM (IP Address) node, select Power, and then click Power Off.
Right-click the node and then click Delete from Disk.
Right-click the Host, and then click Enter Maintenance Mode.
Disconnect the VM. Right-click the VM, and then click Disconnect.
The ESXi node is removed from the vCenter.
7.4.3 Clean up neutron-agent for OVSvAPP Service #
After removing ESXi node from a vCenter, perform the following procedure to clean up neutron agents for ovsvapp-agent service.
Login to Cloud Lifecycle Manager.
Source the credentials.
ardana >
source service.osrcExecute the following command.
ardana >
openstack network agent list | grep <OVSvapp hostname>For example:
ardana >
openstack network agent list | grep MCP-VCP-cpesx-esx-ovsvapp0001-mgmt | 92ca8ada-d89b-43f9-b941-3e0cd2b51e49 | OVSvApp Agent | MCP-VCP-cpesx-esx-ovsvapp0001-mgmt | | :-) | True | ovsvapp-agent |Delete the OVSvAPP agent.
ardana >
openstack network agent delete <Agent -ID>For example:
ardana >
openstack network agent delete 92ca8ada-d89b-43f9-b941-3e0cd2b51e49
If you have more than one host, perform the preceding procedure for all the hosts.
7.4.4 Clean up monasca-agent for OVSvAPP Service #
Perform the following procedure to clean up monasca agents for ovsvapp-agent service.
If monasca-API is installed on different node, copy the
service.orsc
from Cloud Lifecycle Manager to monasca API server.ardana >
scp service.orsc $USER@ardana-cp1-mtrmon-m1-mgmt:SSH to monasca API server. You must SSH to each monasca API server for cleanup.
For example:
ardana >
ssh ardana-cp1-mtrmon-m1-mgmtEdit
/etc/monasca/agent/conf.d/host_alive.yaml
file to remove the reference to the OVSvAPP you removed. This requiressudo
access.sudo vi /etc/monasca/agent/conf.d/host_alive.yaml
A sample of
host_alive.yaml
:- alive_test: ping built_by: HostAlive host_name: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt name: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt ping target_hostname: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
where
host_name
andtarget_hostname
are mentioned at the DNS name field at the vSphere client. (Refer to Section 7.4.1, “Prerequisite”).After removing the reference on each of the monasca API servers, restart the monasca-agent on each of those servers by executing the following command.
tux >
sudo service openstack-monasca-agent restartWith the OVSvAPP references removed and the monasca-agent restarted, you can delete the corresponding alarm to complete the cleanup process. We recommend using the monasca CLI which is installed on each of your monasca API servers by default. Execute the following command from the monasca API server (for example:
ardana-cp1-mtrmon-mX-mgmt
).ardana >
monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=<ovsvapp deleted>For example: You can execute the following command to get the alarm ID, if the OVSvAPP appears as a preceding example.
ardana >
monasca alarm-list --metric-name host_alive_status --metric-dimensions hostname=MCP-VCP-cpesx-esx-ovsvapp0001-mgmt +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+ | id | alarm_definition_id | alarm_definition_name | metric_name | metric_dimensions | severity | state | lifecycle_state | link | state_updated_timestamp | updated_timestamp | created_timestamp | +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+ | cfc6bfa4-2485-4319-b1e5-0107886f4270 | cca96c53-a927-4b0a-9bf3-cb21d28216f3 | Host Status | host_alive_status | service: system | HIGH | OK | None | None | 2016-10-27T06:33:04.256Z | 2016-10-27T06:33:04.256Z | 2016-10-23T13:41:57.258Z | | | | | | cloud_name: entry-scale-kvm-esx-mml | | | | | | | | | | | | | test_type: ping | | | | | | | | | | | | | hostname: ardana-cp1-esx-ovsvapp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: mtrmon | | | | | | | | | | | | | observer_host: ardana-cp1-mtrmon-m1-mgmt | | | | | | | | | | | | host_alive_status | service: system | | | | | | | | | | | | | cloud_name: entry-scale-kvm-esx-mml | | | | | | | | | | | | | test_type: ping | | | | | | | | | | | | | hostname: ardana-cp1-esx-ovsvapp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: mtrmon | | | | | | | | | | | | | observer_host: ardana-cp1-mtrmon-m3-mgmt | | | | | | | | | | | | host_alive_status | service: system | | | | | | | | | | | | | cloud_name: entry-scale-kvm-esx-mml | | | | | | | | | | | | | test_type: ping | | | | | | | | | | | | | hostname: ardana-cp1-esx-ovsvapp0001-mgmt | | | | | | | | | | | | | control_plane: control-plane-1 | | | | | | | | | | | | | cluster: mtrmon | | | | | | | | | | | | | observer_host: ardana-cp1-mtrmon-m2-mgmt | | | | | | | | +--------------------------------------+--------------------------------------+-----------------------+-------------------+-------------------------------------------+----------+-------+-----------------+------+--------------------------+--------------------------+--------------------------+Delete the monasca alarm.
ardana >
monasca alarm-delete <alarm ID>For example:
ardana >
monasca alarm-delete cfc6bfa4-2485-4319-b1e5-0107886f4270Successfully deleted alarmAfter deleting the alarms and updating the monasca-agent configuration, those alarms will be removed from the Operations Console UI. You can login to Operations Console and view the status.
7.4.5 Clean up the entries of OVSvAPP VM from /etc/host #
Perform the following procedure to clean up the entries of OVSvAPP VM from
/etc/hosts
.
Login to Cloud Lifecycle Manager.
Edit
/etc/host
.ardana >
vi /etc/hostFor example:
MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
VM is present in the/etc/host
.192.168.86.17 MCP-VCP-cpesx-esx-ovsvapp0001-mgmt
Delete the OVSvAPP entries from
/etc/hosts
.
7.4.6 Remove the OVSVAPP VM from the servers.yml and pass_through.yml files and run the Configuration Processor #
Complete these steps from the Cloud Lifecycle Manager to remove the OVSvAPP VM:
Log in to the Cloud Lifecycle Manager
Edit
servers.yml
file to remove references to the OVSvAPP VM(s) you want to remove:~/openstack/my_cloud/definition/data/servers.yml
For example:
- ip-addr:192.168.86.17 server-group: AZ1 role: OVSVAPP-ROLE id: 6afaa903398c8fc6425e4d066edf4da1a0f04388
Edit
~/openstack/my_cloud/definition/data/pass_through.yml
file to remove the OVSvAPP VM references using the server-id above section to find the references.- data: vmware: vcenter_cluster: Clust1 cluster_dvs_mapping: 'DC1/host/Clust1:TRUNK-DVS-Clust1' esx_hostname: MCP-VCP-cpesx-esx-ovsvapp0001-mgmt vcenter_id: 0997E2ED9-5E4F-49EA-97E6-E2706345BAB2 id: 6afaa903398c8fc6425e4d066edf4da1a0f04388
Commit the changes to git:
ardana >
git commit -a -m "Remove ESXi host <name>"Run the configuration processor. You may want to use the
remove_deleted_servers
andfree_unused_addresses
switches to free up the resources when running the configuration processor. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data” for more details.ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
7.4.7 Clean Up nova Agent for ESX Proxy #
Log in to the Cloud Lifecycle Manager
Source the credentials.
ardana >
source service.osrcFind the nova ID for ESX Proxy with
openstack compute service list
.Delete the ESX Proxy service.
ardana >
openstack compute service delete ESX_PROXY_ID
If you have more than one host, perform the preceding procedure for all the hosts.
7.4.8 Clean Up monasca Agent for ESX Proxy #
Using the ESX proxy hostname, execute the following command to list all alarms.
ardana >
monasca alarm-list --metric-dimensions hostname=COMPUTE_NODE_DELETEDwhere COMPUTE_NODE_DELETED - hostname is taken from the vSphere client (refer to Section 7.3.1, “Prerequisites”).
NoteMake a note of all the alarm IDs that are displayed after executing the preceding command.
Delete the ESX Proxy alarm using the alarm IDs.
monasca alarm-delete <alarm ID>
This step has to be performed for all alarm IDs listed with the
monasca alarm-list
command.
7.4.9 Clean Up ESX Proxy Entries in /etc/host
#
Log in to the Cloud Lifecycle Manager
Edit the
/etc/hosts
file, removing ESX Proxy entries.
7.4.10 Remove ESX Proxy from servers.yml
and pass_through.yml
files; run the Configuration Processor #
Log in to the Cloud Lifecycle Manager
Edit
servers.yml
file to remove references to ESX Proxy:~/openstack/my_cloud/definition/data/servers.yml
Edit
~/openstack/my_cloud/definition/data/pass_through.yml
file to remove the ESX Proxy references using theserver-id
fromm theservers.yml
file.Commit the changes to git:
git commit -a -m "Remove ESX Proxy references"
Run the configuration processor. You may want to use the
remove_deleted_servers
andfree_unused_addresses
switches to free up the resources when running the configuration processor. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data” for more details.ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml \ -e remove_deleted_servers="y" -e free_unused_addresses="y"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
7.4.11 Remove Distributed Resource Scheduler (DRS) Rules #
Perform the following procedure to remove DRS rules, which is added by OVSvAPP installer to ensure that OVSvAPP does not get migrated to other hosts.
Login to vCenter.
Right click on cluster and select Edit settings.
A cluster settings page appears.
Click DRS Groups Manager on the left hand side of the pop-up box. Select the group which is created for deleted OVSvAPP and click Remove.
Click Rules on the left hand side of the pop-up box and select the checkbox for deleted OVSvAPP and click Remove.
Click OK.
7.5 Configuring Debug Logging #
7.5.1 To Modify the OVSVAPP VM Log Level #
To change the OVSVAPP log level to DEBUG, do the following:
Log in to the Cloud Lifecycle Manager.
Edit the file below:
~/openstack/ardana/ansible/roles/neutron-common/templates/ovsvapp-agent-logging.conf.j2
Set the logging level value of the
logger_root
section toDEBUG
, like this:[logger_root] qualname: root handlers: watchedfile, logstash level: DEBUG
Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Deploy your changes:
cd ~/scratch/ansible/next/hos/ansible ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
7.5.2 To Enable OVSVAPP Service for Centralized Logging #
To enable OVSVAPP Service for centralized logging:
Log in to the Cloud Lifecycle Manager.
Edit the file below:
~/openstack/my_cloud/config/logging/vars/neutron-ovsvapp-clr.yml
Set the value of
centralized_logging
to true as shown in the following sample:logr_services: neutron-ovsvapp: logging_options: - centralized_logging: enabled: true format: json ...
Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Deploy your changes, specifying the hostname for your OVSAPP host:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml --limit <hostname>
The hostname of the node can be found in the list generated from the output of the following command:
grep hostname ~/openstack/my_cloud/info/server_info.yml
7.6 Making Scale Configuration Changes #
This procedure describes how to make the recommended configuration changes to achieve 8,000 virtual machine instances.
In a scale environment for ESX computes, the configuration of vCenter Proxy VM has to be increased to 8 vCPUs and 16 GB RAM. By default it is 4 vCPUs and 4 GB RAM.
Change the directory. The
nova.conf.j2
file is present in following directories:cd ~/openstack/ardana/ansible/roles/nova-common/templates
Edit the DEFAULT section in the
nova.conf.j2
file as below:[DEFAULT] rpc_responce_timeout = 180 server_down_time = 300 report_interval = 30
Commit your configuration:
cd ~/openstack/ardana/ansible git add -A git commit -m "<commit message>"
Prepare your environment for deployment:
ansible-playbook -i hosts/localhost ready-deployment.yml; cd ~/scratch/ansible/next/ardana/ansible;
Execute the
nova-reconfigure
playbook:ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
7.7 Monitoring vCenter Clusters #
Remote monitoring of activated ESX cluster is enabled through vCenter Plugin of monasca. The monasca-agent running in each ESX Compute proxy node is configured with the vcenter plugin, to monitor the cluster.
Alarm definitions are created with the default threshold values and whenever the threshold limit breaches respective alarms (OK/ALARM/UNDETERMINED) are generated.
The configuration file details is given below:
init_config: {} instances: - vcenter_ip: <vcenter-ip> username: <vcenter-username> password: <center-password> clusters: <[cluster list]>
Metrics List of metrics posted to monasca by vCenter Plugin are listed below:
vcenter.cpu.total_mhz
vcenter.cpu.used_mhz
vcenter.cpu.used_perc
vcenter.cpu.total_logical_cores
vcenter.mem.total_mb
vcenter.mem.used_mb
vcenter.mem.used_perc
vcenter.disk.total_space_mb
vcenter.disk.total_used_space_mb
vcenter.disk.total_used_space_perc
monasca measurement-list --dimensions esx_cluster_id=domain-c7.D99502A9-63A8-41A2-B3C3-D8E31B591224 vcenter.disk.total_used_space_mb 2016-08-30T11:20:08
+----------------------------------------------+----------------------------------------------------------------------------------------------+-----------------------------------+------------------+-----------------+ | name | dimensions | timestamp | value | value_meta | +----------------------------------------------+----------------------------------------------------------------------------------------------+-----------------------------------+------------------+-----------------+ | vcenter.disk.total_used_space_mb | vcenter_ip: 10.1.200.91 | 2016-08-30T11:20:20.703Z | 100371.000 | | | | esx_cluster_id: domain-c7.D99502A9-63A8-41A2-B3C3-D8E31B591224 | 2016-08-30T11:20:50.727Z | 100371.000 | | | | hostname: MCP-VCP-cpesx-esx-comp0001-mgmt | 2016-08-30T11:21:20.707Z | 100371.000 | | | | | 2016-08-30T11:21:50.700Z | 100371.000 | | | | | 2016-08-30T11:22:20.700Z | 100371.000 | | | | | 2016-08-30T11:22:50.700Z | 100371.000 | | | | | 2016-08-30T11:23:20.620Z | 100371.000 | | +----------------------------------------------+-----------------------------------------------------------------------------------------------+-----------------------------------+------------------+-----------------+
Dimensions
Each metric will have the dimension as below
- vcenter_ip
FQDN/IP Address of the registered vCenter
- server esx_cluster_id
clusterName.vCenter-id, as seen in the openstack hypervisor list
- hostname
ESX compute proxy name
Alarms
Alarms are created for monitoring cpu, memory and disk usages for each activated clusters. The alarm definitions details are
Name | Expression | Severity | Match_by |
---|---|---|---|
ESX cluster CPU Usage | avg(vcenter.cpu.used_perc) > 90 times 3 | High | esx_cluster_id |
ESX cluster Memory Usage | avg(vcenter.mem.used_perc) > 90 times 3 | High | esx_cluster_id |
ESX cluster Disk Usage | vcenter.disk.total_used_space_perc > 90 | High | esx_cluster_id |
7.8 Monitoring Integration with OVSvApp Appliance #
7.8.1 Processes Monitored with monasca-agent #
Using the monasca agent, the following services are monitored on the OVSvApp appliance:
neutron_ovsvapp_agent service - This is the neutron agent which runs in the appliance which will help enable networking for the tenant virtual machines.
Openvswitch - This service is used by the neutron_ovsvapp_agent service for enabling the datapath and security for the tenant virtual machines.
Ovsdb-server - This service is used by the neutron_ovsvapp_agent service.
If any of the above three processes fail to run on the OVSvApp appliance it will lead to network disruption for the tenant virtual machines. This is why they are monitored.
The monasca-agent periodically reports the status of these processes and metrics data ('load' - cpu.load_avg_1min, 'process' - process.pid_count, 'memory' - mem.usable_perc, 'disk' - disk.space_used_perc, 'cpu' - cpu.idle_perc for examples) to the monasca server.
7.8.2 How It Works #
Once the vApp is configured and up, the monasca-agent will attempt to register with the monasca server. After successful registration, the monitoring begins on the processes listed above and you will be able to see status updates on the server side.
The monasca-agent monitors the processes at the system level so, in the case of failures of any of the configured processes, updates should be seen immediately from monasca.
To check the events from the server side, log into the Operations Console.
8 Managing Block Storage #
Information about managing and configuring the Block Storage service.
8.1 Managing Block Storage using Cinder #
SUSE OpenStack Cloud Block Storage volume operations use the OpenStack cinder service to manage storage volumes, which includes creating volumes, attaching/detaching volumes to nova instances, creating volume snapshots, and configuring volumes.
SUSE OpenStack Cloud supports the following storage back ends for block storage volumes and backup datastore configuration:
Volumes
SUSE Enterprise Storage; for more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.3 “SUSE Enterprise Storage Integration”.
3PAR FC or iSCSI; for more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”.
Backup
swift
8.1.1 Setting Up Multiple Block Storage Back-ends #
SUSE OpenStack Cloud supports setting up multiple block storage backends and multiple volume types.
Whether you have a single or multiple block storage back-ends defined in
your cinder.conf.j2
file, you can create one or more
volume types using the specific attributes associated with the back-end. You
can find details on how to do that for each of the supported back-end types
here:
Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.3 “SUSE Enterprise Storage Integration”
Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”
8.1.2 Creating a Volume Type for your Volumes #
Creating volume types allows you to create standard specifications for your volumes.
Volume types are used to specify a standard Block Storage back-end and collection of extra specifications for your volumes. This allows an administrator to give its users a variety of options while simplifying the process of creating volumes.
The tasks involved in this process are:
8.1.2.1 Create a Volume Type for your Volumes #
The default volume type will be thin provisioned and will have no fault tolerance (RAID 0). You should configure cinder to fully provision volumes, and you may want to configure fault tolerance. Follow the instructions below to create a new volume type that is fully provisioned and fault tolerant:
Perform the following steps to create a volume type using the horizon GUI:
Log in to the horizon dashboard.
Ensure that you are scoped to your
admin
Project. Then under the menu in the navigation pane, click on under the subheading.Select the
tab and then click the button to display a dialog box.Enter a unique name for the volume type and then click the
button to complete the action.
The newly created volume type will be displayed in the Volume
Types
list confirming its creation.
You must set a default_volume_type
in
cinder.conf.j2
, whether it is
default_type
or one you have created. For more
information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”, Section 35.1.4 “Configure 3PAR FC as a Cinder Backend”.
8.1.2.2 Associate the Volume Type to the Back-end #
After the volume type(s) have been created, you can assign extra specification attributes to the volume types. Each Block Storage back-end option has unique attributes that can be used.
To map a volume type to a back-end, do the following:
Log into the horizon dashboard.
Ensure that you are scoped to your Section 5.10.7, “Scope Federated User to Domain”. Then under the menu in the navigation pane, click on under the subheading.
Project (for more information, seeClick the
tab to list the volume types.In the
column of the Volume Type you created earlier, click the drop-down option and select which will bring up the options.Click the
button on theVolume Type Extra Specs
screen.In the
Key
field, enter one of the key values in the table in the next section. In theValue
box, enter its corresponding value. Once you have completed that, click the button to create the extra volume type specs.
Once the volume type is mapped to a back-end, you can create volumes with this volume type.
8.1.2.3 Extra Specification Options for 3PAR #
3PAR supports volumes creation with additional attributes. These attributes can be specified using the extra specs options for your volume type. The administrator is expected to define appropriate extra spec for 3PAR volume type as per the guidelines provided at http://docs.openstack.org/liberty/config-reference/content/hp-3par-supported-ops.html.
The following cinder Volume Type extra-specs options enable control over the 3PAR storage provisioning type:
Key | Value | Description |
---|---|---|
volume_backend_name | volume backend name |
The name of the back-end to which you want to associate the volume type,
which you also specified earlier in the
|
hp3par:provisioning (optional) | thin, full, or dedup |
For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 35 “Integrations”, Section 35.1 “Configuring for 3PAR Block Storage Backend”.
8.1.3 Managing cinder Volume and Backup Services #
If the host running the cinder-volume
service fails for
any reason, it should be restarted as quickly as possible. Often, the host
running cinder services also runs high availability (HA) services
such as MariaDB and RabbitMQ. These HA services are at risk while one of the
nodes in the cluster is down. If it will take a significant amount of time
to recover the failed node, then you may migrate the
cinder-volume
service and its backup service to one of
the other controller nodes. When the node has been recovered, you should
migrate the cinder-volume
service and its backup service
to the original (default) node.
The cinder-volume
service and its backup service migrate
as a pair. If you migrate the cinder-volume
service, its
backup service will also be migrated.
8.1.3.1 Migrating the cinder-volume service #
The following steps will migrate the cinder-volume service and its backup service.
Log in to the Cloud Lifecycle Manager node.
Determine the host index numbers for each of your control plane nodes. This host index number will be used in a later step. They can be obtained by running this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts cinder-show-volume-hosts.yml
Here is an example snippet showing the output of a single three node control plane, with the host index numbers in bold:
TASK: [_CND-CMN | show_volume_hosts | Show cinder Volume hosts index and hostname] *** ok: [ardana-cp1-c1-m1] => (item=(0, 'ardana-cp1-c1-m1')) => { "item": [ 0, "ardana-cp1-c1-m1" ], "msg": "Index 0 Hostname ardana-cp1-c1-m1" } ok: [ardana-cp1-c1-m1] => (item=(1, 'ardana-cp1-c1-m2')) => { "item": [ 1, "ardana-cp1-c1-m2" ], "msg": "Index 1 Hostname ardana-cp1-c1-m2" } ok: [ardana-cp1-c1-m1] => (item=(2, 'ardana-cp1-c1-m3')) => { "item": [ 2, "ardana-cp1-c1-m3" ], "msg": "Index 2 Hostname ardana-cp1-c1-m3" }
Locate the control plane fact file for the control plane you need to migrate the service from. It will be located in the following directory:
/etc/ansible/facts.d/
These fact files use the following naming convention:
cinder_volume_run_location_<control_plane_name>.fact
Edit the fact file to include the host index number of the control plane node you wish to migrate the
cinder-volume
services to. For example, if they currently reside on your first controller node, host index 0, and you wish to migrate them to your second controller, you would change the value in the fact file to1
.If you are using data encryption on your Cloud Lifecycle Manager, ensure you have included the encryption key in your environment variables. For more information see Book “Security Guide”, Chapter 10 “Encryption of Passwords and Sensitive Data”.
export HOS_USER_PASSWORD_ENCRYPT_KEY=<encryption key>
After you have edited the control plane fact file, run the cinder volume migration playbook for the control plane nodes involved in the migration. At minimum this includes the one to start cinder-volume manager on and the one on which to stop it:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts cinder-migrate-volume.yml --limit=<limit_pattern1,limit_pattern2>
Note<limit_pattern> is the pattern used to limit the hosts that are selected to those within a specific control plane. For example, with the nodes in the snippet shown above,
--limit=>ardana-cp1-c1-m1,ardana-cp1-c1-m2<
Even though the playbook summary reports no errors, you may disregard informational messages such as:
msg: Marking ardana_notify_cinder_restart_required to be cleared from the fact cache
Ensure that once your maintenance or other tasks are completed that you migrate the
cinder-volume
services back to their original node using these same steps.
9 Managing Object Storage #
Information about managing and configuring the Object Storage service.
The Object Storage service may be deployed in a full-fledged manner, with proxy nodes engaging rings for managing the accounts, containers, and objects being stored. Or, it may simply be deployed as a front-end to SUSE Enterprise Storage, offering Object Storage APIs with an external back-end.
In the former case, managing your Object Storage environment includes tasks related to ensuring your swift rings stay balanced, and that and other topics are discussed in more detail in this section. swift includes many commands and utilities for these purposes.
When used as a front-end to SUSE Enterprise Storage, many swift constructs such as rings and ring balancing, replica dispersion, etc. do not apply, as swift itself is not responsible for the mechanics of object storage.
9.1 Running the swift Dispersion Report #
swift contains a tool called swift-dispersion-report
that
can be used to determine whether your containers and objects have three
replicas like they are supposed to. This tool works by populating a
percentage of partitions in the system with containers and objects (using
swift-dispersion-populate
) and then running the report to
see if all the replicas of these containers and objects are in the correct
place. For a more detailed explanation of this tool in Openstack swift,
please see
OpenStack
swift - Administrator's Guide.
9.1.1 Configuring the swift dispersion populate #
Once a swift system has been fully deployed in SUSE OpenStack Cloud 9, you can
setup the swift-dispersion-report using the default parameters found in
~/openstack/ardana/ansible/roles/swift-dispersion/templates/dispersion.conf.j2
.
This populates 1% of the partitions on the system and if you are happy with
this figure, please proceed to step 2 below. Otherwise, follow step 1 to
edit the configuration file.
If you wish to change the dispersion coverage percentage, then connect to the Cloud Lifecycle Manager server and change the value of
dispersion_coverage
in the~/openstack/ardana/ansible/roles/swift-dispersion/templates/dispersion.conf.j2
file to the value you wish to use. In the example below we have altered the file to create 5% dispersion:... [dispersion] auth_url = {{ keystone_identity_uri }}/v3 auth_user = {{ swift_dispersion_tenant }}:{{ swift_dispersion_user }} auth_key = {{ swift_dispersion_password }} endpoint_type = {{ endpoint_type }} auth_version = {{ disp_auth_version }} # Set this to the percentage coverage. We recommend a value # of 1%. You can increase this to get more coverage. However, if you # decrease the value, the dispersion containers and objects are # not deleted. dispersion_coverage = 5.0
Commit your configuration to the Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlReconfigure the swift servers:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlRun this playbook to populate your swift system for the health check:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-dispersion-populate.yml
9.1.2 Running the swift dispersion report #
Check the status of the swift system by running the swift dispersion report with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-dispersion-report.yml
The output of the report will look similar to this:
TASK: [swift-dispersion | report | Display dispersion report results] ********* ok: [padawan-ccp-c1-m1-mgmt] => { "var": { "dispersion_report_result.stdout_lines": [ "Using storage policy: General ", "", "[KQueried 40 containers for dispersion reporting, 0s, 0 retries", "100.00% of container copies found (120 of 120)", "Sample represents 0.98% of the container partition space", "", "[KQueried 40 objects for dispersion reporting, 0s, 0 retries", "There were 40 partitions missing 0 copies.", "100.00% of object copies found (120 of 120)", "Sample represents 0.98% of the object partition space" ] } } ...
In addition to being able to run the report above, there will be a cron-job
scheduled to run every 2 hours located on the primary proxy node of your
cloud environment. It will run dispersion-report
and save
the results to the following location on its local filesystem:
/var/cache/swift/dispersion-report
When interpreting the results you get from this report, we recommend using swift Administrator's Guide - Cluster Health
9.2 Gathering Swift Data #
The swift-recon
command retrieves data from swift servers
and displays the results. To use this command, log on as a root user to any
node which is running the swift-proxy service.
9.2.1 Notes #
For help with the swift-recon
command you can use this:
tux >
sudo swift-recon --help
The --driveaudit
option is not supported.
SUSE OpenStack Cloud does not support ec_type isa_l_rs_vand
and
ec_num_parity_fragments
greater than or equal to
5 in the storage-policy configuration.
This particular policy is known to harm data durability.
9.2.2 Using the swift-recon Command #
The following command retrieves and displays disk usage information:
tux >
sudo swift-recon --diskusage
For example:
tux >
sudo swift-recon --diskusage
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-09-14 16:01:40] Checking disk usage now
Distribution Graph:
10% 3 *********************************************************************
11% 1 ***********************
12% 2 **********************************************
Disk usage: space used: 13745373184 of 119927734272
Disk usage: space free: 106182361088 of 119927734272
Disk usage: lowest: 10.39%, highest: 12.96%, avg: 11.4613798613%
===============================================================================
In the above example, the results for several nodes are combined together. You can also view the results from individual nodes by adding the -v option as shown in the following example:
tux >
sudo swift-recon --diskusage -v
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-09-14 16:12:30] Checking disk usage now
-> http://192.168.245.3:6000/recon/diskusage: [{'device': 'disk1', 'avail': 17398411264, 'mounted': True, 'used': 2589544448, 'size': 19987955712}, {'device': 'disk0', 'avail': 17904222208, 'mounted': True, 'used': 2083733504, 'size': 19987955712}]
-> http://192.168.245.2:6000/recon/diskusage: [{'device': 'disk1', 'avail': 17769721856, 'mounted': True, 'used': 2218233856, 'size': 19987955712}, {'device': 'disk0', 'avail': 17793581056, 'mounted': True, 'used': 2194374656, 'size': 19987955712}]
-> http://192.168.245.4:6000/recon/diskusage: [{'device': 'disk1', 'avail': 17912147968, 'mounted': True, 'used': 2075807744, 'size': 19987955712}, {'device': 'disk0', 'avail': 17404235776, 'mounted': True, 'used': 2583719936, 'size': 19987955712}]
Distribution Graph:
10% 3 *********************************************************************
11% 1 ***********************
12% 2 **********************************************
Disk usage: space used: 13745414144 of 119927734272
Disk usage: space free: 106182320128 of 119927734272
Disk usage: lowest: 10.39%, highest: 12.96%, avg: 11.4614140152%
===============================================================================
By default, swift-recon
uses the object-0 ring for
information about nodes and drives. For some commands, it is appropriate to
specify account,
container, or
object to indicate the type of ring. For
example, to check the checksum of the account ring, use the following:
tux >
sudo swift-recon --md5 account
===============================================================================
--> Starting reconnaissance on 3 hosts
===============================================================================
[2015-09-14 16:17:28] Checking ring md5sums
3/3 hosts matched, 0 error[s] while checking hosts.
===============================================================================
[2015-09-14 16:17:28] Checking swift.conf md5sum
3/3 hosts matched, 0 error[s] while checking hosts.
===============================================================================
9.3 Gathering Swift Monitoring Metrics #
The swiftlm-scan
command is the mechanism used to gather
metrics for the monasca system. These metrics are used to derive alarms. For
a list of alarms that can be generated from this data, see
Section 18.1.1, “Alarm Resolution Procedures”.
To view the metrics, use the swiftlm-scan
command
directly. Log on to the swift node as the root user. The following example
shows the command and a snippet of the output:
tux >
sudo swiftlm-scan --pretty
. . .
{
"dimensions": {
"device": "sdc",
"hostname": "padawan-ccp-c1-m2-mgmt",
"service": "object-storage"
},
"metric": "swiftlm.swift.drive_audit",
"timestamp": 1442248083,
"value": 0,
"value_meta": {
"msg": "No errors found on device: sdc"
}
},
. . .
To make the JSON file easier to read, use the --pretty
option.
The fields are as follows:
metric
|
Specifies the name of the metric. |
dimensions
|
Provides information about the source or location of the metric. The
dimensions differ depending on the metric in question. The following
dimensions are used by
|
value |
The value of the metric. For many metrics, this is simply the value of
the metric. However, if the value indicates a status. If
|
value_meta
|
Additional information. The msg field is the most useful of this information. |
9.3.1 Optional Parameters #
You can focus on specific sets of metrics by using one of the following optional parameters:
--replication |
Checks replication and health status. |
--file-ownership |
Checks that swift owns its relevant files and directories. |
--drive-audit |
Checks for logged events about corrupted sectors (unrecoverable read errors) on drives. |
--connectivity |
Checks connectivity to various servers used by the swift system, including:
|
--swift-services |
Check that the relevant swift processes are running. |
--network-interface |
Checks NIC speed and reports statistics for each interface. |
--check-mounts |
Checks that the node has correctly mounted drives used by swift. |
--hpssacli |
If this server uses a Smart Array Controller, this checks the operation of the controller and disk drives. |
9.4 Using the swift Command-line Client (CLI) #
OpenStackClient (OSC) is a command-line client for OpenStack with a uniform
command structure for OpenStack services. Some swift commands do not
have OSC equivalents. The swift
utility (or swift
CLI) is installed on the Cloud Lifecycle Manager node and also on all other nodes running the
swift proxy service. To use this utility on the Cloud Lifecycle Manager, you can use the
~/service.osrc
file as a basis and then edit it with the
credentials of another user if you need to.
ardana >
cp ~/service.osrc ~/swiftuser.osrc
Then you can use your preferred editor to edit swiftuser.osrc so you can
authenticate using the OS_USERNAME
,
OS_PASSWORD
, and OS_PROJECT_NAME
you
wish to use. For example, if you want use the demo
user
that is created automatically for you, it would look like this:
unset OS_DOMAIN_NAME export OS_IDENTITY_API_VERSION=3 export OS_AUTH_VERSION=3 export OS_PROJECT_NAME=demo export OS_PROJECT_DOMAIN_NAME=Default export OS_USERNAME=demo export OS_USER_DOMAIN_NAME=Default export OS_PASSWORD=<password> export OS_AUTH_URL=<auth_URL> export OS_ENDPOINT_TYPE=internalURL # OpenstackClient uses OS_INTERFACE instead of OS_ENDPOINT export OS_INTERFACE=internal export OS_CACERT=/etc/ssl/certs/ca-certificates.crt export OS_COMPUTE_API_VERSION=2
You must use the appropriate password for the demo user and select the
correct endpoint for the OS_AUTH_URL value,
which should be in the ~/service.osrc
file you copied.
You can then examine the following account data using this command:
ardana >
openstack object store account show
Example showing an environment with no containers or objects:
ardana >
openstack object store account show
Account: AUTH_205804d000a242d385b8124188284998
Containers: 0
Objects: 0
Bytes: 0
X-Put-Timestamp: 1442249536.31989
Connection: keep-alive
X-Timestamp: 1442249536.31989
X-Trans-Id: tx5493faa15be44efeac2e6-0055f6fb3f
Content-Type: text/plain; charset=utf-8
Use the following command to create a container:
ardana >
openstack container create CONTAINER_NAME
Example, creating a container named documents
:
ardana >
openstack container create documents
The newly created container appears. But there are no objects:
ardana >
openstack container show documents
Account: AUTH_205804d000a242d385b8124188284998
Container: documents
Objects: 0
Bytes: 0
Read ACL:
Write ACL:
Sync To:
Sync Key:
Accept-Ranges: bytes
X-Storage-Policy: General
Connection: keep-alive
X-Timestamp: 1442249637.69486
X-Trans-Id: tx1f59d5f7750f4ae8a3929-0055f6fbcc
Content-Type: text/plain; charset=utf-8
Upload a document:
ardana >
openstack object create CONTAINER_NAME FILENAME
Example:
ardana >
openstack object create documents mydocument
mydocument
List objects in the container:
ardana >
openstack object list CONTAINER_NAME
Example using a container called documents
:
ardana >
openstack object list documents
mydocument
This is a brief introduction to the swift
CLI. Use the
swift --help
command for more information. You can also
use the OpenStack CLI, see openstack -h
for more
information.
9.5 Managing swift Rings #
swift rings are a machine-readable description of which disk drives are used by the Object Storage service (for example, a drive is used to store account or object data). Rings also specify the policy for data storage (for example, defining the number of replicas). The rings are automatically built during the initial deployment of your cloud, with the configuration provided during setup of the SUSE OpenStack Cloud Input Model. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 5 “Input Model”.
After successful deployment of your cloud, you may want to change or modify the configuration for swift. For example, you may want to add or remove swift nodes, add additional storage policies, or upgrade the size of the disk drives. For instructions, see Section 9.5.5, “Applying Input Model Changes to Existing Rings” and Section 9.5.6, “Adding a New Swift Storage Policy”.
The process of modifying or adding a configuration is similar to other
configuration or topology changes in the cloud. Generally, you make the
changes to the input model files at
~/openstack/my_cloud/definition/
on the Cloud Lifecycle Manager and then
run Ansible playbooks to reconfigure the system.
Changes to the rings require several phases to complete, therefore, you may need to run the playbooks several times over several days.
The following topics cover ring management.
9.5.1 Rebalancing Swift Rings #
The swift ring building process tries to distribute data evenly among the available disk drives. The data is stored in partitions. (For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.10 “Understanding Swift Ring Specifications”.) If you, for example, double the number of disk drives in a ring, you need to move 50% of the partitions to the new drives so that all drives contain the same number of partitions (and hence same amount of data). However, it is not possible to move the partitions in a single step. It can take minutes to hours to move partitions from the original drives to their new drives (this process is called the replication process).
If you move all partitions at once, there would be a period where swift would expect to find partitions on the new drives, but the data has not yet replicated there so that swift could not return the data to the user. Therefore, swift will not be able to find all of the data in the middle of replication because some data has finished replication while other bits of data are still in the old locations and have not yet been moved. So it is considered best practice to move only one replica at a time. If the replica count is 3, you could first move 16.6% of the partitions and then wait until all data has replicated. Then move another 16.6% of partitions. Wait again and then finally move the remaining 16.6% of partitions. For any given object, only one of the replicas is moved at a time.
9.5.1.1 Reasons to Move Partitions Gradually #
Due to the following factors, you must move the partitions gradually:
Not all devices are of the same size. SUSE OpenStack Cloud 9 automatically assigns different weights to drives so that smaller drives store fewer partitions than larger drives.
The process attempts to keep replicas of the same partition in different servers.
Making a large change in one step (for example, doubling the number of drives in the ring), would result in a lot of network traffic due to the replication process and the system performance suffers. There are two ways to mitigate this:
Add servers in smaller groups
Set the weight-step attribute in the ring specification. For more information, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
9.5.2 Using the Weight-Step Attributes to Prepare for Ring Changes #
swift rings are built during a deployment and this process sets the weights of disk drives such that smaller disk drives have a smaller weight than larger disk drives. When making changes in the ring, you should limit the amount of change that occurs. SUSE OpenStack Cloud 9 does this by limiting the weights of the new drives to a smaller value and then building new rings. Once the replication process has finished, SUSE OpenStack Cloud 9 will increase the weight and rebuild rings to trigger another round of replication. (For more information, see Section 9.5.1, “Rebalancing Swift Rings”.)
In addition, you should become familiar with how the replication process
behaves on your system during normal operation. Before making ring changes,
use the swift-recon
command to determine the typical
oldest replication times for your system. For instructions, see
Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.
In SUSE OpenStack Cloud, the weight-step attribute is set in the ring specification of the input model. The weight-step value specifies a maximum value for the change of the weight of a drive in any single rebalance. For example, if you add a drive of 4TB, you would normally assign a weight of 4096. However, if the weight-step attribute is set to 1024 instead then when you add that drive the weight is initially set to 1024. The next time you rebalance the ring, the weight is set to 2048. The subsequent rebalance would then set the weight to the final value of 4096.
The value of the weight-step attribute is dependent on the size of the drives, number of the servers being added, and how experienced you are with the replication process. A common starting value is to use 20% of the size of an individual drive. For example, when adding X number of 4TB drives a value of 820 would be appropriate. As you gain more experience with your system, you may increase or reduce this value.
9.5.2.1 Setting the weight-step attribute #
Perform the following steps to set the weight-step attribute:
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/definition/data/swift/swift_config.yml
file containing the ring-specifications for the account, container, and object rings.Add the weight-step attribute to the ring in this format:
- name: account weight-step: WEIGHT_STEP_VALUE display-name: Account Ring min-part-hours: 16 ...
For example, to set weight-step to 820, add the attribute like this:
- name: account weight-step: 820 display-name: Account Ring min-part-hours: 16 ...
Repeat step 2 for the other rings, if necessary (container, object-0, etc).
Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUse the playbook to create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo complete the configuration, use the ansible playbooks documented in Section 9.5.3, “Managing Rings Using swift Playbooks”.
9.5.3 Managing Rings Using swift Playbooks #
The following table describes how playbooks relate to ring management.
All of these playbooks will be run from the Cloud Lifecycle Manager from the
~/scratch/ansible/next/ardana/ansible
directory.
Playbook | Description | Notes |
---|---|---|
swift-update-from-model-rebalance-rings.yml
|
There are two steps in this playbook:
|
This playbook performs its actions on the first node running the swift-proxy service. (For more information, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”.) However, it also scans all swift nodes to find the size of disk drives.
If there are no changes in the ring delta, the
|
swift-compare-model-rings.yml
|
There are two steps in this playbook:
The playbook reports any issues or problems it finds with the input model. This playbook can be useful to confirm that there are no errors in the input model. It also allows you to check that when you change the input model, that the proposed ring changes are as expected. For example, if you have added a server to the input model, but this playbook reports that no drives are being added, you should determine the cause. |
There is troubleshooting information related to the information that you receive in this report that you can view on this page: Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”. |
swift-deploy.yml
|
|
This playbook is included in the |
swift-reconfigure.yml
|
|
Every time that you directly use the
|
9.5.3.1 Optional Ansible variables related to ring management #
The following optional variables may be specified when running the playbooks
outlined above. They are specified using the --extra-vars
option.
Variable | Description and Use |
---|---|
limit_ring
|
Limit changes to the named ring. Other rings will not be examined or
updated. This option may be used with any of the swift playbooks. For
example, to only update the
|
drive_detail |
Used only with the swift-compare-model-rings.yml playbook. The playbook will include details of changes to every drive where the model and existing rings differ. If you omit the drive_detail variable, only summary information is provided. The following shows how to use the drive_detail variable:
|
9.5.3.2 Interpreting the report from the swift-compare-model-rings.yml playbook #
The swift-compare-model-rings.yml
playbook compares the
existing swift rings with the input model and prints a report telling you
how the rings and the model differ. Specifically, it will tell you what
actions will take place when you next run the
swift-update-from-model-rebalance-rings.yml
playbook (or
a playbook such as ardana-deploy.yml
that runs
swift-update-from-model-rebalance-rings.yml
).
The swift-compare-model-rings.yml
playbook will make no
changes, but is just an advisory report.
Here is an example output from the playbook. The report is between "report.stdout_lines" and "PLAY RECAP":
TASK: [swiftlm-ring-supervisor | validate-input-model | Print report] ********* ok: [ardana-cp1-c1-m1-mgmt] => { "var": { "report.stdout_lines": [ "Rings:", " ACCOUNT:", " ring exists (minimum time to next rebalance: 8:07:33)", " will remove 1 devices (18.00GB)", " ring will be rebalanced", " CONTAINER:", " ring exists (minimum time to next rebalance: 8:07:35)", " no device changes", " ring will be rebalanced", " OBJECT-0:", " ring exists (minimum time to next rebalance: 8:07:34)", " no device changes", " ring will be rebalanced" ] } }
The following describes the report in more detail:
Message | Description |
---|---|
ring exists |
The ring already exists on the system. |
ring will be created |
The ring does not yet exist on the system. |
no device changes |
The devices in the ring exactly match the input model. There are no servers being added or removed and the weights are appropriate for the size of the drives. |
minimum time to next rebalance |
If this time is
If the time is non-zero, it means that not enough time has elapsed
since the ring was last rebalanced. Even if you run a swift playbook
that attempts to change the ring, the ring will not actually
rebalance. This time is determined by the
|
set-weight ardana-ccp-c1-m1-mgmt:disk0:/dev/sdc 8.00 > 12.00 > 18.63 |
The weight of disk0 (mounted on /dev/sdc) on server
This information is only shown when you the
|
will change weight on 12 devices (6.00TB) |
The weight of 12 devices will be increased. This might happen for
example, if a server had been added in a prior ring update. However,
with use of the |
add: ardana-ccp-c1-m1-mgmt:disk0:/dev/sdc |
The disk0 device will be added to the ardana-ccp-c1-m1-mgmt server. This happens when a server is added to the input model or if a disk model is changed to add additional devices.
This information is only shown when you the
|
remove: ardana-ccp-c1-m1-mgmt:disk0:/dev/sdc |
The device is no longer in the input model and will be removed from the ring. This happens if a server is removed from the model, a disk drive is removed from a disk model or the server is marked for removal using the pass-through feature.
This information is only shown when you the
|
will add 12 devices (6TB) |
There are 12 devices in the input model that have not yet been added to the ring. Usually this is because one or more servers have been added. In this example, this could be one server with 12 drives or two servers, each with 6 drives. The size in the report is the change in total available capacity. When the weight-step attribute is used, this may be a fraction of the total size of the disk drives. In this example, 6TB of capacity is being added. For example, if your system currently has 100TB of available storage, when these devices are added, there will be 106TB of available storage. If your system is 50% utilized, this means that when the ring is rebalanced, up to 3TB of data may be moved by the replication process. This is an estimate - in practice, because only one copy of a given replica is moved in any given rebalance, it may not be possible to move this amount of data in a single ring rebalance. |
will remove 12 devices (6TB) |
There are 12 devices in rings that no longer appear in the input model. Usually this is because one or more servers have been removed. In this example, this could be one server with 12 drives or two servers, each with 6 drives. The size in the report is the change in total removed capacity. In this example, 6TB of capacity is being removed. For example, if your system currently has 100TB of available storage, when these devices are removed, there will be 94TB of available storage. If your system is 50% utilized, this means that when the ring is rebalanced, approximately 3TB of data must be moved by the replication process. |
min-part-hours will be changed |
The |
replica-count will be changed |
The |
ring will be rebalanced |
This is always reported. Every time the
|
9.5.4 Determining When to Rebalance and Deploy a New Ring #
Before deploying a new ring, you must be sure the change that has been applied to the last ring is complete (that is, all the partitions are in their correct location). There are three aspects to this:
Is the replication system busy?
You might want to postpone a ring change until after replication has finished. If the replication system is busy repairing a failed drive, a ring change will place additional load on the system. To check that replication has finished, use the
swift-recon
command with the --replication argument. (For more information, see Section 9.2, “Gathering Swift Data”.) The oldest completion time can indicate that the replication process is very busy. If it is more than 15 or 20 minutes then the object replication process are probably still very busy. The following example indicates that the oldest completion is 120 seconds, so that the replication process is probably not busy:root #
swift-recon --replication =============================================================================== --> Starting reconnaissance on 3 hosts =============================================================================== [2015-10-02 15:31:45] Checking on replication [replication_time] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%, no_result: 0, reported: 3 Oldest completion was 2015-10-02 15:31:32 (120 seconds ago) by 192.168.245.4:6000. Most recent completion was 2015-10-02 15:31:43 (10 seconds ago) by 192.168.245.3:6000. ===============================================================================
Are there drive or server failures?
A drive failure does not preclude deploying a new ring. In principle, there should be two copies elsewhere. However, another drive failure in the middle of replication might make data temporary unavailable. If possible, postpone ring changes until all servers and drives are operating normally.
Has
min-part-hours
elapsed?The
swift-ring-builder
will refuse to build a new ring until themin-part-hours
has elapsed since the last time it built rings. You must postpone changes until this time has elapsed.You can determine how long you must wait by running the
swift-compare-model-rings.yml
playbook, which will tell you how long you until themin-part-hours
has elapsed. For more details, see Section 9.5.3, “Managing Rings Using swift Playbooks”.You can change the value of
min-part-hours
. (For instructions, see Section 9.5.7, “Changing min-part-hours in Swift”).Is the swift dispersion report clean?
Run the
swift-dispersion-report.yml
playbook (as described in Section 9.1, “Running the swift Dispersion Report”) and examine the results. If the replication process has not yet replicated partitions that were moved to new drives in the last ring rebalance, the dispersion report will indicate that some containers or objects are missing a copy.For example:
There were 462 partitions missing one copy.
Assuming all servers and disk drives are operational, the reason for the missing partitions is that the replication process has not yet managed to copy a replica into the partitions.
You should wait an hour and rerun the dispersion report process and examine the report. The number of partitions missing one copy should have reduced. Continue to wait until this reaches zero before making any further ring rebalances.
NoteIt is normal to see partitions missing one copy if disk drives or servers are down. If all servers and disk drives are mounted, and you did not recently perform a ring rebalance, you should investigate whether there are problems with the replication process. You can use the Operations Console to investigate replication issues.
ImportantIf there are any partitions missing two copies, you must reboot or repair any failed servers and disk drives as soon as possible. Do not shutdown any swift nodes in this situation. Assuming a replica count of 3, if you are missing two copies you are in danger of losing the only remaining copy.
9.5.5 Applying Input Model Changes to Existing Rings #
This page describes a general approach for making changes to your existing swift rings. This approach applies to actions such as adding and removing a server and replacing and upgrading disk drives, and must be performed as a series of phases, as shown below:
9.5.5.1 Changing the Input Model Configuration Files #
The first step to apply new changes to the swift environment is to update the configuration files. Follow these steps:
Log in to the Cloud Lifecycle Manager.
Set the weight-step attribute, as needed, for the nodes you are altering. (For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”).
Edit the configuration files as part of the Input Model as appropriate. (For general information about the Input Model, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.14 “Networks”. For more specific information about the swift parts of the configuration files, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”)
Once you have completed all of the changes, commit your configuration to the local git repository. (For more information, seeBook “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”.) :
ardana >
git add -Aroot #
git commit -m "commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlCreate a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the swift playbook that will validate your configuration files and give you a report as an output:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleroot #
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.ymlUse the report to validate that the number of drives proposed to be added or deleted, or the weight change, is correct. Fix any errors in your input model. At this stage, no changes have been made to rings.
9.5.5.2 First phase of Ring Rebalance #
To begin the rebalancing of the swift rings, follow these steps:
After going through the steps in the section above, deploy your changes to all of the swift nodes in your environment by running this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-deploy.ymlWait until replication has finished or
min-part-hours
has elapsed (whichever is longer). For more information, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”
9.5.5.3 Weight Change Phase of Ring Rebalance #
At this stage, no changes have been made to the input model. However, when
you set the weight-step
attribute, the rings that were
rebuilt in the previous rebalance phase have weights that are different than
their target/final value. You gradually move to the target/final weight by
rebalancing a number of times as described on this page. For more
information about the weight-step attribute, see
Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
To begin the re-balancing of the rings, follow these steps:
Rebalance the rings by running the playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.ymlRun the reconfiguration:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlWait until replication has finished or
min-part-hours
has elapsed (whichever is longer). For more information, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”Run the following command and review the report:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*The following is an example of the output after executing the above command. In the example no weight changes are proposed:
TASK: [swiftlm-ring-supervisor | validate-input-model | Print report] ********* ok: [padawan-ccp-c1-m1-mgmt] => { "var": { "report.stdout_lines": [ "Need to add 0 devices", "Need to remove 0 devices", "Need to set weight on 0 devices" ] } }
When there are no proposed weight changes, you proceed to the final phase.
If there are proposed weight changes repeat this phase again.
9.5.5.4 Final Rebalance Phase #
The final rebalance phase moves all replicas to their final destination.
Rebalance the rings by running the playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml | tee /tmp/rebalance.logNoteThe output is saved for later reference.
Review the output from the previous step. If the output for all rings is similar to the following, the rebalance had no effect. That is, the rings are balanced and no further changes are needed. In addition, the ring files were not changed so you do not need to deploy them to the swift nodes:
"Running: swift-ring-builder /etc/swiftlm/cloud1/cp1/builder_dir/account.builder rebalance 999", "NOTE: No partitions could be reassigned.", "Either none need to be or none can be due to min_part_hours [16]."
The text No partitions could be reassigned indicates that no further rebalances are necessary. If this is true for all the rings, you have completed the final phase.
NoteYou must have allowed enough time to elapse since the last rebalance. As mentioned in the above example,
min_part_hours [16]
means that you must wait at least 16 hours since the last rebalance. If not, you should wait until enough time has elapsed and repeat this phase.Run the
swift-reconfigure.yml
playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlWait until replication has finished or
min-part-hours
has elapsed (whichever is longer). For more information see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”Repeat the above steps until the ring is rebalanced.
9.5.5.5 System Changes that Change Existing Rings #
There are many system changes ranging from adding servers to replacing drives, which might require you to rebuild and rebalance your rings.
Actions | Process |
---|---|
Adding Servers(s) |
|
Removing Server(s) |
In SUSE OpenStack Cloud, when you remove servers from the input model, the
disk drives are removed from the ring - the weight is not gradually
reduced using the
|
Adding Disk Drive(s) |
|
Replacing Disk Drive(s) |
When a drive fails, replace it as soon as possible. Do not attempt to remove it from the ring - this creates operator overhead. swift will continue to store the correct number of replicas by handing off objects to other drives instead of the failed drive.
If the disk drives are of the same size as the original when the
drive is replaced, no ring changes are required. You can confirm this
by running the
For a single drive replacement, even if the drive is significantly larger than the original drives, you do not need to rebalance the ring (however, the extra space on the drive will not be used). |
Upgrading Disk Drives |
If the drives are different size (for example, you are upgrading your system), you can proceed as follows:
|
Removing Disk Drive(s) |
When removing a disk drive from the input model, keep in mind that this
drops the disk out of the ring without allowing Swift to move the data
off it first. While it should be fine in a properly replicated healthy
cluster, we do not recommend this approach. A better solution is to
step down |
9.5.6 Adding a New Swift Storage Policy #
This page describes how to add an additional storage policy to an existing system. For an overview of storage policies, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.11 “Designing Storage Policies”.
To Add a Storage Policy
Perform the following steps to add the storage policy to an existing system.
Log in to the Cloud Lifecycle Manager.
Select a storage policy index and ring name.
For example, if you already have object-0 and object-1 rings in your ring-specifications (usually in the
~/openstack/my_cloud/definition/data/swift/swift_config.yml
file), the next index is 2 and the ring name is object-2.Select a user-visible name so that you can see when you examine container metadata or when you want to specify the storage policy used when you create a container. The name should be a single word (hyphen and dashes are allowed).
Decide if this new policy will be the default for all new containers.
Decide on other attributes such as
partition-power
andreplica-count
if you are using a standard replication ring. However, if you are using an erasure coded ring, you also need to decide on other attributes:ec-type
,ec-num-data-fragments
,ec-num-parity-fragments
, andec-object-segment-size
. For more details on the required attributes, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.10 “Understanding Swift Ring Specifications”.Edit the
ring-specifications
attribute (usually in the~/openstack/my_cloud/definition/data/swift/swift_config.yml
file) and add the new ring specification. If this policy is to be the default storage policy for new containers, set thedefault
attribute to yes.NoteEnsure that only one object ring has the
default
attribute set toyes
. If you set two rings as default, swift processes will not start.Do not specify the
weight-step
attribute for the new object ring. Since this is a new ring there is no need to gradually increase device weights.
Update the appropriate disk model to use the new storage policy (for example, the
data/disks_swobj.yml
file). The following sample shows that the object-2 has been added to the list of existing rings that use the drives:disk-models: - name: SWOBJ-DISKS ... device-groups: - name: swobj devices: ... consumer: name: swift attrs: rings: - object-0 - object-1 - object-2 ...
NoteYou must use the new object ring on at least one node that runs the
swift-object
service. If you skip this step and continue to run theswift-compare-model-rings.yml
orswift-deploy.yml
playbooks, they will fail with an error There are no devices in this ring, or all devices have been deleted, as shown below:TASK: [swiftlm-ring-supervisor | build-rings | Build ring (make-delta, rebalance)] *** failed: [padawan-ccp-c1-m1-mgmt] => {"changed": true, "cmd": ["swiftlm-ring-supervisor", "--make-delta", "--rebalance"], "delta": "0:00:03.511929", "end": "2015-10-07 14:02:03.610226", "rc": 2, "start": "2015-10-07 14:02:00.098297", "warnings": []} ... Running: swift-ring-builder /etc/swiftlm/cloud1/cp1/builder_dir/object-2.builder rebalance 999 ERROR: ------------------------------------------------------------------------------- An error has occurred during ring validation. Common causes of failure are rings that are empty or do not have enough devices to accommodate the replica count. Original exception message: There are no devices in this ring, or all devices have been deleted -------------------------------------------------------------------------------
Commit your configuration:
ardana >
git add -Aardana >
git commit -m "commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlCreate a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlValidate the changes by running the
swift-compare-model-rings.yml
playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.ymlIf any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”. Then, re-run steps 5 - 10.
Create the new ring (for example, object-2). Then verify the swift service status and reconfigure the swift node to use a new storage policy, by running these playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-status.ymlardana >
ansible-playbook -i hosts/verb_hosts swift-deploy.yml
After adding a storage policy, there is no need to rebalance the ring.
9.5.7 Changing min-part-hours in Swift #
The min-part-hours
parameter specifies the number of
hours you must wait before swift will allow a given partition to be moved.
In other words, it constrains how often you perform ring rebalance
operations. Before changing this value, you should get some experience with
how long it takes your system to perform replication after you make ring
changes (for example, when you add servers).
See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for more information about determining when replication has completed.
9.5.7.1 Changing the min-part-hours Value #
To change the min-part-hours
value, following these
steps:
Log in to the Cloud Lifecycle Manager.
Edit your
~/openstack/my_cloud/definition/data/swift/swift_config.yml
file and change the value(s) ofmin-part-hours
for the rings you desire. The value is expressed in hours and a value of zero is not allowed.Commit your configuration to the local Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlApply the changes by running this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-deploy.yml
9.5.8 Changing Swift Zone Layout #
Before changing the number of swift zones or the assignment of servers to specific zones, you must ensure that your system has sufficient storage available to perform the operation. Specifically, if you are adding a new zone, you may need additional storage. There are two reasons for this:
You cannot simply change the swift zone number of disk drives in the ring. Instead, you need to remove the server(s) from the ring and then re-add the server(s) with a new swift zone number to the ring. At the point where the servers are removed from the ring, there must be sufficient spare capacity on the remaining servers to hold the data that was originally hosted on the removed servers.
The total amount of storage in each swift zone must be the same. This is because new data is added to each zone at the same rate. If one zone has a lower capacity than the other zones, once that zone becomes full, you cannot add more data to the system – even if there is unused space in the other zones.
As mentioned above, you cannot simply change the swift zone number of disk drives in an existing ring. Instead, you must remove and then re-add servers. This is a summary of the process:
Identify appropriate server groups that correspond to the desired swift zone layout.
Remove the servers in a server group from the rings. This process may be protracted, either by removing servers in small batches or by using the weight-step attribute so that you limit the amount of replication traffic that happens at once.
Once all the targeted servers are removed, edit the
swift-zones
attribute in the ring specifications to add or remove a swift zone.Re-add the servers you had temporarily removed to the rings. Again you may need to do this in batches or rely on the weight-step attribute.
Continue removing and re-adding servers until you reach your final configuration.
9.5.8.1 Process for Changing Swift Zones #
This section describes the detailed process or reorganizing swift zones. As a concrete example, we assume we start with a single swift zone and the target is three swift zones. The same general process would apply if you were reducing the number of zones as well.
The process is as follows:
Identify the appropriate server groups that represent the desired final state. In this example, we are going to change the swift zone layout as follows:
Original Layout Target Layout swift-zones: - 1d: 1 server-groups: - AZ1 - AZ2 - AZ3
swift-zones: - 1d: 1 server-groups: - AZ1 - id: 2 - AZ2 - id: 3 - AZ3
The plan is to move servers from server groups
AZ2
andAZ3
to a new swift zone number. The servers inAZ1
will remain in swift zone 1.If you have not already done so, consider setting the weight-step attribute as described in Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
Identify the servers in the
AZ2
server group. You may remove all servers at once or remove them in batches. If this is the first time you have performed a major ring change, we suggest you remove one or two servers only in the first batch. When you see how long this takes and the impact replication has on your system you can then use that experience to decide whether you can remove a larger batch of servers, or increase or decrease the weight-step attribute for the next server-removal cycle. To remove a server, use steps 2-9 as described in Section 15.1.5.1.4, “Removing a Swift Node” ensuring that you do not remove the servers from the input model.This process may take a number of ring rebalance cycles until the disk drives are removed from the ring files. Once this happens, you can edit the ring specifications and add swift zone 2 as shown in this example:
swift-zones: - id: 1 server-groups: - AZ1 - AZ3 - id: 2 - AZ2
The server removal process in step #3 set the "remove" attribute in the
pass-through
attribute of the servers in server groupAZ2
. Edit the input model files and remove thispass-through
attribute. This signals to the system that the servers should be used the next time we rebalance the rings (that is, the server should be added to the rings).Commit your configuration to the local Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUse the playbook to create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRebuild and deploy the swift rings containing the re-added servers by running this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-deploy.ymlWait until replication has finished. For more details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.
You may need to continue to rebalance the rings. For instructions, see the "Final Rebalance Stage" steps at Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
At this stage, the servers in server group
AZ2
are responsible for swift zone 2. Repeat the process in steps #3-9 to remove the servers in server groupAZ3
from the rings and then re-add them to swift zone 3. The ring specifications for zones (step 4) should be as follows:swift-zones: - 1d: 1 server-groups: - AZ1 - id: 2 - AZ2 - id: 3 - AZ3
Once complete, all data should be dispersed (that is, each replica is located) in the swift zones as specified in the input model.
9.6 Configuring your swift System to Allow Container Sync #
swift has a feature where all the contents of a container can be mirrored to another container through background synchronization. swift operators configure their system to allow/accept sync requests to/from other systems, and the user specifies where to sync their container to along with a secret synchronization key. For an overview of this feature, refer to OpenStack swift - Container to Container Synchronization.
9.6.1 Notes and limitations #
The container synchronization is done as a background action. When you put an object into the source container, it will take some time before it becomes visible in the destination container. Storage services will not necessarily copy objects in any particular order, meaning they may be transferred in a different order to which they were created.
Container sync may not be able to keep up with a moderate upload rate to a container. For example, if the average object upload rate to a container is greater than one object per second, then container sync may not be able to keep the objects synced.
If container sync is enabled on a container that already has a large number of objects then container sync may take a long time to sync the data. For example, a container with one million 1KB objects could take more than 11 days to complete a sync.
You may operate on the destination container just like any other container -- adding or deleting objects -- including the objects that are in the destination container because they were copied from the source container. To decide how to handle object creation, replacement or deletion, the system uses timestamps to determine what to do. In general, the latest timestamp "wins". That is, if you create an object, replace it, delete it and the re-create it, the destination container will eventually contain the most recently created object. However, if you also create and delete objects in the destination container, you get some subtle behaviours as follows:
If an object is copied to the destination container and then deleted, it remains deleted in the destination even though there is still a copy in the source container. If you modify the object (replace or change its metadata) in the source container, it will reappear in the destination again.
The same applies to a replacement or metadata modification of an object in the destination container -- the object will remain as-is unless there is a replacement or modification in the source container.
If you replace or modify metadata of an object in the destination container and then delete it in the source container, it is not deleted from the destination. This is because your modified object has a later timestamp than the object you deleted in the source.
If you create an object in the source container and before the system has a chance to copy it to the destination, you also create an object of the same name in the destination, then the object in the destination is not overwritten by the source container's object.
Segmented objects
Segmented objects (objects larger than 5GB) will not work seamlessly with container synchronization. If the manifest object is copied to the destination container before the object segments, when you perform a GET operation on the manifest object, the system may fail to find some or all of the object segments. If your manifest and object segments are in different containers, do not forget that both containers must be synchonized and that the container name of the object segments must be the same on both source and destination.
9.6.2 Prerequisites #
Container to container synchronization requires that SSL certificates are configured on both the source and destination systems. For more information on how to implement SSL, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 41 “Configuring Transport Layer Security (TLS)”.
9.6.3 Configuring container sync #
Container to container synchronization requires that both the source and destination swift systems involved be configured to allow/accept this. In the context of container to container synchronization, swift uses the term cluster to denote a swift system. swift clusters correspond to Control Planes in OpenStack terminology.
Gather the public API endpoints for both swift systems
Gather information about the external/public URL used by each system, as follows:
On the Cloud Lifecycle Manager of one system, get the public API endpoint of the system by running the following commands:
ardana >
source ~/service.osrcardana >
openstack endpoint list | grep swiftThe output of the command will look similar to this:
ardana >
openstack endpoint list | grep swift | 063a84b205c44887bc606c3ba84fa608 | region0 | swift | object-store | True | admin | https://10.13.111.176:8080/v1/AUTH_%(tenant_id)s | | 3c46a9b2a5f94163bb5703a1a0d4d37b | region0 | swift | object-store | True | public | https://10.13.120.105:8080/v1/AUTH_%(tenant_id)s | | a7b2f4ab5ad14330a7748c950962b188 | region0 | swift | object-store | True | internal | https://10.13.111.176:8080/v1/AUTH_%(tenant_id)s |The portion that you want is the endpoint up to, but not including, the
AUTH
part. It is bolded in the above example,https://10.13.120.105:8080/v1
.Repeat these steps on the other swift system so you have both of the public API endpoints for them.
Validate connectivity between both systems
The swift nodes running the swift-container
service must
be able to connect to the public API endpoints of each other for the
container sync to work. You can validate connectivity on each system using
these steps.
For the sake of the examples, we will use the terms source and destination to notate the nodes doing the synchronization.
Log in to a swift node running the
swift-container
service on the source system. You can determine this by looking at the service list in your~/openstack/my_cloud/info/service_info.yml
file for a list of the servers containing this service.Verify the SSL certificates by running this command against the destination swift server:
echo | openssl s_client -connect PUBLIC_API_ENDPOINT:8080 -CAfile /etc/ssl/certs/ca-certificates.crt
If the connection was successful you should see a return code of
0 (ok)
similar to this:... Timeout : 300 (sec) Verify return code: 0 (ok)
Also verify that the source node can connect to the destination swift system using this command:
ardana >
curl -k DESTINATION_IP OR HOSTNAME:8080/healthcheckIf the connection was successful, you should see a response of
OK
.Repeat these verification steps on any system involved in your container synchronization setup.
Configure container to container synchronization
Both the source and destination swift systems must be configured the same way, using sync realms. For more details on how sync realms work, see OpenStack swift - Configuring Container Sync.
To configure one of the systems, follow these steps:
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/swift/container-sync-realms.conf.j2
file and uncomment the sync realm section.Here is a sample showing this section in the file:
#Add sync realms here, for example: # [realm1] # key = realm1key # key2 = realm1key2 # cluster_name1 = https://host1/v1/ # cluster_name2 = https://host2/v1/
Add in the details for your source and destination systems. Each realm you define is a set of clusters that have agreed to allow container syncing between them. These values are case sensitive.
Only one
key
is required. The second key is optional and can be provided to allow an operator to rotate keys if desired. The values for the clusters must contain the prefixcluster_
and will be populated with the public API endpoints for the systems.Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "Add node <name>"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate the deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the swift reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlRun this command to validate that your container synchronization is configured:
ardana >
source ~/service.osrcardana >
swift capabilitiesHere is a snippet of the output showing the container sync information. This should be populated with your cluster names:
... Additional middleware: container_sync Options: realms: {u'INTRACLUSTER': {u'clusters': {u'THISCLUSTER': {}}}}
Repeat these steps on any other swift systems that will be involved in your sync realms.
9.6.4 Configuring Intra Cluster Container Sync #
It is possible to use the swift container sync functionality to sync objects
between containers within the same swift system. swift is automatically
configured to allow intra cluster container sync. Each swift PAC server will
have an intracluster container sync realm defined in
/etc/swift/container-sync-realms.conf
.
For example:
# The intracluster realm facilitates syncing containers on this system [intracluster] key = lQ8JjuZfO # key2 = cluster_thiscluster = http://SWIFT-PROXY-VIP:8080/v1/
The keys defined in /etc/swift/container-sync-realms.conf
are used by the container-sync daemon to determine trust. On top of this
the containers that will be in sync will need a seperate shared key they
both define in container metadata to establish their trust between each other.
Create two containers, for example container-src and container-dst. In this example we will sync one way from container-src to container-dst.
ardana >
openstack container create container-srcardana >
openstack container create container-dstDetermine your swift account. In the following example it is AUTH_1234
ardana >
openstack container show Account: AUTH_1234 Containers: 3 Objects: 42 Bytes: 21692421 Containers in policy "erasure-code-ring": 3 Objects in policy "erasure-code-ring": 42 Bytes in policy "erasure-code-ring": 21692421 Content-Type: text/plain; charset=utf-8 X-Account-Project-Domain-Id: default X-Timestamp: 1472651418.17025 X-Trans-Id: tx81122c56032548aeae8cd-0057cee40c Accept-Ranges: bytesConfigure container-src to sync to container-dst using a key specified by both containers. Replace KEY with your key.
ardana >
openstack container set -t '//intracluster/thiscluster/AUTH_1234/container-dst' -k 'KEY' container-srcConfigure container-dst to accept synced objects with this key
ardana >
openstack container set -k 'KEY' container-dstUpload objects to container-src. Within a number of minutes the objects should be automatically synced to container-dst.
Changing the intracluster realm key
The intracluster realm key used by container sync to sync objects between containers in the same swift system is automatically generated. The process for changing passwords is described in Section 5.7, “Changing Service Passwords”.
The steps to change the intracluster realm key are as follows.
On the Cloud Lifecycle Manager create a file called
~/openstack/change_credentials/swift_data_metadata.yml
with the contents included below. Theconsuming-cp
andcp
are the control plane name specified in~/openstack/my_cloud/definition/data/control_plane.yml
where the swift-container service is running.swift_intracluster_sync_key: metadata: - clusters: - swpac component: swift-container consuming-cp: control-plane-1 cp: control-plane-1 version: '2.0'
Run the following commands
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlReconfigure the swift credentials
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure-credentials-change.ymlDelete
~/openstack/change_credentials/swift_data_metadata.yml
ardana >
rm ~/openstack/change_credentials/swift_data_metadata.ymlOn a swift PAC server check that the intracluster realm key has been updated in
/etc/swift/container-sync-realms.conf
# The intracluster realm facilitates syncing containers on this system [intracluster] key = aNlDn3kWK
Update any containers using the intracluster container sync to use the new intracluster realm key
ardana >
openstack container set -k 'aNlDn3kWK' container-srcardana >
openstack container set -k 'aNlDn3kWK' container-dst
10 Managing Networking #
Information about managing and configuring the Networking service.
10.1 SUSE OpenStack Cloud Firewall #
Firewall as a Service (FWaaS) provides the ability to assign network-level, port security for all traffic entering an existing tenant network. More information on this service can be found in the public OpenStack documentation located at http://specs.openstack.org/openstack/neutron-specs/specs/api/firewall_as_a_service__fwaas_.html. The following documentation provides command-line interface example instructions for configuring and testing a SUSE OpenStack Cloud firewall. FWaaS can also be configured and managed by the horizon web interface.
With SUSE OpenStack Cloud, FWaaS is implemented directly in the L3 agent (neutron-l3-agent). However if VPNaaS is enabled, FWaaS is implemented in the VPNaaS agent (neutron-vpn-agent). Because FWaaS does not use a separate agent process or start a specific service, there currently are no monasca alarms for it.
If DVR is enabled, the firewall service currently does not filter traffic between OpenStack private networks, also known as east-west traffic and will only filter traffic from external networks, also known as north-south traffic.
The L3 agent must be restarted on each compute node hosting a DVR router when removing the FWaaS or adding a new FWaaS. This condition only applies when updating existing instances connected to DVR routers. For more information, see the upstream bug.
10.1.1 Overview of the SUSE OpenStack Cloud Firewall configuration #
The following instructions provide information about how to identify and modify the overall SUSE OpenStack Cloud firewall that is configured in front of the control services. This firewall is administered only by a cloud admin and is not available for tenant use for private network firewall services.
During the installation process, the configuration processor will
automatically generate "allow" firewall rules for each server based on the
services deployed and block all other ports. These are populated in
~/openstack/my_cloud/info/firewall_info.yml
, which includes
a list of all the ports by network, including the addresses on which the
ports will be opened. This is described in more detail in
Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 5 “Input Model”, Section 5.2 “Concepts”, Section 5.2.10 “Networking”, Section 5.2.10.5 “Firewall Configuration”.
The firewall_rules.yml
file in the input model allows you
to define additional rules for each network group. You can read more about
this in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.15 “Firewall Rules”.
The purpose of this document is to show you how to make post-installation changes to the firewall rules if the need arises.
This process is not to be confused with Firewall-as-a-Service, which is a separate service that enables the ability for SUSE OpenStack Cloud tenants to create north-south, network-level firewalls to provide stateful protection to all instances in a private, tenant network. This service is optional and is tenant-configured.
10.1.2 SUSE OpenStack Cloud 9 FWaaS Configuration #
Check for an enabled firewall.
You should check to determine if the firewall is enabled. The output of the openstack extension list should contain a firewall entry.
openstack extension list
Assuming the external network is already created by the admin, this command will show the external network.
openstack network list
Create required assets.
Before creating firewalls, you will need to create a network, subnet, router, security group rules, start an instance and assign it a floating IP address.
Create the network, subnet and router.
openstack network create private openstack subnet create --name sub private 10.0.0.0/24 --gateway 10.0.0.1 openstack router create router openstack router add subnet router sub openstack router set router ext-net
Create security group rules. Security group rules filter traffic at VM level.
openstack security group rule create default --protocol icmp openstack security group rule create default --protocol tcp --port-range-min 22 --port-range-max 22 openstack security group rule create default --protocol tcp --port-range-min 80 --port-range-max 80
Boot a VM.
NET=$(openstack network list | awk '/private/ {print $2}') openstack server create --flavor 1 --image <image> --nic net-id=$NET vm1 --poll
Verify if the instance is ACTIVE and is assigned an IP address.
openstack server list
Get the port id of the vm1 instance.
fixedip=$(openstack server list | awk '/vm1/ {print $12}' | awk -F '=' '{print $2}' | awk -F ',' '{print $1}') vmportuuid=$(openstack port list | grep $fixedip | awk '{print $2}')
Create and associate a floating IP address to the vm1 instance.
openstack floating ip create ext-net --port-id $vmportuuid
Verify if the floating IP is assigned to the instance. The following command should show an assigned floating IP address from the external network range.
openstack server show vm1
Verify if the instance is reachable from the external network. SSH into the instance from a node in (or has route to) the external network.
ssh cirros@FIP-VM1 password: <password>
Create and attach the firewall.
By default, an internal "drop all" rule is enabled in IP tables if none of the defined rules match the real-time data packets.
Create new firewall rules using
firewall-rule-create
command and providing the protocol, action (allow, deny, reject) and name for the new rule.Firewall actions provide rules in which data traffic can be handled. An allow rule will allow traffic to pass through the firewall, deny will stop and prevent data traffic from passing through the firewall and reject will reject the data traffic and return a destination-unreachable response. Using reject will speed up failure detection time dramatically for legitimate users, since they will not be required to wait for retransmission timeouts or submit retries. Some customers should stick with deny where prevention of port scanners and similar methods may be attempted by hostile attackers. Using deny will drop all of the packets, making it more difficult for malicious intent. The firewall action, deny is the default behavior.
The example below demonstrates how to allow icmp and ssh while denying access to http. See the
OpenStackClient
command-line reference at https://docs.openstack.org/python-openstackclient/rocky/ on additional options such as source IP, destination IP, source port and destination port.NoteYou can create a firewall rule with an identical name and each instance will have a unique id associated with the created rule, however for clarity purposes this is not recommended.
neutron firewall-rule-create --protocol icmp --action allow --name allow-icmp neutron firewall-rule-create --protocol tcp --destination-port 80 --action deny --name deny-http neutron firewall-rule-create --protocol tcp --destination-port 22 --action allow --name allow-ssh
Once the rules are created, create the firewall policy by using the
firewall-policy-create
command with the--firewall-rules
option and rules to include in quotes, followed by the name of the new policy. The order of the rules is important.neutron firewall-policy-create --firewall-rules "allow-icmp deny-http allow-ssh" policy-fw
Finish the firewall creation by using the
firewall-create
command, the policy name and the new name you want to give to your new firewall.neutron firewall-create policy-fw --name user-fw
You can view the details of your new firewall by using the
firewall-show
command and the name of your firewall. This will verify that the status of the firewall is ACTIVE.neutron firewall-show user-fw
Verify the FWaaS is functional.
Since allow-icmp firewall rule is set you can ping the floating IP address of the instance from the external network.
ping <FIP-VM1>
Similarly, you can connect via ssh to the instance due to the allow-ssh firewall rule.
ssh cirros@<FIP-VM1> password: <password>
Run a web server on vm1 instance that listens over port 80, accepts requests and sends a WELCOME response.
$ vi webserv.sh #!/bin/bash MYIP=$(/sbin/ifconfig eth0|grep 'inet addr'|awk -F: '{print $2}'| awk '{print $1}'); while true; do echo -e "HTTP/1.0 200 OK Welcome to $MYIP" | sudo nc -l -p 80 done # Give it Exec rights $ chmod 755 webserv.sh # Execute the script $ ./webserv.sh
You should expect to see curl fail over port 80 because of the deny-http firewall rule. If curl succeeds, the firewall is not blocking incoming http requests.
curl -vvv <FIP-VM1>
When using reference implementation, new networks, FIPs and routers created after the Firewall creation will not be automatically updated with firewall rules. Thus, execute the firewall-update command by passing the current and new router Ids such that the rules are reconfigured across all the routers (both current and new).
For example if router-1 is created before and router-2 is created after the firewall creation
$ neutron firewall-update —router <router-1-id> —router <router-2-id> <firewall-name>
10.1.3 Making Changes to the Firewall Rules #
Log in to your Cloud Lifecycle Manager.
Edit your
~/openstack/my_cloud/definition/data/firewall_rules.yml
file and add the lines necessary to allow the port(s) needed through the firewall.In this example we are going to open up port range 5900-5905 to allow VNC traffic through the firewall:
- name: VNC network-groups: - MANAGEMENT rules: - type: allow remote-ip-prefix: 0.0.0.0/0 port-range-min: 5900 port-range-max: 5905 protocol: tcp
NoteThe example above shows a
remote-ip-prefix
of0.0.0.0/0
which opens the ports up to all IP ranges. To be more secure you can specify your local IP address CIDR you will be running the VNC connect from.Commit those changes to your local git:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "firewall rule update"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlCreate the deployment directory structure:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlChange to the deployment directory and run the
osconfig-iptables-deploy.yml
playbook to update your iptable rules to allow VNC:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-iptables-deploy.yml
You can repeat these steps as needed to add, remove, or edit any of these firewall rules.
10.1.4 More Information #
Firewalls are based in IPtable settings.
Each firewall that is created is known as an instance.
A firewall instance can be deployed on selected project routers. If no specific project router is selected, a firewall instance is automatically applied to all project routers.
Only 1 firewall instance can be applied to a project router.
Only 1 firewall policy can be applied to a firewall instance.
Multiple firewall rules can be added and applied to a firewall policy.
Firewall rules can be shared across different projects via the Share API flag.
Firewall rules supersede the Security Group rules that are applied at the Instance level for all traffic entering or leaving a private, project network.
For more information on the command-line interface (CLI) and firewalls, see the OpenStack networking command-line client reference: https://docs.openstack.org/python-openstackclient/rocky/
10.2 Using VPN as a Service (VPNaaS) #
SUSE OpenStack Cloud 9 VPNaaS Configuration
This document describes the configuration process and requirements for the SUSE OpenStack Cloud 9 Virtual Private Network (VPN) as a Service (VPNaaS) module.
10.2.1 Prerequisites #
SUSE OpenStack Cloud must be installed.
Before setting up VPNaaS, you will need to have created an external network and a subnet with access to the internet. Information on how to create the external network and subnet can be found in Section 10.2.4, “More Information”.
You should assume 172.16.0.0/16 as the ext-net CIDR in this document.
10.2.2 Considerations #
Using the neutron plugin-based VPNaaS causes additional processes to be run on the Network Service Nodes. One of these processes, the ipsec charon process from StrongSwan, runs as root and listens on an external network. A vulnerability in that process can lead to remote root compromise of the Network Service Nodes. If this is a concern customers should consider using a VPN solution other than the neutron plugin-based VPNaaS and/or deploying additional protection mechanisms.
10.2.3 Configuration #
Setup Networks You can setup VPN as a
Service (VPNaaS) by first creating networks, subnets and routers using the
neutron
command line. The VPNaaS module enables the
ability to extend access between private networks across two different
SUSE OpenStack Cloud clouds or between a SUSE OpenStack Cloud cloud and a non-cloud network. VPNaaS
is based on the open source software application called StrongSwan.
StrongSwan (more information available
at http://www.strongswan.org/)
is an IPsec implementation and provides basic VPN gateway functionality.
You can execute the included commands from any shell with access to the service APIs. In the included examples, the commands are executed from the lifecycle manager, however you could execute the commands from the controller node or any other shell with aforementioned service API access.
The use of floating IP's is not possible with the current version of VPNaaS when DVR is enabled. Ensure that no floating IP is associated to instances that will be using VPNaaS when using a DVR router. Floating IP associated to instances are ok when using CVR router.
From the Cloud Lifecycle Manager, create first private network, subnet and router assuming that ext-net is created by admin.
openstack network create privateA openstack subnet create --name subA privateA 10.1.0.0/24 --gateway 10.1.0.1 openstack router create router1 openstack router add subnet router1 subA openstack router set router1 ext-net
Create second private network, subnet and router.
openstack network create privateB openstack subnet create --name subB privateB 10.2.0.0/24 --gateway 10.2.0.1 openstack router create router2 openstack router add subnet router2 subB openstack router set router2 ext-net
From the Cloud Lifecycle Manager run the following to start the virtual machines. Begin with adding secgroup rules for SSH and ICMP.
openstack security group rule create default --protocol icmp openstack security group rule create default --protocol tcp --port-range-min 22 --port-range-max 22
Start the virtual machine in the privateA subnet. Using nova images-list, use the image id to boot image instead of the image name. After executing this step, it is recommended that you wait approximately 10 seconds to allow the virtual machine to become active.
NETA=$(openstack network list | awk '/privateA/ {print $2}') openstack server create --flavor 1 --image <id> --nic net-id=$NETA vm1
Start the virtual machine in the privateB subnet.
NETB=$(openstack network list | awk '/privateB/ {print $2}') openstack server create --flavor 1 --image <id> --nic net-id=$NETB vm2
Verify private IP's are allocated to the respective vms. Take note of IP's for later use.
openstack server show vm1 openstack server show vm2
You can set up the VPN by executing the below commands from the lifecycle manager or any shell with access to the service APIs. Begin with creating the policies with
vpn-ikepolicy-create
andvpn-ipsecpolicy-create
.neutron vpn-ikepolicy-create ikepolicy neutron vpn-ipsecpolicy-create ipsecpolicy
Create the VPN service at router1.
neutron vpn-service-create --name myvpnA --description "My vpn service" router1 subA
Wait at least 5 seconds and then run
ipsec-site-connection-create
to create a ipsec-site connection. Note that--peer-address
is the assign ext-net IP from router2 and--peer-cidr
is subB cidr.neutron ipsec-site-connection-create --name vpnconnection1 --vpnservice-id myvpnA \ --ikepolicy-id ikepolicy --ipsecpolicy-id ipsecpolicy --peer-address 172.16.0.3 \ --peer-id 172.16.0.3 --peer-cidr 10.2.0.0/24 --psk secret
Create the VPN service at router2.
neutron vpn-service-create --name myvpnB --description "My vpn serviceB" router2 subB
Wait at least 5 seconds and then run
ipsec-site-connection-create
to create a ipsec-site connection. Note that--peer-address
is the assigned ext-net IP from router1 and--peer-cidr
is subA cidr.neutron ipsec-site-connection-create --name vpnconnection2 --vpnservice-id myvpnB \ --ikepolicy-id ikepolicy --ipsecpolicy-id ipsecpolicy --peer-address 172.16.0.2 \ --peer-id 172.16.0.2 --peer-cidr 10.1.0.0/24 --psk secret
On the Cloud Lifecycle Manager, run the
ipsec-site-connection-list
command to see the active connections. Be sure to check that the vpn_services are ACTIVE. You can check this by runningvpn-service-list
and then checking ipsec-site-connections status. You should expect that the time for both vpn-services and ipsec-site-connections to become ACTIVE could take as long as 1 to 3 minutes.neutron ipsec-site-connection-list +--------------------------------------+----------------+--------------+---------------+------------+-----------+--------+ | id | name | peer_address | peer_cidrs | route_mode | auth_mode | status | +--------------------------------------+----------------+--------------+---------------+------------+-----------+--------+ | 1e8763e3-fc6a-444c-a00e-426a4e5b737c | vpnconnection2 | 172.16.0.2 | "10.1.0.0/24" | static | psk | ACTIVE | | 4a97118e-6d1d-4d8c-b449-b63b41e1eb23 | vpnconnection1 | 172.16.0.3 | "10.2.0.0/24" | static | psk | ACTIVE | +--------------------------------------+----------------+--------------+---------------+------------+-----------+--------+
Verify VPN In the case of non-admin users, you can verify the VPN connection by pinging the virtual machines.
Check the VPN connections.
Notevm1-ip and vm2-ip denotes private IP's for vm1 and vm2 respectively. The private IPs are obtained, as described in of Step 4. If you are unable to SSH to the private network due to a lack of direct access, the VM console can be accessed through horizon.
ssh cirros@vm1-ip password: <password> # ping the private IP address of vm2 ping ###.###.###.###
In another terminal.
ssh cirros@vm2-ip password: <password> # ping the private IP address of vm1 ping ###.###.###.###
You should see ping responses from both virtual machines.
As the admin user, you should check to make sure that a route exists between the router gateways. Once the gateways have been checked, packet encryption can be verified by using traffic analyzer (tcpdump) by tapping on the respective namespace (qrouter-* in case of non-DVR and snat-* in case of DVR) and tapping the right interface (qg-***).
When using DVR namespaces, all the occurrences of qrouter-xxxxxx in the following commands should be replaced with respective snat-xxxxxx.
Check the if the route exists between two router gateways. You can get the right qrouter namespace id by executing sudo ip netns. Once you have the qrouter namespace id, you can get the interface by executing sudo ip netns qrouter-xxxxxxxx ip addr and from the result the interface can be found.
sudo ip netns sudo ip netns exec qrouter-<router1 UUID> ping <router2 gateway> sudo ip netns exec qrouter-<router2 UUID> ping <router1 gateway>
Initiate a tcpdump on the interface.
sudo ip netns exec qrouter-xxxxxxxx tcpdump -i qg-xxxxxx
Check the VPN connection.
ssh cirros@vm1-ip password: <password> # ping the private IP address of vm2 ping ###.###.###.###
Repeat for other namespace and right tap interface.
sudo ip netns exec qrouter-xxxxxxxx tcpdump -i qg-xxxxxx
In another terminal.
ssh cirros@vm2-ip password: <password> # ping the private IP address of vm1 ping ###.###.###.###
You will find encrypted packets containing ‘ESP’ in the tcpdump trace.
10.2.4 More Information #
VPNaaS currently only supports Pre-shared Keys (PSK) security between VPN gateways. A different VPN gateway solution should be considered if stronger, certificate-based security is required.
For more information on the neutron command-line interface (CLI) and VPN as a Service (VPNaaS), see the OpenStack networking command-line client reference: https://docs.openstack.org/python-openstackclient/rocky/
For information on how to create an external network and subnet, see the OpenStack manual: http://docs.openstack.org/user-guide/dashboard_create_networks.html
10.3 DNS Service Overview #
SUSE OpenStack Cloud DNS service provides multi-tenant Domain Name Service with REST API management for domain and records.
The DNS Service is not intended to be used as an internal or private DNS service. The name records in DNSaaS should be treated as public information that anyone could query. There are controls to prevent tenants from creating records for domains they do not own. TSIG provides a Transaction SIG nature to ensure integrity during zone transfer to other DNS servers.
10.3.1 For More Information #
For more information about designate REST APIs, see the OpenStack REST API Documentation at http://docs.openstack.org/developer/designate/rest.html.
For a glossary of terms for designate, see the OpenStack glossary at http://docs.openstack.org/developer/designate/glossary.html.
10.3.2 designate Initial Configuration #
After the SUSE OpenStack Cloud installation has been completed, designate requires initial configuration to operate.
10.3.2.1 Identifying Name Server Public IPs #
Depending on the back-end, the method used to identify the name servers' public IPs will differ.
10.3.2.1.1 InfoBlox #
InfoBlox will act as your public name servers, consult the InfoBlox management UI to identify the IPs.
10.3.2.1.2 BIND Back-end #
You can find the name server IPs in /etc/hosts
by
looking for the ext-api
addresses, which are the
addresses of the controllers. For example:
192.168.10.1 example-cp1-c1-m1-extapi 192.168.10.2 example-cp1-c1-m2-extapi 192.168.10.3 example-cp1-c1-m3-extapi
10.3.2.1.3 Creating Name Server A Records #
Each name server requires a public name, for example
ns1.example.com.
, to which designate-managed domains will
be delegated. There are two common locations where these may be registered,
either within a zone hosted on designate itself, or within a zone hosted on a
external DNS service.
If you are using an externally managed zone for these names:
For each name server public IP, create the necessary A records in the external system.
If you are using a designate-managed zone for these names:
Create the zone in designate which will contain the records:
ardana >
openstack zone create --email hostmaster@example.com example.com. +----------------+--------------------------------------+ | Field | Value | +----------------+--------------------------------------+ | action | CREATE | | created_at | 2016-03-09T13:16:41.000000 | | description | None | | email | hostmaster@example.com | | id | 23501581-7e34-4b88-94f4-ad8cec1f4387 | | masters | | | name | example.com. | | pool_id | 794ccc2c-d751-44fe-b57f-8894c9f5c842 | | project_id | a194d740818942a8bea6f3674e0a3d71 | | serial | 1457529400 | | status | PENDING | | transferred_at | None | | ttl | 3600 | | type | PRIMARY | | updated_at | None | | version | 1 | +----------------+--------------------------------------+For each name server public IP, create an A record. For example:
ardana >
openstack recordset create --records 192.168.10.1 --type A example.com. ns1.example.com. +-------------+--------------------------------------+ | Field | Value | +-------------+--------------------------------------+ | action | CREATE | | created_at | 2016-03-09T13:18:36.000000 | | description | None | | id | 09e962ed-6915-441a-a5a1-e8d93c3239b6 | | name | ns1.example.com. | | records | 192.168.10.1 | | status | PENDING | | ttl | None | | type | A | | updated_at | None | | version | 1 | | zone_id | 23501581-7e34-4b88-94f4-ad8cec1f4387 | +-------------+--------------------------------------+When records have been added, list the record sets in the zone to validate:
ardana >
openstack recordset list example.com. +--------------+------------------+------+---------------------------------------------------+ | id | name | type | records | +--------------+------------------+------+---------------------------------------------------+ | 2d6cf...655b | example.com. | SOA | ns1.example.com. hostmaster.example.com 145...600 | | 33466...bd9c | example.com. | NS | ns1.example.com. | | da98c...bc2f | example.com. | NS | ns2.example.com. | | 672ee...74dd | example.com. | NS | ns3.example.com. | | 09e96...39b6 | ns1.example.com. | A | 192.168.10.1 | | bca4f...a752 | ns2.example.com. | A | 192.168.10.2 | | 0f123...2117 | ns3.example.com. | A | 192.168.10.3 | +--------------+------------------+------+---------------------------------------------------+Contact your domain registrar requesting Glue Records to be registered in the
com.
zone for the nameserver and public IP address pairs above. If you are using a sub-zone of an existing company zone (for example,ns1.cloud.mycompany.com.
), the Glue must be placed in themycompany.com.
zone.
10.3.2.1.4 For More Information #
For additional DNS integration and configuration information, see the OpenStack designate documentation at https://docs.openstack.org/designate/rocky/.
For more information on creating servers, domains and examples, see the OpenStack REST API documentation at https://developer.openstack.org/api-ref/dns/.
10.3.3 DNS Service Monitoring Support #
10.3.3.1 DNS Service Monitoring Support #
Additional monitoring support for the DNS Service (designate) has been added to SUSE OpenStack Cloud.
In the Networking section of the Operations Console, you can see alarms for all of
the DNS Services (designate), such as designate-zone-manager, designate-api,
designate-pool-manager, designate-mdns, and designate-central after running
designate-stop.yml
.
You can run designate-start.yml
to start the DNS Services
back up and the alarms will change from a red status to green and be removed
from the New Alarms panel of the
Operations Console.
An example of the generated alarms from the Operations Console is provided below
after running designate-stop.yml
:
ALARM: STATE: ALARM ID: LAST CHECK: DIMENSION: Process Check 0f221056-1b0e-4507-9a28-2e42561fac3e 2016-10-03T10:06:32.106Z hostname=ardana-cp1-c1-m1-mgmt, service=dns, cluster=cluster1, process_name=designate-zone-manager, component=designate-zone-manager, control_plane=control-plane-1, cloud_name=entry-scale-kvm Process Check 50dc4c7b-6fae-416c-9388-6194d2cfc837 2016-10-03T10:04:32.086Z hostname=ardana-cp1-c1-m1-mgmt, service=dns, cluster=cluster1, process_name=designate-api, component=designate-api, control_plane=control-plane-1, cloud_name=entry-scale-kvm Process Check 55cf49cd-1189-4d07-aaf4-09ed08463044 2016-10-03T10:05:32.109Z hostname=ardana-cp1-c1-m1-mgmt, service=dns, cluster=cluster1, process_name=designate-pool-manager, component=designate-pool-manager, control_plane=control-plane-1, cloud_name=entry-scale-kvm Process Check c4ab7a2e-19d7-4eb2-a9e9-26d3b14465ea 2016-10-03T10:06:32.105Z hostname=ardana-cp1-c1-m1-mgmt, service=dns, cluster=cluster1, process_name=designate-mdns, component=designate-mdns, control_plane=control-plane-1, cloud_name=entry-scale-kvm HTTP Status c6349bbf-4fd1-461a-9932-434169b86ce5 2016-10-03T10:05:01.731Z service=dns, cluster=cluster1, url=http://100.60.90.3:9001/, hostname=ardana-cp1-c1-m3-mgmt, component=designate-api, control_plane=control-plane-1, api_endpoint=internal, cloud_name=entry-scale-kvm, monitored_host_type=instance Process Check ec2c32c8-3b91-4656-be70-27ff0c271c89 2016-10-03T10:04:32.082Z hostname=ardana-cp1-c1-m1-mgmt, service=dns, cluster=cluster1, process_name=designate-central, component=designate-central, control_plane=control-plane-1, cloud_name=entry-scale-kvm
10.4 Networking Service Overview #
SUSE OpenStack Cloud Networking is a virtual Networking service that leverages the OpenStack neutron service to provide network connectivity and addressing to SUSE OpenStack Cloud Compute service devices.
The Networking service also provides an API to configure and manage a variety of network services.
You can use the Networking service to connect guest servers or you can define and configure your own virtual network topology.
10.4.1 Installing the Networking Service #
SUSE OpenStack Cloud Network Administrators are responsible for planning for the neutron Networking service, and once installed, to configure the service to meet the needs of their cloud network users.
10.4.2 Working with the Networking service #
To perform tasks using the Networking service, you can use the dashboard, API or CLI.
10.4.3 Reconfiguring the Networking service #
If you change any of the network configuration after installation, it is recommended that you reconfigure the Networking service by running the neutron-reconfigure playbook.
On the Cloud Lifecycle Manager:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
10.4.4 For more information #
For information on how to operate your cloud we suggest you read the OpenStack Operations Guide. The Architecture section contains useful information about how an OpenStack Cloud is put together. However, SUSE OpenStack Cloud takes care of these details for you. The Operations section contains information on how to manage the system.
10.4.5 Neutron External Networks #
10.4.5.1 External networks overview #
This topic explains how to create a neutron external network.
External networks provide access to the internet.
The typical use is to provide an IP address that can be used to reach a VM from an external network which can be a public network like the internet or a network that is private to an organization.
10.4.5.2 Using the Ansible Playbook #
This playbook will query the Networking service for an existing external
network, and then create a new one if you do not already have one. The
resulting external network will have the name ext-net
with a subnet matching the CIDR you specify in the command below.
If you need to specify more granularity, for example specifying an allocation pool for the subnet, use the Section 10.4.5.3, “Using the python-neutronclient CLI”.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-cloud-configure.yml -e EXT_NET_CIDR=<CIDR>
The table below shows the optional switch that you can use as part of this playbook to specify environment-specific information:
Switch | Description |
---|---|
|
Optional. You can use this switch to specify the external network CIDR. If you choose not to use this switch, or use a wrong value, the VMs will not be accessible over the network.
This CIDR will be from the |
10.4.5.3 Using the python-neutronclient CLI #
For more granularity you can utilize the OpenStackClient tool to create your external network.
Log in to the Cloud Lifecycle Manager.
Source the Admin creds:
ardana >
source ~/service.osrcCreate the external network and then the subnet using these commands below.
Creating the network:
ardana >
openstack network create --router:external <external-network-name>Creating the subnet:
ardana >
openstack subnet create EXTERNAL-NETWORK-NAME CIDR --gateway GATEWAY --allocation-pool start=IP_START,end=IP_END [--disable-dhcp]Where:
Value Description external-network-name This is the name given to your external network. This is a unique value that you will choose. The value
ext-net
is usually used.CIDR Use this switch to specify the external network CIDR. If you do not use this switch or use a wrong value, the VMs will not be accessible over the network.
This CIDR will be from the EXTERNAL VM network.
--gateway Optional switch to specify the gateway IP for your subnet. If this is not included, it will choose the first available IP.
--allocation-pool start end
Optional switch to specify start and end IP addresses to use as the allocation pool for this subnet.
--disable-dhcp Optional switch if you want to disable DHCP on this subnet. If this is not specified, DHCP will be enabled.
10.4.5.4 Multiple External Networks #
SUSE OpenStack Cloud provides the ability to have multiple external networks, by using the Network Service (neutron) provider networks for external networks. You can configure SUSE OpenStack Cloud to allow the use of provider VLANs as external networks by following these steps.
Do NOT include the
neutron.l3_agent.external_network_bridge
tag in the network_groups definition for your cloud. This results in thel3_agent.ini external_network_bridge
being set to an empty value (rather than the traditional br-ex).Configure your cloud to use provider VLANs, by specifying the
provider_physical_network
tag on one of the network_groups defined for your cloud.For example, to run provider VLANS over the EXAMPLE network group: (some attributes omitted for brevity)
network-groups: - name: EXAMPLE tags: - neutron.networks.vlan: provider-physical-network: physnet1
After the cloud has been deployed, you can create external networks using provider VLANs.
For example, using the OpenStackClient:
Create external network 1 on vlan101
ardana >
openstack network create --provider-network-type vlan --provider-physical-network physnet1 --provider-segment 101 --external ext-net1Create external network 2 on vlan102
ardana >
openstack network create --provider-network-type vlan --provider-physical-network physnet1 --provider-segment 102 --external ext-net2
10.4.6 Neutron Provider Networks #
This topic explains how to create a neutron provider network.
A provider network is a virtual network created in the SUSE OpenStack Cloud cloud that is consumed by SUSE OpenStack Cloud services. The distinctive element of a provider network is that it does not create a virtual router; rather, it depends on L3 routing that is provided by the infrastructure.
A provider network is created by adding the specification to the SUSE OpenStack Cloud input model. It consists of at least one network and one or more subnets.
10.4.6.1 SUSE OpenStack Cloud input model #
The input model is the primary mechanism a cloud admin uses in defining a SUSE OpenStack Cloud installation. It exists as a directory with a data subdirectory that contains YAML files. By convention, any service that creates a neutron provider network will create a subdirectory under the data directory and the name of the subdirectory shall be the project name. For example, the Octavia project will use neutron provider networks so it will have a subdirectory named 'octavia' and the config file that specifies the neutron network will exist in that subdirectory.
├── cloudConfig.yml ├── data │ ├── control_plane.yml │ ├── disks_compute.yml │ ├── disks_controller_1TB.yml │ ├── disks_controller.yml │ ├── firewall_rules.yml │ ├── net_interfaces.yml │ ├── network_groups.yml │ ├── networks.yml │ ├── neutron │ │ └── neutron_config.yml │ ├── nic_mappings.yml │ ├── server_groups.yml │ ├── server_roles.yml │ ├── servers.yml │ ├── swift │ │ └── swift_config.yml │ └── octavia │ └── octavia_config.yml ├── README.html └── README.md
10.4.6.2 Network/Subnet specification #
The elements required in the input model for you to define a network are:
name
network_type
physical_network
Elements that are optional when defining a network are:
segmentation_id
shared
Required elements for the subnet definition are:
cidr
Optional elements for the subnet definition are:
allocation_pools which will require start and end addresses
host_routes which will require a destination and nexthop
gateway_ip
no_gateway
enable-dhcp
NOTE: Only IPv4 is supported at the present time.
10.4.6.3 Network details #
The following table outlines the network values to be set, and what they represent.
Attribute | Required/optional | Allowed Values | Usage |
---|---|---|---|
name | Required | ||
network_type | Required | flat, vlan, vxlan | The type of desired network |
physical_network | Required | Valid | Name of physical network that is overlayed with the virtual network |
segmentation_id | Optional | vlan or vxlan ranges | VLAN id for vlan or tunnel id for vxlan |
shared | Optional | True | Shared by all projects or private to a single project |
10.4.6.4 Subnet details #
The following table outlines the subnet values to be set, and what they represent.
Attribute | Req/Opt | Allowed Values | Usage |
---|---|---|---|
cidr | Required | Valid CIDR range | for example, 172.30.0.0/24 |
allocation_pools | Optional | See allocation_pools table below | |
host_routes | Optional | See host_routes table below | |
gateway_ip | Optional | Valid IP addr | Subnet gateway to other nets |
no_gateway | Optional | True | No distribution of gateway |
enable-dhcp | Optional | True | Enable dhcp for this subnet |
10.4.6.5 ALLOCATION_POOLS details #
The following table explains allocation pool settings.
Attribute | Req/Opt | Allowed Values | Usage |
---|---|---|---|
start | Required | Valid IP addr | First ip address in pool |
end | Required | Valid IP addr | Last ip address in pool |
10.4.6.6 HOST_ROUTES details #
The following table explains host route settings.
Attribute | Req/Opt | Allowed Values | Usage |
---|---|---|---|
destination | Required | Valid CIDR | Destination subnet |
nexthop | Required | Valid IP addr | Hop to take to destination subnet |
Multiple destination/nexthop values can be used.
10.4.6.7 Examples #
The following examples show the configuration file settings for neutron and Octavia.
Octavia configuration
This file defines the mapping. It does not need to be edited unless you want to change the name of your VLAN.
Path:
~/openstack/my_cloud/definition/data/octavia/octavia_config.yml
--- product: version: 2 configuration-data: - name: OCTAVIA-CONFIG-CP1 services: - octavia data: amp_network_name: OCTAVIA-MGMT-NET
neutron configuration
Input your network configuration information for your provider VLANs in
neutron_config.yml
found here:
~/openstack/my_cloud/definition/data/neutron/
.
--- product: version: 2 configuration-data: - name: NEUTRON-CONFIG-CP1 services: - neutron data: neutron_provider_networks: - name: OCTAVIA-MGMT-NET provider: - network_type: vlan physical_network: physnet1 segmentation_id: 2754 cidr: 10.13.189.0/24 no_gateway: True enable_dhcp: True allocation_pools: - start: 10.13.189.4 end: 10.13.189.252 host_routes: # route to MANAGEMENT-NET - destination: 10.13.111.128/26 nexthop: 10.13.189.5
10.4.6.8 Implementing your changes #
Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "configuring provider network"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlThen continue with your clean cloud installation.
If you are only adding a neutron Provider network to an existing model, then run the neutron-deploy.yml playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-deploy.yml
10.4.6.9 Multiple Provider Networks #
The physical network infrastructure must be configured to convey the provider VLAN traffic as tagged VLANs to the cloud compute nodes and network service network nodes. Configuration of the physical network infrastructure is outside the scope of the SUSE OpenStack Cloud 9 software.
SUSE OpenStack Cloud 9 automates the server networking configuration and the Network Service configuration based on information in the cloud definition. To configure the system for provider VLANs, specify the neutron.networks.vlan tag with a provider-physical-network attribute on one or more network groups. For example (some attributes omitted for brevity):
network-groups: - name: NET_GROUP_A tags: - neutron.networks.vlan: provider-physical-network: physnet1 - name: NET_GROUP_B tags: - neutron.networks.vlan: provider-physical-network: physnet2
A network group is associated with a server network interface via an interface model. For example (some attributes omitted for brevity):
interface-models: - name: INTERFACE_SET_X network-interfaces: - device: name: bond0 network-groups: - NET_GROUP_A - device: name: eth3 network-groups: - NET_GROUP_B
A network group used for provider VLANs may contain only a single SUSE OpenStack Cloud network, because that VLAN must span all compute nodes and any Network Service network nodes/controllers (that is, it is a single L2 segment). The SUSE OpenStack Cloud network must be defined with tagged-vlan false, otherwise a Linux VLAN network interface will be created. For example:
networks: - name: NET_A tagged-vlan: false network-group: NET_GROUP_A - name: NET_B tagged-vlan: false network-group: NET_GROUP_B
When the cloud is deployed, SUSE OpenStack Cloud 9 will create the appropriate bridges on the servers, and set the appropriate attributes in the neutron configuration files (for example, bridge_mappings).
After the cloud has been deployed, create Network Service network objects for each provider VLAN. For example, using the Network Service CLI:
ardana >
openstack network create --provider:network_type vlan --provider:physical_network physnet1 --provider-segment 101 mynet101ardana >
openstack network create --provider:network_type vlan --provider:physical_network physnet2 --provider-segment 234 mynet234
10.4.6.10 More Information #
For more information on the Network Service command-line interface (CLI), see the OpenStack networking command-line client reference: http://docs.openstack.org/cli-reference/content/neutronclient_commands.html
10.4.7 Using IPAM Drivers in the Networking Service #
This topic describes how to choose and implement an IPAM driver.
10.4.7.1 Selecting and implementing an IPAM driver #
Beginning with the Liberty release, OpenStack networking includes a pluggable interface for the IP Address Management (IPAM) function. This interface creates a driver framework for the allocation and de-allocation of subnets and IP addresses, enabling the integration of alternate IPAM implementations or third-party IP Address Management systems.
There are three possible IPAM driver options:
Non-pluggable driver. This option is the default when the ipam_driver parameter is not specified in neutron.conf.
Pluggable reference IPAM driver. The pluggable IPAM driver interface was introduced in SUSE OpenStack Cloud 9 (OpenStack Liberty). It is a refactoring of the Kilo non-pluggable driver to use the new pluggable interface. The setting in neutron.conf to specify this driver is
ipam_driver = internal
.Pluggable Infoblox IPAM driver. The pluggable Infoblox IPAM driver is a third-party implementation of the pluggable IPAM interface. the corresponding setting in neutron.conf to specify this driver is
ipam_driver = networking_infoblox.ipam.driver.InfobloxPool
.NoteYou can use either the non-pluggable IPAM driver or a pluggable one. However, you cannot use both.
10.4.7.2 Using the Pluggable reference IPAM driver #
To indicate that you want to use the Pluggable reference IPAM driver, the
only parameter needed is "ipam_driver." You can set it by looking for the
following commented line in the
neutron.conf.j2
template (ipam_driver = internal)
uncommenting it, and committing the file. After following the standard
steps to deploy neutron, neutron will be configured to run using the
Pluggable reference IPAM driver.
As stated, the file you must edit is neutron.conf.j2
on
the Cloud Lifecycle Manager in the directory
~/openstack/my_cloud/config/neutron
. Here is the relevant
section where you can see the ipam_driver
parameter
commented out:
[DEFAULT] ... l3_ha_net_cidr = 169.254.192.0/18 # Uncomment the line below if the Reference Pluggable IPAM driver is to be used # ipam_driver = internal ...
After uncommenting the line ipam_driver = internal
,
commit the file using git commit from the openstack/my_cloud
directory:
ardana >
git commit -a -m 'My config for enabling the internal IPAM Driver'
Then follow the steps to deploy SUSE OpenStack Cloud in the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 13 “Overview” appropriate to your cloud configuration.
Currently there is no migration path from the non-pluggable driver to a pluggable IPAM driver because changes are needed to database tables and neutron currently cannot make those changes.
10.4.7.3 Using the Infoblox IPAM driver #
As suggested above, using the Infoblox IPAM driver requires changes to
existing parameters in nova.conf
and
neutron.conf
. If you want to use the infoblox appliance,
you will need to add the "infoblox service-component" to the service-role
containing the neutron API server. To use the infoblox appliance for IPAM,
both the agent and the Infoblox IPAM driver are
required. The infoblox-ipam-agent
should be deployed on
the same node where the neutron-server component is running. Usually this is
a Controller node.
Have the Infoblox appliance running on the management network (the Infoblox appliance admin or the datacenter administrator should know how to perform this step).
Change the control plane definition to add i
nfoblox-ipam-agent
as a service in the controller node cluster (see change in bold). Make the changes incontrol_plane.yml
found here:~/openstack/my_cloud/definition/data/control_plane.yml
--- product: version: 2 control-planes: - name: ccp control-plane-prefix: ccp ... clusters: - name: cluster0 cluster-prefix: c0 server-role: ARDANA-ROLE member-count: 1 allocation-policy: strict service-components: - lifecycle-manager - name: cluster1 cluster-prefix: c1 server-role: CONTROLLER-ROLE member-count: 3 allocation-policy: strict service-components: - ntp-server ... - neutron-server - infoblox-ipam-agent ... - designate-client - bind resources: - name: compute resource-prefix: comp server-role: COMPUTE-ROLE allocation-policy: any
Modify the
~/openstack/my_cloud/config/neutron/neutron.conf.j2
file on the controller node to comment and uncomment the lines noted below to enable use with the Infoblox appliance:[DEFAULT] ... l3_ha_net_cidr = 169.254.192.0/18 # Uncomment the line below if the Reference Pluggable IPAM driver is to be used # ipam_driver = internal # Comment out the line below if the Infoblox IPAM Driver is to be used # notification_driver = messaging # Uncomment the lines below if the Infoblox IPAM driver is to be used ipam_driver = networking_infoblox.ipam.driver.InfobloxPool notification_driver = messagingv2 # Modify the infoblox sections below to suit your cloud environment [infoblox] cloud_data_center_id = 1 # This name of this section is formed by "infoblox-dc:<infoblox.cloud_data_center_id>" # If cloud_data_center_id is 1, then the section name is "infoblox-dc:1" [infoblox-dc:0] http_request_timeout = 120 http_pool_maxsize = 100 http_pool_connections = 100 ssl_verify = False wapi_version = 2.2 admin_user_name = admin admin_password = infoblox grid_master_name = infoblox.localdomain grid_master_host = 1.2.3.4 [QUOTAS] ...
Change
nova.conf.j2
to replace the notification driver "messaging" to "messagingv2"... # Oslo messaging notification_driver = log # Note: # If the infoblox-ipam-agent is to be deployed in the cloud, change the # notification_driver setting from "messaging" to "messagingv2". notification_driver = messagingv2 notification_topics = notifications # Policy ...
Commit the changes:
ardana >
cd ~/openstack/my_cloudardana >
git commit –a –m 'My config for enabling the Infoblox IPAM driver'Deploy the cloud with the changes. Due to changes to the control_plane.yml, you will need to rerun the config-processor-run.yml playbook if you have run it already during the install process.
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml
10.4.7.4 Configuration parameters for using the Infoblox IPAM driver #
Changes required in the notification parameters in nova.conf:
Parameter Name | Section in nova.conf | Default Value | Current Value | Description |
---|---|---|---|---|
notify_on_state_change | DEFAULT | None | vm_and_task_state |
Send compute.instance.update notifications on instance state changes. Vm_and_task_state means notify on vm and task state changes. Infoblox requires the value to be vm_state (notify on vm state change). Thus NO CHANGE is needed for infoblox |
notification_topics | DEFAULT | empty list | notifications |
NO CHANGE is needed for infoblox. The infoblox installation guide requires the notifications to be "notifications" |
notification_driver | DEFAULT | None | messaging |
Change needed. The infoblox installation guide requires the notification driver to be "messagingv2". |
Changes to existing parameters in neutron.conf
Parameter Name | Section in neutron.conf | Default Value | Current Value | Description |
---|---|---|---|---|
ipam_driver | DEFAULT | None |
None (param is undeclared in neutron.conf) |
Pluggable IPAM driver to be used by neutron API server. For infoblox, the value is "networking_infoblox.ipam.driver.InfobloxPool" |
notification_driver | DEFAULT | empty list | messaging |
The driver used to send notifications from the neutron API server to the neutron agents. The installation guide for networking-infoblox calls for the notification_driver to be "messagingv2" |
notification_topics | DEFAULT | None | notifications |
No change needed. The row is here show the changes in the neutron parameters described in the installation guide for networking-infoblox |
Parameters specific to the Networking Infoblox Driver. All the parameters for the Infoblox IPAM driver must be defined in neutron.conf.
Parameter Name | Section in neutron.conf | Default Value | Description |
---|---|---|---|
cloud_data_center_id | infoblox | 0 | ID for selecting a particular grid from one or more grids to serve networks in the Infoblox back end |
ipam_agent_workers | infoblox | 1 | Number of Infoblox IPAM agent works to run |
grid_master_host | infoblox-dc.<cloud_data_center_id> | empty string | IP address of the grid master. WAPI requests are sent to the grid_master_host |
ssl_verify | infoblox-dc.<cloud_data_center_id> | False | Ensure whether WAPI requests sent over HTTPS require SSL verification |
WAPI Version | infoblox-dc.<cloud_data_center_id> | 1.4 | The WAPI version. Value should be 2.2. |
admin_user_name | infoblox-dc.<cloud_data_center_id> | empty string | Admin user name to access the grid master or cloud platform appliance |
admin_password | infoblox-dc.<cloud_data_center_id> | empty string | Admin user password |
http_pool_connections | infoblox-dc.<cloud_data_center_id> | 100 | |
http_pool_maxsize | infoblox-dc.<cloud_data_center_id> | 100 | |
http_request_timeout | infoblox-dc.<cloud_data_center_id> | 120 |
The diagram below shows nova compute sending notification to the infoblox-ipam-agent
10.4.7.5 Limitations #
There is no IPAM migration path from non-pluggable to pluggable IPAM driver (https://bugs.launchpad.net/neutron/+bug/1516156). This means there is no way to reconfigure the neutron database if you wanted to change neutron to use a pluggable IPAM driver. Unless you change the default of non-pluggable IPAM configuration to a pluggable driver at install time, you will have no other opportunity to make that change because reconfiguration of SUSE OpenStack Cloud 9from using the default non-pluggable IPAM configuration to SUSE OpenStack Cloud 9 using a pluggable IPAM driver is not supported.
Upgrade from previous versions of SUSE OpenStack Cloud to SUSE OpenStack Cloud 9 to use a pluggable IPAM driver is not supported.
The Infoblox appliance does not allow for overlapping IPs. For example, only one tenant can have a CIDR of 10.0.0.0/24.
The infoblox IPAM driver fails the creation of a subnet when a there is no gateway-ip supplied. For example, the command
openstack subnet create ... --no-gateway ...
will fail.
10.4.8 Configuring Load Balancing as a Service (LBaaS) #
SUSE OpenStack Cloud 9 LBaaS Configuration
Load Balancing as a Service (LBaaS) is an advanced networking service that allows load balancing of multi-node environments. It provides the ability to spread requests across multiple servers thereby reducing the load on any single server. This document describes the installation steps and the configuration for LBaaS v2.
The LBaaS architecture is based on a driver model to support different load balancers. LBaaS-compatible drivers are provided by load balancer vendors including F5 and Citrix. A new software load balancer driver was introduced in the OpenStack Liberty release called "Octavia". The Octavia driver deploys a software load balancer called HAProxy. Octavia is the default load balancing provider in SUSE OpenStack Cloud 9 for LBaaS V2. Until Octavia is configured the creation of load balancers will fail with an error. Refer to Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service” document for information on installing Octavia.
Before upgrading to SUSE OpenStack Cloud 9, contact F5 and SUSE to determine which F5 drivers have been certified for use with SUSE OpenStack Cloud. Loading drivers not certified by SUSE may result in failure of your cloud deployment.
LBaaS V2 offers with Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service” a software load balancing solution that supports both a highly available control plane and data plane. However, should an external hardware load balancer be selected the cloud operation can achieve additional performance and availability.
LBaaS v2
Your vendor already has a driver that supports LBaaS v2. Many hardware load balancer vendors already support LBaaS v2 and this list is growing all the time.
You intend to script your load balancer creation and management so a UI is not important right now (horizon support will be added in a future release).
You intend to support TLS termination at the load balancer.
You intend to use the Octavia software load balancer (adding HA and scalability).
You do not want to take your load balancers offline to perform subsequent LBaaS upgrades.
You intend in future releases to need L7 load balancing.
Reasons not to select this version.
Your LBaaS vendor does not have a v2 driver.
You must be able to manage your load balancers from horizon.
You have legacy software which utilizes the LBaaS v1 API.
LBaaS v2 is installed by default with SUSE OpenStack Cloud and requires minimal configuration to start the service.
LBaaS V2 API currently supports load balancer failover with Octavia. LBaaS v2 API includes automatic failover of a deployed load balancer with Octavia. More information about this driver can be found in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service”.
10.4.8.1 Prerequisites #
SUSE OpenStack Cloud LBaaS v2
SUSE OpenStack Cloud must be installed for LBaaS v2.
Follow the instructions to install Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service”
10.4.9 Load Balancer: Octavia Driver Administration #
This document provides the instructions on how to enable and manage various components of the Load Balancer Octavia driver if that driver is enabled.
Section 10.4.9.2, “Tuning Octavia Installation”
Homogeneous Compute Configuration
Octavia and Floating IP's
Configuration Files
Spare Pools
Section 10.4.9.3, “Managing Amphora”
Updating the Cryptographic Certificates
Accessing VM information in nova
Initiating Failover of an Amphora VM
10.4.9.1 Monasca Alerts #
The monasca-agent has the following Octavia-related plugins:
Process checks – checks if octavia processes are running. When it starts, it detects which processes are running and then monitors them.
http_connect check – checks if it can connect to octavia api servers.
Alerts are displayed in the Operations Console.
10.4.9.2 Tuning Octavia Installation #
Homogeneous Compute Configuration
Octavia works only with homogeneous compute node configurations. Currently, Octavia does not support multiple nova flavors. If Octavia needs to be supported on multiple compute nodes, then all the compute nodes should carry same set of physnets (which will be used for Octavia).
Octavia and Floating IPs
Due to a neutron limitation Octavia will only work with CVR routers. Another option is to use VLAN provider networks which do not require a router.
You cannot currently assign a floating IP address as the VIP (user facing) address for a load balancer created by the Octavia driver if the underlying neutron network is configured to support Distributed Virtual Router (DVR). The Octavia driver uses a neutron function known as allowed address pairs to support load balancer fail over.
There is currently a neutron bug that does not support this function in a DVR configuration
Octavia Configuration Files
The system comes pre-tuned and should not need any adjustments for most customers. If in rare instances manual tuning is needed, follow these steps:
Changes might be lost during SUSE OpenStack Cloud upgrades.
Edit the Octavia configuration files in
my_cloud/config/octavia
. It is recommended that any
changes be made in all of the Octavia configuration files.
octavia-api.conf.j2
octavia-health-manager.conf.j2
octavia-housekeeping.conf.j2
octavia-worker.conf.j2
After the changes are made to the configuration files, redeploy the service.
Commit changes to git.
ardana >
cd ~/openstackardana >
git add -Aardana >
git commit -m "My Octavia Config"Run the configuration processor and ready deployment.
ardana >
cd ~/openstack/ardana/ansible/ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the Octavia reconfigure.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml
Spare Pools
The Octavia driver provides support for creating spare pools of the HAProxy software installed in VMs. This means instead of creating a new load balancer when loads increase, create new load balancer calls will pull a load balancer from the spare pool. The spare pools feature consumes resources, therefore the load balancers in the spares pool has been set to 0, which is the default and also disables the feature.
Reasons to enable a load balancing spare pool in SUSE OpenStack Cloud
You expect a large number of load balancers to be provisioned all at once (puppet scripts, or ansible scripts) and you want them to come up quickly.
You want to reduce the wait time a customer has while requesting a new load balancer.
To increase the number of load balancers in your spares pool, edit
the Octavia configuration files by uncommenting the
spare_amphora_pool_size
and adding the number of load
balancers you would like to include in your spares pool.
# Pool size for the spare pool # spare_amphora_pool_size = 0
10.4.9.3 Managing Amphora #
Octavia starts a separate VM for each load balancing function. These VMs are called amphora.
Updating the Cryptographic Certificates
Octavia uses two-way SSL encryption for communication between amphora and the control plane. Octavia keeps track of the certificates on the amphora and will automatically recycle them. The certificates on the control plane are valid for one year after installation of SUSE OpenStack Cloud.
You can check on the status of the certificate by logging into the controller node as root and running:
ardana >
cd /opt/stack/service/octavia-SOME UUID/etc/certs/
openssl x509 -in client.pem -text –noout
This prints the certificate out where you can check on the expiration dates.
To renew the certificates, reconfigure Octavia. Reconfiguring causes Octavia to automatically generate new certificates and deploy them to the controller hosts.
On the Cloud Lifecycle Manager execute octavia-reconfigure:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml
Accessing VM information in nova
You can use openstack project list
as an administrative
user to obtain information about the tenant or project-id of the Octavia
project. In the example below, the Octavia project has a project-id of
37fd6e4feac14741b6e75aba14aea833
.
ardana >
openstack project list
+----------------------------------+------------------+
| ID | Name |
+----------------------------------+------------------+
| 055071d8f25d450ea0b981ca67f7ccee | glance-swift |
| 37fd6e4feac14741b6e75aba14aea833 | octavia |
| 4b431ae087ef4bd285bc887da6405b12 | swift-monitor |
| 8ecf2bb5754646ae97989ba6cba08607 | swift-dispersion |
| b6bd581f8d9a48e18c86008301d40b26 | services |
| bfcada17189e4bc7b22a9072d663b52d | cinderinternal |
| c410223059354dd19964063ef7d63eca | monitor |
| d43bc229f513494189422d88709b7b73 | admin |
| d5a80541ba324c54aeae58ac3de95f77 | demo |
| ea6e039d973e4a58bbe42ee08eaf6a7a | backup |
+----------------------------------+------------------+
You can then use openstack server list --tenant <project-id>
to
list the VMs for the Octavia tenant. Take particular note of the IP address
on the OCTAVIA-MGMT-NET; in the example below it is
172.30.1.11
. For additional nova command-line options see
Section 10.4.9.5, “For More Information”.
ardana >
openstack server list --tenant 37fd6e4feac14741b6e75aba14aea833
+--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
| ID | Name | Tenant ID | Status | Task State | Power State | Networks |
+--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
| 1ed8f651-de31-4208-81c5-817363818596 | amphora-1c3a4598-5489-48ea-8b9c-60c821269e4c | 37fd6e4feac14741b6e75aba14aea833 | ACTIVE | - | Running | private=10.0.0.4; OCTAVIA-MGMT-NET=172.30.1.11 |
+--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
The Amphora VMs do not have SSH or any other access. In the rare case that there is a problem with the underlying load balancer the whole amphora will need to be replaced.
Initiating Failover of an Amphora VM
Under normal operations Octavia will monitor the health of the amphora constantly and automatically fail them over if there are any issues. This helps to minimize any potential downtime for load balancer users. There are, however, a few cases a failover needs to be initiated manually:
The Loadbalancer has become unresponsive and Octavia has not detected an error.
A new image has become available and existing load balancers need to start using the new image.
The cryptographic certificates to control and/or the HMAC password to verify Health information of the amphora have been compromised.
To minimize the impact for end users we will keep the existing load balancer working until shortly before the new one has been provisioned. There will be a short interruption for the load balancing service so keep that in mind when scheduling the failovers. To achieve that follow these steps (assuming the management ip from the previous step):
Assign the IP to a SHELL variable for better readability.
ardana >
export MGM_IP=172.30.1.11Identify the port of the vm on the management network.
ardana >
openstack port list | grep $MGM_IP | 0b0301b9-4ee8-4fb6-a47c-2690594173f4 | | fa:16:3e:d7:50:92 | {"subnet_id": "3e0de487-e255-4fc3-84b8-60e08564c5b7", "ip_address": "172.30.1.11"} |Disable the port to initiate a failover. Note the load balancer will still function but cannot be controlled any longer by Octavia.
NoteChanges after disabling the port will result in errors.
ardana >
openstack port set --admin-state-up False 0b0301b9-4ee8-4fb6-a47c-2690594173f4 Updated port: 0b0301b9-4ee8-4fb6-a47c-2690594173f4You can check to see if the amphora failed over with
openstack server list --tenant <project-id>
. This may take some time and in some cases may need to be repeated several times. You can tell that the failover has been successful by the changed IP on the management network.ardana >
openstack server list --tenant 37fd6e4feac14741b6e75aba14aea833 +--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+ | 1ed8f651-de31-4208-81c5-817363818596 | amphora-1c3a4598-5489-48ea-8b9c-60c821269e4c | 37fd6e4feac14741b6e75aba14aea833 | ACTIVE | - | Running | private=10.0.0.4; OCTAVIA-MGMT-NET=172.30.1.12 | +--------------------------------------+----------------------------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
Do not issue too many failovers at once. In a big installation you might be tempted to initiate several failovers in parallel for instance to speed up an update of amphora images. This will put a strain on the nova service and depending on the size of your installation you might need to throttle the failover rate.
10.4.9.4 Load Balancer: Octavia Administration #
10.4.9.4.1 Removing load balancers #
The following procedures demonstrate how to delete a load
balancer that is in the ERROR
,
PENDING_CREATE
, or
PENDING_DELETE
state.
Query the Neutron service for the loadbalancer ID:
tux >
neutron lbaas-loadbalancer-list neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead. +--------------------------------------+---------+----------------------------------+--------------+---------------------+----------+ | id | name | tenant_id | vip_address | provisioning_status | provider | +--------------------------------------+---------+----------------------------------+--------------+---------------------+----------+ | 7be4e4ab-e9c6-4a57-b767-da9af5ba7405 | test-lb | d62a1510b0f54b5693566fb8afeb5e33 | 192.168.1.10 | ERROR | haproxy | +--------------------------------------+---------+----------------------------------+--------------+---------------------+----------+Connect to the neutron database:
ImportantThe default database name depends on the life cycle manager. Ardana uses
ovs_neutron
while Crowbar usesneutron
.Ardana:
mysql> use ovs_neutron
Crowbar:
mysql> use neutron
Get the pools and healthmonitors associated with the loadbalancer:
mysql> select id, healthmonitor_id, loadbalancer_id from lbaas_pools where loadbalancer_id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405'; +--------------------------------------+--------------------------------------+--------------------------------------+ | id | healthmonitor_id | loadbalancer_id | +--------------------------------------+--------------------------------------+--------------------------------------+ | 26c0384b-fc76-4943-83e5-9de40dd1c78c | 323a3c4b-8083-41e1-b1d9-04e1fef1a331 | 7be4e4ab-e9c6-4a57-b767-da9af5ba7405 | +--------------------------------------+--------------------------------------+--------------------------------------+
Get the members associated with the pool:
mysql> select id, pool_id from lbaas_members where pool_id = '26c0384b-fc76-4943-83e5-9de40dd1c78c'; +--------------------------------------+--------------------------------------+ | id | pool_id | +--------------------------------------+--------------------------------------+ | 6730f6c1-634c-4371-9df5-1a880662acc9 | 26c0384b-fc76-4943-83e5-9de40dd1c78c | | 06f0cfc9-379a-4e3d-ab31-cdba1580afc2 | 26c0384b-fc76-4943-83e5-9de40dd1c78c | +--------------------------------------+--------------------------------------+
Delete the pool members:
mysql> delete from lbaas_members where id = '6730f6c1-634c-4371-9df5-1a880662acc9'; mysql> delete from lbaas_members where id = '06f0cfc9-379a-4e3d-ab31-cdba1580afc2';
Find and delete the listener associated with the loadbalancer:
mysql> select id, loadbalancer_id, default_pool_id from lbaas_listeners where loadbalancer_id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405'; +--------------------------------------+--------------------------------------+--------------------------------------+ | id | loadbalancer_id | default_pool_id | +--------------------------------------+--------------------------------------+--------------------------------------+ | 3283f589-8464-43b3-96e0-399377642e0a | 7be4e4ab-e9c6-4a57-b767-da9af5ba7405 | 26c0384b-fc76-4943-83e5-9de40dd1c78c | +--------------------------------------+--------------------------------------+--------------------------------------+ mysql> delete from lbaas_listeners where id = '3283f589-8464-43b3-96e0-399377642e0a';
Delete the pool associated with the loadbalancer:
mysql> delete from lbaas_pools where id = '26c0384b-fc76-4943-83e5-9de40dd1c78c';
Delete the healthmonitor associated with the pool:
mysql> delete from lbaas_healthmonitors where id = '323a3c4b-8083-41e1-b1d9-04e1fef1a331';
Delete the loadbalancer:
mysql> delete from lbaas_loadbalancer_statistics where loadbalancer_id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405'; mysql> delete from lbaas_loadbalancers where id = '7be4e4ab-e9c6-4a57-b767-da9af5ba7405';
Query the Octavia service for the loadbalancer ID:
tux >
openstack loadbalancer list --column id --column name --column provisioning_status +--------------------------------------+---------+---------------------+ | id | name | provisioning_status | +--------------------------------------+---------+---------------------+ | d8ac085d-e077-4af2-b47a-bdec0c162928 | test-lb | ERROR | +--------------------------------------+---------+---------------------+Query the Octavia service for the amphora IDs (in this example we use
ACTIVE/STANDBY
topology with 1 spare Amphora):tux >
openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+ | 6dc66d41-e4b6-4c33-945d-563f8b26e675 | d8ac085d-e077-4af2-b47a-bdec0c162928 | ALLOCATED | BACKUP | 172.30.1.7 | 192.168.1.8 | | 1b195602-3b14-4352-b355-5c4a70e200cf | d8ac085d-e077-4af2-b47a-bdec0c162928 | ALLOCATED | MASTER | 172.30.1.6 | 192.168.1.8 | | b2ee14df-8ac6-4bb0-a8d3-3f378dbc2509 | None | READY | None | 172.30.1.20 | None | +--------------------------------------+--------------------------------------+-----------+--------+---------------+-------------+Query the Octavia service for the loadbalancer pools:
tux >
openstack loadbalancer pool list +--------------------------------------+-----------+----------------------------------+---------------------+----------+--------------+----------------+ | id | name | project_id | provisioning_status | protocol | lb_algorithm | admin_state_up | +--------------------------------------+-----------+----------------------------------+---------------------+----------+--------------+----------------+ | 39c4c791-6e66-4dd5-9b80-14ea11152bb5 | test-pool | 86fba765e67f430b83437f2f25225b65 | ACTIVE | TCP | ROUND_ROBIN | True | +--------------------------------------+-----------+----------------------------------+---------------------+----------+--------------+----------------+Connect to the octavia database:
mysql> use octavia
Delete any listeners, pools, health monitors, and members from the load balancer:
mysql> delete from listener where load_balancer_id = 'd8ac085d-e077-4af2-b47a-bdec0c162928'; mysql> delete from health_monitor where pool_id = '39c4c791-6e66-4dd5-9b80-14ea11152bb5'; mysql> delete from member where pool_id = '39c4c791-6e66-4dd5-9b80-14ea11152bb5'; mysql> delete from pool where load_balancer_id = 'd8ac085d-e077-4af2-b47a-bdec0c162928';
Delete the amphora entries in the database:
mysql> delete from amphora_health where amphora_id = '6dc66d41-e4b6-4c33-945d-563f8b26e675'; mysql> update amphora set status = 'DELETED' where id = '6dc66d41-e4b6-4c33-945d-563f8b26e675'; mysql> delete from amphora_health where amphora_id = '1b195602-3b14-4352-b355-5c4a70e200cf'; mysql> update amphora set status = 'DELETED' where id = '1b195602-3b14-4352-b355-5c4a70e200cf';
Delete the load balancer instance:
mysql> update load_balancer set provisioning_status = 'DELETED' where id = 'd8ac085d-e077-4af2-b47a-bdec0c162928';
The following script automates the above steps:
#!/bin/bash if (( $# != 1 )); then echo "Please specify a loadbalancer ID" exit 1 fi LB_ID=$1 set -u -e -x readarray -t AMPHORAE < <(openstack loadbalancer amphora list \ --format value \ --column id \ --column loadbalancer_id \ | grep ${LB_ID} \ | cut -d ' ' -f 1) readarray -t POOLS < <(openstack loadbalancer show ${LB_ID} \ --format value \ --column pools) mysql octavia --execute "delete from listener where load_balancer_id = '${LB_ID}';" for p in "${POOLS[@]}"; do mysql octavia --execute "delete from health_monitor where pool_id = '${p}';" mysql octavia --execute "delete from member where pool_id = '${p}';" done mysql octavia --execute "delete from pool where load_balancer_id = '${LB_ID}';" for a in "${AMPHORAE[@]}"; do mysql octavia --execute "delete from amphora_health where amphora_id = '${a}';" mysql octavia --execute "update amphora set status = 'DELETED' where id = '${a}';" done mysql octavia --execute "update load_balancer set provisioning_status = 'DELETED' where id = '${LB_ID}';"
10.4.9.5 For More Information #
For more information on the OpenStackClient and Octavia terminology, see the OpenStackClient guide.
10.4.10 Role-based Access Control in neutron #
This topic explains how to achieve more granular access control for your neutron networks.
Previously in SUSE OpenStack Cloud, a network object was either private to a project or could be used by all projects. If the network's shared attribute was True, then the network could be used by every project in the cloud. If false, only the members of the owning project could use it. There was no way for the network to be shared by only a subset of the projects.
neutron Role Based Access Control (RBAC) solves this problem for networks. Now the network owner can create RBAC policies that give network access to target projects. Members of a targeted project can use the network named in the RBAC policy the same way as if the network was owned by the project. Constraints are described in the section Section 10.4.10.10, “Limitations”.
With RBAC you are able to let another tenant use a network that you created, but as the owner of the network, you need to create the subnet and the router for the network.
10.4.10.1 Creating a Network #
ardana >
openstack network create demo-net
+---------------------------+--------------------------------------+
| Field | Value |
+---------------------------+--------------------------------------+
| admin_state_up | UP |
| availability_zone_hints | |
| availability_zones | |
| created_at | 2018-07-25T17:43:59Z |
| description | |
| dns_domain | |
| id | 9c801954-ec7f-4a65-82f8-e313120aabc4 |
| ipv4_address_scope | None |
| ipv6_address_scope | None |
| is_default | False |
| is_vlan_transparent | None |
| mtu | 1450 |
| name | demo-net |
| port_security_enabled | False |
| project_id | cb67c79e25a84e328326d186bf703e1b |
| provider:network_type | vxlan |
| provider:physical_network | None |
| provider:segmentation_id | 1009 |
| qos_policy_id | None |
| revision_number | 2 |
| router:external | Internal |
| segments | None |
| shared | False |
| status | ACTIVE |
| subnets | |
| tags | |
| updated_at | 2018-07-25T17:43:59Z |
+---------------------------+--------------------------------------+
10.4.10.2 Creating an RBAC Policy #
Here we will create an RBAC policy where a member of the project called 'demo' will share the network with members of project 'demo2'
To create the RBAC policy, run:
ardana >
openstack network rbac create --target-project DEMO2-PROJECT-ID --type network --action access_as_shared demo-net
Here is an example where the DEMO2-PROJECT-ID is 5a582af8b44b422fafcd4545bd2b7eb5
ardana >
openstack network rbac create --target-tenant 5a582af8b44b422fafcd4545bd2b7eb5 \
--type network --action access_as_shared demo-net
10.4.10.3 Listing RBACs #
To list all the RBAC rules/policies, execute:
ardana >
openstack network rbac list
+--------------------------------------+-------------+--------------------------------------+
| ID | Object Type | Object ID |
+--------------------------------------+-------------+--------------------------------------+
| 0fdec7f0-9b94-42b4-a4cd-b291d04282c1 | network | 7cd94877-4276-488d-b682-7328fc85d721 |
+--------------------------------------+-------------+--------------------------------------+
10.4.10.4 Listing the Attributes of an RBAC #
To see the attributes of a specific RBAC policy, run
ardana >
openstack network rbac show POLICY-ID
For example:
ardana >
openstack network rbac show 0fd89dcb-9809-4a5e-adc1-39dd676cb386
Here is the output:
+---------------+--------------------------------------+ | Field | Value | +---------------+--------------------------------------+ | action | access_as_shared | | id | 0fd89dcb-9809-4a5e-adc1-39dd676cb386 | | object_id | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | | object_type | network | | target_tenant | 5a582af8b44b422fafcd4545bd2b7eb5 | | tenant_id | 75eb5efae5764682bca2fede6f4d8c6f | +---------------+--------------------------------------+
10.4.10.5 Deleting an RBAC Policy #
To delete an RBAC policy, run openstack network rbac delete
passing the policy id:
ardana >
openstack network rbac delete POLICY-ID
For example:
ardana >
openstack network rbac delete 0fd89dcb-9809-4a5e-adc1-39dd676cb386
Here is the output:
Deleted rbac_policy: 0fd89dcb-9809-4a5e-adc1-39dd676cb386
10.4.10.6 Sharing a Network with All Tenants #
Either the administrator or the network owner can make a network shareable by all tenants.
The administrator can make a tenant's network shareable by all tenants.
To make the network demo-shareall-net
accessible by all
tenants in the cloud:
To share a network with all tenants:
Get a list of all projects
ardana >
~/service.osrcardana >
openstack project listwhich produces the list:
+----------------------------------+------------------+ | ID | Name | +----------------------------------+------------------+ | 1be57778b61645a7a1c07ca0ac488f9e | demo | | 5346676226274cd2b3e3862c2d5ceadd | admin | | 749a557b2b9c482ca047e8f4abf348cd | swift-monitor | | 8284a83df4df429fb04996c59f9a314b | swift-dispersion | | c7a74026ed8d4345a48a3860048dcb39 | demo-sharee | | e771266d937440828372090c4f99a995 | glance-swift | | f43fb69f107b4b109d22431766b85f20 | services | +----------------------------------+------------------+
Get a list of networks:
ardana >
openstack network listThis produces the following list:
+--------------------------------------+-------------------+----------------------------------------------------+ | id | name | subnets | +--------------------------------------+-------------------+----------------------------------------------------+ | f50f9a63-c048-444d-939d-370cb0af1387 | ext-net | ef3873db-fc7a-4085-8454-5566fb5578ea 172.31.0.0/16 | | 9fb676f5-137e-4646-ac6e-db675a885fd3 | demo-net | 18fb0b77-fc8b-4f8d-9172-ee47869f92cc 10.0.1.0/24 | | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e | demo-shareall-net | 2bbc85a9-3ffe-464c-944b-2476c7804877 10.0.250.0/24 | | 73f946ee-bd2b-42e9-87e4-87f19edd0682 | demo-share-subset | c088b0ef-f541-42a7-b4b9-6ef3c9921e44 10.0.2.0/24 | +--------------------------------------+-------------------+----------------------------------------------------+
Set the network you want to share to a shared value of True:
ardana >
openstack network set --share 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8eYou should see the following output:
Updated network: 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e
Check the attributes of that network by running the following command using the ID of the network in question:
ardana >
openstack network show 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8eThe output will look like this:
+---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | | | created_at | 2018-07-25T17:43:59Z | | description | | | dns_domain | | | id | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | None | | is_vlan_transparent | None | | mtu | 1450 | | name | demo-net | | port_security_enabled | False | | project_id | cb67c79e25a84e328326d186bf703e1b | | provider:network_type | vxlan | | provider:physical_network | None | | provider:segmentation_id | 1009 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | False | | status | ACTIVE | | subnets | | | tags | | | updated_at | 2018-07-25T17:43:59Z | +---------------------------+--------------------------------------+
As the owner of the
demo-shareall-net
network, view the RBAC attributes fordemo-shareall-net
(id=8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e
) by first getting an RBAC list:ardana >
echo $OS_USERNAME ; echo $OS_PROJECT_NAME demo demoardana >
openstack network rbac listThis produces the list:
+--------------------------------------+--------------------------------------+ | id | object_id | +--------------------------------------+--------------------------------------+ | ... | | 3e078293-f55d-461c-9a0b-67b5dae321e8 | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e | +--------------------------------------+--------------------------------------+
View the RBAC information:
ardana >
openstack network rbac show 3e078293-f55d-461c-9a0b-67b5dae321e8 +---------------+--------------------------------------+ | Field | Value | +---------------+--------------------------------------+ | action | access_as_shared | | id | 3e078293-f55d-461c-9a0b-67b5dae321e8 | | object_id | 8eada4f7-83cf-40ba-aa8c-5bf7d87cca8e | | object_type | network | | target_tenant | * | | tenant_id | 1be57778b61645a7a1c07ca0ac488f9e | +---------------+--------------------------------------+With network RBAC, the owner of the network can also make the network shareable by all tenants. First create the network:
ardana >
echo $OS_PROJECT_NAME ; echo $OS_USERNAME demo demoardana >
openstack network create test-netThe network is created:
+---------------------------+--------------------------------------+ | Field | Value | +---------------------------+--------------------------------------+ | admin_state_up | UP | | availability_zone_hints | | | availability_zones | | | created_at | 2018-07-25T18:04:25Z | | description | | | dns_domain | | | id | a4bd7c3a-818f-4431-8cdb-fedf7ff40f73 | | ipv4_address_scope | None | | ipv6_address_scope | None | | is_default | False | | is_vlan_transparent | None | | mtu | 1450 | | name | test-net | | port_security_enabled | False | | project_id | cb67c79e25a84e328326d186bf703e1b | | provider:network_type | vxlan | | provider:physical_network | None | | provider:segmentation_id | 1073 | | qos_policy_id | None | | revision_number | 2 | | router:external | Internal | | segments | None | | shared | False | | status | ACTIVE | | subnets | | | tags | | | updated_at | 2018-07-25T18:04:25Z | +---------------------------+--------------------------------------+
Create the RBAC. It is important that the asterisk is surrounded by single-quotes to prevent the shell from expanding it to all files in the current directory.
ardana >
openstack network rbac create --type network \ --action access_as_shared --target-project '*' test-netHere are the resulting RBAC attributes:
+---------------+--------------------------------------+ | Field | Value | +---------------+--------------------------------------+ | action | access_as_shared | | id | 0b797cc6-debc-48a1-bf9d-d294b077d0d9 | | object_id | a4bd7c3a-818f-4431-8cdb-fedf7ff40f73 | | object_type | network | | target_tenant | * | | tenant_id | 1be57778b61645a7a1c07ca0ac488f9e | +---------------+--------------------------------------+
10.4.10.7 Target Project (demo2
) View of Networks and Subnets #
Note that the owner of the network and subnet is not the tenant named
demo2
. Both the network and subnet are owned by tenant demo
.
Demo2
members cannot create subnets of the network. They also cannot
modify or delete subnets owned by demo
.
As the tenant demo2
, you can get a list of neutron networks:
ardana >
openstack network list
+--------------------------------------+-----------+--------------------------------------------------+ | id | name | subnets | +--------------------------------------+-----------+--------------------------------------------------+ | f60f3896-2854-4f20-b03f-584a0dcce7a6 | ext-net | 50e39973-b2e3-466b-81c9-31f4d83d990b | | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | demo-net | d9b765da-45eb-4543-be96-1b69a00a2556 10.0.1.0/24 | ... +--------------------------------------+-----------+--------------------------------------------------+
And get a list of subnets:
ardana >
openstack subnet list --network c3d55c21-d8c9-4ee5-944b-560b7e0ea33b
+--------------------------------------+---------+--------------------------------------+---------------+ | ID | Name | Network | Subnet | +--------------------------------------+---------+--------------------------------------+---------------+ | a806f28b-ad66-47f1-b280-a1caa9beb832 | ext-net | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | 10.0.1.0/24 | +--------------------------------------+---------+--------------------------------------+---------------+
To show details of the subnet:
ardana >
openstack subnet show d9b765da-45eb-4543-be96-1b69a00a2556
+-------------------+--------------------------------------------+ | Field | Value | +-------------------+--------------------------------------------+ | allocation_pools | {"start": "10.0.1.2", "end": "10.0.1.254"} | | cidr | 10.0.1.0/24 | | dns_nameservers | | | enable_dhcp | True | | gateway_ip | 10.0.1.1 | | host_routes | | | id | d9b765da-45eb-4543-be96-1b69a00a2556 | | ip_version | 4 | | ipv6_address_mode | | | ipv6_ra_mode | | | name | sb-demo-net | | network_id | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | | subnetpool_id | | | tenant_id | 75eb5efae5764682bca2fede6f4d8c6f | +-------------------+--------------------------------------------+
10.4.10.8 Target Project: Creating a Port Using demo-net #
The owner of the port is demo2
. Members of the network owner project
(demo
) will not see this port.
Running the following command:
ardana >
openstack port create c3d55c21-d8c9-4ee5-944b-560b7e0ea33b
Creates a new port:
+-----------------------+-----------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:vnic_type | normal | | device_id | | | device_owner | | | dns_assignment | {"hostname": "host-10-0-1-10", "ip_address": "10.0.1.10", "fqdn": "host-10-0-1-10.openstacklocal."} | | dns_name | | | fixed_ips | {"subnet_id": "d9b765da-45eb-4543-be96-1b69a00a2556", "ip_address": "10.0.1.10"} | | id | 03ef2dce-20dc-47e5-9160-942320b4e503 | | mac_address | fa:16:3e:27:8d:ca | | name | | | network_id | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | | security_groups | 275802d0-33cb-4796-9e57-03d8ddd29b94 | | status | DOWN | | tenant_id | 5a582af8b44b422fafcd4545bd2b7eb5 | +-----------------------+-----------------------------------------------------------------------------------------------------+
10.4.10.9 Target Project Booting a VM Using Demo-Net #
Here the tenant demo2
boots a VM that uses the demo-net
shared network:
ardana >
openstack server create --flavor 1 --image $OS_IMAGE --nic net-id=c3d55c21-d8c9-4ee5-944b-560b7e0ea33b demo2-vm-using-demo-net-nic
+--------------------------------------+------------------------------------------------+ | Property | Value | +--------------------------------------+------------------------------------------------+ | OS-EXT-AZ:availability_zone | | | OS-EXT-STS:power_state | 0 | | OS-EXT-STS:task_state | scheduling | | OS-EXT-STS:vm_state | building | | OS-SRV-USG:launched_at | - | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | adminPass | sS9uSv9PT79F | | config_drive | | | created | 2016-01-04T19:23:24Z | | flavor | m1.tiny (1) | | hostId | | | id | 3a4dc44a-027b-45e9-acf8-054a7c2dca2a | | image | cirros-0.3.3-x86_64 (6ae23432-8636-4e...1efc5) | | key_name | - | | metadata | {} | | name | demo2-vm-using-demo-net-nic | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | security_groups | default | | status | BUILD | | tenant_id | 5a582af8b44b422fafcd4545bd2b7eb5 | | updated | 2016-01-04T19:23:24Z | | user_id | a0e6427b036344fdb47162987cb0cee5 | +--------------------------------------+------------------------------------------------+
Run openstack server list:
ardana >
openstack server list
See the VM running:
+-------------------+-----------------------------+--------+------------+-------------+--------------------+ | ID | Name | Status | Task State | Power State | Networks | +-------------------+-----------------------------+--------+------------+-------------+--------------------+ | 3a4dc...a7c2dca2a | demo2-vm-using-demo-net-nic | ACTIVE | - | Running | demo-net=10.0.1.11 | +-------------------+-----------------------------+--------+------------+-------------+--------------------+
Run openstack port list:
ardana >
openstask port list --device-id 3a4dc44a-027b-45e9-acf8-054a7c2dca2a
View the subnet:
+---------------------+------+-------------------+-------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +---------------------+------+-------------------+-------------------------------------------------------------------+ | 7d14ef8b-9...80348f | | fa:16:3e:75:32:8e | {"subnet_id": "d9b765da-45...00a2556", "ip_address": "10.0.1.11"} | +---------------------+------+-------------------+-------------------------------------------------------------------+
Run openstack port show:
ardana >
openstack port show 7d14ef8b-9d48-4310-8c02-00c74d80348f
+-----------------------+-----------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:vnic_type | normal | | device_id | 3a4dc44a-027b-45e9-acf8-054a7c2dca2a | | device_owner | compute:None | | dns_assignment | {"hostname": "host-10-0-1-11", "ip_address": "10.0.1.11", "fqdn": "host-10-0-1-11.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "d9b765da-45eb-4543-be96-1b69a00a2556", "ip_address": "10.0.1.11"} | | id | 7d14ef8b-9d48-4310-8c02-00c74d80348f | | mac_address | fa:16:3e:75:32:8e | | name | | | network_id | c3d55c21-d8c9-4ee5-944b-560b7e0ea33b | | security_groups | 275802d0-33cb-4796-9e57-03d8ddd29b94 | | status | ACTIVE | | tenant_id | 5a582af8b44b422fafcd4545bd2b7eb5 | +-----------------------+-----------------------------------------------------------------------------------------------------+
10.4.10.10 Limitations #
Note the following limitations of RBAC in neutron.
neutron network is the only supported RBAC neutron object type.
The "access_as_external" action is not supported – even though it is listed as a valid action by python-neutronclient.
The neutron-api server will not accept action value of 'access_as_external'. The
access_as_external
definition is not found in the specs.The target project users cannot create, modify, or delete subnets on networks that have RBAC policies.
The subnet of a network that has an RBAC policy cannot be added as an interface of a target tenant's router. For example, the command
openstack router add subnet tgt-tenant-router <sb-demo-net uuid>
will error out.The security group rules on the network owner do not apply to other projects that can use the network.
A user in target project can boot up VMs using a VNIC using the shared network. The user of the target project can assign a floating IP (FIP) to the VM. The target project must have SG rules that allows SSH and/or ICMP for VM connectivity.
neutron RBAC creation and management are currently not supported in horizon. For now, the neutron CLI has to be used to manage RBAC rules.
A RBAC rule tells neutron whether a tenant can access a network (Allow). Currently there is no DENY action.
Port creation on a shared network fails if
--fixed-ip
is specified in theopenstack port create
command.
10.4.11 Configuring Maximum Transmission Units in neutron #
This topic explains how you can configure MTUs, what to look out for, and the results and implications of changing the default MTU settings. It is important to note that every network within a network group will have the same MTU.
An MTU change will not affect existing networks that have had VMs created on them. It will only take effect on new networks created after the reconfiguration process.
10.4.11.1 Overview #
A Maximum Transmission Unit, or MTU is the maximum packet size (in bytes) that a network device can or is configured to handle. There are a number of places in your cloud where MTU configuration is relevant: the physical interfaces managed and configured by SUSE OpenStack Cloud, the virtual interfaces created by neutron and nova for neutron networking, and the interfaces inside the VMs.
SUSE OpenStack Cloud-managed physical interfaces
SUSE OpenStack Cloud-managed physical interfaces include the physical interfaces
and the bonds, bridges, and VLANs created on top of them. The MTU for these
interfaces is configured via the 'mtu' property of a network group. Because
multiple network groups can be mapped to one physical interface, there may
have to be some resolution of differing MTUs between the untagged and tagged
VLANs on the same physical interface. For instance, if one untagged VLAN,
vlan101 (with an MTU of 1500) and a tagged VLAN vlan201 (with an MTU of
9000) are both on one interface (eth0), this means that eth0 can handle
1500, but the VLAN interface which is created on top of eth0 (that is,
vlan201@eth0
) wants 9000. However, vlan201 cannot have a
higher MTU than eth0, so vlan201 will be limited to 1500 when it is brought
up, and fragmentation will result.
In general, a VLAN interface MTU must be lower than or equal to the base device MTU. If they are different, as in the case above, the MTU of eth0 can be overridden and raised to 9000, but in any case the discrepancy will have to be reconciled.
neutron/nova interfaces
neutron/nova interfaces include the virtual devices created by neutron and nova during the normal process of realizing a neutron network/router and booting a VM on it (qr-*, qg-*, tap-*, qvo-*, qvb-*, etc.). There is currently no support in neutron/nova for per-network MTUs in which every interface along the path for a particular neutron network has the correct MTU for that network. There is, however, support for globally changing the MTU of devices created by neutron/nova (see network_device_mtu below). This means that if you want to enable jumbo frames for any set of VMs, you will have to enable it for all your VMs. You cannot just enable them for a particular neutron network.
VM interfaces
VMs typically get their MTU via DHCP advertisement, which means that the dnsmasq processes spawned by the neutron-dhcp-agent actually advertise a particular MTU to the VMs. In SUSE OpenStack Cloud 9, the DHCP server advertises to all VMS a 1400 MTU via a forced setting in dnsmasq-neutron.conf. This is suboptimal for every network type (vxlan, flat, vlan, etc) but it does prevent fragmentation of a VM's packets due to encapsulation.
For instance, if you set the new *-mtu configuration options to a default of 1500 and create a VXLAN network, it will be given an MTU of 1450 (with the remaining 50 bytes used by the VXLAN encapsulation header) and will advertise a 1450 MTU to any VM booted on that network. If you create a provider VLAN network, it will have an MTU of 1500 and will advertise 1500 to booted VMs on the network. It should be noted that this default starting point for MTU calculation and advertisement is also global, meaning you cannot have an MTU of 8950 on one VXLAN network and 1450 on another. However, you can have provider physical networks with different MTUs by using the physical_network_mtus config option, but nova still requires a global MTU option for the interfaces it creates, thus you cannot really take advantage of that configuration option.
10.4.11.2 Network settings in the input model #
MTU can be set as an attribute of a network group in network_groups.yml. Note that this applies only to KVM. That setting means that every network in the network group will be assigned the specified MTU. The MTU value must be set individually for each network group. For example:
network-groups: - name: GUEST mtu: 9000 ... - name: EXTERNAL-API mtu: 9000 ... - name: EXTERNAL-VM mtu: 9000 ...
10.4.11.3 Infrastructure support for jumbo frames #
If you want to use jumbo frames, or frames with an MTU of 9000 or more, the physical switches and routers that make up the infrastructure of the SUSE OpenStack Cloud installation must be configured to support them. To realize the advantages, all devices in the same broadcast domain must have the same MTU.
If you want to configure jumbo frames on compute and controller nodes, then all switches joining the compute and controller nodes must have jumbo frames enabled. Similarly, the "infrastructure gateway" through which the external VM network flows, commonly known as the default route for the external VM VLAN, must also have the same MTU configured.
You can also consider anything in the same broadcast domain to be anything in the same VLAN or anything in the same IP subnet.
10.4.11.4 Enabling end-to-end jumbo frames for a VM #
Add an
mtu
attribute to all the network groups in your model. Note that adding the MTU for the network groups will only affect the configuration for physical network interfaces.To add the mtu attribute, find the YAML file that contains your network-groups entry. We will assume it is network_groups.yml, unless you have changed it. Whatever the file is named, it will be found in ~/openstack/my_cloud/definition/data/.
To edit these files, begin by checking out the site branch on the Cloud Lifecycle Manager node. You may already be on that branch. If so, you will remain there.
ardana >
cd ~/openstack/ardana/ansibleardana >
git checkout siteThen begin editing the files. In
network_groups.yml
, addmtu: 9000
.network-groups: - name: GUEST hostname-suffix: guest mtu: 9000 tags: - neutron.networks.vxlan
This sets the physical interface managed by SUSE OpenStack Cloud 9 that has the GUEST network group tag assigned to it. This can be found in the
interfaces_set.yml
file under theinterface-models
section.Edit
neutron.conf.j2
found in~/openstack/my_cloud/config/neutron/
to setglobal_physnet_mtu
to9000
under[DEFAULT]
:[DEFAULT] ... global_physnet_mtu = 9000
This allows neutron to advertise the optimal MTU to instances (based on
global_physnet_mtu
minus the encapsulation size).Remove the
dhcp-option-force=26,1400
line from~/openstack/my_cloud/config/neutron/dnsmasq-neutron.conf.j2
.OvS will set
br-int
to the value of the lowest physical interface. If you are using Jumbo frames on some of your networks,br-int
on the controllers may be set to 1500 instead of 9000. Work around this condition by running:ovs-vsctl set int br-int mtu_request=9000
Commit your changes
ardana >
git add -Aardana >
git commit -m "your commit message goes here in quotes"If SUSE OpenStack Cloud has not been deployed yet, do normal deployment and skip to Step 8.
Assuming it has been deployed already, continue here:
Run the configuration processor:
ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymland ready the deployment:
ardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlThen run the network_interface-reconfigure.yml playbook, changing directories first:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts network_interface-reconfigure.ymlThen run neutron-reconfigure.yml:
ardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.ymlThen nova-reconfigure.yml:
ardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.ymlNote: adding/changing network-group mtu settings will likely require a network restart when
network_interface-reconfigure.yml
is run.Follow the normal process for creating a neutron network and booting a VM or two. In this example, if a VXLAN network is created and a VM is booted on it, the VM will have an MTU of 8950, with the remaining 50 bytes used by the VXLAN encapsulation header.
Test and verify that the VM can send and receive jumbo frames without fragmentation. You can use
ping
. For example, to test an MTU of 9000 using VXLAN:ardana >
ping –M do –s 8950 YOUR_VM_FLOATING_IPSubstitute your actual floating IP address for the YOUR_VM_FLOATING_IP.
10.4.11.5 Enabling Optimal MTU Advertisement Feature #
To enable the optimal MTU feature, follow these steps:
Edit
~/openstack/my_cloud/config/neutron/neutron.conf.j2
to removeadvertise_mtu
variable under [DEFAULT][DEFAULT] ... advertise_mtu = False #remove this
Remove the
dhcp-option-force=26,1400
line from~/openstack/my_cloud/config/neutron/dnsmasq-neutron.conf.j2
.If SUSE OpenStack Cloud has already been deployed, follow the remaining steps, otherwise follow the normal deployment procedures.
Commit your changes
ardana >
git add -Aardana >
git commit -m "your commit message goes here in quotes"Run the configuration processor:
ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun ready deployment:
ardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the
network_interface-reconfigure.yml
playbook, changing directories first:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts network_interface-reconfigure.ymlRun the
neutron-reconfigure.yml
playbook:ardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
If you are upgrading an existing deployment, avoid creating MTU mismatch
between network interfaces in preexisting VMs and that of VMs created after
upgrade. If you do have an MTU mismatch, then the new VMs (having interface
with 1500 minus the underlay protocol overhead) will not be able to have L2
connectivity with preexisting VMs (with 1400 MTU due to
dhcp-option-force
).
10.4.12 Improve Network Peformance with Isolated Metadata Settings #
In SUSE OpenStack Cloud, neutron currently sets enable_isolated_metadata =
True
by default in dhcp_agent.ini
because
several services require isolated networks (neutron networks without a
router). It also sets force_metadata = True
if DVR is
enabled to improve the scalability on large environments with a high churn
rate. However, this has the effect of spawning a neutron-ns-metadata-proxy
process on one of the controller nodes for every active neutron network.
In environments that create many neutron networks, these extra
neutron-ns-metadata-proxy
processes can quickly eat up a
lot of memory on the controllers, which does not scale up well.
For deployments that do not require isolated metadata (that is, they do not
require the Platform Services and will always create networks with an
attached router) and do not have a high churn rate, you can set
enable_isolated_metadata = False
and force_metadata = False
in dhcp_agent.ini
to reduce neutron memory usage on controllers,
allowing a greater number of active neutron networks.
Note that the dhcp_agent.ini.j2
template is found in
~/openstack/my_cloud/config/neutron
on the Cloud Lifecycle Manager
node. The edit can be made there and the standard deployment can be run if
this is install time. In a deployed cloud, run the neutron reconfiguration
procedure outlined here:
First check out the site branch:
ardana >
cd ~/openstack/my_cloud/config/neutronardana >
git checkout siteEdit the
dhcp_agent.ini.j2
file to change theenable_isolated_metadata = {{ neutron_enable_isolated_metadata }}
force_metadata = {{ router_distributed }}
line in the[DEFAULT]
section to read:enable_isolated_metadata = False
force_metadata = False
Commit the file:
ardana >
git add -Aardana >
git commit -m "your commit message goes here in quotes"Run the
ready-deployment.yml
playbook from~/openstack/ardana/ansible
:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlThen run the
neutron-reconfigure.yml
playbook, changing directories first:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
10.4.13 Moving from DVR deployments to non_DVR #
If you have an older deployment of SUSE OpenStack Cloud which is using DVR as a default and you are attempting to move to non_DVR, follow these steps:
Remove all your existing DVR routers and their workloads. Make sure to remove interfaces, floating ips and gateways, if applicable.
ardana >
openstack router remove subnet ROUTER-NAME SUBNET-NAME/SUBNET-IDardana >
openstack floating ip unset –port FLOATINGIP-ID PRIVATE-PORT-IDardana >
openstack router unset ROUTER-NAME -NET-NAME/EXT-NET-IDThen delete the router.
ardana >
openstack router delete ROUTER-NAMEBefore you create any non_DVR router make sure that l3-agents and metadata-agents are not running in any compute host. You can run the command
openstack network agent list
to see if there are any neutron-l3-agent running in any compute-host in your deployment.You must disable
neutron-l3-agent
andneutron-metadata-agent
on every compute host by running the following commands:ardana >
openstack network agent list +--------------------------------------+----------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------------------+-------+----------------+---------------------------+ | 810f0ae7-63aa-4ee3-952d-69837b4b2fe4 | L3 agent | ardana-cp1-comp0001-mgmt | nova | :-) | True | neutron-l3-agent | | 89ac17ba-2f43-428a-98fa-b3698646543d | Metadata agent | ardana-cp1-comp0001-mgmt | | :-) | True | neutron-metadata-agent | | f602edce-1d2a-4c8a-ba56-fa41103d4e17 | Open vSwitch agent | ardana-cp1-comp0001-mgmt | | :-) | True | neutron-openvswitch-agent | ... +--------------------------------------+----------------------+--------------------------+-------------------+-------+----------------+---------------------------+ $ openstack network agent set --disable 810f0ae7-63aa-4ee3-952d-69837b4b2fe4 Updated agent: 810f0ae7-63aa-4ee3-952d-69837b4b2fe4 $ openstack network agent set --disable 89ac17ba-2f43-428a-98fa-b3698646543d Updated agent: 89ac17ba-2f43-428a-98fa-b3698646543dNoteOnly L3 and Metadata agents were disabled.
Once L3 and metadata neutron agents are stopped, follow steps 1 through 7 in the document Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 12 “Alternative Configurations”, Section 12.2 “Configuring SUSE OpenStack Cloud without DVR” and then run the
neutron-reconfigure.yml
playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
10.4.14 OVS-DPDK Support #
SUSE OpenStack Cloud uses a version of Open vSwitch (OVS) that is built with the Data Plane Development Kit (DPDK) and includes a QEMU hypervisor which supports vhost-user.
The OVS-DPDK package modifes the OVS fast path, which is normally performed in kernel space, and allows it to run in userspace so there is no context switch to the kernel for processing network packets.
The EAL component of DPDK supports mapping the Network Interface Card (NIC) registers directly into userspace. The DPDK provides a Poll Mode Driver (PMD) that can access the NIC hardware from userspace and uses polling instead of interrupts to avoid the user to kernel transition.
The PMD maps the shared address space of the VM that is provided by the vhost-user capability of QEMU. The vhost-user mode causes neutron to create a Unix domain socket that allows communication between the PMD and QEMU. The PMD uses this in order to acquire the file descriptors to the pre-allocated VM memory. This allows the PMD to directly access the VM memory space and perform a fast zero-copy of network packets directly into and out of the VMs virtio_net vring.
This yields performance improvements in the time it takes to process network packets.
10.4.14.1 Usage considerations #
The target for a DPDK Open vSwitch is VM performance and VMs only run on compute nodes so the following considerations are compute node specific.
In order to use DPDK with VMs,
hugepages
must be enabled; please see Section 10.4.14.3, “Configuring Hugepages for DPDK in Networks”. The memory to be used must be allocated at boot time so you must know beforehand how many VMs will be scheduled on a node. Also, for NUMA considerations, you want those hugepages on the same NUMA node as the NIC. A VM maps its entire address space into a hugepage.For maximum performance you must reserve logical cores for DPDK poll mode driver (PMD) usage and for hypervisor (QEMU) usage. This keeps the Linux kernel from scheduling processes on those cores. The PMD threads will go to 100% cpu utilization since it uses polling of the hardware instead of interrupts. There will be at least 2 cores dedicated to PMD threads. Each VM will have a core dedicated to it although for less performance VMs can share cores.
VMs can use the virtio_net or the virtio_pmd drivers. There is also a PMD for an emulated e1000.
Only VMs that use hugepages can be sucessfully launched on a DPDK-enabled NIC. If there is a need to support both DPDK and non-DPDK-based VMs, an additional port managed by the Linux kernel must exist.
10.4.14.2 For more information #
See the following topics for more information:
10.4.14.3 Configuring Hugepages for DPDK in Networks #
To take advantage of DPDK and its network performance enhancements, enable hugepages first.
With hugepages, physical RAM is reserved at boot time and dedicated to a virtual machine. Only that virtual machine and Open vSwitch can use this specifically allocated RAM. The host OS cannot access it. This memory is contiguous, and because of its larger size, reduces the number of entries in the memory map and number of times it must be read.
The hugepage reservation is made in /etc/default/grub
,
but this is handled by the Cloud Lifecycle Manager.
In addition to hugepages, to use DPDK, CPU isolation is required. This is
achieved with the 'isolcups' command in
/etc/default/grub
, but this is also managed by the
Cloud Lifecycle Manager using a new input model file.
The two new input model files introduced with this release to help you configure the necessary settings and persist them are:
memory_models.yml (for hugepages)
cpu_models.yml (for CPU isolation)
10.4.14.3.1 memory_models.yml #
In this file you set your huge page size along with the number of such huge-page allocations.
--- product: version: 2 memory-models: - name: COMPUTE-MEMORY-NUMA default-huge-page-size: 1G huge-pages: - size: 1G count: 24 numa-node: 0 - size: 1G count: 24 numa-node: 1 - size: 1G count: 48
10.4.14.3.2 cpu_models.yml #
--- product: version: 2 cpu-models: - name: COMPUTE-CPU assignments: - components: - nova-compute-kvm cpu: - processor-ids: 3-5,12-17 role: vm - components: - openvswitch cpu: - processor-ids: 0 role: eal - processor-ids: 1-2 role: pmd
10.4.14.3.3 NUMA memory allocation #
As mentioned above, the memory used for hugepages is locked down at boot
time by an entry in /etc/default/grub
. As an admin, you
can specify in the input model how to arrange this memory on NUMA nodes. It
can be spread across NUMA nodes or you can specify where you want it. For
example, if you have only one NIC, you would probably want all the hugepages
memory to be on the NUMA node closest to that NIC.
If you do not specify the numa-node
settings in the
memory_models.yml
input model file and use only the last
entry indicating "size: 1G" and "count: 48" then this memory is spread
evenly across all NUMA nodes.
Also note that the hugepage service runs once at boot time and then goes to an inactive state so you should not expect to see it running. If you decide to make changes to the NUMA memory allocation, you will need to reboot the compute node for the changes to take effect.
10.4.14.4 DPDK Setup for Networking #
10.4.14.4.1 Hardware requirements #
Intel-based compute node. DPDK is not available on AMD-based systems.
The following BIOS settings must be enabled for DL360 Gen9:
Virtualization Technology
Intel(R) VT-d
PCI-PT (Also see Section 10.4.15.14, “Enabling PCI-PT on HPE DL360 Gen 9 Servers”)
Need adequate host memory to allow for hugepages. The examples below use 1G hugepages for the VMs
10.4.14.4.2 Limitations #
DPDK is supported on SLES only.
Applies to SUSE OpenStack Cloud 9 only.
Tenant network can be untagged vlan or untagged vxlan
DPDK port names must be of the form 'dpdk<portid>' where port id is sequential and starts at 0
No support for converting DPDK ports to non DPDK ports without rebooting compute node.
No security group support, need userspace conntrack.
No jumbo frame support.
10.4.14.4.3 Setup instructions #
These setup instructions and example model are for a three-host system. There is one controller with Cloud Lifecycle Manager in cloud control plane and two compute hosts.
After initial run of site.yml all compute nodes must be rebooted to pick up changes in grub for hugepages and isolcpus
Changes to non-uniform memory access (NUMA) memory, isolcpu, or network devices must be followed by a reboot of compute nodes
Run sudo reboot to pick up libvirt change and hugepage/isocpus grub changes
tux >
sudo rebootUse the bash script below to configure nova aggregates, neutron networks, a new flavor, etc. And then it will spin up two VMs.
VM spin-up instructions
Before running the spin up script you need to get a copy of the cirros image to your Cloud Lifecycle Manager node. You can manually scp a copy of the cirros image to the system. You can copy it locallly with wget like so
ardana >
wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
Save the following shell script in the home directory and run it. This should spin up two VMs, one on each compute node.
Make sure to change all network-specific information in the script to match your environment.
#!/usr/bin/env bash source service.osrc ######## register glance image openstack image create --name='cirros' --container-format=bare --disk-format=qcow2 < ~/cirros-0.3.4-x86_64-disk.img ####### create nova aggregate and flavor for dpdk MI_NAME=dpdk openstack aggregate create $MI_NAME nova openstack aggregate add host $MI_NAME openstack-cp-comp0001-mgmt openstack aggregate add host $MI_NAME openstack-cp-comp0002-mgmt openstack aggregate set $MI_NAME pinned=true openstack flavor create $MI_NAME 6 1024 20 1 openstack flavor set $MI_NAME set hw:cpu_policy=dedicated openstack flavor set $MI_NAME set aggregate_instance_extra_specs:pinned=true openstack flavor set $MI_NAME set hw:mem_page_size=1048576 ######## sec groups NOTE: no sec groups supported on DPDK. This is in case we do non-DPDK compute hosts. nova secgroup-add-rule default tcp 22 22 0.0.0.0/0 nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0 ######## nova keys openstack keypair create mykey >mykey.pem chmod 400 mykey.pem ######## create neutron external network openstack network create ext-net --router:external --os-endpoint-type internalURL openstack subnet create ext-net 10.231.0.0/19 --gateway_ip=10.231.0.1 --ip-version=4 --disable-dhcp --allocation-pool start=10.231.17.0,end=10.231.17.255 ######## neutron network openstack network create mynet1 openstack subnet create mynet1 10.1.1.0/24 --name mysubnet1 openstack router create myrouter1 openstack router add subnet myrouter1 mysubnet1 openstack router set myrouter1 ext-net export MYNET=$(openstack network list | grep mynet | awk '{print $2}') ######## spin up 2 VMs, 1 on each compute openstack server create --image cirros --nic net-id=${MYNET} --key-name mykey --flavor dpdk --availability-zone nova:openstack-cp-comp0001-mgmt vm1 openstack server create --image cirros --nic net-id=${MYNET} --key-name mykey --flavor dpdk --availability-zone nova:openstack-cp-comp0002-mgmt vm2 ######## create floating ip and attach to instance export MYFIP1=$(nova floating-ip-create|grep ext-net|awk '{print $4}') nova add-floating-ip vm1 ${MYFIP1} export MYFIP2=$(nova floating-ip-create|grep ext-net|awk '{print $4}') nova add-floating-ip vm2 ${MYFIP2} openstack server list
10.4.14.5 DPDK Configurations #
10.4.14.5.1 Base configuration #
The following is specific to DL360 Gen9 and BIOS configuration as detailed in Section 10.4.14.4, “DPDK Setup for Networking”.
EAL cores - 1, isolate: False in cpu-models
PMD cores - 1 per NIC port
Hugepages - 1G per PMD thread
Memory channels - 4
Global rx queues - based on needs
10.4.14.5.2 Performance considerations common to all NIC types #
Compute host core frequency
Host CPUs should be running at maximum performance. The following is a script to set that. Note that in this case there are 24 cores. This needs to be modified to fit your environment. For a HP DL360 Gen9, the BIOS should be configured to use "OS Control Mode" which can be found on the iLO Power Settings page.
for i in `seq 0 23`; do echo "performance" > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
IO non-posted prefetch
The DL360 Gen9 should have the IO non-posted prefetch disabled. Experimental evidence shows this yields an additional 6-8% performance boost.
10.4.14.5.3 Multiqueue configuration #
In order to use multiqueue, a property must be applied to the glance image and a setting inside the resulting VM must be applied. In this example we create a 4 vCPU flavor for DPDK using 1G hugepages.
MI_NAME=dpdk openstack aggregate create $MI_NAME nova openstack aggregate add host $MI_NAME openstack-cp-comp0001-mgmt openstack aggregate add host $MI_NAME openstack-cp-comp0002-mgmt openstack aggregate set $MI_NAME pinned=true openstack flavor create $MI_NAME 6 1024 20 4 openstack flavor set $MI_NAME set hw:cpu_policy=dedicated openstack flavor set $MI_NAME set aggregate_instance_extra_specs:pinned=true openstack flavor set $MI_NAME set hw:mem_page_size=1048576
And set the hw_vif_multiqueue_enabled property on the glance image
ardana >
openstack image set --property hw_vif_multiqueue_enabled=true IMAGE UUID
Once the VM is booted using the flavor above, inside the VM, choose the number of combined rx and tx queues to be equal to the number of vCPUs
tux >
sudo ethtool -L eth0 combined 4
On the hypervisor you can verify that multiqueue has been properly set by looking at the qemu process
-netdev type=vhost-user,id=hostnet0,chardev=charnet0,queues=4 -device virtio-net-pci,mq=on,vectors=10,
Here you can see that 'mq=on' and vectors=10. The formula for vectors is 2*num_queues+2
10.4.14.6 Troubleshooting DPDK #
10.4.14.6.1 Hardware configuration #
Because there are several variations of hardware, it is up to you to verify that the hardware is configured properly.
Only Intel based compute nodes are supported. There is no DPDK available for AMD-based CPUs.
PCI-PT must be enabled for the NIC that will be used with DPDK.
When using Intel Niantic and the igb_uio driver, the VT-d must be enabled in the BIOS.
For DL360 Gen9 systems, the BIOS shared-memory Section 10.4.15.14, “Enabling PCI-PT on HPE DL360 Gen 9 Servers”.
Adequate memory must be available for Section 10.4.14.3, “Configuring Hugepages for DPDK in Networks” usage.
Hyper-threading can be enabled but is not required for base functionality.
Determine the PCI slot that the DPDK NIC(s) are installed in to determine the associated NUMA node.
Only the Intel Haswell, Broadwell, and Skylake microarchitectures are supported. Intel Sandy Bridge is not supported.
10.4.14.6.2 System configuration #
Only SLES12-SP4 compute nodes are supported.
If a NIC port is used with PCI-PT, SRIOV-only, or PCI-PT+SRIOV, then it cannot be used with DPDK. They are mutually exclusive. This is because DPDK depends on an OvS bridge which does not exist if you use any combination of PCI-PT and SRIOV. You can use DPDK, SRIOV-only, and PCI-PT on difference interfaces of the same server.
There is an association between the PCI slot for the NIC and a NUMA node. Make sure to use logical CPU cores that are on the NUMA node associated to the NIC. Use the following to determine which CPUs are on which NUMA node.
ardana >
lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz Stepping: 2 CPU MHz: 1200.000 CPU max MHz: 1800.0000 CPU min MHz: 1200.0000 BogoMIPS: 3597.06 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47
10.4.14.6.3 Input model configuration #
If you do not specify a driver for a DPDK device, the igb_uio will be selected as default.
DPDK devices must be named
dpdk<port-id>
where the port-id starts at 0 and increments sequentially.Tenant networks supported are untagged VXLAN and VLAN.
Jumbo Frames MTU is not supported with DPDK.
Sample VXLAN model
Sample VLAN model
10.4.14.6.4 Reboot requirements #
A reboot of a compute node must be performed when an input model change causes the following:
After the initial
site.yml
play on a new OpenStack environmentChanges to an existing OpenStack environment that modify the
/etc/default/grub
file, such ashugepage allocations
CPU isolation
iommu changes
Changes to a NIC port usage type, such as
moving from DPDK to any combination of PCI-PT and SRIOV
moving from DPDK to kernel based eth driver
10.4.15 SR-IOV and PCI Passthrough Support #
SUSE OpenStack Cloud supports both single-root I/O virtualization (SR-IOV) and PCI passthrough (PCIPT). Both technologies provide for better network performance.
This improves network I/O, decreases latency, and reduces processor overhead.
10.4.15.1 SR-IOV #
A PCI-SIG Single Root I/O Virtualization and Sharing (SR-IOV) Ethernet interface is a physical PCI Ethernet NIC that implements hardware-based virtualization mechanisms to expose multiple virtual network interfaces that can be used by one or more virtual machines simultaneously. With SR-IOV based NICs, the traditional virtual bridge is no longer required. Each SR-IOV port is associated with a virtual function (VF).
When compared with a PCI Passthtrough Ethernet interface, an SR-IOV Ethernet interface:
Provides benefits similar to those of a PCI Passthtrough Ethernet interface, including lower latency packet processing.
Scales up more easily in a virtualized environment by providing multiple VFs that can be attached to multiple virtual machine interfaces.
Shares the same limitations, including the lack of support for LAG, QoS, ACL, and live migration.
Has the same requirements regarding the VLAN configuration of the access switches.
The process for configuring SR-IOV includes creating a VLAN provider network and subnet, then attaching VMs to that network.
With SR-IOV based NICs, the traditional virtual bridge is no longer required. Each SR-IOV port is associated with a virtual function (VF)
10.4.15.2 PCI passthrough Ethernet interfaces #
A passthrough Ethernet interface is a physical PCI Ethernet NIC on a compute node to which a virtual machine is granted direct access. PCI passthrough allows a VM to have direct access to the hardware without being brokered by the hypervisor. This minimizes packet processing delays but at the same time demands special operational considerations. For all purposes, a PCI passthrough interface behaves as if it were physically attached to the virtual machine. Therefore any potential throughput limitations coming from the virtualized environment, such as the ones introduced by internal copying of data buffers, are eliminated. However, by bypassing the virtualized environment, the use of PCI passthrough Ethernet devices introduces several restrictions that must be taken into consideration. They include:
no support for LAG, QoS, ACL, or host interface monitoring
no support for live migration
no access to the compute node's OVS switch
A passthrough interface bypasses the compute node's OVS switch completely, and is attached instead directly to the provider network's access switch. Therefore, proper routing of traffic to connect the passthrough interface to a particular tenant network depends entirely on the VLAN tagging options configured on both the passthrough interface and the access port on the switch (TOR).
The access switch routes incoming traffic based on a VLAN ID, which ultimately determines the tenant network to which the traffic belongs. The VLAN ID is either explicit, as found in incoming tagged packets, or implicit, as defined by the access port's default VLAN ID when the incoming packets are untagged. In both cases the access switch must be configured to process the proper VLAN ID, which therefore has to be known in advance
10.4.15.3 Leveraging PCI Passthrough #
Two parts are necessary to leverage PCI passthrough on a SUSE OpenStack Cloud 9 Compute Node: preparing the Compute Node, preparing nova and glance.
Preparing the Compute Node
There should be no kernel drivers or binaries with direct access to the PCI device. If there are kernel modules, they should be blacklisted.
For example, it is common to have a
nouveau
driver from when the node was installed. This driver is a graphics driver for Nvidia-based GPUs. It must be blacklisted as shown in this example.ardana >
echo 'blacklist nouveau' >> /etc/modprobe.d/nouveau-default.confThe file location and its contents are important; the name of the file is your choice. Other drivers can be blacklisted in the same manner, possibly including Nvidia drivers.
On the host,
iommu_groups
is necessary and may already be enabled. To check if IOMMU is enabled:root #
virt-host-validate ..... QEMU: Checking if IOMMU is enabled by kernel : WARN (IOMMU appears to be disabled in kernel. Add intel_iommu=on to kernel cmdline arguments) .....To modify the kernel cmdline as suggested in the warning, edit the file
/etc/default/grub
and appendintel_iommu=on
to theGRUB_CMDLINE_LINUX_DEFAULT
variable. Then runupdate-bootloader
.A reboot will be required for
iommu_groups
to be enabled.After the reboot, check that IOMMU is enabled:
root #
virt-host-validate ..... QEMU: Checking if IOMMU is enabled by kernel : PASS .....Confirm IOMMU groups are available by finding the group associated with your PCI device (for example Nvidia GPU):
ardana >
lspci -nn | grep -i nvidia 08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT218 [NVS 300] [10de:10d8] (rev a2) 08:00.1 Audio device [0403]: NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)In this example,
08:00.0
and08:00.1
are addresses of the PCI device. The vendorID is10de
. The productIDs are10d8
and0be3
.Confirm that the devices are available for passthrough:
ardana >
ls -ld /sys/kernel/iommu_groups/*/devices/*08:00.?/ drwxr-xr-x 3 root root 0 Feb 14 13:05 /sys/kernel/iommu_groups/20/devices/0000:08:00.0/ drwxr-xr-x 3 root root 0 Feb 19 16:09 /sys/kernel/iommu_groups/20/devices/0000:08:00.1/NoteWith PCI passthrough, only an entire IOMMU group can be passed. Parts of the group cannot be passed. In this example, the IOMMU group is
20
.
Preparing nova and glance for passthrough
Information about configuring nova and glance is available in the documentation at https://docs.openstack.org/nova/rocky/admin/pci-passthrough.html. Both
nova-compute
andnova-scheduler
must be configured.
10.4.15.4 Supported Intel 82599 Devices #
Vendor | Device | Title |
---|---|---|
Intel Corporation | 10f8 | 82599 10 Gigabit Dual Port Backplane Connection |
Intel Corporation | 10f9 | 82599 10 Gigabit Dual Port Network Connection |
Intel Corporation | 10fb | 82599ES 10-Gigabit SFI/SFP+ Network Connection |
Intel Corporation | 10fc | 82599 10 Gigabit Dual Port Network Connection |
10.4.15.5 SRIOV PCIPT configuration #
If you plan to take advantage of SR-IOV support in SUSE OpenStack Cloud, plan in advance to meet the following requirements:
Use one of the supported NIC cards:
HP Ethernet 10Gb 2-port 560FLR-SFP+ Adapter (Intel Niantic). Product part number: 665243-B21 -- Same part number for the following card options:
FlexLOM card
PCI slot adapter card
Identify the NIC ports to be used for PCI Passthrough devices and SRIOV devices from each compute node
Ensure that:
SRIOV is enabled in the BIOS
HP Shared memory is disabled in the BIOS on the compute nodes.
The Intel boot agent is disabled on the compute (Section 10.4.15.11, “Intel bootutils” can be used to perform this)
NoteBecause of Intel driver limitations, you cannot use a NIC port as an SRIOV NIC as well as a physical NIC. Using the physical function to carry the normal tenant traffic through the OVS bridge at the same time as assigning the VFs from the same NIC device as passthrough to the guest VM is not supported.
If the above prerequisites are met, then SR-IOV or PCIPT can be reconfigured at any time. There is no need to do it at install time.
10.4.15.6 Deployment use cases #
The following are typical use cases that should cover your particular needs:
A device on the host needs to be enabled for both PCI-passthrough and PCI-SRIOV during deployment. At run time nova decides whether to use physical functions or virtual function depending on vnic_type of the port used for booting the VM.
A device on the host needs to be configured only for PCI-passthrough.
A device on the host needs to be configured only for PCI-SRIOV virtual functions.
10.4.15.7 Input model updates #
SUSE OpenStack Cloud 9 provides various options for the user to configure the network for tenant VMs. These options have been enhanced to support SRIOV and PCIPT.
the Cloud Lifecycle Manager input model changes to support SRIOV and PCIPT are as follows. If you were familiar with the configuration settings previously, you will notice these changes.
net_interfaces.yml: This file defines the interface details of the nodes. In it, the following fields have been added under the compute node interface section:
Key | Value |
---|---|
sriov_only: |
Indicates that only SR-IOV be enabled on the interface. This should be set to true if you want to dedicate the NIC interface to support only SR-IOV functionality. |
pci-pt: |
When this value is set to true, it indicates that PCIPT should be enabled on the interface. |
vf-count: |
Indicates the number of VFs to be configured on a given interface. |
In control_plane.yml
under Compute
resource
, neutron-sriov-nic-agent
has been
added as a service component.
under resources:
Key | Value |
---|---|
name: | Compute |
resource-prefix: | Comp |
server-role: | COMPUTE-ROLE |
allocation-policy: | Any |
min-count: | 0 |
service-components: | ntp-client |
nova-compute | |
nova-compute-kvm | |
neutron-l3-agent | |
neutron-metadata-agent | |
neutron-openvswitch-agent | |
- neutron-sriov-nic-agent* |
nic_device_data.yml: This is the new file
added with this release to support SRIOV and PCIPT configuration details. It
contains information about the specifics of a nic, and is found at
/usr/share/ardana/input-model/2.0/services/osconfig/nic_device_data.yml
.
The fields in this file are as follows.
nic-device-types: The nic-device-types section contains the following key-value pairs:
Key Value name: The name of the nic-device-types that will be referenced in nic_mappings.yml
family: The name of the nic-device-families to be used with this nic_device_type
device_id: Device ID as specified by the vendor for the particular NIC
type: The value of this field can be
simple-port
ormulti-port
. If a single bus address is assigned to more than one nic, The value will bemulti-port
. If there is a one-to-one mapping between bus address and the nic, it will besimple-port
.nic-device-families: The nic-device-families section contains the following key-value pairs:
Key Value name: The name of the device family that can be used for reference in nic-device-types.
vendor-id: Vendor ID of the NIC
config-script: A script file used to create the virtual functions (VF) on the Compute node.
driver: Indicates the NIC driver that needs to be used.
vf-count-type: This value can be either
port
ordriver
.“port”: Indicates that the device supports per-port virtual function (VF) counts.
“driver:” Indicates that all ports using the same driver will be configured with the same number of VFs, whether or not the interface model specifies a vf-count attribute for the port. If two or more ports specify different vf-count values, the config processor errors out.
Max-vf-count: This field indicates the maximum VFs that can be configured on an interface as defined by the vendor.
control_plane.yml: This file provides the
information about the services to be run on a particular node. To support
SR-IOV on a particular compute node, you must run
neutron-sriov-nic-agent
on that node.
Mapping the use cases with various fields in input model
Vf-count | SR-IOV | PCIPT | OVS bridge | Can be NIC bonded | Use case | |
---|---|---|---|---|---|---|
sriov-only: true | Mandatory | Yes | No | No | No | Dedicated to SRIOV |
pci-pt : true | Not Specified | No | Yes | No | No | Dedicated to PCI-PT |
pci-pt : true | Specified | Yes | Yes | No | No | PCI-PT or SRIOV |
pci-pt and sriov-only keywords are not specified | Specified | Yes | No | Yes | No | SRIOV with PF used by host |
pci-pt and sriov-only keywords are not specified | Not Specified | No | No | Yes | Yes | Traditional/Usual use case |
10.4.15.8 Mappings between nic_mappings.yml
and net_interfaces.yml
#
The following diagram shows which fields in
nic_mappings.yml
map to corresponding fields in
net_interfaces.yml
:
10.4.15.9 Example Use Cases for Intel #
Nic-device-types and nic-device-families with Intel 82559 with ixgbe as the driver.
nic-device-types: - name: ''8086:10fb family: INTEL-82599 device-id: '10fb' type: simple-port nic-device-families: # Niantic - name: INTEL-82599 vendor-id: '8086' config-script: intel-82599.sh driver: ixgbe vf-count-type: port max-vf-count: 63
net_interfaces.yml for the SRIOV-only use case:
- name: COMPUTE-INTERFACES - name: hed1 device: name: hed1 sriov-only: true vf-count: 6 network-groups: - GUEST1
net_interfaces.yml for the PCIPT-only use case:
- name: COMPUTE-INTERFACES - name: hed1 device: name: hed1 pci-pt: true network-groups: - GUEST1
net_interfaces.yml for the SRIOV and PCIPT use case
- name: COMPUTE-INTERFACES - name: hed1 device: name: hed1 pci-pt: true vf-count: 6 network-groups: - GUEST1
net_interfaces.yml for SRIOV and Normal Virtio use case
- name: COMPUTE-INTERFACES - name: hed1 device: name: hed1 vf-count: 6 network-groups: - GUEST1
net_interfaces.yml for PCI-PT (
hed1
andhed4
refer to the DUAL ports of the PCI-PT NIC)- name: COMPUTE-PCI-INTERFACES network-interfaces: - name: hed3 device: name: hed3 network-groups: - MANAGEMENT - EXTERNAL-VM forced-network-groups: - EXTERNAL-API - name: hed1 device: name: hed1 pci-pt: true network-groups: - GUEST - name: hed4 device: name: hed4 pci-pt: true network-groups: - GUEST
10.4.15.10 Launching Virtual Machines #
Provisioning a VM with SR-IOV NIC is a two-step process.
Create a neutron port with
vnic_type = direct
.ardana >
openstack port create --network $net_id --vnic-type direct sriov_portBoot a VM with the created
port-id
.ardana >
openstack server create --flavor m1.large --image opensuse --nic port-id=$port_id test-sriov
Provisioning a VM with PCI-PT NIC is a two-step process.
Create two neutron ports with
vnic_type = direct-physical
.ardana >
openstack port create --network net1 --vnic-type direct-physical pci-port1ardana >
openstack port create --network net1 --vnic-type direct-physical pci-port2Boot a VM with the created ports.
ardana >
openstack server create --flavor 4 --image opensuse --nic port-id pci-port1-port-id \ --nic port-id pci-port2-port-id vm1-pci-passthrough
If PCI-PT VM gets stuck (hangs) at boot time when using an Intel NIC, the boot agent should be disabled.
10.4.15.11 Intel bootutils #
When Intel cards are used for PCI-PT, a tenant VM can get stuck at boot time. When this happens, you should download Intel bootutils and use it to should disable bootagent.
Download
Preboot.tar.gz
from https://downloadcenter.intel.com/download/19186/Intel-Ethernet-Connections-Boot-Utility-Preboot-Images-and-EFI-DriversUntar the
Preboot.tar.gz
on the compute node where the PCI-PT VM is to be hosted.Go to
~/APPS/BootUtil/Linux_x64
ardana >
cd ~/APPS/BootUtil/Linux_x64and run following command:
ardana >
./bootutil64e -BOOTENABLE disable -allBoot the PCI-PT VM; it should boot without getting stuck.
NoteEven though VM console shows VM getting stuck at PXE boot, it is not related to BIOS PXE settings.
10.4.15.12 Making input model changes and implementing PCI PT and SR-IOV #
To implement the configuration you require, log into the Cloud Lifecycle Manager node and update the Cloud Lifecycle Manager model files to enable SR-IOV or PCIPT following the relevant use case explained above. You will need to edit the following:
net_interfaces.yml
nic_device_data.yml
control_plane.yml
To make the edits,
Check out the site branch of the local git repository and change to the correct directory:
ardana >
git checkout siteardana >
cd ~/openstack/my_cloud/definition/data/Open each file in vim or another editor and make the necessary changes. Save each file, then commit to the local git repository:
ardana >
git add -Aardana >
git commit -m "your commit message goes here in quotes"Have the Cloud Lifecycle Manager enable your changes by running the necessary playbooks:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml
After running the site.yml
playbook above, you must
reboot the compute nodes that are configured with Intel PCI devices.
When a VM is running on an SRIOV port on a given compute node, reconfiguration is not supported.
You can set the number of virtual functions that must be enabled on a compute node at install time. You can update the number of virtual functions after deployment. If any VMs have been spawned before you change the number of virtual functions, those VMs may lose connectivity. Therefore, it is always recommended that if any virtual function is used by any tenant VM, you should not reconfigure the virtual functions. Instead, you should delete/migrate all the VMs on that NIC before reconfiguring the number of virtual functions.
10.4.15.13 Limitations #
Security groups are not applicable for PCI-PT and SRIOV ports.
Live migration is not supported for VMs with PCI-PT and SRIOV ports.
Rate limiting (QoS) is not applicable on SRIOV and PCI-PT ports.
SRIOV/PCIPT is not supported for VxLAN network.
DVR is not supported with SRIOV/PCIPT.
For Intel cards, the same NIC cannot be used for both SRIOV and normal VM boot.
Current upstream OpenStack code does not support this hot plugin of SRIOV/PCIPT interface using the nova
attach_interface
command. See https://review.openstack.org/#/c/139910/ for more information.The
openstack port update
command will not work when admin state is down.SLES Compute Nodes with dual-port PCI-PT NICs, both ports should always be passed in the VM. It is not possible to split the dual port and pass through just a single port.
10.4.15.14 Enabling PCI-PT on HPE DL360 Gen 9 Servers #
The HPE DL360 Gen 9 and HPE ProLiant systems with Intel processors use a region of system memory for sideband communication of management information. The BIOS sets up Reserved Memory Region Reporting (RMRR) to report these memory regions and devices to the operating system. There is a conflict between the Linux kernel and RMRR which causes problems with PCI pass-through (PCI-PT). This is needed for IOMMU use by DPDK. Note that this does not affect SR-IOV.
In order to enable PCI-PT on the HPE DL360 Gen 9 you must have a version of firmware that supports setting this and you must change a BIOS setting.
To begin, get the latest firmware and install it on your compute nodes.
Once the firmware has been updated:
Reboot the server and press F9 (system utilities) during POST (power on self test)
Choose
Select the NIC for which you want to enable PCI-PT
Choose
Disable the shared memory feature in the BIOS.
Save the changes and reboot server
10.4.16 Setting up VLAN-Aware VMs #
Creating a VM with a trunk port will allow a VM to gain connectivity to one or more networks over the same virtual NIC (vNIC) through the use VLAN interfaces in the guest VM. Connectivity to different networks can be added and removed dynamically through the use of subports. The network of the parent port will be presented to the VM as the untagged VLAN, and the networks of the child ports will be presented to the VM as the tagged VLANs (the VIDs of which can be chosen arbitrarily as long as they are unique to that trunk). The VM will send/receive VLAN-tagged traffic over the subports, and neutron will mux/demux the traffic onto the subport's corresponding network. This is not to be confused with VLAN transparency where a VM can pass VLAN-tagged traffic transparently across the network without interference from neutron. VLAN transparency is not supported.
10.4.16.1 Terminology #
Trunk: a resource that logically represents a trunked vNIC and references a parent port.
Parent port: a neutron port that a Trunk is referenced to. Its network is presented as the untagged VLAN.
Subport: a resource that logically represents a tagged VLAN port on a Trunk. A Subport references a child port and consists of the <port>,<segmentation-type>,<segmentation-id> tuple. Currently only the
vlan
segmentation type is supported.Child port: a neutron port that a Subport is referenced to. Its network is presented as a tagged VLAN based upon the segmentation-id used when creating/adding a Subport.
Legacy VM: a VM that does not use a trunk port.
Legacy port: a neutron port that is not used in a Trunk.
VLAN-aware VM: a VM that uses at least one trunk port.
10.4.16.2 Trunk CLI reference #
Command | Action |
---|---|
network trunk create | Create a trunk. |
network trunk delete | Delete a given trunk. |
network trunk list | List all trunks. |
network trunk show | Show information of a given trunk. |
network trunk set | Add subports to a given trunk. |
network subport list | List all subports for a given trunk. |
network trunk unset | Remove subports from a given trunk. |
network trunk set | Update trunk properties. |
10.4.16.3 Enabling VLAN-aware VM capability #
Edit
~/openstack/my_cloud/config/neutron/neutron.conf.j2
to add thetrunk
service_plugin:service_plugins = {{ neutron_service_plugins }},trunk
Edit
~/openstack/my_cloud/config/neutron/ml2_conf.ini.j2
to enable the noop firewall driver:[securitygroup] firewall_driver = neutron.agent.firewall.NoopFirewallDriver
NoteThis is a manual configuration step because it must be made apparent that this step disables neutron security groups completely. The default SUSE OpenStack Cloud firewall_driver is
neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewall Driver
which does not implement security groups for trunk ports. Optionally, the SUSE OpenStack Cloud default firewall_driver may still be used (this step can be skipped), which would provide security groups for legacy VMs but not for VLAN-aware VMs. However, this mixed environment is not recommended. For more information, see Section 10.4.16.6, “Firewall issues”.Commit the configuration changes:
ardana >
git add -Aardana >
git commit -m "Enable vlan-aware VMs"ardana >
cd ~/openstack/ardana/ansible/If this is an initial deployment, continue the rest of normal deployment process:
ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.ymlIf the cloud has already been deployed and this is a reconfiguration:
ardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
10.4.16.4 Use Cases #
Creating a trunk port
Assume that a number of neutron networks/subnets already exist: private, foo-net, and bar-net. This will create a trunk with two subports allocated to it. The parent port will be on the "private" network, while the two child ports will be on "foo-net" and "bar-net", respectively:
Create a port that will function as the trunk's parent port:
ardana >
openstack port create --name trunkparent privateCreate ports that will function as the child ports to be used in subports:
ardana >
openstack port create --name subport1 foo-netardana >
openstack port create --name subport2 bar-netCreate a trunk port using the
openstack network trunk create
command, passing the parent port created in step 1 and child ports created in step 2:ardana >
openstack network trunk create --parent-port trunkparent --subport port=subport1,segmentation-type=vlan,segmentation-id=1 --subport port=subport2,segmentation-type=vlan,segmentation-id=2 mytrunk +-----------------+-----------------------------------------------------------------------------------------------+ | Field | Value | +-----------------+-----------------------------------------------------------------------------------------------+ | admin_state_up | UP | | created_at | 2017-06-02T21:49:59Z | | description | | | id | bd822ebd-33d5-423e-8731-dfe16dcebac2 | | name | mytrunk | | port_id | 239f8807-be2e-4732-9de6-c64519f46358 | | project_id | f51610e1ac8941a9a0d08940f11ed9b9 | | revision_number | 1 | | status | DOWN | | sub_ports | port_id='9d25abcf-d8a4-4272-9436-75735d2d39dc', segmentation_id='1', segmentation_type='vlan' | | | port_id='e3c38cb2-0567-4501-9602-c7a78300461e', segmentation_id='2', segmentation_type='vlan' | | tenant_id | f51610e1ac8941a9a0d08940f11ed9b9 | | updated_at | 2017-06-02T21:49:59Z | +-----------------+-----------------------------------------------------------------------------------------------+ $ openstack network subport list --trunk mytrunk +--------------------------------------+-------------------+-----------------+ | Port | Segmentation Type | Segmentation ID | +--------------------------------------+-------------------+-----------------+ | 9d25abcf-d8a4-4272-9436-75735d2d39dc | vlan | 1 | | e3c38cb2-0567-4501-9602-c7a78300461e | vlan | 2 | +--------------------------------------+-------------------+-----------------+Optionally, a trunk may be created without subports (they can be added later):
ardana >
openstack network trunk create --parent-port trunkparent mytrunk +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | admin_state_up | UP | | created_at | 2017-06-02T21:45:35Z | | description | | | id | eb8a3c7d-9f0a-42db-b26a-ca15c2b38e6e | | name | mytrunk | | port_id | 239f8807-be2e-4732-9de6-c64519f46358 | | project_id | f51610e1ac8941a9a0d08940f11ed9b9 | | revision_number | 1 | | status | DOWN | | sub_ports | | | tenant_id | f51610e1ac8941a9a0d08940f11ed9b9 | | updated_at | 2017-06-02T21:45:35Z | +-----------------+--------------------------------------+A port that is already bound (that is, already in use by a VM) cannot be upgraded to a trunk port. The port must be unbound to be eligible for use as a trunk's parent port. When adding subports to a trunk, the child ports must be unbound as well.
Checking a port's trunk details
Once a trunk has been created, its parent port will show the
trunk_details
attribute, which consists of the
trunk_id
and list of subport dictionaries:
ardana >
openstack port show -F trunk_details trunkparent
+---------------+-------------------------------------------------------------------------------------+
| Field | Value |
+---------------+-------------------------------------------------------------------------------------+
| trunk_details | {"trunk_id": "bd822ebd-33d5-423e-8731-dfe16dcebac2", "sub_ports": |
| | [{"segmentation_id": 2, "port_id": "e3c38cb2-0567-4501-9602-c7a78300461e", |
| | "segmentation_type": "vlan", "mac_address": "fa:16:3e:11:90:d2"}, |
| | {"segmentation_id": 1, "port_id": "9d25abcf-d8a4-4272-9436-75735d2d39dc", |
| | "segmentation_type": "vlan", "mac_address": "fa:16:3e:ff:de:73"}]} |
+---------------+-------------------------------------------------------------------------------------+
Ports that are not trunk parent ports will not have a
trunk_details
field:
ardana >
openstack port show -F trunk_details subport1
need more than 0 values to unpack
Adding subports to a trunk
Assuming a trunk and new child port have been created already, the
trunk-subport-add
command will add one or more subports
to the trunk.
Run
openstack network trunk set
ardana >
openstack network trunk set --subport port=subport3,segmentation-type=vlan,segmentation-id=3 mytrunkRun
openstack network subport list
ardana >
openstack network subport list --trunk mytrunk +--------------------------------------+-------------------+-----------------+ | Port | Segmentation Type | Segmentation ID | +--------------------------------------+-------------------+-----------------+ | 9d25abcf-d8a4-4272-9436-75735d2d39dc | vlan | 1 | | e3c38cb2-0567-4501-9602-c7a78300461e | vlan | 2 | | bf958742-dbf9-467f-b889-9f8f2d6414ad | vlan | 3 | +--------------------------------------+-------------------+-----------------+
The --subport
option may be repeated multiple times in
order to add multiple subports at a time.
Removing subports from a trunk
To remove a subport from a trunk, use openstack network trunk
unset
command:
ardana >
openstack network trunk unset --subport subport3 mytrunk
Deleting a trunk port
To delete a trunk port, use the openstack network trunk
delete
command:
ardana >
openstack network trunk delete mytrunk
Once a trunk has been created successfully, its parent port may be passed to
the openstack server create
command, which will make the VM VLAN-aware:
ardana >
openstack server create --image ubuntu-server --flavor 1 --nic port-id=239f8807-be2e-4732-9de6-c64519f46358 vlan-aware-vm
A trunk cannot be deleted until its parent port is unbound. This means you must delete the VM using the trunk port before you are allowed to delete the trunk.
10.4.16.5 VLAN-aware VM network configuration #
This section illustrates how to configure the VLAN interfaces inside a VLAN-aware VM based upon the subports allocated to the trunk port being used.
Run
openstack network trunk subport list
to see the VLAN IDs in use on the trunk port:ardana >
openstack network subport list --trunk mytrunk +--------------------------------------+-------------------+-----------------+ | Port | Segmentation Type | Segmentation ID | +--------------------------------------+-------------------+-----------------+ | e3c38cb2-0567-4501-9602-c7a78300461e | vlan | 2 | +--------------------------------------+-------------------+-----------------+Run
openstack port show
on the child port to get its mac_address:ardana >
openstack port show -F mac_address 08848e38-50e6-4d22-900c-b21b07886fb7 +-------------+-------------------+ | Field | Value | +-------------+-------------------+ | mac_address | fa:16:3e:08:24:61 | +-------------+-------------------+Log into the VLAN-aware VM and run the following commands to set up the VLAN interface:
ardana >
sudo ip link add link ens3 ens3.2 address fa:16:3e:11:90:d2 broadcast ff:ff:ff:ff:ff:ff type vlan id 2 $ sudo ip link set dev ens3.2 upNote the usage of the mac_address from step 2 and VLAN ID from step 1 in configuring the VLAN interface:
ardana >
sudo ip link add link ens3 ens3.2 address fa:16:3e:11:90:d2 broadcast ff:ff:ff:ff:ff:ff type vlan id 2Trigger a DHCP request for the new vlan interface to verify connectivity and retrieve its IP address. On an Ubuntu VM, this might be:
ardana >
sudo dhclient ens3.2ardana >
sudo ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast state UP group default qlen 1000 link/ether fa:16:3e:8d:77:39 brd ff:ff:ff:ff:ff:ff inet 10.10.10.5/24 brd 10.10.10.255 scope global ens3 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe8d:7739/64 scope link valid_lft forever preferred_lft forever 3: ens3.2@ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000 link/ether fa:16:3e:11:90:d2 brd ff:ff:ff:ff:ff:ff inet 10.10.12.7/24 brd 10.10.12.255 scope global ens3.2 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe11:90d2/64 scope link valid_lft forever preferred_lft forever
10.4.16.6 Firewall issues #
The SUSE OpenStack Cloud default firewall_driver is
neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver
.
This default does not implement security groups for VLAN-aware VMs, but it
does implement security groups for legacy VMs. For this reason, it is
recommended to disable neutron security groups altogether when using
VLAN-aware VMs. To do so, set:
firewall_driver = neutron.agent.firewall.NoopFirewallDriver
Doing this will prevent having a mix of firewalled and non-firewalled VMs in the same environment, but it should be done with caution because all VMs would be non-firewalled.
10.5 Creating a Highly Available Router #
10.5.1 CVR and DVR High Available Routers #
CVR (Centralized Virtual Routing) and DVR (Distributed Virtual Routing) are two types of technologies which can be used to provide routing processes in SUSE OpenStack Cloud 9. You can create Highly Available (HA) versions of CVR and DVR routers by using the options in the table below when creating your router.
The neutron command for creating a router openstack router create
router_name --distributed=True|False --ha=True|False
requires
administrative permissions. See the example in the next section, Section 10.5.2, “Creating a High Availability Router”.
--distributed | --ha | Router Type | Description |
---|---|---|---|
False | False | CVR | Centralized Virtual Router |
False | True | CVRHA | Centralized Virtual Router with L3 High Availablity |
True | False | DVR | Distributed Virtual Router without SNAT High Availability |
True | True | DVRHA | Distributed Virtual Router with SNAT High Availability |
10.5.2 Creating a High Availability Router #
You can create a highly available router using the OpenStackClient.
To create the HA router, add
--ha=True
to theopenstack router create
command. If you also want to make the router distributed, add--distributed=True
. In this example, a DVR SNAT HA router is created with the namerouterHA
.ardana >
openstack router create routerHA --distributed=True --ha=TrueSet the gateway for the external network and add interface
ardana >
openstack router set routerHA <ext-net-id>ardana >
openstack router add subnet routerHA <private_subnet_id>When the router is created, the gateway is set, and the interface attached, you have a router with high availability.
10.5.3 Test Router for High Availability #
You can demonstrate that the router is HA by running a continuous ping from a VM instance that is running on the private network to an external server such as a public DNS. As the ping is running, list the l3 agents hosting the router and identify the agent that is responsible for hosting the active router. Induce the failover mechanism by creating a catastrophic event such as shutting down node hosting the l3 agent. Once the node is shut down, you will see that the ping from the VM to the external network continues to run as the backup l3 agent takes over. To verify the agent hosting the primary router has changed, list the agents hosting the router. You will see a different agent is now hosting the active router.
Boot an instance on the private network
ardana >
openstack server create --image <image_id> --flavor <flavor_id> --nic net_id=<private_net_id> --key_name <key> VM1Log into the VM using the SSH keys
ardana >
ssh -i <key> <ipaddress of VM1>Start a
ping
to X.X.X.X. While pinging, make sure there is no packet loss and leave the ping running.ardana >
ping X.X.X.XCheck which agent is hosting the active router.
ardana >
openstack network agent list –routers <router_id>Shutdown the node hosting the agent.
Within 10 seconds, check again to see which L3 agent is hosting the active router.
ardana >
openstack network agent list –routers <router_id>You will see a different agent.
11 Managing the Dashboard #
Information about managing and configuring the Dashboard service.
11.1 Configuring the Dashboard Service #
horizon is the OpenStack service that serves as the basis for the SUSE OpenStack Cloud dashboards.
The dashboards provide a web-based user interface to SUSE OpenStack Cloud services including Compute, Volume Operations, Networking, and Identity.
Along the left side of the dashboard are sections that provide access to Project and Identity sections. If your login credentials have been assigned the 'admin' role you will also see a separate Admin section that provides additional system-wide setting options.
Across the top are menus to switch between projects and menus where you can access user settings.
11.1.1 Dashboard Service and TLS in SUSE OpenStack Cloud #
By default, the Dashboard service is configured with TLS in the input model (ardana-input-model). You should not disable TLS in the input model for the Dashboard service. The normal use case for users is to have all services behind TLS, but users are given the freedom in the input model to take a service off TLS for troubleshooting or debugging. TLS should always be enabled for production environments.
Make sure that horizon_public_protocol
and
horizon_private_protocol
are both be set to use https.
11.2 Changing the Dashboard Timeout Value #
The default session timeout for the dashboard is 1800 seconds or 30 minutes. This is the recommended default and best practice for those concerned with security.
As an administrator, you can change the session timeout by changing the value of the SESSION_TIMEOUT to anything less than or equal to 14400, which is equal to four hours. Values greater than 14400 should not be used due to keystone constraints.
Increasing the value of SESSION_TIMEOUT increases the risk of abuse.
11.2.1 How to Change the Dashboard Timeout Value #
Follow these steps to change and commit the horizon timeout value.
Log in to the Cloud Lifecycle Manager.
Edit the Dashboard config file at
~/openstack/my_cloud/config/horizon/local_settings.py
and, if it is not already present, add a line forSESSION_TIMEOUT
above the line forSESSION_ENGINE
.Here is an example snippet, in bold:
SESSION_TIMEOUT = <timeout value> SESSION_ENGINE = 'django.contrib.sessions.backends.db'
ImportantDo not exceed the maximum value of 14400.
Commit the changes to git:
git add -A git commit -a -m "changed horizon timeout value"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the Dashboard reconfigure playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts horizon-reconfigure.yml
11.3 Creating a Load Balancer with the Dashboard #
In SUSE OpenStack Cloud 9 you can create a Load Balancer with the Load Balancer Panel in the Dashboard.
Follow the steps below to create the load balancer, listener, pool, add members to the pool and create the health monitor.
Optionally, you may add members to the load balancer pool after the load balancer has been created.
Login to the Dashboard
Login into the Dashboard using your domain, user account and password.
Navigate and Create Load Balancer
Once logged into the Dashboard, navigate to the
panel by selecting › › in the navigation menu, then select from the Load Balancers page.Create Load Balancer
Provide the Load Balancer details, Load Balancer Name, Description (optional), IP Address and Subnet. When complete, select Next.
Create Listener
Provide a Name, Description, Protocol (HTTP, TCP, TERMINATED_HTTPS) and Port for the Load Balancer Listener.
Create Pool
Provide the Name, Description and Method (LEAST_CONNECTIONS, ROUND_ROBIN, SOURCE_IP) for the Load Balancer Pool.
Add Pool Members
Add members to the Load Balancer Pool.
NoteOptionally, you may add members to the load balancer pool after the load balancer has been created.
Create Health Monitor
Create Health Monitor by providing the Monitor type (HTTP, PING, TCP), the Health check interval, Retry count, timeout, HTTP Method, Expected HTTP status code and the URL path. Once all fields are filled, select Create Load Balancer.
Load Balancer Provisioning Status
Clicking on the Load Balancers tab again will provide the status of the Load Balancer. The Load Balancer will be in Pending Create until the Load Balancer is created, at which point the Load Balancer will change to an Active state.
Load Balancer Overview
Once Load Balancer 1 has been created, it will appear in the Load Balancers list. Click the Load Balancer 1, it will show the Overview. In this view, you can see the Load Balancer Provider type, the Admin State, Floating IP, Load Balancer, Subnet and Port ID's.
12 Managing Orchestration #
Information about managing and configuring the Orchestration service, based on OpenStack heat.
12.1 Configuring the Orchestration Service #
Information about configuring the Orchestration service, based on OpenStack heat.
The Orchestration service, based on OpenStack heat, does not need any additional configuration to be used. This documenent describes some configuration options as well as reasons you may want to use them.
heat Stack Tag Feature
heat provides a feature called Stack Tags to allow attributing a set of simple string-based tags to stacks and optionally the ability to hide stacks with certain tags by default. This feature can be used for behind-the-scenes orchestration of cloud infrastructure, without exposing the cloud user to the resulting automatically-created stacks.
Additional details can be seen here: OpenStack - Stack Tags.
In order to use the heat stack tag feature, you need to use the following
steps to define the hidden_stack_tags
setting in the heat
configuration file and then reconfigure the service to enable the feature.
Log in to the Cloud Lifecycle Manager.
Edit the heat configuration file, at this location:
~/openstack/my_cloud/config/heat/heat.conf.j2
Under the
[DEFAULT]
section, add a line forhidden_stack_tags
. Example:[DEFAULT] hidden_stack_tags="<hidden_tag>"
Commit the changes to your local git:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add --allardana >
git commit -m "enabling heat Stack Tag feature"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlReconfigure the Orchestration service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml
To begin using the feature, use these steps to create a heat stack using the
defined hidden tag. You will need to use credentials that have the heat admin
permissions. In the example steps below we are going to do this from the
Cloud Lifecycle Manager using the admin
credentials and a heat
template named heat.yaml
:
Log in to the Cloud Lifecycle Manager.
Source the admin credentials:
ardana >
source ~/service.osrcCreate a heat stack using this feature:
ardana >
openstack stack create -f heat.yaml hidden_stack_tags --tags hiddenIf you list your heat stacks, your hidden one will not show unless you use the
--hidden
switch.Example, not showing hidden stacks:
ardana >
openstack stack listExample, showing the hidden stacks:
ardana >
openstack stack list --hidden
12.2 Autoscaling using the Orchestration Service #
Autoscaling is a process that can be used to scale up and down your compute resources based on the load they are currently experiencing to ensure a balanced load.
12.2.1 What is autoscaling? #
Autoscaling is a process that can be used to scale up and down your compute resources based on the load they are currently experiencing to ensure a balanced load across your compute environment.
Autoscaling is only supported for KVM.
12.2.2 How does autoscaling work? #
The monitoring service, monasca, monitors your infrastructure resources and generates alarms based on their state. The Orchestration service, heat, talks to the monasca API and offers the capability to templatize the existing monasca resources, which are the monasca Notification and monasca Alarm definition. heat can configure certain alarms for the infrastructure resources (compute instances and block storage volumes) it creates and can expect monasca to notify continuously if a certain evaluation pattern in an alarm definition is met.
For example, heat can tell monasca that it needs an alarm generated if the average CPU utilization of the compute instance in a scaling group goes beyond 90%.
As monasca continuously monitors all the resources in the cloud, if it happens to see a compute instance spiking above 90% load as configured by heat, it generates an alarm and in turn sends a notification to heat. Once heat is notified, it will execute an action that was preconfigured in the template. Commonly, this action will be a scale up to increase the number of compute instances to balance the load that is being taken by the compute instance scaling group.
monasca sends a notification every 60 seconds while the alarm is in the ALARM state.
12.2.3 Autoscaling template example #
The following monasca alarm definition template snippet is an example of
instructing monasca to generate an alarm if the average CPU utilization in a
group of compute instances exceeds beyond 50%. If the alarm is triggered, it
will invoke the up_notification
webhook once the alarm
evaluation expression is satisfied.
cpu_alarm_high: type: OS::monasca::AlarmDefinition properties: name: CPU utilization beyond 50 percent description: CPU utilization reached beyond 50 percent expression: str_replace: template: avg(cpu.utilization_perc{scale_group=scale_group_id}) > 50 times 3 params: scale_group_id: {get_param: "OS::stack_id"} severity: high alarm_actions: - {get_resource: up_notification }
The following monasca notification template snippet is an example of creating a monasca notification resource that will be used by the alarm definition snippet to notify heat.
up_notification: type: OS::monasca::Notification properties: type: webhook address: {get_attr: [scale_up_policy, alarm_url]}
12.2.4 monasca Agent configuration options #
There is a monasca Agent configuration option which controls the behavior around compute instance creation and the measurements being received from the compute instance.
The variable is monasca_libvirt_vm_probation
which is set
in the
~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
file. Here is a snippet of the file showing the description and variable:
# The period of time (in seconds) in which to suspend metrics from a # newly-created VM. This is used to prevent creating and storing # quickly-obsolete metrics in an environment with a high amount of instance # churn (VMs created and destroyed in rapid succession). Setting to 0 # disables VM probation and metrics will be recorded as soon as possible # after a VM is created. Decreasing this value in an environment with a high # amount of instance churn can have a large effect on the total number of # metrics collected and increase the amount of CPU, disk space and network # bandwidth required for monasca. This value may need to be decreased if # heat Autoscaling is in use so that heat knows that a new VM has been # created and is handling some of the load. monasca_libvirt_vm_probation: 300
The default value is 300
. This is the time in seconds
that a compute instance must live before the monasca libvirt agent plugin
will send measurements for it. This is so that the monasca metrics database
does not fill with measurements from short lived compute instances. However,
this means that the monasca threshold engine will not see measurements from
a newly created compute instance for at least five minutes on scale up. If
the newly created compute instance is able to start handling the load in
less than five minutes, then heat autoscaling may mistakenly create another
compute instance since the alarm does not clear.
If the default monasca_libvirt_vm_probation
turns out to
be an issue, it can be lowered. However, that will affect all compute
instances, not just ones used by heat autoscaling which can increase the
number of measurements stored in monasca if there are many short lived
compute instances. You should consider how often compute instances are
created that live less than the new value of
monasca_libvirt_vm_probation
. If few, if any, compute
instances live less than the value of
monasca_libvirt_vm_probation
, then this value can be
decreased without causing issues. If many compute instances live less than
the monasca_libvirt_vm_probation
period, then decreasing
monasca_libvirt_vm_probation
can cause excessive disk,
CPU and memory usage by monasca.
If you wish to change this value, follow these steps:
Log in to the Cloud Lifecycle Manager.
Edit the
monasca_libvirt_vm_probation
value in this configuration file:~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
Commit your changes to the local git:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add --allardana >
git commit -m "changing monasca Agent configuration option"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun this playbook to reconfigure the nova service and enact your changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
12.3 Orchestration Service support for LBaaS v2 #
In SUSE OpenStack Cloud, the Orchestration service provides support for LBaaS v2, which means users can create LBaaS v2 resources using Orchestration.
The OpenStack documentation for LBaaSv2 resource plugins is available at following locations.
neutron LBaaS v2 LoadBalancer: http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Neutron::LBaaS::LoadBalancer
neutron LBaaS v2 Listener: http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Neutron::LBaaS::Listener
neutron LBaaS v2 Pool: http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Neutron::LBaaS::Pool
neutron LBaaS v2 Pool Member: http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Neutron::LBaaS::PoolMember
neutron LBaaS v2 Health Monitor: http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Neutron::LBaaS::HealthMonitor
12.3.1 Limitations #
In order to avoid stack-create timeouts when using load balancers, it is recommended that no more than 100 load balancers be created at a time using stack-create loops. Larger numbers of load balancers could reach quotas and/or exhaust resources resulting in the stack create-timeout.
12.3.2 More Information #
For more information on the neutron command-line interface (CLI) and load balancing, see the OpenStack networking command-line client reference: http://docs.openstack.org/cli-reference/content/neutronclient_commands.html
For more information on heat see: http://docs.openstack.org/developer/heat
13 Managing Monitoring, Logging, and Usage Reporting #
Information about the monitoring, logging, and metering services included with your SUSE OpenStack Cloud.
13.1 Monitoring #
The SUSE OpenStack Cloud Monitoring service leverages OpenStack monasca, which is a multi-tenant, scalable, fault tolerant monitoring service.
13.1.1 Getting Started with Monitoring #
You can use the SUSE OpenStack Cloud Monitoring service to monitor the health of your cloud and, if necessary, to troubleshoot issues.
monasca data can be extracted and used for a variety of legitimate purposes, and different purposes require different forms of data sanitization or encoding to protect against invalid or malicious data. Any data pulled from monasca should be considered untrusted data, so users are advised to apply appropriate encoding and/or sanitization techniques to ensure safe and correct usage and display of data in a web browser, database scan, or any other use of the data.
13.1.1.1 Monitoring Service Overview #
13.1.1.1.1 Installation #
The monitoring service is automatically installed as part of the SUSE OpenStack Cloud installation.
No specific configuration is required to use monasca. However, you can configure the database for storing metrics as explained in Section 13.1.2, “Configuring the Monitoring Service”.
13.1.1.1.2 Differences Between Upstream and SUSE OpenStack Cloud Implementations #
In SUSE OpenStack Cloud, the OpenStack monitoring service, monasca, is included as the monitoring solution, except for the following which are not included:
Transform Engine
Events Engine
Anomaly and Prediction Engine
Icinga was supported in previous SUSE OpenStack Cloud versions but it has been deprecated in SUSE OpenStack Cloud 9.
13.1.1.1.3 Diagram of monasca Service #
13.1.1.1.4 For More Information #
For more details on OpenStack monasca, see monasca.io
13.1.1.1.5 Back-end Database #
The monitoring service default metrics database is Cassandra, which is a highly-scalable analytics database and the recommended database for SUSE OpenStack Cloud.
You can learn more about Cassandra at Apache Cassandra.
13.1.1.2 Working with Monasca #
monasca-Agent
The monasca-agent is a Python program that runs on the control plane nodes. It runs the defined checks and then sends data onto the API. The checks that the agent runs include:
System Metrics: CPU utilization, memory usage, disk I/O, network I/O, and filesystem utilization on the control plane and resource nodes.
Service Metrics: the agent supports plugins such as MySQL, RabbitMQ, Kafka, and many others.
VM Metrics: CPU utilization, disk I/O, network I/O, and memory usage of hosted virtual machines on compute nodes. Full details of these can be found https://github.com/openstack/monasca-agent/blob/master/docs/Plugins.md#per-instance-metrics.
For a full list of packaged plugins that are included SUSE OpenStack Cloud, see monasca Plugins
You can further customize the monasca-agent to suit your needs, see Customizing the Agent
13.1.1.3 Accessing the Monitoring Service #
Access to the Monitoring service is available through a number of different interfaces.
13.1.1.3.1 Command-Line Interface #
For users who prefer using the command line, there is the python-monascaclient, which is part of the default installation on your Cloud Lifecycle Manager node.
For details on the CLI, including installation instructions, see Python-monasca Client
monasca API
If low-level access is desired, there is the monasca REST API.
Full details of the monasca API can be found on GitHub.
13.1.1.3.2 Operations Console GUI #
You can use the Operations Console (Ops Console) for SUSE OpenStack Cloud to view data about your SUSE OpenStack Cloud cloud infrastructure in a web-based graphical user interface (GUI) and ensure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways: Triage alarm notifications.
Alarm Definitions and notifications now have their own screens and are collected under the Alarm Explorer menu item which can be accessed from the Central Dashboard. Central Dashboard now allows you to customize the view in the following ways:
Rename or re-configure existing alarm cards to include services different from the defaults
Create a new alarm card with the services you want to select
Reorder alarm cards using drag and drop
View all alarms that have no service dimension now grouped in an Uncategorized Alarms card
View all alarms that have a service dimension that does not match any of the other cards -now grouped in an Other Alarms card
You can also easily access alarm data for a specific component. On the Summary page for the following components, a link is provided to an alarms screen specifically for that component.
13.1.1.3.3 Connecting to the Operations Console #
To connect to Operations Console, perform the following:
Ensure your login has the required access credentials.
Connect through a browser.
Optionally use a Host name OR virtual IP address to access Operations Console.
Operations Console will always be accessed over port 9095.
13.1.1.4 Service Alarm Definitions #
SUSE OpenStack Cloud comes with some predefined monitoring alarms for the services installed.
Full details of all service alarms can be found here: Section 18.1.1, “Alarm Resolution Procedures”.
Each alarm will have one of the following statuses:
An alarm exists for a service or component that is not installed in the environment.
An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.
There is a gap between the last reported metric and the next metric.
When alarms are triggered it is helpful to review the service logs.
13.1.2 Configuring the Monitoring Service #
The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. You also have options for your alarm metrics database should you choose not to use the default option provided with the product.
In SUSE OpenStack Cloud you have the option to specify a SMTP server for email notifications and a database platform you want to use for the metrics database. These steps will assist in this process.
13.1.2.1 Configuring the Monitoring Email Notification Settings #
The monitoring service, based on monasca, allows you to configure an external SMTP server for email notifications when alarms trigger. In SUSE OpenStack Cloud, you have the option to specify a SMTP server for email notifications. These steps will assist in this process.
If you are going to use the email notifiication feature of the monitoring service, you must set the configuration options with valid email settings including an SMTP server and valid email addresses. The email server is not provided by SUSE OpenStack Cloud, but must be specified in the configuration file described below. The email server must support SMTP.
13.1.2.1.1 Configuring monitoring notification settings during initial installation #
Log in to the Cloud Lifecycle Manager.
To change the SMTP server configuration settings edit the following file:
~/openstack/my_cloud/definition/cloudConfig.yml
Enter your email server settings. Here is an example snippet showing the configuration file contents, uncomment these lines before entering your environment details.
smtp-settings: # server: mailserver.examplecloud.com # port: 25 # timeout: 15 # These are only needed if your server requires authentication # user: # password:
This table explains each of these values:
Value Description Server (required) The server entry must be uncommented and set to a valid hostname or IP Address.
Port (optional) If your SMTP server is running on a port other than the standard 25, then uncomment the port line and set it your port.
Timeout (optional) If your email server is heavily loaded, the timeout parameter can be uncommented and set to a larger value. 15 seconds is the default.
User / Password (optional) If your SMTP server requires authentication, then you can configure user and password. Use double quotes around the password to avoid issues with special characters.
To configure the sending email addresses, edit the following file:
~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
Modify the following value to add your sending email address:
email_from_addr
NoteThe default value in the file is
email_from_address: notification@exampleCloud.com
which you should edit.[optional] To configure the receiving email addresses, edit the following file:
~/openstack/ardana/ansible/roles/monasca-default-alarms/defaults/main.yml
Modify the following value to configure a receiving email address:
notification_address
NoteYou can also set the receiving email address via the Operations Console. Instructions for this are in the last section.
If your environment requires a proxy address then you can add that in as well:
# notification_environment can be used to configure proxies if needed. # Below is an example configuration. Note that all of the quotes are required. # notification_environment: '"http_proxy=http://<your_proxy>:<port>" "https_proxy=http://<your_proxy>:<port>"' notification_environment: ''
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Updated monitoring service email notification settings"Continue with your installation.
13.1.2.1.2 Monasca and Apache Commons Validator #
The monasca notification uses a standard Apache Commons validator to validate the configured SUSE OpenStack Cloud domain names before sending the notification over webhook. monasca notification supports some non-standard domain names, but not all. See the Domain Validator documentation for more information: https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/DomainValidator.html
You should ensure that any domains that you use are supported by IETF and IANA. As an example, .local is not listed by IANA and is invalid but .gov and .edu are valid.
Internet Assigned Numbers Authority (IANA): https://www.iana.org/domains/root/db
Failure to use supported domains will generate an unprocessable exception in monasca notification create:
HTTPException code=422 message={"unprocessable_entity": {"code":422,"message":"Address https://myopenstack.sample:8000/v1/signal/test is not of correct format","details":"","internal_code":"c6cf9d9eb79c3fc4"}
13.1.2.1.3 Configuring monitoring notification settings after the initial installation #
If you need to make changes to the email notification settings after your initial deployment, you can change the "From" address using the configuration files but the "To" address will need to be changed in the Operations Console. The following section will describe both of these processes.
To change the sending email address:
Log in to the Cloud Lifecycle Manager.
To configure the sending email addresses, edit the following file:
~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
Modify the following value to add your sending email address:
email_from_addr
NoteThe default value in the file is
email_from_address: notification@exampleCloud.com
which you should edit.Commit your configuration to the local Git repository (Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Updated monitoring service email notification settings"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the monasca reconfigure playbook to deploy the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notificationNoteYou may need to use the
--ask-vault-pass
switch if you opted for encryption during the initial deployment.
To change the receiving email address via the Operations Console:
To configure the "To" email address, after installation,
Connect to and log in to the Operations Console.
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left side, click Home, and then Alarm Explorer.
On the Alarm Explorer page, at the top, click the Notification Methods text.
On the Notification Methods page, find the row with the Default Email notification.
In the Default Email row, click the details icon (), then click Edit.
On the Edit Notification Method: Default Email page, in Name, Type, and Address/Key, type in the values you want to use.
On the Edit Notification Method: Default Email page, click Update Notification.
Once the notification has been added, using the procedures using the Ansible playbooks will not change it.
13.1.2.2 Managing Notification Methods for Alarms #
13.1.2.2.1 Enabling a Proxy for Webhook or Pager Duty Notifications #
If your environment requires a proxy in order for communications to function then these steps will show you how you can enable one. These steps will only be needed if you are utilizing the webhook or pager duty notification methods.
These steps will require access to the Cloud Lifecycle Manager in your cloud deployment so you may need to contact your Administrator. You can make these changes during the initial configuration phase prior to the first installation or you can modify your existing environment, the only difference being the last step.
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml
file and edit the line below with your proxy address values:notification_environment: '"http_proxy=http://<proxy_address>:<port>" "https_proxy=<http://proxy_address>:<port>"'
NoteThere are single quotation marks around the entire value of this entry and then double quotation marks around the individual proxy entries. This formatting must exist when you enter these values into your configuration file.
If you are making these changes prior to your initial installation then you are done and can continue on with the installation. However, if you are modifying an existing environment, you will need to continue on with the remaining steps below.
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlGenerate an updated deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the monasca reconfigure playbook to enable these changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml --tags notification
13.1.2.2.2 Creating a New Notification Method #
Log in to the Operations Console.
Use the navigation menu to go to the Alarm Explorer page:
Select the Notification Methods menu and then click the Create Notification Method button:
On the Create Notification Method window you will select your options and then click the Create Notification button.
A description of each of the fields you use for each notification method:
Field Description Name Enter a unique name value for the notification method you are creating.
Type Choose a type. Available values are Webhook, Email, or Pager Duty.
Address/Key Enter the value corresponding to the type you chose.
13.1.2.2.3 Applying a Notification Method to an Alarm Definition #
Log in to the Operations Console.
Use the navigation menu to go to the Alarm Explorer page:
Select the Alarm Definition menu which will give you a list of each of the alarm definitions in your environment.
Locate the alarm you want to change the notification method for and click on its name to bring up the edit menu. You can use the sorting methods for assistance.
In the edit menu, scroll down to the Notifications and Severity section where you will select one or more Notification Methods before selecting the Update Alarm Definition button:
Repeat as needed until all of your alarms have the notification methods you desire.
13.1.2.3 Enabling the RabbitMQ Admin Console #
The RabbitMQ Admin Console is off by default in SUSE OpenStack Cloud. You can turn on the console by following these steps:
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/rabbitmq/main.yml
file. Under therabbit_plugins:
line, uncomment- rabbitmq_management
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "Enabled RabbitMQ Admin Console"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the RabbitMQ reconfigure playbook to deploy the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-reconfigure.yml
To turn the RabbitMQ Admin Console off again, add the comment back and repeat steps 3 through 6.
13.1.2.4 Capacity Reporting and Monasca Transform #
Capacity reporting is a new feature in SUSE OpenStack Cloud which will provide cloud operators overall capacity (available, used, and remaining) information via the Operations Console so that the cloud operator can ensure that cloud resource pools have sufficient capacity to meet the demands of users. The cloud operator is also able to set thresholds and set alarms to be notified when the thresholds are reached.
For Compute
Host Capacity - CPU/Disk/Memory: Used, Available and Remaining Capacity - for the entire cloud installation or by host
VM Capacity - CPU/Disk/Memory: Allocated, Available and Remaining - for the entire cloud installation, by host or by project
For Object Storage
Disk Capacity - Used, Available and Remaining Capacity - for the entire cloud installation or by project
In addition to overall capacity, roll up views with appropriate slices provide views by a particular project, or compute node. Graphs also show trends and the change in capacity over time.
13.1.2.4.1 monasca Transform Features #
monasca Transform is a new component in monasca which transforms and aggregates metrics using Apache Spark
Aggregated metrics are published to Kafka and are available for other monasca components like monasca-threshold and are stored in monasca datastore
Cloud operators can set thresholds and set alarms to receive notifications when thresholds are met.
These aggregated metrics are made available to the cloud operators via Operations Console's new Capacity Summary (reporting) UI
Capacity reporting is a new feature in SUSE OpenStack Cloud which will provides cloud operators an overall capacity (available, used and remaining) for Compute and Object Storage
Cloud operators can look at Capacity reporting via Operations Console's Compute Capacity Summary and Object Storage Capacity Summary UI
Capacity reporting allows the cloud operators the ability to ensure that cloud resource pools have sufficient capacity to meet demands of users. See table below for Service and Capacity Types.
A list of aggregated metrics is provided in Section 13.1.2.4.4, “New Aggregated Metrics”.
Capacity reporting aggregated metrics are aggregated and published every hour
In addition to the overall capacity, there are graphs which show the capacity trends over time range (for 1 day, for 7 days, for 30 days or for 45 days)
Graphs showing the capacity trends by a particular project or compute host are also provided.
monasca Transform is integrated with centralized monitoring (monasca) and centralized logging
Flexible Deployment
Upgrade & Patch Support
Service | Type of Capacity | Description |
---|---|---|
Compute | Host Capacity |
CPU/Disk/Memory: Used, Available and Remaining Capacity - for entire cloud installation or by compute host |
VM Capacity |
CPU/Disk/Memory: Allocated, Available and Remaining - for entire cloud installation, by host or by project | |
Object Storage | Disk Capacity |
Used, Available and Remaining Disk Capacity - for entire cloud installation or by project |
Storage Capacity |
Utilized Storage Capacity - for entire cloud installation or by project |
13.1.2.4.2 Architecture for Monasca Transform and Spark #
monasca Transform is a new component in monasca. monasca Transform uses Spark for data aggregation. Both monasca Transform and Spark are depicted in the example diagram below.
You can see that the monasca components run on the Cloud Controller nodes, and the monasca agents run on all nodes in the Mid-scale Example configuration.
13.1.2.4.3 Components for Capacity Reporting #
13.1.2.4.3.1 monasca Transform: Data Aggregation Reporting #
monasca-transform is a new component which provides mechanism to aggregate or transform metrics and publish new aggregated metrics to monasca.
monasca Transform is a data driven Apache Spark based data aggregation engine which collects, groups and aggregates existing individual monasca metrics according to business requirements and publishes new transformed (derived) metrics to the monasca Kafka queue.
Since the new transformed metrics are published as any other metric in monasca, alarms can be set and triggered on the transformed metric, just like any other metric.
13.1.2.4.3.2 Object Storage and Compute Capacity Summary Operations Console UI #
A new "Capacity Summary" tab for Compute and Object Storage will displays all the aggregated metrics under the "Compute" and "Object Storage" sections.
Operations Console UI makes calls to monasca API to retrieve and display various tiles and graphs on Capacity Summary tab in Compute and Object Storage Summary UI pages.
13.1.2.4.3.3 Persist new metrics and Trigger Alarms #
New aggregated metrics will be published to monasca's Kafka queue and will be ingested by monasca-persister. If thresholds and alarms have been set on the aggregated metrics, monasca will generate and trigger alarms as it currently does with any other metric. No new/additional change is expected with persisting of new aggregated metrics or setting threshold/alarms.
13.1.2.4.4 New Aggregated Metrics #
Following is the list of aggregated metrics produced by monasca transform in SUSE OpenStack Cloud
Metric Name | For | Description | Dimensions | Notes | |
---|---|---|---|---|---|
1 |
cpu.utilized_logical_cores_agg | compute summary |
utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
2 | cpu.total_logical_cores_agg | compute summary |
total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
3 | mem.total_mb_agg | compute summary |
total physical host memory capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
4 | mem.usable_mb_agg | compute summary |
usable physical host memory capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
5 | disk.total_used_space_mb_agg | compute summary |
utilized physical host disk capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
6 | disk.total_space_mb_agg | compute summary |
total physical host disk capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
7 | nova.vm.cpu.total_allocated_agg | compute summary |
cpus allocated across all VMs by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
8 | vcpus_agg | compute summary |
virtual cpus allocated capacity for VMs of one or all projects by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all or <project ID> | Available as total or per project |
9 | nova.vm.mem.total_allocated_mb_agg | compute summary |
memory allocated to all VMs by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
10 | vm.mem.used_mb_agg | compute summary |
memory utilized by VMs of one or all projects by time interval (defaults to an hour) |
aggregation_period: hourly host: all project_id: <project ID> | Available as total or per project |
11 | vm.mem.total_mb_agg | compute summary |
memory allocated to VMs of one or all projects by time interval (defaults to an hour) |
aggregation_period: hourly host: all project_id: <project ID> | Available as total or per project |
12 | vm.cpu.utilization_perc_agg | compute summary |
cpu utilized by all VMs by project by time interval (defaults to an hour) |
aggregation_period: hourly host: all project_id: <project ID> | |
13 | nova.vm.disk.total_allocated_gb_agg | compute summary |
disk space allocated to all VMs by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
14 | vm.disk.allocation_agg | compute summary |
disk allocation for VMs of one or all projects by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all or <project ID> | Available as total or per project |
15 | swiftlm.diskusage.val.size_agg | object storage summary |
total available object storage capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
16 | swiftlm.diskusage.val.avail_agg | object storage summary |
remaining object storage capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all or <host name> project_id: all | Available as total or per host |
17 | swiftlm.diskusage.rate_agg | object storage summary |
rate of change of object storage usage by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all | |
18 | storage.objects.size_agg | object storage summary |
used object storage capacity by time interval (defaults to a hour) |
aggregation_period: hourly host: all project_id: all |
13.1.2.4.5 Deployment #
monasca Transform and Spark will be deployed on the same control plane nodes along with Logging and Monitoring Service (monasca).
Security Consideration during deployment of monasca Transform and Spark
The SUSE OpenStack Cloud Monitoring system connects internally to the Kafka and Spark technologies without authentication. If you choose to deploy Monitoring, configure it to use only trusted networks such as the Management network, as illustrated on the network diagrams below for Entry Scale Deployment and Mid Scale Deployment.
Entry Scale Deployment
In Entry Scale Deployment monasca Transform and Spark will be deployed on Shared Control Plane along with other Openstack Services along with Monitoring and Logging
Mid scale Deployment
In a Mid Scale Deployment monasca Transform and Spark will be deployed on dedicated Metering Monitoring and Logging (MML) control plane along with other data processing intensive services like Metering, Monitoring and Logging
Multi Control Plane Deployment
In a Multi Control Plane Deployment, monasca Transform and Spark will be deployed on the Shared Control plane along with rest of monasca Components.
Start, Stop and Status for monasca Transform and Spark processes
The service management methods for monasca-transform and spark follow the convention for services in the OpenStack platform. When executing from the deployer node, the commands are as follows:
Status
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-status.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-status.yml
Start
As monasca-transform depends on spark for the processing of the metrics spark will need to be started before monasca-transform.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-start.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-start.yml
Stop
As a precaution, stop the monasca-transform service before taking spark down. Interruption to the spark service altogether while monasca-transform is still running can result in a monasca-transform process that is unresponsive and needing to be tidied up.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-stop.ymlardana >
ansible-playbook -i hosts/verb_hosts spark-stop.yml
13.1.2.4.6 Reconfigure #
The reconfigure process can be triggered again from the deployer. Presuming that changes have been made to the variables in the appropriate places execution of the respective ansible scripts will be enough to update the configuration. The spark reconfigure process alters the nodes serially meaning that spark is never down altogether, each node is stopped in turn and zookeeper manages the leaders accordingly. This means that monasca-transform may be left running even while spark is upgraded.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-reconfigure.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.7 Adding monasca Transform and Spark to SUSE OpenStack Cloud Deployment #
Since monasca Transform and Spark are optional components, the users might elect to not install these two components during their initial SUSE OpenStack Cloud install. The following instructions provide a way the users can add monasca Transform and Spark to their existing SUSE OpenStack Cloud deployment.
Steps
Add monasca Transform and Spark to the input model. monasca Transform and Spark on a entry level cloud would be installed on the common control plane, for mid scale cloud which has a MML (Metering, Monitoring and Logging) cluster, monasca Transform and Spark will should be added to MML cluster.
ardana >
cd ~/openstack/my_cloud/definition/data/Add spark and monasca-transform to input model, control_plane.yml
clusters - name: core cluster-prefix: c1 server-role: CONTROLLER-ROLE member-count: 3 allocation-policy: strict service-components: [...] - zookeeper - kafka - cassandra - storm - spark - monasca-api - monasca-persister - monasca-notifier - monasca-threshold - monasca-client - monasca-transform [...]
Run the Configuration Processor
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Adding monasca Transform and Spark"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun Ready Deployment
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun Cloud Lifecycle Manager Deploy
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
Verify Deployment
Login to each controller node and run
tux >
sudo service monasca-transform statustux >
sudo service spark-master statustux >
sudo service spark-worker status
tux >
sudo service monasca-transform status ● monasca-transform.service - monasca Transform Daemon Loaded: loaded (/etc/systemd/system/monasca-transform.service; disabled) Active: active (running) since Wed 2016-08-24 00:47:56 UTC; 2 days ago Main PID: 7351 (bash) CGroup: /system.slice/monasca-transform.service ├─ 7351 bash /etc/monasca/transform/init/start-monasca-transform.sh ├─ 7352 /opt/stack/service/monasca-transform/venv//bin/python /opt/monasca/monasca-transform/lib/service_runner.py ├─27904 /bin/sh -c export SPARK_HOME=/opt/stack/service/spark/venv/bin/../current && spark-submit --supervise --master spark://omega-cp1-c1-m1-mgmt:7077,omega-cp1-c1-m2-mgmt:7077,omega-cp1-c1... ├─27905 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/lib/drizzle-jdbc-1.3.jar:/opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/v... └─28355 python /opt/monasca/monasca-transform/lib/driver.py Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.tux >
sudo service spark-worker status ● spark-worker.service - Spark Worker Daemon Loaded: loaded (/etc/systemd/system/spark-worker.service; disabled) Active: active (running) since Wed 2016-08-24 00:46:05 UTC; 2 days ago Main PID: 63513 (bash) CGroup: /system.slice/spark-worker.service ├─ 7671 python -m pyspark.daemon ├─28948 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0... ├─63513 bash /etc/spark/init/start-spark-worker.sh & └─63514 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven... Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.tux >
sudo service spark-master status ● spark-master.service - Spark Master Daemon Loaded: loaded (/etc/systemd/system/spark-master.service; disabled) Active: active (running) since Wed 2016-08-24 00:44:24 UTC; 2 days ago Main PID: 55572 (bash) CGroup: /system.slice/spark-master.service ├─55572 bash /etc/spark/init/start-spark-master.sh & └─55573 /usr/bin/java -cp /opt/stack/service/spark/venv/bin/../current/conf/:/opt/stack/service/spark/venv/bin/../current/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/opt/stack/service/spark/ven... Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
13.1.2.4.8 Increase monasca Transform Scale #
monasca Transform in the default configuration can scale up to estimated data for 100 node cloud deployment. Estimated maximum rate of metrics from a 100 node cloud deployment is 120M/hour.
You can further increase the processing rate to 180M/hour. Making the Spark configuration change will increase the CPU's being used by Spark and monasca Transform from average of around 3.5 to 5.5 CPU's per control node over a 10 minute batch processing interval.
To increase the processing rate to 180M/hour the customer will have to make following spark configuration change.
Steps
Edit /var/lib/ardana/openstack/my_cloud/config/spark/spark-defaults.conf.j2 and set spark.cores.max to 6 and spark.executor.cores 2
Set spark.cores.max to 6
spark.cores.max {{ spark_cores_max }}
to
spark.cores.max 6
Set spark.executor.cores to 2
spark.executor.cores {{ spark_executor_cores }}
to
spark.executor.cores 2
Edit ~/openstack/my_cloud/config/spark/spark-env.sh.j2
Set SPARK_WORKER_CORES to 2
export SPARK_WORKER_CORES={{ spark_worker_cores }}
to
export SPARK_WORKER_CORES=2
Edit ~/openstack/my_cloud/config/spark/spark-worker-env.sh.j2
Set SPARK_WORKER_CORES to 2
export SPARK_WORKER_CORES={{ spark_worker_cores }}
to
export SPARK_WORKER_CORES=2
Run Configuration Processor
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Changing Spark Config increase scale"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun Ready Deployment
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun spark-reconfigure.yml and monasca-transform-reconfigure.yml
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-reconfigure.ymlardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.4.9 Change Compute Host Pattern Filter in Monasca Transform #
monasca Transform identifies compute host metrics by pattern matching on
hostname dimension in the incoming monasca metrics. The default pattern is of
the form compNNN
. For example,
comp001
, comp002
, etc. To filter for it
in the transformation specs, use the expression
-comp[0-9]+-
. In case the compute
host names follow a different pattern other than the standard pattern above,
the filter by expression when aggregating metrics will have to be changed.
Steps
On the deployer: Edit
~/openstack/my_cloud/config/monasca-transform/transform_specs.json.j2
Look for all references of
-comp[0-9]+-
and change the regular expression to the desired pattern say for example-compute[0-9]+-
.{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data","insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"], "usage_fetch_operation": "avg", "filter_by_list": [{"field_to_filter": "host", "filter_expression": "-comp[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}
to
{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming", "usage":"fetch_quantity", "setters":["rollup_quantity", "set_aggregated_metric_name", "set_aggregated_period"], "insert":["prepare_data", "insert_data_pre_hourly"]}, "aggregated_metric_name":"mem.total_mb_agg", "aggregation_period":"hourly", "aggregation_group_by_list": ["host", "metric_id", "tenant_id"],"usage_fetch_operation": "avg","filter_by_list": [{"field_to_filter": "host","filter_expression": "-compute[0-9]+", "filter_operation": "include"}], "setter_rollup_group_by_list":[], "setter_rollup_operation": "sum", "dimension_list":["aggregation_period", "host", "project_id"], "pre_hourly_operation":"avg", "pre_hourly_group_by_list":["default"]}, "metric_group":"mem_total_all", "metric_id":"mem_total_all"}
NoteThe filter_expression has been changed to the new pattern.
To change all host metric transformation specs in the same JSON file, repeat Step 2.
Transformation specs will have to be changed for following metric_ids namely "mem_total_all", "mem_usable_all", "disk_total_all", "disk_usable_all", "cpu_total_all", "cpu_total_host", "cpu_util_all", "cpu_util_host"
Run the Configuration Processor:
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Changing monasca Transform specs"ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlRun Ready Deployment:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun monasca Transform Reconfigure:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-transform-reconfigure.yml
13.1.2.5 Configuring Availability of Alarm Metrics #
Using the monasca agent tuning knobs, you can choose which alarm metrics are available in your environment.
The addition of the libvirt and OVS plugins to the monasca agent provides a number of additional metrics that can be used. Most of these metrics are included by default, but others are not. You have the ability to use tuning knobs to add or remove these metrics to your environment based on your individual needs in your cloud.
We will list these metrics along with the tuning knob name and instructions for how to adjust these.
13.1.2.5.1 Libvirt plugin metric tuning knobs #
The following metrics are added as part of the libvirt plugin:
For a description of each of these metrics, see Section 13.1.4.11, “Libvirt Metrics”.
Tuning Knob | Default Setting | Admin Metric Name | Project Metric Name |
---|---|---|---|
vm_cpu_check_enable | True | vm.cpu.time_ns | cpu.time_ns |
vm.cpu.utilization_norm_perc | cpu.utilization_norm_perc | ||
vm.cpu.utilization_perc | cpu.utilization_perc | ||
vm_disks_check_enable |
True Creates 20 disk metrics per disk device per virtual machine. | vm.io.errors | io.errors |
vm.io.errors_sec | io.errors_sec | ||
vm.io.read_bytes | io.read_bytes | ||
vm.io.read_bytes_sec | io.read_bytes_sec | ||
vm.io.read_ops | io.read_ops | ||
vm.io.read_ops_sec | io.read_ops_sec | ||
vm.io.write_bytes | io.write_bytes | ||
vm.io.write_bytes_sec | io.write_bytes_sec | ||
vm.io.write_ops | io.write_ops | ||
vm.io.write_ops_sec | io.write_ops_sec | ||
vm_network_check_enable |
True Creates 16 network metrics per NIC per virtual machine. | vm.net.in_bytes | net.in_bytes |
vm.net.in_bytes_sec | net.in_bytes_sec | ||
vm.net.in_packets | net.in_packets | ||
vm.net.in_packets_sec | net.in_packets_sec | ||
vm.net.out_bytes | net.out_bytes | ||
vm.net.out_bytes_sec | net.out_bytes_sec | ||
vm.net.out_packets | net.out_packets | ||
vm.net.out_packets_sec | net.out_packets_sec | ||
vm_ping_check_enable | True | vm.ping_status | ping_status |
vm_extended_disks_check_enable |
True Creates 6 metrics per device per virtual machine. | vm.disk.allocation | disk.allocation |
vm.disk.capacity | disk.capacity | ||
vm.disk.physical | disk.physical | ||
True Creates 6 aggregate metrics per virtual machine. | vm.disk.allocation_total | disk.allocation_total | |
vm.disk.capacity_total | disk.capacity.total | ||
vm.disk.physical_total | disk.physical_total | ||
vm_disks_check_enable vm_extended_disks_check_enable |
True Creates 20 aggregate metrics per virtual machine. | vm.io.errors_total | io.errors_total |
vm.io.errors_total_sec | io.errors_total_sec | ||
vm.io.read_bytes_total | io.read_bytes_total | ||
vm.io.read_bytes_total_sec | io.read_bytes_total_sec | ||
vm.io.read_ops_total | io.read_ops_total | ||
vm.io.read_ops_total_sec | io.read_ops_total_sec | ||
vm.io.write_bytes_total | io.write_bytes_total | ||
vm.io.write_bytes_total_sec | io.write_bytes_total_sec | ||
vm.io.write_ops_total | io.write_ops_total | ||
vm.io.write_ops_total_sec | io.write_ops_total_sec |
13.1.2.5.1.1 Configuring the libvirt metrics using the tuning knobs #
Use the following steps to configure the tuning knobs for the libvirt plugin metrics.
Log in to the Cloud Lifecycle Manager.
Edit the following file:
~/openstack/my_cloud/config/nova/libvirt-monitoring.yml
Change the value for each tuning knob to the desired setting,
True
if you want the metrics created andFalse
if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.vm_cpu_check_enable: <true or false> vm_disks_check_enable: <true or false> vm_extended_disks_check_enable: <true or false> vm_network_check_enable: <true or false> vm_ping_check_enable: <true or false>
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "configuring libvirt plugin tuning knobs"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the nova reconfigure playbook to implement the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
If you modify either of the following files, then the monasca tuning parameters should be adjusted to handle a higher load on the system.
~/openstack/my_cloud/config/nova/libvirt-monitoring.yml ~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2
Tuning parameters are located in
~/openstack/my_cloud/config/monasca/configuration.yml
.
The parameter monasca_tuning_selector_override
should be
changed to the extra-large
setting.
13.1.2.5.2 OVS plugin metric tuning knobs #
The following metrics are added as part of the OVS plugin:
For a description of each of these metrics, see Section 13.1.4.16, “Open vSwitch (OVS) Metrics”.
Tuning Knob | Default Setting | Admin Metric Name | Project Metric Name |
---|---|---|---|
use_rate_metrics | False | ovs.vrouter.in_bytes_sec | vrouter.in_bytes_sec |
ovs.vrouter.in_packets_sec | vrouter.in_packets_sec | ||
ovs.vrouter.out_bytes_sec | vrouter.out_bytes_sec | ||
ovs.vrouter.out_packets_sec | vrouter.out_packets_sec | ||
use_absolute_metrics | True | ovs.vrouter.in_bytes | vrouter.in_bytes |
ovs.vrouter.in_packets | vrouter.in_packets | ||
ovs.vrouter.out_bytes | vrouter.out_bytes | ||
ovs.vrouter.out_packets | vrouter.out_packets | ||
use_health_metrics with use_rate_metrics | False | ovs.vrouter.in_dropped_sec | vrouter.in_dropped_sec |
ovs.vrouter.in_errors_sec | vrouter.in_errors_sec | ||
ovs.vrouter.out_dropped_sec | vrouter.out_dropped_sec | ||
ovs.vrouter.out_errors_sec | vrouter.out_errors_sec | ||
use_health_metrics with use_absolute_metrics | False | ovs.vrouter.in_dropped | vrouter.in_dropped |
ovs.vrouter.in_errors | vrouter.in_errors | ||
ovs.vrouter.out_dropped | vrouter.out_dropped | ||
ovs.vrouter.out_errors | vrouter.out_errors |
13.1.2.5.2.1 Configuring the OVS metrics using the tuning knobs #
Use the following steps to configure the tuning knobs for the libvirt plugin metrics.
Log in to the Cloud Lifecycle Manager.
Edit the following file:
~/openstack/my_cloud/config/neutron/monasca_ovs_plugin.yaml.j2
Change the value for each tuning knob to the desired setting,
True
if you want the metrics created andFalse
if you want them removed. Refer to the table above for which metrics are controlled by each tuning knob.init_config: use_absolute_metrics: <true or false> use_rate_metrics: <true or false> use_health_metrics: <true or false>
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "configuring OVS plugin tuning knobs"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the neutron reconfigure playbook to implement the changes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
13.1.3 Integrating HipChat, Slack, and JIRA #
monasca, the SUSE OpenStack Cloud monitoring and notification service, includes three default notification methods, email, PagerDuty, and webhook. monasca also supports three other notification plugins which allow you to send notifications to HipChat, Slack, and JIRA. Unlike the default notification methods, the additional notification plugins must be manually configured.
This guide details the steps to configure each of the three non-default notification plugins. This guide also assumes that your cloud is fully deployed and functional.
13.1.3.1 Configuring the HipChat Plugin #
To configure the HipChat plugin you will need the following four pieces of information from your HipChat system.
The URL of your HipChat system.
A token providing permission to send notifications to your HipChat system.
The ID of the HipChat room you wish to send notifications to.
A HipChat user account. This account will be used to authenticate any incoming notifications from your SUSE OpenStack Cloud cloud.
Obtain a token
Use the following instructions to obtain a token from your Hipchat system.
Log in to HipChat as the user account that will be used to authenticate the notifications.
Navigate to the following URL:
https://<your_hipchat_system>/account/api
. Replace<your_hipchat_system>
with the fully-qualified-domain-name of your HipChat system.Select the Create token option. Ensure that the token has the "SendNotification" attribute.
Obtain a room ID
Use the following instructions to obtain the ID of a HipChat room.
Log in to HipChat as the user account that will be used to authenticate the notifications.
Select My account from the application menu.
Select the Rooms tab.
Select the room that you want your notifications sent to.
Look for the API ID field in the room information. This is the room ID.
Create HipChat notification type
Use the following instructions to create a HipChat notification type.
Begin by obtaining the API URL for the HipChat room that you wish to send notifications to. The format for a URL used to send notifications to a room is as follows:
/v2/room/{room_id_or_name}/notification
Use the monasca API to create a new notification method. The following example demonstrates how to create a HipChat notification type named MyHipChatNotification, for room ID 13, using an example API URL and auth token.
ardana >
monasca notification-create NAME TYPE ADDRESSardana >
monasca notification-create MyHipChatNotification HIPCHAT https://hipchat.hpe.net/v2/room/13/notification?auth_token=1234567890The preceding example creates a notification type with the following characteristics
NAME: MyHipChatNotification
TYPE: HIPCHAT
ADDRESS: https://hipchat.hpe.net/v2/room/13/notification
auth_token: 1234567890
The horizon dashboard can also be used to create a HipChat notification type.
13.1.3.2 Configuring the Slack Plugin #
Configuring a Slack notification type requires four pieces of information from your Slack system.
Slack server URL
Authentication token
Slack channel
A Slack user account. This account will be used to authenticate incoming notifications to Slack.
Identify a Slack channel
Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack.
In the left navigation panel, under the CHANNELS section locate the channel that you wish to receive the notifications. The instructions that follow will use the example channel #general.
Create a Slack token
Log in to your Slack system as the user account that will be used to authenticate the notifications to Slack
Navigate to the following URL: https://api.slack.com/docs/oauth-test-tokens
Select the Create token button.
Create a Slack notification type
Begin by identifying the structure of the API call to be used by your notification method. The format for a call to the Slack Web API is as follows:
https://slack.com/api/METHOD
You can authenticate a Web API request by using the token that you created in the previous Create a Slack Tokensection. Doing so will result in an API call that looks like the following.
https://slack.com/api/METHOD?token=auth_token
You can further refine your call by specifying the channel that the message will be posted to. Doing so will result in an API call that looks like the following.
https://slack.com/api/METHOD?token=AUTH_TOKEN&channel=#channel
The following example uses the
chat.postMessage
method, the token1234567890
, and the channel#general
.https://slack.com/api/chat.postMessage?token=1234567890&channel=#general
Find more information on the Slack Web API here: https://api.slack.com/web
Use the CLI on your Cloud Lifecycle Manager to create a new Slack notification type, using the API call that you created in the preceding step. The following example creates a notification type named MySlackNotification, using token 1234567890, and posting to channel #general.
ardana >
monasca notification-create MySlackNotification SLACK https://slack.com/api/chat.postMessage?token=1234567890&channel=#general
Notification types can also be created in the horizon dashboard.
13.1.3.3 Configuring the JIRA Plugin #
Configuring the JIRA plugin requires three pieces of information from your JIRA system.
The URL of your JIRA system.
Username and password of a JIRA account that will be used to authenticate the notifications.
The name of the JIRA project that the notifications will be sent to.
Create JIRA notification type
You will configure the monasca service to send notifications to a particular JIRA project. You must also configure JIRA to create new issues for each notification it receives to this project, however, that configuration is outside the scope of this document.
The monasca JIRA notification plugin supports only the following two JIRA issue fields.
PROJECT. This is the only supported “mandatory” JIRA issue field.
COMPONENT. This is the only supported “optional” JIRA issue field.
The JIRA issue type that your notifications will create may only be configured with the "Project" field as mandatory. If your JIRA issue type has any other mandatory fields, the monasca plugin will not function correctly. Currently, the monasca plugin only supports the single optional "component" field.
Creating the JIRA notification type requires a few more steps than other notification types covered in this guide. Because the Python and YAML files for this notification type are not yet included in SUSE OpenStack Cloud 9, you must perform the following steps to manually retrieve and place them on your Cloud Lifecycle Manager.
Configure the JIRA plugin by adding the following block to the
/etc/monasca/notification.yaml
file, under thenotification_types
section, and adding the username and password of the JIRA account used for the notifications to the respective sections.plugins: - monasca_notification.plugins.jira_notifier:JiraNotifier jira: user: password: timeout: 60
After adding the necessary block, the
notification_types
section should look like the following example. Note that you must also add the username and password for the JIRA user related to the notification type.notification_types: plugins: - monasca_notification.plugins.jira_notifier:JiraNotifier jira: user: password: timeout: 60 webhook: timeout: 5 pagerduty: timeout: 5 url: "https://events.pagerduty.com/generic/2010-04-15/create_event.json"
Create the JIRA notification type. The following command example creates a JIRA notification type named
MyJiraNotification
, in the JIRA projectHISO
.ardana >
monasca notification-create MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISOThe following command example creates a JIRA notification type named
MyJiraNotification
, in the JIRA projectHISO
, and adds the optional component field with a value ofkeystone
.ardana >
monasca notification-create MyJiraNotification JIRA https://jira.hpcloud.net/?project=HISO&component=keystoneNoteThere is a slash (
/
) separating the URL path and the query string. The slash is required if you have a query parameter without a path parameter.NoteNotification types may also be created in the horizon dashboard.
13.1.4 Alarm Metrics #
You can use the available metrics to create custom alarms to further monitor your cloud infrastructure and facilitate autoscaling features.
For details on how to create customer alarms using the Operations Console, see Section 16.2, “Alarm Definition”.
13.1.4.1 Apache Metrics #
A list of metrics associated with the Apache service.
Metric Name | Dimensions | Description |
---|---|---|
apache.net.hits |
hostname service=apache component=apache | Total accesses |
apache.net.kbytes_sec |
hostname service=apache component=apache | Total Kbytes per second |
apache.net.requests_sec |
hostname service=apache component=apache | Total accesses per second |
apache.net.total_kbytes |
hostname service=apache component=apache | Total Kbytes |
apache.performance.busy_worker_count |
hostname service=apache component=apache | The number of workers serving requests |
apache.performance.cpu_load_perc |
hostname service=apache component=apache |
The current percentage of CPU used by each worker and in total by all workers combined |
apache.performance.idle_worker_count |
hostname service=apache component=apache | The number of idle workers |
apache.status |
apache_port hostname service=apache component=apache | Status of Apache port |
13.1.4.2 ceilometer Metrics #
A list of metrics associated with the ceilometer service.
Metric Name | Dimensions | Description |
---|---|---|
disk.total_space_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Total space of disk |
disk.total_used_space_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Total used space of disk |
swiftlm.diskusage.rate_agg |
aggregation_period=hourly, host=all, project_id=all | |
swiftlm.diskusage.val.avail_agg |
aggregation_period=hourly, host, project_id=all | |
swiftlm.diskusage.val.size_agg |
aggregation_period=hourly, host, project_id=all | |
image |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Existence of the image |
image.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Delete operation on this image |
image.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=B, source=openstack | Size of the uploaded image |
image.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Update operation on this image |
image.upload |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=image, source=openstack | Upload operation on this image |
instance |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=instance, source=openstack | Existence of instance |
disk.ephemeral.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of ephemeral disk on this instance |
disk.root.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of root disk on this instance |
memory |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=MB, source=openstack | Size of memory on this instance |
ip.floating |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=ip, source=openstack | Existence of IP |
ip.floating.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=ip, source=openstack | Create operation on this fip |
ip.floating.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=ip, source=openstack | Update operation on this fip |
mem.total_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Total space of memory |
mem.usable_mb_agg |
aggregation_period=hourly, host=all, project_id=all | Available space of memory |
network |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=network, source=openstack | Existence of network |
network.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=network, source=openstack | Create operation on this network |
network.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=network, source=openstack | Update operation on this network |
network.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=network, source=openstack | Delete operation on this network |
port |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=port, source=openstack | Existence of port |
port.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=port, source=openstack | Create operation on this port |
port.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=port, source=openstack | Delete operation on this port |
port.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=port, source=openstack | Update operation on this port |
router |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=router, source=openstack | Existence of router |
router.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=router, source=openstack | Create operation on this router |
router.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=router, source=openstack | Delete operation on this router |
router.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=router, source=openstack | Update operation on this router |
snapshot |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=snapshot, source=openstack | Existence of the snapshot |
snapshot.create.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=snapshot, source=openstack | Create operation on this snapshot |
snapshot.delete.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=snapshot, source=openstack | Delete operation on this snapshot |
snapshot.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of this snapshot |
subnet |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=subnet, source=openstack | Existence of the subnet |
subnet.create |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=subnet, source=openstack | Create operation on this subnet |
subnet.delete |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=subnet, source=openstack | Delete operation on this subnet |
subnet.update |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=subnet, source=openstack | Update operation on this subnet |
vcpus |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=vcpus, source=openstack | Number of virtual CPUs allocated to the instance |
vcpus_agg |
aggregation_period=hourly, host=all, project_id | Number of vcpus used by a project |
volume |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=volume, source=openstack | Existence of the volume |
volume.create.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Create operation on this volume |
volume.delete.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Delete operation on this volume |
volume.resize.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Resize operation on this volume |
volume.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=GB, source=openstack | Size of this volume |
volume.update.end |
user_id, region, resource_id, datasource=ceilometer, project_id, type=delta, unit=volume, source=openstack | Update operation on this volume |
storage.objects |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=object, source=openstack | Number of objects |
storage.objects.size |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=B, source=openstack | Total size of stored objects |
storage.objects.containers |
user_id, region, resource_id, datasource=ceilometer, project_id, type=gauge, unit=container, source=openstack | Number of containers |
13.1.4.3 cinder Metrics #
A list of metrics associated with the cinder service.
Metric Name | Dimensions | Description |
---|---|---|
cinderlm.cinder.backend.physical.list |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, backends | List of physical backends |
cinderlm.cinder.backend.total.avail |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname | Total available capacity metric per backend |
cinderlm.cinder.backend.total.size |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, backendname | Total capacity metric per backend |
cinderlm.cinder.cinder_services |
service=block-storage, hostname, cluster, cloud_name, control_plane, component | Status of a cinder-volume service |
cinderlm.hp_hardware.hpssacli.logical_drive |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, logical_drive, controller_slot, array The HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed for SSACLI status to be reported. To download and install the SSACLI utility to enable management of disk controllers, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f | Status of a logical drive |
cinderlm.hp_hardware.hpssacli.physical_drive |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, box, bay, controller_slot | Status of a logical drive |
cinderlm.hp_hardware.hpssacli.smart_array |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, sub_component, model | Status of smart array |
cinderlm.hp_hardware.hpssacli.smart_array.firmware |
service=block-storage, hostname, cluster, cloud_name, control_plane, component, model | Checks firmware version |
13.1.4.4 Compute Metrics #
Compute instance metrics are listed in Section 13.1.4.11, “Libvirt Metrics”.
A list of metrics associated with the Compute service.
Metric Name | Dimensions | Description |
---|---|---|
nova.heartbeat |
service=compute cloud_name hostname component control_plane cluster |
Checks that all services are running heartbeats (uses nova user and to list services then sets up checks for each. For example, nova-scheduler, nova-conductor, nova-compute) |
nova.vm.cpu.total_allocated |
service=compute hostname component control_plane cluster | Total CPUs allocated across all VMs |
nova.vm.disk.total_allocated_gb |
service=compute hostname component control_plane cluster | Total Gbytes of disk space allocated to all VMs |
nova.vm.mem.total_allocated_mb |
service=compute hostname component control_plane cluster | Total Mbytes of memory allocated to all VMs |
13.1.4.5 Crash Metrics #
A list of metrics associated with the Crash service.
Metric Name | Dimensions | Description |
---|---|---|
crash.dump_count |
service=system hostname cluster | Number of crash dumps found |
13.1.4.6 Directory Metrics #
A list of metrics associated with the Directory service.
Metric Name | Dimensions | Description |
---|---|---|
directory.files_count |
service hostname path | Total number of files under a specific directory path |
directory.size_bytes |
service hostname path | Total size of a specific directory path |
13.1.4.7 Elasticsearch Metrics #
A list of metrics associated with the Elasticsearch service.
Metric Name | Dimensions | Description |
---|---|---|
elasticsearch.active_primary_shards |
service=logging url hostname |
Indicates the number of primary shards in your cluster. This is an aggregate total across all indices. |
elasticsearch.active_shards |
service=logging url hostname |
Aggregate total of all shards across all indices, which includes replica shards. |
elasticsearch.cluster_status |
service=logging url hostname |
Cluster health status. |
elasticsearch.initializing_shards |
service=logging url hostname |
The count of shards that are being freshly created. |
elasticsearch.number_of_data_nodes |
service=logging url hostname |
Number of data nodes. |
elasticsearch.number_of_nodes |
service=logging url hostname |
Number of nodes. |
elasticsearch.relocating_shards |
service=logging url hostname |
Shows the number of shards that are currently moving from one node to another node. |
elasticsearch.unassigned_shards |
service=logging url hostname |
The number of unassigned shards from the master node. |
13.1.4.8 HAProxy Metrics #
A list of metrics associated with the HAProxy service.
Metric Name | Dimensions | Description |
---|---|---|
haproxy.backend.bytes.in_rate | ||
haproxy.backend.bytes.out_rate | ||
haproxy.backend.denied.req_rate | ||
haproxy.backend.denied.resp_rate | ||
haproxy.backend.errors.con_rate | ||
haproxy.backend.errors.resp_rate | ||
haproxy.backend.queue.current | ||
haproxy.backend.response.1xx | ||
haproxy.backend.response.2xx | ||
haproxy.backend.response.3xx | ||
haproxy.backend.response.4xx | ||
haproxy.backend.response.5xx | ||
haproxy.backend.response.other | ||
haproxy.backend.session.current | ||
haproxy.backend.session.limit | ||
haproxy.backend.session.pct | ||
haproxy.backend.session.rate | ||
haproxy.backend.warnings.redis_rate | ||
haproxy.backend.warnings.retr_rate | ||
haproxy.frontend.bytes.in_rate | ||
haproxy.frontend.bytes.out_rate | ||
haproxy.frontend.denied.req_rate | ||
haproxy.frontend.denied.resp_rate | ||
haproxy.frontend.errors.req_rate | ||
haproxy.frontend.requests.rate | ||
haproxy.frontend.response.1xx | ||
haproxy.frontend.response.2xx | ||
haproxy.frontend.response.3xx | ||
haproxy.frontend.response.4xx | ||
haproxy.frontend.response.5xx | ||
haproxy.frontend.response.other | ||
haproxy.frontend.session.current | ||
haproxy.frontend.session.limit | ||
haproxy.frontend.session.pct | ||
haproxy.frontend.session.rate |
13.1.4.9 HTTP Check Metrics #
A list of metrics associated with the HTTP Check service:
Metric Name | Dimensions | Description |
---|---|---|
http_response_time |
url hostname service component | The response time in seconds of the http endpoint call. |
http_status |
url hostname service | The status of the http endpoint call (0 = success, 1 = failure). |
For each component and HTTP metric name there are two separate metrics reported, one for the local URL and another for the virtual IP (VIP) URL:
Component | Dimensions | Description |
---|---|---|
account-server |
service=object-storage component=account-server url | swift account-server http endpoint status and response time |
barbican-api |
service=key-manager component=barbican-api url | barbican-api http endpoint status and response time |
cinder-api |
service=block-storage component=cinder-api url | cinder-api http endpoint status and response time |
container-server |
service=object-storage component=container-server url | swift container-server http endpoint status and response time |
designate-api |
service=dns component=designate-api url | designate-api http endpoint status and response time |
glance-api |
service=image-service component=glance-api url | glance-api http endpoint status and response time |
glance-registry |
service=image-service component=glance-registry url | glance-registry http endpoint status and response time |
heat-api |
service=orchestration component=heat-api url | heat-api http endpoint status and response time |
heat-api-cfn |
service=orchestration component=heat-api-cfn url | heat-api-cfn http endpoint status and response time |
heat-api-cloudwatch |
service=orchestration component=heat-api-cloudwatch url | heat-api-cloudwatch http endpoint status and response time |
ardana-ux-services |
service=ardana-ux-services component=ardana-ux-services url | ardana-ux-services http endpoint status and response time |
horizon |
service=web-ui component=horizon url | horizon http endpoint status and response time |
keystone-api |
service=identity-service component=keystone-api url | keystone-api http endpoint status and response time |
monasca-api |
service=monitoring component=monasca-api url | monasca-api http endpoint status |
monasca-persister |
service=monitoring component=monasca-persister url | monasca-persister http endpoint status |
neutron-server |
service=networking component=neutron-server url | neutron-server http endpoint status and response time |
neutron-server-vip |
service=networking component=neutron-server-vip url | neutron-server-vip http endpoint status and response time |
nova-api |
service=compute component=nova-api url | nova-api http endpoint status and response time |
nova-vnc |
service=compute component=nova-vnc url | nova-vnc http endpoint status and response time |
object-server |
service=object-storage component=object-server url | object-server http endpoint status and response time |
object-storage-vip |
service=object-storage component=object-storage-vip url | object-storage-vip http endpoint status and response time |
octavia-api |
service=octavia component=octavia-api url | octavia-api http endpoint status and response time |
ops-console-web |
service=ops-console component=ops-console-web url | ops-console-web http endpoint status and response time |
proxy-server |
service=object-storage component=proxy-server url | proxy-server http endpoint status and response time |
13.1.4.10 Kafka Metrics #
A list of metrics associated with the Kafka service.
Metric Name | Dimensions | Description |
---|---|---|
kafka.consumer_lag |
topic service component=kafka consumer_group hostname | Hostname consumer offset lag from broker offset |
13.1.4.11 Libvirt Metrics #
For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.1, “Libvirt plugin metric tuning knobs”.
A list of metrics associated with the Libvirt service.
Admin Metric Name | Project Metric Name | Dimensions | Description |
---|---|---|---|
vm.cpu.time_ns | cpu.time_ns |
zone service resource_id hostname component | Cumulative CPU time (in ns) |
vm.cpu.utilization_norm_perc | cpu.utilization_norm_perc |
zone service resource_id hostname component | Normalized CPU utilization (percentage) |
vm.cpu.utilization_perc | cpu.utilization_perc |
zone service resource_id hostname component | Overall CPU utilization (percentage) |
vm.io.errors | io.errors |
zone service resource_id hostname component | Overall disk I/O errors |
vm.io.errors_sec | io.errors_sec |
zone service resource_id hostname component | Disk I/O errors per second |
vm.io.read_bytes | io.read_bytes |
zone service resource_id hostname component | Disk I/O read bytes value |
vm.io.read_bytes_sec | io.read_bytes_sec |
zone service resource_id hostname component | Disk I/O read bytes per second |
vm.io.read_ops | io.read_ops |
zone service resource_id hostname component | Disk I/O read operations value |
vm.io.read_ops_sec | io.read_ops_sec |
zone service resource_id hostname component | Disk I/O write operations per second |
vm.io.write_bytes | io.write_bytes |
zone service resource_id hostname component | Disk I/O write bytes value |
vm.io.write_bytes_sec | io.write_bytes_sec |
zone service resource_id hostname component | Disk I/O write bytes per second |
vm.io.write_ops | io.write_ops |
zone service resource_id hostname component | Disk I/O write operations value |
vm.io.write_ops_sec | io.write_ops_sec |
zone service resource_id hostname component | Disk I/O write operations per second |
vm.net.in_bytes | net.in_bytes |
zone service resource_id hostname component device port_id | Network received total bytes |
vm.net.in_bytes_sec | net.in_bytes_sec |
zone service resource_id hostname component device port_id | Network received bytes per second |
vm.net.in_packets | net.in_packets |
zone service resource_id hostname component device port_id | Network received total packets |
vm.net.in_packets_sec | net.in_packets_sec |
zone service resource_id hostname component device port_id | Network received packets per second |
vm.net.out_bytes | net.out_bytes |
zone service resource_id hostname component device port_id | Network transmitted total bytes |
vm.net.out_bytes_sec | net.out_bytes_sec |
zone service resource_id hostname component device port_id | Network transmitted bytes per second |
vm.net.out_packets | net.out_packets |
zone service resource_id hostname component device port_id | Network transmitted total packets |
vm.net.out_packets_sec | net.out_packets_sec |
zone service resource_id hostname component device port_id | Network transmitted packets per second |
vm.ping_status | ping_status |
zone service resource_id hostname component | 0 for ping success, 1 for ping failure |
vm.disk.allocation | disk.allocation |
zone service resource_id hostname component | Total Disk allocation for a device |
vm.disk.allocation_total | disk.allocation_total |
zone service resource_id hostname component | Total Disk allocation across devices for instances |
vm.disk.capacity | disk.capacity |
zone service resource_id hostname component | Total Disk capacity for a device |
vm.disk.capacity_total | disk.capacity_total |
zone service resource_id hostname component | Total Disk capacity across devices for instances |
vm.disk.physical | disk.physical |
zone service resource_id hostname component | Total Disk usage for a device |
vm.disk.physical_total | disk.physical_total |
zone service resource_id hostname component | Total Disk usage across devices for instances |
vm.io.errors_total | io.errors_total |
zone service resource_id hostname component | Total Disk I/O errors across all devices |
vm.io.errors_total_sec | io.errors_total_sec |
zone service resource_id hostname component | Total Disk I/O errors per second across all devices |
vm.io.read_bytes_total | io.read_bytes_total |
zone service resource_id hostname component | Total Disk I/O read bytes across all devices |
vm.io.read_bytes_total_sec | io.read_bytes_total_sec |
zone service resource_id hostname component | Total Disk I/O read bytes per second across devices |
vm.io.read_ops_total | io.read_ops_total |
zone service resource_id hostname component | Total Disk I/O read operations across all devices |
vm.io.read_ops_total_sec | io.read_ops_total_sec |
zone service resource_id hostname component | Total Disk I/O read operations across all devices per sec |
vm.io.write_bytes_total | io.write_bytes_total |
zone service resource_id hostname component | Total Disk I/O write bytes across all devices |
vm.io.write_bytes_total_sec | io.write_bytes_total_sec |
zone service resource_id hostname component | Total Disk I/O Write bytes per second across devices |
vm.io.write_ops_total | io.write_ops_total |
zone service resource_id hostname component | Total Disk I/O write operations across all devices |
vm.io.write_ops_total_sec | io.write_ops_total_sec |
zone service resource_id hostname component | Total Disk I/O write operations across all devices per sec |
These metrics in libvirt are always enabled and cannot be disabled using the tuning knobs.
Admin Metric Name | Project Metric Name | Dimensions | Description |
---|---|---|---|
vm.host_alive_status | host_alive_status |
zone service resource_id hostname component |
-1 for no status, 0 for Running / OK, 1 for Idle / blocked, 2 for Paused, 3 for Shutting down, 4 for Shut off or nova suspend 5 for Crashed, 6 for Power management suspend (S3 state) |
vm.mem.free_mb | mem.free_mb |
cluster service hostname | Free memory in Mbytes |
vm.mem.free_perc | mem.free_perc |
cluster service hostname | Percent of memory free |
vm.mem.resident_mb |
cluster service hostname | Total memory used on host, an Operations-only metric | |
vm.mem.swap_used_mb | mem.swap_used_mb |
cluster service hostname | Used swap space in Mbytes |
vm.mem.total_mb | mem.total_mb |
cluster service hostname | Total memory in Mbytes |
vm.mem.used_mb | mem.used_mb |
cluster service hostname | Used memory in Mbytes |
13.1.4.12 Monitoring Metrics #
A list of metrics associated with the Monitoring service.
Metric Name | Dimensions | Description |
---|---|---|
alarm-state-transitions-added-to-batch-counter |
service=monitoring url hostname component=monasca-persister | |
jvm.memory.total.max |
service=monitoring url hostname component | Maximum JVM overall memory |
jvm.memory.total.used |
service=monitoring url hostname component | Used JVM overall memory |
metrics-added-to-batch-counter |
service=monitoring url hostname component=monasca-persister | |
metrics.published |
service=monitoring url hostname component=monasca-api | Total number of published metrics |
monasca.alarms_finished_count |
hostname component=monasca-notification service=monitoring | Total number of alarms received |
monasca.checks_running_too_long |
hostname component=monasca-agent service=monitoring cluster | Only emitted when collection time for a check is too long |
monasca.collection_time_sec |
hostname component=monasca-agent service=monitoring cluster | Collection time in monasca-agent |
monasca.config_db_time |
hostname component=monasca-notification service=monitoring | |
monasca.created_count |
hostname component=monasca-notification service=monitoring | Number of notifications created |
monasca.invalid_type_count |
hostname component=monasca-notification service=monitoring | Number of notifications with invalid type |
monasca.log.in_bulks_rejected |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.in_logs |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.in_logs_bytes |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.in_logs_rejected |
hostname component=monasca-log-api service=monitoring version | |
monasca.log.out_logs |
hostname component=monasca-log-api service=monitoring | |
monasca.log.out_logs_lost |
hostname component=monasca-log-api service=monitoring | |
monasca.log.out_logs_truncated_bytes |
hostname component=monasca-log-api service=monitoring | |
monasca.log.processing_time_ms |
hostname component=monasca-log-api service=monitoring | |
monasca.log.publish_time_ms |
hostname component=monasca-log-api service=monitoring | |
monasca.thread_count |
service=monitoring process_name hostname component | Number of threads monasca is using |
raw-sql.time.avg |
service=monitoring url hostname component | Average raw sql query time |
raw-sql.time.max |
service=monitoring url hostname component | Max raw sql query time |
13.1.4.13 Monasca Aggregated Metrics #
A list of the aggregated metrics associated with the monasca Transform feature.
Metric Name | For | Dimensions | Description |
---|---|---|---|
cpu.utilized_logical_cores_agg | Compute summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Utilized physical host cpu core capacity for one or all hosts by time interval (defaults to a hour). Available as total or per host |
cpu.total_logical_cores_agg | Compute summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Total physical host cpu core capacity for one or all hosts by time interval (defaults to a hour) Available as total or per host |
mem.total_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Total physical host memory capacity by time interval (defaults to a hour) |
mem.usable_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all | Usable physical host memory capacity by time interval (defaults to a hour) |
disk.total_used_space_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Utilized physical host disk capacity by time interval (defaults to a hour) |
disk.total_space_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all | Total physical host disk capacity by time interval (defaults to a hour) |
nova.vm.cpu.total_allocated_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
CPUs allocated across all virtual machines by time interval (defaults to a hour) |
vcpus_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Virtual CPUs allocated capacity for virtual machines of one or all projects by time interval (defaults to a hour) Available as total or per host |
nova.vm.mem.total_allocated_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Memory allocated to all virtual machines by time interval (defaults to a hour) |
vm.mem.used_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Memory utilized by virtual machines of one or all projects by time interval (defaults to an hour) Available as total or per host |
vm.mem.total_mb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Memory allocated to virtual machines of one or all projects by time interval (defaults to an hour) Available as total or per host |
vm.cpu.utilization_perc_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
CPU utilized by all virtual machines by project by time interval (defaults to an hour) |
nova.vm.disk.total_allocated_gb_agg | Compute summary |
aggregation_period: hourly host: all project_id: all |
Disk space allocated to all virtual machines by time interval (defaults to an hour) |
vm.disk.allocation_agg | Compute summary |
aggregation_period: hourly host: all project_id: all or <project ID> |
Disk allocation for virtual machines of one or all projects by time interval (defaults to a hour) Available as total or per host |
swiftlm.diskusage.val.size_agg | Object Storage summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Total available object storage capacity by time interval (defaults to a hour) Available as total or per host |
swiftlm.diskusage.val.avail_agg | Object Storage summary |
aggregation_period: hourly host: all or <hostname> project_id: all |
Remaining object storage capacity by time interval (defaults to a hour) Available as total or per host |
swiftlm.diskusage.rate_agg | Object Storage summary |
aggregation_period: hourly host: all project_id: all |
Rate of change of object storage usage by time interval (defaults to a hour) |
storage.objects.size_agg | Object Storage summary |
aggregation_period: hourly host: all project_id: all |
Used object storage capacity by time interval (defaults to a hour) |
13.1.4.14 MySQL Metrics #
A list of metrics associated with the MySQL service.
Metric Name | Dimensions | Description |
---|---|---|
mysql.innodb.buffer_pool_free |
hostname mode service=mysql |
The number of free pages, in bytes. This value is calculated by
multiplying |
mysql.innodb.buffer_pool_total |
hostname mode service=mysql |
The total size of buffer pool, in bytes. This value is calculated by
multiplying |
mysql.innodb.buffer_pool_used |
hostname mode service=mysql |
The number of used pages, in bytes. This value is calculated by
subtracting |
mysql.innodb.current_row_locks |
hostname mode service=mysql |
Corresponding to current row locks of the server status variable. |
mysql.innodb.data_reads |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.data_writes |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.mutex_os_waits |
hostname mode service=mysql |
Corresponding to the OS waits of the server status variable. |
mysql.innodb.mutex_spin_rounds |
hostname mode service=mysql |
Corresponding to spinlock rounds of the server status variable. |
mysql.innodb.mutex_spin_waits |
hostname mode service=mysql |
Corresponding to the spin waits of the server status variable. |
mysql.innodb.os_log_fsyncs |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.row_lock_time |
hostname mode service=mysql |
Corresponding to |
mysql.innodb.row_lock_waits |
hostname mode service=mysql |
Corresponding to |
mysql.net.connections |
hostname mode service=mysql |
Corresponding to |
mysql.net.max_connections |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_delete |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_delete_multi |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_insert |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_insert_select |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_replace_select |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_select |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_update |
hostname mode service=mysql |
Corresponding to |
mysql.performance.com_update_multi |
hostname mode service=mysql |
Corresponding to |
mysql.performance.created_tmp_disk_tables |
hostname mode service=mysql |
Corresponding to |
mysql.performance.created_tmp_files |
hostname mode service=mysql |
Corresponding to |
mysql.performance.created_tmp_tables |
hostname mode service=mysql |
Corresponding to |
mysql.performance.kernel_time |
hostname mode service=mysql |
The kernel time for the databases performance, in seconds. |
mysql.performance.open_files |
hostname mode service=mysql |
Corresponding to |
mysql.performance.qcache_hits |
hostname mode service=mysql |
Corresponding to |
mysql.performance.queries |
hostname mode service=mysql |
Corresponding to |
mysql.performance.questions |
hostname mode service=mysql |
Corresponding to |
mysql.performance.slow_queries |
hostname mode service=mysql |
Corresponding to |
mysql.performance.table_locks_waited |
hostname mode service=mysql |
Corresponding to |
mysql.performance.threads_connected |
hostname mode service=mysql |
Corresponding to |
mysql.performance.user_time |
hostname mode service=mysql |
The CPU user time for the databases performance, in seconds. |
13.1.4.15 NTP Metrics #
A list of metrics associated with the NTP service.
Metric Name | Dimensions | Description |
---|---|---|
ntp.connection_status |
hostname ntp_server | Value of ntp server connection status (0=Healthy) |
ntp.offset |
hostname ntp_server | Time offset in seconds |
13.1.4.16 Open vSwitch (OVS) Metrics #
A list of metrics associated with the OVS service.
For information on how to turn these metrics on and off using the tuning knobs, see Section 13.1.2.5.2, “OVS plugin metric tuning knobs”.
Admin Metric Name | Project Metric Name | Dimensions | Description |
---|---|---|---|
ovs.vrouter.in_bytes_sec | vrouter.in_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Inbound bytes per second for the router (if
|
ovs.vrouter.in_packets_sec | vrouter.in_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets per second for the router |
ovs.vrouter.out_bytes_sec | vrouter.out_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing bytes per second for the router (if
|
ovs.vrouter.out_packets_sec | vrouter.out_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets per second for the router |
ovs.vrouter.in_bytes | vrouter.in_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Inbound bytes for the router (if |
ovs.vrouter.in_packets | vrouter.in_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets for the router |
ovs.vrouter.out_bytes | vrouter.out_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing bytes for the router (if |
ovs.vrouter.out_packets | vrouter.out_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets for the router |
ovs.vrouter.in_dropped_sec | vrouter.in_dropped_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped packets per second for the router |
ovs.vrouter.in_errors_sec | vrouter.in_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Number of incoming errors per second for the router |
ovs.vrouter.out_dropped_sec | vrouter.out_dropped_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets per second for the router |
ovs.vrouter.out_errors_sec | vrouter.out_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Number of outgoing errors per second for the router |
ovs.vrouter.in_dropped | vrouter.in_dropped |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped packets for the router |
ovs.vrouter.in_errors | vrouter.in_errors |
service=networking resource_id component=ovs router_name port_id |
Number of incoming errors for the router |
ovs.vrouter.out_dropped | vrouter.out_dropped |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets for the router |
ovs.vrouter.out_errors | vrouter.out_errors |
service=networking resource_id tenant_id component=ovs router_name port_id |
Number of outgoing errors for the router |
Admin Metric Name | Tenant Metric Name | Dimensions | Description |
---|---|---|---|
ovs.vswitch.in_bytes_sec | vswitch.in_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Incoming Bytes per second on DHCP
port(if |
ovs.vswitch.in_packets_sec | vswitch.in_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets per second for the DHCP port |
ovs.vswitch.out_bytes_sec | vswitch.out_bytes_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing Bytes per second on DHCP
port(if |
ovs.vswitch.out_packets_sec | vswitch.out_packets_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets per second for the DHCP port |
ovs.vswitch.in_bytes | vswitch.in_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Inbound bytes for the DHCP port (if |
ovs.vswitch.in_packets | vswitch.in_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming packets for the DHCP port |
ovs.vswitch.out_bytes | vswitch.out_bytes |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing bytes for the DHCP port (if |
ovs.vswitch.out_packets | vswitch.out_packets |
service=networking resource_id tenant_id component=ovs router_name port_id |
Outgoing packets for the DHCP port |
ovs.vswitch.in_dropped_sec | vswitch.in_dropped_sec |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped per second for the DHCP port |
ovs.vswitch.in_errors_sec | vswitch.in_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Incoming errors per second for the DHCP port |
ovs.vswitch.out_dropped_sec | vswitch.out_dropped_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets per second for the DHCP port |
ovs.vswitch.out_errors_sec | vswitch.out_errors_sec |
service=networking resource_id component=ovs router_name port_id |
Outgoing errors per second for the DHCP port |
ovs.vswitch.in_dropped | vswitch.in_dropped |
service=networking resource_id tenant_id component=ovs router_name port_id |
Incoming dropped packets for the DHCP port |
ovs.vswitch.in_errors | vswitch.in_errors |
service=networking resource_id component=ovs router_name port_id |
Errors received for the DHCP port |
ovs.vswitch.out_dropped | vswitch.out_dropped |
service=networking resource_id component=ovs router_name port_id |
Outgoing dropped packets for the DHCP port |
ovs.vswitch.out_errors | vswitch.out_errors |
service=networking resource_id tenant_id component=ovs router_name port_id |
Errors transmitted for the DHCP port |
13.1.4.17 Process Metrics #
A list of metrics associated with processes.
Metric Name | Dimensions | Description |
---|---|---|
process.cpu_perc |
hostname service process_name component | Percentage of cpu being consumed by a process |
process.io.read_count |
hostname service process_name component | Number of reads by a process |
process.io.read_kbytes |
hostname service process_name component | Kbytes read by a process |
process.io.write_count |
hostname service process_name component | Number of writes by a process |
process.io.write_kbytes |
hostname service process_name component | Kbytes written by a process |
process.mem.rss_mbytes |
hostname service process_name component | Amount of physical memory allocated to a process, including memory from shared libraries in Mbytes |
process.open_file_descriptors |
hostname service process_name component | Number of files being used by a process |
process.pid_count |
hostname service process_name component | Number of processes that exist with this process name |
process.thread_count |
hostname service process_name component | Number of threads a process is using |
13.1.4.17.1 process.cpu_perc, process.mem.rss_mbytes, process.pid_count and process.thread_count metrics #
Component Name | Dimensions | Description |
---|---|---|
apache-storm |
service=monitoring process_name=monasca-thresh process_user=storm | apache-storm process info: cpu percent, momory, pid count and thread count |
barbican-api |
service=key-manager process_name=barbican-api | barbican-api process info: cpu percent, momory, pid count and thread count |
ceilometer-agent-notification |
service=telemetry process_name=ceilometer-agent-notification | ceilometer-agent-notification process info: cpu percent, momory, pid count and thread count |
ceilometer-polling |
service=telemetry process_name=ceilometer-polling | ceilometer-polling process info: cpu percent, momory, pid count and thread count |
cinder-api |
service=block-storage process_name=cinder-api | cinder-api process info: cpu percent, momory, pid count and thread count |
cinder-scheduler |
service=block-storage process_name=cinder-scheduler | cinder-scheduler process info: cpu percent, momory, pid count and thread count |
designate-api |
service=dns process_name=designate-api | designate-api process info: cpu percent, momory, pid count and thread count |
designate-central |
service=dns process_name=designate-central | designate-central process info: cpu percent, momory, pid count and thread count |
designate-mdns |
service=dns process_name=designate-mdns | designate-mdns process cpu percent, momory, pid count and thread count |
designate-pool-manager |
service=dns process_name=designate-pool-manager | designate-pool-manager process info: cpu percent, momory, pid count and thread count |
heat-api |
service=orchestration process_name=heat-api | heat-api process cpu percent, momory, pid count and thread count |
heat-api-cfn |
service=orchestration process_name=heat-api-cfn | heat-api-cfn process info: cpu percent, momory, pid count and thread count |
heat-api-cloudwatch |
service=orchestration process_name=heat-api-cloudwatch | heat-api-cloudwatch process cpu percent, momory, pid count and thread count |
heat-engine |
service=orchestration process_name=heat-engine | heat-engine process info: cpu percent, momory, pid count and thread count |
ipsec/charon |
service=networking process_name=ipsec/charon | ipsec/charon process info: cpu percent, momory, pid count and thread count |
keystone-admin |
service=identity-service process_name=keystone-admin | keystone-admin process info: cpu percent, momory, pid count and thread count |
keystone-main |
service=identity-service process_name=keystone-main | keystone-main process info: cpu percent, momory, pid count and thread count |
monasca-agent |
service=monitoring process_name=monasca-agent | monasca-agent process info: cpu percent, momory, pid count and thread count |
monasca-api |
service=monitoring process_name=monasca-api | monasca-api process info: cpu percent, momory, pid count and thread count |
monasca-notification |
service=monitoring process_name=monasca-notification | monasca-notification process info: cpu percent, momory, pid count and thread count |
monasca-persister |
service=monitoring process_name=monasca-persister | monasca-persister process info: cpu percent, momory, pid count and thread count |
monasca-transform |
service=monasca-transform process_name=monasca-transform | monasca-transform process info: cpu percent, momory, pid count and thread count |
neutron-dhcp-agent |
service=networking process_name=neutron-dhcp-agent | neutron-dhcp-agent process info: cpu percent, momory, pid count and thread count |
neutron-l3-agent |
service=networking process_name=neutron-l3-agent | neutron-l3-agent process info: cpu percent, momory, pid count and thread count |
neutron-metadata-agent |
service=networking process_name=neutron-metadata-agent | neutron-metadata-agent process info: cpu percent, momory, pid count and thread count |
neutron-openvswitch-agent |
service=networking process_name=neutron-openvswitch-agent | neutron-openvswitch-agent process info: cpu percent, momory, pid count and thread count |
neutron-rootwrap |
service=networking process_name=neutron-rootwrap | neutron-rootwrap process info: cpu percent, momory, pid count and thread count |
neutron-server |
service=networking process_name=neutron-server | neutron-server process info: cpu percent, momory, pid count and thread count |
neutron-vpn-agent |
service=networking process_name=neutron-vpn-agent | neutron-vpn-agent process info: cpu percent, momory, pid count and thread count |
nova-api |
service=compute process_name=nova-api | nova-api process info: cpu percent, momory, pid count and thread count |
nova-compute |
service=compute process_name=nova-compute | nova-compute process info: cpu percent, momory, pid count and thread count |
nova-conductor |
service=compute process_name=nova-conductor | nova-conductor process info: cpu percent, momory, pid count and thread count |
nova-novncproxy |
service=compute process_name=nova-novncproxy | nova-novncproxy process info: cpu percent, momory, pid count and thread count |
nova-scheduler |
service=compute process_name=nova-scheduler | nova-scheduler process info: cpu percent, momory, pid count and thread count |
octavia-api |
service=octavia process_name=octavia-api | octavia-api process info: cpu percent, momory, pid count and thread count |
octavia-health-manager |
service=octavia process_name=octavia-health-manager | octavia-health-manager process info: cpu percent, momory, pid count and thread count |
octavia-housekeeping |
service=octavia process_name=octavia-housekeeping | octavia-housekeeping process info: cpu percent, momory, pid count and thread count |
octavia-worker |
service=octavia process_name=octavia-worker | octavia-worker process info: cpu percent, momory, pid count and thread count |
org.apache.spark.deploy.master.Master |
service=spark process_name=org.apache.spark.deploy.master.Master | org.apache.spark.deploy.master.Master process info: cpu percent, momory, pid count and thread count |
org.apache.spark.executor.CoarseGrainedExecutorBackend |
service=monasca-transform process_name=org.apache.spark.executor.CoarseGrainedExecutorBackend | org.apache.spark.executor.CoarseGrainedExecutorBackend process info: cpu percent, momory, pid count and thread count |
pyspark |
service=monasca-transform process_name=pyspark | pyspark process info: cpu percent, momory, pid count and thread count |
transform/lib/driver |
service=monasca-transform process_name=transform/lib/driver | transform/lib/driver process info: cpu percent, momory, pid count and thread count |
cassandra |
service=cassandra process_name=cassandra | cassandra process info: cpu percent, momory, pid count and thread count |
13.1.4.17.2 process.io.*, process.open_file_descriptors metrics #
Component Name | Dimensions | Description |
---|---|---|
monasca-agent |
service=monitoring process_name=monasca-agent process_user=mon-agent | monasca-agent process info: number of reads, number of writes,number of files being used |
13.1.4.18 RabbitMQ Metrics #
A list of metrics associated with the RabbitMQ service.
Metric Name | Dimensions | Description |
---|---|---|
rabbitmq.exchange.messages.published_count |
hostname exchange vhost type service=rabbitmq |
Value of the "publish_out" field of "message_stats" object |
rabbitmq.exchange.messages.published_rate |
hostname exchange vhost type service=rabbitmq |
Value of the "rate" field of "message_stats/publish_out_details" object |
rabbitmq.exchange.messages.received_count |
hostname exchange vhost type service=rabbitmq |
Value of the "publish_in" field of "message_stats" object |
rabbitmq.exchange.messages.received_rate |
hostname exchange vhost type service=rabbitmq |
Value of the "rate" field of "message_stats/publish_in_details" object |
rabbitmq.node.fd_used |
hostname node service=rabbitmq |
Value of the "fd_used" field in the response of /api/nodes |
rabbitmq.node.mem_used |
hostname node service=rabbitmq |
Value of the "mem_used" field in the response of /api/nodes |
rabbitmq.node.run_queue |
hostname node service=rabbitmq |
Value of the "run_queue" field in the response of /api/nodes |
rabbitmq.node.sockets_used |
hostname node service=rabbitmq |
Value of the "sockets_used" field in the response of /api/nodes |
rabbitmq.queue.messages |
hostname queue vhost service=rabbitmq |
Sum of ready and unacknowledged messages (queue depth) |
rabbitmq.queue.messages.deliver_rate |
hostname queue vhost service=rabbitmq |
Value of the "rate" field of "message_stats/deliver_details" object |
rabbitmq.queue.messages.publish_rate |
hostname queue vhost service=rabbitmq |
Value of the "rate" field of "message_stats/publish_details" object |
rabbitmq.queue.messages.redeliver_rate |
hostname queue vhost service=rabbitmq |
Value of the "rate" field of "message_stats/redeliver_details" object |
13.1.4.19 Swift Metrics #
A list of metrics associated with the swift service.
Metric Name | Dimensions | Description |
---|---|---|
swiftlm.access.host.operation.get.bytes |
service=object-storage |
This metric is the number of bytes read from objects in GET requests processed by this host during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included. |
swiftlm.access.host.operation.ops |
service=object-storage |
This metric is a count of the all the API requests made to swift that were processed by this host during the last minute. |
swiftlm.access.host.operation.project.get.bytes | ||
swiftlm.access.host.operation.project.ops | ||
swiftlm.access.host.operation.project.put.bytes | ||
swiftlm.access.host.operation.put.bytes |
service=object-storage |
This metric is the number of bytes written to objects in PUT or POST requests processed by this host during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included. |
swiftlm.access.host.operation.status | ||
swiftlm.access.project.operation.status |
service=object-storage |
This metric reports whether the swiftlm-access-log-tailer program is running normally. |
swiftlm.access.project.operation.ops |
tenant_id service=object-storage |
This metric is a count of the all the API requests made to swift that were processed by this host during the last minute to a given project id. |
swiftlm.access.project.operation.get.bytes |
tenant_id service=object-storage |
This metric is the number of bytes read from objects in GET requests processed by this host for a given project during the last minute. Only successful GET requests to objects are counted. GET requests to the account or container is not included. |
swiftlm.access.project.operation.put.bytes |
tenant_id service=object-storage |
This metric is the number of bytes written to objects in PUT or POST requests processed by this host for a given project during the last minute. Only successful requests to objects are counted. Requests to the account or container is not included. |
swiftlm.async_pending.cp.total.queue_length |
observer_host service=object-storage |
This metric reports the total length of all async pending queues in the system. When a container update fails, the update is placed on the async pending queue. An update may fail becuase the container server is too busy or because the server is down or failed. Later the system will “replay” updates from the queue – so eventually, the container listings will show all objects known to the system. If you know that container servers are down, it is normal to see the value of async pending increase. Once the server is restored, the value should return to zero. A non-zero value may also indicate that containers are too large. Look for “lock timeout” messages in /var/log/swift/swift.log. If you find such messages consider reducing the container size or enable rate limiting. |
swiftlm.check.failure |
check error component service=object-storage |
The total exception string is truncated if longer than 1919 characters and an ellipsis is prepended in the first three characters of the message. If there is more than one error reported, the list of errors is paired to the last reported error and the operator is expected to resolve failures until no more are reported. Where there are no further reported errors, the Value Class is emitted as ‘Ok’. |
swiftlm.diskusage.cp.avg.usage |
observer_host service=object-storage |
Is the average utilization of all drives in the system. The value is a percentage (example: 30.0 means 30% of the total space is used). |
swiftlm.diskusage.cp.max.usage |
observer_host service=object-storage |
Is the highest utilization of all drives in the system. The value is a percentage (example: 80.0 means at least one drive is 80% utilized). The value is just as important as swiftlm.diskusage.usage.avg. For example, if swiftlm.diskusage.usage.avg is 70% you might think that there is plenty of space available. However, if swiftlm.diskusage.usage.max is 100%, this means that some objects cannot be stored on that drive. swift will store replicas on other drives. However, this will create extra overhead. |
swiftlm.diskusage.cp.min.usage |
observer_host service=object-storage |
Is the lowest utilization of all drives in the system. The value is a percentage (example: 10.0 means at least one drive is 10% utilized) |
swiftlm.diskusage.cp.total.avail |
observer_host service=object-storage |
Is the size in bytes of available (unused) space of all drives in the system. Only drives used by swift are included. |
swiftlm.diskusage.cp.total.size |
observer_host service=object-storage |
Is the size in bytes of raw size of all drives in the system. |
swiftlm.diskusage.cp.total.used |
observer_host service=object-storage |
Is the size in bytes of used space of all drives in the system. Only drives used by swift are included. |
swiftlm.diskusage.host.avg.usage |
hostname service=object-storage |
This metric reports the average percent usage of all swift filesystems on a host. |
swiftlm.diskusage.host.max.usage |
hostname service=object-storage |
This metric reports the percent usage of a swift filesystem that is most used (full) on a host. The value is the max of the percentage used of all swift filesystems. |
swiftlm.diskusage.host.min.usage |
hostname service=object-storage |
This metric reports the percent usage of a swift filesystem that is least used (has free space) on a host. The value is the min of the percentage used of all swift filesystems. |
swiftlm.diskusage.host.val.avail |
hostname service=object-storage mount device label |
This metric reports the number of bytes available (free) in a swift filesystem. The value is an integer (units: Bytes) |
swiftlm.diskusage.host.val.size |
hostname service=object-storage mount device label |
This metric reports the size in bytes of a swift filesystem. The value is an integer (units: Bytes) |
swiftlm.diskusage.host.val.usage |
hostname service=object-storage mount device label |
This metric reports the percent usage of a swift filesystem. The value is a floating point number in range 0.0 to 100.0 |
swiftlm.diskusage.host.val.used |
hostname service=object-storage mount device label |
This metric reports the number of used bytes in a swift filesystem. The value is an integer (units: Bytes) |
swiftlm.load.cp.avg.five |
observer_host service=object-storage |
This is the averaged value of the five minutes system load average of all nodes in the swift system. |
swiftlm.load.cp.max.five |
observer_host service=object-storage |
This is the five minute load average of the busiest host in the swift system. |
swiftlm.load.cp.min.five |
observer_host service=object-storage |
This is the five minute load average of the least loaded host in the swift system. |
swiftlm.load.host.val.five |
hostname service=object-storage |
This metric reports the 5 minute load average of a host. The value is
derived from |
swiftlm.md5sum.cp.check.ring_checksums |
observer_host service=object-storage |
If you are in the middle of deploying new rings, it is normal for this to be in the failed state. However, if you are not in the middle of a deployment, you need to investigate the cause. Use “swift-recon –md5 -v” to identify the problem hosts. |
swiftlm.replication.cp.avg.account_duration |
observer_host service=object-storage |
This is the average across all servers for the account replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds. |
swiftlm.replication.cp.avg.container_duration |
observer_host service=object-storage |
This is the average across all servers for the container replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds. |
swiftlm.replication.cp.avg.object_duration |
observer_host service=object-storage |
This is the average across all servers for the object replicator to complete a cycle. As the system becomes busy, the time to complete a cycle increases. The value is in seconds. |
swiftlm.replication.cp.max.account_last |
hostname path service=object-storage |
This is the number of seconds since the account replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle. |
swiftlm.replication.cp.max.container_last |
hostname path service=object-storage |
This is the number of seconds since the container replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle. |
swiftlm.replication.cp.max.object_last |
hostname path service=object-storage |
This is the number of seconds since the object replicator last completed a scan on the host that has the oldest completion time. Normally the replicators runs periodically and hence this value will decrease whenever a replicator completes. However, if a replicator is not completing a cycle, this value increases (by one second for each second that the replicator is not completing). If the value remains high and increasing for a long period of time, it indicates that one of the hosts is not completing the replication cycle. |
swiftlm.swift.drive_audit |
hostname service=object-storage mount_point kernel_device |
If an unrecoverable read error (URE) occurs on a filesystem, the error is logged in the kernel log. The swift-drive-audit program scans the kernel log looking for patterns indicating possible UREs. To get more information, log onto the node in question and run: sudoswift-drive-audit/etc/swift/drive-audit.conf UREs are common on large disk drives. They do not necessarily indicate that the drive is failed. You can use the xfs_repair command to attempt to repair the filesystem. Failing this, you may need to wipe the filesystem. If UREs occur very often on a specific drive, this may indicate that the drive is about to fail and should be replaced. |
swiftlm.swift.file_ownership.config |
hostname path service |
This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects). |
swiftlm.swift.file_ownership.data |
hostname path service |
This metric reports if a directory or file has the appropriate owner. The check looks at swift configuration directories and files. It also looks at the top-level directories of mounted file systems (for example, /srv/node/disk0 and /srv/node/disk0/objects). |
swiftlm.swiftlm_check |
hostname service=object-storage |
This indicates of the swiftlm |
swiftlm.swift.replication.account.last_replication |
hostname service=object-storage |
This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad. |
swiftlm.swift.replication.container.last_replication |
hostname service=object-storage |
This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad. |
swiftlm.swift.replication.object.last_replication |
hostname service=object-storage |
This reports how long (in seconds) since the replicator process last finished a replication run. If the replicator is stuck, the time will keep increasing forever. The time a replicator normally takes depends on disk sizes and how much data needs to be replicated. However, a value over 24 hours is generally bad. |
swiftlm.swift.swift_services |
hostname service=object-storage |
This metric reports of the process as named in the component dimension and the msg value_meta is running or not.
Use the |
swiftlm.swift.swift_services.check_ip_port |
hostname service=object-storage component | Reports if a service is listening to the correct ip and port. |
swiftlm.systems.check_mounts |
hostname service=object-storage mount device label |
This metric reports the mount state of each drive that should be mounted on this node. |
swiftlm.systems.connectivity.connect_check |
observer_host url target_port service=object-storage |
This metric reports if a server can connect to a VIPs. Currently the following VIPs are checked:
|
swiftlm.systems.connectivity.memcache_check |
observer_host hostname target_port service=object-storage |
This metric reports if memcached on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used: We successfully connected to <hostname> on port <target_port> { "dimensions": { "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "11211" }, "metric": "swiftlm.systems.connectivity.memcache_check", "timestamp": 1449084058, "value": 0, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:11211 ok" } } We failed to connect to <hostname> on port <target_port> { "dimensions": { "fail_message": "[Errno 111] Connection refused", "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "11211" }, "metric": "swiftlm.systems.connectivity.memcache_check", "timestamp": 1449084150, "value": 2, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:11211 [Errno 111] Connection refused" } } |
swiftlm.systems.connectivity.rsync_check |
observer_host hostname target_port service=object-storage |
This metric reports if rsyncd on the host as specified by the hostname dimension is accepting connections from the host running the check. The following value_meta.msg are used: We successfully connected to <hostname> on port <target_port>: { "dimensions": { "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "873" }, "metric": "swiftlm.systems.connectivity.rsync_check", "timestamp": 1449082663, "value": 0, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:873 ok" } } We failed to connect to <hostname> on port <target_port>: { "dimensions": { "fail_message": "[Errno 111] Connection refused", "hostname": "ardana-ccp-c1-m1-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "target_port": "873" }, "metric": "swiftlm.systems.connectivity.rsync_check", "timestamp": 1449082860, "value": 2, "value_meta": { "msg": "ardana-ccp-c1-m1-mgmt:873 [Errno 111] Connection refused" } } |
swiftlm.umon.target.avg.latency_sec |
component hostname observer_host service=object-storage url |
Reports the average value of N-iterations of the latency values recorded for a component. |
swiftlm.umon.target.check.state |
component hostname observer_host service=object-storage url |
This metric reports the state of each component after N-iterations of checks. If the initial check succeeds, the checks move onto the next component until all components are queried, then the checks sleep for ‘main_loop_interval’ seconds. If a check fails, it is retried every second for ‘retries’ number of times per component. If the check fails ‘retries’ times, it is reported as a fail instance. A successful state will be reported in JSON: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.check.state", "timestamp": 1453111805, "value": 0 }, A failed state will report a “fail” value and the value_meta will provide the http response error. { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.check.state", "timestamp": 1453112841, "value": 2, "value_meta": { "msg": "HTTPConnectionPool(host='192.168.245.9', port=8080): Max retries exceeded with url: /v1/AUTH_76538ce683654a35983b62e333001b47 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd857d7f550>: Failed to establish a new connection: [Errno 110] Connection timed out',))" } } |
swiftlm.umon.target.max.latency_sec |
component hostname observer_host service=object-storage url |
This metric reports the maximum response time in seconds of a REST call from the observer to the component REST API listening on the reported host A response time query will be reported in JSON: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.max.latency_sec", "timestamp": 1453111805, "value": 0.2772650718688965 } A failed query will have a much longer time value: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.max.latency_sec", "timestamp": 1453112841, "value": 127.288015127182 } |
swiftlm.umon.target.min.latency_sec |
component hostname observer_host service=object-storage url |
This metric reports the minimum response time in seconds of a REST call from the observer to the component REST API listening on the reported host A response time query will be reported in JSON: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.min.latency_sec", "timestamp": 1453111805, "value": 0.10025882720947266 } A failed query will have a much longer time value: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.min.latency_sec", "timestamp": 1453112841, "value": 127.25378203392029 } |
swiftlm.umon.target.val.avail_day |
component hostname observer_host service=object-storage url |
This metric reports the average of all the collected records in the swiftlm.umon.target.val.avail_minute metric data. This is a walking average data set of these approximately per-minute states of the swift Object Store. The most basic case is a whole day of successful per-minute records, which will average to 100% availability. If there is any downtime throughout the day resulting in gaps of data which are two minutes or longer, the per-minute availability data will be “back filled” with an assumption of a down state for all the per-minute records which did not exist during the non-reported time. Because this is a walking average of approximately 24 hours worth of data, any outtage will take 24 hours to be purged from the dataset. A 24-hour average availability report: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.val.avail_day", "timestamp": 1453645405, "value": 7.894736842105263 } |
swiftlm.umon.target.val.avail_minute |
component hostname observer_host service=object-storage url |
A value of 100 indicates that swift-uptime-monitor was able to get a token from keystone and was able to perform operations against the swift API during the reported minute. A value of zero indicates that either keystone or swift failed to respond successfully. A metric is produced every minute that swift-uptime-monitor is running. An “up” minute report value will report 100 [percent]: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.val.avail_minute", "timestamp": 1453645405, "value": 100.0 } A “down” minute report value will report 0 [percent]: { "dimensions": { "component": "rest-api", "hostname": "ardana-ccp-vip-admin-SWF-PRX-mgmt", "observer_host": "ardana-ccp-c1-m1-mgmt", "service": "object-storage", "url": "http://ardana-ccp-vip-admin-SWF-PRX-mgmt:8080" }, "metric": "swiftlm.umon.target.val.avail_minute", "timestamp": 1453649139, "value": 0.0 } |
swiftlm.hp_hardware.hpssacli.smart_array.firmware |
component hostname service=object-storage component model controller_slot |
This metric reports the firmware version of a component of a Smart Array controller. |
swiftlm.hp_hardware.hpssacli.smart_array |
component hostname service=object-storage component sub_component model controller_slot |
This reports the status of various sub-components of a Smart Array Controller. A failure is considered to have occured if:
|
swiftlm.hp_hardware.hpssacli.physical_drive |
component hostname service=object-storage component controller_slot box bay |
This reports the status of a disk drive attached to a Smart Array controller. |
swiftlm.hp_hardware.hpssacli.logical_drive |
component hostname observer_host service=object-storage controller_slot array logical_drive sub_component |
This reports the status of a LUN presented by a Smart Array controller. A LUN is considered failed if the LUN has failed or if the LUN cache is not enabled and working. |
HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed on all control nodes that are swift nodes, in order to generate the following swift metrics:
swiftlm.hp_hardware.hpssacli.smart_array
swiftlm.hp_hardware.hpssacli.logical_drive
swiftlm.hp_hardware.hpssacli.smart_array.firmware
swiftlm.hp_hardware.hpssacli.physical_drive
HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f
After the HPE SSA CLI component is installed on the swift nodes, the metrics will be generated automatically during the next agent polling cycle. Manual reboot of the node is not required.
13.1.4.20 System Metrics #
A list of metrics associated with the System.
Metric Name | Dimensions | Description |
---|---|---|
cpu.frequency_mhz |
cluster hostname service=system |
Maximum MHz value for the cpu frequency. Note This value is dynamic, and driven by CPU governor depending on current resource need. |
cpu.idle_perc |
cluster hostname service=system |
Percentage of time the CPU is idle when no I/O requests are in progress |
cpu.idle_time |
cluster hostname service=system |
Time the CPU is idle when no I/O requests are in progress |
cpu.percent |
cluster hostname service=system |
Percentage of time the CPU is used in total |
cpu.stolen_perc |
cluster hostname service=system |
Percentage of stolen CPU time, that is, the time spent in other OS contexts when running in a virtualized environment |
cpu.system_perc |
cluster hostname service=system |
Percentage of time the CPU is used at the system level |
cpu.system_time |
cluster hostname service=system |
Time the CPU is used at the system level |
cpu.time_ns |
cluster hostname service=system |
Time the CPU is used at the host level |
cpu.total_logical_cores |
cluster hostname service=system |
Total number of logical cores available for an entire node (Includes hyper threading). Note: This is an optional metric that is only sent when send_rollup_stats is set to true. |
cpu.user_perc |
cluster hostname service=system |
Percentage of time the CPU is used at the user level |
cpu.user_time |
cluster hostname service=system |
Time the CPU is used at the user level |
cpu.wait_perc |
cluster hostname service=system |
Percentage of time the CPU is idle AND there is at least one I/O request in progress |
cpu.wait_time |
cluster hostname service=system |
Time the CPU is idle AND there is at least one I/O request in progress |
Metric Name | Dimensions | Description |
---|---|---|
disk.inode_used_perc |
mount_point service=system hostname cluster device |
The percentage of inodes that are used on a device |
disk.space_used_perc |
mount_point service=system hostname cluster device |
The percentage of disk space that is being used on a device |
disk.total_space_mb |
mount_point service=system hostname cluster device |
The total amount of disk space in Mbytes aggregated across all the disks on a particular node. Note This is an optional metric that is only sent when send_rollup_stats is set to true. |
disk.total_used_space_mb |
mount_point service=system hostname cluster device |
The total amount of used disk space in Mbytes aggregated across all the disks on a particular node. Note This is an optional metric that is only sent when send_rollup_stats is set to true. |
io.read_kbytes_sec |
mount_point service=system hostname cluster device |
Kbytes/sec read by an io device |
io.read_req_sec |
mount_point service=system hostname cluster device |
Number of read requests/sec to an io device |
io.read_time_sec |
mount_point service=system hostname cluster device |
Amount of read time in seconds to an io device |
io.write_kbytes_sec |
mount_point service=system hostname cluster device |
Kbytes/sec written by an io device |
io.write_req_sec |
mount_point service=system hostname cluster device |
Number of write requests/sec to an io device |
io.write_time_sec |
mount_point service=system hostname cluster device |
Amount of write time in seconds to an io device |
Metric Name | Dimensions | Description |
---|---|---|
load.avg_15_min |
service=system hostname cluster |
The normalized (by number of logical cores) average system load over a 15 minute period |
load.avg_1_min |
service=system hostname cluster |
The normalized (by number of logical cores) average system load over a 1 minute period |
load.avg_5_min |
service=system hostname cluster |
The normalized (by number of logical cores) average system load over a 5 minute period |
Metric Name | Dimensions | Description |
---|---|---|
mem.free_mb |
service=system hostname cluster |
Mbytes of free memory |
mem.swap_free_mb |
service=system hostname cluster |
Percentage of free swap memory that is free |
mem.swap_free_perc |
service=system hostname cluster |
Mbytes of free swap memory that is free |
mem.swap_total_mb |
service=system hostname cluster |
Mbytes of total physical swap memory |
mem.swap_used_mb |
service=system hostname cluster |
Mbytes of total swap memory used |
mem.total_mb |
service=system hostname cluster |
Total Mbytes of memory |
mem.usable_mb |
service=system hostname cluster |
Total Mbytes of usable memory |
mem.usable_perc |
service=system hostname cluster |
Percentage of total memory that is usable |
mem.used_buffers |
service=system hostname cluster |
Number of buffers in Mbytes being used by the kernel for block io |
mem.used_cache |
service=system hostname cluster |
Mbytes of memory used for the page cache |
mem.used_mb |
service=system hostname cluster |
Total Mbytes of used memory |
Metric Name | Dimensions | Description |
---|---|---|
net.in_bytes_sec |
service=system hostname device |
Number of network bytes received per second |
net.in_errors_sec |
service=system hostname device |
Number of network errors on incoming network traffic per second |
net.in_packets_dropped_sec |
service=system hostname device |
Number of inbound network packets dropped per second |
net.in_packets_sec |
service=system hostname device |
Number of network packets received per second |
net.out_bytes_sec |
service=system hostname device |
Number of network bytes sent per second |
net.out_errors_sec |
service=system hostname device |
Number of network errors on outgoing network traffic per second |
net.out_packets_dropped_sec |
service=system hostname device |
Number of outbound network packets dropped per second |
net.out_packets_sec |
service=system hostname device |
Number of network packets sent per second |
13.1.4.21 Zookeeper Metrics #
A list of metrics associated with the Zookeeper service.
Metric Name | Dimensions | Description |
---|---|---|
zookeeper.avg_latency_sec |
hostname mode service=zookeeper | Average latency in second |
zookeeper.connections_count |
hostname mode service=zookeeper | Number of connections |
zookeeper.in_bytes |
hostname mode service=zookeeper | Received bytes |
zookeeper.max_latency_sec |
hostname mode service=zookeeper | Maximum latency in second |
zookeeper.min_latency_sec |
hostname mode service=zookeeper | Minimum latency in second |
zookeeper.node_count |
hostname mode service=zookeeper | Number of nodes |
zookeeper.out_bytes |
hostname mode service=zookeeper | Sent bytes |
zookeeper.outstanding_bytes |
hostname mode service=zookeeper | Outstanding bytes |
zookeeper.zxid_count |
hostname mode service=zookeeper | Count number |
zookeeper.zxid_epoch |
hostname mode service=zookeeper | Epoch number |
13.2 Centralized Logging Service #
You can use the Centralized Logging Service to evaluate and troubleshoot your distributed cloud environment from a single location.
13.2.1 Getting Started with Centralized Logging Service #
A typical cloud consists of multiple servers which makes locating a specific log from a single server difficult. The Centralized Logging feature helps the administrator evaluate and troubleshoot the distributed cloud deployment from a single location.
The Logging API is a component in the centralized logging architecture. It works between log producers and log storage. In most cases it works by default after installation with no additional configuration. To use Logging API with logging-as-a-service, you must configure an end-point. This component adds flexibility and supportability for features in the future.
Do I need to Configure monasca-log-api? If you are only using Cloud Lifecycle Manager , then the default configuration is ready to use.
If you are using logging in any of the following deployments, then you will need to query keystone to get an end-point to use.
Logging as a Service
Platform as a Service
The Logging API is protected by keystone’s role-based access control. To ensure that logging is allowed and monasca alarms can be triggered, the user must have the monasca-user role. To get an end-point from keystone:
Log on to Cloud Lifecycle Manager (deployer node).
To list the Identity service catalog, run:
ardana >
source ./service.osrcardana >
openstack catalog listIn the output, find Kronos. For example:
Name Type Endpoints kronos region0 public: http://myardana.test:5607/v3.0, admin: http://192.168.245.5:5607/v3.0, internal: http://192.168.245.5:5607/v3.0
Use the same port number as found in the output. In the example, you would use port 5607.
In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.
It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.
For more information, see Section 13.2.4, “Managing the Centralized Logging Feature”.
13.2.1.1 For More Information #
For more information about the centralized logging components, see the following sites:
13.2.2 Understanding the Centralized Logging Service #
The Centralized Logging feature collects logs on a central system, rather than leaving the logs scattered across the network. The administrator can use a single Kibana interface to view log information in charts, graphs, tables, histograms, and other forms.
13.2.2.1 What Components are Part of Centralized Logging? #
Centralized logging consists of several components, detailed below:
Administrator's Browser: Operations Console can be used to access logging alarms or to access Kibana's dashboards to review logging data.
Apache Website for Kibana: A standard Apache website that proxies web/REST requests to the Kibana NodeJS server.
Beaver: A Python daemon that collects information in log files and sends it to the Logging API (monasca-log API) over a secure connection.
Cloud Auditing Data Federation (CADF): Defines a standard, full-event model anyone can use to fill in the essential data needed to certify, self-manage and self-audit application security in cloud environments.
Centralized Logging and Monitoring (CLM): Used to evaluate and troubleshoot your SUSE OpenStack Cloud distributed cloud environment from a single location.
Curator: a tool provided by Elasticsearch to manage indices.
Elasticsearch: A data store offering fast indexing and querying.
SUSE OpenStack Cloud: Provides public, private, and managed cloud solutions to get you moving on your cloud journey.
JavaScript Object Notation (JSON) log file: A file stored in the JSON format and used to exchange data. JSON uses JavaScript syntax, but the JSON format is text only. Text can be read and used as a data format by any programming language. This format is used by the Beaver and Logstash components.
Kafka: A messaging broker used for collection of SUSE OpenStack Cloud centralized logging data across nodes. It is highly available, scalable and performant. Kafka stores logs in disk instead of memory and is therefore more tolerant to consumer down times.
ImportantMake sure not to undersize your Kafka partition or the data retention period may be lower than expected. If the Kafka partition capacity is lower than 85%, the retention period will increase to 30 minutes. Over time Kafka will also eject old data.
Kibana: A client/server application with rich dashboards to visualize the data in Elasticsearch through a web browser. Kibana enables you to create charts and graphs using the log data.
Logging API (monasca-log-api): SUSE OpenStack Cloud API provides a standard REST interface to store logs. It uses keystone authentication and role-based access control support.
Logstash: A log processing system for receiving, processing and outputting logs. Logstash retrieves logs from Kafka, processes and enriches the data, then stores the data in Elasticsearch.
MML Service Node: Metering, Monitoring, and Logging (MML) service node. All services associated with metering, monitoring, and logging run on a dedicated three-node cluster. Three nodes are required for high availability with quorum.
Monasca: OpenStack monitoring at scale infrastructure for the cloud that supports alarms and reporting.
OpenStack Service. An OpenStack service process that requires logging services.
Oslo.log. An OpenStack library for log handling. The library functions automate configuration, deployment and scaling of complete, ready-for-work application platforms. Some PaaS solutions, such as Cloud Foundry, combine operating systems, containers, and orchestrators with developer tools, operations utilities, metrics, and security to create a developer-rich solution.
Text log: A type of file used in the logging process that contains human-readable records.
These components are configured to work out-of-the-box and the admin should be able to view log data using the default configurations.
In addition to each of the services, Centralized Logging also processes logs for the following features:
HAProxy
Syslog
keepalived
The purpose of the logging service is to provide a common logging infrastructure with centralized user access. Since there are numerous services and applications running in each node of a SUSE OpenStack Cloud cloud, and there could be hundreds of nodes, all of these services and applications can generate enough log files to make it very difficult to search for specific events in log files across all of the nodes. Centralized Logging addresses this issue by sending log messages in real time to a central Elasticsearch, Logstash, and Kibana cluster. In this cluster they are indexed and organized for easier and visual searches. The following illustration describes the architecture used to collect operational logs.
The arrows come from the active (requesting) side to the passive (listening) side. The active side is always the one providing credentials, so the arrows may also be seen as coming from the credential holder to the application requiring authentication.
13.2.2.2 Steps 1- 2 #
Services configured to generate log files record the data. Beaver listens for changes to the files and sends the log files to the Logging Service. The first step the Logging service takes is to re-format the original log file to a new log file with text only and to remove all network operations. In Step 1a, the Logging service uses the Oslo.log library to re-format the file to text-only. In Step 1b, the Logging service uses the Python-Logstash library to format the original audit log file to a JSON file.
- Step 1a
Beaver watches configured service operational log files for changes and reads incremental log changes from the files.
- Step 1b
Beaver watches configured service operational log files for changes and reads incremental log changes from the files.
- Step 2a
The monascalog transport of Beaver makes a token request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.
- Step 2b
The monascalog transport of Beaver batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection. Failure logs are written to the local Beaver log.
- Step 2c
The REST API client for monasca-log-api makes a token-request call to keystone passing in credentials. The token returned is cached to avoid multiple network round-trips.
- Step 2d
The REST API client for monasca-log-api batches multiple logs (operational or audit) and posts them to the monasca-log-api VIP over a secure connection.
13.2.2.3 Steps 3a- 3b #
The Logging API (monasca-log API) communicates with keystone to validate the incoming request, and then sends the logs to Kafka.
- Step 3a
The monasca-log-api WSGI pipeline is configured to validate incoming request tokens with keystone. The keystone middleware used for this purpose is configured to use the monasca-log-api admin user, password and project that have the required keystone role to validate a token.
- Step 3b
monasca-log-api sends log messages to Kafka using a language-agnostic TCP protocol.
13.2.2.4 Steps 4- 8 #
Logstash pulls messages from Kafka, identifies the log type, and transforms the messages into either the audit log format or operational format. Then Logstash sends the messages to Elasticsearch, using either an audit or operational indices.
- Step 4
Logstash input workers pull log messages from the Kafka-Logstash topic using TCP.
- Step 5
This Logstash filter processes the log message in-memory in the request pipeline. Logstash identifies the log type from this field.
- Step 6
This Logstash filter processes the log message in-memory in the request pipeline. If the message is of audit-log type, Logstash transforms it from the monasca-log-api envelope format to the original CADF format.
- Step 7
This Logstash filter determines which index should receive the log message. There are separate indices in Elasticsearch for operational versus audit logs.
- Step 8
Logstash output workers write the messages read from Kafka to the daily index in the local Elasticsearch instance.
13.2.2.5 Steps 9- 12 #
When an administrator who has access to the guest network accesses the Kibana client and makes a request, Apache forwards the request to the Kibana NodeJS server. Then the server uses the Elasticsearch REST API to service the client requests.
- Step 9
An administrator who has access to the guest network accesses the Kibana client to view and search log data. The request can originate from the external network in the cloud through a tenant that has a pre-defined access route to the guest network.
- Step 10
An administrator who has access to the guest network uses a web browser and points to the Kibana URL. This allows the user to search logs and view Dashboard reports.
- Step 11
The authenticated request is forwarded to the Kibana NodeJS server to render the required dashboard, visualization, or search page.
- Step 12
The Kibana NodeJS web server uses the Elasticsearch REST API in localhost to service the UI requests.
13.2.2.6 Steps 13- 15 #
Log data is backed-up and deleted in the final steps.
- Step 13
A daily cron job running in the ELK node runs curator to prune old Elasticsearch log indices.
- Step 14
The curator configuration is done at the deployer node through the Ansible role logging-common. Curator is scripted to then prune or clone old indices based on this configuration.
- Step 15
The audit logs must be backed up manually. For more information about Backup and Recovery, see Chapter 17, Backup and Restore.
13.2.2.7 How Long are Log Files Retained? #
The logs that are centrally stored are saved to persistent storage as
Elasticsearch indices. These indices are stored in the partition
/var/lib/elasticsearch
on each of the Elasticsearch
cluster nodes. Out of the box, logs are stored in one Elasticsearch index
per service. As more days go by, the number of indices stored in this disk
partition grows. Eventually the partition fills up. If they are
open, each of these indices takes up CPU
and memory. If these indices are left unattended they will continue to
consume system resources and eventually deplete them.
Elasticsearch, by itself, does not prevent this from happening.
SUSE OpenStack Cloud uses a tool called curator that is developed by the Elasticsearch community to handle these situations. SUSE OpenStack Cloud installs and uses a curator in conjunction with several configurable settings. This curator is called by cron and performs the following checks:
First Check. The hourly cron job checks to see if the currently used Elasticsearch partition size is over the value set in:
curator_low_watermark_percent
If it is higher than this value, the curator deletes old indices according to the value set in:
curator_num_of_indices_to_keep
Second Check. Another check is made to verify if the partition size is below the high watermark percent. If it is still too high, curator will delete all indices except the current one that is over the size as set in:
curator_max_index_size_in_gb
Third Check. A third check verifies if the partition size is still too high. If it is, curator will delete all indices except the current one.
Final Check. A final check verifies if the partition size is still high. If it is, an error message is written to the log file but the current index is NOT deleted.
In the case of an extreme network issue, log files can run out of disk space
in under an hour. To avoid this SUSE OpenStack Cloud uses a shell script called
logrotate_if_needed.sh
. The cron process runs this script
every 5 minutes to see if the size of /var/log
has
exceeded the high_watermark_percent (95% of the disk, by default). If it is
at or above this level, logrotate_if_needed.sh
runs the
logrotate
script to rotate logs and to free up extra
space. This script helps to minimize the chance of running out of disk space
on /var/log
.
13.2.2.8 How Are Logs Rotated? #
SUSE OpenStack Cloud uses the cron process which in turn calls Logrotate to provide rotation, compression, and removal of log files. Each log file can be rotated hourly, daily, weekly, or monthly. If no rotation period is set then the log file will only be rotated when it grows too large.
Rotating a file means that the Logrotate process creates a copy of the log file with a new extension, for example, the .1 extension, then empties the contents of the original file. If a .1 file already exists, then that file is first renamed with a .2 extension. If a .2 file already exists, it is renamed to .3, etc., up to the maximum number of rotated files specified in the settings file. When Logrotate reaches the last possible file extension, it will delete the last file first on the next rotation. By the time the Logrotate process needs to delete a file, the results will have been copied to Elasticsearch, the central logging database.
The log rotation setting files can be found in the following directory
~/scratch/ansible/next/ardana/ansible/roles/logging-common/vars
These files allow you to set the following options:
- Service
The name of the service that creates the log entries.
- Rotated Log Files
List of log files to be rotated. These files are kept locally on the server and will continue to be rotated. If the file is also listed as Centrally Logged, it will also be copied to Elasticsearch.
- Frequency
The timing of when the logs are rotated. Options include:hourly, daily, weekly, or monthly.
- Max Size
The maximum file size the log can be before it is rotated out.
- Rotation
The number of log files that are rotated.
- Centrally Logged Files
These files will be indexed by Elasticsearch and will be available for searching in the Kibana user interface.
Only files that are listed in the Centrally Logged Files section are copied to Elasticsearch.
All of the variables for the Logrotate process are found in the following file:
~/scratch/ansible/next/ardana/ansible/roles/logging-ansible/logging-common/defaults/main.yml
Cron runs Logrotate hourly. Every 5 minutes another process is run called "logrotate_if_needed" which uses a watermark value to determine if the Logrotate process needs to be run. If the "high watermark" has been reached, and the /var/log partition is more than 95% full (by default - this can be adjusted), then Logrotate will be run within 5 minutes.
13.2.2.9 Are Log Files Backed-Up To Elasticsearch? #
While centralized logging is enabled out of the box, the backup of these logs is not. The reason is because Centralized Logging relies on the Elasticsearch FileSystem Repository plugin, which in turn requires shared disk partitions to be configured and accessible from each of the Elasticsearch nodes. Since there are multiple ways to setup a shared disk partition, SUSE OpenStack Cloud allows you to choose an approach that works best for your deployment before enabling the back-up of log files to Elasticsearch.
If you enable automatic back-up of centralized log files, then all the logs collected from the cloud nodes will be backed-up to Elasticsearch. Every hour, in the management controller nodes where Elasticsearch is setup, a cron job runs to check if Elasticsearch is running low on disk space. If the check succeeds, it further checks if the backup feature is enabled. If enabled, the cron job saves a snapshot of the Elasticsearch indices to the configured shared disk partition using curator. Next, the script starts deleting the oldest index and moves down from there checking each time if there is enough space for Elasticsearch. A check is also made to ensure that the backup runs only once a day.
For steps on how to enable automatic back-up, see Section 13.2.5, “Configuring Centralized Logging”.
13.2.3 Accessing Log Data #
All logging data in SUSE OpenStack Cloud is managed by the Centralized Logging Service and can be viewed or analyzed by Kibana. Kibana is the only graphical interface provided with SUSE OpenStack Cloud to search or create a report from log data. Operations Console provides only a link to the Kibana Logging dashboard.
The following two methods allow you to access the Kibana Logging dashboard to search log data:
To learn more about Kibana, read the Getting Started with Kibana guide.
13.2.3.1 Use the Operations Console Link #
Operations Console allows you to access Kibana in the same tool that you use to manage the other SUSE OpenStack Cloud resources in your deployment. To use Operations Console, you must have the correct permissions.
To use Operations Console:
In a browser, open the Operations Console.
On the login page, enter the user name, and the Password, and then click LOG IN.
On the Home/Central Dashboard page, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left, select Home, and then select Logging.
On the Home/Logging page, click View Logging Dashboard.
In SUSE OpenStack Cloud, Kibana usually runs on a different network than Operations Console. Due to this configuration, it is possible that using Operations Console to access Kibana will result in an “404 not found” error. This error only occurs if the user has access only to the public facing network.
13.2.3.2 Using Kibana to Access Log Data #
Kibana is an open-source, data-visualization plugin for Elasticsearch. Kibana provides visualization capabilities using the log content indexed on an Elasticsearch cluster. Users can create bar and pie charts, line and scatter plots, and maps using the data collected by SUSE OpenStack Cloud in the cloud log files.
While creating Kibana dashboards is beyond the scope of this document, it is important to know that the dashboards you create are JSON files that you can modify or create new dashboards based on existing dashboards.
Kibana is client-server software. To operate properly, the browser must be able to access port 5601 on the control plane.
Field | Default Value | Description |
---|---|---|
user | kibana |
Username that will be required for logging into the Kibana UI. |
password | random password is generated |
Password generated during installation that is used to login to the Kibana UI. |
13.2.3.3 Logging into Kibana #
To log into Kibana to view data, you must make sure you have the required login configuration.
Verify login credentials: Section 13.2.3.3.1, “Verify Login Credentials”
Find the randomized password: Section 13.2.3.3.2, “Find the Randomized Password”
Access Kibana using a direct link: Section 13.2.3.3.3, “Access Kibana Using a Direct Link:”
13.2.3.3.1 Verify Login Credentials #
During the installation of Kibana, a password is automatically set and it is randomized. Therefore, unless an administrator has already changed it, you need to retrieve the default password from a file on the control plane node.
13.2.3.3.2 Find the Randomized Password #
To find the Kibana password, run:
ardana >
grep kibana ~/scratch/ansible/next/my_cloud/stage/internal/CloudModel.yaml
13.2.3.3.3 Access Kibana Using a Direct Link: #
This section helps you verify the horizon virtual IP (VIP) address that you should use. To provide enhanced security, access to Kibana is not available on the External network.
To determine which IP address to use to access Kibana, from your Cloud Lifecycle Manager, run:
ardana >
grep HZN-WEB /etc/hostsThe output of the grep command should show you the virtual IP address for Kibana that you should use.
ImportantIf nothing is returned by the grep command, you can open the following file to look for the IP address manually:
/etc/hosts
Access to Kibana will be over port 5601 of that virtual IP address. Example:
https://VIP:5601
13.2.4 Managing the Centralized Logging Feature #
No specific configuration tasks are required to use Centralized Logging, as it is enabled by default after installation. However, you can configure the individual components as needed for your environment.
13.2.4.1 How Do I Stop and Start the Logging Service? #
Although you might not need to stop and start the logging service very often, you may need to if, for example, one of the logging services is not behaving as expected or not working.
You cannot enable or disable centralized logging across all services unless you stop all centralized logging. Instead, it is recommended that you enable or disable individual log files in the <service>-clr.yml files and then reconfigure logging. You would enable centralized logging for a file when you want to make sure you are able to monitor those logs in Kibana.
In SUSE OpenStack Cloud, the logging-ansible restart playbook has been updated to manage the start,stop, and restart of the Centralized Logging Service in a specific way. This change was made to ensure the proper stop, start, and restart of Elasticsearch.
It is recommended that you only use the logging playbooks to perform the start, stop, and restart of the Centralized Logging Service. Manually mixing the start, stop, and restart operations with the logging playbooks will result in complex failures.
The steps in this section only impact centralized logging. Logrotate is an essential feature that keeps the service log files from filling the disk and will not be affected.
These playbooks must be run from the Cloud Lifecycle Manager.
To stop the Logging service:
To change to the directory containing the ansible playbook, run
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo run the ansible playbook that will stop the logging service, run:
ardana >
ansible-playbook -i hosts/verb_hosts logging-stop.yml
To start the Logging service:
To change to the directory containing the ansible playbook, run
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo run the ansible playbook that will stop the logging service, run:
ardana >
ansible-playbook -i hosts/verb_hosts logging-start.yml
13.2.4.2 How Do I Enable or Disable Centralized Logging For a Service? #
To enable or disable Centralized Logging for a service you need to modify the configuration for the service, set the enabled flag to true or false, and then reconfigure logging.
There are consequences if you enable too many logging files for a service. If there is not enough storage to support the increased logging, the retention period of logs in Elasticsearch is decreased. Alternatively, if you wanted to increase the retention period of log files or if you did not want those logs to show up in Kibana, you would disable centralized logging for a file.
To enable Centralized Logging for a service:
Use the documentation provided with the service to ensure it is not configured for logging.
To find the SUSE OpenStack Cloud file to edit, run:
ardana >
find ~/openstack/my_cloud/config/logging/vars/ -name "*service-name*"Edit the file for the service for which you want to enable logging.
To enable Centralized Logging, find the following code and change the enabled flag to true, to disable, change the enabled flag to false:
logging_options: - centralized_logging: enabled: true format: json
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To reconfigure logging, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.ymlardana >
cd ~/openstack/ardana/ansible/ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
13.2.5 Configuring Centralized Logging #
You can adjust the settings for centralized logging when you are troubleshooting problems with a service or to decrease log size and retention to save on disk space. For steps on how to configure logging settings, refer to the following tasks:
13.2.5.1 Configuration Files #
Centralized Logging settings are stored in the configuration files in the
following directory on the Cloud Lifecycle Manager:
~/openstack/my_cloud/config/logging/
The configuration files and their use are described below:
File | Description |
---|---|
main.yml | Main configuration file for all centralized logging components. |
elasticsearch.yml.j2 | Main configuration file for Elasticsearch. |
elasticsearch-default.j2 | Default overrides for the Elasticsearch init script. |
kibana.yml.j2 | Main configuration file for Kibana. |
kibana-apache2.conf.j2 | Apache configuration file for Kibana. |
logstash.conf.j2 | Logstash inputs/outputs configuration. |
logstash-default.j2 | Default overrides for the Logstash init script. |
beaver.conf.j2 | Main configuration file for Beaver. |
vars | Path to logrotate configuration files. |
13.2.5.2 Planning Resource Requirements #
The Centralized Logging service needs to have enough resources available to it to perform adequately for different scale environments. The base logging levels are tuned during installation according to the amount of RAM allocated to your control plane nodes to ensure optimum performance.
These values can be viewed and changed in the
~/openstack/my_cloud/config/logging/main.yml
file, but you
will need to run a reconfigure of the Centralized Logging service if changes
are made.
The total process memory consumption for Elasticsearch will be the above
allocated heap value (in
~/openstack/my_cloud/config/logging/main.yml
) plus any Java
Virtual Machine (JVM) overhead.
Setting Disk Size Requirements
In the entry-scale models, the disk partition sizes on your controller nodes
for the logging and Elasticsearch data are set as a percentage of your total
disk size. You can see these in the following file on the Cloud Lifecycle Manager
(deployer):
~/openstack/my_cloud/definition/data/<controller_disk_files_used>
Sample file settings:
# Local Log files. - name: log size: 13% mount: /var/log fstype: ext4 mkfs-opts: -O large_file # Data storage for centralized logging. This holds log entries from all # servers in the cloud and hence can require a lot of disk space. - name: elasticsearch size: 30% mount: /var/lib/elasticsearch fstype: ext4
The disk size is set automatically based on the hardware configuration. If you need to adjust it, you can set it manually with the following steps.
To set disk sizes:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/definition/data/disks.yml
Make any desired changes.
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -A gitardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the logging reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
13.2.5.3 Backing Up Elasticsearch Log Indices #
The log files that are centrally collected in SUSE OpenStack Cloud are stored by
Elasticsearch on disk in the /var/lib/elasticsearch
partition. However, this is distributed across each of the Elasticsearch
cluster nodes as shards. A cron job runs periodically to see if the disk
partition runs low on space, and, if so, it runs curator to delete the old
log indices to make room for new logs. This deletion is permanent and the
logs are lost forever. If you want to backup old logs, for example to comply
with certain regulations, you can configure automatic backup of
Elasticsearch indices.
If you need to restore data that was archived prior to SUSE OpenStack Cloud 9 and used the older versions of Elasticsearch, then this data will need to be restored to a separate deployment of Elasticsearch.
This can be accomplished using the following steps:
Deploy a separate distinct Elasticsearch instance version matching the version in SUSE OpenStack Cloud.
Configure the backed-up data using NFS or some other share mechanism to be available to the Elasticsearch instance matching the version in SUSE OpenStack Cloud.
Before enabling automatic back-ups, make sure you understand how much disk space you will need, and configure the disks that will store the data. Use the following checklist to prepare your deployment for enabling automatic backups:
☐ | Item |
---|---|
☐ |
Add a shared disk partition to each of the Elasticsearch controller nodes. The default partition name used for backup is /var/lib/esbackup You can change this by:
|
☐ |
Ensure the shared disk has enough storage to retain backups for the desired retention period. |
To enable automatic back-up of centralized logs to Elasticsearch:
Log in to the Cloud Lifecycle Manager (deployer node).
Open the following file in a text editor:
~/openstack/my_cloud/config/logging/main.yml
Find the following variables:
curator_backup_repo_name: "es_{{host.my_dimensions.cloud_name}}" curator_es_backup_partition: /var/lib/esbackup
To enable backup, change the curator_enable_backup value to true in the curator section:
curator_enable_backup: true
Save your changes and re-run the configuration processor:
ardana >
cd ~/openstackardana >
git add -A # Verify the added filesardana >
git statusardana >
git commit -m "Enabling Elasticsearch Backup" $ cd ~/openstack/ardana/ansible $ ansible-playbook -i hosts/localhost config-processor-run.yml $ ansible-playbook -i hosts/localhost ready-deployment.ymlTo re-configure logging:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.ymlTo verify that the indices are backed up, check the contents of the partition:
ardana >
ls /var/lib/esbackup
13.2.5.4 Restoring Logs From an Elasticsearch Backup #
To restore logs from an Elasticsearch backup, see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-snapshots.html.
We do not recommend restoring to the original SUSE OpenStack Cloud Centralized Logging cluster as it may cause storage/capacity issues. We rather recommend setting up a separate ELK cluster of the same version and restoring the logs there.
13.2.5.5 Tuning Logging Parameters #
When centralized logging is installed in SUSE OpenStack Cloud, parameters for Elasticsearch heap size and logstash heap size are automatically configured based on the amount of RAM on the system. These values are typically the required values, but they may need to be adjusted if performance issues arise, or disk space issues are encountered. These values may also need to be adjusted if hardware changes are made after an installation.
These values are defined at the top of the following file
.../logging-common/defaults/main.yml
. An example of the
contents of the file is below:
1. Select heap tunings based on system RAM #------------------------------------------------------------------------------- threshold_small_mb: 31000 threshold_medium_mb: 63000 threshold_large_mb: 127000 tuning_selector: " {% if ansible_memtotal_mb < threshold_small_mb|int %} demo {% elif ansible_memtotal_mb < threshold_medium_mb|int %} small {% elif ansible_memtotal_mb < threshold_large_mb|int %} medium {% else %} large {%endif %} " logging_possible_tunings: 2. RAM < 32GB demo: elasticsearch_heap_size: 512m logstash_heap_size: 512m 3. RAM < 64GB small: elasticsearch_heap_size: 8g logstash_heap_size: 2g 4. RAM < 128GB medium: elasticsearch_heap_size: 16g logstash_heap_size: 4g 5. RAM >= 128GB large: elasticsearch_heap_size: 31g logstash_heap_size: 8g logging_tunings: "{{ logging_possible_tunings[tuning_selector] }}"
This specifies thresholds for what a small, medium, or large system would look like, in terms of memory. To see what values will be used, see what RAM your system uses, and see where it fits in with the thresholds to see what values you will be installed with. To modify the values, you can either adjust the threshold values so that your system will change from a small configuration to a medium configuration, for example, or keep the threshold values the same, and modify the heap_size variables directly for the selector that your system is set for. For example, if your configuration is a medium configuration, which sets heap_sizes to 16 GB for Elasticsearch and 4 GB for logstash, and you want twice as much set aside for logstash, then you could increase the 4 GB for logstash to 8 GB.
13.2.6 Configuring Settings for Other Services #
When you configure settings for the Centralized Logging Service, those changes impact all services that are enabled for centralized logging. However, if you only need to change the logging configuration for one specific service, you will want to modify the service's files instead of changing the settings for the entire Centralized Logging service. This topic helps you complete the following tasks:
13.2.6.1 Setting Logging Levels for Services #
When it is necessary to increase the logging level for a specific service to troubleshoot an issue, or to decrease logging levels to save disk space, you can edit the service's config file and then reconfigure logging. All changes will be made to the service's files and not to the Centralized Logging service files.
Messages only appear in the log files if they are the same as or more severe than the log level you set. The DEBUG level logs everything. Most services default to the INFO logging level, which lists informational events, plus warnings, errors, and critical errors. Some services provide other logging options which will narrow the focus to help you debug an issue, receive a warning if an operation fails, or if there is a serious issue with the cloud.
For more information on logging levels, see the OpenStack Logging Guidelines documentation.
13.2.6.2 Configuring the Logging Level for a Service #
If you want to increase or decrease the amount of details that are logged by a service, you can change the current logging level in the configuration files. Most services support, at a minimum, the DEBUG and INFO logging levels. For more information about what levels are supported by a service, check the documentation or Website for the specific service.
13.2.6.3 Barbican #
Service | Sub-component | Supported Logging Levels |
---|---|---|
barbican |
barbican-api barbican-worker |
INFO (default) DEBUG |
To change the barbican logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana >
vi my_cloud/config/barbican/barbican_deploy_config.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
barbican_loglevel: "{{ ardana_loglevel | default('INFO') }}" barbican_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts barbican-reconfigure.yml
13.2.6.4 Block Storage (cinder) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
cinder |
cinder-api cinder-scheduler cinder-backup cinder-volume |
INFO (default) DEBUG |
To manage cinder logging:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana/ansibleardana >
vi roles/_CND-CMN/defaults/main.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
cinder_loglevel: {{ ardana_loglevel | default('INFO') }} cinder_logstash_loglevel: {{ ardana_loglevel | default('INFO') }}
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml
13.2.6.5 Ceilometer #
Service | Sub-component | Supported Logging Levels |
---|---|---|
ceilometer |
ceilometer-collector ceilometer-agent-notification ceilometer-polling ceilometer-expirer |
INFO (default) DEBUG |
To change the ceilometer logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana/ansibleardana >
vi roles/_CEI-CMN/defaults/main.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
ceilometer_loglevel: INFO ceilometer_logstash_loglevel: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml
13.2.6.6 Compute (nova) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
nova |
INFO (default) DEBUG |
To change the nova logging level:
Log in to the Cloud Lifecycle Manager.
The nova service component logging can be changed by modifying the following files:
~/openstack/my_cloud/config/nova/novncproxy-logging.conf.j2 ~/openstack/my_cloud/config/nova/api-logging.conf.j2 ~/openstack/my_cloud/config/nova/compute-logging.conf.j2 ~/openstack/my_cloud/config/nova/conductor-logging.conf.j2 ~/openstack/my_cloud/config/nova/scheduler-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-reconfigure.yml
13.2.6.7 Designate (DNS) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
designate |
designate-api designate-central designate-mdns designate-producer designate-worker designate-pool-manager designate-zone-manager |
INFO (default) DEBUG |
To change the designate logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana >
vi my_cloud/config/designate/designate.conf.j2To change the logging level, set the value of the following line:
debug = False
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts designate-reconfigure.yml
13.2.6.8 Identity (keystone) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
keystone | keystone |
INFO (default) DEBUG WARN ERROR |
To change the keystone logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/keystone/keystone_deploy_config.yml
To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
keystone_loglevel: INFO keystone_logstash_loglevel: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
13.2.6.9 Image (glance) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
glance |
glance-api |
INFO (default) DEBUG |
To change the glance logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/glance/glance-api-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml
13.2.6.10 Bare Metal (ironic) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
ironic |
ironic-api-logging.conf.j2 ironic-conductor-logging.conf.j2 |
INFO (default) DEBUG |
To change the ironic logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
ardana >
cd ~/openstack/ardana/ansibleardana >
vi roles/ironic-common/defaults/main.ymlTo change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
ironic_api_loglevel: "{{ ardana_loglevel | default('INFO') }}" ironic_api_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}" ironic_conductor_loglevel: "{{ ardana_loglevel | default('INFO') }}" ironic_conductor_logstash_loglevel: "{{ ardana_loglevel | default('INFO') }}"
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ironic-reconfigure.yml
13.2.6.11 Monitoring (monasca) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
monasca |
monasca-persister zookeeper storm monasca-notification monasca-api kafka monasca-agent |
WARN (default) INFO |
To change the monasca logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Monitoring service component logging can be changed by modifying the following files:
~/openstack/ardana/ansible/roles/monasca-persister/defaults/main.yml ~/openstack/ardana/ansible/roles/storm/defaults/main.yml ~/openstack/ardana/ansible/roles/monasca-notification/defaults/main.yml ~/openstack/ardana/ansible/roles/monasca-api/defaults/main.yml ~/openstack/ardana/ansible/roles/kafka/defaults/main.yml ~/openstack/ardana/ansible/roles/monasca-agent/defaults/main.yml (For this file, you will need to add the variable)
To change the logging level, use ALL CAPS to set the desired level in the following line:
monasca_log_level: WARN
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-reconfigure.yml
13.2.6.12 Networking (neutron) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
neutron |
neutron-server dhcp-agent l3-agent metadata-agent openvswitch-agent ovsvapp-agent sriov-agent infoblox-ipam-agent l2gateway-agent |
INFO (default) DEBUG |
To change the neutron logging level:
Log in to the Cloud Lifecycle Manager.
The neutron service component logging can be changed by modifying the following files:
~/openstack/ardana/ansible/roles/neutron-common/templates/dhcp-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/infoblox-ipam-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/l2gateway-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/l3-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/metadata-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/openvswitch-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/ovsvapp-agent-logging.conf.j2 ~/openstack/ardana/ansible/roles/neutron-common/templates/sriov-agent-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts neutron-reconfigure.yml
13.2.6.13 Object Storage (swift) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
swift |
INFO (default) DEBUG |
Currently it is not recommended to log at any level other than INFO.
13.2.6.14 Octavia #
Service | Sub-component | Supported Logging Levels |
---|---|---|
octavia |
octavia-api octavia-worker octavia-hk octavia-hm |
INFO (default) DEBUG |
To change the Octavia logging level:
Log in to the Cloud Lifecycle Manager.
The Octavia service component logging can be changed by modifying the following files:
~/openstack/my_cloud/config/octavia/octavia-api-logging.conf.j2 ~/openstack/my_cloud/config/octavia/octavia-worker-logging.conf.j2 ~/openstack/my_cloud/config/octavia/octavia-hk-logging.conf.j2 ~/openstack/my_cloud/config/octavia/octavia-hm-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts octavia-reconfigure.yml
13.2.6.15 Operations Console #
Service | Sub-component | Supported Logging Levels |
---|---|---|
opsconsole |
ops-web ops-mon |
INFO (default) DEBUG |
To change the Operations Console logging level:
Log in to the Cloud Lifecycle Manager.
Open the following file:
~/openstack/ardana/ansible/roles/OPS-WEB/defaults/main.yml
To change the logging level, use ALL CAPS to set the desired level in the following line:
ops_console_loglevel: "{{ ardana_loglevel | default('INFO') }}"
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ops-console-reconfigure.yml
13.2.6.16 Orchestration (heat) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
heat |
api-cfn api engine |
INFO (default) DEBUG |
To change the heat logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/heat/*-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts heat-reconfigure.yml
13.2.6.17 Magnum #
Service | Sub-component | Supported Logging Levels |
---|---|---|
magnum |
api conductor |
INFO (default) DEBUG |
To change the Magnum logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/magnum/api-logging.conf.j2 ~/openstack/my_cloud/config/magnum/conductor-logging.conf.j2
The threshold for default text log files may be set by editing the [handler_watchedfile] section, or the JSON content forwarded to centralized logging may be set by editing the [handler_logstash] section. In either section, replace the value of the following line with the desired log level:
level: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts magnum-reconfigure.yml
13.2.6.18 File Storage (manila) #
Service | Sub-component | Supported Logging Levels |
---|---|---|
manila |
api |
INFO (default) DEBUG |
To change the manila logging level:
Log in to the Cloud Lifecycle Manager (deployer).
Open the following file:
~/openstack/my_cloud/config/manila/manila-logging.conf.j2
To change the logging level, replace the values in these lines with the desired threshold (in ALL CAPS) for the standard log file on disk and the JSON log entries forwarded to centralized log services.
manila_loglevel: INFO manila_logstash_loglevel: INFO
Save the changes to the file.
To commit the changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"To run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlTo create a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlTo run the reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts manila-reconfigure.yml
13.2.6.19 Selecting Files for Centralized Logging #
As you use SUSE OpenStack Cloud, you might find a need to redefine which log files are rotated on disk or transferred to centralized logging. These changes are all made in the centralized logging definition files.
SUSE OpenStack Cloud uses the logrotate service to provide rotation, compression, and
removal of log files. All of the tunable variables for the logrotate process
itself can be controlled in the following file:
~/openstack/ardana/ansible/roles/logging-common/defaults/main.yml
You can find the centralized logging definition files for each service in
the following directory:
~/openstack/ardana/ansible/roles/logging-common/vars
You can change log settings for a service by following these steps.
Log in to the Cloud Lifecycle Manager.
Open the *.yml file for the service or sub-component that you want to modify.
Using keystone, the Identity service as an example:
ardana >
vi ~/openstack/ardana/ansible/roles/logging-common/vars/keystone-clr.ymlConsider the opening clause of the file:
sub_service: hosts: KEY-API name: keystone service: keystone
The hosts setting defines the role which will trigger this logrotate definition being applied to a particular host. It can use regular expressions for pattern matching, that is, NEU-.*.
The service setting identifies the high-level service name associated with this content, which will be used for determining log files' collective quotas for storage on disk.
Verify logging is enabled by locating the following lines:
centralized_logging: enabled: true format: rawjson
NoteWhen possible, centralized logging is most effective on log files generated using logstash-formatted JSON. These files should specify format: rawjson. When only plaintext log files are available, format: json is appropriate. (This will cause their plaintext log lines to be wrapped in a json envelope before being sent to centralized logging storage.)
Observe log files selected for rotation:
- files: - /var/log/keystone/keystone.log log_rotate: - daily - maxsize 300M - rotate 7 - compress - missingok - notifempty - copytruncate - create 640 keystone adm
NoteWith the introduction of dynamic log rotation, the frequency (that is, daily) and file size threshold (that is, maxsize) settings no longer have any effect. The rotate setting may be easily overridden on a service-by-service basis.
Commit any changes to your local git repository:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlCreate a deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRun the logging reconfigure playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts kronos-reconfigure.yml
13.2.6.20 Controlling Disk Space Allocation and Retention of Log Files #
Each service is assigned a weighted allocation of the
/var/log
filesystem's capacity. When all its log files'
cumulative sizes exceed this allocation, a rotation is triggered for that
service's log files according to the behavior specified in the
/etc/logrotate.d/*
specification.
These specification files are auto-generated based on YML sources delivered with the Cloud Lifecycle Manager codebase. The source files can be edited and reapplied to control the allocation of disk space across services or the behavior during a rotation.
Disk capacity is allocated as a percentage of the total weighted value of all services running on a particular node. For example, if 20 services run on the same node, all with a default weight of 100, they will each be granted 1/20th of the log filesystem's capacity. If the configuration is updated to change one service's weight to 150, all the services' allocations will be adjusted to make it possible for that one service to consume 150% of the space available to other individual services.
These policies are enforced by the script
/opt/kronos/rotate_if_exceeded_quota.py
, which will be
executed every 5 minutes via a cron job and will rotate the log files of any
services which have exceeded their respective quotas. When log rotation
takes place for a service, logs are generated to describe the activity in
/var/log/kronos/check_if_exceeded_quota.log
.
When logrotate is performed on a service, its existing log files are compressed and archived to make space available for fresh log entries. Once the number of archived log files exceeds that service's retention thresholds, the oldest files are deleted. Thus, longer retention thresholds (that is, 10 to 15) will result in more space in the service's allocated log capacity being used for historic logs, while shorter retention thresholds (that is, 1 to 5) will keep more space available for its active plaintext log files.
Use the following process to make adjustments to services' log capacity allocations or retention thresholds:
Navigate to the following directory on your Cloud Lifecycle Manager:
~/stack/scratch/ansible/next/ardana/ansible
Open and edit the service weights file:
ardana >
vi roles/kronos-logrotation/vars/rotation_config.ymlEdit the service parameters to set the desired parameters. Example:
cinder: weight: 300 retention: 2
NoteThe retention setting of default will use recommend defaults for each services' log files.
Run the kronos-logrotation-deploy playbook:
ardana >
ansible-playbook -i hosts/verb_hosts kronos-logrotation-deploy.ymlVerify the changes to the quotas have been changed:
Login to a node and check the contents of the file /opt/kronos/service_info.yml to see the active quotas for that node, and the specifications in /etc/logrotate.d/* for rotation thresholds.
13.2.6.21 Configuring Elasticsearch for Centralized Logging #
Elasticsearch includes some tunable options exposed in its configuration. SUSE OpenStack Cloud uses these options in Elasticsearch to prioritize indexing speed over search speed. SUSE OpenStack Cloud also configures Elasticsearch for optimal performance in low RAM environments. The options that SUSE OpenStack Cloud modifies are listed below along with an explanation about why they were modified.
These configurations are defined in the
~/openstack/my_cloud/config/logging/main.yml
file and are
implemented in the Elasticsearch configuration file
~/openstack/my_cloud/config/logging/elasticsearch.yml.j2
.
13.2.6.22 Safeguards for the Log Partitions Disk Capacity #
Because the logging partitions are at a high risk of filling up over time, a condition which can cause many negative side effects on services running, it is important to safeguard against log files consuming 100 % of available capacity.
This protection is implemented by pairs of low/high
watermark thresholds, with values
established in
~/stack/scratch/ansible/next/ardana/ansible/roles/logging-common/defaults/main.yml
and applied by the kronos-logrotation-deploy
playbook.
var_log_low_watermark_percent (default: 80) sets a capacity level for the contents of the
/var/log
partition beyond which alarms will be triggered (visible to administrators in monasca).var_log_high_watermark_percent (default: 95) defines how much capacity of the
/var/log
partition to make available for log rotation (in calculating weighted service allocations).var_audit_low_watermark_percent (default: 80) sets a capacity level for the contents of the
/var/audit
partition beyond which alarm notifications will be triggered.var_audit_high_watermark_percent (default: 95) sets a capacity level for the contents of the
/var/audit
partition which will cause log rotation to be forced according to the specification in/etc/auditlogrotate.conf
.
13.2.7 Audit Logging Overview #
Existing OpenStack service logging varies widely across services. Generally, log messages do not have enough detail about who is requesting the application program interface (API), or enough context-specific details about an action performed. Often details are not even consistently logged across various services, leading to inconsistent data formats being used across services. These issues make it difficult to integrate logging with existing audit tools and processes.
To help you monitor your workload and data in compliance with your corporate, industry or regional policies, SUSE OpenStack Cloud provides auditing support as a basic security feature. The audit logging can be integrated with customer Security Information and Event Management (SIEM) tools and support your efforts to correlate threat forensics.
The SUSE OpenStack Cloud audit logging feature uses Audit Middleware for Python services. This middleware service is based on OpenStack services which use the Paste Deploy system. Most OpenStack services use the paste deploy mechanism to find and configure WSGI servers and applications. Utilizing the paste deploy system provides auditing support in services with minimal changes.
By default, audit logging as a post-installation feature is disabled in the cloudConfig file on the Cloud Lifecycle Manager and it can only be enabled after SUSE OpenStack Cloud installation or upgrade.
The tasks in this section explain how to enable services for audit logging in your environment. SUSE OpenStack Cloud provides audit logging for the following services:
nova
barbican
keystone
cinder
ceilometer
neutron
glance
heat
For audit log backup information see Section 17.3.4, “Audit Log Backup and Restore”
13.2.7.1 Audit Logging Checklist #
Before enabling audit logging, make sure you understand how much disk space you will need, and configure the disks that will store the logging data. Use the following table to complete these tasks:
13.2.7.1.1 Frequently Asked Questions #
- How are audit logs generated?
The audit logs are created by services running in the cloud management controller nodes. The events that create auditing entries are formatted using a structure that is compliant with Cloud Auditing Data Federation (CADF) policies. The formatted audit entries are then saved to disk files. For more information, see the Cloud Auditing Data Federation Website.
- Where are audit logs stored?
We strongly recommend adding a dedicated disk volume for
/var/audit
.If the disk templates for the controllers are not updated to create a separate volume for
/var/audit
, the audit logs will still be created in the root partition under the folder/var/audit
. This could be problematic if the root partition does not have adequate space to hold the audit logs.WarningWe recommend that you do not store audit logs in the
/var/log
volume. The/var/log
volume is used for storing operational logs and logrotation/alarms have been preconfigured for various services based on the size of this volume. Adding audit logs here may impact these causing undesired alarms. This would also impact the retention times for the operational logs.- Are audit logs centrally stored?
Yes. The existing operational log profiles have been configured to centrally log audit logs as well, once their generation has been enabled. The audit logs will be stored in separate Elasticsearch indices separate from the operational logs.
- How long are audit log files retained?
By default, audit logs are configured to be retained for 7 days on disk. The audit logs are rotated each day and the rotated files are stored in a compressed format and retained up to 7 days (configurable). The backup service has been configured to back up the audit logs to a location outside of the controller nodes for much longer retention periods.
- Do I lose audit data if a management controller node goes down?
Yes. For this reason, it is strongly recommended that you back up the audit partition in each of the management controller nodes for protection against any data loss.
13.2.7.1.2 Estimate Disk Size #
The table below provides estimates from each service of audit log size generated per day. The estimates are provided for environments with 100 nodes, 300 nodes, and 500 nodes.
Service |
Log File Size: 100 nodes |
Log File Size: 300 nodes |
Log File Size: 500 nodes |
---|---|---|---|
barbican | 2.6 MB | 4.2 MB | 5.6 MB |
keystone | 96 - 131 MB | 288 - 394 MB | 480 - 657 MB |
nova | 186 (with a margin of 46) MB | 557 (with a margin of 139) MB | 928 (with a margin of 232) MB |
ceilometer | 12 MB | 12 MB | 12 MB |
cinder | 2 - 250 MB | 2 - 250 MB | 2 - 250 MB |
neutron | 145 MB | 433 MB | 722 MB |
glance | 20 (with a margin of 8) MB | 60 (with a margin of 22) MB | 100 (with a margin of 36) MB |
heat | 432 MB (1 transaction per second) | 432 MB (1 transaction per second) | 432 MB (1 transaction per second) |
swift | 33 GB (700 transactions per second) | 102 GB (2100 transactions per second) | 172 GB (3500 transactions per second) |
13.2.7.1.3 Add disks to the controller nodes #
You need to add disks for the audit log partition to store the data in a secure manner. The steps to complete this task will vary depending on the type of server you are running. Please refer to the manufacturer’s instructions on how to add disks for the type of server node used by the management controller cluster. If you already have extra disks in the controller node, you can identify any unused one and use it for the audit log partition.
13.2.7.1.4 Update the disk template for the controller nodes #
Since audit logging is disabled by default, the audit volume groups in the disk templates are commented out. If you want to turn on audit logging, the template needs to be updated first. If it is not updated, there will be no back-up volume group. To update the disk template, you will need to copy templates from the examples folder to the definition folder and then edit the disk controller settings. Changes to the disk template used for provisioning cloud nodes must be made prior to deploying the nodes.
To update the disk controller template:
Log in to your Cloud Lifecycle Manager.
To copy the example templates folder, run the following command:
ImportantIf you already have the required templates in the definition folder, you can skip this step.
ardana >
cp -r ~/openstack/examples/entry-scale-esx/* ~/openstack/my_cloud/definition/To change to the data folder, run:
ardana >
cd ~/openstack/my_cloud/definition/To edit the disks controller settings, open the file that matches your server model and disk model in a text editor:
Model File entry-scale-kvm disks_controller_1TB.yml
disks_controller_600GB.yml
mid-scale disks_compute.yml
disks_control_common_600GB.yml
disks_dbmq_600GB.yml
disks_mtrmon_2TB.yml
disks_mtrmon_4.5TB.yml
disks_mtrmon_600GB.yml
disks_swobj.yml
disks_swpac.yml
To update the settings and enable an audit log volume group, edit the appropriate file(s) listed above and remove the '#' comments from these lines, confirming that they are appropriate for your environment.
- name: audit-vg physical-volumes: - /dev/sdz logical-volumes: - name: audit size: 95% mount: /var/audit fstype: ext4 mkfs-opts: -O large_file
13.2.7.1.5 Save your changes #
To save your changes you will use the GIT repository to add the setup disk files.
To save your changes:
To change to the openstack directory, run:
ardana >
cd ~/openstackTo add the new and updated files, run:
ardana >
git add -ATo verify the files are added, run:
ardana >
git statusTo commit your changes, run:
ardana >
git commit -m "Setup disks for audit logging"
13.2.7.2 Enable Audit Logging #
To enable audit logging you must edit your cloud configuration settings, save your changes and re-run the configuration processor. Then you can run the playbooks to create the volume groups and configure them.
In the ~/openstack/my_cloud/definition/cloudConfig.yml
file,
service names defined under enabled-services or disabled-services override
the default setting.
The following is an example of your audit-settings section:
# Disc space needs to be allocated to the audit directory before enabling # auditing. # Default can be either "disabled" or "enabled". Services listed in # "enabled-services" and "disabled-services" override the default setting. audit-settings: default: disabled #enabled-services: # - keystone # - barbican disabled-services: - nova - barbican - keystone - cinder - ceilometer - neutron
In this example, although the default setting for all services is set to disabled, keystone and barbican may be explicitly enabled by removing the comments from these lines and this setting overrides the default.
13.2.7.2.1 To edit the configuration file: #
Log in to your Cloud Lifecycle Manager.
To change to the cloud definition folder, run:
ardana >
cd ~/openstack/my_cloud/definitionTo edit the auditing settings, in a text editor, open the following file:
cloudConfig.yml
To enable audit logging, begin by uncommenting the "enabled-services:" block.
enabled-service:
any service you want to enable for audit logging.
For example, keystone has been enabled in the following text:
Default cloudConfig.yml file Enabling keystone audit logging audit-settings: default: disabled enabled-services: # - keystone
audit-settings: default: disabled enabled-services: - keystone
To move the services you want to enable, comment out the service in the disabled section and add it to the enabled section. For example, barbican has been enabled in the following text:
cloudConfig.yml file Enabling barbican audit logging audit-settings: default: disabled enabled-services: - keystone disabled-services: - nova # - keystone - barbican - cinder
audit-settings: default: disabled enabled-services: - keystone - barbican disabled-services: - nova # - barbican # - keystone - cinder
13.2.7.2.2 To save your changes and run the configuration processor: #
To change to the openstack directory, run:
ardana >
cd ~/openstackTo add the new and updated files, run:
ardana >
git add -ATo verify the files are added, run:
ardana >
git statusTo commit your changes, run:
ardana >
git commit -m "Enable audit logging"To change to the directory with the ansible playbooks, run:
ardana >
cd ~/openstack/ardana/ansibleTo rerun the configuration processor, run:
ardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
13.2.7.2.3 To create the volume group: #
To change to the directory containing the osconfig playbook, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo remove the stub file that osconfig uses to decide if the disks are already configured, run:
ardana >
ansible -i hosts/verb_hosts KEY-API -a 'sudo rm -f /etc/openstack/osconfig-ran'ImportantThe osconfig playbook uses the stub file to mark already configured disks as "idempotent." To stop osconfig from identifying your new disk as already configured, you must remove the stub file /etc/hos/osconfig-ran before re-running the osconfig playbook.
To run the playbook that enables auditing for a service, run:
ardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-APIImportantThe variable KEY-API is used as an example to cover the management controller cluster. To enable auditing for a service that is not run on the same cluster, add the service to the –limit flag in the above command. For example:
ardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit KEY-API:NEU-SVR
13.2.7.2.4 To Reconfigure services for audit logging: #
To change to the directory containing the service playbooks, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleTo run the playbook that reconfigures a service for audit logging, run:
ardana >
ansible-playbook -i hosts/verb_hosts SERVICE_NAME-reconfigure.ymlFor example, to reconfigure keystone for audit logging, run:
ardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlRepeat steps 1 and 2 for each service you need to reconfigure.
ImportantYou must reconfigure each service that you changed to be enabled or disabled in the cloudConfig.yml file.
13.2.8 Troubleshooting #
For information on troubleshooting Central Logging, see Section 18.7.1, “Troubleshooting Centralized Logging”.
13.3 Metering Service (ceilometer) Overview #
The SUSE OpenStack Cloud metering service collects and provides access to OpenStack usage data that can be used for billing reporting such as showback and chargeback. The metering service can also provide general usage reporting. ceilometer acts as the central collection and data access service to the meters provided by all the OpenStack services. The data collected is available through the monasca API. ceilometer V2 API was deprecated in the Pike release upstream.
13.3.1 Metering Service New Functionality #
13.3.1.1 New Metering Functionality in SUSE OpenStack Cloud 9 #
ceilometer is now integrated with monasca, using it as the datastore.
The default meters and other items configured for ceilometer can now be modified and additional meters can be added. We recommend that users test overall SUSE OpenStack Cloud performance prior to deploying any ceilometer modifications to ensure the addition of new notifications or polling events does not negatively affect overall system performance.
ceilometer Central Agent (pollster) is now called Polling Agent and is configured to support HA (Active-Active).
Notification Agent has built-in HA (Active-Active) with support for pipeline transformers, but workload partitioning has been disabled in SUSE OpenStack Cloud
SWIFT Poll-based account level meters will be enabled by default with an hourly collection cycle.
Integration with centralized monitoring (monasca) and centralized logging
Support for upgrade and reconfigure operations
13.3.1.2 Limitations #
The Number of metadata attributes that can be extracted from resource_metadata has a maximum of 16. This is the number of fields in the metadata section of the monasca_field_definitions.yaml file for any service. It is also the number that is equal to fields in metadata.common and fields in metadata.<service.meters> sections. The total number of these fields cannot be more than 16.
Several network-related attributes are accessible using a colon ":" but are returned as a period ".". For example, you can access a sample list using the following command:
ardana >
source ~/service.osrcardana >
ceilometer --debug sample-list network -q "resource_id=421d50a5-156e-4cb9-b404- d2ce5f32f18b;resource_metadata.provider.network_type=flat"However, in response you will see the following:
provider.network_type
instead of
provider:network_type
This limitation is known for the following attributes:
provider:network_type provider:physical_network provider:segmentation_id
ceilometer Expirer is not supported. Data retention expiration is handled by monasca with a default retention period of 45 days.
ceilometer Collector is not supported.
13.3.2 Understanding the Metering Service Concepts #
13.3.2.1 Ceilometer Introduction #
Before configuring the ceilometer Metering Service, it is important to understand how it works.
13.3.2.1.1 Metering Architecture #
SUSE OpenStack Cloud automatically configures ceilometer to use Logging and Monitoring Service (monasca) as its backend. ceilometer is deployed on the same control plane nodes as monasca.
The installation of Celiometer creates several management nodes running different metering components.
ceilometer Components on Controller nodes
This controller node is the first of the High Available (HA) cluster.
ceilometer Sample Polling
Sample Polling is part of the Polling Agent. Messages are posted by the Notification Agent directly to monasca API.
ceilometer Polling Agent
The Polling Agent is responsible for coordinating the polling activity. It
parses the pipeline.yml
configuration file and
identifies all the sources that need to be polled. The sources are then
evaluated using a discovery mechanism and all the sources are translated to
resources where a dedicated pollster can retrieve and publish data. At each
identified interval the discovery mechanism is triggered, the resource list
is composed, and the data is polled and sent to the queue.
ceilometer Collector No Longer Required
In previous versions, the collector was responsible for getting the samples/events from the RabbitMQ service and storing it in the main database. The ceilometer Collector is no longer enabled. Now that Notification Agent posts the data directly to monasca API, the collector is no longer required
13.3.2.1.2 Meter Reference #
ceilometer collects basic information grouped into categories known as
meters
. A meter is the unique resource-usage measurement
of a particular OpenStack service. Each OpenStack service defines what type
of data is exposed for metering.
Each meter has the following characteristics:
Attribute | Description |
---|---|
Name | Description of the meter |
Unit of Measurement | The method by which the data is measured. For example: storage meters are defined in Gigabytes (GB) and network bandwidth is measured in Gigabits (Gb). |
Type | The origin of the meter's data. OpenStack defines the following origins:
|
A meter is defined for every measurable resource. A meter can exist beyond the actual existence of a particular resource, such as an active instance, to provision long-cycle use cases such as billing.
For a list of meter types and default meters installed with SUSE OpenStack Cloud, see Section 13.3.3, “Ceilometer Metering Available Meter Types”
The most common meter submission method is notifications. With this method, each service sends the data from their respective meters on a periodic basis to a common notifications bus.
ceilometer, in turn, pulls all of the events from the bus and saves the notifications in a ceilometer-specific database. The period of time that the data is collected and saved is known as the ceilometer expiry and is configured during ceilometer installation. Each meter is collected from one or more samples, gathered from the messaging queue or polled by agents. The samples are represented by counter objects. Each counter has the following fields:
Attribute | Description |
---|---|
counter_name | Description of the counter |
counter_unit | The method by which the data is measured. For example: data can be defined in Gigabytes (GB) or for network bandwidth, measured in Gigabits (Gb). |
counter_typee |
The origin of the counter's data. OpenStack defines the following origins:
|
counter_volume | The volume of data measured (CPU ticks, bytes transmitted, etc.). Not used for gauge counters. Set to a default value such as 1. |
resource_id | The identifier of the resource measured (UUID) |
project_id | The project (tenant) ID to which the resource belongs. |
user_id | The ID of the user who owns the resource. |
resource_metadata | Other data transmitted in the metering notification payload. |
13.3.3 Ceilometer Metering Available Meter Types #
The Metering service contains three types of meters:
- Cumulative
A cumulative meter measures data over time (for example, instance hours).
- Gauge
A gauge measures discrete items (for example, floating IPs or image uploads) or fluctuating values (such as disk input or output).
- Delta
A delta measures change over time, for example, monitoring bandwidth.
Each meter is populated from one or more samples, which are gathered from the messaging queue (listening agent), polling agents, or push agents. Samples are populated by counter objects.
Each counter contains the following fields:
- name
the name of the meter
- type
the type of meter (cumulative, gauge, or delta)
- amount
the amount of data measured
- unit
the unit of measure
- resource
the resource being measured
- project ID
the project the resource is assigned to
- user
the user the resource is assigned to.
Note: The metering service shares the same High-availability proxy, messaging, and database clusters with the other Information services. To avoid unnecessarily high loads, Section 13.3.8, “Optimizing the Ceilometer Metering Service”.
13.3.3.1 SUSE OpenStack Cloud Default Meters #
These meters are installed and enabled by default during an SUSE OpenStack Cloud installation. More information about ceilometer can be found at OpenStack ceilometer.
13.3.3.2 Compute (nova) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
vcpus | Gauge | vcpu | Instance ID | Notification | Number of virtual CPUs allocated to the instance |
memory | Gauge | MB | Instance ID | Notification | Volume of RAM allocated to the instance |
memory.resident | Gauge | MB | Instance ID | Pollster | Volume of RAM used by the instance on the physical machine |
memory.usage | Gauge | MB | Instance ID | Pollster | Volume of RAM used by the instance from the amount of its allocated memory |
cpu | Cumulative | ns | Instance ID | Pollster | CPU time used |
cpu_util | Gauge | % | Instance ID | Pollster | Average CPU utilization |
disk.read.requests | Cumulative | request | Instance ID | Pollster | Number of read requests |
disk.read.requests.rate | Gauge | request/s | Instance ID | Pollster | Average rate of read requests |
disk.write.requests | Cumulative | request | Instance ID | Pollster | Number of write requests |
disk.write.requests.rate | Gauge | request/s | Instance ID | Pollster | Average rate of write requests |
disk.read.bytes | Cumulative | B | Instance ID | Pollster | Volume of reads |
disk.read.bytes.rate | Gauge | B/s | Instance ID | Pollster | Average rate of reads |
disk.write.bytes | Cumulative | B | Instance ID | Pollster | Volume of writes |
disk.write.bytes.rate | Gauge | B/s | Instance ID | Pollster | Average rate of writes |
disk.root.size | Gauge | GB | Instance ID | Notification | Size of root disk |
disk.ephemeral.size | Gauge | GB | Instance ID | Notification | Size of ephemeral disk |
disk.device.read.requests | Cumulative | request | Disk ID | Pollster | Number of read requests |
disk.device.read.requests.rate | Gauge | request/s | Disk ID | Pollster | Average rate of read requests |
disk.device.write.requests | Cumulative | request | Disk ID | Pollster | Number of write requests |
disk.device.write.requests.rate | Gauge | request/s | Disk ID | Pollster | Average rate of write requests |
disk.device.read.bytes | Cumulative | B | Disk ID | Pollster | Volume of reads |
disk.device.read.bytes .rate | Gauge | B/s | Disk ID | Pollster | Average rate of reads |
disk.device.write.bytes | Cumulative | B | Disk ID | Pollster | Volume of writes |
disk.device.write.bytes .rate | Gauge | B/s | Disk ID | Pollster | Average rate of writes |
disk.capacity | Gauge | B | Instance ID | Pollster | The amount of disk that the instance can see |
disk.allocation | Gauge | B | Instance ID | Pollster | The amount of disk occupied by the instance on the host machine |
disk.usage | Gauge | B | Instance ID | Pollster | The physical size in bytes of the image container on the host |
disk.device.capacity | Gauge | B | Disk ID | Pollster | The amount of disk per device that the instance can see |
disk.device.allocation | Gauge | B | Disk ID | Pollster | The amount of disk per device occupied by the instance on the host machine |
disk.device.usage | Gauge | B | Disk ID | Pollster | The physical size in bytes of the image container on the host per device |
network.incoming.bytes | Cumulative | B | Interface ID | Pollster | Number of incoming bytes |
network.outgoing.bytes | Cumulative | B | Interface ID | Pollster | Number of outgoing bytes |
network.incoming.packets | Cumulative | packet | Interface ID | Pollster | Number of incoming packets |
network.outgoing.packets | Cumulative | packet | Interface ID | Pollster | Number of outgoing packets |
13.3.3.3 Compute Host Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
compute.node.cpu.frequency | Gauge | MHz | Host ID | Notification | CPU frequency |
compute.node.cpu.kernel.time | Cumulative | ns | Host ID | Notification | CPU kernel time |
compute.node.cpu.idle.time | Cumulative | ns | Host ID | Notification | CPU idle time |
compute.node.cpu.user.time | Cumulative | ns | Host ID | Notification | CPU user mode time |
compute.node.cpu.iowait.time | Cumulative | ns | Host ID | Notification | CPU I/O wait time |
compute.node.cpu.kernel.percent | Gauge | % | Host ID | Notification | CPU kernel percentage |
compute.node.cpu.idle.percent | Gauge | % | Host ID | Notification | CPU idle percentage |
compute.node.cpu.user.percent | Gauge | % | Host ID | Notification | CPU user mode percentage |
compute.node.cpu.iowait.percent | Gauge | % | Host ID | Notification | CPU I/O wait percentage |
compute.node.cpu.percent | Gauge | % | Host ID | Notification | CPU utilization |
13.3.3.4 Image (glance) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
image.size | Gauge | B | Image ID | Notification | Uploaded image size |
image.update | Delta | Image | Image ID | Notification | Number of uploads of the image |
image.upload | Delta | Image | image ID | notification | Number of uploads of the image |
image.delete | Delta | Image | Image ID | Notification | Number of deletes on the image |
13.3.3.5 Volume (cinder) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
volume.size | Gauge | GB | Vol ID | Notification | Size of volume |
snapshot.size | Gauge | GB | Snap ID | Notification | Size of snapshot's volume |
13.3.3.6 Storage (swift) Meters #
Meter | Type | Unit | Resource | Origin | Note |
---|---|---|---|---|---|
storage.objects | Gauge | Object | Storage ID | Pollster | Number of objects |
storage.objects.size | Gauge | B | Storage ID | Pollster | Total size of stored objects |
storage.objects.containers | Gauge | Container | Storage ID | Pollster | Number of containers |
The resource_id
for any ceilometer query is the
tenant_id
for the swift object because swift usage is
rolled up at the tenant level.
13.3.4 Configure the Ceilometer Metering Service #
SUSE OpenStack Cloud 9 automatically deploys ceilometer to use the monasca database. ceilometer is deployed on the same control plane nodes along with other OpenStack services such as keystone, nova, neutron, glance, and swift.
The Metering Service can be configured using one of the procedures described below.
13.3.4.1 Run the Upgrade Playbook #
Follow Standard Service upgrade mechanism available in the Cloud Lifecycle Manager distribution. For ceilometer, the playbook included with SUSE OpenStack Cloud is ceilometer-upgrade.yml
13.3.4.2 Enable Services for Messaging Notifications #
After installation of SUSE OpenStack Cloud, the following services are enabled by default to send notifications:
nova
cinder
glance
neutron
swift
The list of meters for these services are specified in the Notification Agent or Polling Agent's pipeline configuration file.
For steps on how to edit the pipeline configuration files, see: Section 13.3.5, “Ceilometer Metering Service Notifications”
13.3.4.3 Restart the Polling Agent #
The Polling Agent is responsible for coordinating the polling activity. It parses the pipeline.yml configuration file and identifies all the sources where data is collected. The sources are then evaluated and are translated to resources that a dedicated pollster can retrieve. The Polling Agent follows this process:
At each identified interval, the pipeline.yml configuration file is parsed.
The resource list is composed.
The pollster collects the data.
The pollster sends data to the queue.
Metering processes should normally be operating at all times. This need is addressed by the Upstart event engine which is designed to run on any Linux system. Upstart creates events, handles the consequences of those events, and starts and stops processes as required. Upstart will continually attempt to restart stopped processes even if the process was stopped manually. To stop or start the Polling Agent and avoid the conflict with Upstart, using the following steps.
To restart the Polling Agent:
To determine whether the process is running, run:
tux >
sudo systemctl status ceilometer-agent-notification #SAMPLE OUTPUT: ceilometer-agent-notification.service - ceilometer-agent-notification Service Loaded: loaded (/etc/systemd/system/ceilometer-agent-notification.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2018-06-12 05:07:14 UTC; 2 days ago Main PID: 31529 (ceilometer-agen) Tasks: 69 CGroup: /system.slice/ceilometer-agent-notification.service ├─31529 ceilometer-agent-notification: master process [/opt/stack/service/ceilometer-agent-notification/venv/bin/ceilometer-agent-notification --config-file /opt/stack/service/ceilometer-agent-noti... └─31621 ceilometer-agent-notification: NotificationService worker(0) Jun 12 05:07:14 ardana-qe201-cp1-c1-m2-mgmt systemd[1]: Started ceilometer-agent-notification Service.To stop the process, run:
tux >
sudo systemctl stop ceilometer-agent-notificationTo start the process, run:
tux >
sudo systemctl start ceilometer-agent-notification
13.3.4.4 Replace a Logging, Monitoring, and Metering Controller #
In a medium-scale environment, if a metering controller has to be replaced or rebuilt, use the following steps:
If the ceilometer nodes are not on the shared control plane, to implement the changes and replace the controller, you must reconfigure ceilometer. To do this, run the ceilometer-reconfigure.yml ansible playbook without the limit option
13.3.4.5 Configure Monitoring #
The monasca HTTP Process monitors ceilometer's notification and polling
agents are monitored. If these agents are down, monasca monitoring alarms
are triggered. You can use the notification alarms to debug the issue and
restart the notifications agent. However, for
Central-Agent
(polling) and Collector
the alarms need to be deleted. These two processes are not started after an
upgrade so when the monitoring process checks the alarms for these
components, they will be in UNDETERMINED
state. SUSE OpenStack Cloud does not monitor these processes anymore. To resolve
this issue, manually delete alarms that are no longer used but are
installed.
To resolve notification alarms, first check the ceilometer-agent-notification logs for errors in the /var/log/ceilometer directory. You can also use the Operations Console to access Kibana and check the logs. This will help you understand and debug the error.
To restart the service, run the ceilometer-start.yml. This playbook starts the ceilometer processes that has stopped and only restarts during install, upgrade or reconfigure which is what is needed in this case. Restarting the process that has stopped will resolve this alarm because this monasca alarm means that ceilometer-agent-notification is no longer running on certain nodes.
You can access ceilometer data through monasca. ceilometer publishes samples to monasca with credentials of the following accounts:
ceilometer user
services
Data collected by ceilometer can also be retrieved by the monasca REST API. Make sure you use the following guidelines when requesting data from the monasca REST API:
Verify you have the monasca-admin role. This role is configured in the monasca-api configuration file.
Specify the
tenant id
of the services project.
For more details, read the monasca API Specification.
To run monasca commands at the command line, you must be have the admin role. This allows you to use the ceilometer account credentials to replace the default admin account credentials defined in the service.osrc file. When you use the ceilometer account credentials, monasca commands will only return data collected by ceilometer. At this time, monasca command line interface (CLI) does not support the data retrieval of other tenants or projects.
13.3.5 Ceilometer Metering Service Notifications #
ceilometer uses the notification agent to listen to the message queue, convert notifications to Events and Samples, and apply pipeline actions.
13.3.5.1 Manage Whitelisting and Polling #
SUSE OpenStack Cloud is designed to reduce the amount of data that is stored. SUSE OpenStack Cloud's use of a SQL-based cluster, which is not recommended for big data, means you must control the data that ceilometer collects. You can do this by filtering (whitelisting) the data or by using the configuration files for the ceilometer Polling Agent and the ceilometer Notificfoation Agent.
Whitelisting is used in a rule specification as a positive filtering parameter. Whitelist is only included in rules that can be used in direct mappings, for identity service issues such as service discovery, provisioning users, groups, roles, projects, domains as well as user authentication and authorization.
You can run tests against specific scenarios to see if filtering reduces the amount of data stored. You can create a test by editing or creating a run filter file (whitelist). For steps on how to do this, see: Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 38 “Post Installation Tasks”, Section 38.1 “API Verification”.
ceilometer Polling Agent (polling agent) and ceilometer Notification Agent (notification agent) use different pipeline.yaml files to configure meters that are collected. This prevents accidentally polling for meters which can be retrieved by the polling agent as well as the notification agent. For example, glance image and image.size are meters which can be retrieved both by polling and notifications.
In both of the separate configuration files, there is a setting for
interval
. The interval attribute determines the
frequency, in seconds, of how often data is collected. You can use this
setting to control the amount of resources that are used for notifications
and for polling. For example, you want to use more resources for
notifications and less for polling. To accomplish this you would set the
interval
in the polling configuration file to a large
amount of time, such as 604800 seconds, which polls only once a week. Then
in the notifications configuration file, you can set the
interval
to a higher amount, such as collecting data
every 30 seconds.
swift account data will be collected using the polling mechanism in an hourly interval.
Setting this interval to manage both notifications and polling is the recommended procedure when using a SQL cluster back-end.
Sample ceilometer Polling Agent file:
#File: ~/opt/stack/service/ceilometer-polling/etc/pipeline-polling.yaml --- sources: - name: swift_source interval: 3600 meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
Sample ceilometer Notification Agent(notification agent) file:
#File: ~/opt/stack/service/ceilometer-agent-notification/etc/pipeline-agent-notification.yaml --- sources: - name: meter_source interval: 30 meters: - "instance" - "image" - "image.size" - "image.upload" - "image.delete" - "volume" - "volume.size" - "snapshot" - "snapshot.size" - "ip.floating" - "network" - "network.create" - "network.update" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
Both of the pipeline files have two major sections:
- Sources
represents the data that is collected either from notifications posted by services or through polling. In the Sources section there is a list of meters. These meters define what kind of data is collected. For a full list refer to the ceilometer documentation available at: Telemetry Measurements
- Sinks
represents how the data is modified before it is published to the internal queue for collection and storage.
You will only need to change a setting in the Sources section to control the data collection interval.
For more information, see Telemetry Measurements
To change the ceilometer Polling Agent interval setting:
To find the polling agent configuration file, run:
cd ~/opt/stack/service/ceilometer-polling/etc
In a text editor, open the following file:
pipeline-polling.yaml
In the following section, change the value of
interval
to the desired amount of time:--- sources: - name: swift_source interval: 3600 meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
In the sample code above, the polling agent will collect data every 600 seconds, or 10 minutes.
To change the ceilometer Notification Agent (notification agent) interval setting:
To find the notification agent configuration file, run:
cd /opt/stack/service/ceilometer-agent-notification
In a text editor, open the following file:
pipeline-agent-notification.yaml
In the following section, change the value of
interval
to the desired amount of time:sources: - name: meter_source interval: 30 meters: - "instance" - "image" - "image.size" - "image.upload" - "image.delete" - "volume" - "volume.size" - "snapshot" - "snapshot.size" - "ip.floating" - "network" - "network.create" - "network.update"
In the sample code above, the notification agent will collect data every 30 seconds.
The pipeline-agent-notification.yaml
file needs to be changed on all
controller nodes to change the white-listing and polling strategy.
13.3.5.2 Edit the List of Meters #
The number of enabled meters can be reduced or increased by editing the pipeline configuration of the notification and polling agents. To deploy these changes you must then restart the agent. If pollsters and notifications are both modified, then you will have to restart both the Polling Agent and the Notification Agent. ceilometer Collector will also need to be restarted. The following code is an example of a compute-only ceilometer Notification Agent (notification agent) pipeline-agent-notification.yaml file:
--- sources: - name: meter_source interval: 86400 meters: - "instance" - "memory" - "vcpus" - "compute.instance.create.end" - "compute.instance.delete.end" - "compute.instance.update" - "compute.instance.exists" sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
If you enable meters at the container level in this file, every time the polling interval triggers a collection, at least 5 messages per existing container in swift are collected.
The following table illustrates the amount of data produced hourly in different scenarios:
swift Containers | swift Objects per container | Samples per Hour | Samples stored per 24 hours |
10 | 10 | 500 | 12000 |
10 | 100 | 5000 | 120000 |
100 | 100 | 50000 | 1200000 |
100 | 1000 | 500000 | 12000000 |
The data in the table shows that even a very small swift storage with 10 containers and 100 files will store 120,000 samples in 24 hours, generating a total of 3.6 million samples.
The size of each file does not have any impact on the number of samples collected. As shown in the table above, the smallest number of samples results from polling when there are a small number of files and a small number of containers. When there are a lot of small files and containers, the number of samples is the highest.
13.3.5.3 Add Resource Fields to Meters #
By default, not all the resource metadata fields for an event are recorded and stored in ceilometer. If you want to collect metadata fields for a consumer application, for example, it is easier to add a field to an existing meter rather than creating a new meter. If you create a new meter, you must also reconfigure ceilometer.
Consider the following information before you add or edit a meter:
You can add a maximum of 12 new fields.
Adding or editing a meter causes all non-default meters to STOP receiving notifications. You will need to restart ceilometer.
New meters added to the
pipeline-polling.yaml.j2
file must also be added to thepipeline-agent-notification.yaml.j2
file. This is due to the fact that polling meters are drained by the notification agent and not by the collector.After SUSE OpenStack Cloud is installed, services like compute, cinder, glance, and neutron are configured to publish ceilometer meters by default. Other meters can also be enabled after the services are configured to start publishing the meter. The only requirement for publishing a meter is that the
origin
must have a value ofnotification
. For a complete list of meters, see the OpenStack documentation on Measurements.Not all meters are supported. Meters collected by ceilometer Compute Agent or any agent other than ceilometer Polling are not supported or tested with SUSE OpenStack Cloud.
Identity meters are disabled by keystone.
To enable ceilometer to start collecting meters, some services require you enable the meters you need in the service first before enabling them in ceilometer. Refer to the documentation for the specific service before you add new meters or resource fields.
To add Resource Metadata fields:
Log on to the Cloud Lifecycle Manager (deployer node).
To change to the ceilometer directory, run:
ardana >
cd ~/openstack/my_cloud/config/ceilometerIn a text editor, open the target configuration file (for example, monasca-field-definitions.yaml.j2).
In the metadata section, either add a new meter or edit an existing one provided by SUSE OpenStack Cloud.
Include the metadata fields you need. You can use the
instance meter
in the file as an example.Save and close the configuration file.
To save your changes in SUSE OpenStack Cloud, run:
ardana >
cd ~/openstackardana >
git add -Aardana >
git commit -m "My config"If you added a new meter, reconfigure ceilometer:
ardana >
cd ~/openstack/ardana/ansible/ # To run the config-processor playbook:ardana >
ansible-playbook -i hosts/localhost config-processor-run.yml #To run the ready-deployment playbook:ardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ceilometer-reconfigure.yml
13.3.5.4 Update the Polling Strategy and Swift Considerations #
Polling can be very taxing on the system due to the sheer volume of data that the system may have to process. It also has a severe impact on queries since the database will now have a very large amount of data to scan to respond to the query. This consumes a great amount of cpu and memory. This can result in long wait times for query responses, and in extreme cases can result in timeouts.
There are 3 polling meters in swift:
storage.objects
storage.objects.size
storage.objects.containers
Here is an example of pipeline.yml
in which
swift polling is set to occur hourly.
--- sources: - name: swift_source interval: 3600 meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" resources: discovery: sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
With this configuration above, we did not enable polling of container based meters and we only collect 3 messages for any given tenant, one for each meter listed in the configuration files. Since we have 3 messages only per tenant, it does not create a heavy load on the MySQL database as it would have if container-based meters were enabled. Hence, other APIs are not hit because of this data collection configuration.
13.3.6 Ceilometer Metering Setting Role-based Access Control #
Role Base Access Control (RBAC) is a technique that limits access to resources based on a specific set of roles associated with each user's credentials.
keystone has a set of users that are associated with each project. Each user has at least one role. After a user has authenticated with keystone using a valid set of credentials, keystone will augment that request with the Roles that are associated with that user. These roles are added to the Request Header under the X-Roles attribute and are presented as a comma-separated list.
13.3.6.1 Displaying All Users #
To discover the list of users available in the system, an administrator can run the following command using the keystone command-line interface:
ardana >
source ~/service.osrcardana >
openstack user list
The output should resemble this response, which is a list of all the users currently available in this system.
+----------------------------------+-----------------------------------------+----+ | id | name | enabled | email | +----------------------------------+-----------------------------------------+----+ | 1c20d327c92a4ea8bb513894ce26f1f1 | admin | True | admin.example.com | | 0f48f3cc093c44b4ad969898713a0d65 | ceilometer | True | nobody@example.com | | 85ba98d27b1c4c8f97993e34fcd14f48 | cinder | True | nobody@example.com | | d2ff982a0b6547d0921b94957db714d6 | demo | True | demo@example.com | | b2d597e83664489ebd1d3c4742a04b7c | ec2 | True | nobody@example.com | | 2bd85070ceec4b608d9f1b06c6be22cb | glance | True | nobody@example.com | | 0e9e2daebbd3464097557b87af4afa4c | heat | True | nobody@example.com | | 0b466ddc2c0f478aa139d2a0be314467 | neutron | True | nobody@example.com | | 5cda1a541dee4555aab88f36e5759268 | nova | True | nobody@example.com || | 5cda1a541dee4555aab88f36e5759268 | nova | True | nobody@example.com | | 1cefd1361be8437d9684eb2add8bdbfa | swift | True | nobody@example.com | | f05bac3532c44414a26c0086797dab23 | user20141203213957|True| nobody@example.com | | 3db0588e140d4f88b0d4cc8b5ca86a0b | user20141205232231|True| nobody@example.com | +----------------------------------+-----------------------------------------+----+
13.3.6.2 Displaying All Roles #
To see all the roles that are currently available in the deployment, an administrator (someone with the admin role) can run the following command:
ardana >
source ~/service.osrcardana >
openstack role list
The output should resemble the following response:
+----------------------------------+-------------------------------------+ | id | name | +----------------------------------+-------------------------------------+ | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | | 9fe2ff9ee4384b1894a90878d3e92bab | member | | e00e9406b536470dbde2689ce1edb683 | admin | | aa60501f1e664ddab72b0a9f27f96d2c | heat_stack_user | | a082d27b033b4fdea37ebb2a5dc1a07b | service | | 8f11f6761534407585feecb5e896922f | swiftoperator | +----------------------------------+-------------------------------------+
13.3.6.3 Assigning a Role to a User #
In this example, we want to add the role ResellerAdmin to the demo user who has the ID d2ff982a0b6547d0921b94957db714d6.
Determine which Project/Tenant the user belongs to.
ardana >
source ~/service.osrcardana >
openstack user show d2ff982a0b6547d0921b94957db714d6The response should resemble the following output:
+---------------------+----------------------------------+ | Field | Value | +---------------------+----------------------------------+ | domain_id | default | | enabled | True | | id | d2ff982a0b6547d0921b94957db714d6 | | name | admin | | options | {} | | password_expires_at | None | +---------------------+----------------------------------+
We need to link the ResellerAdmin Role to a Project/Tenant. To start, determine which tenants are available on this deployment.
ardana >
source ~/service.osrcardana >
openstack project listThe response should resemble the following output:
+----------------------------------+-------------------------------+--+ | id | name | enabled | +----------------------------------+-------------------------------+--+ | 4a8f4207a13444089a18dc524f41b2cf | admin | True | | 00cbaf647bf24627b01b1a314e796138 | demo | True | | 8374761f28df43b09b20fcd3148c4a08 | gf1 | True | | 0f8a9eef727f4011a7c709e3fbe435fa | gf2 | True | | 6eff7b888f8e470a89a113acfcca87db | gf3 | True | | f0b5d86c7769478da82cdeb180aba1b0 | jaq1 | True | | a46f1127e78744e88d6bba20d2fc6e23 | jaq2 | True | | 977b9b7f9a6b4f59aaa70e5a1f4ebf0b | jaq3 | True | | 4055962ba9e44561ab495e8d4fafa41d | jaq4 | True | | 33ec7f15476545d1980cf90b05e1b5a8 | jaq5 | True | | 9550570f8bf147b3b9451a635a1024a1 | service | True | +----------------------------------+-------------------------------+--+
Now that we have all the pieces, we can assign the ResellerAdmin role to this User on the Demo project.
ardana >
openstack role add --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138 507bface531e4ac2b7019a1684df3370This will produce no response if everything is correct.
Validate that the role has been assigned correctly. Pass in the user and tenant ID and request a list of roles assigned.
ardana >
openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138Note that all members have the member role as a default role in addition to any other roles that have been assigned.
+----------------------------------+---------------+----------------------------------+----------------------------------+ | id | name | user_id | tenant_id | +----------------------------------+---------------+----------------------------------+----------------------------------+ | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | | 9fe2ff9ee4384b1894a90878d3e92bab | member | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | +----------------------------------+---------------+----------------------------------+----------------------------------+
13.3.6.4 Creating a New Role #
In this example, we will create a Level 3 Support role called L3Support.
Add the new role to the list of roles.
ardana >
openstack role create L3SupportThe response should resemble the following output:
+----------+----------------------------------+ | Property | Value | +----------+----------------------------------+ | id | 7e77946db05645c4ba56c6c82bf3f8d2 | | name | L3Support | +----------+----------------------------------+
Now that we have the new role's ID, we can add that role to the Demo user from the previous example.
ardana >
openstack role add --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138 7e77946db05645c4ba56c6c82bf3f8d2This will produce no response if everything is correct.
Verify that the user Demo has both the ResellerAdmin and L3Support roles.
ardana >
openstack role list --user d2ff982a0b6547d0921b94957db714d6 --project 00cbaf647bf24627b01b1a314e796138The response should resemble the following output. Note that this user has the L3Support role, the ResellerAdmin role, and the default member role.
+----------------------------------+---------------+----------------------------------+----------------------------------+ | id | name | user_id | tenant_id | +----------------------------------+---------------+----------------------------------+----------------------------------+ | 7e77946db05645c4ba56c6c82bf3f8d2 | L3Support | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | | 507bface531e4ac2b7019a1684df3370 | ResellerAdmin | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | | 9fe2ff9ee4384b1894a90878d3e92bab | member | d2ff982a0b6547d0921b94957db714d6 | 00cbaf647bf24627b01b1a314e796138 | +----------------------------------+---------------+----------------------------------+----------------------------------+
13.3.6.5 Access Policies #
Before introducing RBAC, ceilometer had very simple access control. There were two types of user: admins and users. Admins will be able to access any API and perform any operation. Users will only be able to access non-admin APIs and perform operations only on the Project/Tenant where they belonged.
13.3.7 Ceilometer Metering Failover HA Support #
In the SUSE OpenStack Cloud environment, the ceilometer metering service supports native Active-Active high-availability (HA) for the notification and polling agents. Implementing HA support includes workload-balancing, workload-distribution and failover.
Tooz is the coordination engine that is used to coordinate workload among multiple active agent instances. It also maintains the knowledge of active-instance-to-handle failover and group membership using hearbeats (pings).
Zookeeper is used as the coordination backend. Zookeeper uses Tooz to expose the APIs that manage group membership and retrieve workload specific to each agent.
The following section in the configuration file is used to implement high-availability (HA):
[coordination] backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default) heartbeat = 1.0 check_watchers = 10.0
For the notification agent to be configured in HA mode, additional configuration is needed in the configuration file:
[notification] workload_partitioning = true
The HA notification agent distributes workload among multiple queues that are created based on the number of unique source:sink combinations. The combinations are configured in the notification agent pipeline configuration file. If there are additional services to be metered using notifications, then the recommendation is to use a separate source for those events. This is recommended especially if the expected load of data from that source is considered high. Implementing HA support should lead to better workload balancing among multiple active notification agents.
ceilometer-expirer is also an Active-Active HA. Tooz is used to pick an expirer process that acquires a lock when there are multiple contenders and the winning process runs. There is no failover support, as expirer is not a daemon and is scheduled to run at pre-determined intervals.
You must ensure that a single expirer process runs when multiple processes are scheduled to run at the same time. This must be done using cron-based scheduling. on multiple controller nodes
The following configuration is needed to enable expirer HA:
[coordination] backend_url = <IP address of Zookeeper host: port> (port is usually 2181 as a zookeeper default) heartbeat = 1.0 check_watchers = 10.0
The notification agent HA support is mainly designed to coordinate among notification agents so that correlated samples can be handled by the same agent. This happens when samples get transformed from other samples. The SUSE OpenStack Cloud ceilometer pipeline has no transformers, so this task of coordination and workload partitioning does not need to be enabled. The notification agent is deployed on multiple controller nodes and they distribute workload among themselves by randomly fetching the data from the queue.
To disable coordination and workload partitioning by OpenStack, set the following value in the configuration file:
[notification] workload_partitioning = False
When a configuration change is made to an API running under the HA Proxy, that change needs to be replicated in all controllers.
13.3.8 Optimizing the Ceilometer Metering Service #
You can improve ceilometer responsiveness by configuring metering to store only the data you are require. This topic provides strategies for getting the most out of metering while not overloading your resources.
13.3.8.1 Change the List of Meters #
The list of meters can be easily reduced or increased by editing the pipeline.yaml file and restarting the polling agent.
Sample compute-only pipeline.yaml file with the daily poll interval:
--- sources: - name: meter_source interval: 86400 meters: - "instance" - "memory" - "vcpus" - "compute.instance.create.end" - "compute.instance.delete.end" - "compute.instance.update" - "compute.instance.exists" sinks: - meter_sink sinks: - name: meter_sink transformers: publishers: - notifier://
This change will cause all non-default meters to stop receiving notifications.
13.3.8.2 Enable Nova Notifications #
You can configure nova to send notifications by enabling the setting in the configuration file. When enabled, nova will send information to ceilometer related to its usage and VM status. You must restart nova for these changes to take effect.
The Openstack notification daemon, also known as a polling agent, monitors
the message bus for data being provided by other OpenStack components such
as nova. The notification daemon loads one or more listener plugins, using
the ceilometer.notification
namespace. Each plugin can
listen to any topic, but by default it will listen to the
notifications.info
topic. The listeners grab messages off
the defined topics and redistribute them to the appropriate plugins
(endpoints) to be processed into Events and Samples. After the nova service
is restarted, you should verify that the notification daemons are receiving
traffic.
For a more in-depth look at how information is sent over openstack.common.rpc, refer to the OpenStack ceilometer documentation.
nova can be configured to send following data to ceilometer:
Name | Unit | Type | Resource | Note |
instance | g | instance | inst ID | Existence of instance |
instance: type
| g | instance | inst ID | Existence of instance of type (Where
type is a valid OpenStack type.) |
memory | g | MB | inst ID | Amount of allocated RAM. Measured in MB. |
vcpus | g | vcpu | inst ID | Number of VCPUs |
disk.root.size | g | GB | inst ID | Size of root disk. Measured in GB. |
disk.ephemeral.size | g | GB | inst ID | Size of ephemeral disk. Measured in GB. |
To enable nova to publish notifications:
In a text editor, open the following file:
nova.conf
Compare the example of a working configuration file with the necessary changes to your configuration file. If there is anything missing in your file, add it, and then save the file.
notification_driver=messaging notification_topics=notifications notify_on_state_change=vm_and_task_state instance_usage_audit=True instance_usage_audit_period=hour
ImportantThe
instance_usage_audit_period
interval can be set to check the instance's status every hour, once a day, once a week or once a month. Every time the audit period elapses, nova sends a notification to ceilometer to record whether or not the instance is alive and running. Metering this statistic is critical if billing depends on usage.To restart nova service, run:
tux >
sudo systemctl restart nova-api.servicetux >
sudo systemctl restart nova-conductor.servicetux >
sudo systemctl restart nova-scheduler.servicetux >
sudo systemctl restart nova-novncproxy.serviceImportantDifferent platforms may use their own unique command to restart nova-compute services. If the above command does not work, please refer to the documentation for your specific platform.
To verify successful launch of each process, list the service components:
ardana >
source ~/service.osrcardana >
openstack compute service list +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-conductor | controller | internal | enabled | up | 2014-09-16T23:54:02.000000 | - | | 3 | nova-scheduler | controller | internal | enabled | up | 2014-09-16T23:54:07.000000 | - | | 4 | nova-cert | controller | internal | enabled | up | 2014-09-16T23:54:00.000000 | - | | 5 | nova-compute | compute1 | nova | enabled | up | 2014-09-16T23:54:06.000000 | - | +----+------------------+------------+----------+---------+-------+----------------------------+-----------------+
13.3.8.3 Improve Reporting API Responsiveness #
Reporting APIs are the main access to the metering data stored in ceilometer. These APIs are accessed by horizon to provide basic usage data and information.
SUSE OpenStack Cloud uses Apache2 Web Server to provide the API access. This topic provides some strategies to help you optimize the front-end and back-end databases.
To improve the responsiveness you can increase the number of threads and processes in the ceilometer configuration file. Each process can have a certain amount of threads managing the filters and applications, which can comprise the processing pipeline.
To configure Apache2 to use increase the number of threads, use the steps in Section 13.3.4, “Configure the Ceilometer Metering Service”
The resource usage panel could take some time to load depending on the number of metrics selected.
13.3.8.4 Update the Polling Strategy and Swift Considerations #
Polling can put an excessive amount of strain on the system due to the amount of data the system may have to process. Polling also has a severe impact on queries since the database can have very large amount of data to scan before responding to the query. This process usually consumes a large amount of CPU and memory to complete the requests. Clients can also experience long waits for queries to come back and, in extreme cases, even timeout.
There are 3 polling meters in swift:
storage.objects
storage.objects.size
storage.objects.containers
Sample section of the pipeline.yaml configuration file with swift polling on an hourly interval:
--- sources: - name: swift_source interval: 3600 sources: meters: - "storage.objects" - "storage.objects.size" - "storage.objects.containers" sinks: - name: meter_sink transformers: publishers: - notifier://
Every time the polling interval occurs, at least 3 messages per existing object/container in swift are collected. The following table illustrates the amount of data produced hourly in different scenarios:
swift Containers | swift Objects per container | Samples per Hour | Samples stored per 24 hours |
10 | 10 | 500 | 12000 |
10 | 100 | 5000 | 120000 |
100 | 100 | 50000 | 1200000 |
100 | 1000 | 500000 | 12000000 |
Looking at the data we can see that even a very small swift storage with 10 containers and 100 files will store 120K samples in 24 hours, bringing it to a total of 3.6 million samples.
The file size of each file does not have any impact on the number of samples collected. In fact the smaller the number of containers or files, the smaller the sample size. In the scenario where there a large number of small files and containers, the sample size is also large and the performance is at its worst.
13.3.9 Metering Service Samples #
Samples are discrete collections of a particular meter or the actual usage data defined by a meter description. Each sample is time-stamped and includes a variety of data that varies per meter but usually includes the project ID and UserID of the entity that consumed the resource represented by the meter and sample.
In a typical deployment, the number of samples can be in the tens of thousands if not higher for a specific collection period depending on overall activity.
Sample collection and data storage expiry settings are configured in ceilometer. Use cases that include collecting data for monthly billing cycles are usually stored over a period of 45 days and require a large, scalable, back-end database to support the large volume of samples generated by production OpenStack deployments.
Example configuration:
[database] metering_time_to_live=-1
In our example use case, to construct a complete billing record, an external billing application must collect all pertinent samples. Then the results must be sorted, summarized, and combine with the results of other types of metered samples that are required. This function is known as aggregation and is external to the ceilometer service.
Meter data, or samples, can also be collected directly from the service APIs by individual ceilometer polling agents. These polling agents directly access service usage by calling the API of each service.
OpenStack services such as swift currently only provide metered data through this function and some of the other OpenStack services provide specific metrics only through a polling action.
14 Managing Container as a Service (Magnum) #
The SUSE OpenStack Cloud Magnum Service provides container orchestration engines such as Docker Swarm, Kubernetes, and Apache Mesos available as first class resources. SUSE OpenStack Cloud Magnum uses heat to orchestrate an OS image which contains Docker and Kubernetes and runs that image in either virtual machines or bare metal in a cluster configuration.
14.1 Deploying a Kubernetes Cluster on Fedora Atomic #
14.1.1 Prerequisites #
These steps assume the following have been completed:
The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.
Deploying a Kubernetes Cluster on Fedora Atomic requires the Fedora Atomic image fedora-atomic-26-20170723.0.x86_64.qcow2 prepared specifically for the OpenStack release. You can download the fedora-atomic-26-20170723.0.x86_64.qcow2 image from https://fedorapeople.org/groups/magnum/
14.1.2 Creating the Cluster #
The following example is created using Kubernetes Container Orchestration Engine (COE) running on Fedora Atomic guest OS on SUSE OpenStack Cloud VMs.
As stack user, login to the lifecycle manager.
Source openstack admin credentials.
$ source service.osrc
If you haven't already, download Fedora Atomic image, prepared for the Openstack Pike release.
$ wget https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-26-20170723.0/CloudImages/x86_64/images/Fedora-Atomic-26-20170723.0.x86_64.qcow2
Create a glance image.
$ openstack image create --name fedora-atomic-26-20170723.0.x86_64 --visibility public \ --disk-format qcow2 --os-distro fedora-atomic --container-format bare \ --file Fedora-Atomic-26-20170723.0.x86_64.qcow2 --progress [=============================>] 100% +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | checksum | 9d233b8e7fbb7ea93f20cc839beb09ab | | container_format | bare | | created_at | 2017-04-10T21:13:48Z | | disk_format | qcow2 | | id | 4277115a-f254-46c0-9fb0-fffc45d2fd38 | | min_disk | 0 | | min_ram | 0 | | name | fedora-atomic-26-20170723.0.x86_64 | | os_distro | fedora-atomic | | owner | 2f5b83ab49d54aaea4b39f5082301d09 | | protected | False | | size | 515112960 | | status | active | | tags | [] | | updated_at | 2017-04-10T21:13:56Z | | virtual_size | None | | visibility | public | +------------------+--------------------------------------+
Create a nova keypair.
$ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
Create a Magnum cluster template.
$ magnum cluster-template-create --name my-template \ --image-id 4277115a-f254-46c0-9fb0-fffc45d2fd38 \ --keypair-id testkey \ --external-network-id ext-net \ --dns-nameserver 8.8.8.8 \ --flavor-id m1.small \ --docker-volume-size 5 \ --network-driver flannel \ --coe kubernetes \ --http-proxy http://proxy.yourcompany.net:8080/ \ --https-proxy http://proxy.yourcompany.net:8080/
NoteUse the image_id from
openstack image create
command output in the previous step.Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.
The proxy is only needed if public internet (for example,
https://discovery.etcd.io/
orhttps://gcr.io/
) is not accessible without proxy.
Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).
$ magnum cluster-create --name my-cluster --cluster-template my-template --node-count 1 --master-count 1
Immediately after issuing
cluster-create
command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.$ magnum cluster-show my-cluster +---------------------+------------------------------------------------------------+ | Property | Value | +---------------------+------------------------------------------------------------+ | status | CREATE_IN_PROGRESS | | cluster_template_id | 245c6bf8-c609-4ea5-855a-4e672996cbbc | | uuid | 0b78a205-8543-4589-8344-48b8cfc24709 | | stack_id | 22385a42-9e15-49d9-a382-f28acef36810 | | status_reason | - | | created_at | 2017-04-10T21:25:11+00:00 | | name | my-cluster | | updated_at | - | | discovery_url | https://discovery.etcd.io/193d122f869c497c2638021eae1ab0f7 | | api_address | - | | coe_version | - | | master_addresses | [] | | create_timeout | 60 | | node_addresses | [] | | master_count | 1 | | container_version | - | | node_count | 1 | +---------------------+------------------------------------------------------------+
You can monitor cluster creation progress by listing the resources of the heat stack. Use the
stack_id
value from themagnum cluster-status
output above in the following command:$ heat resource-list -n2 22385a42-9e15-49d9-a382-f28acef36810 WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead +-------------------------------+--------------------------------------+-----------------------------------+--------------------+----------------------+-------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-------------------------------+--------------------------------------+-----------------------------------+--------------------+----------------------+-------------------------+ | api_address_floating_switch | 06b2cc0d-77f9-4633-8d96-f51e2db1faf3 | Magnum::FloatingIPAddressSwitcher | CREATE_COMPLETE | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv | | api_address_lb_switch | 965124ca-5f62-4545-bbae-8d9cda7aff2e | Magnum::ApiGatewaySwitcher | CREATE_COMPLETE | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv | . . .
The cluster is complete when all resources show CREATE_COMPLETE.
Install kubectl onto your Cloud Lifecycle Manager.
$ export https_proxy=http://proxy.yourcompany.net:8080 $ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.0/bin/linux/amd64/kubectl $ chmod +x ./kubectl $ sudo mv ./kubectl /usr/local/bin/kubectl
Generate the cluster configuration using
magnum cluster-config
. If the CLI option--tls-disabled
was not specified during cluster template creation, authentication in the cluster will be turned on. In this case,magnum cluster-config
command will generate client authentication certificate (cert.pem
) and key (key.pem
). Copy and pastemagnum cluster-config
output to your command line input to finalize configuration (that is, export KUBECONFIG environment variable).$ mkdir my_cluster $ cd my_cluster /my_cluster $ ls /my_cluster $ magnum cluster-config my-cluster export KUBECONFIG=./config /my_cluster $ ls ca.pem cert.pem config key.pem /my_cluster $ export KUBECONFIG=./config /my_cluster $ kubectl version Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"5cb86ee022267586db386f62781338b0483733b3", GitTreeState:"clean"} Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"cffae0523cfa80ddf917aba69f08508b91f603d5", GitTreeState:"clean"}
Create a simple Nginx replication controller, exposed as a service of type NodePort.
$ cat >nginx.yml <<-EOF apiVersion: v1 kind: ReplicationController metadata: name: nginx-controller spec: replicas: 1 selector: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: nginx-service spec: type: NodePort ports: - port: 80 nodePort: 30080 selector: app: nginx EOF $ kubectl create -f nginx.yml
Check pod status until it turns from Pending to Running.
$ kubectl get pods NAME READY STATUS RESTARTS AGE nginx-controller-5cmev 1/1 Running 0 2m
Ensure that the Nginx welcome page is displayed at port 30080 using the kubemaster floating IP.
$ http_proxy= curl http://172.31.0.6:30080 <!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title>
14.2 Deploying a Kubernetes Cluster on CoreOS #
14.2.1 Prerequisites #
These steps assume the following have been completed:
The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.
Creating the Magnum cluster requires the CoreOS image for OpenStack. You can download compressed image file coreos_production_openstack_image.img.bz2 from http://stable.release.core-os.net/amd64-usr/current/.
14.2.2 Creating the Cluster #
The following example is created using Kubernetes Container Orchestration Engine (COE) running on CoreOS guest OS on SUSE OpenStack Cloud VMs.
Login to the Cloud Lifecycle Manager.
Source openstack admin credentials.
$ source service.osrc
If you haven't already, download CoreOS image that is compatible for the OpenStack release.
NoteThe https_proxy is only needed if your environment requires a proxy.
$ export https_proxy=http://proxy.yourcompany.net:8080 $ wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_openstack_image.img.bz2 $ bunzip2 coreos_production_openstack_image.img.bz2
Create a glance image.
$ openstack image create --name coreos-magnum --visibility public \ --disk-format raw --os-distro coreos --container-format bare \ --file coreos_production_openstack_image.img --progress [=============================>] 100% +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | checksum | 4110469bb15af72ec0cf78c2da4268fa | | container_format | bare | | created_at | 2017-04-25T18:10:52Z | | disk_format | raw | | id | c25fc719-2171-437f-9542-fcb8a534fbd1 | | min_disk | 0 | | min_ram | 0 | | name | coreos-magnum | | os_distro | coreos | | owner | 2f5b83ab49d54aaea4b39f5082301d09 | | protected | False | | size | 806551552 | | status | active | | tags | [] | | updated_at | 2017-04-25T18:11:07Z | | virtual_size | None | | visibility | public | +------------------+--------------------------------------+
Create a nova keypair.
$ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
Create a Magnum cluster template.
$ magnum cluster-template-create --name my-coreos-template \ --image-id c25fc719-2171-437f-9542-fcb8a534fbd1 \ --keypair-id testkey \ --external-network-id ext-net \ --dns-nameserver 8.8.8.8 \ --flavor-id m1.small \ --docker-volume-size 5 \ --network-driver flannel \ --coe kubernetes \ --http-proxy http://proxy.yourcompany.net:8080/ \ --https-proxy http://proxy.yourcompany.net:8080/
NoteUse the image_id from
openstack image create
command output in the previous step.Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.
The proxy is only needed if public internet (for example,
https://discovery.etcd.io/
orhttps://gcr.io/
) is not accessible without proxy.
Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).
$ magnum cluster-create --name my-coreos-cluster --cluster-template my-coreos-template --node-count 1 --master-count 1
Almost immediately after issuing
cluster-create
command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.$ magnum cluster-show my-coreos-cluster +---------------------+------------------------------------------------------------+ | Property | Value | +---------------------+------------------------------------------------------------+ | status | CREATE_IN_PROGRESS | | cluster_template_id | c48fa7c0-8dd9-4da4-b599-9e62dc942ca5 | | uuid | 6b85e013-f7c3-4fd3-81ea-4ea34201fd45 | | stack_id | c93f873a-d563-4721-9bd9-3bae2340750a | | status_reason | - | | created_at | 2017-04-25T22:38:43+00:00 | | name | my-coreos-cluster | | updated_at | - | | discovery_url | https://discovery.etcd.io/6e4c0e5ff5e5b9872173d06880886a0c | | api_address | - | | coe_version | - | | master_addresses | [] | | create_timeout | 60 | | node_addresses | [] | | master_count | 1 | | container_version | - | | node_count | 1 | +---------------------+------------------------------------------------------------+
You can monitor cluster creation progress by listing the resources of the heat stack. Use the
stack_id
value from themagnum cluster-status
output above in the following command:$ heat resource-list -n2 c93f873a-d563-4721-9bd9-3bae2340750a WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead +--------------------------------+-------------------------------------------------------------------------------------+------------------------------------------------------------------- ----------------------------------------------------------------+--------------------+----------------------+-------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +--------------------------------+-------------------------------------------------------------------------------------+------------------------------------------------------------------- ----------------------------------------------------------------+--------------------+----------------------+-------------------------------------------------------------------------+ | api_address_switch | | Magnum::ApiGatewaySwitcher | INIT_COMPLETE | 2017-04-25T22:38:42Z | my-coreos-cluster-mscybll54eoj | . . .
The cluster is complete when all resources show CREATE_COMPLETE.
Install kubectl onto your Cloud Lifecycle Manager.
$ export https_proxy=http://proxy.yourcompany.net:8080 $ wget https://storage.googleapis.com/kubernetes-release/release/v1.2.0/bin/linux/amd64/kubectl $ chmod +x ./kubectl $ sudo mv ./kubectl /usr/local/bin/kubectl
Generate the cluster configuration using
magnum cluster-config
. If the CLI option--tls-disabled
was not specified during cluster template creation, authentication in the cluster will be turned on. In this case,magnum cluster-config
command will generate client authentication certificate (cert.pem
) and key (key.pem
). Copy and pastemagnum cluster-config
output to your command line input to finalize configuration (that is, export KUBECONFIG environment variable).$ mkdir my_cluster $ cd my_cluster /my_cluster $ ls /my_cluster $ magnum cluster-config my-cluster export KUBECONFIG=./config /my_cluster $ ls ca.pem cert.pem config key.pem /my_cluster $ export KUBECONFIG=./config /my_cluster $ kubectl version Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"5cb86ee022267586db386f62781338b0483733b3", GitTreeState:"clean"} Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.0", GitCommit:"cffae0523cfa80ddf917aba69f08508b91f603d5", GitTreeState:"clean"}
Create a simple Nginx replication controller, exposed as a service of type NodePort.
$ cat >nginx.yml <<-EOF apiVersion: v1 kind: ReplicationController metadata: name: nginx-controller spec: replicas: 1 selector: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: nginx-service spec: type: NodePort ports: - port: 80 nodePort: 30080 selector: app: nginx EOF $ kubectl create -f nginx.yml
Check pod status until it turns from Pending to Running.
$ kubectl get pods NAME READY STATUS RESTARTS AGE nginx-controller-5cmev 1/1 Running 0 2m
Ensure that the Nginx welcome page is displayed at port 30080 using the kubemaster floating IP.
$ http_proxy= curl http://172.31.0.6:30080 <!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title>
14.3 Deploying a Docker Swarm Cluster on Fedora Atomic #
14.3.1 Prerequisites #
These steps assume the following have been completed:
The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.
Deploying a Docker Swarm Cluster on Fedora Atomic requires the Fedora Atomic image fedora-atomic-26-20170723.0.x86_64.qcow2 prepared specifically for the OpenStack Pike release. You can download the fedora-atomic-26-20170723.0.x86_64.qcow2 image from https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-26-20170723.0/CloudImages/x86_64/
14.3.2 Creating the Cluster #
The following example is created using Kubernetes Container Orchestration Engine (COE) running on Fedora Atomic guest OS on SUSE OpenStack Cloud VMs.
As stack user, login to the lifecycle manager.
Source openstack admin credentials.
$ source service.osrc
If you haven't already, download Fedora Atomic image, prepared for Openstack Pike release.
$ wget https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-26-20170723.0/CloudImages/x86_64/images/Fedora-Atomic-26-20170723.0.x86_64.qcow2
Create a glance image.
$ openstack image create --name fedora-atomic-26-20170723.0.x86_64 --visibility public \ --disk-format qcow2 --os-distro fedora-atomic --container-format bare \ --file Fedora-Atomic-26-20170723.0.x86_64.qcow2 --progress [=============================>] 100% +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | checksum | 9d233b8e7fbb7ea93f20cc839beb09ab | | container_format | bare | | created_at | 2017-04-10T21:13:48Z | | disk_format | qcow2 | | id | 4277115a-f254-46c0-9fb0-fffc45d2fd38 | | min_disk | 0 | | min_ram | 0 | | name | fedora-atomic-26-20170723.0.x86_64 | | os_distro | fedora-atomic | | owner | 2f5b83ab49d54aaea4b39f5082301d09 | | protected | False | | size | 515112960 | | status | active | | tags | [] | | updated_at | 2017-04-10T21:13:56Z | | virtual_size | None | | visibility | public | +------------------+--------------------------------------+
Create a nova keypair.
$ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
Create a Magnum cluster template.
NoteThe
--tls-disabled
flag is not specified in the included template. Authentication via client certificate will be turned on in clusters created from this template.$ magnum cluster-template-create --name my-swarm-template \ --image-id 4277115a-f254-46c0-9fb0-fffc45d2fd38 \ --keypair-id testkey \ --external-network-id ext-net \ --dns-nameserver 8.8.8.8 \ --flavor-id m1.small \ --docker-volume-size 5 \ --network-driver docker \ --coe swarm \ --http-proxy http://proxy.yourcompany.net:8080/ \ --https-proxy http://proxy.yourcompany.net:8080/
NoteUse the
image_id
fromopenstack image create
command output in the previous step.Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.
The proxy is only needed if public internet (for example,
https://discovery.etcd.io/
orhttps://gcr.io/
) is not accessible without proxy.
Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).
$ magnum cluster-create --name my-swarm-cluster --cluster-template my-swarm-template \ --node-count 1 --master-count 1
Immediately after issuing
cluster-create
command, cluster status should turn to CREATE_IN_PROGRESS andstack_id
assigned.$ magnum cluster-show my-swarm-cluster +---------------------+------------------------------------------------------------+ | Property | Value | +---------------------+------------------------------------------------------------+ | status | CREATE_IN_PROGRESS | | cluster_template_id | 17df266e-f8e1-4056-bdee-71cf3b1483e3 | | uuid | c3e13e5b-85c7-44f4-839f-43878fe5f1f8 | | stack_id | 3265d843-3677-4fed-bbb7-e0f56c27905a | | status_reason | - | | created_at | 2017-04-21T17:13:08+00:00 | | name | my-swarm-cluster | | updated_at | - | | discovery_url | https://discovery.etcd.io/54e83ea168313b0c2109d0f66cd0aa6f | | api_address | - | | coe_version | - | | master_addresses | [] | | create_timeout | 60 | | node_addresses | [] | | master_count | 1 | | container_version | - | | node_count | 1 | +---------------------+------------------------------------------------------------+
You can monitor cluster creation progress by listing the resources of the heat stack. Use the
stack_id
value from themagnum cluster-status
output above in the following command:$ heat resource-list -n2 3265d843-3677-4fed-bbb7-e0f56c27905a WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead +--------------------+--------------------------------------+--------------------------------------------+-----------------+----------------------+-------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | |--------------------+--------------------------------------+--------------------------------------------+-----------------+----------------------+-------------------------------+ | api_address_switch | 430f82f2-03e3-4085-8c07-b4a6b6d7e261 | Magnum::ApiGatewaySwitcher | CREATE_COMPLETE | 2017-04-21T17:13:07Z | my-swarm-cluster-j7gbjcxaremy | . . .
The cluster is complete when all resources show CREATE_COMPLETE. You can also obtain the floating IP address once the cluster has been created.
$ magnum cluster-show my-swarm-cluster +---------------------+------------------------------------------------------------+ | Property | Value | +---------------------+------------------------------------------------------------+ | status | CREATE_COMPLETE | | cluster_template_id | 17df266e-f8e1-4056-bdee-71cf3b1483e3 | | uuid | c3e13e5b-85c7-44f4-839f-43878fe5f1f8 | | stack_id | 3265d843-3677-4fed-bbb7-e0f56c27905a | | status_reason | Stack CREATE completed successfully | | created_at | 2017-04-21T17:13:08+00:00 | | name | my-swarm-cluster | | updated_at | 2017-04-21T17:18:26+00:00 | | discovery_url | https://discovery.etcd.io/54e83ea168313b0c2109d0f66cd0aa6f | | api_address | tcp://172.31.0.7:2376 | | coe_version | 1.0.0 | | master_addresses | ['172.31.0.7'] | | create_timeout | 60 | | node_addresses | ['172.31.0.5'] | | master_count | 1 | | container_version | 1.9.1 | | node_count | 1 | +---------------------+------------------------------------------------------------+
Generate and sign client certificate using
magnum cluster-config
command.$ mkdir my_swarm_cluster $ cd my_swarm_cluster/ ~/my_swarm_cluster $ magnum cluster-config my-swarm-cluster {'tls': True, 'cfg_dir': '.', 'docker_host': u'tcp://172.31.0.7:2376'} ~/my_swarm_cluster $ ls ca.pem cert.pem key.pem
Copy generated certificates and key to ~/.docker folder on first cluster master node.
$ scp -r ~/my_swarm_cluster fedora@172.31.0.7:.docker ca.pem 100% 1066 1.0KB/s 00:00 key.pem 100% 1679 1.6KB/s 00:00 cert.pem 100% 1005 1.0KB/s 00:00
Login to first master node and set up cluster access environment variables.
$ ssh fedora@172.31.0.7 [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ export DOCKER_TLS_VERIFY=1 [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ export DOCKER_HOST=tcp://172.31.0.7:2376
Verfy that the swarm container is up and running.
[fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES fcbfab53148c swarm:1.0.0 "/swarm join --addr 1" 24 minutes ago Up 24 minutes 2375/tcp my-xggjts5zbgr-0-d4qhxhdujh4q-swarm-node-vieanhwdonon.novalocal/swarm-agent
Deploy a sample docker application (nginx) and verify that Nginx is serving requests at port 8080 on worker node(s), on both floating and private IPs:
[fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ docker run -itd -p 8080:80 nginx 192030325fef0450b7b917af38da986edd48ac5a6d9ecb1e077b017883d18802 [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ docker port 192030325fef 80/tcp -> 10.0.0.11:8080 [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ curl http://10.0.0.11:8080 <!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> <style> ... [fedora@my-6zxz5ukdu-0-bvqbsn2z2uwo-swarm-master-n6wfplu7jcwo ~]$ curl http://172.31.0.5:8080 <!DOCTYPE html> <html> <head> <title>Welcome to nginx!</title> <style> ...
14.4 Deploying an Apache Mesos Cluster on Ubuntu #
14.4.1 Prerequisites #
These steps assume the following have been completed:
The Magnum service has been installed. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.
Deploying an Apache Mesos Cluster requires the Fedora Atomic image that is compatible for the OpenStack release. You can download the ubuntu-mesos-latest.qcow2 image from https://fedorapeople.org/groups/magnum/
14.4.2 Creating the Cluster #
The following example is created using Kubernetes Container Orchestration Engine (COE) running on Fedora Atomic guest OS on SUSE OpenStack Cloud VMs.
As stack user, login to the lifecycle manager.
Source openstack admin credentials.
$ source service.osrc
If you haven't already, download Fedora Atomic image that is compatible for the OpenStack release.
NoteThe
https_proxy
is only needed if your environment requires a proxy.$ https_proxy=http://proxy.yourcompany.net:8080 wget https://fedorapeople.org/groups/magnum/ubuntu-mesos-latest.qcow2
Create a glance image.
$ openstack image create --name ubuntu-mesos-latest --visibility public --disk-format qcow2 --os-distro ubuntu --container-format bare --file ubuntu-mesos-latest.qcow2 --progress [=============================>] 100% +------------------+--------------------------------------+ | Property | Value | +------------------+--------------------------------------+ | checksum | 97cc1fdb9ca80bf80dbd6842aab7dab5 | | container_format | bare | | created_at | 2017-04-21T19:40:20Z | | disk_format | qcow2 | | id | d6a4e6f9-9e34-4816-99fe-227e0131244f | | min_disk | 0 | | min_ram | 0 | | name | ubuntu-mesos-latest | | os_distro | ubuntu | | owner | 2f5b83ab49d54aaea4b39f5082301d09 | | protected | False | | size | 753616384 | | status | active | | tags | [] | | updated_at | 2017-04-21T19:40:32Z | | virtual_size | None | | visibility | public | +------------------+--------------------------------------+
Create a nova keypair.
$ test -f ~/.ssh/id_rsa.pub || ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa $ openstack keypair create --pub-key ~/.ssh/id_rsa.pub testkey
Create a Magnum cluster template.
$ magnum cluster-template-create --name my-mesos-template \ --image-id d6a4e6f9-9e34-4816-99fe-227e0131244f \ --keypair-id testkey \ --external-network-id ext-net \ --dns-nameserver 8.8.8.8 \ --flavor-id m1.small \ --docker-volume-size 5 \ --network-driver docker \ --coe mesos \ --http-proxy http://proxy.yourcompany.net:8080/ \ --https-proxy http://proxy.yourcompany.net:8080/
NoteUse the image_id from
openstack image create
command output in the previous step.Use your organization's DNS server. If the SUSE OpenStack Cloud public endpoint is configured with the hostname, this server should provide resolution for this hostname.
The proxy is only needed if public internet (for example,
https://discovery.etcd.io/
orhttps://gcr.io/
) is not accessible without proxy.
Create cluster. The command below will create a minimalistic cluster consisting of a single Kubernetes Master (kubemaster) and single Kubernetes Node (worker, kubeminion).
$ magnum cluster-create --name my-mesos-cluster --cluster-template my-mesos-template --node-count 1 --master-count 1
Immediately after issuing
cluster-create
command, cluster status should turn to CREATE_IN_PROGRESS and stack_id assigned.$ magnum cluster-show my-mesos-cluster +---------------------+--------------------------------------+ | Property | Value | +---------------------+--------------------------------------+ | status | CREATE_IN_PROGRESS | | cluster_template_id | be354919-fa6c-4db8-9fd1-69792040f095 | | uuid | b1493402-8571-4683-b81e-ddc129ff8937 | | stack_id | 50aa20a6-bf29-4663-9181-cf7ba3070a25 | | status_reason | - | | created_at | 2017-04-21T19:50:34+00:00 | | name | my-mesos-cluster | | updated_at | - | | discovery_url | - | | api_address | - | | coe_version | - | | master_addresses | [] | | create_timeout | 60 | | node_addresses | [] | | master_count | 1 | | container_version | - | | node_count | 1 | +---------------------+--------------------------------------+
You can monitor cluster creation progress by listing the resources of the heat stack. Use the
stack_id
value from themagnum cluster-status
output above in the following command:$ heat resource-list -n2 50aa20a6-bf29-4663-9181-cf7ba3070a25 WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead +------------------------------+--------------------------------------+-----------------------------------+-----------------+----------------------+-------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +------------------------------+--------------------------------------+-----------------------------------+-----------------+----------------------+-------------------------------+ | add_proxy_master | 10394a74-1503-44b4-969a-44258c9a7be1 | OS::heat::SoftwareConfig | CREATE_COMPLETE | 2017-04-21T19:50:33Z | my-mesos-cluster-w2trq7m46qus | | add_proxy_master_deployment | | OS::heat::SoftwareDeploymentGroup | INIT_COMPLETE | 2017-04-21T19:50:33Z | my-mesos-cluster-w2trq7m46qus | ...
The cluster is complete when all resources show CREATE_COMPLETE.
$ magnum cluster-show my-mesos-cluster +---------------------+--------------------------------------+ | Property | Value | +---------------------+--------------------------------------+ | status | CREATE_COMPLETE | | cluster_template_id | 9e942bfa-2c78-4837-82f5-6bea88ba1bf9 | | uuid | 9d7bb502-8865-4cbd-96fa-3cd75f0f6945 | | stack_id | 339a72b4-a131-47c6-8d10-365e6f6a18cf | | status_reason | Stack CREATE completed successfully | | created_at | 2017-04-24T20:54:31+00:00 | | name | my-mesos-cluster | | updated_at | 2017-04-24T20:59:18+00:00 | | discovery_url | - | | api_address | 172.31.0.10 | | coe_version | - | | master_addresses | ['172.31.0.10'] | | create_timeout | 60 | | node_addresses | ['172.31.0.5'] | | master_count | 1 | | container_version | 1.9.1 | | node_count | 1 | +---------------------+--------------------------------------+
Verify that Marathon web console is available at http://${MASTER_IP}:8080/, and Mesos UI is available at http://${MASTER_IP}:5050/
$ https_proxy=http://proxy.yourcompany.net:8080 curl -LO \ https://storage.googleapis.com/kubernetes-release/release/v1.2.0/bin/linux/amd64/kubectl $ chmod +x ./kubectl $ sudo mv ./kubectl /usr/local/bin/kubectl
Create an example Mesos application.
$ mkdir my_mesos_cluster $ cd my_mesos_cluster/ $ cat > sample.json <<-EOFc { "id": "sample", "cmd": "python3 -m http.server 8080", "cpus": 0.5, "mem": 32.0, "container": { "type": "DOCKER", "docker": { "image": "python:3", "network": "BRIDGE", "portMappings": [ { "containerPort": 8080, "hostPort": 0 } ] } } } EOF
$ curl -s -X POST -H "Content-Type: application/json" \ http://172.31.0.10:8080/v2/apps -d@sample.json | json_pp { "dependencies" : [], "healthChecks" : [], "user" : null, "mem" : 32, "requirePorts" : false, "tasks" : [], "cpus" : 0.5, "upgradeStrategy" : { "minimumHealthCapacity" : 1, "maximumOverCapacity" : 1 }, "maxLaunchDelaySeconds" : 3600, "disk" : 0, "constraints" : [], "executor" : "", "cmd" : "python3 -m http.server 8080", "id" : "/sample", "labels" : {}, "ports" : [ 0 ], "storeUrls" : [], "instances" : 1, "tasksRunning" : 0, "tasksHealthy" : 0, "acceptedResourceRoles" : null, "env" : {}, "tasksStaged" : 0, "tasksUnhealthy" : 0, "backoffFactor" : 1.15, "version" : "2017-04-25T16:37:40.657Z", "uris" : [], "args" : null, "container" : { "volumes" : [], "docker" : { "portMappings" : [ { "containerPort" : 8080, "hostPort" : 0, "servicePort" : 0, "protocol" : "tcp" } ], "parameters" : [], "image" : "python:3", "forcePullImage" : false, "network" : "BRIDGE", "privileged" : false }, "type" : "DOCKER" }, "deployments" : [ { "id" : "6fbe48f0-6a3c-44b7-922e-b172bcae1be8" } ], "backoffSeconds" : 1 }
Wait for sample application to start. Use REST API or Marathon web console to monitor status:
$ curl -s http://172.31.0.10:8080/v2/apps/sample | json_pp { "app" : { "deployments" : [], "instances" : 1, "tasks" : [ { "id" : "sample.7fdd1ee4-29d5-11e7-9ee0-02427da4ced1", "stagedAt" : "2017-04-25T16:37:40.807Z", "version" : "2017-04-25T16:37:40.657Z", "ports" : [ 31827 ], "appId" : "/sample", "slaveId" : "21444bc5-3eb8-49cd-b020-77041e0c88d0-S0", "host" : "10.0.0.9", "startedAt" : "2017-04-25T16:37:42.003Z" } ], "upgradeStrategy" : { "maximumOverCapacity" : 1, "minimumHealthCapacity" : 1 }, "storeUrls" : [], "requirePorts" : false, "user" : null, "id" : "/sample", "acceptedResourceRoles" : null, "tasksRunning" : 1, "cpus" : 0.5, "executor" : "", "dependencies" : [], "args" : null, "backoffFactor" : 1.15, "ports" : [ 10000 ], "version" : "2017-04-25T16:37:40.657Z", "container" : { "volumes" : [], "docker" : { "portMappings" : [ { "servicePort" : 10000, "protocol" : "tcp", "hostPort" : 0, "containerPort" : 8080 } ], "forcePullImage" : false, "parameters" : [], "image" : "python:3", "privileged" : false, "network" : "BRIDGE" }, "type" : "DOCKER" }, "constraints" : [], "tasksStaged" : 0, "env" : {}, "mem" : 32, "disk" : 0, "labels" : {}, "tasksHealthy" : 0, "healthChecks" : [], "cmd" : "python3 -m http.server 8080", "backoffSeconds" : 1, "maxLaunchDelaySeconds" : 3600, "versionInfo" : { "lastConfigChangeAt" : "2017-04-25T16:37:40.657Z", "lastScalingAt" : "2017-04-25T16:37:40.657Z" }, "uris" : [], "tasksUnhealthy" : 0 } }
Verify that deployed application is responding on automatically assigned port on floating IP address of worker node.
$ curl http://172.31.0.5:31827 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Directory listing for /</title> ...
14.5 Creating a Magnum Cluster with the Dashboard #
You can alternatively create a cluster template and cluster with the Magnum
UI in horizon. The example instructions below demonstrate how to deploy a
Kubernetes Cluster using the Fedora Atomic image. Other deployments such as
Kubernetes on CoreOS, Docker Swarm on Fedora, and Mesos on Ubuntu all follow
the same set of instructions mentioned below with slight variations to their
parameters. You can determine those parameters by looking at the previous
set of CLI instructions in the
magnum cluster-template-create
and
magnum cluster-create
commands.
14.5.1 Prerequisites #
Magnum must be installed before proceeding. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 26 “Magnum Overview”, Section 26.2 “Install the Magnum Service”.
ImportantPay particular attention to
external-name:
indata/network_groups.yml
. This cannot be set to the defaultmyardana.test
and must be a valid DNS-resolvable FQDN. If you do not have a DNS-resolvable FQDN, remove or comment out theexternal-name
entry and the public endpoint will use an IP address instead of a name.The image for which you want to base your cluster on must already have been uploaded into glance. See the previous CLI instructions regarding deploying a cluster on how this is done.
14.5.2 Creating the Cluster Template #
You will need access to the Dashboard to create the cluster template.
Open a web browser that has both JavaScript and cookies enabled. In the address bar, enter the host name or IP address for the dashboard.
On the
page, enter your user name and password and then click .Make sure you are in the appropriate domain and project in the left pane. Below is an example image of the drop-down box:
A key pair is required for cluster template creation. It is applied to VMs created during the cluster creation process. This allows SSH access to your cluster's VMs. If you would like to create a new key pair, do so by going to
› › › .Go to
› › . Insert CLUSTER_NAME and click on with the following options:Proxies are only needed if the created VMs require a proxy to connect externally.
SUSE OpenStack Cloud.
– This should be turned off; LbaaS v2 (Octavia) is not available in
Click the my-template in the list of templates.
button to create the cluster template and you should see
14.5.3 Creating the Cluster #
15 System Maintenance #
This section contains the following subsections to help you manage, configure, and maintain your SUSE OpenStack Cloud cloud as well as procedures for performing node maintenance.
15.1 Planned System Maintenance #
Planned maintenance tasks for your cloud. See sections below for:
15.1.1 Whole Cloud Maintenance #
Planned maintenance procedures for your whole cloud.
15.1.1.1 Bringing Down Your Cloud: Services Down Method #
If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.
If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 15.1.1.2, “Rolling Reboot of the Cloud”.
To perform backups prior to these steps, visit the backup and restore pages first at Chapter 17, Backup and Restore.
15.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment #
You will do the following steps from your Cloud Lifecycle Manager.
Log in to your Cloud Lifecycle Manager.
Gracefully shut down your cloud by running the
ardana-stop.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-stop.yml
Shut down and restart your nodes. There are multiple ways you can do this:
You can SSH to each node and use
sudo reboot -f
to reboot the node. Reboot the control plane nodes first so that they become functional as early as possible.You can shut down the nodes and then physically restart them either via a power button or the IPMI. If your cloud data model
servers.yml
specifies iLO connectivity for all nodes, then you can use thebm-power-down.yml
andbm-power-up.yml
playbooks on the Cloud Lifecycle Manager.Power down the control plane nodes last so that they remain online as long as possible, and power them back up before other nodes to restore their services quickly.
Perform the necessary maintenance.
After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.
Determine the current power status of the nodes in your environment:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts bm-power-status.yml
If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the
-e nodelist=<node_name>
switch.cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
NoteObtain the
<node_name>
by using thesudo cobbler system list
command from the Cloud Lifecycle Manager.Bring the databases back up:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
Gracefully bring up your cloud services by running the
ardana-start.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml
Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml
If any services did not start properly, you can run playbooks for the specific services having issues.
For example:
If RabbitMQ fails, run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml
You can check the status of RabbitMQ afterwards with this:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml
If the recovery had failed, you can run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml
Each of the other services have playbooks in the
~/scratch/ansible/next/ardana/ansible
directory in the format of<service>-start.yml
that you can run. One example, for the compute service, isnova-start.yml
.Continue checking the status of your SUSE OpenStack Cloud 9 cloud services until there are no more failed or unreachable nodes:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml
15.1.1.2 Rolling Reboot of the Cloud #
If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 15.1.1.1, “Bringing Down Your Cloud: Services Down Method”.
15.1.1.2.1 Recommended node reboot order #
To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.
The recommended way to achieve this is as follows:
Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.
Reboot of compute nodes (if present in your cloud).
Reboot of swift nodes (if present in your cloud).
Reboot of ESX nodes (if present in your cloud).
15.1.1.2.2 Rebooting controller nodes #
Turn off the keystone Fernet Token-Signing Key Rotation
Before rebooting any controller node, you need to ensure that the keystone Fernet token-signing key rotation is turned off. Run the following command:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml
Migrate singleton services first
If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that
the apache2
service is running before
continuing. To start the apache2
service, use
this command:
ardana >
sudo systemctl start apache2
The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.
For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.
For the cinder-volume
singleton
service:
Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:
ardana >
ps auxww | grep cinder-volume | grep -v grep
Run the cinder-migrate-volume.yml
playbook - details
about the cinder volume and backup migration instructions can be found in
Section 8.1.3, “Managing cinder Volume and Backup Services”.
For the SNAT namespace singleton service:
If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.
Locate the SNAT node where the router is providing the active
snat_service
:From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:
ardana >
source ~/service.osrcardana >
openstack port list --device_owner network:router_gatewayExample:
$ openstack port list --device_owner network:router_gateway +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+ | 287746e6-7d82-4b2c-914c-191954eba342 | | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
Look at the details of this port to determine what the
binding:host_id
value is, which will point to the host in which the port is bound to:openstack port show <port_id>
Example, with the value you need in bold:
ardana >
openstack port show 287746e6-7d82-4b2c-914c-191954eba342 +-----------------------+--------------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | ardana-cp1-c1-m2-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | e122ea3f-90c5-4662-bf4a-3889f677aacf | | device_owner | network:router_gateway | | dns_assignment | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | | id | 287746e6-7d82-4b2c-914c-191954eba342 | | mac_address | fa:16:3e:2e:26:ac | | name | | | network_id | d3cb12a6-a000-4e3e-82c4-ee04aa169291 | | security_groups | | | status | DOWN | | tenant_id | | +-----------------------+--------------------------------------------------------------------------------------------------------------+In this example, the
ardana-cp1-c1-m2-mgmt
is the node hosting the SNAT namespace service.
SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:
ardana >
ssh <IP_of_SNAT_namespace_host>ardana >
sudo ip netns exec snat-<router_ID> bashExample:
ardana >
sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bashObtain the ID for the L3 Agent for the node hosting your SNAT namespace:
ardana >
source ~/service.osrcardana >
openstack network agent listExample, with the entry you need given the examples above:
ardana >
openstack network agent list +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-dhcp-agent | | 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-metadata-agent | | 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-vpn-agent | | 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-l3-agent | | 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-openvswitch-agent | | 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-vpn-agent | | 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-metadata-agent | | 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-metadata-agent | | a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-vpn-agent | | a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-dhcp-agent | | b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-openvswitch-agent | | d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-openvswitch-agent | | e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-openvswitch-agent | | f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-dhcp-agent | | fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-metadata-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.
Use these commands to move the SNAT namespace service, with the
router_id
being the same value as the ID for router:Remove the L3 Agent for the old host:
ardana >
openstack network agent remove router –agent-type l3 <agent_id_of_snat_namespace_host> \ <qrouter_uuid>Example:
ardana >
openstack network agent remove router –agent-type l3 a209c67d-c00f-4a00-b31c-0db30e9ec661 \ e122ea3f-90c5-4662-bf4a-3889f677aacf Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agentRemove the SNAT namespace:
ardana >
sudo ip netns delete snat-<router_id>Example:
ardana >
sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacfCreate a new L3 Agent for the new host:
ardana >
openstack network agent add router –agent-type l3 <agent_id_of_new_snat_namespace_host> \ <qrouter_uuid>Example:
ardana >
openstack network agent add router –agent-type l3 3bc28451-c895-437b-999d-fdcff259b016 \ e122ea3f-90c5-4662-bf4a-3889f677aacf Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent
Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of
binding:host_id
which should be updated to the host you moved your SNAT namespace to:ardana >
openstack port show <port_ID>Example:
ardana >
openstack port show 287746e6-7d82-4b2c-914c-191954eba342 +-----------------------+--------------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | ardana-cp1-c1-m1-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | e122ea3f-90c5-4662-bf4a-3889f677aacf | | device_owner | network:router_gateway | | dns_assignment | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | | id | 287746e6-7d82-4b2c-914c-191954eba342 | | mac_address | fa:16:3e:2e:26:ac | | name | | | network_id | d3cb12a6-a000-4e3e-82c4-ee04aa169291 | | security_groups | | | status | DOWN | | tenant_id | | +-----------------------+--------------------------------------------------------------------------------------------------------------+
Reboot the controllers
In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.
ardana >
for i in $(grep -w cluster-prefix
~/openstack/my_cloud/definition/data/control_plane.yml \
| awk '{print $2}'); do grep $i
~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts \
| grep ansible_ssh_host | awk '{print $1}'; done
Then perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:
If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.
Stop all services on the controller node that you are rebooting first:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \ <controller node>Reboot the controller node, e.g. run the following command on the controller itself:
ardana >
sudo rebootNote that the current node being rebooted could be hosting the lifecycle manager.
Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml \ --limit <controller node>Verify that the status of all services on that is OK on the controller node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-status.yml \ --limit <controller node>When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.
It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).
Reenable the keystone Fernet Token-Signing Key Rotation
After all the controller nodes are successfully updated and back online, you
need to re-enable the keystone Fernet token-signing key rotation job by
running the keystone-reconfigure.yml
playbook. On the
deployer, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
15.1.1.2.3 Rebooting compute nodes #
To reboot a compute node the following operations will need to be performed:
Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.
Identify instances that exist on the compute node, and then either:
Live migrate the instances off the node before actioning the reboot. OR
Stop the instances
Reboot the node
Restart the nova services
Disable provisioning:
ardana >
openstack compute service set --disable --disable-reason "DESCRIBE REASON" compute nova-computeIf the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.
Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.
ardana >
openstack server list --host <hostname> --all-tenantsMigrate or Stop the instances on the compute node.
Migrate the instances off the node by running one of the following commands for each of the instances:
If your instance is booted from a volume and has any number of cinder volume attached, use the nova live-migration command:
ardana >
nova live-migration <instance uuid> [<target compute host>]If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:
ardana >
nova live-migration --block-migrate <instance uuid> [<target compute host>]Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.
OR
Stop the instances on the node by running the following command for each of the instances:
ardana >
openstack server stop <instance-uuid>Stop all services on the Compute node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>SSH to your Compute nodes and reboot them:
ardana >
sudo rebootThe operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.
Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>Re-enable provisioning on the node:
ardana >
openstack compute service set --enable compute nova-computeRestart any instances you stopped.
ardana >
openstack server start <instance-uuid>
15.1.1.2.4 Rebooting swift nodes #
If your swift services are on controller node, please follow the controller node reboot instructions above.
For a dedicated swift PAC cluster or swift Object resource node:
For each swift host
Stop all services on the swift node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <swift node>Reboot the swift node by running the following command on the swift node itself:
ardana >
sudo rebootWait for the node to become ssh-able and then start all services on the swift node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>
15.1.1.2.5 Get list of status playbooks #
The following command will display a list of status playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ls *status*
15.1.2 Planned Control Plane Maintenance #
Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.
15.1.2.1 Replacing a Controller Node #
This section outlines steps for replacing a controller node in your environment.
For SUSE OpenStack Cloud, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.
These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.
Keep in mind while performing the following tasks:
Do not add entries for a new server. Instead, update the entries for the broken one.
Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.
15.1.2.1.2 Replacing a Standalone Controller Node #
If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.
Log in to the Cloud Lifecycle Manager.
Update your cloud model, specifically the
servers.yml
file, with the newmac-addr
,ilo-ip
,ilo-password
, andilo-user
fields where these have changed. Do not change theid
,ip-addr
,role
, orserver-group
settings.Commit your configuration to the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRemove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:
ardana >
sudo cobbler system listand then remove the old controller nodes with this command:
ardana >
sudo cobbler system remove --name <node>Remove the SSH key of the old controller node from the known hosts file. You will specify the
ip-addr
value:ardana >
ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>You should see a response similar to this one:
ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135 # Host 10.13.111.135 found: line 6 type ECDSA ~/.ssh/known_hosts updated. Original contents retained as ~/.ssh/known_hosts.old
Run the cobbler-deploy playbook to add the new controller node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlImage the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>ImportantYou must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.
Run the
wipe_disks.yml
playbook to ensure all non-OS partitions on the new node are completely wiped prior to continuing with the installation. (The value to be used forhostname
is the host's identifier from~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
.)ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Run osconfig on the replacement controller node. For example:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>If the controller being replaced is the swift ring builder (see Section 18.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the swift ring builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir
directory. See Section 18.6.2.7, “Recovering swift Builder Files” for details.Run the ardana-deploy playbook on the replacement controller.
If the node being replaced is the swift ring builder server then you only need to use the
--limit
switch for that node, otherwise you need to specify the hostname of your swift ringer builder server and the hostname of the node being replaced.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller-hostname>,<swift-ring-builder-hostname>ImportantIf you receive a keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the
keystone-reconfigure.yml
playbook to re-sync the Fernet keys.In this situation, do not use the
--limit
option when runningkeystone-reconfigure.yml
. In order to re-sync Fernet keys, all the controller nodes must be in the play.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlImportantIf you receive a RabbitMQ failure when running this playbook, review Section 18.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.
During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags threshardana >
ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
15.1.3 Planned Compute Maintenance #
Planned maintenance tasks for compute nodes.
15.1.3.1 Planned Maintenance for a Compute Node #
If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.
15.1.3.1.1 Performing planned maintenance on a compute node #
If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:
Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
source ~/service.osrc
Obtain the hostname for your compute node, which you will use in subsequent commands when
<hostname>
is requested:openstack host list | grep compute
The following example shows two compute nodes:
$ openstack host list | grep compute | ardana-cp1-comp0001-mgmt | compute | AZ1 | | ardana-cp1-comp0002-mgmt | compute | AZ2 |
Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:
openstack compute service set –disable --reason "Maintenance mode" <hostname>
NoteMake sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:
openstack compute service set –enable <hostname>
At this point you have two choices:
Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.
Stop/start the instances: Issuing
openstack server stop
commands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.
If you choose the live migration route, See Section 15.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.
If you choose the stop start method, continue on.
List all of the instances on the node so you can issue stop commands to them:
openstack server list --host <hostname> --all-tenants
Issue the
openstack server stop
command against each of the instances:openstack server stop <instance uuid>
Confirm that the instances are stopped. If stoppage was successful you should see the instances in a
SHUTOFF
state, as shown here:$ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+ | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | - | Shutdown | demo_network=10.0.0.5 | +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their
SHUTOFF
state:openstack server list --host <hostname> --all-tenants
Start the instances back up using this command:
openstack server start <instance uuid>
Example:
$ openstack server start ef31c453-f046-4355-9bd3-11e774b1772f Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.
Confirm that the instances started back up. If restarting is successful you should see the instances in an
ACTIVE
state, as shown here:$ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+ | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | - | Running | demo_network=10.0.0.5 | +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
If the
openstack server start
fails, you can try doing a hard reboot:openstack server reboot --hard <instance uuid>
If this does not resolve the issue you may want to contact support.
Re-enable provisioning when the node is fixed:
openstack compute service set –enable <hostname>
15.1.3.2 Rebooting a Compute Node #
If all you need to do is reboot a Compute node, the following steps can be used.
You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.
Log in to the Cloud Lifecycle Manager.
Reboot the Compute node(s) with the following playbook.
You can specify either single or multiple Compute nodes using the
--limit
switch.An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
NoteIf the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.
15.1.3.3 Live Migration of Instances #
Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.
SUSE OpenStack Cloud nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.
15.1.3.3.1 Migration Options #
If your compute node has failed
A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.
In these cases you will want to use one of the nova evacuate commands, which will cause nova to rebuild the instances on other hosts.
This table describes each of the evacuate options for failed compute nodes:
Command | Description |
---|---|
|
This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the nova scheduler will choose one for you.
See |
|
This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the nova scheduler will choose a target host for each instance.
See |
If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.
Cold migration is used to copy an instances
data in a SHUTOFF
status from one compute host to
another. It does this using passwordless SSH access which has security
concerns associated with it. For this reason, the openstack server
migrate
function has been disabled by default but you have the
ability to enable this feature if you would like. Details on how to do this
can be found in Section 6.4, “Enabling the Nova Resize and Migrate Features”.
Live migration can be performed on
instances in either an ACTIVE
or
PAUSED
state and uses the QEMU hypervisor to manage the
copy of the running processes and associated resources to the destination
compute host using the hypervisors own protocol and thus is a more secure
method and allows for less downtime. There may be a short network outage,
usually a few milliseconds but could be up to a few seconds if your compute
instances are busy, during a live migration. Also there may be some
performance degredation during the process.
The compute host must remain powered on during the migration process.
Both the cold migration and live migration options will honor nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 15.1.3.3, “Live Migration of Instances” section.
This table describes each of the migration options for active compute nodes:
Command | Description | SLES |
---|---|---|
|
Used to cold migrate a single instance from a compute host. The
This command will work against instances in an
See the difference between cold migration and live migration at the start of this section. | |
|
Used to cold migrate all instances off a specified host to other
available hosts, chosen by the
This command will work against instances in an
See the difference between cold migration and live migration at the start of this section. | |
|
Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.
This command works against instances in | X |
|
Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.
This command works against instances in | X |
|
Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.
This command works against instances in | X |
|
Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.
This command works against instances in | X |
15.1.3.3.2 Limitations of these Features #
There are limitations that may impact your use of this feature:
To use live migration, your compute instances must be in either an
ACTIVE
orPAUSED
state on the compute host. If you have instances in aSHUTOFF
state then cold migration should be used.Instances in a
Paused
state cannot be live migrated using the horizon dashboard. You will need to utilize thepython-novaclient
CLI to perform these.Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the
nova-scheduler
. You should ensure that the target host you choose has the resources available to host the instances.The
nova host-evacuate-live
command will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the--block-migrate
option. This is described in further detail in Section 15.1.3.3, “Live Migration of Instances”.Instances on KVM hosts can only be live migrated to other KVM hosts.
The migration options described in this document are not available on ESX compute hosts.
Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.
15.1.3.3.3 Performing a Live Migration #
Cloud administrators can perform a migration on an instance using either the
horizon dashboard, API, or CLI. Instances in a Paused
state cannot be live migrated using the horizon GUI. You will need to
utilize the CLI to perform these.
We have documented different scenarios:
15.1.3.3.4 Migrating instances off of a failed compute host #
Log in to the Cloud Lifecycle Manager.
If the compute node is not already powered off, do so with this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
NoteThe value for
<node_name>
will be the name that Cobbler has when you runsudo cobbler system list
from the Cloud Lifecycle Manager.Source the admin credentials necessary to run administrative commands against the nova API:
source ~/service.osrc
Force the
nova-compute
service to go down on the compute node:openstack compute service set --down HOSTNAME nova-compute
NoteThe value for HOSTNAME can be obtained by using
openstack host list
from the Cloud Lifecycle Manager.Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.
For single instances on a failed host:
nova evacuate <instance_uuid> <target_hostname>
For all instances on a failed host:
nova host-evacuate <hostname> [--target_host <target_hostname>]
When you have repaired the failed node and start it back up again, when the
nova-compute
process starts again, it will clean up the evacuated instances.
15.1.3.3.5 Migrating instances off of an active compute host #
Migrating instances using the horizon dashboard
The horizon dashboard offers a GUI method for performing live migrations.
Instances in a Paused
state will not provide you the live
migration option in horizon so you will need to use the CLI instructions in
the next section to perform these.
Log into the horizon dashboard with admin credentials.
Navigate to the menu
› › .Next to the instance you want to migrate, select the drop down menu and choose the
option.In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:
False
. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.False
. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.To begin the live migration, click
.
Migrating instances using the python-novaclient CLI
To perform migrations from the command-line, use the python-novaclient
.
The Cloud Lifecycle Manager node in your cloud environment should have
the python-novaclient
already installed. If you will be accessing your environment
through a different method, ensure that the python-novaclient
is
installed. You can do so using Python's pip
package
manager.
To run the commands in the steps below, you need administrator
credentials. From the Cloud Lifecycle Manager, you can source the
service.osrc
file which is provided that has the
necessary credentials:
source ~/service.osrc
Here are the steps to perform:
Log in to the Cloud Lifecycle Manager.
Identify the instances on the compute node you wish to migrate:
openstack server list --all-tenants --host <hostname>
Example showing a host with a single compute instance on it:
ardana >
openstack server list --host ardana-cp1-comp0001-mgmt --all-tenants +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+ | 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | - | Running | adminnetwork=10.0.0.5 | +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:
openstack host list
Migrate the instance(s) on the compute node using the notes below.
If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the
nova live-migration
command with this syntax:nova live-migration <instance uuid> [<target compute host>]
If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the
--block-migrate
option:nova live-migration --block-migrate <instance uuid> [<target compute host>]
NoteThe
[<target compute host>]
option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.Multiple instances
If you want to live migrate all of the instances off a single compute host you can utilize the
nova host-evacuate-live
command.Issue the host-evacuate-live command, which will begin the live migration process.
If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:
nova host-evacuate-live --block-migrate <hostname>
Alternatively, if all of the instances are only using block storage volumes then omit the
--block-migrate
option:nova host-evacuate-live <hostname>
NoteYou can either let the nova-scheduler choose a suitable target host or you can specify one using the
--target-host <hostname>
switch. Seenova help host-evacuate-live
for details.
15.1.3.3.6 Troubleshooting migration or host evacuate issues #
Issue: When attempting to use nova
host-evacuate-live
against a node, you receive the error below:
$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt +--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Server UUID | Live Migration Accepted | Error Message | +--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 95a7ded8-ebfc-4848-9090-2df378c88a4c | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7) | | 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6) | +--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fix: This occurs when you are attempting to
live evacuate a host that contains instances booted from local storage and
you are not specifying --block-migrate
in your command.
Re-attempt the live evacuation with this syntax:
nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]
Issue: When attempting to use nova
host-evacuate-live
against a node, you receive the error below:
$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Server UUID | Live Migration Accepted | Error Message | +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | e9874122-c5dc-406f-9039-217d9258c020 | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a) | | 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112) | +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fix: This occurs when you are attempting to
live evacuate a host that contains instances booted from a block storage
volume and you are specifying --block-migrate
in your
command. Re-attempt the live evacuation with this syntax:
nova host-evacuate-live <hostname> [--target-host <target_hostname>]
Issue: When attempting to use nova
live-migration
against an instance, you receive the error below:
$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)
Fix: This occurs when you are attempting to
live migrate an instance that was booted from local storage and you are not
specifying --block-migrate
in your command. Re-attempt
the live migration with this syntax:
nova live-migration --block-migrate <instance_uuid> <target_hostname>
Issue: When attempting to use nova
live-migration
against an instance, you receive the error below:
$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)
Fix: This occurs when you are attempting to
live migrate an instance that was booted from a block storage volume and you
are specifying --block-migrate
in your command.
Re-attempt the live migration with this syntax:
nova live-migration <instance_uuid> <target_hostname>
15.1.3.4 Adding Compute Node #
Adding a Compute Node allows you to add capacity.
15.1.3.4.1 Adding a SLES Compute Node #
Adding a SLES compute node allows you to add additional capacity for more virtual machines.
You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.
There are two methods you can use to add SLES compute hosts to your environment:
Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.
Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP4 ISO during the initial installation of your cloud, following the instructions at Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute”, Section 31.1 “SLES Compute Node Installation Overview”.
If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP4 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.
15.1.3.4.1.1 Prerequisites #
You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute”, Section 31.3 “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.1 “SLES Compute Nodes”.
15.1.3.4.1.2 Adding a SLES compute node #
Adding pre-installed SLES compute hosts
This method requires that you have SUSE Linux Enterprise Server 12 SP4 pre-installed on the baremetal host prior to beginning these steps.
Ensure you have SUSE Linux Enterprise Server 12 SP4 pre-installed on your baremetal host.
Log in to the Cloud Lifecycle Manager.
Edit your
~/openstack/my_cloud/definition/data/servers.yml
file to include the details about your new compute host(s).For example, if you already had a cluster of three SLES compute hosts using the
SLES-COMPUTE-ROLE
role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).- id: compute4 ip-addr: 192.168.102.70 role: SLES-COMPUTE-ROLE server-group: RACK1
You can find detailed descriptions of these fields in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
ImportantYou will need to verify that the
ip-addr
value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.yml
file on your Cloud Lifecycle Manager.In your
~/openstack/my_cloud/definition/data/control_plane.yml
file you will need to check the values formember-count
,min-count
, andmax-count
. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specifiedmember-count: 3
and are adding a fourth compute node, you will need to change that value tomember-count: 4
.See for Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” more details.
Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "Add node <name>"Run the configuration processor and resolve any errors that are indicated:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlBefore proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.
[OPTIONAL] - Run the
wipe_disks.yml
playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.NoteThe
wipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions.The value to be used for
hostname
is host's identifier from~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Complete the compute host deployment with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"ardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
Adding SLES compute hosts with Ansible playbooks and Cobbler
These steps will show you how to add the new SLES compute host to your
servers.yml
file and then run the playbooks that update
your cloud configuration. You will run these playbooks from the lifecycle
manager.
If you did not have the SUSE Linux Enterprise Server 12 SP4 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute”.
When you are prepared to continue, use these steps:
Log in to your Cloud Lifecycle Manager.
Checkout the
site
branch of your local git so you can begin to make the necessary edits:ardana >
cd ~/openstack/my_cloud/definition/dataardana >
git checkout siteEdit your
~/openstack/my_cloud/definition/data/servers.yml
file to include the details about your new compute host(s).For example, if you already had a cluster of three SLES compute hosts using the
SLES-COMPUTE-ROLE
role and needed to add a fourth one you would add your details to the bottom of the file in this format:- id: compute4 ip-addr: 192.168.102.70 role: SLES-COMPUTE-ROLE server-group: RACK1 mac-addr: e8:39:35:21:32:4e ilo-ip: 10.1.192.36 ilo-password: password ilo-user: admin distro-id: sles12sp4-x86_64
You can find detailed descriptions of these fields in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
ImportantYou will need to verify that the
ip-addr
value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.yml
file on your Cloud Lifecycle Manager.In your
~/openstack/my_cloud/definition/data/control_plane.yml
file you will need to check the values formember-count
,min-count
, andmax-count
. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specifiedmember-count: 3
and are adding a fourth compute node, you will need to change that value tomember-count: 4
.See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.
Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "Add node <name>"Run the configuration processor and resolve any errors that are indicated:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlThe following playbook confirms that your servers are accessible over their IPMI ports.
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4
Add the new node into Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]Then you can image the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>NoteIf you do not know the
<node name>
, you can get it by usingsudo cobbler system list
.Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.
Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml[OPTIONAL] - Run the
wipe_disks.yml
playbook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. Thewipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>NoteYou can obtain the
<hostname>
from the file~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
.You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 31 “Installing SLES Compute” for details.
Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the
--limit
switch:ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"ardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
15.1.3.4.1.3 Adding a new SLES compute node to monitoring #
If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
15.1.3.5 Removing a Compute Node #
Removing a Compute node allows you to remove capacity.
You may have a need to remove a Compute node and these steps will help you achieve this.
15.1.3.5.1 Disable Provisioning on the Compute Host #
Get a list of the nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:
ardana >
openstack compute service listHere is an example below. I've highlighted the Compute node we are going to remove in the examples:
ardana >
openstack compute service list +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:42.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T22:50:35.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | AZ2 | enabled | up | 2015-11-22T22:50:44.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+Disable the nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:
ardana >
compute service set --disable --reason "enter reason here" node hostnameHere is an example if I wanted to remove the
ardana-cp1-comp0002-mgmt
in the output above:ardana >
compute service set –disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt +--------------------------+--------------+----------+-----------------------+ | Host | Binary | Status | Disabled Reason | +--------------------------+--------------+----------+-----------------------+ | ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation | +--------------------------+--------------+----------+-----------------------+
15.1.3.5.2 Remove the Compute Host from its Availability Zone #
If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.
Get a list of the nova services running which will provide us with the details we need to remove a Compute node:
ardana >
openstack compute service listHere is an example below. I've highlighted the Compute node we are going to remove in the examples:
ardana >
openstack compute service list +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:42.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T22:50:35.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | AZ2 | enabled | up | 2015-11-22T22:50:44.000000 | hardware reallocation | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+If the
Zone
reported for this host is simply "nova", then it is not a member of a particular availability zone, and this step will not be necessary. Otherwise, you must remove the Compute host from its availability zone:ardana >
openstack aggregate remove host availability zone nova hostnameSo for the same example in the previous step, the
ardana-cp1-comp0002-mgmt
host was in theAZ2
availability zone so you would use this command to remove it:ardana >
openstack aggregate remove host AZ2 ardana-cp1-comp0002-mgmt Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4 +----+------+-------------------+-------+-------------------------+ | Id | Name | Availability Zone | Hosts | Metadata | +----+------+-------------------+-------+-------------------------+ | 4 | AZ2 | AZ2 | | 'availability_zone=AZ2' | +----+------+-------------------+-------+-------------------------+You can confirm the last two steps completed successfully by running another
openstack compute service list
.Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:
ardana >
openstack compute service list +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:32.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T23:04:25.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | nova | disabled | up | 2015-11-22T23:04:34.000000 | hardware reallocation | +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
15.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts #
You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:
ardana >
openstack server list --host nova hostname --all_tenants=1Here is an example below which shows that we have a single running instance on this node currently:
ardana >
openstack server list --host ardana-cp1-comp0002-mgmt --all-projects +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+ | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | - | Running | paul=10.10.10.7 | +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within nova. The command will look like this:
ardana >
nova live-migration --block-migrate nova instance IDHere is an example using the instance in the previous step:
ardana >
nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9You can check the status of the migration using the same command from the previous step:
ardana >
openstack server list --host ardana-cp1-comp0002-mgmt --all-projects +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+ | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating | Running | paul=10.10.10.7 | +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+List the compute instances again to see that the running instance has been migrated:
ardana >
openstack server list --host ardana-cp1-comp0002-mgmt --all-projects +----+------+-----------+--------+------------+-------------+----------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +----+------+-----------+--------+------------+-------------+----------+ +----+------+-----------+--------+------------+-------------+----------+
15.1.3.5.4 Disable Neutron Agents on Node to be Removed #
You should also locate and disable or remove neutron agents. To see the neutron agents running:
ardana >
openstack network agent list | grep NODE_NAME +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-l3-agent | | dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-metadata-agent | | f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-openvswitch-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ardana >
openstack network agent set --disable 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4ardana >
openstack network agent set --disable dbe4fe11-8f08-4306-8244-cc68e98bb770ardana >
openstack network agent set --disable f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host #
To perform this step you have a few options. You can SSH into the Compute host and run the following commands:
tux >
sudo systemctl stop nova-compute
tux >
sudo systemctl stop neutron-*
Because the neutron agent self-registers against neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:
tux >
sudo systemctl list-units neutron-* --all
Here are the results:
UNIT LOAD ACTIVE SUB DESCRIPTION neutron-common-rundir.service loaded inactive dead Create /var/run/neutron •neutron-dhcp-agent.service not-found inactive dead neutron-dhcp-agent.service neutron-l3-agent.service loaded inactive dead neutron-l3-agent Service neutron-metadata-agent.service loaded inactive dead neutron-metadata-agent Service •neutron-openvswitch-agent.service loaded failed failed neutron-openvswitch-agent Service neutron-ovs-cleanup.service loaded inactive dead neutron OVS Cleanup Service LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 7 loaded units listed. To show all installed unit files use 'systemctl list-unit-files'.
For each loaded service issue the command
tux >
sudo systemctl disable service-name
In the above example that would be each service, except neutron-dhcp-agent.service
For example:
tux >
sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-metadata-agent neutron-openvswitch-agent
Now you can shut down the node:
tux >
sudo shutdown now
OR
From the Cloud Lifecycle Manager you can use the
bm-power-down.yml
playbook to shut down the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=node name
The node name
value will be the value
corresponding to this node in Cobbler. You can run
sudo cobbler system list
to retrieve these names.
15.1.3.5.6 Delete the Compute Host from Nova #
Retrieve the list of nova services:
ardana >
openstack compute service list
Here is an example highlighting the Compute host we're going to remove:
ardana >
openstack compute service list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:34.000000 | - |
| 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - |
| 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:32.000000 | - |
| 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T23:04:25.000000 | - |
| 37 | nova-compute | ardana-cp1-comp0002-mgmt | nova | disabled | up | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
Delete the host from nova using the command below:
ardana >
openstack compute service delete service ID
Following our example above, you would use:
ardana >
openstack compute service delete 37
Use the command below to confirm that the Compute host has been completely removed from nova:
ardana >
openstack hypervisor list
15.1.3.5.7 Delete the Compute Host from Neutron #
Multiple neutron agents are running on the compute node. You have to remove
all of the agents running on the node using the openstack network
agent delete
command. In the example below, the l3-agent,
openvswitch-agent and metadata-agent are running:
ardana >
openstack network agent list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id | agent_type | host | alive | admin_state_up | binary |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-l3-agent |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-metadata-agent |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
$ openstack network agent delete AGENT_ID
$ openstack network agent delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ openstack network agent delete dbe4fe11-8f08-4306-8244-cc68e98bb770
$ openstack network agent delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor #
Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:
Log in to the Cloud Lifecycle Manager
Edit your
servers.yml
file in the location below to remove references to the Compute node(s) you want to remove:ardana >
cd ~/openstack/my_cloud/definition/dataardana >
vi servers.ymlYou may also need to edit your
control_plane.yml
file to update the values formember-count
,min-count
, andmax-count
if you used those to ensure they reflect the exact number of nodes you are using.See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.
Commit the changes to git:
ardana >
git commit -a -m "Remove node NODE_NAME"To release the network capacity allocated to the deleted server(s), use the switches
remove_deleted_servers
andfree_unused_addresses
when running the configuration processors. (For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.)ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml \ -e remove_deleted_servers="y" -e free_unused_addresses="y"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRefresh the
/etc/hosts
file through the cloud to remove references to the old node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
15.1.3.5.9 Remove the Compute Host from Cobbler #
Complete these steps to remove the node from Cobbler:
Confirm the system name in Cobbler with this command:
tux >
sudo cobbler system listRemove the system from Cobbler using this command:
tux >
sudo cobbler system remove --name=nodeRun the
cobbler-deploy.yml
playbook to complete the process:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.yml
15.1.3.5.10 Remove the Compute Host from Monitoring #
Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.
To find all monasca API servers
tux >
sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
bind ardana-cp1-vip-public-MON-API-extapi:8070 ssl crt /etc/ssl/private//my-public-cert-entry-scale
server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
listen ardana-cp1-vip-MON-API-mgmt-8070
bind ardana-cp1-vip-MON-API-mgmt:8070 ssl crt /etc/ssl/private//ardana-internal-cert
server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
In above example ardana-cp1-c1-m1-mgmt
,ardana-cp1-c1-m2-mgmt
,
ardana-cp1-c1-m3-mgmt
are Monasa API servers
You will want to SSH to each of the monasca API servers and edit the
/etc/monasca/agent/conf.d/host_alive.yaml
file to remove
references to the Compute node you removed. This will require
sudo
access. The entries will look similar to the one
below:
- alive_test: ping built_by: HostAlive host_name: ardana-cp1-comp0001-mgmt name: ardana-cp1-comp0001-mgmt ping
Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:
tux >
sudo service openstack-monasca-agent restart
With the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:
ardana >
monasca alarm-list --metric-dimensions hostname=compute node deleted
For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:
ardana >
monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt
You can then delete the alarm with this command:
ardana >
monasca alarm-delete alarm ID
15.1.4 Planned Network Maintenance #
Planned maintenance task for networking nodes.
15.1.4.1 Adding a Network Node #
Adding an additional neutron networking node allows you to increase the performance of your cloud.
You may have a need to add an additional neutron network node for increased performance or another purpose and these steps will help you achieve this.
15.1.4.1.1 Prerequisites #
If you are using the mid-scale model then your networking nodes are already
separate and the roles are defined. If you are not already using this model
and wish to add separate networking nodes then you need to ensure that those
roles are defined. You can look in the ~/openstack/examples
folder on your Cloud Lifecycle Manager for the mid-scale example model files which
show how to do this. We have also added the basic edits that need to be made
below:
In your
server_roles.yml
file, ensure you have theNEUTRON-ROLE
defined.Path to file:
~/openstack/my_cloud/definition/data/server_roles.yml
Example snippet:
- name: NEUTRON-ROLE interface-model: NEUTRON-INTERFACES disk-model: NEUTRON-DISKS
In your
net_interfaces.yml
file, ensure you have theNEUTRON-INTERFACES
defined.Path to file:
~/openstack/my_cloud/definition/data/net_interfaces.yml
Example snippet:
- name: NEUTRON-INTERFACES network-interfaces: - device: name: hed3 name: hed3 network-groups: - EXTERNAL-VM - GUEST - MANAGEMENT
Create a
disks_neutron.yml
file, ensure you have theNEUTRON-DISKS
defined in it.Path to file:
~/openstack/my_cloud/definition/data/disks_neutron.yml
Example snippet:
product: version: 2 disk-models: - name: NEUTRON-DISKS volume-groups: - name: ardana-vg physical-volumes: - /dev/sda_root logical-volumes: # The policy is not to consume 100% of the space of each volume group. # 5% should be left free for snapshots and to allow for some flexibility. - name: root size: 35% fstype: ext4 mount: / - name: log size: 50% mount: /var/log fstype: ext4 mkfs-opts: -O large_file - name: crash size: 10% mount: /var/crash fstype: ext4 mkfs-opts: -O large_file
Modify your
control_plane.yml
file, ensure you have theNEUTRON-ROLE
defined as well as the neutron services added.Path to file:
~/openstack/my_cloud/definition/data/control_plane.yml
Example snippet:
- allocation-policy: strict cluster-prefix: neut member-count: 1 name: neut server-role: NEUTRON-ROLE service-components: - ntp-client - neutron-vpn-agent - neutron-dhcp-agent - neutron-metadata-agent - neutron-openvswitch-agent
You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 2 “Hardware and Software Support Matrix”.
15.1.4.1.2 Adding a network node #
These steps will show you how to add the new network node to your
servers.yml
file and then run the playbooks that update
your cloud configuration. You will run these playbooks from the lifecycle
manager.
Log in to your Cloud Lifecycle Manager.
Checkout the
site
branch of your local git so you can begin to make the necessary edits:ardana >
cd ~/openstack/my_cloud/definition/dataardana >
git checkout siteIn the same directory, edit your
servers.yml
file to include the details about your new network node(s).For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:
# network nodes - id: neut3 ip-addr: 10.13.111.137 role: NEUTRON-ROLE server-group: RACK2 mac-addr: "5c:b9:01:89:b6:18" nic-mapping: HP-DL360-6PORT ip-addr: 10.243.140.22 ilo-ip: 10.1.12.91 ilo-password: password ilo-user: admin
ImportantYou will need to verify that the
ip-addr
value you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.yml
file on your Cloud Lifecycle Manager.In your
control_plane.yml
file you will need to check the values formember-count
,min-count
, andmax-count
, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specifiedmember-count: 3
and are adding a fourth network node, you will need to change that value tomember-count: 4
.Commit the changes to git:
ardana >
git commit -a -m "Add new networking node <name>"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlAdd the new node into Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlThen you can image the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>NoteIf you do not know the
<hostname>
, you can get it by usingsudo cobbler system list
.[OPTIONAL] - Run the
wipe_disks.yml
playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. Thewipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Configure the operating system on the new networking node with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>Complete the networking node deployment with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>Run the
site.yml
playbook with the required tag so that all other services become aware of the new node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
15.1.4.1.3 Adding a New Network Node to Monitoring #
If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
15.1.5 Planned Storage Maintenance #
Planned maintenance procedures for swift storage nodes.
15.1.5.1 Planned Maintenance Tasks for swift Nodes #
Planned maintenance tasks including recovering, adding, and removing swift nodes.
15.1.5.1.1 Adding a Swift Object Node #
Adding additional object nodes allows you to increase capacity.
This topic describes how to add additional swift object server nodes to an existing system.
15.1.5.1.1.1 To add a new node #
To add a new node to your cloud, you will need to add it to
servers.yml
, and then run the scripts that update your
cloud configuration. To begin, access the servers.yml
file
by checking out the Git branch where you are required to make
the changes:
Then, perform the following steps to add a new node:
Log in to the Cloud Lifecycle Manager node.
Get the
servers.yml
file stored in Git:cd ~/openstack/my_cloud/definition/data git checkout site
If not already done, set the
weight-step
attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.Add the details of new nodes to the
servers.yml
file. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in theservers.yml
file:servers: ... - id: swobj4 role: SWOBJ_ROLE server-group: <server-group-name> mac-addr: <mac-address> nic-mapping: <nic-mapping-name> ip-addr: <ip-address> ilo-ip: <ilo-ip-address> ilo-user: <ilo-username> ilo-password: <ilo-password>
Commit your changes:
git add -A git commit -m "Add Node <name>"
NoteBefore you run any playbooks, remember that you need to export the encryption key in the following environment variable:
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
For instructions, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 30 “Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only”.
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the
nodelist
argument):cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
In the following example, the server id is swobj4 (mentioned in step 3):
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
NoteYou must use the server id as it appears in the file
servers.yml
in the fieldid
.Configure the operating system:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
The hostname of the newly added server can be found in the list generated from the output of the following command:
grep hostname ~/openstack/my_cloud/info/server_info.yml
For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt
Validate that the disk drives of the new node are compatible with the disk model used by the node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.
Run the following playbook to ensure that all other server's host file are updated with the new server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
Run the
ardana-deploy.yml
playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
For example:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
15.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node #
Steps for adding additional PAC nodes to your swift system.
This topic describes how to add additional swift proxy, account, and container (PAC) servers to an existing system.
15.1.5.1.2.1 Adding a new node #
To add a new node to your cloud, you will need to add it to
servers.yml
, and then run the scripts that update your
cloud configuration. To begin, access the servers.yml
file by checking out the Git branch where you are required to make
the changes:
Then, perform the following steps to add a new node:
Log in to the Cloud Lifecycle Manager.
Get the
servers.yml
file stored in Git:cd ~/openstack/my_cloud/definition/data git checkout site
If not already done, set the weight-step attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
Add details of new nodes to the
servers.yml
file:servers: ... - id: swpac6 role: SWPAC-ROLE server-group: <server-group-name> mac-addr: <mac-address> nic-mapping: <nic-mapping-name> ip-addr: <ip-address> ilo-ip: <ilo-ip-address> ilo-user: <ilo-username> ilo-password: <ilo-password>
In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the
servers.yml
file.In the entry-scale configurations there is no dedicated swift PAC cluster. Instead, there is a cluster using servers that have a role of
CONTROLLER-ROLE
. You cannot add additional nodes dedicated exclusively to swift PAC because that would change themember-count
of the entire cluster. In that case, to create a dedicated swift PAC cluster, you will need to add it to the configuration files. For details on how to do this, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.7 “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:
control_plane.yml disks_pac.yml net_interfaces.yml servers.yml server_roles.yml
You can see a good example of this in the example configurations for the mid-scale model in the
~/openstack/examples/mid-scale-kvm
directory.The following steps assume that you have already created a dedicated swift PAC cluster and that it has two members (swpac4 and swpac5).
Set the member count of the swift PAC cluster to match the number of nodes. For example, if you are adding swpac6 as the 6th swift PAC node, the member count should be increased from 5 to 6 as shown in the following example:
control-planes: - name: control-plane-1 control-plane-prefix: cp1 . . . clusters: . . . - name: swpac cluster-prefix: swpac server-role: SWPAC-ROLE member-count: 6 . . .
Commit your changes:
git add -A git commit -m "Add Node <name>"
NoteBefore you run any playbooks, remember that you need to export the encryption key in the following environment variable:
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
For instructions, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 30 “Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only”.
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the
nodelist
argument):ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
In the following example, the server id is swpac6 (mentioned in step 3):
ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
NoteYou must use the server id as it appears in the file
servers.yml
in the fieldid
.Review the
cloudConfig.yml
anddata/control_plane.yml
files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
Validate that the disk drives of the new node are compatible with the disk model used by the node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml
If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.
Run the following playbook to ensure that all other server's host file are updated with the new server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
Run the
ardana-deploy.yml
playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
15.1.5.1.3 Adding Additional Disks to a Swift Node #
Steps for adding additional disks to any nodes hosting swift services.
You may have a need to add additional disks to a node for swift usage and we can show you how. These steps work for adding additional disks to swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the swift service, like you would see if you are using one of the entry-scale example models.
Read through the notes below before beginning the process.
You can add multiple disks at the same time, there is no need to do it one at a time.
You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three swift servers.
15.1.5.1.3.1 Adding additional disks to your Swift servers #
Verify the general health of the swift system and that it is safe to rebalance your rings. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Perform the disk maintenance.
Shut down the first swift server you wish to add disks to.
Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.
For more details, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.6 “Swift Requirements for Device Group Drives”.
Power the server on.
While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Repeat the steps from Step 2.a for each of the swift servers you are adding the disks to, one at a time.
NoteIf the additional disks can be added to the swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.
On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.
Edit the disk configuration file that correlates to the type of server you are adding your new disks to.
Path to the typical disk configuration files:
~/openstack/my_cloud/definition/data/disks_swobj.yml ~/openstack/my_cloud/definition/data/disks_swpac.yml ~/openstack/my_cloud/definition/data/disks_controller_*.yml
Example showing the addition of a single new disk, indicated by the
/dev/sdd
, in bold:device-groups: - name: swiftObject devices: - name: "/dev/sdb" - name: "/dev/sdc" - name: "/dev/sdd" consumer: name: swift ...
NoteFor more details on how the disk model works, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”.
Configure the swift weight-step value in the
~/openstack/my_cloud/definition/data/swift/swift_config.yml
file. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.Commit the changes to Git:
cd ~/openstack git commit -a -m "adding additional swift disks"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the
osconfig-run.yml
playbook against the swift nodes you have added disks to. Use the--limit
switch to target the specific nodes:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>
You can use a wildcard when specifying the hostnames with the
--limit
switch. If you added disks to all of the swift servers in your environment and they all have the same prefix (for example,ardana-cp1-swobj...
) then you can use a wildcard likeardana-cp1-swobj*
. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.Validate your swift configuration with this playbook which will also provide details of each drive being added:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
Verify that swift services are running on all of your servers:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml
If everything looks okay with the swift status, then apply the changes to your swift rings with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
At this point your swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
15.1.5.1.4 Removing a Swift Node #
Removal process for both swift Object and PAC nodes.
You can use this process when you want to remove one or more swift nodes permanently. This process applies to both swift Proxy, Account, Container (PAC) nodes and swift Object nodes.
15.1.5.1.4.1 Setting the Pass-through Attributes #
This process will remove the swift node's drives from the rings and rebalance their responsibilities among the remaining nodes in your cluster. Note that removal will not succeed if it causes the number of remaining disks in the cluster to decrease below the replica count of its rings.
Log in to the Cloud Lifecycle Manager.
Ensure that the weight-step attribute is set. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.
Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your
~/openstack/my_cloud/definition/data/servers.yml
file since your server IDs are already listed in that file. For more information about pass-through, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.17 “Pass Through”.Here is the format required, which can be inserted at the topmost level of indentation in your file (typically 2 spaces):
pass-through: servers: - id: server-id data: subsystem: subsystem-attributes
Here is an example:
--- product: version: 2 pass-through: servers: - id: ccn-0001 data: swift: drain: yes
If a pass-through definition already exists in any of your input model data files, just include the additional data for the server which you are removing instead of defining an entirely new pass-through block.
By setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the swift data from the node in preparation for removing the node.
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Use the playbook to create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
Wait until the replication has completed. For further details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”
Determine whether all of the partitions have been removed from all drives on the swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:
cd /etc/swiftlm/cloud1/cp1/builder_dir/ sudo swift-ring-builder ring_name.builder
For example, if the node you are removing was part of the object-o ring the command would be:
sudo swift-ring-builder object-0.builder
Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:
$ cd /etc/swiftlm/cloud1/cp1/builder_dir/ $ sudo swift-ring-builder object-0.builder account.builder, build version 6 4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 16 The overload factor is 0.00% (0.000000) Devices: id region zone ip address port replication ip replication port name weight partitions balance meta 0 1 1 192.168.245.3 6002 192.168.245.3 6002 disk0 0.00 0 -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc 1 1 1 192.168.245.3 6002 192.168.245.3 6002 disk1 0.00 0 -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd 2 1 1 192.168.245.4 6002 192.168.245.4 6002 disk0 18.63 2048 -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc 3 1 1 192.168.245.4 6002 192.168.245.4 6002 disk1 18.63 2048 -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd 4 1 1 192.168.245.5 6002 192.168.245.5 6002 disk0 18.63 2048 -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc 5 1 1 192.168.245.5 6002 192.168.245.5 6002 disk1 18.63 2048 -0.00 padawan-ccp-c1-m3:disk1:/dev/sdd
If the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.
If the number of partitions is zero for the server on all rings, you can remove the swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the
remove
attribute as shown in this example:--- product: version: 2 pass-through: servers: - id: ccn-0001 data: swift: remove: yes
Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the swift deploy playbook to rebuild the rings by removing the server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.
15.1.5.1.4.2 To Disable Swift on a Node #
The next phase in this process will disable the swift service on the node. In this example, swobj4 is the node being removed from swift.
Log in to the Cloud Lifecycle Manager.
Stop swift services on the node using the
swift-stop.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit hostname
NoteWhen using the
--limit
argument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card*
(for example, *swobj4*).The following example uses the
swift-stop.yml
playbook to stop swift services on ardana-cp1-swobj0004:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004
Remove the configuration files.
ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
NoteDo not run any other playbooks until you have finished the process described in Section 15.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate
/etc/swift
and restart swift on swobj4. If you accidentally run a playbook, repeat the process in Section 15.1.5.1.4.2, “To Disable Swift on a Node”.
15.1.5.1.4.3 To Remove a Node from the Input Model #
Use the following steps to finish the process of removing the swift node.
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/definition/data/servers.yml
file and remove the entry for the node (swobj4 in this example). In addition, remove the related entry you created in the pass-through section earlier in this process.If this was a SWPAC node, reduce the member-count attribute by 1 in the
~/openstack/my_cloud/definition/data/control_plane.yml
file. For SWOBJ nodes, no such action is needed.Commit your configuration to the local Git repository (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Using the
remove_deleted_servers
andfree_unused_addresses
switches is recommended to free up the resources associated with the removed node when running the configuration processor. For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Validate the changes you have made to the configuration files using the playbook below before proceeding further:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.
For more details on how to interpret and resolve errors, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”
Remove the node from Cobbler:
sudo cobbler system remove --name=swobj4
Run the Cobbler deploy playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
The final step will depend on what type of swift node you are removing.
If the node was a SWPAC node, run the
ardana-deploy.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
If the node was a SWOBJ node (and not a SWPAC node), run the
swift-deploy.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
Wait until replication has finished. For more details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.
You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
15.1.5.1.4.4 Remove the Swift Node from Monitoring #
Once you have removed the swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.
Connect to each of the nodes in your cluster running the
monasca-api
service (as defined in
~/openstack/my_cloud/definition/data/control_plane.yml
)
and use sudo vi /etc/monasca/agent/conf.d/host_alive.yaml
to delete all references to the swift node(s) you removed.
Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:
tux >
sudo service openstack-monasca-agent restart
With the swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:
monasca alarm-list --metric-dimensions hostname=swift node deleted
You can then delete the alarm with this command:
monasca alarm-delete alarm ID
15.1.5.1.5 Replacing a swift Node #
Maintenance steps for replacing a failed swift node in your environment.
This process is used when you want to replace a failed swift node in your cloud.
If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.
15.1.5.1.5.1 How to replace a swift node in your environment #
Log in to the Cloud Lifecycle Manager.
Power off the node.
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=OLD_SWIFT_CONTROLLER_NODE
Update your cloud configuration with the details of your replacement swift. node.
Edit your
servers.yml
file to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement swift node.NoteDo not change the server's IP address (that is,
ip-addr
).Path to file:
~/openstack/my_cloud/definition/data/servers.yml
Example showing the fields to edit, in bold:
- id: swobj5 role: SWOBJ-ROLE server-group: rack2 mac-addr: 8c:dc:d4:b5:cb:bd nic-mapping: HP-DL360-6PORT ip-addr: 10.243.131.10 ilo-ip: 10.1.12.88 ilo-user: iLOuser ilo-password: iLOpass ...
Commit the changes to Git:
ardana >
cd ~/openstackardana >
git commit -a -m "replacing a swift node"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
Prepare SLES:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook prepare-sles-loader.ymlardana >
ansible-playbook prepare-sles-grub2.yml -e nodelist=NEW REPLACEMENT NODE
Update Cobbler and reimage your replacement swift node:
Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace
<node name>
in future steps.ardana >
sudo cobbler system listRemove the replaced swift node from Cobbler:
ardana >
sudo cobbler system remove --name <node name>Re-run the
cobbler-deploy.yml
playbook to add the replaced node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlReimage the node using this playbook:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
Wipe the disks on the
NEW REPLACEMENT NODE
. This action will not affect the OS partitions on the server.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limitNEW_REPLACEMENT_NODE
Complete the deployment of your replacement swift node.
Obtain the hostname for your new swift node. Use this value to replace
<hostname>
in future steps.ardana >
cat ~/openstack/my_cloud/info/server_info.ymlConfigure the operating system on your replacement swift node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>If this is the swift ring builder server, restore the swift ring builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir
directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.Configure services on the node using the
ardana-deploy.yml
playbook. If you have used an encryption password when running the configuration processor, include the--ask-vault-pass
argument.ardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
15.1.5.1.6 Replacing Drives in a swift Node #
Maintenance steps for replacing drives in a swift node.
This process is used when you want to remove a failed hard drive from swift node and replace it with a new one.
There are two different classes of drives in a swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.
15.1.5.1.6.1 To Replace the Operating System Disk Drive #
After the operating system disk drive is replaced, the node must be reimaged.
Log in to the Cloud Lifecycle Manager.
Update your Cobbler profile:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
Reimage the node using this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>
In the example below swobj2 server is reimaged:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2
Review the
cloudConfig.yml
anddata/control_plane.yml
files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
If this is the first server running the swift-proxy service, restore the swift Ring Builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir
directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.Configure services on the node using the
ardana-deploy.yml
playbook. If you have used an encryption password when running the configuration processor include the--ask-vault-pass
argument.ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \ --limit <hostname>
For example:
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
15.1.5.1.6.2 To Replace a Storage Disk Drive #
After a storage drive is replaced, there is no need to reimage the server.
Instead, run the swift-reconfigure.yml
playbook.
Log onto the Cloud Lifecycle Manager.
Run the following commands:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>
In following example, the server used is swobj2:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt
15.1.6 Updating MariaDB with Galera #
Updating MariaDB with Galera must be done manually. Updates are not installed automatically. This is particularly an issue with upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.
Using the CLI, update MariaDB with the following procedure:
Mark Galera as unmanaged:
crm resource unmanage galera
Or put the whole cluster into maintenance mode:
crm configure property maintenance-mode=true
Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:
crm_resource --wait --force-demote -r galera -V
Perform updates:
Uninstall the old versions of MariaDB and the Galera wsrep provider.
Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.
Change configuration options if necessary.
Start MariaDB on the node.
crm_resource --wait --force-promote -r galera -V
Run
mysql_upgrade
with the--skip-write-binlog
option.On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run
mysql_upgrade
.Mark Galera as managed:
crm resource manage galera
Or take the cluster out of maintenance mode.
15.2 Unplanned System Maintenance #
Unplanned maintenance tasks for your cloud.
15.2.1 Whole Cloud Recovery Procedures #
Unplanned maintenance procedures for your whole cloud.
15.2.1.1 Full Disaster Recovery #
In this disaster scenario, you have lost everything in your cloud. In other words, you have lost access to all data stored in the cloud that was not backed up to an external backup location, including:
Data in swift object storage
glance images
cinder volumes
Metering, Monitoring, and Logging (MML) data
Workloads running on compute resources
In effect, the following recovery process creates a minimal new cloud with the existing identity information. Much of the operating state and data would have been lost, as would running workloads.
We recommend backups external to your cloud for your data, including as much as possible of the types of resources listed above. Most workloads that were running could possibly be recreated with sufficient external backups.
15.2.1.1.1 Install and Set Up a Cloud Lifecycle Manager Node #
Before beginning the process of a full cloud recovery, you need to install and set up a Cloud Lifecycle Manager node as though you are creating a new cloud. There are several steps in that process:
Install the appropriate version of SUSE Linux Enterprise Server
Restore
passwd
,shadow
, andgroup
files. They have User ID (UID) and group ID (GID) content that will be used to set up the new cloud. If these are not restored immediately after installing the operating system, the cloud deployment will create new UIDs and GIDs, overwriting the existing content.Install Cloud Lifecycle Manager software
Prepare the Cloud Lifecycle Manager, which includes installing the necessary packages
Initialize the Cloud Lifecycle Manager
Restore your OpenStack git repository
Adjust input model settings if the hardware setup has changed
The following sections cover these steps in detail.
15.2.1.1.2 Install the Operating System #
Follow the instructions for installing SUSE Linux Enterprise Server in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”.
15.2.1.1.3 Restore files with UID and GID content #
There is a risk that you may lose data completely. Restore the backups for
/etc/passwd
, /etc/shadow
, and
/etc/group
immediately after installing SUSE Linux Enterprise Server.
Some backup files contain content that would no longer be valid if your cloud were to be freshly deployed in the next step of a whole cloud recovery. As a result, some of the backup must be restored before deploying a new cloud. Three kinds of backups are involved: passwd, shadow, and group. The following steps will restore those backups.
Log in to the server where the Cloud Lifecycle Manager will be installed.
Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.
ardana >
scp USER@REMOTE_SERVER:TAR_ARCHIVEUntar the TAR archives to overwrite the three locations:
passwd
shadow
group
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzThe following are examples. Use the actual
tar.gz
file names of the backups.BACKUP_TARGET=
/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ -f passwd.tar.gzBACKUP_TARGET=
/etc/shadow
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ -f shadow.tar.gzBACKUP_TARGET=
/etc/group
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ -f group.tar.gz
15.2.1.1.4 Install the Cloud Lifecycle Manager #
To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.5.2 “Installing the SUSE OpenStack Cloud Extension”.
15.2.1.1.5 Prepare to deploy your cloud #
The following is the general process for preparing to deploy a SUSE OpenStack Cloud. You may not need to perform all the steps, depending on your particular disaster recovery situation.
When you install the ardana cloud pattern
in the
following process, the ardana
user and
ardana
group will already exist in
/etc/passwd
and /etc/group
. Do
not re-create them.
When you run ardana-init
in the following process,
/var/lib/ardana
is created as a deployer account using
the account settings in /etc/passwd
and
/etc/group
that were restored in the previous step.
15.2.1.1.5.1 Prepare for Cloud Installation #
Review the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 14 “Pre-Installation Checklist” about recommended pre-installation tasks.
Prepare the Cloud Lifecycle Manager node. The Cloud Lifecycle Manager must be accessible either directly or via
ssh
, and have SUSE Linux Enterprise Server 12 SP4 installed. All nodes must be accessible to the Cloud Lifecycle Manager. If the nodes do not have direct access to online Cloud subscription channels, the Cloud Lifecycle Manager node will need to host the Cloud repositories.If you followed the installation instructions for Cloud Lifecycle Manager server (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”), SUSE OpenStack Cloud software should already be installed. Double-check whether SUSE Linux Enterprise and SUSE OpenStack Cloud are properly registered at the SUSE Customer Center by starting YaST and running › .
If you have not yet installed SUSE OpenStack Cloud, do so by starting YaST and running › › . Choose and follow the on-screen instructions. Make sure to register SUSE OpenStack Cloud during the installation process and to install the software pattern
patterns-cloud-ardana
.tux >
sudo zypper -n in patterns-cloud-ardanaEnsure the SUSE OpenStack Cloud media repositories and updates repositories are made available to all nodes in your deployment. This can be accomplished either by configuring the Cloud Lifecycle Manager server as an SMT mirror as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 16 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)” or by syncing or mounting the Cloud and updates repositories to the Cloud Lifecycle Manager server as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 17 “Software Repository Setup”.
Configure passwordless
sudo
for the user created when setting up the node (as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.4 “Creating a User”). Note that this is not the userardana
that will be used later in this procedure. In the following we assume you named the usercloud
. Run the commandvisudo
as userroot
and add the following line to the end of the file:CLOUD ALL = (root) NOPASSWD:ALL
Make sure to replace CLOUD with your user name choice.
Set the password for the user
ardana
:tux >
sudo passwd ardanaBecome the user
ardana
:tux >
su - ardanaPlace a copy of the SUSE Linux Enterprise Server 12 SP4
.iso
in theardana
home directory,var/lib/ardana
, and rename it tosles12sp4.iso
.Install the templates, examples, and working model directories:
ardana >
/usr/bin/ardana-init
15.2.1.1.6 Restore the remaining Cloud Lifecycle Manager content from a remote backup #
Log in to the Cloud Lifecycle Manager.
Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.
ardana >
scp USER@REMOTE_SERVER:TAR_ARCHIVEUntar the TAR archives to overwrite the remaining four required locations:
home
ssh
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzThe following are examples. Use the actual
tar.gz
file names of the backups.BACKUP_TARGET=
/var/lib/ardana
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /var/lib/ -f home.tar.gzBACKUP_TARGET=/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz
15.2.1.1.7 Re-deployment of controllers 1, 2 and 3 #
Change back to the default ardana user.
Run the
cobbler-deploy.yml
playbook.ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the
bm-reimage.yml
playbook limited to the second and third controllers.ardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3The names of controller2 and controller3. Use the
bm-power-status.yml
playbook to check the cobbler names of these nodes.Run the
site.yml
playbook limited to the three controllers and localhost—in this example,doc-cp1-c1-m1-mgmt
,doc-cp1-c1-m2-mgmt
,doc-cp1-c1-m3-mgmt
, andlocalhost
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit \ doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostYou can now perform the procedures to restore MariaDB and swift.
15.2.1.1.8 Restore MariaDB from a remote backup #
Log in to the first node running the MariaDB service.
Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_infoStop SUSE OpenStack Cloud services on the three controllers (using the hostnames of the controllers in your configuration).
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \ doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostDelete the files in the
mysql
directory and copy the restored backup to that directory.root #
cd /var/lib/mysql/root #
rm -rf ./*root #
cp -pr /tmp/mysql_restore/* ./Switch back to the
ardana
user when the copy is finished.
15.2.1.1.9 Restore swift from a remote backup #
Log in to the first swift Proxy (
SWF-PRX--first-member
) node.To find the first swift Proxy node:
On the Cloud Lifecycle Manager
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-status.yml \ --limit SWF-PRX--first-memberAt the end of the output, you will see something like the following example:
... Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)' Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)' PLAY RECAP ******************************************************************** ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
Find the first node name and its IP address. For example:
ardana >
cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
Retrieve (
scp
) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.Create a temporary directory and extract the TAR archive (for example,
swring.tar.gz
).ardana >
mkdir /tmp/swift_builder_dir_restore; sudo tar -z \ --incremental --extract --ignore-zeros --warning=none --overwrite --directory \ /tmp/swift_builder_dir_restore/ -f swring.tar.gzLog in to the Cloud Lifecycle Manager.
Stop the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-stop.ymlLog back in to the first swift Proxy (
SWF-PRX--first-member
) node, which was determined in Step 1.Copy the restored files.
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/Log back in to the Cloud Lifecycle Manager.
Reconfigure the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.1.1.10 Restart SUSE OpenStack Cloud services #
Restart the MariaDB database
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlOn the deployer node, execute the
galera-bootstrap.yml
playbook which will determine the log sequence number, bootstrap the main node, and start the database cluster.If this process fails to recover the database cluster, refer to Section 15.2.3.1.2, “Recovering the MariaDB Database”.
Restart SUSE OpenStack Cloud services on the three controllers as in the following example.
ardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml \ --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostReconfigure SUSE OpenStack Cloud
ardana >
ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
15.2.2 Recover Start-up Processes #
In this scenario, processes do not start. If those processes are not running,
ansible start-up scripts will fail. On the deployer, use
Ansible
to check status on the control plane servers. The
following checks and remedies address common causes of this condition.
If disk space is low, determine the cause and remove anything that is no longer needed. Check disk space with the following command:
ardana >
ansible KEY-API -m shell -a 'df -h'Check that Network Time Protocol (NTP) is synchronizing clocks properly with the following command.
ardana >
ansible resources -i hosts/verb_hosts \ -m shell -a "sudo ntpq -c peers"Check
keepalived
, the daemon that monitors services or systems and automatically fails over to a standby if problems occur.ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status keepalived | head -8"Restart
keepalived
if necessary.Check RabbitMQ status first:
ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo rabbitmqctl status | head -10"Restart RabbitMQ if necessary:
ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl start rabbitmq-server"If RabbitMQ is running, restart
keepalived
:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl restart keepalived"
If RabbitMQ is up, is it clustered?
ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo rabbitmqctl cluster_status"Restart RabbitMQ cluster if necessary:
ardana >
ansible_playbook -i hosts/verb_hosts rabbitmq-start.ymlCheck
Kafka
messaging:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status kafka | head -5"Check the
Spark
framework:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status spark-worker | head -8"ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status spark-master | head -8"If necessary, start
Spark
:ardana >
ansible-playbook -i hosts/verb_hosts spark-start.ymlardana >
ansible KEY-API -i hosts/verb_hosts -m shell -a \ "sudo systemctl start spark-master | head -8"Check
Zookeeper
centralized service:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status zookeeper| head -8"Check MariaDB:
ardana >
ansible KEY-API -i hosts/verb_hosts -m shell -a "sudo mysql -e 'show status;' | grep -e wsrep_incoming_addresses \ -e wsrep_local_state_comment "
15.2.3 Unplanned Control Plane Maintenance #
Unplanned maintenance tasks for controller nodes such as recovery from power failure.
15.2.3.1 Restarting Controller Nodes After a Reboot #
Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.
When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.
These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.
15.2.3.1.1 Prerequisites #
The following conditions must be true in order to perform these steps successfully:
Each of your controller nodes should be powered on.
Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.
The operator who performs these steps will need access to the Cloud Lifecycle Manager.
15.2.3.1.2 Recovering the MariaDB Database #
The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:
Scenario 1: Recovering one or two of your controller nodes but not the entire cluster
Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:
Ensure the controller nodes have power and are booted to the command prompt.
If the MariaDB service is not started, start it with this command:
ardana >
sudo service mysql startIf MariaDB fails to start, proceed to the next section which covers the bootstrap process.
Scenario 2: Recovering the entire controller cluster with the bootstrap playbook
If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.
Make sure no
mysqld
daemon is running on any node in the cluster before you continue with the steps in this procedure. If there is amysqld
daemon running, then use the command below to shut down the daemon.ardana >
sudo systemctl stop mysqlIf the mysqld daemon does not go down following the service stop, then kill the daemon using
kill -9
before continuing.On the deployer node, execute the
galera-bootstrap.yml
playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
15.2.3.1.3 Restarting Services on the Controller Nodes #
From the Cloud Lifecycle Manager you should execute the
ardana-start.yml
playbook for each node that was brought
down so the services can be started back up.
If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>
If you have a shared Cloud Lifecycle Manager/controller setup and need to restart
services on this shared node, you can use localhost
to
indicate the shared node, like this:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost
If you leave off the --limit
switch, the playbook will
be run against all nodes.
15.2.3.1.4 Restart the Monitoring Agents #
As part of the recovery process, you should also restart the
monasca-agent
and these steps will show you how:
Log in to the Cloud Lifecycle Manager.
Stop the
monasca-agent
:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-agent-stop.ymlRestart the
monasca-agent
:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-agent-start.ymlYou can then confirm the status of the
monasca-agent
with this playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml
15.2.3.2 Recovering the Control Plane #
If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery, there are several scenarios for recovering your cloud.
If you backed up the Cloud Lifecycle Manager manually after installation (see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 38 “Post Installation Tasks”, you will have a backup copy of
/etc/group
. When recovering a Cloud Lifecycle Manager node, manually copy
the /etc/group
file from a backup of the old Cloud Lifecycle Manager.
15.2.3.2.1 Point-in-Time MariaDB Database Recovery #
In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.
15.2.3.2.1.1 Restore MariaDB manually #
Follow this procedure to manually restore MariaDB:
Log in to the Cloud Lifecycle Manager.
Stop the MariaDB cluster:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-stop.ymlOn all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:
ardana >
sudo rm -r /var/lib/mysql/*On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.
ardana >
sudo cp -pr /tmp/mysql_restore/* /var/lib/mysqlIf you need to restore the files manually from SSH, follow these steps:
Log in to the first node running the MariaDB service.
Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_info
Log back in to the Cloud Lifecycle Manager.
Start the MariaDB service.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlAfter approximately 10-15 minutes, the output of the
percona-status.yml
playbook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-status.ymlAn example output is as follows:
TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] ************* ok: [ardana-cp1-c1-m1-mgmt] => { "msg": "mysql is synced." } ok: [ardana-cp1-c1-m2-mgmt] => { "msg": "mysql is synced." } ok: [ardana-cp1-c1-m3-mgmt] => { "msg": "mysql is synced." }
15.2.3.2.1.2 Point-in-Time Cassandra Recovery #
A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.
The following steps should be taken before enabling and deploying the replacement node.
Determine the IP address of the node that was removed or is being replaced.
On one of the functional Cassandra control plane nodes, log in as the
ardana
user.Run the command
nodetool status
to display a list of Cassandra nodes.If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.
If the node that was removed is still in the list, copy its node ID.
Run the command
nodetool removenode ID
.
After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 15.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.
For more information, please consult the Cassandra documentation.
15.2.3.2.2 Point-in-Time swift Rings Recovery #
In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your swift rings to a previous state.
This process restores swift rings only, not swift data.
15.2.3.2.2.1 Restore from a swift backup #
Log in to the first swift Proxy (
SWF-PRX--first-member
) node.To find the first swift Proxy node:
On the Cloud Lifecycle Manager
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-status.yml \ --limit SWF-PRX--first-memberAt the end of the output, you will see something like the following example:
... Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)' Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)' PLAY RECAP ******************************************************************** ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
Find the first node name and its IP address. For example:
ardana >
cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
Retrieve (
scp
) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.Create a temporary directory and extract the TAR archive (for example,
swring.tar.gz
).ardana >
mkdir /tmp/swift_builder_dir_restore; sudo tar -z \ --incremental --extract --ignore-zeros --warning=none --overwrite --directory \ /tmp/swift_builder_dir_restore/ -f swring.tar.gzLog in to the Cloud Lifecycle Manager.
Stop the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-stop.ymlLog back in to the first swift Proxy (
SWF-PRX--first-member
) node, which was determined in Step 1.Copy the restored files.
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/Log back in to the Cloud Lifecycle Manager.
Reconfigure the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.3.2.3 Point-in-time Cloud Lifecycle Manager Recovery #
In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.
Log in to the Cloud Lifecycle Manager.
Retrieve the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.
Extract the TAR archives for each of the seven locations.
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzFor example, with a directory such as BACKUP_TARGET=
/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gzWith a file such as BACKUP_TARGET=
/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
15.2.3.2.4 Cloud Lifecycle Manager Disaster Recovery #
In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.
To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 15 “Installing the Cloud Lifecycle Manager server”, Section 15.5.2 “Installing the SUSE OpenStack Cloud Extension” before proceeding.
Prepare the Cloud Lifecycle Manager following the steps in the Before You
Start
section of Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 21 “Installing with the Install UI”.
15.2.3.2.4.1 Restore from a remote backup #
Log in to the Cloud Lifecycle Manager.
Retrieve (with
scp
) the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.Extract the TAR archives for each of the seven locations.
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzFor example, with a directory such as BACKUP_TARGET=
/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gzWith a file such as BACKUP_TARGET=
/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gzUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready_deployment.ymlWhen the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
15.2.3.2.5 One or Two Controller Node Disaster Recovery #
This scenario makes the following assumptions:
Your Cloud Lifecycle Manager is still intact and working.
One or two of your controller nodes went down, but not the entire cluster.
The node needs to be rebuilt from scratch, not simply rebooted.
15.2.3.2.5.1 Steps to recovering one or two controller nodes #
Ensure that your node has power and all of the hardware is functioning.
Log in to the Cloud Lifecycle Manager.
Verify that all of the information in your
~/openstack/my_cloud/definition/data/servers.yml
file is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.If you made changes to your
servers.yml
file then commit those changes to your local git:ardana >
git add -Aardana >
git commit -a -m "editing controller information"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlEnsure that Cobbler has the correct system information:
If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:
ardana >
sudo cobbler system listRemove any controller nodes from Cobbler that no longer exist:
ardana >
sudo cobbler system remove --name=<node>Add the new node into Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.yml
Then you can image the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>NoteIf you do not know the
<node name>
already, you can get it by usingsudo cobbler system list
.Before proceeding, look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. To prevent loss of data, the configuration processor retains data about removed nodes and keeps their ID numbers from being reallocated. For more information about how this works, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations”.
Run the
wipe_disks.yml
playbook to ensure the non-OS partitions on your nodes are completely wiped prior to continuing with the installation.ImportantThe
wipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other situation, it may not wipe all of the expected partitions.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>Complete the rebuilding of your controller node with the two playbooks below:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>ardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>
15.2.3.2.6 Three Control Plane Node Disaster Recovery #
In this scenario, all control plane nodes are down and need to be rebuilt or replaced. Restoring from a swift backup is not possible because swift is gone.
15.2.3.2.6.1 Restore from an SSH backup #
Log in to the Cloud Lifecycle Manager.
Deploy the control plane nodes, using the values for your control plane node hostnames:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit \ CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \ CONTROL_PLANE_HOSTNAME3 -e rebuild=TrueFor example, if you were using the default values from the example model files, the command would look like this:
ardana >
ansible-playbook -i hosts/verb_hosts site.yml \ --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \ -e rebuild=TrueNoteThe
-e rebuild=True
is only used on a single control plane node when there are other controllers available to pull configuration data from. This causes the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.Log in to the Cloud Lifecycle Manager.
Stop MariaDB:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-stop.ymlRetrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_infoLog back in to the first controller node and move the following files:
ardana >
ssh FIRST_CONTROLLER_NODEardana >
sudo suroot #
rm -rf /var/lib/mysql/*root #
cp -pr /tmp/mysql_restore/* /var/lib/mysql/Log back in to the Cloud Lifecycle Manager and bootstrap MariaDB:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlVerify the status of MariaDB:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-status.yml
15.2.3.2.7 swift Rings Recovery #
To recover the swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings with the manual swift backup and restore or use the SSH backup.
15.2.3.2.7.1 Restore from the swift deployment backup #
15.2.3.2.7.2 Restore from the SSH backup #
In case you have lost all system disks of all object nodes and swift proxy nodes are corrupted, you can recover the rings from a copy of the swift rings was backed up previously. swift data is still available (the disks used by swift still need to be accessible).
Recover the rings with these steps.
Log in to a swift proxy node.
Become root:
ardana >
sudo suCreate the temporary directory for your restored files:
root #
mkdir /tmp/swift_builder_dir_restore/Retrieve (
scp
) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.Create a temporary directory and extract the TAR archive (for example,
swring.tar.gz
).ardana >
mkdir /tmp/swift_builder_dir_restore; sudo tar -z \ --incremental --extract --ignore-zeros --warning=none --overwrite --directory \ /tmp/swift_builder_dir_restore/ -f swring.tar.gzYou now have the swift rings in
/tmp/swift_builder_dir_restore/
If the SWF-PRX--first-member is already deployed, copy the contents of the restored directory (
/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
) to/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
on the SWF-PRX--first-member.Then from the Cloud Lifecycle Manager run:
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlIf the SWF-ACC--first-member is not deployed, from the Cloud Lifecycle Manager run these playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts guard-deployment.ymlardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>Copy the contents of the restored directory (
/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
) to/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
on the SWF-ACC[0].Create the directories:
/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example:
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/From the Cloud Lifecycle Manager, run the
ardana-deploy.yml
playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
15.2.4 Unplanned Compute Maintenance #
Unplanned maintenance tasks including recovering compute nodes.
15.2.4.1 Recovering a Compute Node #
If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.
Typical scenarios in which you will need to recover a compute node include the following:
The node has failed, either because it has shut down has a hardware failure, or for another reason.
The node is working but the
nova-compute
process is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.
15.2.4.1.1 What to do if your compute node is down #
Compute node has power but is not powered on
If your compute node has power but is not powered on, use these steps to restore the node:
Log in to the Cloud Lifecycle Manager.
Obtain the name for your compute node in Cobbler:
ardana >
sudo cobbler system listPower the node back up with this playbook, specifying the node name from Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>
Compute node is powered on but services are not running on it
If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:
Log in to the Cloud Lifecycle Manager.
Confirm the status of the compute service on the node with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>You can start the compute service on the node with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>
15.2.4.1.2 Scenarios involving disk failures on your compute nodes #
Your compute nodes should have a minimum of two disks, one that is used for
the operating system and one that is used as the data disk. These are
defined during the installation of your cloud, in the
~/openstack/my_cloud/definition/data/disks_compute.yml
file on the Cloud Lifecycle Manager. The data disk(s) are where the
nova-compute
service lives. Recovery scenarios will
depend on whether one or the other, or both, of these disks experienced
failures.
If your operating system disk failed but the data disk(s) are okay
If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:
Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
ardana >
source ~/service.osrcObtain the hostname for your compute node, which you will use in subsequent commands when
<hostname>
is requested:ardana >
openstack host list | grep computeObtain the status of the
nova-compute
service on that node:ardana >
openstack compute service list --host <hostname>You will likely want to disable provisioning on that node to ensure that
nova-scheduler
does not attempt to place any additional instances on the node while you are repairing it:ardana >
openstack compute service set –disable --reason "node is being rebuilt" <hostname>Obtain the status of the instances on the compute node:
ardana >
openstack server list --host <hostname> --all-tenantsBefore continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the
nova evacuate
ornova host-evacuate
commands to do this. See Section 15.1.3.3, “Live Migration of Instances” for more details on how to do this.If your instances are not booted from volumes, you will need to stop the instances using the
openstack server stop
command. Because thenova-compute
service is not running on the node you will not see the instance status change, but theTask State
for the instance should change topowering-off
.ardana >
openstack server stop <instance_uuid>Verify the status of each of the instances using these commands, verifying the
Task State
statespowering-off
:ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server show <instance_uuid>At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:
Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when
<node_name>
is requested:ardana >
sudo cobbler system listRun the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]Reimage the compute node with this playbook:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
Once reimaging is complete, use the following playbook to configure the operating system and start up services:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>You should then ensure any instances on the recovered node are in an
ACTIVE
state. If they are not then use theopenstack server start
command to bring them to theACTIVE
state:ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server start <instance_uuid>Re-enable provisioning:
ardana >
openstack compute service set –enable <hostname>Start any instances that you had stopped previously:
ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server start <instance_uuid>
If your data disk(s) failed but the operating system disk is okay OR if all drives failed
In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.
After that is complete, use the openstack server rebuild
command to respawn your instances, which will also ensure that they receive
the same IP address:
ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server rebuild <instance_uuid>
15.2.5 Unplanned Storage Maintenance #
Unplanned maintenance tasks for storage nodes.
15.2.5.1 Unplanned swift Storage Maintenance #
Unplanned maintenance tasks for swift storage nodes.
15.2.5.1.1 Recovering a Swift Node #
If one or more of your swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.
Typical scenarios in which you will need to repair a swift object or PAC node include:
The node has either shut down or been rebooted.
The entire node has failed and needs to be replaced.
A disk drive has failed and must be replaced.
15.2.5.1.1.1 What to do if your Swift host has shut down or rebooted #
If your swift host has power but is not powered on, from the lifecycle manager you can run this playbook:
Log in to the Cloud Lifecycle Manager.
Obtain the name for your swift host in Cobbler:
sudo cobbler system list
Power the node back up with this playbook, specifying the node name from Cobbler:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>
Once the node is booted up, swift should start automatically. You can verify this with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml
Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 18.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.
15.2.5.1.1.2 How to replace your Swift node #
If your swift node has irreparable damage and you need to replace the entire node in your environment, see Section 15.1.5.1.5, “Replacing a swift Node” for details on how to do this.
15.2.5.1.1.3 How to replace a hard disk in your Swift node #
If you need to do a hard drive replacement in your swift node, see Section 15.1.5.1.6, “Replacing Drives in a swift Node” for details on how to do this.
15.3 Cloud Lifecycle Manager Maintenance Update Procedure #
Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 16 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”. Alternatives to setting up an SMT server are described in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 17 “Software Repository Setup”.
Read the Release Notes for the security and maintenance updates that will be installed.
Have a backup strategy in place. For further information, see Chapter 17, Backup and Restore.
Ensure that you have a known starting state by resolving any unexpected alarms.
Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.
Review steps in Section 15.1.4.1, “Adding a Network Node” and Section 15.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the neutron services are not provided via external SDN controllers.
Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.
15.3.1 Performing the Update #
Before you proceed, get the status of all your services:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-status.yml
If status check returns an error for a specific service, run the
SERVICE-reconfigure.yml
playbook. Then run the
SERVICE-status.yml
playbook to check that the issue has been resolved.
Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”.
The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.
To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 15.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.
Install all available security and maintenance updates on the deployer using the
zypper patch
command.Initialize the Cloud Lifecycle Manager and prepare the update playbooks.
Run the
ardana-init
initialization script to update the deployer.Redeploy cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
Installation and management of updates can be automated with the following playbooks:
ardana-update-pkgs.yml
ardana-update.yml
ardana-update-status.yml
ardana-reboot.yml
Confirm version changes by running
hostnamectl
before and after running theardana-update-pkgs
playbook on each node.ardana >
hostnamectlNotice that
Boot ID:
andKernel:
have changed.By default, the
ardana-update-pkgs.yml
playbook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \ --limit TARGET_NODE_NAMEThere may be a delay in the playbook output at the following task while updates are pulled from the deployer.
TASK: [ardana-upgrade-tools | pkg-update | Download and install package updates] ***
After running the
ardana-update-pkgs.yml
playbook to install patches and updates not requiring reboot, check the status of remaining tasks.ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit TARGET_NODE_NAMETo install patches that require reboot, run the
ardana-update-pkgs.yml
playbook with the parameter-e zypper_update_include_reboot_patches=true
.ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \ --limit TARGET_NODE_NAME \ -e zypper_update_include_reboot_patches=trueIf the output of
ardana-update-pkgs.yml
indicates that a reboot is required, runardana-reboot.yml
after completing theardana-update.yml
step below. Runningardana-reboot.yml
will cause cloud service interruption.NoteTo update a single package (for example, apply a PTF on a single node or on all nodes), run
zypper update PACKAGE
.To install all package updates using
zypper update
.Update services:
ardana >
ansible-playbook -i hosts/verb_hosts ardana-update.yml \ --limit TARGET_NODE_NAMEIf indicated by the
ardana-update-status.yml
playbook, reboot the node.There may also be a warning to reboot after running the
ardana-update-pkgs.yml
.This check can be overridden by setting the
SKIP_UPDATE_REBOOT_CHECKS
environment variable or theskip_update_reboot_checks
Ansible variable.ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \ --limit TARGET_NODE_NAME
To recheck pending system reboot status at a later time, run the following commands:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit ardana-cp1-c1-m2 -e update_status_var=system-rebootThe pending system reboot status can be reset by running:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit ardana-cp1-c1-m2 \ -e update_status_var=system-reboot \ -e update_status_reset=trueMultiple servers can be patched at the same time with
ardana-update-pkgs.yml
by setting the option-e skip_single_host_checks=true
.WarningWhen patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.
If multiple nodes are specified on the command line (with
--limit
), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.ImportantDo not reboot all of your controllers at the same time.
When the node comes up after the reboot, run the
spark-start.yml
file:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-start.ymlVerify that Spark is running on all Control Nodes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-status.ymlAfter all nodes have been updated, check the status of all services:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-status.yml
15.3.2 Summary of the Update Playbooks #
- ardana-update-pkgs.yml
Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable
ardana-update-pkgs.yml -e skip_single_host_checks=true
.Provide the following
-e
options to modify default behavior:zypper_update_method
(default: patch)patch
installs all patches for the system. Patches are intended for specific bug and security fixes.update
installs all packages that have a higher version number than the installed packages.dist-upgrade
replaces each package installed with the version from the repository and deletes packages not available in the repositories.
zypper_update_repositories
(default: all) restricts the list of repositories usedzypper_update_gpg_checks
(default: true) enables GPG checks. If set totrue
, checks if packages are correctly signed.zypper_update_licenses_agree
(default: false) automatically agrees with licenses. If set totrue
, zypper automatically accepts third party licenses.zypper_update_include_reboot_patches
(default: false) includes patches that require reboot. Setting this totrue
installs patches that require a reboot (such as kernel or glibc updates).
- ardana-update.yml
Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding
--limit nodename
.- ardana-reboot.yml
Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.
- ardana-update-status.yml
This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.
15.4 Upgrading Cloud Lifecycle Manager 8 to Cloud Lifecycle Manager 9 #
Before undertaking the upgrade from SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you need to ensure that your existing SUSE OpenStack Cloud 8 Cloud Lifecycle Manager installation is up to date by following the https://documentation.suse.com/hpe-helion/8/html/hpe-helion-openstack-clm-all/system-maintenance.html#maintenance-update.
Ensure you review the following resources:
To confirm that all nodes have been successfully updated with no pending
actions, run the ardana-update-status.yml
playbook on
the Cloud Lifecycle Manager deployer node as follows:
ardana >
cd scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml
Ensure that all nodes have been updated, and that there are no pending update actions remaining to be completed. In particular, ensure that any nodes that need to be rebooted have been, using the documented reboot procedure.
Once all nodes have been successfully updated, and there are no pending update actions remaining, you should be able to run the
ardana-pre-upgrade-validations.sh
script, as follows:ardana >
cd scratch/ansible/next/ardana/ansible/ardana >
./ardana-pre-upgrade-validations.sh ~/scratch/ansible/next/ardana/ansible ~/scratch/ansible/next/ardana/ansible PLAY [Initialize an empty list of msgs] *************************************** TASK: [set_fact ] ************************************************************* ok: [localhost] ... PLAY RECAP ******************************************************************** ... localhost : ok=8 changed=5 unreachable=0 failed=0 msg: Please refer to /var/log/ardana-pre-upgrade-validations.log for the results of this run. Ensure that any messages in the file that have the words FAIL or WARN are resolved.The last line of output from the
ardana-pre-upgrade-validations.sh
script will tell you the name of its log file—in this case,/var/log/ardana-pre-upgrade-validations.log
. If you look at the log file, you will see content similar to the following:ardana >
sudo cat /var/log/ardana-pre-upgrade-validations.log ardana-cp-dbmqsw-m1************************************************************* NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. ardana-cp-dbmqsw-m2************************************************************* NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. ardana-cp-dbmqsw-m3************************************************************* NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. ardana-cp-mml-m1**************************************************************** SUCCESS: Keystone V2 ==> V3 API config changes detected. ardana-cp-mml-m2**************************************************************** SUCCESS: Keystone V2 ==> V3 API config changes detected. ardana-cp-mml-m3**************************************************************** SUCCESS: Keystone V2 ==> V3 API config changes detected. localhost***********************************************************************The report states the following:
- SUCCESS: Keystone V2 ==> V3 API config changes detected.
This check confirms that your cloud has been updated with the necessary changes such that all services will be using Keystone V3 API. This means that there should be minimal interruption of service during the upgrade. This is important because the Keystone V2 API has been removed in SUSE OpenStack Cloud 9.
- NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than the SUSE Linux Enterprise 12 SP4 recommendation of 512. Some recommended XFS data integrity features may not be available after upgrade.
This check will only report something if you have local swift configured and it is formatted with the SUSE Linux Enterprise 12 SP3 default XFS inode size of 256. In SUSE Linux Enterprise 12 SP4, the default XFS inode size for a newly-formatted XFS file system has been increased to 512, to allow room for enabling some additional XFS data-integrity features by default.
There will be no loss of functionality as regards the swift solution after the upgrade. The difference is that some additional XFS features will not be available on file systems which were formatted under SUSE Linux Enterprise 12 SP3 or earlier. These XFS features aid in the detection of, and recovery from, data corruption. They are enabled by default for XFS file systems formatted under SUSE Linux Enterprise 12 SP 4.
In addition to the automated upgrade checks above, there are some checks that should be performed manually.
For each network interface device specified in the input model under
~/openstack/my_cloud/definition
, ensure that there is only one untagged VLAN. The SUSE OpenStack Cloud 9 Cloud Lifecycle Manager configuration processor will fail with an error if it detects this problem during the upgrade, so address this problem before starting the upgrade process.If the deployer node is not a standalone system, but is instead co-located with the DB services, this can lead to potentially longer service disruptions during the upgrade process. To determine if this is the case, check if the deployer node (
OPS-LM--first-member
) is a member of the database nodes (FND-MDB
). You can do this with the following command:ardana >
cd scratch/ansible/next/ardana/ansible/ardana >
ansible -i hosts/verb_hosts 'FND-MDB:&OPS-LM--first-member' --list-hostsIf the output is:
No hosts matched
Then the deployer node is not co-located with the database nodes. Otherwise, if the command reports a hostname, then there may be additional interruptions to the database services during the upgrade.
Similarly, if the deployer is co-located with the database services, and you are also trying to run a local SMT service on the deployer node, you will run into issues trying to configure the SMT to enable and mirror the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.
In such cases, it is recommended that you run the SMT services on a different node, and NFS-import the
/srv/www/htdocs/repo
onto the deployer node, instead of trying to run the SMT services locally.
The integrated backup solution in SUSE OpenStack Cloud 8 Cloud Lifecycle Manager, freezer, is no longer available in SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. Therefore, we recommend doing a manual backup to a server that is not a member of the cloud, as per Chapter 17, Backup and Restore.
15.4.1 Migrating the Deployer Node Packages #
The upgrade process first migrates the SUSE OpenStack Cloud 8 Cloud Lifecycle Manager deployer node to SUSE Linux Enterprise 12 SP4 and the SOC 9 Cloud Lifecycle Manager packages.
If the deployer node is not a dedicated node, but is instead a member of one of the cloud-control planes, then some services may restart with the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM versions of the software during the migration. This may mean that:
Some services fail to restart. This will be resolved when the appropriate SUSE OpenStack Cloud 9 configuration changes are applied by running the
ardana-upgrade.yml
playbook, later during the upgrade process.Other services may log excessive warnings about connectivity issues and backwards-compatibility warnings. This will be resolved when the relevant services are upgraded during the
ardana-upgrade.yml
playbook run.
In order to upgrade the deployer node to be based on SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you first need to migrate the system to SUSE Linux Enterprise 12 SP4 with the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager product installed.
The process for migrating the deployer node differs somewhat, depending on whether your deployer node is registered with the SUSE Customer Center (or an SMT mirror), versus using locally-maintained repositories available at the relevant locations.
If your deployer node is registered with the SUSE Customer Center or an SMT, the migration
process requires the zypper-migration-plugin
package to be installed.
If you are using an SMT server to mirror the relevant repositories, then you need to enable mirroring of the relevant repositories. See Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 16 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”, Section 16.3 “Setting up Repository Mirroring on the SMT Server” for more information.
Ensure that the mirroring process has completed before proceeding.
Ensure that the
zypper-migration-plugin
package is installed; if not, install it:ardana >
sudo zypper install zypper-migration-plugin Refreshing service 'SMT-http_smt_example_com'. Loading repository data... Reading installed packages... 'zypper-migration-plugin' is already installed. No update candidate for 'zypper-migration-plugin-0.10-12.4.noarch'. The highest available version is already installed. Resolving package dependencies... Nothing to do.De-register the SUSE Linux Enterprise Server LTSS 12 SP3 x86_64 extension (if enabled):
ardana >
sudo SUSEConnect --status-text Installed Products: ------------------------------------------ SUSE Linux Enterprise Server 12 SP3 LTSS (SLES-LTSS/12.3/x86_64) Registered ------------------------------------------ SUSE Linux Enterprise Server 12 SP3 (SLES/12.3/x86_64) Registered ------------------------------------------ SUSE OpenStack Cloud 8 (suse-openstack-cloud/8/x86_64) Registered ------------------------------------------ ardana > sudo SUSEConnect -d -p SLES-LTSS/12.3/x86_64 Deregistering system from registration proxy https://smt.example.com/ Deactivating SLES-LTSS 12.3 x86_64 ... -> Refreshing service ... -> Removing release package ... ardana > sudo SUSEConnect --status-text Installed Products: ------------------------------------------ SUSE Linux Enterprise Server 12 SP3 (SLES/12.3/x86_64) Registered ------------------------------------------ SUSE OpenStack Cloud 8 (suse-openstack-cloud/8/x86_64) Registered ------------------------------------------Disable any other SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories. The zypper migration process should detect and disable most of these automatically, but in some cases it may not catch all of them, which can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the
/srv/www/suse-12.3
directory or the SUSE-12-4 alias underhttp://localhost:79/
, you could use the following commands:ardana >
zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2 PTF SLES12-SP3-LTSS-Updates SLES12-SP3-Pool SLES12-SP3-Updates SUSE-OpenStack-Cloud-8-Pool SUSE-OpenStack-Cloud-8-Updates ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done Repository 'PTF' has been successfully disabled. Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled. Repository 'SLES12-SP3-Pool' has been successfully disabled. Repository 'SLES12-SP3-Updates' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3 (a new one, based on SUSE Linux Enterprise 12 SP4, will be created during the upgrade process):
ardana >
zypper repos | grep PTF 2 | PTF | PTF | No | (r ) Yes | Yes ardana > sudo zypper removerepo PTF Removing repository 'PTF' ..............................................................................................[done] Repository 'PTF' has been removed.Remove the Cloud media repository (if defined):
ardana >
zypper repos | grep '[|] Cloud ' 1 | Cloud | SUSE OpenStack Cloud 8 DVD #1 | Yes | (r ) Yes | No ardana > sudo zypper removerepo Cloud Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done] Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.Run the
zypper migration
command, which should offer a single choice: namely, to upgrade to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. You need to accept the offered choice, then answeryes
to any prompts to disable obsoleted repositories. At that point, thezypper migration
command will runzypper dist-upgrade
, which will prompt you to agree with the proposed package changes. Finally, you will to agree with any new licenses. After this, the package upgrade of the deployer node will proceed. The output of the runningzypper migration
should look something like the following:ardana >
sudo zypper migration Executing 'zypper refresh' Repository 'SLES12-SP3-Pool' is up to date. Repository 'SLES12-SP3-Updates' is up to date. Repository 'SLES12-SP3-Pool' is up to date. Repository 'SLES12-SP3-Updates' is up to date. Repository 'SUSE-OpenStack-Cloud-8-Pool' is up to date. Repository 'SUSE-OpenStack-Cloud-8-Updates' is up to date. Repository 'OpenStack-Cloud-8-Pool' is up to date. Repository 'OpenStack-Cloud-8-Updates' is up to date. All repositories have been refreshed. Executing 'zypper --no-refresh patch-check --updatestack-only' Loading repository data... Reading installed packages... 0 patches needed (0 security patches) Available migrations: 1 | SUSE Linux Enterprise Server 12 SP4 x86_64 SUSE OpenStack Cloud 9 x86_64 [num/q]: 1 Executing 'snapper create --type pre --cleanup-algorithm=number --print-number --userdata important=yes --description 'before online migration'' The config 'root' does not exist. Likely snapper is not configured. See 'man snapper' for further instructions. Upgrading product SUSE Linux Enterprise Server 12 SP4 x86_64. Found obsolete repository SLES12-SP3-Updates Disable obsolete repository SLES12-SP3-Updates [y/n] (y): y ... disabling. Found obsolete repository SLES12-SP3-Pool Disable obsolete repository SLES12-SP3-Pool [y/n] (y): y ... disabling. Upgrading product SUSE OpenStack Cloud 9 x86_64. Found obsolete repository OpenStack-Cloud-8-Pool Disable obsolete repository OpenStack-Cloud-8-Pool [y/n] (y): y ... disabling. Executing 'zypper --releasever 12.4 ref -f' Warning: Enforced setting: $releasever=12.4 Forcing raw metadata refresh Retrieving repository 'SLES12-SP4-Pool' metadata .......................................................................[done] Forcing building of repository cache Building repository 'SLES12-SP4-Pool' cache ............................................................................[done] Forcing raw metadata refresh Retrieving repository 'SLES12-SP4-Updates' metadata ....................................................................[done] Forcing building of repository cache Building repository 'SLES12-SP4-Updates' cache .........................................................................[done] Forcing raw metadata refresh Retrieving repository 'SUSE-OpenStack-Cloud-9-Pool' metadata ...........................................................[done] Forcing building of repository cache Building repository 'SUSE-OpenStack-Cloud-9-Pool' cache ................................................................[done] Forcing raw metadata refresh Retrieving repository 'SUSE-OpenStack-Cloud-9-Updates' metadata ........................................................[done] Forcing building of repository cache Building repository 'SUSE-OpenStack-Cloud-9-Updates' cache .............................................................[done] Forcing raw metadata refresh Retrieving repository 'OpenStack-Cloud-8-Updates' metadata .............................................................[done] Forcing building of repository cache Building repository 'OpenStack-Cloud-8-Updates' cache ..................................................................[done] All repositories have been refreshed. Executing 'zypper --releasever 12.4 --no-refresh dist-upgrade --no-allow-vendor-change ' Warning: Enforced setting: $releasever=12.4 Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command. Loading repository data... Reading installed packages... Computing distribution upgrade... ... 525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch. Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used. Continue? [y/n/...? shows all options] (y): y ... dracut: *** Generating early-microcode cpio image *** dracut: *** Constructing GenuineIntel.bin **** dracut: *** Store current command line parameters *** dracut: Stored kernel commandline: dracut: rd.lvm.lv=ardana-vg/root dracut: root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' *** dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done *** Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script: Refresh script btrfs-scrub.sh for monthly Refresh script btrfs-defrag.sh for none Refresh script btrfs-balance.sh for weekly Refresh script btrfs-trim.sh for none There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.
In this configuration, you need to manually migrate the system using
zypper dist-upgrade
, according to the following steps:
Disable any SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud 8 Cloud Lifecycle Manager-related repositories. Leaving the SUSE Linux Enterprise 12 SP3 and/or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories enabled can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the
/srv/www/suse-12.3
directory, or the SUSE-12-4 alias underhttp://localhost:79/
, use the following commands:ardana >
zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2 PTF SLES12-SP3-LTSS-Updates SLES12-SP3-Pool SLES12-SP3-Updates SUSE-OpenStack-Cloud-8-Pool SUSE-OpenStack-Cloud-8-Updates ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done Repository 'PTF' has been successfully disabled. Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled. Repository 'SLES12-SP3-Pool' has been successfully disabled. Repository 'SLES12-SP3-Updates' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.NoteThe SLES12-SP3-LTSS-Updates repository should only be present if you have purchased the optional SUSE Linux Enterprise 12 SP3 LTSS support. Whether or not it is configured will not impact the upgrade process.
Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3. A new one based on SUSE Linux Enterprise 12 SP4 will be created during the upgrade process.
ardana >
zypper repos | grep PTF 2 | PTF | PTF | Yes | (r ) Yes | Yes ardana > sudo zypper removerepo PTF Removing repository 'PTF' ..............................................................................................[done] Repository 'PTF' has been removed.Remove the Cloud media repository if defined.
ardana >
zypper repos | grep '[|] Cloud ' 1 | Cloud | SUSE OpenStack Cloud 8 DVD #1 | Yes | (r ) Yes | No ardana > sudo zypper removerepo Cloud Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done] Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.Ensure the deployer node has access to the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM repositories as documented in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 17 “Software Repository Setup” paying attention to the non-SMT based repository setup. When you run
zypper repos --show-enabled-only
, the output should look similar to the following:ardana >
zypper repos --show-enabled-only # | Alias | Name | Enabled | GPG Check | Refresh ---+--------------------------------+--------------------------------+---------+-----------+-------- 1 | Cloud | SUSE OpenStack Cloud 9 DVD #1 | Yes | (r ) Yes | No 7 | SLES12-SP4-Pool | SLES12-SP4-Pool | Yes | (r ) Yes | No 8 | SLES12-SP4-Updates | SLES12-SP4-Updates | Yes | (r ) Yes | Yes 9 | SUSE-OpenStack-Cloud-9-Pool | SUSE-OpenStack-Cloud-9-Pool | Yes | (r ) Yes | No 10 | SUSE-OpenStack-Cloud-9-Updates | SUSE-OpenStack-Cloud-9-Updates | Yes | (r ) Yes | YesNoteThe Cloud repository above is optional. Its content is equivalent to the SUSE-Openstack-Cloud-9-Pool repository.
Run the
zypper dist-upgrade
command to upgrade the deployer node:ardana >
sudo zypper dist-upgrade Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command. Loading repository data... Reading installed packages... Computing distribution upgrade... ... 525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch. Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used. Continue? [y/n/...? shows all options] (y): y ... dracut: *** Generating early-microcode cpio image *** dracut: *** Constructing GenuineIntel.bin **** dracut: *** Store current command line parameters *** dracut: Stored kernel commandline: dracut: rd.lvm.lv=ardana-vg/root dracut: root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' *** dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done *** Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script: Refresh script btrfs-scrub.sh for monthly Refresh script btrfs-defrag.sh for none Refresh script btrfs-balance.sh for weekly Refresh script btrfs-trim.sh for none There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.NoteYou may need to run the
zypper dist-upgrade
command more than once, if it determines that it needs to update thezypper
infrastructure on your system to be able to successfullydist-upgrade
the node; the command will tell you if you need to run it again.
15.4.2 Upgrading the Deployer Node Configuration Settings #
Now that the deployer node packages have been migrated to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we need to update the configuration settings to be SUSE OpenStack Cloud 9 Cloud Lifecycle Manager based.
The first step is to run the ardana-init
command. This will:
Add the PTF repository, creating it if needed.
Optionally add appropriate local repository references for any SMT-provided SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.
Upgrade the deployer account
~/openstack
area to be based upon SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources.This will import the new SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible code into the Git repository on the Ardana branch, and then rebase the customer site branch on top of the updated Ardana branch.
Follow the directions to resolve any Git merge conflicts that may arise due to local changes that may have been made on the site branch:
ardana >
ardana-init ... To continue installation copy your cloud layout to: /var/lib/ardana/openstack/my_cloud/definition Then execute the installation playbooks: cd /var/lib/ardana/openstack/ardana/ansible git add -A git commit -m 'My config' ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml ansible-playbook -i hosts/localhost config-processor-run.yml ansible-playbook -i hosts/localhost ready-deployment.yml cd /var/lib/ardana/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml If you prefer to use the UI to install the product, you can do either of the following: - If you are running a browser on this machine, you can point your browser to http://localhost:9085 to start the install via the UI. - If you are running the browser on a remote machine, you will need to create an ssh tunnel to access the UI. Please refer to the Ardana installation documentation for further details.
As we are upgrading to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we do not need
to run the suggested bm-reimage.yml
playbook.
If you were previously using the cobbler
-based integrated provisioning
solution, then you will need to perform the following steps to import
the SUSE Linux Enterprise 12 SP4 ISO and update the default provisioning distribution:
Ensure there is a copy of the
SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso
, namedsles12sp4.iso
, available in the/var/lib/ardana
directory.Ensure that any distribution entries in
servers.yml
(or whichever file holds the server node definitions) under~/openstack/my_cloud/definition
are updated to specifysles12sp4
if they are currently usingsles12sp3
.NoteThe default distribution will now be
sles12sp4
, so if there are no specific distribution entries specified for the servers, then no change will be required.If you have made any changes to the
~/openstack/my_cloud/definition
files, you will need to commit those changes, as follows:ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Update sles12sp3 distro entries to sles12sp4"Run the
cobbler-deploy.yml
playbook to import the SUSE Linux Enterprise 12 SP4 distribution as the new default distribution:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.yml Enter the password that will be used to access provisioned nodes: confirm Enter the password that will be used to access provisioned nodes: PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - cobbler-deploy.yml" } msg: Playbook started - cobbler-deploy.yml ... PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - cobbler-deploy.yml" } msg: Playbook finished - cobbler-deploy.yml PLAY RECAP ******************************************************************** localhost : ok=92 changed=45 unreachable=0 failed=0
You are now ready to upgrade the input model to be compatible.
At this point, there are some mandatory changes that will need to be made to the existing input model to permit the upgrade proceed. These mandatory changes represent:
The removal of previously-deprecated service components;
The dropping of service components that are no longer supported;
That there can be only one untagged VLAN per network interface;
That there must be a
MANAGEMENT
network group.
There are also some service components that have been made redundant
and have no effect. These should be removed to quieten the associated
config-processor-run.yml
warnings.
For example, if you run the configuration-processor-run.yml
playbook from the ~/openstack/ardana/ansible
directory before you made the necessary input model changes, you should
see it fail with errors similar to those shown below—unless your input
model doesn't deploy the problematic service component:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml Enter encryption key (press return for none): confirm Enter encryption key (press return for none): To change encryption key enter new key (press return for none): confirm To change encryption key enter new key (press return for none): PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] ... "################################################################################", "# The configuration processor failed. ", "# control-planes-2.0 WRN: cp:openstack-core: 'designate-pool-manager' has been deprecated and will be replaced by 'designate-worker'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.", "", "# control-planes-2.0 WRN: cp:openstack-core: 'manila-share' service component is deprecated. The 'manila-share' service component can be removed as manila share service will be deployed where manila-api is specified. This is not a deprecation for openstack-manila-share but just an entry deprecation in input model.", "", "# control-planes-2.0 WRN: cp:openstack-core: 'designate-zone-manager' has been deprecated and will be replaced by 'designate-producer'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.", "", "# control-planes-2.0 WRN: cp:openstack-core: 'glance-registry' has been deprectated and is no longer deployed. Please update you input model to remove any 'glance-registry' service component specifications to remove this warning.", "", "# control-planes-2.0 WRN: cp:mml: 'ceilometer-api' is no longer used by Ardana and will not be deployed. Please update your input model to remove this warning.", "", "# control-planes-2.0 WRN: cp:sles-compute: 'neutron-lbaasv2-agent' has been deprecated and replaced by 'octavia' and will not be deployed in a future release. Please update your input model to remove this warning.", "", "# control-planes-2.0 ERR: cp:common-service-components: Undefined component 'freezer-agent'", "# control-planes-2.0 ERR: cp:openstack-core: Undefined component 'nova-console-auth'", "# control-planes-2.0 ERR: cp:openstack-core: Undefined component 'heat-api-cloudwatch'", "# control-planes-2.0 ERR: cp:mml: Undefined component 'freezer-api'", "################################################################################" ] } } TASK: [debug var=config_processor_result.stderr] ****************************** ok: [localhost] => { "var": { "config_processor_result.stderr": "/usr/lib/python2.7/site-packages/ardana_configurationprocessor/cp/model/YamlConfigFile.py:95: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n self._contents = yaml.load(''.join(lines))" } } TASK: [fail msg="Configuration processor run failed, see log output above for details"] *** failed: [localhost] => {"failed": true} msg: Configuration processor run failed, see log output above for details msg: Configuration processor run failed, see log output above for details FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/var/lib/ardana/config-processor-run.retry localhost : ok=8 changed=5 unreachable=0 failed=1
To resolve any errors and warnings like those shown above, you will need to perform the following actions:
Remove any service component entries that are no longer valid from the
control_plane.yml
(or whichever file holds the control-plane definitions) under~/openstack/my_cloud/definition
. This means that you have to comment out (or delete) any lines for the following service components, which are no longer available:freezer-agent
freezer-api
heat-api-cloudwatch
nova-console-auth
NoteThis should resolve the errors that cause the
config-processor-run.yml
playbook to fail.Similarly, remove any service components that are redundant and no longer required. This means that you should comment out (or delete) any lines for the following service components:
ceilometer-api
glance-registry
manila-share
neutron-lbaasv2-agent
NoteThis should resolve most of the warnings reported by the
config-processor-run.yml
playbook.ImportantIf you have deployed the
designate
service components (designate-pool-manager
anddesignate-zone-manager
) in your cloud, you will see warnings like those shown above, indicating that these service components have been deprecated.You can switch to using the newer
designate-worker
anddesignate-producer
service components, which will quieten these deprecation warnings produced by theconfig-processor-run.yml
playbook run.However, this is a procedure that should be perfomed after the upgrade has completed, as outlined in the Section 15.4.5, “Post-Upgrade Tasks” section below.
Once you have made the necessary changes to your input model, if you run
git diff
under the~/openstack/my_cloud/definition
directory, you should see output similar to the following:ardana >
cd ~/openstack/my_cloud/definitionardana >
git diff diff --git a/my_cloud/definition/data/control_plane.yml b/my_cloud/definition/data/control_plane.yml index f7cfd84..2c1a73c 100644 --- a/my_cloud/definition/data/control_plane.yml +++ b/my_cloud/definition/data/control_plane.yml @@ -32,7 +32,6 @@ - NEUTRON-CONFIG-CP1 common-service-components: - lifecycle-manager-target - - freezer-agent - stunnel - monasca-agent - logging-rotate @@ -118,12 +117,10 @@ - cinder-volume - cinder-backup - glance-api - - glance-registry - nova-api - nova-placement-api - nova-scheduler - nova-conductor - - nova-console-auth - nova-novncproxy - neutron-server - neutron-ml2-plugin @@ -137,7 +134,6 @@ - horizon - heat-api - heat-api-cfn - - heat-api-cloudwatch - heat-engine - ops-console-web - barbican-api @@ -151,7 +147,6 @@ - magnum-api - magnum-conductor - manila-api - - manila-share - name: mml cluster-prefix: mml @@ -164,9 +159,7 @@ # freezer-api shares elastic-search with logging-server # so must be co-located with it - - freezer-api - - ceilometer-api - ceilometer-polling - ceilometer-agent-notification - ceilometer-common @@ -194,4 +187,3 @@ - neutron-l3-agent - neutron-metadata-agent - neutron-openvswitch-agent - - neutron-lbaasv2-agentIf you are happy with these changes, commit them into the Git repository as follows:
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "SOC 9 CLM Upgrade input model migration"Now you are ready to run the
config-processor-run.yml
playbook. If the necessary input model changes have been made, it will complete sucessfully:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml Enter encryption key (press return for none): confirm Enter encryption key (press return for none): To change encryption key enter new key (press return for none): confirm To change encryption key enter new key (press return for none): PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] ... PLAY RECAP ******************************************************************** localhost : ok=24 changed=20 unreachable=0 failed=0
15.4.3 Upgrading Cloud Services #
The deployer node is now ready to be used to upgrade the remaining cloud nodes and running services.
If upgrading from Helion OpenStack 8, there is a manual file update that must
be applied before continuing the upgrade process. In the file
/usr/share/ardana/ansible/roles/osconfig/tasks/check-product-status.yml
replace `command` with `shell` in the first ansible entry. The correct version
of the file appears below.
- name: deployer-setup | check-product-status | Check HOS product installed shell: |- zypper info hpe-helion-openstack-release | grep "^Installed *: *Yes" ignore_errors: yes register: product_flavor_hos - name: deployer-setup | check-product-status | Check SOC product availability become: yes zypper: name: "suse-openstack-cloud-release>=8" state: present ignore_errors: yes register: product_flavor_soc - name: deployer-setup | check-product-status | Provide help fail: msg: > The deployer node does not have a Cloud Add-On product installed. In YaST select Software/Add-On Products to see an overview of installed add-on products and use "Add" to add the Cloud product. when: - product_flavor_soc|failed - product_flavor_hos|failed
Changes to the check-product-status.yml
file must be staged
and committed via git.
git add -u git commit -m "applying osconfig fix prior to HOS8 to SOC9 upgrade"
The ardana-upgrade.yml
playbook runs the upgrade process
against all nodes in parallel, though some of the steps are serialised
to run on only one node at a time to avoid triggering potentially
problematic race conditions. As such, the playbook can take a long time to run.
Generate the updated scratch area using the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] ... PLAY RECAP ******************************************************************** localhost : ok=31 changed=16 unreachable=0 failed=0Confirm that there are no pending updates for the deployer node. This could happen if you are using an SMT to manage the repositories, and updates have been released through the official channels since the deployer node was migrated. To check for any pending Cloud Lifecycle Manager package updates, you can run the
ardana-update-pkgs.yml
playbook as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --limit OPS-LM--first-member PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-dplyr-m1] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-update-pkgs.yml" } ... TASK: [_ardana-update-status | Report update status] ************************** ok: [ardana-cp-dplyr-m1] => { "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n=====================================================================" } msg: ===================================================================== Update status for node ardana-cp-dplyr-m1: ===================================================================== No pending update actions on the ardana-cp-dplyr-m1 host were collected or reset during this update run or persisted during previous unsuccessful or incomplete update runs. ===================================================================== PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-update-pkgs.yml" } msg: Playbook finished - ardana-update-pkgs.yml PLAY RECAP ******************************************************************** ardana-cp-dplyr-m1 : ok=98 changed=12 unreachable=0 failed=0 localhost : ok=6 changed=2 unreachable=0 failed=0NoteIf running the
ardana-update-pkgs.yml
playbook identifies that there were updates that needed to be installed on your deployer node, then you need to go back to running theardana-init
command, followed by thecobbler-deploy.yml
playbook, then theconfig-processor-run.yml
playbook, and finally theready-deployment.yml
playbook, addressing any additional input model changes that may be needed. Then, repeat this step to check for any pending updates before continuing with the upgrade.Double-check that there are no pending actions needed for the deployer node by running the
ardana-update-status.yml
playbook, as follows:ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml --limit OPS-LM--first-member PLAY [resources] ************************************************************** ... TASK: [_ardana-update-status | Report update status] ************************** ok: [ardana-cp-dplyr-m1] => { "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n=====================================================================" } msg: ===================================================================== Update status for node ardana-cp-dplyr-m1: ===================================================================== No pending update actions on the ardana-cp-dplyr-m1 host were collected or reset during this update run or persisted during previous unsuccessful or incomplete update runs. ===================================================================== PLAY RECAP ******************************************************************** ardana-cp-dplyr-m1 : ok=12 changed=0 unreachable=0 failed=0Having verified that there are no pending actions detected, it is safe to proceed with running the
ardana-upgrade.yml
playbook to upgrade the entire cloud:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-upgrade.yml PLAY [all] ******************************************************************** ... TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-upgrade.yml" } msg: Playbook started - ardana-upgrade.yml ... ... ... ... ... TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-upgrade.yml" } msg: Playbook finished - ardana-upgrade.yml
The ardana-upgrade.yml
playbook run will take a long time. The
zypper dist-upgrade
phase is serialised across all of
the nodes and usually takes between five and 10 minutes for each node. This
is followed by the cloud service upgrade phase, which will take
approximately the same amount of time as a full cloud deploy. During
this time, the cloud should remain basically functional, though there
may be brief interruptions to some services. However, it is recommended
that any workload management tasks are avoided during this period.
Until the ardana-upgrade.yml
playbook run has
ompleted successfully, other playbooks such as the ardana-status.yml
,
may report status problems. This is because some services that are
expected to be running may not be installed, enabled, or migrated yet.
The ardana-upgrade.yml
playbook run may sometimes
fail during the whole cloud upgrade phase, if a service (for example,
the monasca-thresh
service) is slow to restart. In such cases, it is
safe to run the ardana-upgrade.yml
playbook again,
and in most cases it should continue past the stage that failed
previously. However, if the same problem persists across multiple runs,
contact your support team for assistance.
It is important to disable all SUSE Linux Enterprise 12 SP3 SUSE OpenStack Cloud 8 Cloud Lifecycle Manager
repositories before migrating the deployer to SUSE Linux Enterprise 12 SP4 SUSE OpenStack Cloud
9 Cloud Lifecycle Manager. If you did not do this, then the first time you
run the ardana-upgrade.yml
playbook, it may
complain that there are pending updates for the deployer node.
This will require you to repeat the earlier steps to upgrade the
deployer node, starting with running the ardana-init
command. If this happens, repeat the steps as requested. Note that this
does not represent a serious problem.
In SUSE OpenStack Cloud 9 Cloud Lifecycle Manager the LBaaS V2 legacy driver has been
deprecated and removed. As part of the ardana-upgrade.yml
playbook run,
all existing LBaaS V2 load-balancers will be automatically migrated to being
based on the Octavia Amphora provider. To enable creation of any new Octavia-
based load-balancer instances, you need to ensure that an appropriate Amphora
image is registered for use when creating instances, by following
Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 43 “Configuring Load Balancer as a Service”.
While running the ardana-upgrade.yml
playbook, a point will be
reached when the Neutron services are upgraded. As part of this upgrade,
any existing LBaaS V2 load-balancer definitions will be migrated to
Octavia Amphora-based load-balancer definitions.
After this migration of load-balancer definitions has completed, if a load-balancer failover is triggered, then the replacement load- balancer may fail to start, as an appropriate Octavia Amphora image for SUSE OpenStack Cloud 9 Cloud Lifecycle Manager will not yet be available.
However, once the Octavia Amphora image has been uploaded using the above instructions, then it will be possible to recover any failed load-balancers by re-triggering the failover: follow the instructions at https://docs.openstack.org/python-octaviaclient/latest/cli/index.html#loadbalancer-failover.
15.4.4 Rebooting the Nodes into the SUSE Linux Enterprise 12 SP4 Kernel #
At this point, all of the cloud services have been upgraded, but the nodes are still running the SUSE Linux Enterprise 12 SP3 kernel. The final step in the upgrade workflow is to reboot all of the nodes in the cloud in a controlled fashion, to ensure that active services failover appropriately.
The recommended order for rebooting nodes is to start with the deployer. This requires special handling, since the Ansible-based automation cannot fully manage the reboot of the node that it is running on.
After that, we recommend rebooting the rest of the nodes in the control planes in a rolling-reboot fashion, ensuring that high-availability services remain available.
Finally, the compute nodes can be rebooted, either individually or in groups, as is appropriate to avoid interruptions to running workloads.
Do not reboot all your control plane nodes at the same time.
The reboot of the deployer node requires additional steps, as the Ansible-based automation framework cannot fully automate the reboot of the node that runs the ansible-playbook commands.
Run the
ardana-reboot.yml
playbook limited to the deployer node, either by name, or using the logical node identifiedOPS-LM--first-member
, as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit OPS-LM--first-member PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-dplyr-m1] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-reboot.yml" } msg: Playbook started - ardana-reboot.yml ... TASK: [ardana-reboot | Deployer node has to be rebooted manually] ************* failed: [ardana-cp-dplyr-m1] => {"failed": true} msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook: cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1 msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook: cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1 FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/var/lib/ardana/ardana-reboot.retry ardana-cp-dplyr-m1 : ok=8 changed=3 unreachable=0 failed=1 localhost : ok=7 changed=0 unreachable=0 failed=0The
ardana-reboot.yml
playbook will fail when run on a deployer node; this is expected. The reported failure message tells you what you need to do to complete the remaining steps of the reboot manually: namely, rebooting the node, then logging back in again to run the_ardana-post-reboot.yml
playbook, to start any services that need to be running on the node.Manually reboot the deployer node, for example with
shutdown -r now
.Once the deployer node has rebooted, you need to log in again and run the
_ardana-post-reboot.yml
playbook to complete the startup of any services that should be running on the deployer node, as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook _ardana-post-reboot.yml --limit OPS-LM--first-member PLAY [resources] ************************************************************** TASK: [Set pending_clm_update] ************************************************ skipping: [ardana-cp-dplyr-m1] ... TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-status.yml" } msg: Playbook finished - ardana-status.yml PLAY RECAP ******************************************************************** ardana-cp-dplyr-m1 : ok=26 changed=0 unreachable=0 failed=0 localhost : ok=19 changed=1 unreachable=0 failed=0
For the remaining nodes, you can use ardana-reboot.yml
to fully automate the reboot process. However, it is recommended that you
reboot the nodes in a rolling-reboot fashion, such that high-availability
services continue to run without interruption. Similarly, to avoid
interruption of service for any singleton services (such as the cinder-volume
and cinder-backup
services), they should be migrated off the intended
node before it is rebooted, and then migrated back again afterwards.
Use the
ansible
command's--list-hosts
option to list the remaining nodes in the cloud that are neither the deployer nor a compute node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP' ardana-cp-dbmqsw-m1 ardana-cp-dbmqsw-m2 ardana-cp-dbmqsw-m3 ardana-cp-osc-m1 ardana-cp-osc-m2 ardana-cp-mml-m1 ardana-cp-mml-m2 ardana-cp-mml-m3Use the following command to generate the set of
ansible-playbook
commands that need to be run to reboot all the nodes sequentially:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
for node in $(ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP'); do echo ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ${node} || break; done ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m1 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m2 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m3 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m1 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m2 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m1 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m2 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3WarningDo not reboot all your control-plane nodes at the same time.
To reboot a specific control-plane node, you can use the above
ansible-playbook
commands as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3 PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-mml-m3] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-reboot.yml" } msg: Playbook started - ardana-reboot.yml ... PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-status.yml" } msg: Playbook finished - ardana-status.yml PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-reboot.yml" } msg: Playbook finished - ardana-reboot.yml PLAY RECAP ******************************************************************** ardana-cp-mml-m3 : ok=389 changed=105 unreachable=0 failed=0 localhost : ok=27 changed=1 unreachable=0 failed=0
You can reboot more than one control-plane node at a time, but only if they are members of different control-plane clusters. For example, you could reboot one node out of each of the Openstack controller, database, swift, monitoring or logging clusters, so long as doing do only reboots one node out of each cluster at the same time.
When rebooting the first member of the control-plane cluster where
monitoring services run, the monasca-thresh
service can sometimes fail
to start up in a timely fashion when the node is coming back up after
being rebooted. This can cause ardana-reboot.yml
to fail.
See below for suggestions on how to handle this problem.
monasca-thresh
Running After an ardana-reboot.yml
Failure #
If the ardana-reboot.yml
playbook failed because
monasca-thresh
didn't start up in a timely fashion
after a reboot, you can retry starting the services on the node using
the _ardana-post-reboot.yml
playbook for the node.
This is similar to the manual handling of the deployer reboot, since
the node has already successfully rebooted onto the new kernel, and
you just need to get the required services running again on the node.
It can sometimes take up to 15 minutes for the monasca-thresh
service to successfully start in such cases.
However, if the service still fails to start after that time, then you may need to force a restart of the
storm-nimbus
andstorm-supervisor
services on all nodes in theMON-THR
node group, as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible MON-THR -b -m shell -a "systemctl restart storm-nimbus" ardana-cp-mml-m1 | success | rc=0 >> ardana-cp-mml-m2 | success | rc=0 >> ardana-cp-mml-m3 | success | rc=0 >> ardana > ansible MON-THR -b -m shell -a "systemctl restart storm-supervisor" ardana-cp-mml-m1 | success | rc=0 >> ardana-cp-mml-m2 | success | rc=0 >> ardana-cp-mml-m3 | success | rc=0 >> ardana > ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-mml-m1
If the monasca-thresh
service still fails to start up,
contact your support team.
To check which control plane nodes have not yet been rebooted onto
the new kernel, you can use an Ansible command to run the command uname -r
on the target nodes, as follows:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible -i hosts/verb_hosts 'resources:!OPS-LM--first-member:!NOV-CMP' -m command -a 'uname -r' ardana-cp-dbmqsw-m1 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-dbmqsw-m3 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-osc-m1 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-dbmqsw-m2 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-mml-m2 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-osc-m2 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-mml-m1 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-mml-m3 | success | rc=0 >> 4.12.14-95.57-default ardana > uname -r 4.12.14-95.57-default
If any node's uname -r
value does not match the kernel
that the deployer is running, you probably have not yet rebooted that node.
Finally, you need to reboot the compute nodes. Rebooting multiple compute nodes at the same time is possible, so long as doing so does not compromise the integrity of running workloads. We recommended that you migrate workloads off groups of compute nodes in a controlled fashion, enabling them to be rebooted together.
Do not reboot all of your compute nodes at the same time.
To see all the compute nodes that are available to be rebooted, you can run the following command:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible -i hosts/verb_hosts --list-hosts NOV-CMP ardana-cp-slcomp0001 ardana-cp-slcomp0002 ... ardana-cp-slcomp0080Reboot the compute nodes, individually or in groups, using the
ardana-reboot.yml
playbook as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-slcomp0001,ardana-cp-slcomp0002 PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-slcomp0001] ok: [ardana-cp-slcomp0002] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-reboot.yml" } msg: Playbook started - ardana-reboot.yml ... PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-status.yml" } msg: Playbook finished - ardana-status.yml PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-reboot.yml" } msg: Playbook finished - ardana-reboot.yml PLAY RECAP ******************************************************************** ardana-cp-slcomp0001 : ok=120 changed=11 unreachable=0 failed=0 ardana-cp-slcomp0002 : ok=120 changed=11 unreachable=0 failed=0 localhost : ok=27 changed=1 unreachable=0 failed=0
You must ensure that there is sufficient unused workload capacity to host any migrated workload or Amphora instances that may be running on the targeted compute nodes.
When rebooting multiple compute nodes at the same time, consider manually migrating any running workloads and Amphora instances off the target nodes in advance, to avoid any potential risk of workload or service interruption.
15.4.5 Post-Upgrade Tasks #
After the cloud has been upgraded to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager,
if designate
was previously configured, then the deprecated service
components, designate-zone-manager
and
designate-pool-manager
, were being used.
They will continue to operate correctly under SUSE OpenStack Cloud 9 Cloud Lifecycle Manager,
but we recommend that you migrate to using the newer designate-worker
designate-producer
service components instead by
following the procedure documented in Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 25 “DNS Service Installation Overview”, Section 25.4 “Migrate Zone/Pool to Worker/Producer after Upgrade”.
After migrating the deployer node, there are a small number of packages that were installed that are no longer required—such as the
ceilometer
andfreezer
virtualenv
(venv) packages.You can safely remove these packages with the following command:
ardana >
zypper packages --orphaned Loading repository data... Reading installed packages... S | Repository | Name | Version | Arch --+------------+----------------------------------+-----------------------------------+------- i | @System | python-flup | 1.0.3.dev_20110405-2.10.52 | noarch i | @System | python-happybase | 0.9-1.64 | noarch i | @System | venv-openstack-ceilometer-x86_64 | 9.0.8~dev7-12.24.2 | noarch i | @System | venv-openstack-freezer-x86_64 | 5.0.0.0~xrc2~dev2-10.22.1 | noarch ardana> sudo zypper remove venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64 Loading repository data... Reading installed packages... Resolving package dependencies... The following 2 packages are going to be REMOVED: venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64 2 packages to remove. After the operation, 79.0 MiB will be freed. Continue? [y/n/...? shows all options] (y): y (1/2) Removing venv-openstack-ceilometer-x86_64-9.0.8~dev7-12.24.2.noarch ..................................................................[done] Additional rpm output: /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. return yaml.load(f) (2/2) Removing venv-openstack-freezer-x86_64-5.0.0.0~xrc2~dev2-10.22.1.noarch ..............................................................[done] Additional rpm output: /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. return yaml.load(f)
The freezer service has been deprecated and removed from SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, but the backups that the freezer service created before you upgraded will still be consuming space in your Swift Object store.
Therefore, once you have completed the upgrade successfully, you can safely delete the containers that freezer used to hold the database and ring backups, freeing up that space.
Using the credentials in the
backup.osrc
file, found on the deployer node in the Ardana account's home directory, run the following commands:ardana >
. ~/backup.osrcardana >
swift list freezer_database_backups freezer_rings_backups ardana> swift delete --all freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/1_1598548599/segments/000000021 freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/2_1598605266/data1 ... freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/0_1598505404/segments/000000001 freezer_database_backups freezer_rings_backups/metadata/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/metadata ... freezer_rings_backups/data/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/data freezer_rings_backups
15.5 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment #
Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.
Use the following steps to deploy a PTF:
When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:
ardana >
tmpdir=`mktemp -d`ardana >
cd $tmpdirardana >
wget --no-directories --recursive --reject "index.html*"\ --user=USER_NAME \ --ask-password \ --no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030/Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.
ardana >
sudo rm -rf /srv/www/suse-12.4/x86_64/repos/PTF/*Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a neutron PTF.
ardana >
sudo mkdir -p /srv/www/suse-12.4/x86_64/repos/PTF/ardana >
sudo mv $tmpdir/* /srv/www/suse-12.4/x86_64/repos/PTF/ardana >
sudo chown --recursive root:root /srv/www/suse-12.4/x86_64/repos/PTF/*ardana >
rmdir $tmpdirCreate or update the repository metadata:
ardana >
sudo /usr/local/sbin/createrepo-cloud-ptf Spawning worker 0 with 2 pkgs Workers Finished Saving Primary metadata Saving file lists metadata Saving other metadataRefresh the PTF repository before installing package updates on the Cloud Lifecycle Manager
ardana >
sudo zypper refresh --force --repo PTF Forcing raw metadata refresh Retrieving repository 'PTF' metadata ..........................................[d one] Forcing building of repository cache Building repository 'PTF' cache ..........................................[done] Specified repositories have been refreshed.The PTF shows as available on the deployer.
ardana >
sudo zypper se --repo PTF Loading repository data... Reading installed packages... S | Name | Summary | Type --+-------------------------------+-----------------------------------------+-------- | python-neutronclient | Python API and CLI for OpenStack neutron | package i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack neutron | packageInstall the PTF venv packages on the Cloud Lifecycle Manager
ardana >
sudo zypper dup --from PTF Refreshing service Loading repository data... Reading installed packages... Computing distribution upgrade... The following package is going to be upgraded: venv-openstack-neutron-x86_64 The following package has no support information from its vendor: venv-openstack-neutron-x86_64 1 package to upgrade. Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used. Continue? [y/n/...? shows all options] (y): y Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1), 64.2 MiB ( 64.6 MiB unpacked) Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done] Checking for file conflicts: ..............................................................[done] (1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done] Additional rpm output: warning warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEYValidate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)
ardana >
ls -la /opt/ardana_packager/ardana-9/sles_venv/x86_64 total 898952 drwxr-xr-x 2 root root 4096 Oct 30 16:10 . ... -rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<< -rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz -rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz -rw-r--r-- 1 root root 1879 Oct 30 16:10 packages -rw-r--r-- 1 root root 27186008 Apr 26 2018 swift-20180426T230541Z.tgzInstall the non-venv PTF packages on the Compute Node
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmtWhen it has finished, you can see that the upgraded package has been installed on
comp0001-mgmt
.ardana >
sudo zypper se --detail python-neutronclient Loading repository data... Reading installed packages... S | Name | Type | Version | Arch | Repository --+----------------------+----------+---------------------------------+--------+-------------------------------------- i | python-neutronclient | package | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF | python-neutronclient | package | 6.5.0-4.361 | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.
The Compute Node before running the update playbook:
ardana >
ls -la /opt/stack/venv total 24 drwxr-xr-x 9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z drwxr-xr-x 9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306ZRun the update.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmtWhen it has finished, you can see that an additional virtual environment has been installed.
ardana >
ls -la /opt/stack/venv total 28 drwxr-xr-x 9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z drwxr-xr-x 9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z drwxr-xr-x 9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306ZThe PTF may also have
RPM
package updates in addition to venv updates. To complete the update, follow the instructions at Section 15.3.1, “Performing the Update”
15.6 Periodic OpenStack Maintenance Tasks #
Heat-manage helps manage Heat specific database operations. The associated
database should be periodically purged to save space. The following should
be setup as a cron job on the servers where the heat service is running at
/etc/cron.weekly/local-cleanup-heat
with the following content:
#!/bin/bash su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :
nova-manage db archive_deleted_rows command will move deleted rows
from production tables to shadow tables. Including
--until-complete
will make the command run continuously
until all deleted rows are archived. It is recommended to setup this task
as /etc/cron.weekly/local-cleanup-nova
on the servers where the nova service is running, with the
following content:
#!/bin/bash su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :
16 Operations Console #
Often referred to as the Ops Console, you can use this web-based graphical user interface (GUI) to view data about your cloud infrastructure and ensure your cloud is operating correctly.
16.1 Using the Operations Console #
16.1.1 Operations Console Overview #
Often referred to as the Ops Console, you can use this web-based graphical user interface (GUI) to view data about your cloud infrastructure and ensure your cloud is operating correctly.
You can use the Operations Console for SUSE OpenStack Cloud 9 to view data about your SUSE OpenStack Cloud infrastructure in a web-based graphical user interface (GUI) and ensure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways: Triage alarm notifications.
Alarm Definitions and notifications now have their own screens and are collected under the Alarm Explorer menu item which can be accessed from the Central Dashboard. Central Dashboard now allows you to customize the view in the following ways:
Rename or re-configure existing alarm cards to include services different from the defaults
Create a new alarm card with the services you want to select
Reorder alarm cards using drag and drop
View all alarms that have no service dimension now grouped in an Uncategorized Alarms card
View all alarms that have a service dimension that does not match any of the other cards -now grouped in an Other Alarms card
You can also easily access alarm data for a specific component. On the Summary page for the following components, a link is provided to an alarms screen specifically for that component:
Compute Instances: Section 16.1.3, “Managing Compute Hosts”
Object Storage: Section 16.1.4.4, “Alarm Summary”
16.1.1.1 Monitor the environment by giving priority to alarms that take precedence. #
Alarm Explorer now allows you to manage alarms in the following ways:
Refine the monitoring environment by creating new alarms to specify a combination of metrics, services, and hosts that match the triggers unique to an environment
Filter alarms in one place using an enumerated filter box instead of service badges
Specify full alarm IDs as dimension key-value pairs in the form of dimension=value
16.1.1.2 Support Changes #
To resolve scalability issues, plain text search through alarm sets is no longer supported
The Business Logic Layer of Operations Console is a middleware component that serves as a single point of contact for the user interface to communicate with OpenStack services such as monasca, nova, and others.
16.1.2 Connecting to the Operations Console #
Instructions for accessing the Operations Console through a web browser.
To connect to Operations Console, perform the following:
Ensure your login has the required access credentials: Section 16.1.2.1, “Required Access Credentials”
Connect through a browser: Section 16.1.2.2, “Connect Through a Browser”
Optionally use a Hostname OR virtual IP address to access Operations Console: Section 16.1.2.3, “Optionally use a Hostname OR virtual IP address to access Operations Console”
Operations Console will always be accessed over port 9095.
16.1.2.1 Required Access Credentials #
In previous versions of Operations Console you were required to have only the password for the Administrator account (admin by default). Now the Administrator user account must also have all of the following credentials:
Project | Domain | Role | Description |
---|---|---|---|
*All projects* | *not specific* | Admin | Admin role on at least one project |
*All projects* | *not specific* | Admin | Admin role in default domain |
Admin | default | Admin or monasca-user | Admin or monasca-user role on admin project |
If your login account has administrator role on the administrator project, then you only need to make sure you have the administrator role on the default domain.
Administrator account
During installation, an administrator account called
admin
is created by default.
Administrator password
During installation, an administrator password is randomly created by default. It is not recommend that you change the default password.
To find the randomized password:
To display the password, log on to the Cloud Lifecycle Manager and run:
cat ~/service.osrc
16.1.2.2 Connect Through a Browser #
The following instructions will show you how to find the URL to access Operations Console. You will use SSH, also known as Secure Socket Shell, which provides administrators with a secure way to access a remote computer.
To access Operations Console:
Log in to the Cloud Lifecycle Manager.
Locate the URL or IP address for the Operations Console with the following command:
source ~/service.osrc && openstack endpoint list | grep opsconsole | grep admin
Sample output:
| 8ef10dd9c00e4abdb18b5b22adc93e87 | region1 | opsconsole | opsconsole | True | admin | https://192.168.24.169:9095/api/v1/
To access Operations Console, in the sample output, remove everything after port 9095 (api/v1/) and in a browser, type:
https://192.168.24.169:9095
16.1.2.3 Optionally use a Hostname OR virtual IP address to access Operations Console #
If you can access Operations Console using the above instructions, then you can skip this section. These steps provide an alternate way to access Operations Console if the above steps do not work for you.
To find your hostname OR IP address:
Navigate to and open in a text editor the following file:
network_groups.yml
Find the following entry:
external-name
If your administrator set a hostname value in the external-name field, you will use that hostname when logging in to Operations Console. or example, in a browser you would type:
https://VIP:9095
If your administrator did not set a hostname value, then to determine the IP address to use, from your Cloud Lifecycle Manager, run:
grep HZN-WEB /etc/hosts
The output of that command will show you the virtual IP address you should use. For example, in a browser you would type:
https://VIP:9095
16.1.3 Managing Compute Hosts #
Operations Console (Ops Console) provides a graphical interface for you to add and delete compute hosts.
As your deployment grows and changes, you may need to add more compute hosts to increase your capacity for VMs, or delete a host to reallocate hardware for a different use. To accomplish these tasks, in previous versions of SUSE OpenStack Cloud you had to use the command line to update configuration files and run ansible playbooks. Now Operations Console provides a graphical interface for you to complete the same tasks quickly using menu items in the console.
Do not refresh the Operations Console page or open Operations Console in another window during the following tasks. If you do, you will not see any notifications or be able to review the error log for more information. This would make troubleshooting difficult since you would not know the error that was encountered, or why it occurred.
Use Operations Console to perform the following tasks:
Create a Compute Host: Section 16.1.3.1, “Create a Compute Host”
Deactivate a Compute Host: Section 16.1.3.2, “Deactivate a Compute Host”
Activate a Compute Host: Section 16.1.3.3, “Activate a Compute Host”
Delete a Compute Host: Section 16.1.3.4, “Delete a Compute Host”
To use Operations Console, you need to have the correct permissions and know the URL or VIP connected to Operations Console during installation.
16.1.3.1 Create a Compute Host #
If you need to create additional compute hosts for more virtual machine capacity, you can do this easily on the Compute Hosts screen.
To add a compute host:
To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.
For example:
https://myardana.test:9095 https://VIP:9095
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left side, click Compute, and then Compute Hosts.
On the Compute Hosts page, click Create Host.
On the Add & Activate Compute Host tab that slides in from the right, enter the following information:
- Host ID
Cloud Lifecycle Manager model's server ID
- Host Role
Defined in the Cloud Lifecycle Manager model and cannot be modified in Operations Console
- Host Group
Defined in the Cloud Lifecycle Manager model and cannot be modified in Operations Console
- Host NIC Mapping
Defined in the Cloud Lifecycle Manager model and cannot be modified in Operations Console
- Encryption Key
If the configuration is encrypted, enter the encryption key here
Click
, and in the confirmation screen that opens, click .Wait for SUSE OpenStack Cloud to complete the pre deployment steps. This process can take up to 2 minutes.
If pre-deployment is successful, you will see a notification that deployment has started.
ImportantIf you receive a notice that pre-deployment did not complete successfully, read the notification explaining at which step the error occured. You can click on the error notification and see the ansible log for the configuration processor playbook. Then you can click Create Host in step 4 again and correct the mistake.
Wait for SUSE OpenStack Cloud to complete the deployments steps. This process can take up to 20 minutes.
If deployment is successful, you will see a notification and a new entry will appear in the compute hosts table.
ImportantIf you receive a notice that deployment did not complete successfully, read the notification explaining at which step the error occured. You can click on the error notification for more details.
16.1.3.2 Deactivate a Compute Host #
If you have multiple compute hosts and for debugging reasons you want to disable them all except one, you may need to deactivate and then activate a compute host. If you want to delete a host, you will also have to deactivate it first. This can be done easily in the Operations Console.
The host must be in the following state: ACTIVATED
To deactivate a compute host:
To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.
For example:
https://myardana.test:9095 https://VIP:9095
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left side, click Compute, and then Compute Hosts.
On the Compute Hosts page, in the row for the host you want to deactivate, click the details button ().
Click Deactivate, and in the confirmation screen that opens, click Confirm.
Wait for SUSE OpenStack Cloud to complete the operation. This process can take up to 2 minutes.
If deactivation is successful, you will see a notification and in the compute hosts table the STATE will change to DEACTIVATED.
ImportantIf you receive a notice that the operation did not complete successfully, read the notification explaining at which step the error occured. You can click on the link in the error notification for more details. In the compute hosts table the STATE will remain ACTIVATED.
16.1.3.3 Activate a Compute Host #
The host must be in the following state: DEACTIVATED
To activate a compute host:
To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.
For example:
https://myardana.test:9095 https://VIP:9095
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left side, click Compute, and then Compute Hosts.
On the Compute Hosts page, in the row for the host you want to activate, click the details button ().
Click Activate, and in the confirmation screen that opens, click Confirm.
Wait for SUSE OpenStack Cloud to complete the operation. This process can take up to 2 minutes.
If activation is successful, you will see a notification and in the compute hosts table the STATE will change to ACTIVATED.
ImportantIf you receive a notice that the operation did not complete successfully, read the notification explaining at which step the error occured. You can click on the link in the error notification for more details. In the compute hosts table the STATE will remain DEACTIVATED.
16.1.3.4 Delete a Compute Host #
If you need to scale down the size of your current deployment to use the hardware for other purposes, you may want to delete a compute host.
Complete the following steps before deleting a host:
host must be in the following state: DEACTIVATED
Optionally you can migrate the instance off the host to be deleted. To do this, complete the following sections in Section 15.1.3.5, “Removing a Compute Node”:
Disable provisioning on the compute host.
Use live migration to move any instances on this host to other hosts.
To delete a compute host:
To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.
For example:
https://myardana.test:9095 https://VIP:9095
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left side, click Compute, and then Compute Hosts.
On the Compute Hosts page, in the row for the host you want to delete, click the details button ().
Click Delete, and if the configuration is encrypted, enter the encryption key.
in the confirmation screen that opens, click Confirm.
In the compute hosts table you will see the STATE change to Deleting.
Wait for SUSE OpenStack Cloud to complete the operation. This process can take up to 2 minutes.
If deletion is successful, you will see a notification and in the compute hosts table the host will not be listed.
ImportantIf you receive a notice that the operation did not complete successfully, read the notification explaining at which step the error occured. You can click on the link in the error notification for more details. In the compute hosts table the STATE will remain DEACTIVATED.
16.1.3.5 For More Information #
For more information on how to complete these tasks through the command line, see the following topics:
16.1.4 Managing Swift Performance #
In Operations Console you can monitor your swift cluster to ensure long-term data protection as well as sufficient performance.
OpenStack swift is an object storage solution with a focus on availability. While there are various mechanisms inside swift to protect stored data and ensure a high availability, you must still closely monitor your swift cluster to ensure long-term data protection as well as sufficient performance. The best way to manage swift is to collect useful data that will detect possible performance impacts early on.
The new Object Summary Dashboard in Operations Console provides an overview of your swift environment.
If swift is not installed and configured, you will not be able to access this dashboard. The swift endpoint must be present in keystone for the Object Summary to be present in the menu.
In Operations Console's object storage dashboard, you can easily review the following information:
Performance Summary: Section 16.1.4.1, “Performance Summary”
Inventory Summary: Section 16.1.4.2, “Inventory Summary”
Capacity Summary: Section 16.1.4.3, “Capacity Summary”
Alarm Summary: Section 16.1.4.4, “Alarm Summary”
16.1.4.1 Performance Summary #
View a comprehensive summary of current performance values.
To access the object storage performance dashboard:
Performance data includes:
- Healthcheck Latency from monasca
This latency is the average time it takes for swift to respond to a healthcheck, or ping, request. The swiftlm-uptime monitor program reports the value. A large difference between average and maximum may indicate a problem with one node.
- Operational Latency from monasca
Operational latency is the average time it takes for swift to respond to an upload, download, or object delete request. The swiftlm-uptime monitor program reports the value. A large difference between average and maximum may indicate a problem with one node.
- Service Availability
This is the availability over the last 24 hours as a percentage.
100% - No outages in the last 24 hours
50% - swift was unavailable for a total of 12 hours in the last 24-hour period
- Graph of Performance Over Time
Create a visual representation of performance data to see when swift encountered longer-than-normal response times.
To create a graph:
Choose the length of time you want to graph in
. This sets the length of time for the x-axis which counts backwards until it reaches the present time. In the example below, 1 day is selected, and so the x axis shows performance starting from 24 hours ago (-24) until the present time.Look at the y-axis to understand the range of response times. The first number is the smallest value in the data collected from the backend, and the last number is the longest amount of time it took swift to respond to a request. In the example below, the shortest time for a response from swift was 16.1 milliseconds.
Look for spikes which represent longer than normal response times. In the example below, swift experienced long response times 21 hours ago and again 1 hour ago.
Look for the latency value at the present time. The line running across the x-axis at 16.1 milliseconds shows you what the response time is currently.
16.1.4.2 Inventory Summary #
Monitor details about all the swift resources deployed in your cloud.
To access the object storage inventory screen:
General swift metrics are available for the following attributes:
- Important
On a public cloud deployment, this value can reach millions. If it continues to grow, it means that the container updates are not keeping up with the requests. It is also normal for it this number to grow if a node hosting the swift container service is down.
- Green
Indicates all alarms are in a known and untriggered state. For example, if there are 5 nodes and they are all known with no alarms, you will see the number 5 in the green box, and a zero in all the other colored boxes.
- Yellow
Indicates that some low or medium alarms have been triggered but no critical or high alarms. For example, if there are 5 nodes, and there are 3 nodes with untriggered alarms and 2 nodes with medium severity alarms, you will see the number 3 in the green box, the number 2 in the yellow box, and zeros in all the other colored boxes.
- Red
Indicates at least one critical or high severity alarm has been triggered on a node. For example, if there are 5 nodes, and there are 3 nodes with untriggered alarms, 1 node with a low severity, and 1 node with a critical alarm, you will see the number 3 in the green box, the number 1 in the yellow box, the number 1 in the red box,and a zero in the gray box.
- Gray
Indicates that all alarms on the nodes are unknown. For example, if there are 5 nodes with no data reported, you will see the number 5 in the gray box, and zeros in all the other colored boxes.
16.1.4.3 Capacity Summary #
Use this screen to view the size of the file system space on all nodes and disk drives assigned to swift. Also shown is the remaining space available and the total size of all file systems used by swift. Values are given in megabytes (MB).
To access the object storage alarm summary screen:
To open Operations Console, in a browser enter either the URL or Virtual IP connected to Operations Console.
For example:
ardana >
https://myardana.test:9095 https://VIP:9095- screen, click the menu represented by 3 horizontal lines (
In the menu, click
› .On the Summary page, click Capacity Summary.
16.1.4.4 Alarm Summary #
Use this page to quickly see the most recent alarms and triage all alarms related to object storage.
To access the object storage alarm summary screen:
To open Operations Console, in a browser enter either the URL or Virtual IP connected to Operations Console.
For example:
ardana >
https://myardana.test:9095 https://VIP:9095- screen, click the menu represented by 3 horizontal lines (
In the menu, click
› .On the Summary page, click Alarm Summary.
Each row has a checkbox to allow you to select multiple alarms and set the same condition on them.
The
column displays a graphical indicator representing the state of each alarm:Green indicator: OK. Good operating state.
Yellow indicator: Warning. Low severity, not requiring immediate action.
Red indicator: Alarm. Varying severity levels and must be addressed.
Gray indicator: Undetermined.
The
column identifies the alarm by the name it was given when it was originally created.The
column displays the date and time the most recent occurrence of the alarm.The
column describes the components to check in order to clear the alarm.The last column, depicted by three dots, reveals an
menu that allows you to choose:Comments can be updated by clicking
. Click to go to the tab showing that specific alarm definition.
16.1.5 Visualizing Data in Charts #
Operations Console allows you to create a new chart and select the time range and the metric you want to chart, based on monasca metrics.
Present data in a pictorial or graphical format to enable administrators and decision makers to grasp difficult concepts or identify new patterns.
Create new time-series graphs from My Dashboard.
My Dashboard also allows you to customize the view in the following ways:
Include alarm cards from the Central Dashboard
Customize graphs in new ways
Reorder items using drag and drop
Plan for future storage
Track capacity over time to predict with some degree of reliability the amount of additional storage needed.
Charts and graphs provide a quick way to visualize large amounts of complex data. It is especially useful when trying to find relationships and understand your data, which could include thousands or even millions of variables. You can create a new chart in Operations Console from My Dashboard.
The charts in Operations Console are based on monasca data. When you create a new chart you will be able to select the time range and the metric you want to chart. The list of Metrics you can choose from is equivalent to using the monasca metric-name-list on the command line. After you select a metric, you can then specify a dimension, which is derived from the monasca metric-list –name <metric_name> command line results. The dimension list changes based on the selected metric.
This topic provides instructions on how to create a basic chart, and how to create a chart specifically to visualize your cinder capacity.
Create a Chart: Section 16.1.5.1, “Create a Chart”
Chart cinder Capacity: Section 16.1.5.2, “Chart cinder Capacity”
16.1.5.1 Create a Chart #
Create a chart to visually display data for up to 6 metrics over a period of time.
To create a chart:
To open Operations Console, in a browser, enter either the URL or Virtual IP connected to Operations Console.
For example:
https://myardana.test:9095 https://VIP:9095
On the Home screen, click the menu represented by 3 horizontal lines ().
From the menu that slides in on the left, select Home, and then select My Dashboard.
On the My Dashboard screen, select Create New Chart.
On the Add New Time Series Chart screen, in Chart Definition complete any of the optional fields:
- Name
Short description of chart.
- Time Range
Specifies the interval between metric collection. The default is 1 hour. Can be set to hours (1,2,4,8,24) or days (7,30,45).
- Chart Update Rate
Collects metric data and adds it to the chart at the specified interval. The default is 1 minute. Can be set to minutes (1,5,10,30) or 1 hour.
- Chart Type
Determines how the data is displayed. The default type is Line. Can be set to the following values:
- Chart Size
This controls the visual display of the chart width as it appears on My Dashboard. The default is Small. This field can be set to Small to display it at 50% or Large for 100%.
On the Add New Time Series Chart screen, in Added Chart Data complete the following fields:
- Metric
In monasca, a metric is a multi-dimensional description that consists of the following fields: name, dimensions, timestamp, value and value_meta. The pre-populated list is equivalent to using the monasca metric-name-list on the command line.
- Dimension
The set of unique dimensions that are defined for a specific metric. Dimensions are a dictionary of key-value pairs. This pre-populated list is equivalent to using the monasca metric-list –name <metric_name> on the command line.
- Function
Operations Console uses monasca to provide the results of all mathematical functions. monasca in turns uses Graphite to perform the mathematical calculations and return the results. The default is AVG. The Function field can be set to AVG (default), MIN, MAX. and COUNT. For more information on these functions, see the Graphite documentation at http://www.aosabook.org/en/graphite.html.
Click Add Data To Chart. To add another metric to the chart, repeat the previous step until all metrics are added. The maximum you can have in one chart is 6 metrics.
To create the chart, click Create New Chart.
After you click Create New Chart, you will be returned to My Dashboard where the new chart will be shown. From the My Dashboard screen you can use the menu in the top-right corner of the card to delete or edit the chart. You can also select an option to create a comma-delimited file of the data in the chart.
16.1.5.2 Chart cinder Capacity #
To visualize the use of storage capacity over time, you can create a chart that graphs the total block storage backend capacity. To find out how much of that total is being used, you can also create a chart that graphs the available block storage capacity.
Visualizing cinder:
Chart Total Capacity: Section 16.1.5.3, “Chart Total Capacity”
Chart Available Capacity: Section 16.1.5.4, “Chart Available Capacity”
The total and free capacity values are based on the available capacity reported by the cinder backend. Be aware that some backends can be configured to thinly provision.
16.1.5.3 Chart Total Capacity #
To chart the total block-storage backend capacity:
Log in to Operations Console.
Follow the steps in the previous instructions to start creating a chart.
To chart the total backend capacity, on the Add New Time Series Chart screen, in Chart Definition use the following settings:
Field Setting Metrics cinderlm.cinder.backend.total.size Dimension any hostname. If multiple backends are available, select any one. The backends will all return the same metric data.
Add the data to the chart and click Create.
Example of a cinder Total Capacity Chart:
16.1.5.4 Chart Available Capacity #
To chart the available block-storage backend capacity:
Log in to Operations Console.
Follow the steps in the previous instructions to start creating a chart.
To chart the available backend capacity, on the Add New Time Series Chart screen, in Chart Definition use the following settings:
Field Setting Metrics cinderlm.cinder.backend.total.avail Dimension any hostname. If multiple backends are available, select any one. The backends will all return the same metric data.
Add the data to the chart and click Create.
Example of a chart showing cinder Available Capacity:
The source data for the Capacity Summary pages is only refreshed at the top of each hour. This affects the latency of the displayed data on those pages.
16.1.6 Getting Help with the Operations Console #
On each of the Operations Console pages there is a help menu that you can click on to take you to a help page specific to the console you are currently viewing.
To reach the help page:
Click the help menu option in the upper-right corner of the page, depicted by the question mark seen in the screenshot below.
Click the
link which will open the help page in a new tab in your browser.
16.2 Alarm Definition #
The
section under allows you to define alarms that are useful in generating notifications and metrics required by your organization. By default, alarm definitions are sorted by name and in a table format.16.2.1 Filter and Sort #
The search feature allows you to search and filter alarm entries by name and description.
The check box above the top left of the table is used to select all alarm definitions on the current page.
To sort the table, click the desired column header. To reverse the sort order, click the column again.
16.2.2 Create Alarm Definitions #
The
button next to the search bar allows you to create a new alarm definition.To create a new alarm definition:
Click
to open the dialog.In the Create Alarm Definition window, type a name for the alarm in the
text field. The name is mandatory and can be up to 255 characters long. The name can include letters, numbers, and special characters.Provide a short description of the alarm in the
text field (optional).Select the desired severity level of the alarm from the
drop-down box. The severity level is subjective, so choose the level appropriate for prioritizing the handling of alarms when they occur.Although not required, in order to specify how to receive notifications, you must be able to select the method(s) of notification (Email, Web, API, etc.) from the list of options in the Alarm Notifications area. If none are available to choose from, you must first configure them in the Notifications Methods window. Refer to the Notification Methods help page for further instructions.
To enable notifications for the alarm, enable the check box next to the desired alarm notification method.
Apply the following rules to your alarm by using the Alarm Expression form:
To save the changes and add the new alarm definition to the table, click
.
16.3 Alarm Explorer #
This page displays the alarms for all services and appliances. By default, alarms are sorted by their state.
16.3.1 Filter and Sort #
Using the
button, you can filter the alarms by their IDs and dimensions. The dialog lets you configure a filtering rule using the field and options in the section.You can display the alarms by grid, list or table views by selecting the corresponding icons next to the
control.To sort the alarm list, click the desired column header. To reverse the sort order, click the column again.
16.3.2 Alarm Table #
Each row has a checkbox to allow you to select multiple alarms and set the same condition on them.
The Status column displays a graphical indicator that shows the state of each alarm:
Green indicator: OK. Good operating state.
Yellow indicator: Warning. Low severity, not requiring immediate action.
Red indicator: Alarm. Varying severity levels and must be addressed.
Gray indicator: Unknown.
The Alarm column identifies the alarm by the name it was given when it was originally created.
The Last Check column displays the date and time the most recent occurrence of the alarm.
The Dimension column describes the components to check in order to clear the alarm.
16.3.3 Notification Methods #
The
section of the Alarm Explorer allows you to define notification methods that are used by the alarms. By default, notification methods are sorted by name.16.3.3.1 Filter and Sort #
The filter bar allows you to filter the notification methods by specifying a filter criteria. You can sort the available notification methods by clicking on the desired column header in the table.
16.3.3.2 Create Notification Methods #
The
button beside the search bar allows you to create a new notification method.To create a new notification method:
Click the
button.In the
window, specify a name for the notification in the text field. The name is required, and it can be up to 255 characters in length, consisting of letters, numbers, or special characters.Select a
in the drop down and select the desired option:Web Hook.
allows you to enter in an internet address, also referred to as a
In the
text field, provide the required values.Press
, and you should see the created notification method in the table.
16.4 Compute Hosts #
This
page in the section allows you to view your Compute Host resources.16.4.1 Filter and Sort #
The dedicated bar at the top of the page bar lets you filter alarm entries using the available filtering options.
Click the
icon to select one of the available options:The alarm entries can be sorted by clicking on the appropriate column header, such as
, , , , etc.To view detailed information (including alarm counts and utilization metrics) about a specific host in the list, click in the host's name in the list.
16.5 Compute Instances #
This Operations Console page allows you to monitor your Compute instances.
16.5.1 Search and Sort #
The search bar allows you to filter the alarm definitions you want to view. Type and Status are examples of alarm criteria that can be specified. Additionally, you can filter by typing in text similar to searching by keywords.
The checkbox allows you to select (or deselect) a group of alarm definitions to delete:
You can display the alarm definitions by grid, list or table views by selecting the corresponding icons next to the
control.The
control contains a drop-down list of ways by which you can sort the compute nodes. Alternatively, you can also sort using the column headers in the table.16.6 Compute Summary #
The
page in the section gives you access to inventory, capacity, and alarm summaries.16.6.1 Inventory Summary #
The
section provides an overview of compute alarms by status. These alarms are grouped by control plane. There is also information on resource usage for each compute host. Here you can also see alarms triggered on individual compute hosts.16.6.2 Capacity Summary #
offers an overview of the utilization of physical resources and allocation of virtual resources among compute nodes. Here you will also find a break-down of CPU, memory, and storage usage across all compute resources in the cloud.
16.6.3 Compute Summary #
16.6.4 Appliances #
This page displays details of an appliance.
Search and Sort
The search bar allows you to filter the appliances you want to view. Role and Status are examples of criteria that can be specified. Additionally, you can filter by selecting Any Column and typing in text similar to searching by keywords.
You can sort using the column headers in the table.
Actions
Click the Action icon (three dots) to view details of an appliance.
16.6.5 Block Storage Summary #
This page displays the alarms that have triggered since the timeframe indicated.
Search and Sort
The search bar allows you to filter the alarms you want to view. State and Service are examples of criteria that can be specified. Additionally, you can filter by typing in text similar to searching by keywords.
You can sort alarm entries using the column headers in the table.
New Alarms: Block Storage
The New Alarms section shows you the alarms that have triggered since the timeframe indicated. You can select the timeframe using the Configure control with options ranging from the Last Minute to Last 30 Days. This section refreshes every 60 seconds.
The new alarms will be separated into the following categories:
Category | Description |
---|---|
Critical | Open alarms, identified by red indicator. |
Warning | Open alarms, identified by yellow indicator. |
Unknown |
Open alarms, identified by gray indicator. Unknown will be the status of an alarm that has stopped receiving a metric. This can be caused by the following conditions:
|
Open | Complete list of open alarms. |
Total |
Complete list of alarms, may include Acknowledged and Resolved alarms. |
More Information
16.7 Logging #
This page displays the link to the Logging Interface, known as Kibana.
The Kibana logging interface only runs on the management network. You need to have access to that network to be able to use Kibana.
16.7.1 View Logging Interface #
To access the logging interface, click the
button, which will open the interface in a new window.For more details about the logging interface, see Section 13.2, “Centralized Logging Service”.
16.8 My Dashboard #
This page allows you to customize the dashboard by mixing and matching graphs and alarm cards.
My Dashboard allows you to customize the dashboard by mixing and matching graphs and alarm cards. Since different operators may be interested in different metrics and alarms, the configuration for this page is tied to the login account used to access Operations Console. Charts available here are based on metrics collected by the monasca monitoring component.
16.9 Networking Alarm Summary #
This page displays the alarms for the Networking (neutron), DNS, Firewall, and Load Balancing services. By default, alarms are sorted by State.
16.9.1 Filter and Sort #
The filter bar allows you to filter the alarms by the available criteria, including
, , and . The dimension filter accepts key/value pairs, while the State filter provides a selection of valid values.You can sort alarm entries using the column headers in the table.
16.9.2 Alarm Table #
You can select one or multiple alarms using the check box next to each entry.
The State column displays a graphical indicator that shows the state of each alarm:
Green indicator: OK. Good operating state.
Yellow indicator: Warning. Low severity, not requiring immediate action.
Red indicator: Alarm. Varying severity levels and must be addressed.
Gray square (or gray indicator): Undetermined.
The Alarm column identifies the alarm by its name.
The Last Check column displays the date and time the most recent occurrence of the alarm.
The Dimension column shows the components to check in order to clear the alarm.
The last column, depicted by three dots, reveals an Actions menu gives you access to the following options:
16.10 Central Dashboard #
This page displays a high level overview of all cloud resources and their alarm status.
16.10.1 Central Dashboard #
16.10.2 New Alarms #
The New Alarms section shows you the alarms that have triggered since the timeframe indicated. You can select the timeframe using the
control with options ranging from the Last Minute to Last 30 Days. This section refreshes every 60 seconds.The new alarms will be separated into the following categories:
An alarm exists for a service or component that is not installed in the environment.
An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.
There is a gap between the last reported metric and the next metric.
16.10.3 Alarm Summary #
Each service or group of services have a dedicated card displaying related alarms.
An alarm exists for a service or component that is not installed in the environment.
An alarm exists for a virtual machine or node that previously existed but has been removed without the corresponding alarms being removed.
There is a gap between the last reported metric and the next metric.
17 Backup and Restore #
The following sections cover backup and restore operations. Before installing your cloud, there are several things you must do so that you achieve the backup and recovery results you need. SUSE OpenStack Cloud comes with playbooks and procedures to recover the control plane from various disaster scenarios.
As of SUSE OpenStack Cloud 9, Freezer (a distributed backup restore and disaster recovery service) is no longer supported; backup and restore are manual operations.
Consider Section 17.2, “Enabling Backups to a Remote Server” in case you lose cloud servers that back up and restore services.
The following features are supported:
File system backup using a point-in-time snapshot.
Strong encryption: AES-256-CFB.
MariaDB database backup with LVM snapshot.
Restoring your data from a previous backup.
Low storage requirement: backups are stored as compressed files.
Flexible backup (both incremental and differential).
Data is archived in GNU Tar format for file-based incremental backup and restore.
When a key is provided, Open SSL is used to encrypt data (AES-256-CFB).
17.1 Manual Backup Overview #
This section covers manual backup and some restore processes. Full documentation for restore operations is in Section 15.2, “Unplanned System Maintenance”.To back up outside the cluster, refer to Section 17.2, “Enabling Backups to a Remote Server”. Backups of the following types of resources are covered:
Cloud Lifecycle Manager Data. All important information on the Cloud Lifecycle Manager
MariaDB database that is part of the Control Plane. The MariaDB database contains most of the data needed to restore services. MariaDB supports full back up and recovery for all services. Logging data in Elasticsearch is not backed up. swift objects are not backed up because of the redundant nature of swift.
swift Rings used in the swift storage deployment. swift rings are backed up so that you can recover more quickly than rebuilding with swift. swift can rebuild the rings without this backup data, but automatically rebuilding the rings is slower than restoring from a backup.
Audit Logs. Audit Logs are backed up to provide retrospective information and statistical data for performance and security purposes.
The following services will be backed up. Specifically, the data needed to restore the services is backed up. This includes databases and configuration-related files.
Data content for some services is not backed up, as indicated below.
ceilometer. There is no backup of metrics data.
cinder. There is no backup of the volumes.
glance. There is no backup of the images.
heat
horizon
keystone
neutron
nova. There is no backup of the images.
swift. There is no backup of the objects. swift has its own high availability and redundancy. swift rings are backed up. Although swift can rebuild the rings itself, restoring from backup is faster.
Operations Console
monasca. There is no backup of the metrics.
17.2 Enabling Backups to a Remote Server #
We recommend that you set up a remote server to store your backups, so that you can restore the control plane nodes. This may be necessary if you lose all of your control plane nodes at the same time.
A remote backup server must be set up before proceeding.
You do not have to restore from the remote server if only one or two control plane nodes are lost. In that case, the control planes can be recovered from the data on a remaining control plane node following the restore procedures in Section 15.2.3.2, “Recovering the Control Plane”.
17.2.1 Securing your SSH backup server #
You can do the following to harden an SSH server:
Disable root login
Move SSH to a non-default port (the default SSH port is 22)
Disable password login (only allow RSA keys)
Disable SSH v1
Authorize Secure File Transfer Protocol (SFTP) only for the designated backup user (disable SSH shell)
Firewall SSH traffic to ensure it comes from the SUSE OpenStack Cloud address range
Install a Fail2Ban solution
Restrict users who are allowed to SSH
Additional suggestions are available online
Remove the key pair generated earlier on the backup server; the only thing
needed is .ssh/authorized_keys
. You can remove the
.ssh/id_rsa
and .ssh/id_rsa.pub
files. Be sure to save a backup of them.
17.2.2 General tips #
Provide adequate space in the directory that is used for backup.
Monitor the space left on that directory.
Keep the system up to date on that server.
17.3 Manual Backup and Restore Procedures #
Each backup requires the following steps:
Create a snapshot.
Mount the snapshot.
Generate a TAR archive and save it.
Unmount and delete the snapshot.
17.3.1 Cloud Lifecycle Manager Data Backup #
The following procedure is used for each of the seven BACKUP_TARGETS (list below). Incremental backup instructions follow the full backup procedure. For both full and incremental backups, the last step of the procedure is to unmount and delete the snapshot after the TAR archive has been created and saved. A new snapshot must be created every time a backup is created.
Create a snapshot on the Cloud Lifecycle Manager in (
ardana-vg
), the location where all Cloud Lifecycle Manager data is stored.ardana >
sudo lvcreate --size 2G --snapshot --permission r \ --name lvm_clm_snapshot /dev/ardana-vg/rootNoteIf you have stored additional data or files in your
ardana-vg
directory, you may need more space than the 2G indicated for thesize
parameter. In this situation, create a preliminary TAR archive with thetar
command on the directory before creating a snapshot. Set thesize
snapshot parameter larger than the size of the archive.Mount the snapshot
ardana >
sudo mkdir /var/tmp/clm_snapshotardana >
sudo mount -o ro /dev/ardana-vg/lvm_clm_snapshot /var/tmp/clm_snapshotGenerate a TAR archive (does not apply to incremental backups) with an appropriate
BACKUP_TAR_ARCHIVE_NAME.tar.gz
backup file for each of the following BACKUP_TARGETS.Backup Targets
home
ssh
shadow
passwd
group
The backup TAR archive should contain only the necessary data; nothing extra. Some of the archives will be stored as directories, others as files. The backup commands are slightly different for each type.
If the BACKUP_TARGET is a directory, then that directory must be appended to
/var/tmp/clm_snapshot/
TARGET_DIR. If the BACKUP_TARGET is a file, then its parent directory must be appended to/var/tmp/clm_snapshot/
.In the commands that follow, replace BACKUP_TARGET with the appropriate BACKUP_PATH (replacement table is below).
ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --file BACKUP_TAR_ARCHIVE_NAME.tar.gz -C \ /var/tmp/clm_snapshotTARGET_DIR|BACKUP_TARGET_WITHOUT_LEADING_DIRIf BACKUP_TARGET is a directory, replace TARGET_DIR with BACKUP_PATH.
For example, where BACKUP_PATH=
/etc/ssh/
(a directory):ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --file ssh.tar.gz -C /var/tmp/clm_snapshot/etc/ssh .If BACKUP_TARGET is a file (not a directory), replace TARGET_DIR with the parent directory of BACKUP_PATH.
For example, where BACKUP_PATH=
/etc/passwd
(a file):ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --file passwd.tar.gz -C /var/tmp/clm_snapshot/etc/passwd
Save the TAR archive to the remote server.
ardana >
scp TAR_ARCHIVE USER@REMOTE_SERVERUse the following commands to unmount and delete a snapshot.
ardana >
sudo umount -l -f /var/tmp/clm_snapshot; rm -rf /var/tmp/clm_snapshotardana >
sudo lvremove -f /dev/ardana-vg/lvm_clm_snapshot
The table below shows Cloud Lifecycle Manager backup_targets
and their
respective backup_paths
.
backup_name |
backup_path |
---|---|
home_backup |
/var/lib/ardana (file) |
etc_ssh_backup |
/etc/ssh/ (directory) |
shadow_backup |
/etc/shadow (file) |
passwd_backup |
/etc/passwd (file) |
group_backup |
/etc/group (file) |
17.3.1.1 Cloud Lifecycle Manager Incremental Backup #
Incremental backups require a meta
file. If you use the
incremental backup option, a meta file must be included in the
tar
command in the initial backup and whenever you do an
incremental backup. A copy of the original meta
file
should be stored in each backup. The meta
file is used
to determine the incremental changes from the previous backup, so it is
rewritten with each incremental backup.
Versions are useful for incremental backup because they provide a way
to differentiate between each backup. Versions are included in the
tar
command.
Every incremental backup requires creating and mounting a separate snapshot. After the TAR archive is created, the snapshot is unmounted and deleted.
To prepare for incremental backup, follow the steps in Procedure 17.1, “Manual Backup Setup” with the following differences in the
commands for generating a tar
archive.
First time full backup
ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --listed-incremental=PATH_TO_YOUR_META \ --file BACKUP_TAR_ARCHIVE_NAME.tar.gz -C \ /var/tmp/clm_snapshotTARGET_DIR|BACKUP_TARGET_WITHOUT_LEADING_DIRFor example, where BACKUP_PATH=
/etc/ssh/
ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --listed-incremental=mysshMeta --file ssh.tar.gz -C \ /var/tmp/clm_snapshot/etc/ssh .Incremental backup
ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --listed-incremental=PATH_TO_YOUR_META\ --file BACKUP_TAR_ARCHIVE_NAME_VERSION.tar.gz -C \ /var/tmp/clm_snapshotTARGET_DIR|BACKUP_TARGET_WITHOUT_LEADING_DIRFor example, where BACKUP_PATH=
/etc/ssh/
:ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --listed-incremental=mysshMeta --file \ ssh_v1.tar.gz -C \ /var/tmp/clm_snapshot/etc/ssh .
After creating an incremental backup, use the following commands to unmount and delete a snapshot.
ardana >
sudo umount -l -f /var/tmp/clm_snapshot; rm -rf /var/tmp/clm_snapshotardana >
sudo lvremove -f /dev/ardana-vg/lvm_clm_snapshot
17.3.1.2 Encryption #
When a key is provided, Open SSL is used to encrypt data (AES-256-CFB). Backup files can be encrypted with the following command:
ardana >
sudo openssl enc -aes-256-cfb -pass file:ENCRYPT_PASS_FILE_PATH -in \
YOUR_BACKUP_TAR_ARCHIVE_NAME.tar.gz -out YOUR_BACKUP_TAR_ARCHIVE_NAME.tar.gz.enc
For example, using the ssh.tar.gz
generated above:
ardana >
sudo openssl enc -aes-256-cfb -pass file:myEncFile -in ssh.tar.gz -out ssh.tar.gz.enc
17.3.2 MariaDB Database Backup #
When backing up MariaDB, the following process must be performed on
all nodes in the cluster. It is similar to
the backup procedure above for the Cloud Lifecycle Manager (see Procedure 17.1, “Manual Backup Setup”). The difference is the addition of SQL
commands, which are run with the create_db_snapshot.yml
playbook.
Create the create_db_snapshot.yml
file in
~/scratch/ansible/next/ardana/ansible/
on the deployer
with the following content:
- hosts: FND-MDB vars: - snapshot_name: lvm_mysql_snapshot - lvm_target: /dev/ardana-vg/mysql tasks: - name: Cleanup old snapshots become: yes shell: | lvremove -f /dev/ardana-vg/{{ snapshot_name }} ignore_errors: True - name: Create snapshot become: yes shell: | lvcreate --size 2G --snapshot --permission r --name {{ snapshot_name }} {{ lvm_target }} register: snapshot_st ignore_errors: True - fail: msg: "Fail to create snapshot on {{ lvm_target }}" when: snapshot_st.rc != 0
Verify the validity of the lvm_target
variable
(which refers to the actual database LVM volume) before proceeding with
the backup.
Doing the MariaDB backup
We recommend storing the MariaDB version with your backup. The following command saves the MariaDB version as MARIADB_VER.
mysql -V | grep -Eo '(\S+?)-MariaDB' > MARIADB_VER
Open a MariaDB client session on all controllers.
Run the command to spread
read lock
on all controllers and keep the MariaDB session open.>> FLUSH TABLES WITH READ LOCK;
Open a new terminal and run the
create_db_snapshot.yml
playbook created above.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts create_db_snapshot.ymlGo back to the open MariaDB session and run the command to flush the lock on all controllers.
>> UNLOCK TABLES;
Mount the snapshot
dbnode>> mkdir /var/tmp/mysql_snapshot dbnode>> sudo mount -o ro /dev/ardana-vg/lvm_mysql_snapshot /var/tmp/mysql_snapshot
On each database node, generate a TAR archive with an appropriate
BACKUP_TAR_ARCHIVE_NAME.tar.gz
backup file for the BACKUP_TARGET.The
backup_name
ismysql_backup
and thebackup_path
(BACKUP_TARGET) is/var/lib/mysql/
.dbnode>> sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --file mydb.tar.gz /var/tmp/mysql_snapshot/var/lib/mysql .
Unmount and delete the MariaDB snapshot on each database node.
dbnode>> sudo umount -l -f /var/tmp/mysql_snapshot; \ sudo rm -rf /var/tmp/mysql_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_mysql_snapshot
17.3.2.1 Incremental MariaDB Database Backup #
Incremental backups require a meta
file. If you use the
incremental backup option, a meta file must be included in the
tar
command in the initial backup and whenever you do an
incremental backup. A copy of the original meta
file
should be stored in each backup. The meta
file is used
to determine the incremental changes from the previous backup, so it is
rewritten with each incremental backup.
Versions are useful for incremental backup because they provide a way
to differentiate between each backup. Versions are included in the
tar
command.
To prepare for incremental backup, follow the steps in the previous section
except for the tar
commands. Incremental backup
tar
commands must have additional information.
First time MariaDB database full backup
dbnode>> sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --listed-incremental=PATH_TO_YOUR_DB_META \ --file mydb.tar.gz -C /var/tmp/mysql_snapshot/var/lib/mysql .
For example, where BACKUP_PATH=
/var/lib/mysql/
:dbnode>> sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --listed-incremental=mydbMeta --file mydb.tar.gz -C \ /var/tmp/mysql_snapshot/var/lib/mysql .
Incremental MariaDB database backup
dbnode>> sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek \ --ignore-failed-read --listed-incremental=PATH_TO_YOUR_META\ --file BACKUP_TAR_ARCHIVE_NAME_VERSION.tar.gz -C \ /var/tmp/clm_snapshotTARGET_DIR
For example, where BACKUP_PATH=
/var/lib/mysql/
:dbnode>> sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --listed-incremental=mydbMeta --file \ mydb_v1.tar.gz -C /var/tmp/mysql_snapshot/var/lib/mysql .
After creating and saving the TAR archive, unmount and delete the snapshot.
dbnode>> sudo umount -l -f /var/tmp/mysql_snapshot; \ sudo rm -rf /var/tmp/mysql_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_mysql_snapshot
17.3.2.2 MariaDB Database Encryption #
Encrypt your MariaDB database backup following the instructions in Section 17.3.1.2, “Encryption”
Upload your BACKUP_TARGET.tar.gz to your preferred remote server.
17.3.3 swift Ring Backup #
The following procedure is used to back up swift rings. It is similar to the Cloud Lifecycle Manager backup (see Procedure 17.1, “Manual Backup Setup”).
The steps must be performed only on the building server (For more information, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”.).
The backup_name
is
swift_builder_dir_backup
and the
backup_path
is /etc/swiftlm/
.
Create a snapshot
ardana >
sudo lvcreate --size 2G --snapshot --permission r \ --name lvm_root_snapshot /dev/ardana-vg/rootMount the snapshot
ardana >
mkdir /var/tmp/root_snapshot; sudo mount -o ro \ /dev/ardana-vg/lvm_root_snapshot /var/tmp/root_snapshotCreate the TAR archive
ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --file swring.tar.gz -C /var/tmp/root_snapshot/etc/swiftlm .Upload your
swring.tar.gz
TAR archive to your preferred remote server.Unmount and delete the snapshot
ardana >
sudo umount -l -f /var/tmp/root_snapshot; sudo rm -rf \ /var/tmp/root_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_root_snapshot
17.3.4 Audit Log Backup and Restore #
17.3.4.1 Audit Log Backup #
The following procedure is used to back up Audit Logs. It is similar to the Cloud Lifecycle Manager backup (see Procedure 17.1, “Manual Backup Setup”). The steps must be performed on all nodes; there will be a backup TAR archive for each node. Before performing the following steps, run through Section 13.2.7.2, “Enable Audit Logging” .
The backup_name
is
audit_log_backup
and the
backup_path
is /var/audit
.
Create a snapshot
ardana >
sudo lvcreate --size 2G --snapshot --permission r --name \ lvm_root_snapshot /dev/ardana-vg/rootMount the snapshot
ardana >
mkdir /var/tmp/root_snapshot; sudo mount -o ro \ /dev/ardana-vg/lvm_root_snapshot /var/tmp/root_snapshotCreate the TAR archive
ardana >
sudo tar --create -z --warning=none --no-check-device \ --one-file-system --preserve-permissions --same-owner --seek --ignore-failed-read \ --file audit.tar.gz -C /var/tmp/root_snapshot/var/audit .Upload your
audit.tar.gz
TAR archive to your preferred remote server.Unmount and delete a snapshot
ardana >
sudo umount -l -f /var/tmp/root_snapshot; sudo rm -rf \ /var/tmp/root_snapshot; sudo lvremove -f /dev/ardana-vg/lvm_root_snapshot
17.3.4.2 Audit Logs Restore #
Restore the Audit Logs backup with the following commands
Retrieve the Audit Logs TAR archive
Extract the TAR archive to the proper backup location
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /var/audit/ -f audit.tar.gz
17.4 Full Disaster Recovery Test #
Full Disaster Recovery Test
17.4.1 High Level View of the Recovery Process #
Back up the control plane using the manual backup procedure
Backup the Cassandra Database
Re-install Controller 1 with the SUSE OpenStack Cloud ISO
Use manual restore steps to recover deployment data (and model)
Re-install SUSE OpenStack Cloud on Controllers 1, 2, 3
Recover the backup of the MariaDB database
Recover the Cassandra Database
Verify testing
17.4.2 Description of the testing environment #
The testing environment is similar to the Entry Scale model.
It uses five servers: three Control Nodes and two Compute Nodes.
The controller node has three disks. The first is reserved for the system; the others are used for swift.
For this Disaster Recovery test, data has been saved on disks 2 and 3 of the swift controllers, which allows for swift objects to be restored the recovery. If these disks were also wiped, swift data would be lost, but the procedure would not change. The only difference is that glance images would be lost and would have to be uploaded again.
Unless specified otherwise, all commands should be executed on controller 1, which is also the deployer node.
17.4.3 Pre-Disaster testing #
In order to validate the procedure after recovery, we need to create some workloads.
Source the service credential file
ardana >
source ~/service.osrcCopy an image to the platform and create a glance image with it. In this example, Cirros is used
ardana >
openstack image create --disk-format raw --container-format \ bare --public --file ~/cirros-0.3.5-x86_64-disk.img cirrosCreate a network
ardana >
openstack network create test_netCreate a subnet
ardana >
openstack subnet create 07c35d11-13f9-41d4-8289-fa92147b1d44 192.168.42.0/24 --name test_subnetCreate some instances
ardana >
openstack server create server_1 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44ardana >
openstack server create server_2 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44ardana >
openstack server create server_3 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44ardana >
openstack server create server_4 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44ardana >
openstack server create server_5 --image 411a03...e2da52e --flavor m1.small --nic net-id=07c35d...147b1d44ardana >
openstack server listCreate containers and objects
ardana >
openstack object create container_1 ~/service.osrc var/lib/ardana/service.osrcardana >
openstack object create container_1 ~/backup.osrc swift upload container_1 ~/backup.osrcardana >
openstack object list container_1 var/lib/ardana/backup.osrc var/lib/ardana/service.osrc
17.4.4 Preparation of the test backup server #
17.4.4.1 Preparation to store backups #
In this example, backups are stored on the server
192.168.69.132
Connect to the backup server
Create the user
root #
useradd BACKUPUSER --create-home --home-dir /mnt/backups/Switch to that user
root #
su BACKUPUSERCreate the SSH keypair
backupuser >
ssh-keygen -t rsa > # Leave the default for the first question and do not set any passphrase > Generating public/private rsa key pair. > Enter file in which to save the key (/mnt/backups//.ssh/id_rsa): > Created directory '/mnt/backups//.ssh'. > Enter passphrase (empty for no passphrase): > Enter same passphrase again: > Your identification has been saved in /mnt/backups//.ssh/id_rsa > Your public key has been saved in /mnt/backups//.ssh/id_rsa.pub > The key fingerprint is: > a9:08:ae:ee:3c:57:62:31:d2:52:77:a7:4e:37:d1:28 backupuser@padawan-ccp-c0-m1-mgmt > The key's randomart image is: > +---[RSA 2048]----+ > | o | > | . . E + . | > | o . . + . | > | o + o + | > | + o o S . | > | . + o o | > | o + . | > |.o . | > |++o | > +-----------------+Add the public key to the list of the keys authorized to connect to that user on this server
backupuser >
cat /mnt/backups/.ssh/id_rsa.pub >> /mnt/backups/.ssh/authorized_keysPrint the private key. This will be used for the backup configuration (ssh_credentials.yml file)
backupuser >
cat /mnt/backups/.ssh/id_rsa > -----BEGIN RSA PRIVATE KEY----- > MIIEogIBAAKCAQEAvjwKu6f940IVGHpUj3ffl3eKXACgVr3L5s9UJnb15+zV3K5L > BZuor8MLvwtskSkgdXNrpPZhNCsWSkryJff5I335Jhr/e5o03Yy+RqIMrJAIa0X5 > ... > ... > ... > iBKVKGPhOnn4ve3dDqy3q7fS5sivTqCrpaYtByJmPrcJNjb2K7VMLNvgLamK/AbL > qpSTZjicKZCCl+J2+8lrKAaDWqWtIjSUs29kCL78QmaPOgEvfsw= > -----END RSA PRIVATE KEY-----
17.4.4.2 Preparation to store Cassandra backups #
In this example, backups will be stored on the server
192.168.69.132
, in the
/mnt/backups/cassandra_backups/
directory.
Create a directory on the backup server to store Cassandra backups.
backupuser >
mkdir /mnt/backups/cassandra_backupsCopy the private SSH key from the backup server to all controller nodes.
Replace CONTROLLER with each control node e.g. doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt etc
Log in to each controller node and copy the private SSH key to
.ssh
directory of the root user.ardana >
sudo cp /var/lib/ardana/.ssh/id_rsa_backup /root/.ssh/Verify that you can SSH to the backup server as backupuser using the private key.
root #
ssh -i ~/.ssh/id_rsa_backup backupuser@192.168.69.132
17.4.5 Perform Backups for disaster recovery test #
17.4.5.1 Execute backup of Cassandra #
Create the following cassandra-backup-extserver.sh
script on all controller nodes.
root #
cat > ~/cassandra-backup-extserver.sh << EOF
#!/bin/sh
# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/
# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool
# example: cassandra-snp-2018-06-26-1003
SNAPSHOT_NAME=cassandra-snp-\$(date +%F-%H%M)
HOST_NAME=\$(/bin/hostname)_
# Take a snapshot of Cassandra database
\$NODETOOL snapshot -t \$SNAPSHOT_NAME monasca
# Collect a list of directories that make up the snapshot
SNAPSHOT_DIR_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
for d in \$SNAPSHOT_DIR_LIST
do
# copy snapshot directories to external server
rsync -avR -e "ssh -i /root/.ssh/id_rsa_backup" \$d \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME
done
\$NODETOOL clearsnapshot monasca
EOF
root #
chmod +x ~/cassandra-backup-extserver.sh
Execute following steps on all the controller nodes
The /usr/local/sbin/cassandra-backup-extserver.sh
script should be executed on all three controller nodes at the same time
(within seconds of each other) for a successful backup.
Edit the
/usr/local/sbin/cassandra-backup-extserver.sh
scriptSet BACKUP_USER and BACKUP_SERVER to the desired backup user (for example,
backupuser
) and desired backup server (for example,192.168.68.132
), respectively.BACKUP_USER=backupuser BACKUP_SERVER=192.168.69.132 BACKUP_DIR=/mnt/backups/cassandra_backups/
Execute
~/cassandra-backup-extserver.sh
on on all controller nodes which are also Cassandra nodes.root #
~/cassandra-backup-extserver.sh Requested creating snapshot(s) for [monasca] with snapshot name [cassandra-snp-2018-06-28-0251] and options {skipFlush=false} Snapshot directory: cassandra-snp-2018-06-28-0251 sending incremental file list created directory /mnt/backups/cassandra_backups//doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251 /var/ /var/cassandra/ /var/cassandra/data/ /var/cassandra/data/data/ /var/cassandra/data/data/monasca/ ... ... ... /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-Summary.db /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-TOC.txt /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/schema.cql sent 173,691 bytes received 531 bytes 116,148.00 bytes/sec total size is 171,378 speedup is 0.98 Requested clearing snapshot(s) for [monasca]Verify the Cassandra backup directory on the backup server.
backupuser >
ls -alt /mnt/backups/cassandra_backups total 16 drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 . drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306 drwxr-xr-x 3 backupuser users 4096 Jun 28 02:51 doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251 drwxr-xr-x 8 backupuser users 4096 Jun 27 20:56 .. $backupuser@backupserver> du -shx /mnt/backups/cassandra_backups/* 6.2G /mnt/backups/cassandra_backups/doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251 6.3G /mnt/backups/cassandra_backups/doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
17.4.5.2 Execute backup of SUSE OpenStack Cloud #
Back up the Cloud Lifecycle Manager using the procedure at Section 17.3.1, “Cloud Lifecycle Manager Data Backup”
Back up the MariaDB database using the procedure at Section 17.3.2, “MariaDB Database Backup”
Back up swift rings using the procedure at Section 17.3.3, “swift Ring Backup”
17.4.5.2.1 Restore the first controller #
Log in to the Cloud Lifecycle Manager.
Retrieve the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.
Extract the TAR archives for each of the seven locations.
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory RESTORE_TARGET \ -f BACKUP_TARGET.tar.gzFor example, with a directory such as BACKUP_TARGET=
/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gzWith a file such as BACKUP_TARGET=/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
17.4.5.2.2 Re-deployment of controllers 1, 2 and 3 #
Change back to the default ardana user.
Run the
cobbler-deploy.yml
playbook.ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.xmlRun the
bm-reimage.yml
playbook limited to the second and third controllers.ardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3The names of controller2 and controller3. Use the
bm-power-status.yml
playbook to check the cobbler names of these nodes.Run the
site.yml
playbook limited to the three controllers and localhost—in this example,doc-cp1-c1-m1-mgmt
,doc-cp1-c1-m2-mgmt
,doc-cp1-c1-m3-mgmt
, andlocalhost
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit \ doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
17.4.5.2.3 Restore Databases #
17.4.5.2.3.1 Restore MariaDB database #
Log in to the first controller node.
Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_infoStop SUSE OpenStack Cloud services on the three controllers (using the hostnames of the controllers in your configuration).
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \ doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostDelete the files in the
mysql
directory and copy the restored backup to that directory.root #
cd /var/lib/mysql/root #
rm -rf ./*root #
cp -pr /tmp/mysql_restore/* ./Switch back to the
ardana
user when the copy is finished.
17.4.5.2.3.2 Restore Cassandra database #
Create a script called
cassandra-restore-extserver.sh
on all controller
nodes
root #
cat > ~/cassandra-restore-extserver.sh << EOF
#!/bin/sh
# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/
# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool
HOST_NAME=\$(/bin/hostname)_
#Get snapshot name from command line.
if [ -z "\$*" ]
then
echo "usage \$0 <snapshot to restore>"
exit 1
fi
SNAPSHOT_NAME=\$1
# restore
rsync -av -e "ssh -i /root/.ssh/id_rsa_backup" \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME/ /
# set ownership of newley restored files
chown -R cassandra:cassandra \$DATA_DIR/monasca/*
# Get a list of snapshot directories that have files to be restored.
RESTORE_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
# use RESTORE_LIST to move snapshot files back into place of database.
for d in \$RESTORE_LIST
do
cd \$d
mv * ../..
KEYSPACE=\$(pwd | rev | cut -d '/' -f4 | rev)
TABLE_NAME=\$(pwd | rev | cut -d '/' -f3 |rev | cut -d '-' -f1)
\$NODETOOL refresh \$KEYSPACE \$TABLE_NAME
done
cd
# Cleanup snapshot directories
\$NODETOOL clearsnapshot \$KEYSPACE
EOF
root #
chmod +x ~/cassandra-restore-extserver.sh
Execute following steps on all the controller nodes.
Edit the
~/cassandra-restore-extserver.sh
script.Set BACKUP_USER,BACKUP_SERVER to the desired backup user (for example,
backupuser
) and the desired backup server (for example,192.168.68.132
), respectively.BACKUP_USER=backupuser BACKUP_SERVER=192.168.69.132 BACKUP_DIR=/mnt/backups/cassandra_backups/
Execute
~/cassandra-restore-extserver.sh
SNAPSHOT_NAMEFind SNAPSHOT_NAME from listing of /mnt/backups/cassandra_backups. All the directories have the format HOST_SNAPSHOT_NAME.
ardana >
ls -alt /mnt/backups/cassandra_backups total 16 drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 . drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306root #
~/cassandra-restore-extserver.sh cassandra-snp-2018-06-28-0306 receiving incremental file list ./ var/ var/cassandra/ var/cassandra/data/ var/cassandra/data/data/ var/cassandra/data/data/monasca/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/ var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/manifest.json var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-CompressionInfo.db var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-Data.db ... ... ... /usr/bin/nodetool clearsnapshot monasca
17.4.5.2.3.3 Restart SUSE OpenStack Cloud services #
Restart the MariaDB database
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlOn the deployer node, execute the
galera-bootstrap.yml
playbook which will determine the log sequence number, bootstrap the main node, and start the database cluster.If this process fails to recover the database cluster, refer to Section 15.2.3.1.2, “Recovering the MariaDB Database”.
Restart SUSE OpenStack Cloud services on the three controllers as in the following example.
ardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml \ --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostReconfigure SUSE OpenStack Cloud
ardana >
ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
17.4.5.2.4 Post restore testing #
Source the service credential file
ardana >
source ~/service.osrcswift
ardana >
openstack container list container_1 volumebackupsardana >
openstack object list container_1 var/lib/ardana/backup.osrc var/lib/ardana/service.osrcardana >
openstack object save container_1 /tmp/backup.osrcneutron
ardana >
openstack network list +--------------------------------------+---------------------+--------------------------------------+ | ID | Name | Subnets | +--------------------------------------+---------------------+--------------------------------------+ | 07c35d11-13f9-41d4-8289-fa92147b1d44 | test-net | 02d5ca3b-1133-4a74-a9ab-1f1dc2853ec8| +--------------------------------------+---------------------+--------------------------------------+glance
ardana >
openstack image list +--------------------------------------+----------------------+--------+ | ID | Name | Status | +--------------------------------------+----------------------+--------+ | 411a0363-7f4b-4bbc-889c-b9614e2da52e | cirros-0.4.0-x86_64 | active | +--------------------------------------+----------------------+--------+ardana >
openstack image save --file /tmp/cirros f751c39b-f1e3-4f02-8332-3886826889baardana >
ls -lah /tmp/cirros -rw-r--r-- 1 ardana ardana 12716032 Jul 2 20:52 /tmp/cirrosnova
ardana >
openstack server listardana >
openstack server create server_6 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44 +-------------------------------------+------------------------------------------------------------+ | Field | Value | +-------------------------------------+------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host | None | | OS-EXT-SRV-ATTR:hypervisor_hostname | None | | OS-EXT-SRV-ATTR:instance_name | | | OS-EXT-STS:power_state | NOSTATE | | OS-EXT-STS:task_state | scheduling | | OS-EXT-STS:vm_state | building | | OS-SRV-USG:launched_at | None | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | | | adminPass | iJBoBaj53oUd | | config_drive | | | created | 2018-07-02T21:02:01Z | | flavor | m1.small (2) | | hostId | | | id | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | | image | cirros-0.4.0-x86_64 (f751c39b-f1e3-4f02-8332-3886826889ba) | | key_name | None | | name | server_6 | | progress | 0 | | project_id | cca416004124432592b2949a5c5d9949 | | properties | | | security_groups | name='default' | | status | BUILD | | updated | 2018-07-02T21:02:01Z | | user_id | 8cb1168776d24390b44c3aaa0720b532 | | volumes_attached | | +-------------------------------------+------------------------------------------------------------+ardana >
openstack server list +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+ | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | server_6 | ACTIVE | n1=1.1.1.8 | cirros-0.4.0-x86_64 | m1.small |ardana >
openstack server delete ce7689ff-23bf-4fe9-b2a9-922d4aa9412c
18 Troubleshooting Issues #
Troubleshooting and support processes for solving issues in your environment.
This section contains troubleshooting tasks for your SUSE OpenStack Cloud cloud.
18.1 General Troubleshooting #
General troubleshooting procedures for resolving your cloud issues including steps for resolving service alarms and support contact information.
Before contacting support to help you with a problem on SUSE OpenStack Cloud, we recommend
gathering as much information as possible about your system and the
problem. For this purpose, SUSE OpenStack Cloud ships with a tool called
supportconfig
. It gathers system information such as the
current kernel version being used, the hardware, RPM database, partitions,
and other items. supportconfig
also collects the most
important log files. This information assists support staff to identify and
solve your problem.
Always run supportconfig
on the Cloud Lifecycle Manager and on the
Control Node(s). If a Compute Node or a Storage Node is part of the problem, run
supportconfig
on the affected node as well. For details on
how to run supportconfig
, see
https://documentation.suse.com/sles/15-SP1/single-html/SLES-admin/#cha-adm-support.
18.1.1 Alarm Resolution Procedures #
SUSE OpenStack Cloud provides a monitoring solution based on OpenStack’s monasca service. This service provides monitoring and metrics for all OpenStack components, as well as much of the underlying system. By default, SUSE OpenStack Cloud comes with a set of alarms that provide coverage of the primary systems. In addition, you can define alarms based on threshold values for any metrics defined in the system. You can view alarm information in the Operations Console. You can also receive or deliver this information to others by configuring email or other mechanisms. Alarms provide information about whether a component failed and is affecting the system, and also what condition triggered the alarm.
Here is a list of the included service-specific alarms and the recommended
troubleshooting steps. We have organized these alarms by the section of the
SUSE OpenStack Cloud Operations Console, they are organized in as well as the
service
dimension defined.
18.1.1.1 Compute Alarms #
These alarms show under the Compute section of the SUSE OpenStack Cloud Operations Console.
18.1.1.1.1 SERVICE: COMPUTE #
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status
Description: This is a Likely cause: Process crashed. | Restart the nova-api process on the affected
node. Review the nova-api.log files. Try to connect
locally to the http port that is found in the dimension field of the alarm
to see if the connection is accepted. |
Name: Host Status Description:: Alarms when the specified host is down or not reachable. Likely cause: The host is down, has been rebooted, or has network connectivity issues. | If it is a single host, attempt to restart the system. If it is multiple hosts, investigate networking issues. |
Name: Process Bound Check
Description:: Likely cause: Process crashed or too many processes running | Stop all the processes and restart the nova-api process on the affected host. Review the system and nova-api logs. |
Name: Process Check
Description:: Separate alarms for each of these nova services,
specified by the
Likely cause: Process specified by the |
Restart the process on the affected node using these steps:
Review the associated logs. The logs will be in the format of
|
Name: nova.heartbeat Description:: Check that all services are sending heartbeats. Likely cause: Process for service specified in the alarm has crashed or is hung and not reporting its status to the database. Alternatively it may be the service is fine but an issue with messaging or the database which means the status is not being updated correctly. | Restart the affected service. If the service is reporting OK the issue may be with RabbitMQ or MySQL. In that case, check the alarms for those services. |
Name: Service Log Directory Size Description:: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a service set to
| Find the service that is consuming too much disk space. Look at the
logs. If DEBUG log entries exist, set the logging level
to INFO . If the logs are repeatedly logging an error
message, do what is needed to resolve the error. If old log files exist,
configure log rotate to remove them. You could also choose to remove old
log files by hand after backing them up if needed. |
18.1.1.1.2 SERVICE: IMAGE-SERVICE in Compute section #
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status
Description:: Separate alarms for
each of these glance services, specified by the
Likely cause: API is unresponsive. |
Restart the process on the affected node using these steps:
Review the associated logs. |
Name: Service Log Directory Size Description:: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a service set to
| Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.1.3 SERVICE: BAREMETAL in Compute section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the
specified process is not running: Likely cause: The ironic API is unresponsive. |
Restart the
|
Name: Process Check
Description: Alarms when the
specified process is not running:
Likely cause: The
|
Restart the
|
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable. Likely cause: The API is unresponsive. |
|
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.2 Storage Alarms #
These alarms show under the Storage section of the SUSE OpenStack Cloud Operations Console.
18.1.1.2.1 SERVICE: OBJECT-STORAGE #
Alarm Information | Mitigation Tasks |
---|---|
Name: swiftlm-scan monitor
Description: Alarms if
Likely cause: The
|
Click on the alarm to examine the sudo swiftlm-scan | python -mjson.tool
The |
Name: swift account replicator last completed in 12 hours
Description: Alarms if an
Likely cause: This can indicate that
the |
Another cause of this problem may be that a file system may be corrupt. Look for sign of this in these logs on the affected node: /var/log/swift/swift.log /var/log/kern.log The file system may need to be wiped, contact Sales Engineering for advice on the best way to do that if needed. You can then reformat the file system with these steps:
|
Name: swift container replicator last completed in 12 hours Description: Alarms if a container-replicator process did not complete a replication cycle within the last 12 hours Likely cause: This can indicate that the container-replication process is stuck. |
SSH to the affected host and restart the process with this command: sudo systemctl restart swift-container-replicator Another cause of this problem may be that a file system may be corrupt. Look for sign of this in these logs on the affected node: /var/log/swift/swift.log /var/log/kern.log The file system may need to be wiped, contact Sales Engineering for advice on the best way to do that if needed. You can then reformat the file system with these steps:
|
Name: swift object replicator last completed in 24 hours Description: Alarms if an object-replicator process did not complete a replication cycle within the last 24 hours Likely cause: This can indicate that the object-replication process is stuck. |
SSH to the affected host and restart the process with this command: sudo systemctl restart swift-account-replicator Another cause of this problem may be that a file system may be corrupt. Look for sign of this in these logs on the affected node: /var/log/swift/swift.log /var/log/kern.log The file system may need to be wiped, contact Sales Engineering for advice on the best way to do that if needed. You can then reformat the file system with these steps:
|
Name: swift configuration file ownership
Description: Alarms if
files/directories in
Likely cause: For files in
|
For files in
|
Name: swift data filesystem ownership
Description: Alarms if files or
directories in
Likely cause: For directories in
|
For directories and files in |
Name: Drive URE errors detected
Description: Alarms if
Likely cause: An unrecoverable read error occurred when swift attempted to access a directory. |
The UREs reported only apply to file system metadata (that is, directory structures). For UREs in object files, the swift system automatically deletes the file and replicates a fresh copy from one of the other replicas. UREs are a normal feature of large disk drives. It does not mean that the drive has failed. However, if you get regular UREs on a specific drive, then this may indicate that the drive has indeed failed and should be replaced. You can use standard XFS repair actions to correct the UREs in the file system. If the XFS repair fails, you should wipe the GPT table as follows (where <drive_name> is replaced by the actual drive name):
Then follow the steps below which will reformat the drive, remount it, and restart swift services on the affected node.
It is safe to reformat drives containing swift data because swift maintains other copies of the data (usually, swift is configured to have three replicas of all data). |
Name: swift service
Description: Alarms if a swift
process, specified by the
Likely cause: A daemon specified by
the |
Examine the
Restart swift processes by running the
|
Name: swift filesystem mount point status Description: Alarms if a file system/drive used by swift is not correctly mounted.
Likely cause: The device specified by
the The most probable cause is that the drive has failed or that it had a temporary failure during the boot process and remained unmounted. Other possible causes are a file system corruption that prevents the device from being mounted. |
Reboot the node and see if the file system remains unmounted. If the file system is corrupt, see the process used for the "Drive URE errors" alarm to wipe and reformat the drive. |
Name: swift uptime-monitor status
Description: Alarms if the
swiftlm-uptime-monitor has errors using keystone ( Likely cause: The swiftlm-uptime-monitor cannot get a token from keystone or cannot get a successful response from the swift Object-Storage API. |
Check that the keystone service is running:
Check that swift is running:
Restart the swiftlm-uptime-monitor as follows:
|
Name: swift keystone server connect Description: Alarms if a socket cannot be opened to the keystone service (used for token validation)
Likely cause: The Identity service
(keystone) server may be down. Another possible cause is that the
network between the host reporting the problem and the keystone server
or the |
The |
Name: swift service listening on ip and port Description: Alarms when a swift service is not listening on the correct port or ip. Likely cause: The swift service may be down. |
Verify the status of the swift service on the affected host, as
specified by the
If an issue is determined, you can stop and restart the swift service with these steps:
|
Name: swift rings checksum Description: Alarms if the swift rings checksums do not match on all hosts.
Likely cause: The swift ring files
must be the same on every node. The files are located in
If you have just changed any of the rings and you are still deploying the change, it is normal for this alarm to trigger. |
If you have just changed any of your swift rings, if you wait until the changes complete then this alarm will likely clear on its own. If it does not, then continue with these steps.
Use
Run the
|
Name: swift memcached server connect Description: Alarms if a socket cannot be opened to the specified memcached server. Likely cause: The server may be down. The memcached daemon running the server may have stopped. |
If the server is down, restart it.
If memcached has stopped, you can restart it by using the
If the server is running and memcached is running, there may be a network problem blocking port 11211. If you see sporadic alarms on different servers, the system may be running out of resources. Contact Sales Engineering for advice. |
Name: swift individual disk usage exceeds 80% Description: Alarms when a disk drive used by swift exceeds 80% utilization. Likely cause: Generally all disk drives will fill roughly at the same rate. If an individual disk drive becomes filled faster than other drives it can indicate a problem with the replication process. |
If many or most of your disk drives are 80% full, you need to add more nodes to your system or delete existing objects. If one disk drive is noticeably (more than 30%) more utilized than the average of other disk drives, check that swift processes are working on the server (use the steps below) and also look for alarms related to the host. Otherwise continue to monitor the situation.
|
Name: swift individual disk usage exceeds 90% Description: Alarms when a disk drive used by swift exceeds 90% utilization. Likely cause: Generally all disk drives will fill roughly at the same rate. If an individual disk drive becomes filled faster than other drives it can indicate a problem with the replication process. |
If one disk drive is noticeably (more than 30%) more utilized than the average of other disk drives, check that swift processes are working on the server, using these steps:
Also look for alarms related to the host. An individual disk drive filling can indicate a problem with the replication process.
Restart swift on that host using the
If the utilization does not return to similar values as other disk drives, you can reformat the disk drive. You should only do this if the average utilization of all disk drives is less than 80%. To format a disk drive contact Sales Engineering for instructions. |
Name: swift total disk usage exceeds 80% Description: Alarms when the average disk utilization of swift disk drives exceeds 80% utilization. Likely cause: The number and size of objects in your system is beginning to fill the available disk space. Account and container storage is included in disk utilization. However, this generally consumes 1-2% of space compared to objects, so object storage is the dominate consumer of disk space. |
You need to add more nodes to your system or delete existing objects to remain under 80% utilization.
If you delete a project/account, the objects in that account are not
removed until a week later by the |
Name: swift total disk usage exceeds 90% Description: Alarms when the average disk utilization of swift disk drives exceeds 90% utilization. Likely cause: The number and size of objects in your system is beginning to fill the available disk space. Account and container storage is included in disk utilization. However, this generally consumes 1-2% of space compared to objects, so object storage is the dominate consumer of disk space. |
If your disk drives are 90% full, you must immediately stop all applications that put new objects into the system. At that point you can either delete objects or add more servers.
Using the steps below, set the
If you allow your file systems to become full, you will be unable to delete objects or add more nodes to the system. This is because the system needs some free space to handle the replication process when adding nodes. With no free space, the replication process cannot work. |
Name: swift service per-minute availability Description: Alarms if the swift service reports unavailable for the previous minute.
Likely cause: The
|
There are many reasons why the endpoint may stop running. Check:
|
Name: swift rsync connect Description: Alarms if a socket cannot be opened to the specified rsync server Likely cause: The rsync daemon on the specified node cannot be contacted. The most probable cause is that the node is down. The rsync service might also have been stopped on the node. |
Reboot the server if it is down. Attempt to restart rsync with this command: systemctl restart rsync.service |
Name: swift smart array controller status Description: Alarms if there is a failure in the Smart Array. Likely cause: The Smart Array or Smart HBA controller has a fault or a component of the controller (such as a battery) is failed or caching is disabled. The HPE Smart Storage Administrator (HPE SSA) CLI component will have to be installed for SSACLI status to be reported. HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f |
Log in to the reported host and run these commands to find out the status of the controllers: sudo hpssacli => controller show all detail For hardware failures (such as failed battery), replace the failed component. If the cache is disabled, reenable the cache. |
Name: swift physical drive status Description: Alarms if there is a failure in the Physical Drive. Likely cause:A disk drive on the server has failed or has warnings. |
Log in to the reported and run these commands to find out the status of the drive: sudo hpssacli => ctrl slot=1 pd all show Replace any broken drives. |
Name: swift logical drive status Description: Alarms if there is a failure in the Logical Drive. Likely cause: A LUN on the server is degraded or has failed. |
Log in to the reported host and run these commands to find out the status of the LUN: sudo hpssacli => ctrl slot=1 ld all show => ctrl slot=1 pd all show Replace any broken drives. |
Name: Process Check Description: Alarms when the specified process is not running.
Likely cause: If the
|
If the |
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable.
Likely cause: If the
|
If the |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to |
Find the service that is consuming too much disk space. Look at the
logs. If |
18.1.1.2.2 SERVICE: BLOCK-STORAGE in Storage section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Separate alarms for each
of these cinder services, specified by the
Likely cause: Process crashed. |
Restart the process on the affected node. Review the associated logs.
|
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
Alert may be incorrect if the service has migrated. Validate that the service is intended to be running on this node before restarting the service. Review the associated logs. |
Name: Process Check Description: Alarms when the specified process is not running: process_name=cinder-scheduler Likely cause: Process crashed. |
Restart the process on the affected node. Review the associated logs.
|
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause:Process crashed. |
Alert may be incorrect if the service has migrated. Validate that the service is intended to be running on this node before restarting the service. Review the associated logs. |
Name: cinder backup running <hostname> check Description: cinder backup singleton check. Likely cause: Backup process is one of the following:
|
Run the
|
Name: cinder volume running <hostname> check Description: cinder volume singleton check.
Likely cause: The
|
Run the
|
Name: Storage faulty lun check Description: Alarms if local LUNs on your HPE servers using smartarray are not OK. Likely cause: A LUN on the server is degraded or has failed. |
Log in to the reported host and run these commands to find out the status of the LUN: sudo hpssacli => ctrl slot=1 ld all show => ctrl slot=1 pd all show Replace any broken drives. |
Name: Storage faulty drive check Description: Alarms if the local disk drives on your HPE servers using smartarray are not OK. Likely cause: A disk drive on the server has failed or has warnings. |
Log in to the reported and run these commands to find out the status of the drive: sudo hpssacli => ctrl slot=1 pd all show Replace any broken drives. |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to |
Find the service that is consuming too much disk space. Look at the
logs. If |
18.1.1.3 Networking Alarms #
These alarms show under the Networking section of the SUSE OpenStack Cloud Operations Console.
18.1.1.3.1 SERVICE: NETWORKING #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the specified
process is not running. Separate alarms for each of these neutron
services, specified by the
Likely cause: Process crashed. |
Restart the process on the affected node:
|
Name: Process Check Description: Alarms when the specified process is not running: process_name = neutron-rootwrap Likely cause: Process crashed. |
Currently
|
Name: HTTP Status Description: neutron api health check
Likely cause: Process is stuck if the
|
|
Name: HTTP Status Description: neutron api health check Likely cause: The node crashed. Alternatively, only connectivity might have been lost if the local node HTTP Status is OK or UNKNOWN. | Reboot the node if it crashed or diagnose the networking connectivity failures between the local and remote nodes. Review the logs. |
Name: Service Directory Log Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.3.2 SERVICE: DNS in Networking section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check Description: Alarms when the specified process is not running: process_name = designate-zone-manager Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at: /var/log/designate/designate-zone-manager.log |
Name: Process Check Description: Alarms when the specified process is not running: process_name = designate-pool-manager Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at: /var/log/designate/designate-pool-manager.log |
Name: Process Check Description: Alarms when the specified process is not running: process_name = designate-central Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at: /var/log/designate/designate-central.log |
Name: Process Check Description: Alarms when the specified process is not running: process_name = designate-api Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at: /var/log/designate/designate-api.log |
Name: Process Check Description: Alarms when the specified process is not running: process_name = designate-mdns Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at: /var/log/designate/designate-mdns.log |
Name: HTTP Status
Description: Likely cause: The API is unresponsive. |
Restart the process on the affected node using these steps:
Review the logs located at: /var/log/designate/designate-api.log /var/log/designate/designate-central.log |
Name: Service Directory Log Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.3.3 SERVICE: BIND in Networking section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at, querying against /var/log/syslog |
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the log located at, querying against /var/log/syslog |
18.1.1.4 Identity Alarms #
These alarms show under the Identity section of the SUSE OpenStack Cloud Operations Console.
18.1.1.4.1 SERVICE: IDENTITY-SERVICE #
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status Description: This check is contacting the keystone public endpoint directly. component=keystone-api api_endpoint=public Likely cause: The keystone service is down on the affected node. |
Restart the keystone service on the affected node:
|
Name: HTTP Status Description: This check is contacting the keystone admin endpoint directly component=keystone-api api_endpoint=admin Likely cause: The keystone service is down on the affected node. |
Restart the keystone service on the affected node:
|
Name: HTTP Status Description: This check is contacting the keystone admin endpoint via the virtual IP address (HAProxy) component=keystone-api monitored_host_type=vip Likely cause: The keystone service is unreachable via the virtual IP address. |
If neither the You can restart the haproxy service with these steps:
|
Name: Process Check
Description: Separate alarms for each
of these glance services, specified by the
Likely cause: Process crashed. |
You can restart the keystone service with these steps:
Review the logs in |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.5 Telemetry Alarms #
These alarms show under the Telemetry section of the SUSE OpenStack Cloud Operations Console.
18.1.1.5.1 SERVICE: TELEMETRY #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the
Likely cause: Process has crashed. |
Review the logs on the alarming host in the following location for the cause: /var/log/ceilometer/ceilometer-agent-notification-json.log Restart the process on the affected node using these steps:
|
Name: Process Check
Description: Alarms when the
Likely cause: Process has crashed. |
Review the logs on the alarming host in the following location for the cause: /var/log/ceilometer/ceilometer-polling-json.log Restart the process on the affected node using these steps:
|
18.1.1.5.2 SERVICE: METERING in Telemetry section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.5.3 SERVICE: KAFKA in Telemetry section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Kafka Persister Metric Consumer Lag Description: Alarms when the Persister consumer group is not keeping up with the incoming messages on the metric topic. Likely cause: There is a slow down in the system or heavy load. |
Verify that all of the monasca-persister services are up with these steps:
Look for high load in the various systems. This alert can fire for multiple topics or on multiple hosts. Determining which alarms are firing can help diagnose likely causes. For example, if the alarm is alerting all on one machine it could be the machine. If one topic across multiple machines it is likely the consumers of that topic, etc. |
Name: Kafka Alarm Transition Consumer Lag Description: Alarms when the specified consumer group is not keeping up with the incoming messages on the alarm state transition topic. Likely cause: There is a slow down in the system or heavy load. |
Check that monasca-thresh and monasca-notification are up. Look for high load in the various systems. This alert can fire for multiple topics or on multiple hosts. Which alarms are firing can help diagnose likely causes. For example:
|
Name: Kafka Kronos Consumer Lag Description: Alarms when the Kronos consumer group is not keeping up with the incoming messages on the metric topic. Likely cause: There is a slow down in the system or heavy load. |
Look for high load in the various systems. This alert can fire for multiple topics or on multiple hosts. Which alarms are firing can help diagnose likely causes. For example:
|
Name: Process Check Description: Alarms when the specified process is not running: process_name = kafka.Kafka Likely cause: |
Restart the process on the affected node using these steps:
Review the logs in |
18.1.1.5.4 SERVICE: LOGGING in Telemetry section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Beaver Memory Usage Description: Beaver is using more memory than expected. This may indicate that it cannot forward messages and its queue is filling up. If you continue to see this, see the troubleshooting guide. Likely cause: Overloaded system or services with memory leaks. | Log on to the reporting host to investigate high memory users. |
Name: Audit Log Partition Low Watermark
Description: The
var_audit_low_watermark_percent Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly. | Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed. |
Name: Audit Log Partition High Watermark
Description: The
var_audit_high_watermark_percent Likely cause: This could be due to a service set to DEBUG instead of INFO level. Another reason could be due to a repeating error message filling up the log files. Finally, it could be due to log rotate not configured properly so old log files are not being deleted properly. | Find the service that is consuming too much disk space. Look at the logs. If DEBUG log entries exist, set the logging level to INFO. If the logs are repeatedly logging an error message, do what is needed to resolve the error. If old log files exist, configure log rotate to remove them. You could also choose to remove old log files by hand after backing them up if needed. |
Name: Elasticsearch Unassigned Shards Description: component = elasticsearch; Elasticsearch unassigned shards count is greater than 0. Likely cause: Environment could be misconfigured. |
To find the unassigned shards, run the following command on the Cloud Lifecycle Manager
from the
This shows which shards are unassigned, like this: logstash-2015.10.21 4 p UNASSIGNED ... 10.240.75.10 NodeName The last column shows the name that Elasticsearch uses for the node that the unassigned shards are on. To find the actual host name, run:
When you find the host name, take the following steps:
|
Name: Elasticsearch Number of Log Entries
Description: Elasticsearch Number of
Log Entries: Likely cause: The number of log entries may get too large. | Older versions of Kibana (version 3 and earlier) may hang if the number of log entries is too large (for example, above 40,000), and the page size would need to be small enough (about 20,000 results), because if it is larger (for example, 200,000), it may hang the browser, but Kibana 4 should not have this issue. |
Name: Elasticsearch Field Data Evictions
Description: Elasticsearch Field
Data Evictions count is greater than 0: Likely cause: Field Data Evictions may be found even though it is nowhere near the limit set. |
The
|
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
Name: Process Check
Description: Separate alarms for each
of these logging services, specified by the
Likely cause: Process has crashed. |
On the affected node, attempt to restart the process.
If the
If the logstash process has crashed, use:
The rest of the processes can be restarted using similar commands, listed here:
|
18.1.1.5.5 SERVICE: MONASCA-TRANSFORM in Telemetry section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Likely cause: Service process has crashed. |
Restart process on affected node. Review logs.
Child process of |
Name: Process Check Description: process_name = org.apache.spark.executor.CoarseGrainedExecutorBackend Likely cause: Service process has crashed. |
Restart process on affected node. Review logs.
Child process of |
Name: Process Check
Description: Likely cause: Service process has crashed. | Restart the service on affected node. Review logs. |
18.1.1.5.6 SERVICE: MONITORING in Telemetery section #
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status
Description: Persister Health Check
Likely cause: The process has crashed or a dependency is out. |
If the process has crashed, restart it using the steps below. If a dependent service is down, address that issue. Restart the process on the affected node using these steps:
Review the associated logs. |
Name: HTTP Status
Description: API Health Check
Likely cause: The process has crashed or a dependency is out. |
If the process has crashed, restart it using the steps below. If a dependent service is down, address that issue. Restart the process on the affected node using these steps:
Review the associated logs. |
Name: monasca Agent Collection Time
Description: Alarms when the elapsed
time the Likely cause: Heavy load on the box or a stuck agent plug-in. |
Address the load issue on the machine. If needed, restart the agent using the steps below: Restart the agent on the affected node using these steps:
Review the associated logs. |
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the associated logs. |
Name: Process Check Description: Alarms when the specified process is not running: process_name = monasca-notification Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the associated logs. |
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
Restart the agent on the affected node using these steps:
Review the associated logs. |
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
>Restart the process on the affected node using these steps:
Review the associated logs. |
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the associated logs. |
Name: Process Check Description: Alarms when the specified process is not running: process_name = backtype.storm.daemon.nimbus component = apache-storm Likely cause: Process crashed. |
Review the logs in the Note The logs containing threshold engine logging are on the 2nd and 3rd controller nodes.
Restart
|
Name: Process Check Description: Alarms when the specified process is not running: process_name = backtype.storm.daemon.supervisor component = apache-storm Likely cause: Process crashed. |
Review the logs in the Note The logs containing threshold engine logging are on the 2nd and 3rd controller nodes. Restart monasca-thresh with these steps:
|
Name: Process Check Description: Alarms when the specified process is not running: process_name = backtype.storm.daemon.worker component = apache-storm Likely cause: Process crashed. |
Review the logs in the Note The logs containing threshold engine logging are on the 2nd and 3rd controller nodes.
Restart
|
Name: Process Check
Description: Alarms when the specified
process is not running: process_name = monasca-thresh component = apache-storm Likely cause: Process crashed. |
Restart the process on the affected node using these steps:
Review the associated logs. |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: The service log
directory, as indicated by the | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.6 Console Alarms #
These alarms show under the Console section of the SUSE OpenStack Cloud Operations Console.
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status
Description:
Likely cause: The Operations Console is unresponsive |
Review logs in
|
Name: Process Check
Description: Alarms when the specified
process is not running:
Likely cause: Process crashed or unresponsive. |
Review logs in
|
18.1.1.7 System Alarms #
These alarms show under the System section and are set up per
hostname
and/or mount_point
.
18.1.1.7.1 SERVICE: SYSTEM #
Alarm Information | Mitigation Tasks |
---|---|
Name: CPU Usage Description: Alarms on high CPU usage. Likely cause: Heavy load or runaway processes. | Log onto the reporting host and diagnose the heavy CPU usage. |
Name: Elasticsearch Low Watermark
Description:
Likely cause: Running out of disk
space for |
Free up space by removing indices (backing them up first if desired).
Alternatively, adjust For more information about how to back up your centralized logs, see Section 13.2.5, “Configuring Centralized Logging”. |
Name: Elasticsearch High Watermark
Description:
Likely cause: Running out of disk
space for |
Verify that disk space was freed up by the curator. If needed, free up additional space by removing indices (backing them up first if desired). Alternatively, adjust curator_low_watermark_percent, curator_high_watermark_percent, and/or elasticsearch_max_total_indices_size_in_bytes if needed. For more information about how to back up your centralized logs, see Section 13.2.5, “Configuring Centralized Logging”. |
Name: Log Partition Low Watermark
Description: The
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
Name: Log Partition High Watermark
Description: The
Likely cause: This could be due to a
service set to | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
Name: Crash Dump Count
Description: Alarms if it receives any
metrics with
Likely cause: When a crash dump is
generated by kdump, the crash dump file is put into the
|
Analyze the crash dump file(s) located in
Move the file to a new location so that a developer can take a look at
it. Make sure all of the processes are back up after the crash (run the
|
Name: Disk Inode Usage
Description: Nearly out of inodes for
a partition, as indicated by the Likely cause: Many files on the disk. | Investigate cleanup of data or migration to other partitions. |
Name: Disk Usage
Description: High disk usage, as
indicated by the Likely cause: Large files on the disk. |
Investigate cleanup of data or migration to other partitions. |
Name: Host Status
Description: Alerts when a host is
unreachable. Likely cause: Host or network is down. | If a single host, attempt to restart the system. If multiple hosts, investigate network issues. |
Name: Memory Usage Description: High memory usage. Likely cause: Overloaded system or services with memory leaks. | Log onto the reporting host to investigate high memory users. |
Name: Network Errors Description: Alarms on a high network error rate. Likely cause: Bad network or cabling. | Take this host out of service until the network can be fixed. |
Name: NTP Time Sync Description: Alarms when the NTP time offset is high. |
Log in to the reported host and check if the ntp service is running. If it is running, then use these steps:
|
18.1.1.8 Other Services Alarms #
These alarms show under the Other Services section of the SUSE OpenStack Cloud Operations Console.
18.1.1.8.1 SERVICE: APACHE #
Alarm Information | Mitigation Tasks |
---|---|
Name: Apache Status Description: Alarms on failure to reach the Apache status endpoint. | |
Name: Process Check
Description: Alarms when the specified
process is not running: | If the Apache process goes down, connect to the affected node via
SSH and restart it with this command: sudo systemctl restart
apache2
|
Name: Apache Idle Worker Count Description: Alarms when there are no idle workers in the Apache server. |
18.1.1.8.2 SERVICE: BACKUP in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: The service log
directory, as indicated by the | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.8.3 SERVICE: HAPROXY in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: HA Proxy is not running on this machine. |
Restart the process on the affected node:
Review the associated logs. |
18.1.1.8.4 SERVICE: ARDANA-UX-SERVICES in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable. |
18.1.1.8.5 SERVICE: KEY-MANAGER in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the specified
process is not running: process_name = barbican-api Likely cause: Process has crashed. |
Restart the process on the affected node using these steps:
|
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable. component = barbican-api api_endpoint = public or internal Likely cause: The endpoint is not responsive, it may be down. |
For the HTTP Status alarms for the public and internal endpoints, restart the process on the affected node using these steps:
Examine the logs in |
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable. component = barbican-api monitored_host_type = vip Likely cause: The barbican API on the admin virtual IP is down. | This alarm is verifying access to the barbican API via the virtual IP address (HAProxy). If this check is failing but the other two HTTP Status alarms for the key-manager service are not then the issue is likely with HAProxy so you should view the alarms for that service. If the other two HTTP Status alarms are alerting as well then restart barbican using the steps listed. |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: The service log
directory, as indicated by the | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.8.6 SERVICE: MYSQL in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: MySQL Slow Query Rate Description: Alarms when the slow query rate is high. Likely cause: The system load is too high. | This could be an indication of near capacity limits or an exposed bad query. First, check overall system load and then investigate MySQL details. |
Name: Process Check Description: Alarms when the specified process is not running. Likely cause: MySQL crashed. | Restart MySQL on the affected node. |
18.1.1.8.7 SERVICE: OCTAVIA in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check Description: Alarms when the specified process is not running. There are individual alarms for each of these processes:
Likely cause: The process has crashed. |
Restart the process on the affected node using these steps:
|
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable.
Likely cause: The |
If the
If it is not the |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: The service log
directory, as indicated by the | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.8.8 SERVICE: ORCHESTRATION in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check Description: Alarms when the specified process is not running. There are individual alarms for each of these processes:
heat-api process check on each node Likely cause: Process crashed. |
Restart the process with these steps:
Review the relevant log at the following locations on the affected node: /var/log/heat/heat-api.log /var/log/heat/heat-cfn.log /var/log/heat/heat-cloudwatch.log /var/log/heat/heat-engine.log |
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable.
|
Restart the heat service with these steps:
Review the relevant log at the following locations on the affected node: /var/log/heat/heat-api.log /var/log/heat/heat-cfn.log /var/log/heat/heat-cloudwatch.log |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: The service log
directory, as indicated by the | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.8.9 SERVICE: OVSVAPP-SERVICEVM in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check Description:Alarms when the specified process is not running: process_name = ovs-vswitchd process_name = neutron-ovsvapp-agent process_name = ovsdb-server Likely cause: Process has crashed. | Restart process on affected node. Review logs. |
18.1.1.8.10 SERVICE: RABBITMQ in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check Description: Alarms when the specified process is not running: process_name = rabbitmq process_name = epmd Likely cause: Process has crashed. | Restart process on affected node. Review logs. |
18.1.1.8.11 SERVICE: SPARK in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check Description: Alarms when the specified process is not running process_name = org.apache.spark.deploy.master.Master process_name = org.apache.spark.deploy.worker.Worker Likely cause: Process has crashed. | Restart process on affected node. Review logs. |
18.1.1.8.12 SERVICE: WEB-UI in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: HTTP Status Description: Alarms when the specified HTTP endpoint is down or not reachable. Likely cause: Apache is not running or there is a misconfiguration. | Check that Apache is running; investigate horizon logs. |
Name: Service Log Directory Size Description: Service log directory consuming more disk than its quota.
Likely cause: The service log
directory, as indicated by the | Find the service that is consuming too much disk space. Look at
the logs. If DEBUG log entries exist, set the logging
level to INFO . If the logs are repeatedly logging an
error message, do what is needed to resolve the error. If old log files
exist, configure log rotate to remove them. You could also choose to
remove old log files by hand after backing them up if needed. |
18.1.1.8.13 SERVICE: ZOOKEEPER in Other Services section #
Alarm Information | Mitigation Tasks |
---|---|
Name: Process Check
Description: Alarms when the specified
process is not running: Likely cause: Process crashed. | Restart the process on the affected node. Review the associated logs. |
Name: ZooKeeper Latency Description: Alarms when the ZooKeeper latency is high. Likely cause: Heavy system load. | Check the individual system as well as activity across the entire service. |
18.1.1.9 ESX vCenter Plugin Alarms #
These alarms relate to your ESX cluster, if you are utilizing one.
Alarm Information | Mitigation Tasks |
---|---|
Name: ESX cluster CPU Usage Description: Alarms when average of CPU usage for a particular cluster exceeds 90% continuously for 3 polling cycles. Alarm will have the following dimension: esx_cluster_id=<domain>.<vcenter-id> Likely cause: Virtual machines are consuming more than 90% of allocated vCPUs. |
|
Name: ESX cluster Disk Usage Description:
Likely cause:
|
|
Name: ESX cluster Memory Usage Description: Alarms when average of RAM memory usage for a particular cluster, exceeds 90% continuously for 3 polling cycles. Alarm will have the following dimension: esx_cluster_id=<domain>.<vcenter-id> Likely cause: Virtual machines are consuming more than 90% of their total allocated memory. |
|
18.1.2 Support Resources #
To solve issues in your cloud, consult the Knowledge Base or contact Sales Engineering.
18.1.2.1 Using the Knowledge Base #
Support information is available at the SUSE Support page https://www.suse.com/products/suse-openstack-cloud/. This page offers access to the Knowledge Base, forums and documentation.
18.1.2.2 Contacting SUSE Support #
The central location for information about accessing and using SUSE Technical Support is available at https://www.suse.com/support/handbook/. This page has guidelines and links to many online support services, such as support account management, incident reporting, issue reporting, feature requests, training, consulting.
18.2 Control Plane Troubleshooting #
Troubleshooting procedures for control plane services.
18.2.1 Understanding and Recovering RabbitMQ after Failure #
RabbitMQ is the message queue service that runs on each of your controller nodes and brokers communication between multiple services in your SUSE OpenStack Cloud 9 cloud environment. It is important for cloud operators to understand how different troubleshooting scenarios affect RabbitMQ so they can minimize downtime in their environments. We are going to discuss multiple scenarios and how it affects RabbitMQ. We will also explain how you can recover from them if there are issues.
18.2.1.1 How upgrades affect RabbitMQ #
There are two types of upgrades within SUSE OpenStack Cloud -- major and minor. The effect that the upgrade process has on RabbitMQ depends on these types.
A major upgrade is defined by an erlang change or major version upgrade of RabbitMQ. A minor upgrade would be an upgrade where RabbitMQ stays within the same version, such as v3.4.3 to v.3.4.6.
During both types of upgrades there may be minor blips in the authentication process of client services as the accounts are recreated.
RabbitMQ during a major upgrade
There will be a RabbitMQ service outage while the upgrade is performed.
During the upgrade, high availability consistency is compromised -- all but the primary node will go down and will be reset, meaning their database copies are deleted. The primary node is not taken down until the last step and then it is upgrade. The database of users and permissions is maintained during this process. Then the other nodes are brought back into the cluster and resynchronized.
RabbitMQ during a minor upgrade
Minor upgrades are performed node by node. This "rolling" process means there should be no overall service outage because each node is taken out of its cluster in turn, its database is reset, and then it is added back to the cluster and resynchronized.
18.2.1.2 How RabbitMQ is affected by other operational processes #
There are operational tasks, such as Section 15.1.1.1, “Bringing Down Your Cloud: Services Down Method”, where
you use the ardana-stop.yml
and
ardana-start.yml
playbooks to gracefully restart your cloud.
If you use these playbooks, and there are no errors associated with them
forcing you to troubleshoot further, then RabbitMQ is brought down
gracefully and brought back up. There is nothing special to note regarding
RabbitMQ in these normal operational processes.
However, there are other scenarios where an understanding of RabbitMQ is important when a graceful shutdown did not occur.
These examples that follow assume you are using one of the entry-scale
models where RabbitMQ is hosted on your controller node cluster. If you are
using a mid-scale model or have a dedicated cluster that RabbitMQ lives on
you may need to alter the steps accordingly. To determine which nodes
RabbitMQ is on you can use the rabbit-status.yml
playbook
from your Cloud Lifecycle Manager.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml
Your entire control plane cluster goes down
If you have a scenario where all of your controller nodes went down, either manually or via another process such as a power outage, then an understanding of how RabbitMQ should be brought back up is important. Follow these steps to recover RabbitMQ on your controller node cluster in these cases:
The order in which the nodes went down is key here. Locate the last node to go down as this will be used as the primary node when bringing the RabbitMQ cluster back up. You can review the timestamps in the
/var/log/rabbitmq
log file to determine what the last node was.NoteThe
primary
status of a node is transient, it only applies for the duration that this process is running. There is no long-term distinction between any of the nodes in your cluster. The primary node is simply the one that owns the RabbitMQ configuration database that will be synchronized across the cluster.Run the
ardana-start.yml
playbook specifying the primary node (aka the last node down determined in the first step):ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<hostname>NoteThe
<hostname>
value will be the "shortname" for your node, as found in the/etc/hosts
file.
If one of your controller nodes goes down
First step here is to determine whether the controller that went down is the
primary RabbitMQ host or not. The primary host is going to be the first host
member in the FND-RMQ
group in the file below on your
Cloud Lifecycle Manager:
ardana >
~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
In this example below, ardana-cp1-c1-m1-mgmt
would be the
primary:
[FND-RMQ-ccp-cluster1:children] ardana-cp1-c1-m1-mgmt ardana-cp1-c1-m2-mgmt ardana-cp1-c1-m3-mgmt
If your primary RabbitMQ controller node has gone down and you need to bring
it back up, you can follow these steps. In this playbook you are using the
rabbit_primary_hostname
parameter to specify the hostname
for one of the other controller nodes in your environment hosting RabbitMQ,
which will service as the primary node in the recovery. You will also use
the --limit
parameter to specify the controller node you
are attempting to bring back up.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_you_are_bringing_up>
If the node you need to bring back is not
the primary RabbitMQ node then you can just run the
ardana-start.yml
playbook with the
--limit
parameter and your node should recover:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <hostname_of_node_you_are_bringing_up>
If you are replacing one or more of your controller nodes
The same general process noted above is used if you are removing or replacing one or more of your controller nodes.
If your node needs minor hardware repairs, but does not need to be replaced
with a new node, you should use the ardana-stop.yml
playbook
with the --limit
parameter to stop services on that node
prior to removing it from the cluster.
Log into the Cloud Lifecycle Manager.
Run the
rabbitmq-stop.yml
playbook, specifying the hostname of the node you are removing, which will remove the node from the RabbitMQ cluster:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-stop.yml --limit <hostname_of_node_you_are_removing>Run the
ardana-stop.yml
playbook, again specifying the hostname of the node you are removing, which will stop the rest of the services and prepare it to be removed:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <hostname_of_node_you_are_removing>
If your node cannot be repaired and needs to be replaced with another baremetal node, any references to the replaced node must be removed from the RabbitMQ cluster. This is because RabbitMQ associates a cookie with each node in the cluster which is derived, in part, by the specific hardware. So it is possible to replace a hard drive in a node. However changing a motherboard or replacing the node with another node entirely may cause RabbitMQ to stop working. When this happens, the running RabbitMQ cluster must be edited from a running RabbitMQ node. The following steps show how to do this.
In this example, controller 3 is the node being replaced with the following steps:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleSSH to a running RabbitMQ cluster node.
ardana >
ssh cloud-cp1-rmq-mysql-m1-mgmtForce the cluster to forget the node you are removing (in this example, the controller 3 node).
ardana >
sudo rabbitmqctl forget_cluster_node \ rabbit@cloud-cp1-rmq-mysql-m3-mgmtConfirm that the node has been removed.
ardana >
sudo rabbitmqctl cluster_statusOn the replacement node, information and services related to RabbitMQ must be removed.
ardana >
sudo systemctl stop rabbitmq-serverardana >
sudo systemctl stop epmd.socket>Verify that the epmd service has stopped (kill it if it is still running).
ardana >
ps -eaf | grep epmd.Remove the Mnesia database directory.
ardana >
sudo rm -rf /var/lib/rabbitmq/mnesiaRestart the RabbitMQ server.
ardana >
sudo systemctl start rabbitmq-serverOn the Cloud Lifecycle Manager, run the
ardana-start.yml
playbook.
If the node you are removing/replacing is your primary host then when you are adding it to your cluster then you will want to ensure that you specify a new primary host when doing so, as follows:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_you_are_adding>
If the node you are removing/replacing is not your primary host then you can add it as follows:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <hostname_of_node_you_are_adding>
If one of your controller nodes has rebooted or temporarily lost power
After a single reboot, RabbitMQ will not automatically restart. This is by design to protect your RabbitMQ cluster. To restart RabbitMQ, you should follow the process below.
If the rebooted node was your primary RabbitMQ host, you will specify a different primary hostname using one of the other nodes in your cluster:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_that_rebooted>
If the rebooted node was not the primary RabbitMQ host then you can just start it back up with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <hostname_of_node_that_rebooted>
18.2.1.3 Recovering RabbitMQ #
In this section we will show you how to check the status of RabbitMQ and how to do a variety of disaster recovery procedures.
Verifying the status of RabbitMQ
You can verify the status of RabbitMQ on each of your controller nodes by using the following steps:
Log in to the Cloud Lifecycle Manager.
Run the
rabbitmq-status.yml
playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-status.ymlIf all is well, you should see an output similar to the following:
PLAY RECAP ******************************************************************** rabbitmq | status | Check RabbitMQ running hosts in cluster ------------- 2.12s rabbitmq | status | Check RabbitMQ service running ---------------------- 1.69s rabbitmq | status | Report status of RabbitMQ --------------------------- 0.32s ------------------------------------------------------------------------------- Total: ------------------------------------------------------------------ 4.36s ardana-cp1-c1-m1-mgmt : ok=2 changed=0 unreachable=0 failed=0 ardana-cp1-c1-m2-mgmt : ok=2 changed=0 unreachable=0 failed=0 ardana-cp1-c1-m3-mgmt : ok=2 changed=0 unreachable=0 failed=0
If one or more of your controller nodes are having RabbitMQ issues then continue reading, looking for the scenario that best matches yours.
RabbitMQ recovery after a small network outage
In the case of a transient network outage, the version of RabbitMQ included
with SUSE OpenStack Cloud 9 is likely to recover automatically without any further
action needed. However, if yours does not and the
rabbitmq-status.yml
playbook is reporting an issue then
use the scenarios below to resolve your issues.
All of your controller nodes have gone down and using other methods have not brought RabbitMQ back up
If your RabbitMQ cluster is irrecoverable and you need rapid service recovery because other methods either cannot resolve the issue or you do not have time to investigate more nuanced approaches then we provide a disaster recovery playbook for you to use. This playbook will tear down and reset any RabbitMQ services. This does have an extreme effect on your services. The process will ensure that the RabbitMQ cluster is recreated.
Log in to your Cloud Lifecycle Manager.
Run the RabbitMQ disaster recovery playbook. This generally takes around two minutes.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.ymlRun the reconfigure playbooks for both cinder (Block Storage) and heat (Orchestration), if those services are present in your cloud. These services are affected when the fan-out queues are not recovered correctly. The reconfigure generally takes around five minutes.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts cinder-reconfigure.ymlardana >
ansible-playbook -i hosts/verb_hosts heat-reconfigure.ymlardana >
ansible-playbook -i hosts/verb_hosts kronos-server-configure.ymlIf you need to do a safe recovery of all the services in your environment then you can use this playbook. This is a more lengthy process as all services are inspected.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
One of your controller nodes has gone down and using other methods have not brought RabbitMQ back up
This disaster recovery procedure has the same caveats as the preceding one, but the steps differ.
If your primary RabbitMQ controller node has gone down and you need to perform a disaster recovery, use this playbook from your Cloud Lifecycle Manager:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml -e rabbit_primary_hostname=<new_primary_hostname> --limit <hostname_of_node_that_needs_recovered>
If the controller node is not your primary, you can use this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml --limit <hostname_of_node_that_needs_recovered>
No reconfigure playbooks are needed because all of the fan-out exchanges are maintained by the running members of your RabbitMQ cluster.
18.3 Troubleshooting Compute service #
Troubleshooting scenarios with resolutions for the nova service.
nova offers scalable, on-demand, self-service access to compute resources. You can use this guide to help with known issues and troubleshooting of nova services.
18.3.1 How can I reset the state of a compute instance? #
If you have an instance that is stuck in a non-Active state, such as
Deleting
or Rebooting
and you want to
reset the state so you can interact with the instance again, there is a way
to do this.
The OSC command-line tool command, openstack server set
–state
, allows you to reset the state of a server.
Here is the content of the help information about the command which shows the syntax:
$ openstack help reset-state usage: openstack set -state [--active] <server> [<server> ...] Reset the state of a server. Positional arguments: <server> Name or ID of server(s). Optional arguments: --active Request the server be reset to "active" state instead of "error" state (the default).
If you had an instance that was stuck in a Rebooting
state you would use this command to reset it back to
Active
:
openstack server set –state --active <instance_id>
18.3.2 Enabling the migrate or resize functions in nova post-installation when using encryption #
If you have used encryption for your data when running the configuration processor during your cloud deployment and are enabling the nova resize and migrate functionality after the initial installation, there is an issue that arises if you have made additional configuration changes that required you to run the configuration processor before enabling these features.
You will only experience an issue if you have enabled encryption. If you
haven't enabled encryption, then there is no need to follow the procedure
below. If you are using encryption and you have made a configuration change
and run the configuration processor after your initial install or upgrade,
and you have run the ready-deployment.yml
playbook, and
you want to enable migrate or resize in nova, then the following steps will
allow you to proceed. Note that the ansible vault key referred to below is
the encryption key that you have provided to the configuration processor.
Log in to the Cloud Lifecycle Manager.
Checkout the ansible branch of your local git:
cd ~/openstack git checkout ansible
Do a git log, and pick the previous commit:
git log
In this example below, the commit is
ac54d619b4fd84b497c7797ec61d989b64b9edb3
:$ git log commit 69f95002f9bad0b17f48687e4d97b2a791476c6a Merge: 439a85e ac54d61 Author: git user <user@company.com> Date: Fri May 6 09:08:55 2016 +0000 Merging promotion of saved output commit 439a85e209aeeca3ab54d1a9184efb01604dbbbb Author: git user <user@company.com> Date: Fri May 6 09:08:24 2016 +0000 Saved output from CP run on 1d3976dac4fd7e2e78afad8d23f7b64f9d138778 commit ac54d619b4fd84b497c7797ec61d989b64b9edb3 Merge: a794083 66ffe07 Author: git user <user@company.com> Date: Fri May 6 08:32:04 2016 +0000 Merging promotion of saved output
Checkout the commit:
git checkout <commit_ID>
Using the same example above, here is the command:
$ git checkout ac54d619b4fd84b497c7797ec61d989b64b9edb3 Note: checking out 'ac54d619b4fd84b497c7797ec61d989b64b9edb3'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at ac54d61... Merging promotion of saved output
Change to the ansible output directory:
cd ~/openstack/my_cloud/stage/ansible/group_vars/
View the
group_vars
file from the ansible vault - it will be of the form below, with your compute cluster name being the indicator:<cloud name>-<control plane name>-<compute cluster name>
View this group_vars file from the ansible vault with this command which will prompt you for your vault password:
ansible-vault view <group_vars_file>
Search the contents of this file for the
nova_ssh_key
section which will contain both the private and public SSH keys which you should then save into a temporary file so you can use it in a later step.Here is an example snippet, with the bold part being what you need to save:
NOV_KVM: vars: nova_ssh_key: private: '-----BEGIN RSA PRIVATE KEY----- MIIEpAIBAAKCAQEAv/hhekzykD2K8HnVNBKZcJWYrVlUyb6gR8cvE6hbh2ISzooA jQc3xgglIwpt5TuwpTY3LL0C4PEHObxy9WwqXTHBZp8jg/02RzD02bEcZ1WT49x7 Rj8f5+S1zutHlDv7PwEIMZPAHA8lihfGFG5o+QHUmsUHgjShkWPdHXw1+6mCO9V/ eJVZb3nDbiunMOBvyyk364w+fSzes4UDkmCq8joDa5KkpTgQK6xfw5auEosyrh8D zocN/JSdr6xStlT6yY8naWziXr7p/QhG44RPD9SSD7dhkyJh+bdCfoFVGdjmF8yA h5DlcLu9QhbJ/scb7yMP84W4L5GwvuWCCFJTHQIDAQABAoIBAQCCH5O7ecMFoKG4 JW0uMdlOJijqf93oLk2oucwgUANSvlivJX4AGj9k/YpmuSAKvS4cnqZBrhDwdpCG Q0XNM7d3mk1VCVPimNWc5gNiOBpftPNdBcuNryYqYq4WBwdq5EmGyGVMbbFPk7jH ZRwAJ2MCPoplKl7PlGtcCMwNu29AGNaxCQEZFmztXcEFdMrfpTh3kuBI536pBlEi Srh23mRILn0nvLXMAHwo94S6bI3JOQSK1DBCwtA52r5YgX0nkZbi2MvHISY1TXBw SiWgzqW8dakzVu9UNif9nTDyaJDpU0kr0/LWtBQNdcpXnDSkHGjjnIm2pJVBC+QJ SM9o8h1lAoGBANjGHtG762+dNPEUUkSNWVwd7tvzW9CZY35iMR0Rlux4PO+OXwNq agldHeUpgG1MPl1ya+rkf0GD62Uf4LHTDgaEkUfiXkYtcJwHbjOnj3EjZLXaYMX2 LYBE0bMKUkQCBdYtCvZmo6+dfC2DBEWPEhvWi7zf7o0CJ9260aS4UHJzAoGBAOK1 P//K7HBWXvKpY1yV2KSCEBEoiM9NA9+RYcLkNtIy/4rIk9ShLdCJQVWWgDfDTfso sJKc5S0OtOsRcomvv3OIQD1PvZVfZJLKpgKkt20/w7RwfJkYC/jSjQpzgDpZdKRU vRY8P5iryptleyImeqV+Vhf+1kcH8t5VQMUU2XAvAoGATpfeOqqIXMpBlJqKjUI2 QNi1bleYVVQXp43QQrrK3mdlqHEU77cYRNbW7OwUHQyEm/rNN7eqj8VVhi99lttv fVt5FPf0uDrnVhq3kNDSh/GOJQTNC1kK/DN3WBOI6hFVrmZcUCO8ewJ9MD8NQG7z 4NXzigIiiktayuBd+/u7ZxMCgYEAm6X7KaBlkn8KMypuyIsssU2GwHEG9OSYay9C Ym8S4GAZKGyrakm6zbjefWeV4jMZ3/1AtXg4tCWrutRAwh1CoYyDJlUQAXT79Phi 39+8+6nSsJimQunKlmvgX7OK7wSp24U+SPzWYPhZYzVaQ8kNXYAOlezlquDfMxxv GqBE5QsCgYA8K2p/z2kGXCNjdMrEM02reeE2J1Ft8DS/iiXjg35PX7WVIZ31KCBk wgYTWq0Fwo2W/EoJVl2o74qQTHK0Bs+FTnR2nkVF3htEOAW2YXQTTN2rEsHmlQqE A9iGTNwm9hvzbvrWeXtx8Zk/6aYfsXCoxq193KglS40shOCaXzWX0w== -----END RSA PRIVATE KEY-----' public: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/+GF6TPKQPYrwedU0Epl wlZitWVTJvqBHxy8TqFuHYhLOigCNBzfGCCUjCm3lO7ClNjcsvQLg8Qc5vHL1bCpdMc FmnyOD/TZHMPTZsRxnVZPj3HtGPx/n5LXO60eUO/s/AQgxk8AcDyWKF8YUbmj5Ad SaxQeCNKGRY90dfDX7qYI71X94lVlvecNuK6cw4G/LKTfrjD59LN6zhQOSYKryOgNrkq SlOBArrF/Dlq4SizKuHwPOhw38lJ2vrFK2VPrJjydpbOJevun9CEbjhE8P1JIPt2GTImH5t0 J+gVUZ2OYXzICHkOVwu71CFsn+xxvvIw/zhbgvkbC+5YIIUlMd Generated Key for nova User NTP_CLI:
Switch back to the
site
branch by checking it out:cd ~/openstack git checkout site
Navigate to your group_vars directory in this branch:
cd ~/scratch/ansible/next/ardana/ansible/group_vars
Edit your compute group_vars file, which will prompt you for your vault password:
ansible-vault edit <group_vars_file> Vault password: Decryption successful
Search the contents of this file for the
nova_ssh_key
section and replace the private and public keys with the contents that you had saved in a temporary file in step #7 earlier.Remove the temporary file that you created earlier. You are now ready to run the deployment. For information about enabling nova resizing and migration, see Section 6.4, “Enabling the Nova Resize and Migrate Features”.
18.3.3 Compute (ESX) #
Unable to Create Instance Snapshot when Instance is Active
There is a known issue with VMWare vCenter where if you have a compute
instance in Active
state you will receive the error below
when attempting to take a snapshot of it:
An error occurred while saving the snapshot: Failed to quiesce the virtual machine
The workaround for this issue is to stop the instance. Here are steps to achieve this using the command line tool:
Stop the instance using the OpenStackClient:
openstack server stop <instance UUID>
Take the snapshot of the instance.
Start the instance back up:
openstack server start <instance UUID>
18.3.4 How to archive deleted instances from the database #
The nova-reconfigure.yml playbook can take a long time to run if the database has a large number of deleted instances.
To find the number of rows being used by deleted instances:
sudo mysql nova -e "select count(*) from instances where vm_state='deleted';"
To archive a batch of 1000 deleted instances to shadow tables:
sudo /opt/stack/service/nova-api/venv/bin/nova-manage \ --config-dir /opt/stack/service/nova-api/etc/nova/ \ db archive_deleted_rows --verbose --max_rows 1000
18.4 Network Service Troubleshooting #
Troubleshooting scenarios with resolutions for the Networking service.
18.4.1 CVR HA - Split-brain result of failover of L3 agent when master comes back up #
This situation is specific to when L3 HA is configured and a network failure occurs to the node hosting the currently active l3 agent. L3 HA is intended to provide HA in situations where the l3-agent crashes or the node hosting an l3-agent crashes/restarts. In the case of a physical networking issue which isolates the active l3 agent, the stand-by l3-agent takes over but when the physical networking issue is resolved, traffic to the VMs is disrupted due to a "split-brain" situation in which traffic is split over the two L3 agents. The solution is to restart the L3-agent that was originally the master.
18.4.2 OVSvApp Loses Connectivity with vCenter #
If the OVSvApp loses connectivity with the vCenter cluster, you receive the following errors:
The OVSvApp VM will go into ERROR state
The OVSvApp VM will not get IP address
When you see these symptoms:
Restart the OVSvApp agent on the OVSvApp VM.
Execute the following command to restart the Network (neutron) service:
sudo service neutron-ovsvapp-agent restart
18.4.3 Fail over a plain CVR router because the node became unavailable: #
Get a list of l3 agent UUIDs which can be used in the commands that follow
openstack network agent list | grep l3
Determine the current host
openstack network agent list –routers <router uuid>
Remove the router from the current host
openstack network agent remove router –agent-type l3 <current l3 agent uuid> <router uuid>
Add the router to a new host
openstack network agent add router –agent-type l3 <new l3 agent uuid> <router uuid>
18.4.4 Trouble setting maximum transmission units (MTU) #
See Section 10.4.11, “Configuring Maximum Transmission Units in neutron” for more information.
18.4.5 Floating IP on allowed_address_pair port with
DVR-routed networks allowed_address_pair
#
You may notice this issue: If you have an
allowed_address_pair
associated with multiple virtual
machine (VM) ports, and if all the VM ports are ACTIVE, then the
allowed_address_pair
port binding will have the last
ACTIVE VM's binding host as its bound host.
In addition, you may notice that if the
floating IP is assigned to the allowed_address_pair
that
is bound to multiple VMs that are ACTIVE, then the floating IP will not work
with DVR routers. This is different from the centralized router behavior
where it can handle unbound allowed_address_pair
ports
that are associated with floating IPs.
Currently we support allowed_address_pair
ports with DVR
only if they have floating IPs enabled, and have just one ACTIVE port.
Using the CLI, you can follow these steps:
Create a network to add the host to:
$ openstack network create vrrp-net
Attach a subnet to that network with a specified allocation-pool range:
$ openstack subnet create --name vrrp-subnet --allocation-pool start=10.0.0.2,end=10.0.0.200 vrrp-net 10.0.0.0/24
Create a router, uplink the vrrp-subnet to it, and attach the router to an upstream network called public:
$ openstack router create router1 $ openstack router add subnet router1 vrrp-subnet $ openstack router set router1 public
Create a security group called vrrp-sec-group and add ingress rules to allow ICMP and TCP port 80 and 22:
$ openstack security group create vrrp-sec-group $ openstack security group rule create --protocol icmp vrrp-sec-group $ openstack security group rule create --protocol tcp --port-range-min80 --port-range-max80 vrrp-sec-group $ openstack security group rule create --protocol tcp --port-range-min22 --port-range-max22 vrrp-sec-group
Next, boot two instances:
$ openstack server create --num-instances 2 --image ubuntu-12.04 --flavor 1 --nic net-id=24e92ee1-8ae4-4c23-90af-accb3919f4d1 vrrp-node --security_groups vrrp-sec-group
When you create two instances, make sure that both the instances are not in ACTIVE state before you associate the
allowed_address_pair
. The instances:$ openstack server list +--------------------------------------+-------------------------------------------------+--------+------------+-------------+--------------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------------------------------+--------+------------+-------------+--------------------------------------------------------+ | 15b70af7-2628-4906-a877-39753082f84f | vrrp-node-15b70af7-2628-4906-a877-39753082f84f | ACTIVE | - | Running | vrrp-net=10.0.0.3 | | e683e9d1-7eea-48dd-9d3a-a54cf9d9b7d6 | vrrp-node-e683e9d1-7eea-48dd-9d3a-a54cf9d9b7d6 | DOWN | - | Running | vrrp-net=10.0.0.4 | +--------------------------------------+-------------------------------------------------+--------+------------+-------------+--------------------------------------------------------+
Create a port in the VRRP IP range that was left out of the ip-allocation range:
$ openstack port create --fixed-ip ip_address=10.0.0.201 --security-group vrrp-sec-group vrrp-net Created a new port: +-----------------------+-----------------------------------------------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | device_id | | | device_owner | | | fixed_ips | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.201"} | | id | 6239f501-e902-4b02-8d5c-69062896a2dd | | mac_address | fa:16:3e:20:67:9f | | name | | | network_id | 24e92ee1-8ae4-4c23-90af-accb3919f4d1 | | port_security_enabled | True | | security_groups | 36c8131f-d504-4bcc-b708-f330c9f6b67a | | status | DOWN | | tenant_id | d4e4332d5f8c4a8eab9fcb1345406cb0 | +-----------------------+-----------------------------------------------------------------------------------+
Another thing to cross check after you associate the allowed_address_pair port to the VM port, is whether the
allowed_address_pair
port has inherited the VM's host binding:$ neutron --os-username admin --os-password ZIy9xitH55 --os-tenant-name admin port-show f5a252b2-701f-40e9-a314-59ef9b5ed7de +-----------------------+--------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | {color:red}binding:host_id{color} | ...-cp1-comp0001-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | | | device_owner | compute:None | | dns_assignment | {"hostname": "host-10-0-0-201", "ip_address": "10.0.0.201", "fqdn": "host-10-0-0-201.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.201"} | | id | 6239f501-e902-4b02-8d5c-69062896a2dd | | mac_address | fa:16:3e:20:67:9f | | name | | | network_id | 24e92ee1-8ae4-4c23-90af-accb3919f4d1 | | port_security_enabled | True | | security_groups | 36c8131f-d504-4bcc-b708-f330c9f6b67a | | status | DOWN | | tenant_id | d4e4332d5f8c4a8eab9fcb1345406cb0 | +-----------------------+--------------------------------------------------------------------------------------------------------+
Note that you were allocated a port with the IP address 10.0.0.201 as requested. Next, associate a floating IP to this port to be able to access it publicly:
$ openstack floating ip create --port-id=6239f501-e902-4b02-8d5c-69062896a2dd public Created a new floatingip: +---------------------+--------------------------------------+ | Field | Value | +---------------------+--------------------------------------+ | fixed_ip_address | 10.0.0.201 | | floating_ip_address | 10.36.12.139 | | floating_network_id | 3696c581-9474-4c57-aaa0-b6c70f2529b0 | | id | a26931de-bc94-4fd8-a8b9-c5d4031667e9 | | port_id | 6239f501-e902-4b02-8d5c-69062896a2dd | | router_id | 178fde65-e9e7-4d84-a218-b1cc7c7b09c7 | | tenant_id | d4e4332d5f8c4a8eab9fcb1345406cb0 | +---------------------+--------------------------------------+
Now update the ports attached to your VRRP instances to include this IP address as an allowed-address-pair so they will be able to send traffic out using this address. First find the ports attached to these instances:
$ openstack port list -- --network_id=24e92ee1-8ae4-4c23-90af-accb3919f4d1 +--------------------------------------+------+-------------------+-----------------------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +--------------------------------------+------+-------------------+-----------------------------------------------------------------------------------+ | 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d | | fa:16:3e:7a:7b:18 | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.4"} | | 14f57a85-35af-4edb-8bec-6f81beb9db88 | | fa:16:3e:2f:7e:ee | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.2"} | | 6239f501-e902-4b02-8d5c-69062896a2dd | | fa:16:3e:20:67:9f | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.201"} | | 87094048-3832-472e-a100-7f9b45829da5 | | fa:16:3e:b3:38:30 | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.1"} | | c080dbeb-491e-46e2-ab7e-192e7627d050 | | fa:16:3e:88:2e:e2 | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.3"} | +--------------------------------------+------+-------------------+-----------------------------------------------------------------------------------+
Add this address to the ports c080dbeb-491e-46e2-ab7e-192e7627d050 and 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d which are 10.0.0.3 and 10.0.0.4 (your vrrp-node instances):
$ openstack port set c080dbeb-491e-46e2-ab7e-192e7627d050 --allowed_address_pairs list=truetype=dict ip_address=10.0.0.201 $ openstack port set 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d --allowed_address_pairs list=truetype=dict ip_address=10.0.0.201
The allowed-address-pair 10.0.0.201 now shows up on the port:
$ openstack port show 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d +-----------------------+---------------------------------------------------------------------------------+ | Field | Value | +-----------------------+---------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | {"ip_address": "10.0.0.201", "mac_address": "fa:16:3e:7a:7b:18"} | | device_id | e683e9d1-7eea-48dd-9d3a-a54cf9d9b7d6 | | device_owner | compute:None | | fixed_ips | {"subnet_id": "94a0c371-d37c-4796-821e-57c2a8ec65ae", "ip_address": "10.0.0.4"} | | id | 12bf9ea4-4845-4e2c-b511-3b8b1ad7291d | | mac_address | fa:16:3e:7a:7b:18 | | name | | | network_id | 24e92ee1-8ae4-4c23-90af-accb3919f4d1 | | port_security_enabled | True | | security_groups | 36c8131f-d504-4bcc-b708-f330c9f6b67a | | status | ACTIVE | | tenant_id | d4e4332d5f8c4a8eab9fcb1345406cb0 |
18.4.6 OpenStack traffic that must traverse VXLAN tunnel dropped when using HPE 5930 switch #
Cause: UDP destination port 4789 is conflicting with OpenStack VXLAN traffic.
There is a configuration setting you can use in the switch to configure the port number the HPN kit will use for its own VXLAN tunnels. Setting this to a port number other than the one neutron will use by default (4789) will keep the HPN kit from absconding with neutron's VXLAN traffic. Specifically:
Parameters:
port-number: Specifies a UDP port number in the range of 1 to 65535. As a best practice, specify a port number in the range of 1024 to 65535 to avoid conflict with well-known ports.
Usage guidelines:
You must configure the same destination UDP port number on all VTEPs in a VXLAN.
Examples
# Set the destination UDP port number to 6666 for VXLAN packets. <Sysname> system-view [Sysname] vxlan udp-port 6666
Use vxlan udp-port to configure the destination UDP port number of VXLAN packets. Mandatory for all VXLAN packets to specify a UDP port Default The destination UDP port number is 4789 for VXLAN packets.
OVS can be configured to use a different port number itself:
# (IntOpt) The port number to utilize if tunnel_types includes 'vxlan'. By # default, this will make use of the Open vSwitch default value of '4789' if # not specified. # # vxlan_udp_port = # Example: vxlan_udp_port = 8472 #
18.4.7 Issue: PCI-PT virtual machine gets stuck at boot #
If you are using a machine that uses Intel NICs, if the PCI-PT virtual machine gets stuck at boot, the boot agent should be disabled.
When Intel cards are used for PCI-PT, sometimes the tenant virtual machine gets stuck at boot. If this happens, you should download Intel bootutils and use it to disable the bootagent.
Use the following steps:
Download
preebot.tar.gz
from the Intel website.Untar the
preboot.tar.gz
file on the compute host where the PCI-PT virtual machine is to be hosted.Go to path
~/APPS/BootUtil/Linux_x64
and then run following command:./bootutil64e -BOOTENABLE disable -all
Now boot the PCI-PT virtual machine and it should boot without getting stuck.
18.5 Troubleshooting the Image (glance) Service #
Troubleshooting scenarios with resolutions for the glance service. We have gathered some of the common issues and troubleshooting steps that will help when resolving issues that occur with the glance service.
18.5.1 Images Created in Horizon UI Get Stuck in a Queued State #
When creating a new image in the horizon UI you will see the option for
Image Location
which allows you to enter a HTTP source to
use when creating a new image for your cloud. However, this option is
disabled by default for security reasons. This results in any new images
created via this method getting stuck in a Queued
state.
We cannot guarantee the security of any third party sites you use as image sources and the traffic goes over HTTP (non-SSL) traffic.
Resolution: You will need your cloud administrator to enable the HTTP store option in glance for your cloud.
Here are the steps to enable this option:
Log in to the Cloud Lifecycle Manager.
Edit the file below:
~/openstack/ardana/ansible/roles/GLA-API/templates/glance-api.conf.j2
Locate the glance store options and add the
http
value in thestores
field. It will look like this:[glance_store] stores = {{ glance_stores }}
Change this to:
[glance_store] stores = {{ glance_stores }},http
Commit your configuration to the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "adding HTTP option to glance store list"
Run the configuration processor with this command:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Use the playbook below to create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the glance service reconfigure playbook which will update these settings:
cd ~/scratch/ansible/next/ardana/ansible/ ansible-playbook -i hosts/verb_hosts glance-reconfigure.yml
18.6 Storage Troubleshooting #
Troubleshooting scenarios with resolutions for swift services.
18.6.1 Block Storage Troubleshooting #
The block storage service utilizes OpenStack cinder and can integrate with multiple back-ends including 3Par, SUSE Enterprise Storage, and Ceph. Failures may exist at the cinder API level, an operation may fail, or you may see an alarm trigger in the monitoring service. These may be caused by configuration problems, network issues, or issues with your servers or storage back-ends. The purpose of this page and section is to describe how the service works, where to find additional information, some of the common problems that come up, and how to address them.
18.6.1.1 Where to find information #
When debugging block storage issues it is helpful to understand the deployment topology and know where to locate the logs with additional information.
The cinder service consists of:
An API service, typically deployed and active on the controller nodes.
A scheduler service, also typically deployed and active on the controller nodes.
A volume service, which is deployed on all of the controller nodes but only active on one of them.
A backup service, which is deployed on the same controller node as the volume service.
You can refer to your configuration files (usually located in
~/openstack/my_cloud/definition/
on the Cloud Lifecycle Manager) for
specifics about where your services are located. They will usually be
located on the controller nodes.
cinder uses a MariaDB database and communicates between components by consuming messages from a RabbitMQ message service.
The cinder API service is layered underneath a HAProxy service and accessed using a virtual IP address maintained using keepalived.
If any of the cinder components is not running on its intended host then an
alarm will be raised. Details on how to resolve these alarms can be found on
our Section 18.1.1, “Alarm Resolution Procedures” page. You should check the logs for
the service on the appropriate nodes. All cinder logs are stored in
/var/log/cinder/
and all log entries above
INFO
level are also sent to the centralized logging
service. For details on how to change the logging level of the cinder
service, see Section 13.2.6, “Configuring Settings for Other Services”.
In order to get the full context of an error you may need to examine the full log files on individual nodes. Note that if a component runs on more than one node you will need to review the logs on each of the nodes that component runs on. Also remember that as logs rotate that the time interval you are interested in may be in an older log file.
Log locations:
/var/log/cinder/cinder-api.log
- Check this log if you
have endpoint or connectivity issues
/var/log/cinder/cinder-scheduler.log
- Check this log if
the system cannot assign your volume to a back-end
/var/log/cinder/cinder-backup.log
- Check this log if you
have backup or restore issues
/var/log/cinder-cinder-volume.log
- Check here for
failures during volume creation
/var/log/nova/nova-compute.log
- Check here for failures
with attaching volumes to compute instances
You can also check the logs for the database and/or the RabbitMQ service if your cloud exhibits database or messaging errors.
If the API servers are up and running but the API is not reachable then checking the HAProxy logs on the active keepalived node would be the place to look.
If you have errors attaching volumes to compute instances using the nova API then the logs would be on the compute node associated with the instance. You can use the following command to determine which node is hosting the instance:
openstack server show <instance_uuid>
Then you can check the logs located at
/var/log/nova/nova-compute.log
on that compute node.
18.6.1.2 Understanding the cinder volume states #
Once the topology is understood, if the issue with the cinder service relates to a specific volume then you should have a good understanding of what the various states a volume can be in are. The states are:
attaching
available
backing-up
creating
deleting
downloading
error
error attaching
error deleting
error detaching
error extending
error restoring
in-use
extending
restoring
restoring backup
retyping
uploading
The common states are in-use
which indicates a volume is
currently attached to a compute instance and available
means the volume is created on a back-end and is free to be attached to an
instance. All -ing
states are transient and represent a
transition. If a volume stays in one of those states for too long indicating
it is stuck, or if it fails and goes into an error state, you should check
for failures in the logs.
18.6.1.3 Initial troubleshooting steps #
These should be the initial troubleshooting steps you go through.
If you have noticed an issue with the service, you should check your monitoring system for any alarms that may have triggered. See Section 18.1.1, “Alarm Resolution Procedures” for resolution steps for those alarms.
Check if the cinder API service is active by listing the available volumes from the Cloud Lifecycle Manager:
source ~/service.osrc openstack volume list
18.6.1.4 Common failures #
Alerts from the cinder service
Check for alerts associated with the block storage service, noting that these could include alerts related to the server nodes being down, alerts related to the messaging and database services, or the HAProxy and keepalived services, as well as alerts directly attributed to the block storage service.
The Operations Console provides a web UI method for checking alarms.
cinder volume service is down
The cinder volume service could be down if the server hosting the
volume service fails. (Running the command openstack volume service
list
will show the state of the volume service.) In this case,
follow the documented procedure linked below to start the volume service on
another controller node. See Section 8.1.3, “Managing cinder Volume and Backup Services” for details.
Creating a cinder bootable volume fails
When creating a bootable volume from an image, your cinder volume must be larger than the Virtual Size (raw size) of your image or creation will fail with an error.
When creating your disk model for nodes that will have the cinder volume role, make sure that there is sufficient disk space allocated for temporary space for image conversion if you will be creating bootable volumes. Allocate enough space to the filesystem as would be needed to contain the raw size of images to be used for bootable volumes. For example, Windows images can be quite large in raw format.
By default, cinder uses /var/lib/cinder
for image
conversion and this will be on the root filesystem unless it is explicitly
separated. You can ensure there is enough space by ensuring that the root
file system is sufficiently large, or by creating a logical volume mounted
at /var/lib/cinder
in the disk model when installing the
system.
If your system is already installed, use these steps to update this:
Edit the configuration item
image_conversion_dir
incinder.conf.j2
to point to another location with more disk space. Make sure that the new directory location has the same ownership and permissions as/var/lib/cinder
(owner:cinder group:cinder. mode 0750).Then run this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts cinder-reconfigure.yml
API-level failures
If the API is inaccessible, determine if the API service has halted on
the controller nodes. If a single instance of cinder-api
goes down but other instances remain online on other controllers, load
balancing would typically automatically direct all traffic to the online
nodes. The cinder-status.yml
playbook can be used to
report on the health of the API service from the Cloud Lifecycle Manager:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts cinder-status.yml
Service failures can be diagnosed by reviewing the logs in centralized logging or on the individual controller nodes.
After a controller node is rebooted, you must make sure to run the
ardana-start.yml
playbook to ensure all the services are
up and running. For more information, see
Section 15.2.3.1, “Restarting Controller Nodes After a Reboot”.
If the API service is returning an error code, look for the error message in the API logs on all API nodes. Successful completions would be logged like this:
2016-04-25 10:09:51.107 30743 INFO eventlet.wsgi.server [req-a14cd6f3-6c7c-4076-adc3-48f8c91448f6 dfb484eb00f94fb39b5d8f5a894cd163 7b61149483ba4eeb8a05efa92ef5b197 - - -] 192.168.186.105 - - [25/Apr/2016 10:09:51] "GET /v2/7b61149483ba4eeb8a05efa92ef5b197/volumes/detail HTTP/1.1" 200 13915 0.235921
where 200
represents HTTP status 200 for a successful
completion. Look for a line with your status code and then examine all
entries associated with the request id. The request ID in the successful
completion is highlighted in bold above.
The request may have failed at the scheduler or at the volume or backup service and you should also check those logs at the time interval of interest, noting that the log file of interest may be on a different node.
Operations that do not complete
If you have started an operation, such as creating or deleting a volume, that does not complete, the cinder volume may be stuck in a state. You should follow the procedures for detaling with stuck volumes.
There are six transitory states that a volume can get stuck in:
State | Description |
---|---|
creating | The cinder volume manager has sent a request to a back-end driver to create a volume, but has not received confirmation that the volume is available. |
attaching | cinder has received a request from nova to make a volume available for attaching to an instance but has not received confirmation from nova that the attachment is complete. |
detaching | cinder has received notification from nova that it will detach a volume from an instance but has not received notification that the detachment is complete. |
deleting | cinder has received a request to delete a volume but has not completed the operation. |
backing-up | cinder backup manager has started to back a volume up to swift, or some other backup target, but has not completed the operation. |
restoring | cinder backup manager has started to restore a volume from swift, or some other backup target, but has not completed the operation. |
At a high level, the steps that you would take to address any of these states are similar:
Confirm that the volume is actually stuck, and not just temporarily blocked.
Where possible, remove any resources being held by the volume. For example, if a volume is stuck detaching it may be necessary to remove associated iSCSI or DM devices on the compute node.
Reset the state of the volume to an appropriate state, for example to
available
orerror
.Do any final cleanup. For example, if you reset the state to
error
you can then delete the volume.
The next sections will describe specific steps you can take for volumes stuck in each of the transitory states.
Volumes stuck in Creating
Broadly speaking, there are two possible scenarios where a volume would get
stuck in creating
. The cinder-volume
service could have thrown an exception while it was attempting to create the
volume, and failed to handle the exception correctly. Or the volume back-end
could have failed, or gone offline, after it received the request from
cinder to create the volume.
These two cases are different in that for the second case you will need to
determine the reason the back-end is offline and restart it. Often, when the
back-end has been restarted, the volume will move from
creating
to available
so your issue
will be resolved.
If you can create volumes successfully on the same back-end as the volume
stuck in creating
then the back-end is not down. So you
will need to reset the state for the volume and then delete it.
To reset the state of a volume you can use the openstack volume set
--state
command. You can use either the UUID or the volume name of
the stuck volume.
For example, here is a volume list where we have a stuck volume:
$ openstack volume list +--------------------------------------+-----------+------+------+-------------+------------+ | ID | Status | Name | Size | Volume Type |Attached to | +--------------------------------------+-----------+------+------+-------------+------------+ | 14b76133-e076-4bd3-b335-fa67e09e51f6 | creating | vol1 | 1 | - | | +--------------------------------------+-----------+------+------+-------------+------------+
You can reset the state by using the openstack volume set --state
command, like this:
openstack volume set --state --state error 14b76133-e076-4bd3-b335-fa67e09e51f6
Confirm that with another listing:
$ openstack volume list +--------------------------------------+-----------+------+------+-------------+------------+ | ID | Status | Name | Size | Volume Type |Attached to | +--------------------------------------+-----------+------+------+-------------+------------+ | 14b76133-e076-4bd3-b335-fa67e09e51f6 | error | vol1 | 1 | - | | +--------------------------------------+-----------+------+------+-------------+------------+
You can then delete the volume:
$ openstack volume delete 14b76133-e076-4bd3-b335-fa67e09e51f6 Request to delete volume 14b76133-e076-4bd3-b335-fa67e09e51f6 has been accepted.
Volumes stuck in Deleting
If a volume is stuck in the deleting state then the request to delete the volume may or may not have been sent to and actioned by the back-end. If you can identify volumes on the back-end then you can examine the back-end to determine whether the volume is still there or not. Then you can decide which of the following paths you can take. It may also be useful to determine whether the back-end is responding, either by checking for recent volume create attempts, or creating and deleting a test volume.
The first option is to reset the state of the volume to
available
and then attempt to delete the volume again.
The second option is to reset the state of the volume to
error
and then delete the volume.
If you have reset the volume state to error
then the volume
may still be consuming storage on the back-end. If that is the case then you
will need to delete it from the back-end using your back-end's specific tool.
Volumes stuck in Attaching
The most complicated situation to deal with is where a volume is stuck either in attaching or detaching, because as well as dealing with the state of the volume in cinder and the back-end, you have to deal with exports from the back-end, imports to the compute node, and attachments to the compute instance.
The two options you have here are to make sure that all exports and imports
are deleted and to reset the state of the volume to
available
or to make sure all of the exports and imports
are correct and to reset the state of the volume to
in-use
.
A volume that is in attaching state should never have been made available to
a compute instance and therefore should not have any data written to it, or
in any buffers between the compute instance and the volume back-end. In that
situation, it is often safe to manually tear down the devices exported on
the back-end and imported on the compute host and then reset the volume state
to available
.
You can use the management features of the back-end you are using to locate the compute host to where the volume is being exported.
Volumes stuck in Detaching
The steps in dealing with a volume stuck in detaching
state are very similar to those for a volume stuck in
attaching
. However, there is the added consideration that
the volume was attached to, and probably servicing, I/O from a compute
instance. So you must take care to ensure that all buffers are properly
flushed before detaching the volume.
When a volume is stuck in detaching
, the output from a
openstack volume list
command will include the UUID for the
instance to which the volume was attached. From that you can identify the
compute host that is running the instance using the openstack
server show
command.
For example, here are some snippets:
$ openstack volume list +--------------------------------------+-----------+-----------------------+-----------------+ | ID | Status | Name | Attached to | +--------------------------------------+-----------+-----------------------+-----------------+ | 85384325-5505-419a-81bb-546c69064ec2 | detaching | vol1 | 4bedaa76-78ca-… | +--------------------------------------+-----------+-----------------------+-----------------+
$ openstack server show 4bedaa76-78ca-4fe3-806a-3ba57a9af361|grep host | OS-EXT-SRV-ATTR:host | mycloud-cp1-comp0005-mgmt | OS-EXT-SRV-ATTR:hypervisor_hostname | mycloud-cp1-comp0005-mgmt | hostId | 61369a349bd6e17611a47adba60da317bd575be9a900ea590c1be816
The first thing to check in this case is whether the instance is still
importing the volume. Use virsh list
and virsh
dumpxml ID
to see the underlying
condition of the virtual machine. If the XML for the
instance has a reference to the device, then you should reset the volume
state to in-use
and attempt the cinder
detach
operation again.
$ openstack volume set --state --state in-use --attach-status attached 85384325-5505-419a-81bb-546c69064ec2
If the volume gets stuck detaching again, there may be a more fundamental problem, which is outside the scope of this document and you should contact the Support team.
If the volume is not referenced in the XML for the instance then you should
remove any devices on the compute node and back-end and then reset the state
of the volume to available
.
$ openstack volume set --state --state available --attach-status detached 85384325-5505-419a-81bb-546c69064ec2
You can use the management features of the back-end you are using to locate the compute host to where the volume is being exported.
Volumes stuck in restoring
Restoring a cinder volume from backup will be as slow as backing it up. So
you must confirm that the volume is actually stuck by examining the
cinder-backup.log
. For example:
# tail -f cinder-backup.log |grep 162de6d5-ba92-4e36-aba4-e37cac41081b 2016-04-27 12:39:14.612 6689 DEBUG swiftclient [req-0c65ec42-8f9d-430a-b0d5-05446bf17e34 - - 2016-04-27 12:39:15.533 6689 DEBUG cinder.backup.chunkeddriver [req-0c65ec42-8f9d-430a-b0d5- 2016-04-27 12:39:15.566 6689 DEBUG requests.packages.urllib3.connectionpool [req-0c65ec42- 2016-04-27 12:39:15.567 6689 DEBUG swiftclient [req-0c65ec42-8f9d-430a-b0d5-05446bf17e34 - - -
If you determine that the volume is genuinely stuck in the
restoring
state then you must follow the procedure described
in the detaching section above to remove any volumes that remain exported from
the back-end and imported on the controller node. Remember that in this case
the volumes will be imported and mounted on the controller node running
cinder-backup
. So you do not have to search for the
correct compute host. Also remember that no instances are involved so you do
not need to confirm that the volume is not imported to any instances.
18.6.1.5 Debugging volume attachment #
In an error case, it is possible for a cinder volume to fail to complete an operation and revert back to its initial state. For example, attaching a cinder volume to a nova instance, so you would follow the steps above to examine the nova compute logs for the attach request.
18.6.1.6 Errors creating volumes #
If you are creating a volume and it goes into the ERROR
state, a common error to see is No valid host was found
.
This means that the scheduler could not schedule your volume to a back-end.
You should check that the volume service is up and running. You can use this
command:
$ sudo cinder-manage service list Binary Host Zone Status State Updated At cinder-scheduler ha-volume-manager nova enabled :-) 2016-04-25 11:39:30 cinder-volume ha-volume-manager@ses1 nova enabled XXX 2016-04-25 11:27:26 cinder-backup ha-volume-manager nova enabled :-) 2016-04-25 11:39:28
In this example, the state of XXX
indicates that the
service is down.
If the service is up, next check that the back-end has sufficient space. You can use this command to show the available and total space on each back-end:
openstack volume backend pool list --detail
If your deployment is using volume types, verify that the
volume_backend_name
in your
cinder.conf
file matches the
volume_backend_name
for the volume type you selected.
You can verify the back-end name on your volume type by using this command:
openstack volume type list
Then list the details about your volume type. For example:
$ openstack volume type show dfa8ecbd-8b95-49eb-bde7-6520aebacde0 +---------------------------------+--------------------------------------+ | Field | Value | +---------------------------------+--------------------------------------+ | description | None | | id | dfa8ecbd-8b95-49eb-bde7-6520aebacde0 | | is_public | True | | name | my3par | | os-volume-type-access:is_public | True | | properties | volume_backend_name='3par' | +---------------------------------+--------------------------------------+
18.6.2 swift Storage Troubleshooting #
Troubleshooting scenarios with resolutions for the swift service. You can use these guides to help you identify and resolve basic problems you may experience while deploying or using the Object Storage service. It contains the following troubleshooting scenarios:
18.6.2.1 Deployment Fails With “MSDOS Disks Labels Do Not Support Partition Names” #
Description
If a disk drive allocated to swift uses the MBR partition table type, the
deploy process refuses to label and format the drive. This is to prevent
potential data loss. (For more information, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.5 “Allocating Disk Drives for Object Storage”. If you intend to use the disk drive for
swift, you must convert the MBR partition table to GPT on the drive using
/sbin/sgdisk
.
This process only applies to swift drives. It does not apply to the operating system or boot drive.
Resolution
You must install gdisk
, before using
sgdisk
:
Run the following command to install
gdisk
:sudo zypper install gdisk
Convert to the GPT partition type. Following is an example for converting
/dev/sdd
to the GPT partition type:sudo sgdisk -g /dev/sdd
Reboot the node to take effect. You may then resume the deployment (repeat the playbook that reported the error).
18.6.2.2 Examining Planned Ring Changes #
Before making major changes to your rings, you can see the planned layout of swift rings using the following steps:
Log in to the Cloud Lifecycle Manager.
Run the
swift-compare-model-rings.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
Validate the following in the output:
Drives are being added to all rings in the ring specifications.
Servers are being used as expected (for example, you may have a different set of servers for the account/container rings than the object rings.)
The drive size is the expected size.
18.6.2.3 Interpreting Swift Input Model Validation Errors #
The following examples provide an error message, description, and resolution.
To resolve an error, you must first modify the input model and re-run the configuration processor. (For instructions, see Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”.) Then, continue with the deployment.
Example Message - Model Mismatch: Cannot find drive /dev/sdt on padawan-ccp-c1-m2 (192.168.245.3))
Description The disk model used for node padawan-ccp-c1-m2 has drive /dev/sdt
listed in the devices list of a device-group where swift is the consumer. However, thedev/sdt
device does not exist on that node.Resolution If a drive or controller is failed on a node, the operating system does not see the drive and so the corresponding block device may not exist. Sometimes this is transitory and a reboot may resolve the problem. The problem may not be with
/dev/sdt
, but with another drive. For example, if/dev/sds
is failed, when you boot the node, the drive that you expect to be called/dev/sdt
is actually called/dev/sds
.Alternatively, there may not be enough drives installed in the server. You can add drives. Another option is to remove
/dev/sdt
from the appropriate disk model. However, this removes the drive for all servers using the disk model.Example Message - Model Mismatch: Cannot find drive /dev/sdd2 on padawan-ccp-c1-m2 (192.168.245.3)
Description The disk model used for node padawan-ccp-c1-m2 has drive /dev/sdt
listed in the devices list of a device-group where swift is the consumer. However, the partition number (2) has been specified in the model. This is not supported - only specify the block device name (for example/dev/sdd
), not partition names in disk models.Resolution Remove the partition number from the disk model. Example Message - Cannot find IP address of padawan-ccp-c1-m3-swift for ring: account host: padawan-ccp-c1-m3-mgmt
Description The service (in this example, swift-account) is running on the node padawan-ccp-c1-m3. However, this node does not have a connection to the network designated for the swift-account
service (that is, the SWIFT network).Resolution Check the input model for which networks are configured for each node type. Example Message - Ring: object-2 has specified replication_policy and erasure_coding_policy. Only one may be specified.
Description Only either replication-policy
orerasure-coding-policy
may be used inring-specifications
.Resolution Remove one of the policy types. Example Message - Ring: object-3 is missing a policy type (replication-policy or erasure-coding-policy)
Description There is no replication-policy
orerasure-coding-policy
section inring-specifications
for the object-0 ring.Resolution Add a policy type to the input model file.
18.6.2.4 Identifying the Swift Ring Building Server #
18.6.2.4.1 Identify the swift Ring Building server #
Perform the following steps to identify the swift ring building server:
Log in to the Cloud Lifecycle Manager.
Run the following command:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml --limit SWF-ACC[0]
Examine the output of this playbook. The last line underneath the play recap will give you the server name which is your swift ring building server.
PLAY RECAP ******************************************************************** _SWF_CMN | status | Check systemd service running ----------------------- 1.61s _SWF_CMN | status | Check systemd service running ----------------------- 1.16s _SWF_CMN | status | Check systemd service running ----------------------- 1.09s _SWF_CMN | status | Check systemd service running ----------------------- 0.32s _SWF_CMN | status | Check systemd service running ----------------------- 0.31s _SWF_CMN | status | Check systemd service running ----------------------- 0.26s ------------------------------------------------------------------------------- Total: ------------------------------------------------------------------ 7.88s ardana-cp1-c1-m1-mgmt : ok=7 changed=0 unreachable=0 failed=0
In the above example, the first swift proxy server is
ardana-cp1-c1-m1-mgmt
.
For the purposes of this document, any errors you see in the output of this playbook can be ignored if all you are looking for is the server name for your swift ring builder server.
18.6.2.5 Verifying a Swift Partition Label #
For a system upgrade do NOT clear the label before starting the upgrade.
This topic describes how to check whether a device has a label on a partition.
18.6.2.5.1 Check Partition Label #
To check whether a device has label on a partition, perform the following step:
Log on to the node and use the
parted
command:sudo parted -l
The output lists all of the block devices. Following is an example output for
/dev/sdc
with a single partition and a label of c0a8f502h000. Because the partition has a label, if you are about to install and deploy the system, you must clear this label before starting the deployment. As part of the deployment process, the system will label the partition.. . . Model: QEMU QEMU HARDDISK (scsi) Disk /dev/sdc: 20.0GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 20.0GB 20.0GB xfs c0a8f502h000 . . .
18.6.2.6 Verifying a Swift File System Label #
For a system upgrade do NOT clear the label before starting the upgrade.
This topic describes how to check whether a file system in a partition has a label.
To check whether a file system in a partition has a label, perform the following step:
Log on to the server and execute the
xfs_admin
command (where/dev/sdc1
is the partition where the file system is located):sudo xfs_admin -l /dev/sdc1
The output shows if a file system has a label. For example, this shows a label of c0a8f502h000:
$ sudo xfs_admin -l /dev/sdc1 label = "c0a8f502h000"
If no file system exists, the result is as follows:
$ sudo xfs_admin -l /dev/sde1 xfs_admin: /dev/sde is not a valid XFS file system (unexpected SB magic number 0x00000000)
If you are about to install and deploy the system, you must delete the label before starting the deployment. As part of the deployment process, the system will label the partition.
18.6.2.7 Recovering swift Builder Files #
When you execute the deploy process for a system, a copy of the builder files is stored on the following nodes and directories:
On the swift ring building node, the primary reference copy is stored in the
/etc/swiftlm/<cloud-name>/<control-plane-name>/builder_dir/
directory.On the next node after the swift ring building node, a backup copy is stored in the
/etc/swiftlm/<cloud-name>/<control-plane-name>/builder_dir/
directory.In addition, in the deploy process, the builder files are also copied to the
/etc/swiftlm/deploy_dir/<cloud-name>
directory on every swift node.
If these builder files are found on the primary swift ring building node
(to identify which node is the primary ring building node, see
Section 18.6.2.4, “Identifying the Swift Ring Building Server”) in the directory
/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir
,
then no further recover action is needed. If not, then you need to copy the
files from an intact swift node onto the primary swift ring building node.
If you have no intact /etc/swiftlm
directory on any swift
node, you may be able to restore from a backup. See
Section 15.2.3.2, “Recovering the Control Plane”.
To restore builder files on the primary ring builder node from a backup stored on another member of the ring, use the following process:
Log in to the swift ring building server (To identify the swift ring building server, see Section 18.6.2.4, “Identifying the Swift Ring Building Server”).
Create the
/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir
directory structure with these commands:Replace CLOUD_NAME with the name of your cloud and CONTROL_PLANE_NAME with the name of your control plane.
tux >
sudo mkdir -p /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/tux >
sudo chown -R ardana.ardana /etc/swiftlm/Log in to a swift node where an intact
/etc/swiftlm/deploy_dir
directory exists.Copy the builder files to the swift ring building node. In the example below we use scp to transfer the files, where
swpac-c1-m2-mgmt
is the node where the files can be found,cloud1
is the cloud, andcp1
is the control plane name:tux >
sudo mkdir -p /etc/swiftlm/cloud1/cp1/builder_dirtux >
sudo cd /etc/swiftlm/cloud1/cp1/builder_dirtux >
sudo scp -r ardana@swpac-ccp-c1-m1-mgmt:/etc/swiftlm/cloud1/cp1/builder_dir/* ./tux >
sudo chown -R swift:swift /etc/swiftlm(Any permissions errors related to files in the
backups
directory can be ignored.)Skip this step if you are rebuilding the entire node. It should only be used if swift components are already present and functioning on the server, and you are recovering or updating the ring builder files.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
18.6.2.8 Restarting the Object Storage Deployment #
This page describes the various operational procedures performed by swift.
18.6.2.8.1 Restart the Swift Object Storage Deployment #
The structure of ring is built in an incremental stages. When you modify a
ring, the new ring uses the state of the old ring as a basis for the new
ring. Rings are stored in the builder file. The
swiftlm-ring-supervisor
stores builder files in the
/etc/swiftlm/cloud1/cp1/builder_dir/
directory on the Ring-Builder node. The builder files are named
<ring-name> builder. Prior versions of the builder files are stored in
the
/etc/swiftlm/cloud1/cp1/builder_dir/backups
directory.
Generally, you use an existing builder file as the basis for changes to a ring. However, at initial deployment, when you create a ring there will be no builder file. Instead, the first step in the process is to build a builder file. The deploy playbook does this as a part of the deployment process. If you have successfully deployed some of the system, the ring builder files will exist.
If you change your input model (for example, by adding servers) now, the process assumes you are modifying a ring and behaves differently than while creating a ring from scratch. In this case, the ring is not balanced. So, if the cloud model contains an error or you decide to make substantive changes, it is a best practice to start from scratch and build rings using the steps below.
18.6.2.8.2 Reset Builder Files #
You must reset the builder files during the initial deployment process (only). This process should be used only when you want to restart a deployment from scratch. If you reset the builder files after completing your initial deployment, then you are at a risk of losing critical system data.
Delete the builder files in the
/etc/swiftlm/cloud1/cp1/builder-dir/
directory. For example, for the region0 keystone region (the default single
region designation), do the following:
sudo rm /etc/swiftlm/cloud1/cp1/builder_dir/*.builder
If you have successfully deployed a system and accidentally delete the builder files, you can recover to the correct state. For instructions, see Section 18.6.2.7, “Recovering swift Builder Files”.
18.6.2.9 Increasing the Swift Node Timeout Value #
On a heavily loaded Object Storage system timeouts may occur when transferring data to or from swift, particularly large objects.
The following is an example of a timeout message in the log
(/var/log/swift/swift.log
) on a swift proxy server:
Jan 21 16:55:08 ardana-cp1-swpaco-m1-mgmt proxy-server: ERROR with Object server 10.243.66.202:6000/disk1 re: Trying to write to /v1/AUTH_1234/testcontainer/largeobject: ChunkWriteTimeout (10s)
If this occurs, it may be necessary to increase the
node_timeout
parameter in the
proxy-server.conf
configuration file.
The node_timeout
parameter in the swift
proxy-server.conf
file is the maximum amount of time the
proxy server will wait for a response from the account, container, or object
server. The default value is 10 seconds.
In order to modify the timeout you can use these steps:
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/config/swift/proxy-server.conf.j2
file and add a line specifying thenode_timeout
into the[app:proxy-server]
section of the file.Example, in bold, increasing the timeout to 30 seconds:
[app:proxy-server] use = egg:swift#proxy . . node_timeout = 30
Commit your configuration to the Book “Deployment Guide using Cloud Lifecycle Manager”, Chapter 22 “Using Git for Configuration Management”, as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Use the playbook below to create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Change to the deployment directory and run the swift reconfigure playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
18.6.2.10 Troubleshooting Swift File System Usage Issues #
If you have recycled your environment to do a re-installation and you haven't
run the wipe_disks.yml
playbook in the process, you may
experience an issue where your file system usage continues to grow
exponentially even though you are not adding any files to your swift system.
This is likely occurring because the quarantined directory is getting filled
up. You can find this directory at
/srv/node/disk0/quarantined
.
You can resolve this issue by following these steps:
SSH to each of your swift nodes and stop the replication processes on each of them. The following commands must be executed on each of your swift nodes. Make note of the time that you performed this action as you will reference it in step three.
sudo systemctl stop swift-account-replicator sudo systemctl stop swift-container-replicator sudo systemctl stop swift-object-replicator
Examine the
/var/log/swift/swift.log
file for events that indicate when the auditor processes have started and completed audit cycles. For more details, see Section 18.6.2.10, “Troubleshooting Swift File System Usage Issues”.Wait until you see that the auditor processes have finished two complete cycles since the time you stopped the replication processes (from step one). You must check every swift node, which on a lightly loaded system that was recently installed this should take less than two hours.
At this point you should notice that your quarantined directory has stopped growing. You may now delete the files in that directory on each of your nodes.
Restart the replication processes using the swift start playbook:
Log in to the Cloud Lifecycle Manager.
Run the swift start playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-start.yml
18.6.2.10.1 Examining the swift Log for Audit Event Cycles #
Below is an example of the object-server
start and end
cycle details. They were taken by using the following command on a swift
node:
sudo grep object-auditor /var/log/swift/swift.log|grep ALL
Example output:
$ sudo grep object-auditor /var/log/swift/swift.log|grep ALL ... Apr 1 13:31:18 padawan-ccp-c1-m1-mgmt object-auditor: Begin object audit "forever" mode (ALL) Apr 1 13:31:18 padawan-ccp-c1-m1-mgmt object-auditor: Object audit (ALL). Since Fri Apr 1 13:31:18 2016: Locally: 0 passed, 0 quarantined, 0 errors files/sec: 0.00 , bytes/sec: 0.00, Total time: 0.00, Auditing time: 0.00, Rate: 0.00 Apr 1 13:51:32 padawan-ccp-c1-m1-mgmt object-auditor: Object audit (ALL) "forever" mode completed: 1213.78s. Total quarantined: 0, Total errors: 0, Total files/sec: 7.02, Total bytes/sec: 9999722.38, Auditing time: 1213.07, Rate: 1.00
In this example, the auditor started at 13:31
and ended
at 13:51
.
In this next example, the account-auditor
and
container-auditor
use similar message structure, so we
only show the container auditor. You can substitute
account
for container
as well:
$ sudo grep container-auditor /var/log/swift/swift.log ... Apr 1 14:07:00 padawan-ccp-c1-m1-mgmt container-auditor: Begin container audit pass. Apr 1 14:07:00 padawan-ccp-c1-m1-mgmt container-auditor: Since Fri Apr 1 13:07:00 2016: Container audits: 42 passed audit, 0 failed audit Apr 1 14:37:00 padawan-ccp-c1-m1-mgmt container-auditor: Container audit pass completed: 0.10s
In the example, the container auditor started a cycle at
14:07
and the cycle finished at 14:37
.
18.7 Monitoring, Logging, and Usage Reporting Troubleshooting #
Troubleshooting scenarios with resolutions for the Monitoring, Logging, and Usage Reporting services.
18.7.1 Troubleshooting Centralized Logging #
This section contains the following scenarios:
18.7.1.1 Reviewing Log Files #
You can troubleshoot service-specific issues by reviewing the logs. After logging into Kibana, follow these steps to load the logs for viewing:
Navigate to the Settings menu to configure an index pattern to search for.
In the Index name or pattern field, you can enter
logstash-*
to query all Elasticsearch indices.Click the green Create button to create and load the index.
Navigate to the Discover menu to load the index and make it available to search.
If you want to search specific Elasticsearch indices, you can run the following command from the control plane to get a full list of available indices:
curl localhost:9200/_cat/indices?v
Once the logs load you can change the timeframe from the dropdown in the upper-righthand corner of the Kibana window. You have the following options to choose from:
Quick - a variety of time frame choices will be available here
Relative - allows you to select a start time relative to the current time to show this range
Absolute - allows you to select a date range to query
When searching there are common fields you will want to use, such as:
type - this will include the service name, such as
keystone
orceilometer
host - you can specify a specific host to search for in the logs
file - you can specify a specific log file to search
For more details on using Kibana and Elasticsearch to query logs, see https://www.elastic.co/guide/en/kibana/3.0/working-with-queries-and-filters.html
18.7.1.2 Monitoring Centralized Logging #
To help keep ahead of potential logging issues and resolve issues before they affect logging, you may want to monitor the Centralized Logging Alarms.
To monitor logging alarms:
Log in to Operations Console.
From the menu button in the upper left corner, navigate to the Alarm Definitions page.
Find the alarm definitions that are applied to the various hosts. See the Section 18.1.1, “Alarm Resolution Procedures” for the Centralized Logging Alarm Definitions.
Navigate to the Alarms page
Find the alarm definitions applied to the various hosts. These should match the alarm definitions in the Section 18.1.1, “Alarm Resolution Procedures”.
See if the alarm is green (good) or is in a bad state. If any are in a bad state, see the possible actions to perform in the Section 18.1.1, “Alarm Resolution Procedures”.
You can use this filtering technique in the "Alarms" page to look for the following:
To look for processes that may be down, filter for "Process" then make sure the process are up:
Elasticsearch
Logstash
Beaver
Apache (Kafka)
Kibana
monasca
To look for sufficient disk space, filter for "Disk"
To look for sufficient RAM memory, filter for "Memory"
18.7.1.3 Situations In Which Logs Might Not Be Collected #
Centralized logging might not collect log data under the following circumstances:
If the Beaver service is not running on one or more of the nodes (controller or compute), logs from these nodes will not be collected.
18.7.1.4 Error When Creating a Kibana Visualization #
When creating a visualization in Kibana you may get an error similiar to this:
"logstash-*" index pattern does not contain any of the following field types: number
To resolve this issue:
Log in to Kibana.
Navigate to the
Settings
page.In the left panel, select the
logstash-*
index.Click the Refresh button. You may see a mapping conflict warning after refreshing the index.
Re-create the visualization.
18.7.1.5 After Deploying Logging-API, Logs Are Not Centrally Stored #
If you are using the Logging-API and logs are not being centrally stored, use the following checklist to troubleshoot Logging-API.
☐ | Item |
---|---|
Ensure monasca is running. | |
Check any alarms monasca has triggered. | |
Check to see if the Logging-API (monasca-log-api) process alarm has triggered. | |
Run an Ansible playbook to get status of the Cloud Lifecycle Manager: ansible-playbook -i hosts/verb_hosts ardana-status.yml | |
Troubleshoot all specific tasks that have failed on the Cloud Lifecycle Manager. | |
Ensure that the Logging-API daemon is up. | |
Run an Ansible playbook to try and bring the Logging-API daemon up: ansible-playbook –I hosts/verb_hosts logging-start.yml | |
If you get errors trying to bring up the daemon, resolve them. | |
Verify the Logging-API configuration settings are correct in the configuration file: roles/kronos-api/templates/kronos-apache2.conf.j2 |
The following is a sample Logging-API configuration file:
{# # (c) Copyright 2015-2016 Hewlett Packard Enterprise Development LP # Licensed under the Apache License, Version 2.0 (the "License"); you may # not use this file except in compliance with the License. You may obtain # a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the # License for the specific language governing permissions and limitations # under the License. # #} Listen {{ kronos_api_host }}:{{ kronos_api_port }} <VirtualHost *:{{ kronos_api_port }}> WSGIDaemonProcess log-api processes=4 threads=4 socket-timeout=300 user={{ kronos_user }} group={{ kronos_group }} python-path=/opt/stack/service/kronos/venv:/opt/stack/service/kronos/venv/bin/../lib/python2.7/site-packages/ display-name=monasca-log-api WSGIProcessGroup log-api WSGIApplicationGroup log-api WSGIScriptAlias / {{ kronos_wsgi_dir }}/app.wsgi ErrorLog /var/log/kronos/wsgi.log LogLevel info CustomLog /var/log/kronos/wsgi-access.log combined <Directory /opt/stack/service/kronos/venv/bin/../lib/python2.7/site-packages/monasca_log_api> Options Indexes FollowSymLinks MultiViews Require all granted AllowOverride None Order allow,deny allow from all LimitRequestBody 102400 </Directory> SetEnv no-gzip 1 </VirtualHost>
18.7.1.6 Re-enabling Slow Logging #
MariaDB slow logging was enabled by default in earlier versions. Slow
logging logs slow MariaDB queries to
/var/log/mysql/mysql-slow.log
on
FND-MDB hosts.
As it is possible for temporary tokens to be logged to the slow log, we have disabled slow log in this version for security reasons.
To re-enable slow logging follow the following procedure:
Login to the Cloud Lifecycle Manager and set a mariadb service configurable to enable slow logging.
cd ~/openstack/my_cloud
Check slow_query_log is currently disabled with a value of 0:
grep slow ./config/percona/my.cfg.j2 slow_query_log = 0 slow_query_log_file = /var/log/mysql/mysql-slow.log
Enable slow logging in the server configurable template file and confirm the new value:
sed -e 's/slow_query_log = 0/slow_query_log = 1/' -i ./config/percona/my.cfg.j2 grep slow ./config/percona/my.cfg.j2 slow_query_log = 1 slow_query_log_file = /var/log/mysql/mysql-slow.log
Commit the changes:
git add -A git commit -m "Enable Slow Logging"
Run the configuration procesor.
cd ~/openstack/ardana/ansible/ ansible-playbook -i hosts/localhost config-processor-run.yml
You will be prompted for an encryption key, and also asked if you want to change the encryption key to a new value, and it must be a different key. You can turn off encryption by typing the following:
ansible-playbook -i hosts/localhost config-processor-run.yml -e encrypt="" -e rekey=""
Create a deployment directory.
ansible-playbook -i hosts/localhost ready-deployment.yml
Reconfigure Percona (note this will restart your mysqld server on your cluster hosts).
ansible-playbook -i hosts/verb_hosts percona-reconfigure.yml
18.7.2 Usage Reporting Troubleshooting #
Troubleshooting scenarios with resolutions for the ceilometer service.
This page describes troubleshooting scenarios for ceilometer.
18.7.2.1 Logging #
Logs for the various running components in the Overcloud Controllers can be found at /var/log/ceilometer.log
The Upstart for the services also logs data at /var/log/upstart
18.7.2.2 Modifying #
Change the level of debugging in ceilometer by editing the ceilometer.conf file located at /etc/ceilometer/ceilometer.conf. To log the maximum amount of information, change the level entry to DEBUG.
Note: When the logging level for a service is changed, that service must be re-started before the change will take effect.
This is an excerpt of the ceilometer.conf configuration file showing where to make changes:
[loggers] keys: root [handlers] keys: watchedfile, logstash [formatters] keys: context, logstash [logger_root] qualname: root handlers: watchedfile, logstash level: NOTSET
18.7.2.3 Messaging/Queuing Errors #
ceilometer relies on a message bus for passing data between the various components. In high-availability scenarios, RabbitMQ servers are used for this purpose. If these servers are not available, the ceilometer log will record errors during "Connecting to AMQP" attempts.
These errors may indicate that the RabbitMQ messaging nodes are not running as expected and/or the RPC publishing pipeline is stale. When these errors occur, re-start the instances.
Example error:
Error: unable to connect to node 'rabbit@xxxx-rabbitmq0000': nodedown
Use the RabbitMQ CLI to re-start the instances and then the host.
Restart the downed cluster node.
sudo invoke-rc.d rabbitmq-server start
Restart the RabbitMQ host
sudo rabbitmqctl start_app
18.8 Orchestration Troubleshooting #
Troubleshooting scenarios with resolutions for the Orchestration services. Troubleshooting scenarios with resolutions for the Orchestration services.
18.8.1 Heat Troubleshooting #
Troubleshooting scenarios with resolutions for the heat service.
18.8.1.1 RPC timeout on Heat Stack Creation #
If you exerience a remote procedure call (RPC) timeout failure when
attempting heat stack-create
, you can work around the
issue by increasing the timeout value and purging records of deleted stacks
from the database. To do so, follow the steps below. An example of the
error is:
MessagingTimeout: resources.XXX-LCP-Pair01.resources[0]: Timed out waiting for a reply to message ID e861c4e0d9d74f2ea77d3ec1984c5cb6
Increase the timeout value.
ardana >
cd ~/openstack/my_cloud/config/heatMake changes to heat config files. In
heat.conf.j2
, add this timeout value:rpc_response_timeout=300
Commit your changes:
git commit -a -m "some message"
Move to the
ansible
directory and run the following playbooks:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlChange to the
scratch
directory and runheat-reconfigure
:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts heat-reconfigure.ymlPurge records of deleted stacks from the database. First delete all stacks that are in failed state. Then execute the following
sudo /opt/stack/venv/heat-20151116T000451Z/bin/python2 /opt/stack/service/heat-engine/venv/bin/heat-manage --config-file /opt/stack/service/heat-engine-20151116T000451Z/etc/heat/heat.conf --config-file /opt/stack/service/heat-engine-20151116T000451Z/etc/heat/engine.conf purge_deleted 0
18.8.1.2 General Heat Stack Creation Errors #
Generally in heat, when a timeout occurs it means that the underlying
resource service such as nova, neutron, or cinder fails to complete the
required action. No matter what error this underlying service reports, heat
simply reports it back. So in the case of time-out in heat stack
create
, look at the logs of the underlying services, most
importantly the nova service, to understand the reason for the timeout.
18.8.1.3 Multiple Heat Stack Create Failure #
The monasca AlarmDefinition resource,
OS::monasca::AlarmDefinition
used for heat autoscaling,
consists of an optional property name for
defining the alarm name. In case this optional property being specified in
the heat template, this name must be unique in the same project of the
system. Otherwise, multiple heat stack create using this heat template will
fail with the following conflict:
| cpu_alarm_low | 5fe0151b-5c6a-4a54-bd64-67405336a740 | HTTPConflict: resources.cpu_alarm_low: An alarm definition already exists for project / tenant: 835d6aeeb36249b88903b25ed3d2e55a named: CPU utilization less than 15 percent | CREATE_FAILED | 2016-07-29T10:28:47 |
This is due to the fact that the monasca registers the alarm definition name using this name property when it is defined in the heat template. This name must be unique.
To avoid this problem, if you want to define an alarm name using this property in the template, you must be sure this name is unique within a project in the system. Otherwise, you can leave this optional property undefined in your template. In this case, the system will create an unique alarm name automatically during heat stack create.
18.8.1.4 Unable to Retrieve QOS Policies #
Launching the Orchestration Template Generator may trigger the message:
Unable to retrieve resources Qos Policies
. This is a
known upstream
bug. This information message can be ignored.
18.8.2 Troubleshooting Magnum Service #
Troubleshooting scenarios with resolutions for the Magnum service. Magnum Service provides container orchestration engines such as Docker Swarm, Kubernetes, and Apache Mesos available as first class resources. You can use this guide to help with known issues and troubleshooting of Magnum services.
18.8.2.1 Magnum cluster fails to create #
Typically, small size clusters need about 3-5 minutes to stand up. If cluster stand up takes longer, you may proceed with troubleshooting, not waiting for status to turn to CREATE_FAILED after timing out.
Use
heat resource-list STACK-ID
to identify which heat stack resource is stuck in CREATE_IN_PROGRESS.NoteThe main heat stack has nested stacks, one for kubemaster(s) and one for kubeminion(s). These stacks are visible as resources of type OS::heat::ResourceGroup (in parent stack) and file:///... in nested stack. If any resource remains in CREATE_IN_PROGRESS state within the nested stack, the overall state of the resource will be CREATE_IN_PROGRESS.
$ heat resource-list -n2 22385a42-9e15-49d9-a382-f28acef36810 +-------------------------------+--------------------------------------+--------------------------------------+--------------------+----------------------+------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-------------------------------+--------------------------------------+--------------------------------------+--------------------+----------------------+------------------------------------------------------------------+ | api_address_floating_switch | 06b2cc0d-77f9-4633-8d96-f51e2db1faf3 | Magnum::FloatingIPAddressSwitcher | CREATE_COMPLETE | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv | . . . | fixed_subnet | d782bdf2-1324-49db-83a8-6a3e04f48bb9 | OS::neutron::Subnet | CREATE_COMPLETE | 2017-04-10T21:25:11Z | my-cluster-z4aquda2mgpv | | kube_masters | f0d000aa-d7b1-441a-a32b-17125552d3e0 | OS::heat::ResourceGroup | CREATE_IN_PROGRESS | 2017-04-10T21:25:10Z | my-cluster-z4aquda2mgpv | | 0 | b1ff8e2c-23dc-490e-ac7e-14e9f419cfb6 | file:///opt/s...ates/kubemaster.yaml | CREATE_IN_PROGRESS | 2017-04-10T21:25:41Z | my-cluster-z4aquda2mgpv-kube_masters-utyggcbucbhb | | kube_master | 4d96510e-c202-4c62-8157-c0e3dddff6d5 | OS::nova::Server | CREATE_IN_PROGRESS | 2017-04-10T21:25:48Z | my-cluster-z4aquda2mgpv-kube_masters-utyggcbucbhb-0-saafd5k7l7im | . . .
If stack creation failed on some native OpenStack resource, like OS::nova::Server or OS::neutron::Router, proceed with respective service troubleshooting. This type of error usually does not cause time out, and cluster turns into status CREATE_FAILED quickly. The underlying reason of the failure, reported by heat, can be checked via the
magnum cluster-show
command.If stack creation stopped on resource of type OS::heat::WaitCondition, heat is not receiving notification from cluster VM about bootstrap sequence completion. Locate corresponding resource of type OS::nova::Server and use its physical_resource_id to get information about the VM (which should be in status CREATE_COMPLETE)
$ openstack server show 4d96510e-c202-4c62-8157-c0e3dddff6d5 +--------------------------------------+---------------------------------------------------------------------------------------------------------------+ | Property | Value | +--------------------------------------+---------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | comp1 | | OS-EXT-SRV-ATTR:hypervisor_hostname | comp1 | | OS-EXT-SRV-ATTR:instance_name | instance-00000025 | | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2017-04-10T22:10:40.000000 | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2017-04-10T22:09:53Z | | flavor | m1.small (2) | | hostId | eb101a0293a9c4c3a2d79cee4297ab6969e0f4ddd105f4d207df67d2 | | id | 4d96510e-c202-4c62-8157-c0e3dddff6d5 | | image | fedora-atomic-26-20170723.0.x86_64 (4277115a-f254-46c0-9fb0-fffc45d2fd38) | | key_name | testkey | | metadata | {} | | name | my-zaqshggwge-0-sqhpyez4dig7-kube_master-wc4vv7ta42r6 | | os-extended-volumes:volumes_attached | [{"id": "24012ce2-43dd-42b7-818f-12967cb4eb81"}] | | private network | 10.0.0.14, 172.31.0.6 | | progress | 0 | | security_groups | my-cluster-z7ttt2jvmyqf-secgroup_base-gzcpzsiqkhxx, my-cluster-z7ttt2jvmyqf-secgroup_kube_master-27mzhmkjiv5v | | status | ACTIVE | | tenant_id | 2f5b83ab49d54aaea4b39f5082301d09 | | updated | 2017-04-10T22:10:40Z | | user_id | 7eba6d32db154d4790e1d3877f6056fb | +--------------------------------------+---------------------------------------------------------------------------------------------------------------+
Use the floating IP of the master VM to log into first master node. Use the appropriate username below for your VM type. Passwords should not be required as the VMs should have public ssh key installed.
VM Type Username Kubernetes or Swarm on Fedora Atomic fedora Kubernetes on CoreOS core Mesos on Ubuntu ubuntu Useful dianostic commands
Kubernetes cluster on Fedora Atomic
sudo journalctl --system sudo journalctl -u cloud-init.service sudo journalctl -u etcd.service sudo journalctl -u docker.service sudo journalctl -u kube-apiserver.service sudo journalctl -u kubelet.service sudo journalctl -u wc-notify.service
Kubernetes cluster on CoreOS
sudo journalctl --system sudo journalctl -u oem-cloudinit.service sudo journalctl -u etcd2.service sudo journalctl -u containerd.service sudo journalctl -u flanneld.service sudo journalctl -u docker.service sudo journalctl -u kubelet.service sudo journalctl -u wc-notify.service
Swarm cluster on Fedora Atomic
sudo journalctl --system sudo journalctl -u cloud-init.service sudo journalctl -u docker.service sudo journalctl -u swarm-manager.service sudo journalctl -u wc-notify.service
Mesos cluster on Ubuntu
sudo less /var/log/syslog sudo less /var/log/cloud-init.log sudo less /var/log/cloud-init-output.log sudo less /var/log/os-collect-config.log sudo less /var/log/marathon.log sudo less /var/log/mesos-master.log
18.9 Troubleshooting Tools #
Tools to assist with troubleshooting issues in your cloud. Additional troubleshooting information is available at Section 18.1, “General Troubleshooting”.
18.9.1 Retrieving the SOS Report #
The SOS report provides debug level information about your environment to assist in troubleshooting issues. When troubleshooting and debugging issues in your SUSE OpenStack Cloud environment you can run an ansible playbook that will provide you with a full debug report, referred to as a SOS report. These reports can be sent to the support team when seeking assistance.
18.9.1.1 Retrieving the SOS Report #
Log in to the Cloud Lifecycle Manager.
Run the SOS report ansible playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts sosreport-run.yml
Retrieve the SOS report tarballs, which will be in the following directories on your Cloud Lifecycle Manager:
/tmp /tmp/sosreport-report-archives/
You can then use these reports to troubleshoot issues further or provide to the support team when you reach out to them.
The SOS Report may contain sensitive information because service configuration file data is included in the report. Please remove any sensitive information before sending the SOSReport tarball externally.