documentation.suse.com › Documentation › Operations Guide › Operations Overview

Operations Guide

Navigation←→

Applies to HPE Helion OpenStack 8

1 Operations Overview

1.1 What is a cloud operator?
1.2 Tools provided to operate your cloud
1.3 Daily tasks
1.4 Weekly or monthly tasks
1.5 Semi-annual tasks
1.6 Troubleshooting
1.7 Common Questions

A high-level overview of the processes related to operating a HPE Helion OpenStack 8 cloud.

1.1 What is a cloud operator? #

When we talk about a cloud operator it is important to understand the scope of the tasks and responsibilities we are referring to. HPE Helion OpenStack defines a cloud operator as the person or group of people who will be administering the cloud infrastructure, which includes:

Monitoring the cloud infrastructure, resolving issues as they arise.
Managing hardware resources, adding/removing hardware due to capacity needs.
Repairing, and recovering if needed, any hardware issues.
Performing domain administration tasks, which involves creating and managing projects, users, and groups as well as setting and managing resource quotas.

1.2 Tools provided to operate your cloud #

HPE Helion OpenStack provides the following tools which are available to operate your cloud:

Operations Console

Often referred to as the Ops Console, you can use this console to view data about your cloud infrastructure in a web-based graphical user interface (GUI) to make sure your cloud is operating correctly. By logging on to the console, HPE Helion OpenStack administrators can manage data in the following ways:

Triage alarm notifications in the central dashboard
Monitor the environment by giving priority to alarms that take precedence
Manage compute nodes and easily use a form to create a new host
Refine the monitoring environment by creating new alarms to specify a combination of metrics, services, and hosts that match the triggers unique to an environment
Plan for future storage by tracking capacity over time to predict with some degree of reliability the amount of additional storage needed

For more details on how to connect to and use the Operations Console, see Book “User Guide”, Chapter 1 “Using the Operations Console”, Section 1.1 “Operations Console Overview”.

Dashboard

Often referred to as Horizon or the Horizon dashboard, you can use this console to manage resources on a domain and project level in a web-based graphical user interface (GUI). The following are some of the typical operational tasks that you may perform using the dashboard:

Creating and managing projects, users, and groups within your domain.
Assigning roles to users and groups to manage access to resources.
Setting and updating resource quotas for the projects.

For more details, see the following pages:

Section 4.3, “Understanding Domains, Projects, Users, Groups, and Roles”
Book “User Guide”, Chapter 3 “Cloud Admin Actions with the Dashboard”

Command-line interface (CLI)

Each service within HPE Helion OpenStack provides a command-line client, such as the novaclient (sometimes referred to as the python-novaclient or nova CLI) for the Compute service, the keystoneclient for the Identity service, etc. There is also an effort in the OpenStack community to make a unified client, called the openstackclient, which will combine the available commands in the various service-specific clients into one tool. By default, we install each of the necessary clients onto the hosts in your environment for you to use.

You will find processes defined in our documentation that use these command-line tools. There is also a list of common cloud administration tasks which we have outlined which you can use the command-line tools to do. For more details, see Book “User Guide”, Chapter 4 “Cloud Admin Actions with the Command Line”.

There are references throughout the SUSE OpenStack Cloud documentation to the HPE Smart Storage Administrator (HPE SSA) CLI. HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f

1.3 Daily tasks #

Ensure your cloud is running correctly: HPE Helion OpenStack is deployed as a set of highly available services to minimize the impact of failures. That said, hardware and software systems can fail. Detection of failures early in the process will enable you to address issues before they affect the broader system. HPE Helion OpenStack provides a monitoring solution, based on OpenStack’s Monasca, which provides monitoring and metrics for all OpenStack components and much of the underlying system, including service status, performance metrics, compute node, and virtual machine status. Failures are exposed via the Operations Console and/or alarm notifications. In the case where more detailed diagnostics are required, you can use a centralized logging system based on the Elasticsearch, Logstash, and Kibana (ELK) stack. This provides the ability to search service logs to get detailed information on behavior and errors.
Perform critical maintenance: To ensure your OpenStack installation is running correctly, provides the right access and functionality, and is secure, you should make ongoing adjustments to the environment. Examples of daily maintenance tasks include:
- Add/remove projects and users. The frequency of this task depends on your policy.
- Apply security patches (if released).
- Run daily backups.

1.4 Weekly or monthly tasks #

Do regular capacity planning: Your initial deployment will likely reflect the known near to mid-term scale requirements, but at some point your needs will outgrow your initial deployment’s capacity. You can expand HPE Helion OpenStack in a variety of ways, such as by adding compute and storage capacity.

To manage your cloud’s capacity, begin by determining the load on the existing system. OpenStack is a set of relatively independent components and services, so there are multiple subsystems that can affect capacity. These include control plane nodes, compute nodes, object storage nodes, block storage nodes, and an image management system. At the most basic level, you should look at the CPU used, RAM used, I/O load, and the disk space used relative to the amounts available. For compute nodes, you can also evaluate the allocation of resource to hosted virtual machines. This information can be viewed in the Operations Console. You can pull historical information from the monitoring service (OpenStack’s Monasca) by using its client or API. Also, OpenStack provides you some ability to manage the hosted resource utilization by using quotas for projects. You can track this usage over time to get your growth trend so that you can project when you will need to add capacity.

1.5 Semi-annual tasks #

Perform upgrades: OpenStack releases new versions on a six-month cycle. In general, HPE Helion OpenStack will release new major versions annually with minor versions and maintenance updates more often. Each new release consists of both new functionality and services, as well as bug fixes for existing functionality.

Note

If you are planning to upgrade, this is also an excellent time to evaluate your existing capabilities, especially in terms of capacity (see Capacity Planning above).

1.6 Troubleshooting #

As part of managing your cloud, you should be ready to troubleshoot issues, as needed. The following are some common troubleshooting scenarios and solutions:

How do I determine if my cloud is operating correctly now?: HPE Helion OpenStack provides a monitoring solution based on OpenStack’s Monasca service. This service provides monitoring and metrics for all OpenStack components, as well as much of the underlying system. By default, HPE Helion OpenStack comes with a set of alarms that provide coverage of the primary systems. In addition, you can define alarms based on threshold values for any metrics defined in the system. You can view alarm information in the Operations Console. You can also receive or deliver this information to others by configuring email or other mechanisms. Alarms provide information about whether a component failed and is affecting the system, and also what condition triggered the alarm.

How do I troubleshoot and resolve performance issues for my cloud?: There are a variety of factors that can affect the performance of a cloud system, such as the following:

Health of the control plane
Health of the hosting compute node and virtualization layer
Resource allocation on the compute node

If your cloud users are experiencing performance issues on your cloud, use the following approach:

View the compute summary page on the Operations Console to determine if any alarms have been triggered.
Determine the hosting node of the virtual machine that is having issues.
On the compute hosts page, view the status and resource utilization of the compute node to determine if it has errors or is over-allocated.
On the compute instances page you can view the status of the VM along with its metrics.

How do I troubleshoot and resolve availability issues for my cloud?: If your cloud users are experiencing availability issues, determine what your users are experiencing that indicates to them the cloud is down. For example, can they not access the Dashboard service (Horizon) console or APIs, indicating a problem with the control plane? Or are they having trouble accessing resources? Console/API issues would indicate a problem with the control planes. Use the Operations Console to view the status of services to see if there is an issue. However, if it is an issue of accessing a virtual machine, then also search the consolidated logs that are available in the ELK stack or errors related to the virtual machine and supporting networking.

1.7 Common Questions #

To manage a cloud, how many administrators do I need?

A 24x7 cloud needs a 24x7 cloud operations team. If you already have a NOC, managing the cloud can be added to their workload.

A cloud with 20 nodes will need a part-time person. You can manage a cloud with 200 nodes with two people. As the amount of nodes increases and processes and automation are put in place, you will need to increase the number of administrators but the need is not linear. As an example, if you have 3000 nodes and 15 clouds you will probably need 6 administrators.

What skills do my cloud administrators need?

Your administrators should be experienced Linux admins. They should have experience in application management, as well as experience with Ansible. It is a plus if they have experience with Bash shell scripting and Python programming skills.

In addition, you will need networking engineers. A 3000 node environment will need two networking engineers.

What operations should I plan on performing daily, weekly, monthly, or semi-annually?

You should plan for operations by understanding what tasks you need to do daily, weekly, monthly, or semi-annually. The specific list of tasks that you need to perform depends on your cloud configuration, but should include the following high-level tasks specified in the Chapter 2, Tutorials

Print this page