1 Operations Overview #
A high-level overview of the processes related to operating a SUSE OpenStack Cloud 9 cloud.
1.1 What is a cloud operator? #
When we talk about a cloud operator it is important to understand the scope of the tasks and responsibilities we are referring to. SUSE OpenStack Cloud defines a cloud operator as the person or group of people who will be administering the cloud infrastructure, which includes:
Monitoring the cloud infrastructure, resolving issues as they arise.
Managing hardware resources, adding/removing hardware due to capacity needs.
Repairing, and recovering if needed, any hardware issues.
Performing domain administration tasks, which involves creating and managing projects, users, and groups as well as setting and managing resource quotas.
1.2 Tools provided to operate your cloud #
SUSE OpenStack Cloud provides the following tools which are available to operate your cloud:
Operations Console
Often referred to as the Ops Console, you can use this console to view data about your cloud infrastructure in a web-based graphical user interface (GUI) to make sure your cloud is operating correctly. By logging on to the console, SUSE OpenStack Cloud administrators can manage data in the following ways:
Triage alarm notifications in the central dashboard
Monitor the environment by giving priority to alarms that take precedence
Manage compute nodes and easily use a form to create a new host
Refine the monitoring environment by creating new alarms to specify a combination of metrics, services, and hosts that match the triggers unique to an environment
Plan for future storage by tracking capacity over time to predict with some degree of reliability the amount of additional storage needed
Dashboard
Often referred to as horizon or the horizon dashboard, you can use this console to manage resources on a domain and project level in a web-based graphical user interface (GUI). The following are some of the typical operational tasks that you may perform using the dashboard:
Creating and managing projects, users, and groups within your domain.
Assigning roles to users and groups to manage access to resources.
Setting and updating resource quotas for the projects.
For more details, see the following page: Section 5.3, “Understanding Domains, Projects, Users, Groups, and Roles”
Command-line interface (CLI)
The OpenStack community has created a unified client, called the openstackclient (OSC), which combines the available commands in the various service-specific clients into one tool. Some service-specific commands do not have OSC equivalents.
You will find processes defined in our documentation that use these command-line tools. There is also a list of common cloud administration tasks which we have outlined which you can use the command-line tools to do.
There are references throughout the SUSE OpenStack Cloud documentation to the HPE Smart Storage Administrator (HPE SSA) CLI. HPE-specific binaries that are not based on open source are distributed directly from and supported by HPE. To download and install the SSACLI utility, please refer to: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_3d16386b418a443388c18da82f
1.3 Daily tasks #
Ensure your cloud is running correctly: SUSE OpenStack Cloud is deployed as a set of highly available services to minimize the impact of failures. That said, hardware and software systems can fail. Detection of failures early in the process will enable you to address issues before they affect the broader system. SUSE OpenStack Cloud provides a monitoring solution, based on OpenStack’s monasca, which provides monitoring and metrics for all OpenStack components and much of the underlying system, including service status, performance metrics, compute node, and virtual machine status. Failures are exposed via the Operations Console and/or alarm notifications. In the case where more detailed diagnostics are required, you can use a centralized logging system based on the Elasticsearch, Logstash, and Kibana (ELK) stack. This provides the ability to search service logs to get detailed information on behavior and errors.
Perform critical maintenance: To ensure your OpenStack installation is running correctly, provides the right access and functionality, and is secure, you should make ongoing adjustments to the environment. Examples of daily maintenance tasks include:
Add/remove projects and users. The frequency of this task depends on your policy.
Apply security patches (if released).
Run daily backups.
1.4 Weekly or monthly tasks #
Do regular capacity planning: Your initial deployment will likely reflect the known near to mid-term scale requirements, but at some point your needs will outgrow your initial deployment’s capacity. You can expand SUSE OpenStack Cloud in a variety of ways, such as by adding compute and storage capacity.
To manage your cloud’s capacity, begin by determining the load on the existing system. OpenStack is a set of relatively independent components and services, so there are multiple subsystems that can affect capacity. These include control plane nodes, compute nodes, object storage nodes, block storage nodes, and an image management system. At the most basic level, you should look at the CPU used, RAM used, I/O load, and the disk space used relative to the amounts available. For compute nodes, you can also evaluate the allocation of resource to hosted virtual machines. This information can be viewed in the Operations Console. You can pull historical information from the monitoring service (OpenStack’s monasca) by using its client or API. Also, OpenStack provides you some ability to manage the hosted resource utilization by using quotas for projects. You can track this usage over time to get your growth trend so that you can project when you will need to add capacity.
1.5 Semi-annual tasks #
Perform upgrades: OpenStack releases new versions on a six-month cycle. In general, SUSE OpenStack Cloud will release new major versions annually with minor versions and maintenance updates more often. Each new release consists of both new functionality and services, as well as bug fixes for existing functionality.
If you are planning to upgrade, this is also an excellent time to evaluate your existing capabilities, especially in terms of capacity (see Capacity Planning above).
1.6 Troubleshooting #
As part of managing your cloud, you should be ready to troubleshoot issues, as needed. The following are some common troubleshooting scenarios and solutions:
How do I determine if my cloud is operating correctly now?: SUSE OpenStack Cloud provides a monitoring solution based on OpenStack’s monasca service. This service provides monitoring and metrics for all OpenStack components, as well as much of the underlying system. By default, SUSE OpenStack Cloud comes with a set of alarms that provide coverage of the primary systems. In addition, you can define alarms based on threshold values for any metrics defined in the system. You can view alarm information in the Operations Console. You can also receive or deliver this information to others by configuring email or other mechanisms. Alarms provide information about whether a component failed and is affecting the system, and also what condition triggered the alarm.
How do I troubleshoot and resolve performance issues for my cloud?: There are a variety of factors that can affect the performance of a cloud system, such as the following:
Health of the control plane
Health of the hosting compute node and virtualization layer
Resource allocation on the compute node
If your cloud users are experiencing performance issues on your cloud, use the following approach:
View the compute summary page on the Operations Console to determine if any alarms have been triggered.
Determine the hosting node of the virtual machine that is having issues.
On the compute hosts page, view the status and resource utilization of the compute node to determine if it has errors or is over-allocated.
On the compute instances page you can view the status of the VM along with its metrics.
How do I troubleshoot and resolve availability issues for my cloud?: If your cloud users are experiencing availability issues, determine what your users are experiencing that indicates to them the cloud is down. For example, can they not access the Dashboard service (horizon) console or APIs, indicating a problem with the control plane? Or are they having trouble accessing resources? Console/API issues would indicate a problem with the control planes. Use the Operations Console to view the status of services to see if there is an issue. However, if it is an issue of accessing a virtual machine, then also search the consolidated logs that are available in the ELK stack or errors related to the virtual machine and supporting networking.
1.7 Common Questions #
What skills do my cloud administrators need?
Your administrators should be experienced Linux admins. They should have experience in application management, as well as experience with Ansible. It is a plus if they have experience with Bash shell scripting and Python programming skills.
In addition, you will need skilled networking engineering staff to administer the cloud network environment.