documentation.suse.com › Documentation › Operations Guide › System Maintenance

Operations Guide

Navigation←→

Applies to HPE Helion OpenStack 8

13 System Maintenance

13.1 Planned System Maintenance
13.2 Unplanned System Maintenance
13.3 Cloud Lifecycle Manager Maintenance Update Procedure
13.4 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment
13.5 Periodic OpenStack Maintenance Tasks

Information about managing and configuring your cloud as well as procedures for performing node maintenance.

This section contains the following sections to help you manage, configure, and maintain your HPE Helion OpenStack cloud.

13.1 Planned System Maintenance #

Planned maintenance tasks for your cloud. See sections below for:

13.1.1 Whole Cloud Maintenance #

Planned maintenance procedures for your whole cloud.

13.1.1.1 Bringing Down Your Cloud: Services Down Method #

Important

If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 13.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.

If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 13.1.1.2, “Rolling Reboot of the Cloud”.

To perform backups prior to these steps, visit the backup and restore pages first at Chapter 14, Backup and Restore.

13.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment #

You will do the following steps from your Cloud Lifecycle Manager.

Gracefully shut down your cloud by running the ardana-stop.yml playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-stop.yml

Shut down your nodes. You should shut down your controller nodes last and bring them up first after the maintenance.
There are multiple ways you can do this:
1. You can SSH to each node and use sudo reboot -f to reboot the node.
2. From the Cloud Lifecycle Manager, you can use the bm-power-down.yml and bm-power-up.yml playbooks.
3. You can shut down the nodes and then physically restart them either via a power button or the IPMI.
Perform the necessary maintenance.
After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.

Determine the current power status of the nodes in your environment:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts bm-power-status.yml

If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the -e nodelist=<node_name> switch.
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
```
Note
Obtain the <node_name> by using the sudo cobbler system list command from the Cloud Lifecycle Manager.

Bring the databases back up:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

Gracefully bring up your cloud services by running the ardana-start.yml playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml

Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-status.yml
```
If any services did not start properly, you can run playbooks for the specific services having issues.
For example:
If RabbitMQ fails, run:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml
```
You can check the status of RabbitMQ afterwards with this:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml
```
If the recovery had failed, you can run:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml
```
Each of the other services have playbooks in the ~/scratch/ansible/next/ardana/ansible directory in the format of <service>-start.yml that you can run. One example, for the compute service, is nova-start.yml.
Continue checking the status of your HPE Helion OpenStack 8 cloud services until there are no more failed or unreachable nodes:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-status.yml
```

13.1.1.2 Rolling Reboot of the Cloud #

If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 13.1.1.1, “Bringing Down Your Cloud: Services Down Method”.

13.1.1.2.1 Recommended node reboot order #

To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.

The recommended way to achieve this is as follows:

Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.
Reboot of compute nodes (if present in your cloud).
Reboot of Swift nodes (if present in your cloud).
Reboot of ESX nodes (if present in your cloud).

13.1.1.2.2 Rebooting controller nodes #

Turn off the Keystone Fernet Token-Signing Key Rotation

Before rebooting any controller node, you need to ensure that the Keystone Fernet token-signing key rotation is turned off. Run the following command:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml

Migrate singleton services first

Note

If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that the apache2 service is running before continuing. To start the apache2 service, use this command:

sudo systemctl start apache2

The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.

For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.

For the cinder-volume singleton service:

Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:

ps auxww | grep cinder-volume | grep -v grep

Run the cinder-migrate-volume.yml playbook - details about the Cinder volume and backup migration instructions can be found in Section 7.1.3, “Managing Cinder Volume and Backup Services”.

For the nova-consoleauth singleton service:

The nova-consoleauth component runs by default on the first controller node, that is, the host with consoleauth_host_index=0. To move it to another controller node before rebooting controller 0, run the ansible playbook nova-start.yml and pass it the index of the next controller node. For example, to move it to controller 2 (index of 1), run:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts nova-start.yml --extra-vars "consoleauth_host_index=1"

After you run this command you may now see two instances of the nova-consoleauth service, which will show as being in disabled state, when you run the nova service-list command. You can then delete the service using these steps.

Obtain the service ID for the duplicated nova-consoleauth service:

nova service-list

Example:

$ nova service-list
+----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+
| Id | Binary           | Host                      | Zone     | Status   | State | Updated_at                 | Disabled Reason |
+----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+
| 1  | nova-conductor   | ...a-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:48.000000 | -               |
| 10 | nova-conductor   | ...a-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:47.000000 | -               |
| 13 | nova-conductor   | ...a-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:48.000000 | -               |
| 16 | nova-scheduler   | ...a-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:39.000000 | -               |
| 19 | nova-scheduler   | ...a-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:41.000000 | -               |
| 22 | nova-scheduler   | ...a-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:44.000000 | -               |
| 25 | nova-consoleauth | ...a-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:45.000000 | -               |
| 49 | nova-compute     | ...a-cp1-comp0001-mgmt | nova     | enabled  | up    | 2016-08-25T12:11:48.000000 | -               |
| 52 | nova-compute     | ...a-cp1-comp0002-mgmt | nova     | enabled  | up    | 2016-08-25T12:11:41.000000 | -               |
| 55 | nova-compute     | ...a-cp1-comp0003-mgmt | nova     | enabled  | up    | 2016-08-25T12:11:43.000000 | -               |
| 70 | nova-consoleauth | ...a-cp1-c1-m3-mgmt    | internal | disabled | down  | 2016-08-25T12:10:40.000000 | -               |
+----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+

Delete the disabled duplicate service with this command:
```
nova service-delete <service_ID>
```
Given the example in the previous step, the command could be:
```
nova service-delete 70
```

For the SNAT namespace singleton service:

If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.

Locate the SNAT node where the router is providing the active snat_service:

From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:

source ~/service.osrc
neutron port-list --device_owner network:router_gateway

Example:

$ neutron port-list --device_owner network:router_gateway
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
| id                                   | name | mac_address       | fixed_ips                                                                           |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
| 287746e6-7d82-4b2c-914c-191954eba342 |      | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+

Look at the details of this port to determine what the binding:host_id value is, which will point to the host in which the port is bound to:

neutron port-show <port_id>

Example, with the value you need in bold:

$ neutron port-show 287746e6-7d82-4b2c-914c-191954eba342
+-----------------------+--------------------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                                        |
+-----------------------+--------------------------------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                                                         |
| allowed_address_pairs |                                                                                                              |
| binding:host_id       | ardana-cp1-c1-m2-mgmt                                                                                        |
| binding:profile       | {}                                                                                                           |
| binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
| binding:vif_type      | ovs                                                                                                          |
| binding:vnic_type     | normal                                                                                                       |
| device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
| device_owner          | network:router_gateway                                                                                       |
| dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
| dns_name              |                                                                                                              |
| extra_dhcp_opts       |                                                                                                              |
| fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
| id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
| mac_address           | fa:16:3e:2e:26:ac                                                                                            |
| name                  |                                                                                                              |
| network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
| security_groups       |                                                                                                              |
| status                | DOWN                                                                                                         |
| tenant_id             |                                                                                                              |
+-----------------------+--------------------------------------------------------------------------------------------------------------+

In this example, the ardana-cp1-c1-m2-mgmt is the node hosting the SNAT namespace service.

SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:
```
ssh <IP_of_SNAT_namespace_host>
sudo ip netns exec snat-<router_ID> bash
```
Example:
```
sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bash
```

Obtain the ID for the L3 Agent for the node hosting your SNAT namespace:

source ~/service.osrc
neutron agent-list

Example, with the entry you need given the examples above:

$ neutron agent-list
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent           | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-dhcp-agent        |
| 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent       | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-metadata-agent    |
| 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent             | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-vpn-agent         |
| 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent             | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-l3-agent          |
| 58f01f34-b6ca-4186-ac38-b56ee376ffeb | Loadbalancerv2 agent | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-lbaasv2-agent     |
| 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent   | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-openvswitch-agent |
| 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent             | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-vpn-agent         |
| 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent       | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-metadata-agent    |
| 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent       | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-metadata-agent    |
| a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent             | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-vpn-agent         |
| a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent           | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-dhcp-agent        |
| b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent   | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-openvswitch-agent |
| d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent   | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-openvswitch-agent |
| e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent   | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-openvswitch-agent |
| f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent           | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-dhcp-agent        |
| fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent       | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-metadata-agent    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.

Use these commands to move the SNAT namespace service, with the router_id being the same value as the ID for router:

Remove the L3 Agent for the old host:

neutron l3-agent-router-remove <agent_id_of_snat_namespace_host> <qrouter_uuid>

Example:

$ neutron l3-agent-router-remove a209c67d-c00f-4a00-b31c-0db30e9ec661 e122ea3f-90c5-4662-bf4a-3889f677aacf
Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agent

Remove the SNAT namespace:

sudo ip netns delete snat-<router_id>

Example:

$ sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacf

Create a new L3 Agent for the new host:

neutron l3-agent-router-add <agent_id_of_new_snat_namespace_host> <qrouter_uuid>

Example:

$ neutron l3-agent-router-add 3bc28451-c895-437b-999d-fdcff259b016 e122ea3f-90c5-4662-bf4a-3889f677aacf
Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent

Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of binding:host_id which should be updated to the host you moved your SNAT namespace to:

neutron port-show <port_ID>

Example:

$ neutron port-show 287746e6-7d82-4b2c-914c-191954eba342
+-----------------------+--------------------------------------------------------------------------------------------------------------+
| Field                 | Value                                                                                                        |
+-----------------------+--------------------------------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                                                         |
| allowed_address_pairs |                                                                                                              |
| binding:host_id       | ardana-cp1-c1-m1-mgmt                                                                                        |
| binding:profile       | {}                                                                                                           |
| binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
| binding:vif_type      | ovs                                                                                                          |
| binding:vnic_type     | normal                                                                                                       |
| device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
| device_owner          | network:router_gateway                                                                                       |
| dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
| dns_name              |                                                                                                              |
| extra_dhcp_opts       |                                                                                                              |
| fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
| id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
| mac_address           | fa:16:3e:2e:26:ac                                                                                            |
| name                  |                                                                                                              |
| network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
| security_groups       |                                                                                                              |
| status                | DOWN                                                                                                         |
| tenant_id             |                                                                                                              |
+-----------------------+--------------------------------------------------------------------------------------------------------------+

Reboot the controllers

In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.

for i in $(grep -w cluster-prefix ~/openstack/my_cloud/definition/data/control_plane.yml | awk '{print $2}'); do grep $i ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts | grep ansible_ssh_host | awk '{print $1}'; done

Then perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:

If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.

Stop all services on the controller node that you are rebooting first:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <controller node>

Reboot the controller node, e.g. run the following command on the controller itself:
```
sudo reboot
```
Note that the current node being rebooted could be hosting the lifecycle manager.
Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <controller node>
```

Verify that the status of all services on that is OK on the controller node:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-status.yml --limit <controller node>

When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.

Note

It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).

Reenable the Keystone Fernet Token-Signing Key Rotation

After all the controller nodes are successfully updated and back online, you need to re-enable the Keystone Fernet token-signing key rotation job by running the keystone-reconfigure.yml playbook. On the deployer, run:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml

13.1.1.2.3 Rebooting compute nodes #

To reboot a compute node the following operations will need to be performed:

Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.
Identify instances that exist on the compute node, and then either:
- Live migrate the instances off the node before actioning the reboot. OR
- Stop the instances
Reboot the node
Restart the Nova services

Disable provisioning:
```
nova service-disable --reason "<describe reason>" <node name> nova-compute
```
If the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.
Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.
```
nova list --host <hostname> --all-tenants
```
Migrate or Stop the instances on the compute node.
Migrate the instances off the node by running one of the following commands for each of the instances:
If your instance is booted from a volume and has any number of Cinder volume attached, use the nova live-migration command:
```
nova live-migration <instance uuid> [<target compute host>]
```
If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:
```
nova live-migration --block-migrate <instance uuid> [<target compute host>]
```
Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.
OR
Stop the instances on the node by running the following command for each of the instances:
```
nova stop <instance-uuid>
```

Stop all services on the Compute node:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>

SSH to your Compute nodes and reboot them:
```
sudo reboot
```
The operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.
Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>
```
Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>
```

Re-enable provisioning on the node:

nova service-enable <node-name> nova-compute

Restart any instances you stopped.
```
nova start <instance-uuid>
```

13.1.1.2.4 Rebooting Swift nodes #

If your Swift services are on controller node, please follow the controller node reboot instructions above.

For a dedicated Swift PAC cluster or Swift Object resource node:

For each Swift host

Stop all services on the Swift node:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <Swift node>

Reboot the Swift node by running the following command on the Swift node itself:
```
sudo reboot
```

Wait for the node to become ssh-able and then start all services on the Swift node:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>

13.1.1.2.5 Get list of status playbooks #

Running the following command will yield a list of status playbooks:

cd ~/scratch/ansible/next/ardana/ansible
ls *status*

Here is the list:

ls *status*
bm-power-status.yml          heat-status.yml      logging-producer-status.yml
ceilometer-status.yml        FND-AP2-status.yml   ardana-status.yml
FND-CLU-status.yml           horizon-status.yml   logging-status.yml
cinder-status.yml            freezer-status.yml   ironic-status.yml
cmc-status.yml               glance-status.yml    keystone-status.yml
galera-status.yml            memcached-status.yml nova-status.yml
logging-server-status.yml    monasca-status.yml   ops-console-status.yml
monasca-agent-status.yml     neutron-status.yml   rabbitmq-status.yml
swift-status.yml             zookeeper-status.yml

13.1.2 Planned Control Plane Maintenance #

Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.

13.1.2.1 Replacing a Controller Node #

This section outlines steps for replacing a controller node in your environment.

For HPE Helion OpenStack, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.

These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.

Keep in mind while performing the following tasks:

Do not add entries for a new server. Instead, update the entries for the broken one.
Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.

13.1.2.1.1 Replacing a Shared Cloud Lifecycle Manager/Controller Node #

If the controller node you need to replace was also being used as your Cloud Lifecycle Manager then use these steps below. If this is not a shared controller then skip to the next section.

To ensure that you use the same version of HPE Helion OpenStack that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the installation guide:
1. Book “Installing with Cloud Lifecycle Manager”, Chapter 3 “Installing the Cloud Lifecycle Manager server”, Section 3.5.2 “Installing the HPE Helion OpenStack Extension”
2. To restore your data, see Section 13.2.2.2.3, “Point-in-time Cloud Lifecycle Manager Recovery”
On the new node, update your cloud model with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields to reflect the attributes of the node. Do not change the id, ip-addr, role, or server-group settings.
Note
When imaging servers with your own tooling, it is still necessary to have ILO/IPMI settings for all nodes. Even if you are not using Cobbler, the username and password fields in servers.yml need to be filled in with dummy settings. For example, add the following to servers.yml:
```
ilo-user: manual
ilo-password: deployment
```
Get the servers.yml file stored in git:
```
cd ~/openstack/my_cloud/definition/data
git checkout site
```
then change, as necessary, the mac-addr, ilo-ip, ilo-password, and ilo-user fields of this existing controller node. Save and commit the change
```
git commit -a -m "repaired node X"
```

Run the configuration processor as follows:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Then run ready-deployment:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Deploy Cobbler:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml
```
Note
After this step you may see failures because MariaDB has not finished syncing. If so, rerun this step.
Delete the haproxy user:
```
sudo userdel haproxy
```

Install the software on your new Cloud Lifecycle Manager/controller node with these three playbooks:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller-hostname>,<first-proxy-hostname>

During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
```

13.1.2.1.2 Replacing a Standalone Controller Node #

If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.

Log in to the Cloud Lifecycle Manager.
Update your cloud model, specifically the servers.yml file, with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields where these have changed. Do not change the id, ip-addr, role, or server-group settings.
Commit your configuration to the Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”, as follows:
```
cd ~/openstack/ardana/ansible
git add -A
git commit -m "My config or other commit message"
```

Run the configuration processor:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Remove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:
```
sudo cobbler system list
```
and then remove the old controller nodes with this command:
```
sudo cobbler system remove --name <node>
```

Remove the SSH key of the old controller node from the known hosts file. You will specify the ip-addr value:

ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>

You should see a response similar to this one:

ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135
# Host 10.13.111.135 found: line 6 type ECDSA
~/.ssh/known_hosts updated.
Original contents retained as ~/.ssh/known_hosts.old

Run the cobbler-deploy playbook to add the new controller node:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml

Image the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>
```
Important
You must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.

Configure the necessary keys used for the database etc:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml

Run osconfig on the replacement controller node. For example:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>

If the controller being replaced is the Swift ring builder (see Section 15.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the Swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. See Section 15.6.2.7, “Recovering Swift Builder Files” for details.
Run the ardana-deploy playbook on the replacement controller.
If the node being replaced is the Swift ring builder server then you only need to use the --limit switch for that node, otherwise you need to specify the hostname of your Swift ringer builder server and the hostname of the node being replaced.
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True
--limit=<controller-hostname>,<swift-ring-builder-hostname>
```
Important
If you receive a Keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the keystone-reconfigure.yml playbook to re-sync the Fernet keys.
In this situation, do not use the --limit option when running keystone-reconfigure.yml. In order to re-sync Fernet keys, all the controller nodes must be in the play.
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
```
Important
If you receive a RabbitMQ failure when running this playbook, review Section 15.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.
During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
```

13.1.3 Planned Compute Maintenance #

Planned maintenance tasks for compute nodes.

13.1.3.1 Planned Maintenance for a Compute Node #

If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.

13.1.3.1.1 Performing planned maintenance on a compute node #

If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:

Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
```
source ~/service.osrc
```

Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

nova host-list | grep compute

The following example shows two compute nodes:

$ nova host-list | grep compute
| ardana-cp1-comp0001-mgmt | compute     | AZ1      |
| ardana-cp1-comp0002-mgmt | compute     | AZ2      |

Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:
```
nova service-disable --reason "Maintenance mode" <hostname> nova-compute
```
Note
Make sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:
```
nova service-enable <hostname> nova-compute
```

At this point you have two choices:

Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.
Stop/start the instances: Issuing nova stop commands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.

If you choose the live migration route, See Section 13.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.

If you choose the stop start method, continue on.

List all of the instances on the node so you can issue stop commands to them:
```
nova list --host <hostname> --all-tenants
```
Issue the nova stop command against each of the instances:
```
nova stop <instance uuid>
```

Confirm that the instances are stopped. If stoppage was successful you should see the instances in a SHUTOFF state, as shown here:

$ nova list --host ardana-cp1-comp0002-mgmt --all-tenants
+--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
| ID                                   | Name      | Tenant ID                        | Status  | Task State | Power State | Networks              |
+--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
| ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | -          | Shutdown    | demo_network=10.0.0.5 |
+--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+

Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their SHUTOFF state:
```
nova list --host <hostname> --all-tenants
```

Start the instances back up using this command:

nova start <instance uuid>

Example:

$ nova start ef31c453-f046-4355-9bd3-11e774b1772f
Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.

Confirm that the instances started back up. If restarting is successful you should see the instances in an ACTIVE state, as shown here:

$ nova list --host ardana-cp1-comp0002-mgmt --all-tenants
+--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name      | Tenant ID                        | Status | Task State | Power State | Networks              |
+--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
| ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | -          | Running     | demo_network=10.0.0.5 |
+--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+

If the nova start fails, you can try doing a hard reboot:
```
nova reboot --hard <instance uuid>
```
If this does not resolve the issue you may want to contact support.

Reenable provisioning when the node is fixed:
```
nova service-enable <hostname> nova-compute
```

13.1.3.2 Rebooting a Compute Node #

If all you need to do is reboot a Compute node, the following steps can be used.

You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.

Log in to the Cloud Lifecycle Manager.
Reboot the Compute node(s) with the following playbook.
You can specify either single or multiple Compute nodes using the --limit switch.
An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
```
Note
If the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.

13.1.3.3 Live Migration of Instances #

Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.

HPE Helion OpenStack Nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.

13.1.3.3.1 Migration Options #

If your compute node has failed

A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.

In these cases you will want to use one of the Nova evacuate commands, which will cause Nova to rebuild the instances on other hosts.

This table describes each of the evacuate options for failed compute nodes:

Command Description

Command	Description
`nova evacuate <instance> <hostname>`	This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the Nova scheduler will choose one for you. See `nova help evacuate` for more information and syntax. Further details can also be seen in the OpenStack documentation at http://docs.openstack.org/admin-guide/cli_nova_evacuate.html.
`nova host-evacuate <hostname> --target_host <target_hostname>`	This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the Nova scheduler will choose a target host for each instance. See `nova help host-evacuate` for more information and syntax.

nova evacuate <instance> <hostname>

This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the Nova scheduler will choose one for you.

See nova help evacuate for more information and syntax. Further details can also be seen in the OpenStack documentation at http://docs.openstack.org/admin-guide/cli_nova_evacuate.html.

nova host-evacuate <hostname> --target_host <target_hostname>

This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the Nova scheduler will choose a target host for each instance.

See nova help host-evacuate for more information and syntax.

If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.

Cold migration is used to copy an instances data in a SHUTOFF status from one compute host to another. It does this using passwordless SSH access which has security concerns associated with it. For this reason, the nova migrate function has been disabled by default but you have the ability to enable this feature if you would like. Details on how to do this can be found in Section 5.4, “Enabling the Nova Resize and Migrate Features”.

Live migration can be performed on instances in either an ACTIVE or PAUSED state and uses the QEMU hypervisor to manage the copy of the running processes and associated resources to the destination compute host using the hypervisors own protocol and thus is a more secure method and allows for less downtime. There may be a short network outage, usually a few milliseconds but could be up to a few seconds if your compute instances are busy, during a live migration. Also there may be some performance degredation during the process.

The compute host must remain powered on during the migration process.

Both the cold migration and live migration options will honor Nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 13.1.3.3, “Live Migration of Instances” section.

This table describes each of the migration options for active compute nodes:

Command	Description	SLES
`nova migrate <instance_uuid>`	Used to cold migrate a single instance from a compute host. The `nova-scheduler` will choose the new host. This command will work against instances in an `ACTIVE` or `SHUTOFF` state. The instances, if active, will be shutdown and restarted. Instances in a `PAUSED` state cannot be cold migrated. See the difference between cold migration and live migration at the start of this section.
`nova host-servers-migrate <hostname>`	Used to cold migrate all instances off a specified host to other available hosts, chosen by the `nova-scheduler`. This command will work against instances in an `ACTIVE` or `SHUTOFF` state. The instances, if active, will be shutdown and restarted. Instances in a `PAUSED` state cannot be cold migrated. See the difference between cold migration and live migration at the start of this section.
`nova live-migration <instance_uuid> [<target host>]`	Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached. This command works against instances in `ACTIVE` or `PAUSED` states only.	X
`nova live-migration --block-migrate <instance_uuid> [<target host>]`	Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume. This command works against instances in `ACTIVE` or `PAUSED` states only.	X
`nova host-evacuate-live <hostname> [--target-host <target_hostname>]`	Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached. This command works against instances in `ACTIVE` or `PAUSED` states only.	X
`nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]`	Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume. This command works against instances in `ACTIVE` or `PAUSED` states only.	X

13.1.3.3.2 Limitations of these Features #

There are limitations that may impact your use of this feature:

To use live migration, your compute instances must be in either an ACTIVE or PAUSED state on the compute host. If you have instances in a SHUTOFF state then cold migration should be used.
Instances in a Paused state cannot be live migrated using the Horizon dashboard. You will need to utilize the NovaClient CLI to perform these.
Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the nova-scheduler. You should ensure that the target host you choose has the resources available to host the instances.
The nova host-evacuate-live command will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the --block-migrate option. This is described in further detail in Section 13.1.3.3, “Live Migration of Instances”.
Instances on KVM hosts can only be live migrated to other KVM hosts.
If you are using both Linux for HPE Helion (KVM) and SLES compute hosts, you cannot live migrate instances between them. Instances on KVM hosts can only be live migrated to other KVM hosts. Instances on SLES hosts can only be live migrated to other SLES hosts.
The migration options described in this document are not available on ESX compute hosts.
Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.

13.1.3.3.3 Performing a Live Migration #

Cloud administrators can perform a migration on an instance using either the Horizon dashboard, API, or CLI. Instances in a Paused state cannot be live migrated using the Horizon GUI. You will need to utilize the CLI to perform these.

We have documented different scenarios:

13.1.3.3.4 Migrating instances off of a failed compute host #

Log in to the Cloud Lifecycle Manager.
If the compute node is not already powered off, do so with this playbook:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
```
Note
The value for <node_name> will be the name that Cobbler has when you run sudo cobbler system list from the Cloud Lifecycle Manager.
Source the admin credentials necessary to run administrative commands against the Nova API:
```
source ~/service.osrc
```
Force the nova-compute service to go down on the compute node:
```
nova service-force-down HOSTNAME nova-compute
```
Note
The value for HOSTNAME can be obtained by using nova host-list from the Cloud Lifecycle Manager.
Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.
For single instances on a failed host:
```
nova evacuate <instance_uuid> <target_hostname>
```
For all instances on a failed host:
```
nova host-evacuate <hostname> [--target_host <target_hostname>]
```
When you have repaired the failed node and start it back up again, when the nova-compute process starts again, it will clean up the evacuated instances.

13.1.3.3.5 Migrating instances off of an active compute host #

Migrating instances using the Horizon dashboard

The Horizon dashboard offers a GUI method for performing live migrations. Instances in a Paused state will not provide you the live migration option in Horizon so you will need to use the CLI instructions in the next section to perform these.

Log into the Horizon dashboard with admin credentials.
Navigate to the menu Admin › Compute › Instances.
Next to the instance you want to migrate, select the drop down menu and choose the Live Migrate Instance option.
In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:
Disk Over Commit - If this is not checked then the value will be False. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.
Block Migration - If this is not checked then the value will be False. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.
To begin the live migration, click Submit.

Migrating instances using the NovaClient CLI

To perform migrations from the command-line, use the NovaClient. The Cloud Lifecycle Manager node in your cloud environment should have the NovaClient already installed. If you will be accessing your environment through a different method, ensure that the NovaClient is installed. You can do so using Python's pip package manager.

To run the commands in the steps below, you need administrator credentials. From the Cloud Lifecycle Manager, you can source the service.osrc file which is provided that has the necessary credentials:

source ~/service.osrc

Here are the steps to perform:

Identify the instances on the compute node you wish to migrate:

nova list --all-tenants --host <hostname>

Example showing a host with a single compute instance on it:

ardana >  nova list --host ardana-cp1-comp0001-mgmt --all-tenants
+--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name | Tenant ID                        | Status | Task State | Power State | Networks              |
+--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
| 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | -          | Running     | adminnetwork=10.0.0.5 |
+--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+

When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:
```
nova host-list
```
Migrate the instance(s) on the compute node using the notes below.
If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the nova live-migration command with this syntax:
```
nova live-migration <instance uuid> [<target compute host>]
```
If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the --block-migrate option:
```
nova live-migration --block-migrate <instance uuid> [<target compute host>]
```
Note
The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.
Multiple instances
If you want to live migrate all of the instances off a single compute host you can utilize the nova host-evacuate-live command.
Issue the host-evacuate-live command, which will begin the live migration process.
If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:
```
nova host-evacuate-live --block-migrate <hostname>
```
Alternatively, if all of the instances are only using block storage volumes then omit the --block-migrate option:
```
nova host-evacuate-live <hostname>
```
Note
You can either let the nova-scheduler choose a suitable target host or you can specify one using the --target-host <hostname> switch. See nova help host-evacuate-live for details.

13.1.3.3.6 Troubleshooting migration or host evacuate issues #

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                                                                                        |
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 95a7ded8-ebfc-4848-9090-2df378c88a4c | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7)     |
| 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6)     |
+--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| e9874122-c5dc-406f-9039-217d9258c020 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a)     |
| 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112)     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt
ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)

Fix: This occurs when you are attempting to live migrate an instance that was booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration --block-migrate <instance_uuid> <target_hostname>

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt
ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)

Fix: This occurs when you are attempting to live migrate an instance that was booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration <instance_uuid> <target_hostname>

13.1.3.4 Adding Compute Node #

Adding a Compute Node allows you to add capacity.

13.1.3.4.1 Adding a SLES Compute Node #

Adding a SLES compute node allows you to add additional capacity for more virtual machines.

You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.

There are two methods you can use to add SLES compute hosts to your environment:

Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.
Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP3 ISO during the initial installation of your cloud, following the instructions at Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”, Section 19.1 “SLES Compute Node Installation Overview”.
If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP3 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.

13.1.3.4.1.1 Prerequisites #

You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”, Section 19.3 “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.1 “SLES Compute Nodes”.

13.1.3.4.1.2 Adding a SLES compute node #

Adding pre-installed SLES compute hosts

This method requires that you have SUSE Linux Enterprise Server 12 SP3 pre-installed on the baremetal host prior to beginning these steps.

Ensure you have SUSE Linux Enterprise Server 12 SP3 pre-installed on your baremetal host.
Log in to the Cloud Lifecycle Manager.
Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).
For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).
```
- id: compute4
  ip-addr: 192.168.102.70
  role: SLES-COMPUTE-ROLE
  server-group: RACK1
```
You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
Important
You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.
In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.
See for Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” more details.

Commit the changes to git:

git add -A
git commit -a -m "Add node <name>"

Run the configuration processor and resolve any errors that are indicated:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml
```
Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.
[OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.
Note
The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.
The location of hostname is ~/scratch/ansible/next/ardana/ansible/hosts.
```
cd ~/scratch/ansible/next/ardana/ansible/
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
```

Complete the compute host deployment with this playbook:

cd ~/scratch/ansible/next/ardana/ansible/
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>

Adding SLES compute hosts with Ansible playbooks and Cobbler

These steps will show you how to add the new SLES compute host to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

If you did not have the SUSE Linux Enterprise Server 12 SP3 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”.

When you are prepared to continue, use these steps:

Log in to your Cloud Lifecycle Manager.
Checkout the site branch of your local git so you can begin to make the necessary edits:
```
cd ~/openstack/my_cloud/definition/data
git checkout site
```
Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).
For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in this format:
```
- id: compute4
  ip-addr: 192.168.102.70
  role: SLES-COMPUTE-ROLE
  server-group: RACK1
  mac-addr: e8:39:35:21:32:4e
  ilo-ip: 10.1.192.36
  ilo-password: password
  ilo-user: admin
  distro-id: sles12sp3-x86_64
```
You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
Important
You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.
In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.
See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

Commit the changes to git:

git add -A
git commit -a -m "Add node <name>"

Run the configuration processor and resolve any errors that are indicated:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

The following playbook confirms that your servers are accessible over their IPMI ports.

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4

Add the new node into Cobbler:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml

Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]
```
Then you can image the node:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
```
Note
If you do not know the <node name>, you can get it by using sudo cobbler system list.
Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

Update your deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

[OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.
```
cd ~/scratch/ansible/next/ardana/ansible/
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
```
Note
You can obtain the <hostname> from the file ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.
You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute” for details.

Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the --limit switch:

cd ~/scratch/ansible/next/ardana/ansible/
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>

13.1.3.4.1.3 Adding a new SLES compute node to monitoring #

If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

13.1.3.4.2 Adding a RHEL Compute Node #

Adding a RHEL compute node allows you to increase cloud capacity for more virtual machines. These steps will help you add new RHEL compute hosts for this purpose.

13.1.3.4.2.1 Prerequisites #

You need to ensure your input model files are properly setup for RHEL compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.2 “RHEL Compute Nodes”.

13.1.3.4.2.2 Adding a RHEL compute node #

You must have RHEL 7.5 pre-installed on the baremetal host prior to beginning these steps.

Ensure you have RHEL 7.5 pre-installed on your baremetal host.
Log in to the Cloud Lifecycle Manager.
Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).
For example, if you already had a cluster of three RHEL compute hosts using the RHEL-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the RHEL OS on your host(s).
```
- id: compute4
  ip-addr: 192.168.102.70
  role: RHEL-COMPUTE-ROLE
  server-group: RACK1
```
You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new RHEL hosts you are adding as you specified on your existing RHEL hosts.
Important
Verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.
In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.
See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

Commit the changes to git:

git add -A
git commit -a -m "Add node <name>"

Run the configuration processor and resolve any errors that are indicated:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml
```
Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

Look up the value for the new compute node's hostname in ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

Then, complete the compute host deployment with this playbook:

cd ~/scratch/ansible/next/ardana/ansible/
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>

13.1.3.4.2.3 Adding a new RHEL compute node to monitoring #

If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

13.1.3.5 Removing a Compute Node #

Removing a Compute node allows you to remove capacity.

You may have a need to remove a Compute node and these steps will help you achieve this.

13.1.3.5.1 Disable Provisioning on the Compute Host #

Get a list of the Nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:

nova service-list

Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

$ nova service-list
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -               |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
| 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -               |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -               |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -               |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -               |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | -               |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+

Disable the Nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:

nova service-disable --reason "<enter reason here>" <node hostname> nova-compute

Here is an example if I wanted to remove the ardana-cp1-comp0002-mgmt in the output above:

$ nova service-disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt nova-compute
+--------------------------+--------------+----------+-----------------------+
| Host                     | Binary       | Status   | Disabled Reason       |
+--------------------------+--------------+----------+-----------------------+
| ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation |
+--------------------------+--------------+----------+-----------------------+

13.1.3.5.2 Remove the Compute Host from its Availability Zone #

If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.

Get a list of the Nova services running which will provide us with the details we need to remove a Compute node:

nova service-list

Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

$ nova service-list
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
| Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason       |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -                     |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
| 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -                     |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -                     |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -                     |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -                     |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+

You can remove the Compute host from the availability zone it was a part of with this command:

nova aggregate-remove-host <availability zone> <nova hostname>

So for the same example as the previous step, the ardana-cp1-comp0002-mgmt host was in the AZ2 availability zone so I would use this command to remove it:

$ nova aggregate-remove-host AZ2 ardana-cp1-comp0002-mgmt
Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4
+----+------+-------------------+-------+-------------------------+
| Id | Name | Availability Zone | Hosts | Metadata                |
+----+------+-------------------+-------+-------------------------+
| 4  | AZ2  | AZ2               |       | 'availability_zone=AZ2' |
+----+------+-------------------+-------+-------------------------+

You can confirm the last two steps completed successfully by running another nova service-list.

Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:

$ nova service-list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+

13.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts #

You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:

nova list --host=<nova hostname> --all_tenants=1

Here is an example below which shows that we have a single running instance on this node currently:

$ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1
+--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
| ID                                   | Name   | Tenant ID                        | Status | Task State | Power State | Networks        |
+--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
| 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | -          | Running     | paul=10.10.10.7 |
+--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+

You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within Nova. The command will look like this:

nova live-migration --block-migrate <nova instance ID>

Here is an example using the instance in the previous step:

$ nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9

You can check the status of the migration using the same command from the previous step:

$ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1
+--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
| ID                                   | Name   | Tenant ID                        | Status    | Task State | Power State | Networks        |
+--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
| 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating  | Running     | paul=10.10.10.7 |
+--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+

Run nova list again

$ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1

to see that the running instance has been migrated:

+----+------+-----------+--------+------------+-------------+----------+
| ID | Name | Tenant ID | Status | Task State | Power State | Networks |
+----+------+-----------+--------+------------+-------------+----------+
+----+------+-----------+--------+------------+-------------+----------+

13.1.3.5.4 Disable Neutron Agents on Node to be Removed #

You should also locate and disable or remove neutron agents. To see the neutron agents running:

$ neutron agent-list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

$ neutron agent-update --admin-state-down 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ neutron agent-update --admin-state-down dbe4fe11-8f08-4306-8244-cc68e98bb770
$ neutron agent-update --admin-state-down f0d262d1-7139-40c7-bdc2-f227c6dee5c8

13.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host #

To perform this step you have a few options. You can SSH into the Compute host and run the following commands:

sudo systemctl stop nova-compute

sudo systemctl stop neutron-*

Because the Neutron agent self-registers against Neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:

sudo systemctl list-units neutron-* --all

Here are the results:

UNIT                                  LOAD        ACTIVE     SUB      DESCRIPTION
neutron-common-rundir.service         loaded      inactive   dead     Create /var/run/neutron
•neutron-dhcp-agent.service         not-found     inactive   dead     neutron-dhcp-agent.service
neutron-l3-agent.service              loaded      inactive   dead     neutron-l3-agent Service
neutron-lbaasv2-agent.service         loaded      inactive   dead     neutron-lbaasv2-agent Service
neutron-metadata-agent.service        loaded      inactive   dead     neutron-metadata-agent Service
•neutron-openvswitch-agent.service    loaded      failed     failed   neutron-openvswitch-agent Service
neutron-ovs-cleanup.service           loaded      inactive   dead     Neutron OVS Cleanup Service

        LOAD   = Reflects whether the unit definition was properly loaded.
        ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
        SUB    = The low-level unit activation state, values depend on unit type.

        7 loaded units listed.
        To show all installed unit files use 'systemctl list-unit-files'.

For each loaded service issue the command

sudo systemctl disable <service-name>

In the above example that would be each service, except neutron-dhcp-agent.service

For example:

sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-lbaasv2-agent neutron-metadata-agent neutron-openvswitch-agent

Now you can shut down the node:

sudo shutdown now

From the Cloud Lifecycle Manager you can use the bm-power-down.yml playbook to shut down the node:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node name>

Note

The <node name> value will be the value corresponding to this node in Cobbler. You can run sudo cobbler system list to retrieve these names.

13.1.3.5.6 Delete the Compute Host from Nova #

Retrieve the list of Nova services:

nova service-list

Here is an example highlighting the Compute host we're going to remove:

$ nova service-list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+

Delete the host from Nova using the command below:

nova service-delete <service ID>

Following our example above, you would use:

nova service-delete 37

Use the command below to confirm that the Compute host has been completely removed from Nova:

nova hypervisor-list

13.1.3.5.7 Delete the Compute Host from Neutron #

Multiple Neutron agents are running on the compute node. You have to remove all of the agents running on the node using the "neutron agent-delete" command. In the example below, the l3-agent, openvswitch-agent and metadata-agent are running:

$ neutron agent-list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

$ neutron agent-delete AGENT_ID

$ neutron agent-delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ neutron agent-delete dbe4fe11-8f08-4306-8244-cc68e98bb770
$ neutron agent-delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8

13.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor #

Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:

Log in to the Cloud Lifecycle Manager
Edit your servers.yml file in the location below to remove references to the Compute node(s) you want to remove:
```
~/openstack/my_cloud/definition/data/servers.yml
```
You may also need to edit your control_plane.yml file to update the values for member-count, min-count, and max-count if you used those to ensure they reflect the proper number of nodes you are using.
See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.
Commit the changes to git:
```
git commit -a -m "Remove node <name>"
```
Run the configuration processor:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml
```
To free up the resources when running the configuration processor, use the switches remove_deleted_servers and free_unused_addresses. For more information, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
```

Update your deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

13.1.3.5.9 Remove the Compute Host from Cobbler #

Complete these steps to remove the node from Cobbler:

Confirm the system name in Cobbler with this command:
```
sudo cobbler system list
```
Remove the system from Cobbler using this command:
```
sudo cobbler system remove --name=<node>
```

Run the cobbler-deploy.yml playbook to complete the process:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml

13.1.3.5.10 Remove the Compute Host from Monitoring #

Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.

To find all Monasca API servers

tux > sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
    bind ardana-cp1-vip-public-MON-API-extapi:8070  ssl crt /etc/ssl/private//my-public-cert-entry-scale                                          
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5        
listen ardana-cp1-vip-MON-API-mgmt-8070
    bind ardana-cp1-vip-MON-API-mgmt:8070  ssl crt /etc/ssl/private//ardana-internal-cert                                          
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5

In above example ardana-cp1-c1-m1-mgmt,ardana-cp1-c1-m2-mgmt, ardana-cp1-c1-m3-mgmt are Monasa API servers

You will want to SSH to each of the Monasca API servers and edit the /etc/monasca/agent/conf.d/host_alive.yaml file to remove references to the Compute node you removed. This will require sudo access. The entries will look similar to the one below:

- alive_test: ping
  built_by: HostAlive
  host_name: ardana-cp1-comp0001-mgmt
  name: ardana-cp1-comp0001-mgmt ping

Once you have removed the references on each of your Monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the Monasca CLI which should be installed on each of your Monasca API servers by default:

monasca alarm-list --metric-dimensions hostname=<compute node deleted>

For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:

monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt

You can then delete the alarm with this command:

monasca alarm-delete <alarm ID>

13.1.4 Planned Network Maintenance #

Planned maintenance task for networking nodes.

13.1.4.1 Adding a Neutron Network Node #

Adding an additional Neutron networking node allows you to increase the performance of your cloud.

You may have a need to add an additional Neutron network node for increased performance or another purpose and these steps will help you achieve this.

13.1.4.1.1 Prerequisites #

If you are using the mid-scale model then your networking nodes are already separate and the roles are defined. If you are not already using this model and wish to add separate networking nodes then you need to ensure that those roles are defined. You can look in the ~/openstack/examples folder on your Cloud Lifecycle Manager for the mid-scale example model files which show how to do this. We have also added the basic edits that need to be made below:

In your server_roles.yml file, ensure you have the NEUTRON-ROLE defined.

Path to file:

~/openstack/my_cloud/definition/data/server_roles.yml

Example snippet:

- name: NEUTRON-ROLE
  interface-model: NEUTRON-INTERFACES
  disk-model: NEUTRON-DISKS

In your net_interfaces.yml file, ensure you have the NEUTRON-INTERFACES defined.

Path to file:

~/openstack/my_cloud/definition/data/net_interfaces.yml

Example snippet:

- name: NEUTRON-INTERFACES
  network-interfaces:
  - device:
      name: hed3
    name: hed3
    network-groups:
    - EXTERNAL-VM
    - GUEST
    - MANAGEMENT

Create a disks_neutron.yml file, ensure you have the NEUTRON-DISKS defined in it.

Path to file:

~/openstack/my_cloud/definition/data/disks_neutron.yml

Example snippet:

  product:
    version: 2

  disk-models:
  - name: NEUTRON-DISKS
    volume-groups:
      - name: ardana-vg
        physical-volumes:
         - /dev/sda_root

        logical-volumes:
        # The policy is not to consume 100% of the space of each volume group.
        # 5% should be left free for snapshots and to allow for some flexibility.
          - name: root
            size: 35%
            fstype: ext4
            mount: /
          - name: log
            size: 50%
            mount: /var/log
            fstype: ext4
            mkfs-opts: -O large_file
          - name: crash
            size: 10%
            mount: /var/crash
            fstype: ext4
            mkfs-opts: -O large_file

Modify your control_plane.yml file, ensure you have the NEUTRON-ROLE defined as well as the Neutron services added.

Path to file:

~/openstack/my_cloud/definition/data/control_plane.yml

Example snippet:

  - allocation-policy: strict
    cluster-prefix: neut
    member-count: 1
    name: neut
    server-role: NEUTRON-ROLE
    service-components:
    - ntp-client
    - neutron-vpn-agent
    - neutron-dhcp-agent
    - neutron-metadata-agent
    - neutron-openvswitch-agent

You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 2 “Hardware and Software Support Matrix”.

13.1.4.1.2 Adding a network node #

These steps will show you how to add the new network node to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

Log in to your Cloud Lifecycle Manager.
Checkout the site branch of your local git so you can begin to make the necessary edits:
```
ardana > cd ~/openstack/my_cloud/definition/data
ardana > git checkout site
```
In the same directory, edit your servers.yml file to include the details about your new network node(s).
For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:
```
# network nodes
- id: neut3
  ip-addr: 10.13.111.137
  role: NEUTRON-ROLE
  server-group: RACK2
  mac-addr: "5c:b9:01:89:b6:18"
  nic-mapping: HP-DL360-6PORT
  ip-addr: 10.243.140.22
  ilo-ip: 10.1.12.91
  ilo-password: password
  ilo-user: admin
```
Important
You will need to verify that the ip-addr value you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.
In your control_plane.yml file you will need to check the values for member-count, min-count, and max-count, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specified member-count: 3 and are adding a fourth network node, you will need to change that value to member-count: 4.

Commit the changes to git:

ardana > git commit -a -m "Add new networking node <name>"

Run the configuration processor:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

Add the new node into Cobbler:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml

Then you can image the node:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>

Note

If you do not know the <hostname>, you can get it by using sudo cobbler system list.

[OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.
```
ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
```

Configure the operating system on the new networking node with this playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

Complete the networking node deployment with this playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>

Run the site.yml playbook with the required tag so that all other services become aware of the new node:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"

13.1.4.1.3 Adding a New Network Node to Monitoring #

If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

13.1.5 Planned Storage Maintenance #

Planned maintenance procedures for Swift storage nodes.

13.1.5.1 Planned Maintenance Tasks for Swift Nodes #

Planned maintenance tasks including recovering, adding, and removing Swift nodes.

13.1.5.1.1 Adding a Swift Object Node #

Adding additional object nodes allows you to increase capacity.

This topic describes how to add additional Swift object server nodes to an existing system.

13.1.5.1.1.1 To add a new node #

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

Get the servers.yml file stored in Git:

cd ~/openstack/my_cloud/definition/data
git checkout site

If not already done, set the weight-step attribute. For instructions, see Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

Add the details of new nodes to the servers.yml file. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in the servers.yml file:

servers:
...
- id: swobj4
  role: SWOBJ_ROLE
  server-group: <server-group-name>
  mac-addr: <mac-address>
  nic-mapping: <nic-mapping-name>
  ip-addr: <ip-address>
  ilo-ip: <ilo-ip-address>
  ilo-user: <ilo-username>
  ilo-password: <ilo-password>

Commit your changes:
```
git add -A
git commit -m "Add Node <name>"
```
Note
Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:
```
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
```
For instructions, see Book “Installing with Cloud Lifecycle Manager”, Chapter 18 “Installation for HPE Helion OpenStack Entry-scale Cloud with Swift Only”.

Run the configuration processor:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Create a deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the nodelist argument):
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
```
In the following example, the server id is swobj4 (mentioned in step 3):
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
```
Note
You must use the server id as it appears in the file servers.yml in the field server-id.

Configure the operating system:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

The hostname of the newly added server can be found in the list generated from the output of the following command:

grep hostname ~/openstack/my_cloud/info/server_info.yml

For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt

Validate that the disk drives of the new node are compatible with the disk model used by the node:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
```
If any errors occur, correct them. For instructions, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”.

Run the following playbook to ensure that all other server's host file are updated with the new server:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"

Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
```
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 8.5.5, “Applying Input Model Changes to Existing Rings”.
For example:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
```

13.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node #

Steps for adding additional PAC nodes to your Swift system.

This topic describes how to add additional Swift proxy, account, and container (PAC) servers to an existing system.

13.1.5.1.2.1 Adding a new node #

Then, perform the following steps to add a new node:

Get the servers.yml file stored in Git:

cd ~/openstack/my_cloud/definition/data
git checkout site

If not already done, set the weight-step attribute. For instructions, see Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
Add details of new nodes to the servers.yml file:
```
servers:
...
- id: swpac6
  role: SWPAC-ROLE
  server-group: <server-group-name>
  mac-addr: <mac-address>
  nic-mapping: <nic-mapping-name>
  ip-addr: <ip-address>
  ilo-ip: <ilo-ip-address>
  ilo-user: <ilo-username>
  ilo-password: <ilo-password>
```
In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the servers.yml file.
In the entry-scale configurations there is no dedicated Swift PAC cluster. Instead, there is a cluster using servers that have a role of CONTROLLER-ROLE. You cannot add swpac4 to this cluster because that would change the member-count. If your system does not already have a dedicated Swift PAC cluster you will need to add it to the configuration files. For details on how to do this, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.7 “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.
If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:
```
control_plane.yml
disks_pac.yml
net_interfaces.yml
servers.yml
server_roles.yml
```
You can see a good example of this in the example configurations for the mid-scale model in the ~/openstack/examples/mid-scale-kvm directory.
The following steps assume that you have already created a dedicated Swift PAC cluster and that it has two members (swpac4 and swpac5).
Increase the member count of the Swift PAC cluster, as appropriate. For example, if you are adding swpac6 and you previously had two Swift PAC nodes, the increased member count should be 3 as shown in the following example:
```
control-planes:
    - name: control-plane-1
      control-plane-prefix: cp1

  . . .
  clusters:
  . . .
     - name: ....
       cluster-prefix: c2
       server-role: SWPAC-ROLE
       member-count: 3
   . . .
```
Commit your changes:
```
git add -A
git commit -m "Add Node <name>"
```
Note
Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:
```
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
```
For instructions, see Book “Installing with Cloud Lifecycle Manager”, Chapter 18 “Installation for HPE Helion OpenStack Entry-scale Cloud with Swift Only”.

Run the configuration processor:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Create a deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the nodelist argument):
```
ansible-playbook -i hosts/localhost cobbler-deploy.yml
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
```
In the following example, the server id is swpac6 (mentioned in step 3):
```
ansible-playbook -i hosts/localhost cobbler-deploy.yml
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
```
Note
You must use the server id as it appears in the file servers.yml in the field server-id.
Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
```
For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
```
Validate that the disk drives of the new node are compatible with the disk model used by the node:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml
```
If any errors occur, correct them. For instructions, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”.

Run the following playbook to ensure that all other server's host file are updated with the new server:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"

Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
```
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

13.1.5.1.3 Adding Additional Disks to a Swift Node #

Steps for adding additional disks to any nodes hosting Swift services.

You may have a need to add additional disks to a node for Swift usage and we can show you how. These steps work for adding additional disks to Swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the Swift service, like you would see if you are using one of the entry-scale example models.

Read through the notes below before beginning the process.

You can add multiple disks at the same time, there is no need to do it one at a time.

Important: Add the Same Number of Disks

You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three Swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three Swift servers.

13.1.5.1.3.1 Adding additional disks to your Swift servers #

Verify the general health of the Swift system and that it is safe to rebalance your rings. See Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Perform the disk maintenance.
1. Shut down the first Swift server you wish to add disks to.
2. Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.
  For more details, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.6 “Swift Requirements for Device Group Drives”.
3. Power the server on.
4. While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the Swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
5. Repeat the steps from Step 2.a for each of the Swift servers you are adding the disks to, one at a time.
  Note
  If the additional disks can be added to the Swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.
On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.
1. Edit the disk configuration file that correlates to the type of server you are adding your new disks to.
  Path to the typical disk configuration files:
```
~/openstack/my_cloud/definition/data/disks_swobj.yml
~/openstack/my_cloud/definition/data/disks_swpac.yml
~/openstack/my_cloud/definition/data/disks_controller_*.yml
```
  Example showing the addition of a single new disk, indicated by the /dev/sdd, in bold:
```
device-groups:
  - name: SwiftObject
    devices:
      - name: "/dev/sdb"
      - name: "/dev/sdc"
      - name: "/dev/sdd"
    consumer:
      name: swift
      ...
```
  Note
  For more details on how the disk model works, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”.
2. Configure the Swift weight-step value in the ~/openstack/my_cloud/definition/data/swift/rings.yml file. See Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.
3. Commit the changes to Git:
```
cd ~/openstack
git commit -a -m "adding additional Swift disks"
```
4. Run the configuration processor:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml
```
5. Update your deployment directory:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml
```
Run the osconfig-run.yml playbook against the Swift nodes you have added disks to. Use the --limit switch to target the specific nodes:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>
```
You can use a wildcard when specifying the hostnames with the --limit switch. If you added disks to all of the Swift servers in your environment and they all have the same prefix (for example, ardana-cp1-swobj...) then you can use a wildcard like ardana-cp1-swobj*. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.

Validate your Swift configuration with this playbook which will also provide details of each drive being added:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"

Verify that Swift services are running on all of your servers:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-status.yml

If everything looks okay with the Swift status, then apply the changes to your Swift rings with this playbook:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-deploy.yml
```
At this point your Swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

13.1.5.1.4 Removing a Swift Node #

Removal process for both Swift Object and PAC nodes.

You can use this process when you want to remove one or more Swift nodes permanently. This process applies to both Swift Proxy, Account, Container (PAC) nodes and Swift Object nodes.

13.1.5.1.4.1 Setting the Pass-through Attributes #

This process will remove the Swift node's drives from the rings and move it to the remaining nodes in your cluster.

Log in to the Cloud Lifecycle Manager.
Ensure that the weight-step attribute is set. See Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.
Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your ~/openstack/my_cloud/definition/data/servers.yml file since your server IDs are already listed in that file. For more information about pass-through, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.17 “Pass Through”.
Here is the general format:
```
pass-through:
  servers:
    - id: <server-id>
      data:
          <subsystem>:
                <subsystem-attributes>
```
Here is an example:
```
---
  product:
    version: 2

  pass-through:
    servers:
      - id: ccn-0001
        data:
          swift:
            drain: yes
```
By setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the Swift data from the node in preparation for removing the node.
Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:
```
cd ~/openstack/ardana/ansible
git add -A
git commit -m "My config or other commit message"
```

Run the configuration processor:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Use the playbook to create a deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Run the Swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-deploy.yml
```
Wait until the replication has completed. For further details, see Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring”

Determine whether all of the partitions have been removed from all drives on the Swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:

cd /etc/swiftlm/cloud1/cp1/builder_dir/
sudo swift-ring-builder <ring_name>.builder

For example, if the node you are removing was part of the object-o ring the command would be:

sudo swift-ring-builder object-0.builder

Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:

$ cd /etc/swiftlm/cloud1/cp1/builder_dir/
$ sudo swift-ring-builder object-0.builder
account.builder, build version 6
4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 16
The overload factor is 0.00% (0.000000)
Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta
             0       1     1   192.168.245.3  6002   192.168.245.3              6002     disk0   0.00          0   -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc
             1       1     1   192.168.245.3  6002   192.168.245.3              6002     disk1   0.00          0   -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd
             2       1     1   192.168.245.4  6002   192.168.245.4              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc
             3       1     1   192.168.245.4  6002   192.168.245.4              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd
             4       1     1   192.168.245.5  6002   192.168.245.5              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc
             5       1     1   192.168.245.5  6002   192.168.245.5              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m3:disk1:/dev/sdd

If the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.
If the number of partitions is zero for the server on all rings, you can remove the Swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the remove attribute as shown in this example:
```
---
  product:
    version: 2

  pass-through:
    servers:
      - id: ccn-0001
        data:
          swift:
            remove: yes
```
Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:
```
cd ~/openstack/ardana/ansible
git add -A
git commit -m "My config or other commit message"
```

Run the configuration processor:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Run the Swift deploy playbook to rebuild the rings by removing the server:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-deploy.yml

At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.

13.1.5.1.4.2 To Disable Swift on a Node #

The next phase in this process will disable the Swift service on the node. In this example, swobj4 is the node being removed from Swift.

Log in to the Cloud Lifecycle Manager.
Stop Swift services on the node using the swift-stop.yml playbook:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit <hostname>
```
Note
When using the --limit argument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card * (for example, *swobj4*).
The following example uses the swift-stop.yml playbook to stop Swift services on ardana-cp1-swobj0004:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004
```
Remove the configuration files.
```
ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
```
Note
Do not run any other playbooks until you have finished the process described in Section 13.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate /etc/swift and restart Swift on swobj4. If you accidentally run a playbook, repeat the process in Section 13.1.5.1.4.2, “To Disable Swift on a Node”.

13.1.5.1.4.3 To Remove a Node from the Input Model #

Use the following steps to finish the process of removing the Swift node.

Log in to the Cloud Lifecycle Manager.
Edit the ~/openstack/my_cloud/definition/data/servers.yml file and remove the entry for the node (swobj4 in this example).
If this was a SWPAC node, reduce the member-count attribute by 1 in the ~/openstack/my_cloud/definition/data/control_plane.yml file. For SWOBJ nodes, no such action is needed.
Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:
```
cd ~/openstack/ardana/ansible
git add -A
git commit -m "My config or other commit message"
```
Run the configuration processor:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml
```
You may want to use the remove_deleted_servers and free_unused_addresses switches to free up the resources when running the configuration processor. For more information, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.
```
ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
```

Update your deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Validate the changes you have made to the configuration files using the playbook below before proceeding further:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
```
If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.
For more details on how to interpret and resolve errors, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”

Remove the node from Cobbler:

sudo cobbler system remove --name=swobj4

Run the Cobbler deploy playbook:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml

The final step will depend on what type of Swift node you are removing.
If the node was a SWPAC node, run the ardana-deploy.yml playbook:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
```
If the node was a SWOBJ node, run the swift-deploy.yml playbook:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-deploy.yml
```
Wait until replication has finished. For more details, see Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring”.
You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

13.1.5.1.4.4 Remove the Swift Node from Monitoring #

Once you have removed the Swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.

You will want to SSH to each of the Monasca API servers and edit the /etc/monasca/agent/conf.d/host_alive.yaml file to remove references to the Swift node(s) you removed. This will require sudo access.

Once you have removed the references on each of your Monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the Swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the Monasca CLI which should be installed on each of your Monasca API servers by default:

monasca alarm-list --metric-dimensions hostname=<swift node deleted>

You can then delete the alarm with this command:

monasca alarm-delete <alarm ID>

13.1.5.1.5 Replacing a Swift Node #

Maintenance steps for replacing a failed Swift node in your environment.

This process is used when you want to replace a failed Swift node in your cloud.

Warning

If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.

13.1.5.1.5.1 How to replace a Swift node in your environment #

Update your cloud configuration with the details of your replacement Swift node.

Edit your servers.yml file to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement Swift node.
Note
Do not change the server's IP address (that is, ip-addr).
Path to file:
```
~/openstack/my_cloud/definition/data/servers.yml
```
Example showing the fields to edit, in bold:
```
 - id: swobj5
   role: SWOBJ-ROLE
   server-group: rack2
   mac-addr: 8c:dc:d4:b5:cb:bd
   nic-mapping: HP-DL360-6PORT
   ip-addr: 10.243.131.10
   ilo-ip: 10.1.12.88
   ilo-user: iLOuser
   ilo-password: iLOpass
   ...
```

Commit the changes to Git:

cd ~/openstack
git commit -a -m "replacing a Swift node"

Run the configuration processor:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost ready-deployment.yml

Update Cobbler and reimage your replacement Swift node:
1. Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace <node name> in future steps.
```
sudo cobbler system list
```
2. Remove the replaced Swift node from Cobbler:
```
sudo cobbler system remove --name <node name>
```
3. Re-run the cobbler-deploy.yml playbook to add the replaced node:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml
```
4. Reimage the node using this playbook:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
```
Complete the deployment of your replacement Swift node.
1. Obtain the hostname for your new Swift node. You will use this value to replace <hostname> in future steps.
```
cat ~/openstack/my_cloud/info/server_info.yml
```
2. Configure the operating system on your replacement Swift node:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>
```
3. If this is the Swift ring builder server, restore the Swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 15.6.2.4, “Identifying the Swift Ring Building Server” and Section 15.6.2.7, “Recovering Swift Builder Files”.
4. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor, include the --ask-vault-pass argument.
```
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
```

13.1.5.1.6 Replacing Drives in a Swift Node #

Maintenance steps for replacing drives in a Swift node.

This process is used when you want to remove a failed hard drive from Swift node and replace it with a new one.

There are two different classes of drives in a Swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.

13.1.5.1.6.1 To Replace the Operating System Disk Drive #

After the operating system disk drive is replaced, the node must be reimaged.

Update your Cobbler profile:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/localhost cobbler-deploy.yml

Reimage the node using this playbook:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>

In the example below swobj2 server is reimaged:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2

Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
```
In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
```
If this is the first server running the swift-proxy service, restore the Swift Ring Builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 15.6.2.4, “Identifying the Swift Ring Building Server” and Section 15.6.2.7, “Recovering Swift Builder Files”.
Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor include the --ask-vault-pass argument.
```
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \
  --limit <hostname>
```
For example:
```
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
```

13.1.5.1.6.2 To Replace a Storage Disk Drive #

After a storage drive is replaced, there is no need to reimage the server. Instead, run the swift-reconfigure.yml playbook.

Log onto the Cloud Lifecycle Manager.

Run the following commands:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>

In following example, the server used is swobj2:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt

13.1.6 Updating MariaDB with Galera #

Updating MariaDB with Galera must be done manually. Updates are not installed automatically. In particular, this situation applies to upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.

Using the CLI, update MariaDB with the following procedure:

Mark Galera as unmanaged:

crm resource unmanage galera

Or put the whole cluster into maintenance mode:

crm configure property maintenance-mode=true

Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:
```
crm_resource --wait --force-demote -r galera -V
```
Perform updates:
1. Uninstall the old versions of MariaDB and the Galera wsrep provider.
2. Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.
3. Change configuration options if necessary.

Start MariaDB on the node.

crm_resource --wait --force-promote -r galera -V

Run mysql_upgrade with the --skip-write-binlog option.
On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run mysql_upgrade.
Mark Galera as managed:
```
crm resource manage galera
```
Or take the cluster out of maintenance mode.

13.2 Unplanned System Maintenance #

Unplanned maintenance tasks for your cloud.

13.2.1 Whole Cloud Recovery Procedures #

Unplanned maintenance procedures for your whole cloud.

13.2.1.1 Full Disaster Recovery #

In this disaster scenario, you have lost everything in the cloud, including Swift.

13.2.1.1.1 Restore from a Swift backup: #

Restoring from a Swift backup is not possible because Swift is gone.

13.2.1.1.2 Restore from an SSH backup: #

Log in to the Cloud Lifecycle Manager.
Edit the following file so it contains the same information as it had previously:
```
~/openstack/my_cloud/config/freezer/ssh_credentials.yml
```

On the Cloud Lifecycle Manager copy the following files:

cp -r ~/hp-ci/openstack/* ~/openstack/my_cloud/definition/

Run this playbook to restore the Cloud Lifecycle Manager helper:

cd ~/openstack/ardana/ansible/
ansible-playbook -i hosts/localhost _deployer_restore_helper.yml

Run as root, and change directories:

sudo su
cd /root/deployer_restore_helper/

Execute the restore:
```
./deployer_restore_script.sh
```

Run this playbook to deploy your cloud:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts site.yml -e '{ "freezer_backup_jobs_upload": false }'

You can now perform the procedures to restore MySQL and Swift. Once everything is restored, re-enable the backups from the Cloud Lifecycle Manager:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
```

13.2.1.2 Full Disaster Recovery Test #

Full Disaster Recovery Test

13.2.1.2.1 Prerequisites #

HPE Helion OpenStack platform

An external server to store backups to via SSH

13.2.1.2.2 Goals #

Here is a high level view of how we expect to test the disaster recovery of the platform.

Backup the control plane using Freezer to an SSH target
Backup the Cassandra Database
Re-install Controller 1 with the HPE Helion OpenStack ISO
Use Freezer to recover deployment data (model …)
Re-install HPE Helion OpenStack on Controller 1, 2, 3
Recover the Cassandra Database
Recover the backup of the MariaDB database

13.2.1.2.3 Description of the testing environment #

The testing environment is very similar to the Entry Scale model.

It used 5 servers: 3 Controllers and 2 computes.

The controller node have three disks. The first one is reserved for the system, while others are used for swift.

Note

During this Disaster Recovery exercise, we have saved the data on disk 2 and 3 of the swift controllers.

This allow to restore the swift objects after the recovery.

If these disks were to be wiped as well, swift data would be lost but the procedure would not change.

The only difference is that Glance images would be lost and they will have to be re-uploaded.

13.2.1.2.4 Disaster recovery test note #

If it is not specified otherwise, all the commands should be executed on controller 1, which is also the deployer node.

13.2.1.2.5 Pre-Disaster testing #

In order to validate the procedure after recovery, we need to create some workloads.

Source the service credential file
```
ardana > source ~/service.osrc
```

Copy an image to the platform and create a Glance image with it. In this example, Cirros is used

ardana > openstack image create --disk-format raw --container-format bare --public --file ~/cirros-0.3.5-x86_64-disk.img cirros

Create a network

ardana > openstack network create test_net

Create a subnet

ardana > neutron subnet-create 07c35d11-13f9-41d4-8289-fa92147b1d44 192.168.42.0/24 --name test_subnet

Create some instances

ardana > openstack server create server_1 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
ardana > openstack server create server_2 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
ardana > openstack server create server_3 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
ardana > openstack server create server_4 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
ardana > openstack server create server_5 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
ardana > openstack server list

Create containers and objects

ardana > swift upload container_1 ~/service.osrc
var/lib/ardana/service.osrc

ardana > swift upload container_1 ~/backup.osrc
swift upload container_1 ~/backup.osrc

ardana > swift list container_1
var/lib/ardana/backup.osrc
var/lib/ardana/service.osrc

13.2.1.2.6 Preparation of the backup server #

Preparation of the backup server

13.2.1.2.6.1 Preparation to store Freezer backups #

In this example, we want to store the backups on the server 192.168.69.132

Freezer will connect with the user backupuser on port 22 and store the backups in the /mnt/backups/ directory.

Connect to the backup server

Create the user

root # useradd backupuser --create-home --home-dir /mnt/backups/

Switch to that user
```
root # su backupuser
```

Create the SSH keypair

backupuser > ssh-keygen -t rsa
> # Just leave the default for the first question and do not set any passphrase
> Generating public/private rsa key pair.
> Enter file in which to save the key (/mnt/backups//.ssh/id_rsa):
> Created directory '/mnt/backups//.ssh'.
> Enter passphrase (empty for no passphrase):
> Enter same passphrase again:
> Your identification has been saved in /mnt/backups//.ssh/id_rsa
> Your public key has been saved in /mnt/backups//.ssh/id_rsa.pub
> The key fingerprint is:
> a9:08:ae:ee:3c:57:62:31:d2:52:77:a7:4e:37:d1:28 backupuser@padawan-ccp-c0-m1-mgmt
> The key's randomart image is:
> +---[RSA 2048]----+
> |          o      |
> |   . . E + .     |
> |  o . . + .      |
> | o +   o +       |
> |  + o o S .      |
> | . + o o         |
> |  o + .          |
> |.o .             |
> |++o              |
> +-----------------+

Add the public key to the list of the keys authorized to connect to that user on this server
```
backupuser > cat /mnt/backups/.ssh/id_rsa.pub >> /mnt/backups/.ssh/authorized_keys
```

Print the private key. This is what we will use for the backup configuration (ssh_credentials.yml file)

backupuser > cat /mnt/backups/.ssh/id_rsa

> -----BEGIN RSA PRIVATE KEY-----
> MIIEogIBAAKCAQEAvjwKu6f940IVGHpUj3ffl3eKXACgVr3L5s9UJnb15+zV3K5L
> BZuor8MLvwtskSkgdXNrpPZhNCsWSkryJff5I335Jhr/e5o03Yy+RqIMrJAIa0X5
> ...
> ...
> ...
> iBKVKGPhOnn4ve3dDqy3q7fS5sivTqCrpaYtByJmPrcJNjb2K7VMLNvgLamK/AbL
> qpSTZjicKZCCl+J2+8lrKAaDWqWtIjSUs29kCL78QmaPOgEvfsw=
> -----END RSA PRIVATE KEY-----

13.2.1.2.6.2 Preparation to store Cassandra backups #

In this example, we want to store the backups on the server 192.168.69.132. We will store the backups in the /mnt/backups/cassandra_backups/ directory.

Create a directory on the backup server to store cassandra backups
```
backupuser > mkdir /mnt/backups/cassandra_backups
```
Copy private ssh key from backupserver to all controller nodes
```
backupuser > scp /mnt/backups/.ssh/id_rsa ardana@CONTROLLER:~/.ssh/id_rsa_backup
         Password:
         id_rsa     100% 1675     1.6KB/s   00:00
```
Replace CONTROLLER with each control node e.g. doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt etc
Login to each controller node and copy private ssh key to the root user's .ssh directory
```
tux > sudo cp /var/lib/ardana/.ssh/id_rsa_backup /root/.ssh/
```
Verify that you can ssh to backup server as backup user using the private key
```
root # ssh -i ~/.ssh/id_rsa_backup backupuser@doc-cp1-comp0001-mgmt
```

13.2.1.2.7 Perform Backups for disaster recovery test #

Perform Backups for disaster recovery

13.2.1.2.7.1 Execute backup of Cassandra #

Execute backup of Cassandra

Create cassandra-backup-extserver.sh script on all controller nodes where Cassandra runs, which can be determined by running this command on deployer

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible FND-CDB --list-hosts

root # cat > ~/cassandra-backup-extserver.sh << EOF
#!/bin/sh

# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/

# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool

# e.g. cassandra-snp-2018-06-26-1003
SNAPSHOT_NAME=cassandra-snp-\$(date +%F-%H%M)
HOST_NAME=\$(/bin/hostname)_

# Take a snapshot of cassandra database
\$NODETOOL snapshot -t \$SNAPSHOT_NAME monasca

# Collect a list of directories that make up the snapshot
SNAPSHOT_DIR_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
for d in \$SNAPSHOT_DIR_LIST
  do
    # copy snapshot directories to external server
    rsync -avR -e "ssh -i /root/.ssh/id_rsa_backup" \$d \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME
  done

\$NODETOOL clearsnapshot monasca
EOF

root # chmod +x ~/cassandra-backup-extserver.sh

Execute following steps on all the controller nodes

Note

/usr/local/sbin/cassandra-backup-extserver.sh should be executed on all the three controller nodes at the same time (within seconds of each other) for a successful backup

Edit /usr/local/sbin/cassandra-backup-extserver.sh script
Set BACKUP_USER and BACKUP_SERVER to the desired backup user (for example, backupuser) and desired backup server (for example, 192.168.68.132), respectively.
```
BACKUP_USER=backupuser
BACKUP_SERVER=192.168.69.132
BACKUP_DIR=/mnt/backups/cassandra_backups/
```

Execute ~/cassandra-backup-extserver.sh

root # ~/cassandra-backup-extserver.sh (on all controller nodes which are also cassandra nodes)

Requested creating snapshot(s) for [monasca] with snapshot name [cassandra-snp-2018-06-28-0251] and options {skipFlush=false}
Snapshot directory: cassandra-snp-2018-06-28-0251
sending incremental file list
created directory /mnt/backups/cassandra_backups//doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
/var/
/var/cassandra/
/var/cassandra/data/
/var/cassandra/data/data/
/var/cassandra/data/data/monasca/

...
...
...

/var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-Summary.db
/var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-TOC.txt
/var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/schema.cql
sent 173,691 bytes  received 531 bytes  116,148.00 bytes/sec
total size is 171,378  speedup is 0.98
Requested clearing snapshot(s) for [monasca]

Verify cassandra backup directory on backup server

backupuser > ls -alt /mnt/backups/cassandra_backups
total 16
drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 .
drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
drwxr-xr-x 3 backupuser users 4096 Jun 28 02:51 doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
drwxr-xr-x 8 backupuser users 4096 Jun 27 20:56 ..

$backupuser@backupserver> du -shx /mnt/backups/cassandra_backups/*
6.2G    /mnt/backups/cassandra_backups/doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
6.3G    /mnt/backups/cassandra_backups/doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306

13.2.1.2.7.2 Execute backup of HPE Helion OpenStack #

Execute backup of HPE Helion OpenStack

Edit the configuration file for SSH backups (be careful to format the private key as requested: pipe on the first line and two spaces indentation). The private key is the key we created on the backup server earlier.

ardana > vi ~/openstack/my_cloud/config/freezer/ssh_credentials.yml

$ cat ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
freezer_ssh_host: 192.168.69.132
freezer_ssh_port: 22
freezer_ssh_username: backupuser
freezer_ssh_base_dir: /mnt/backups
freezer_ssh_private_key: |
  -----BEGIN RSA PRIVATE KEY-----
  MIIEowIBAAKCAQEAyzhZ+F+sXQp70N8zCDDb6ORKAxreT/qD4zAetjOTuBoFlGb8
  pRBY79t9vNp7qvrKaXHBfb1OkKzhqyUwEqNcC9bdngABbb8KkCq+OkfDSAZRrmja
  wa5PzgtSaZcSJm9jQcF04Fq19mZY2BLK3OJL4qISp1DmN3ZthgJcpksYid2G3YG+
  bY/EogrQrdgHfcyLaoEkiBWQSBTEENKTKFBB2jFQYdmif3KaeJySv9cJqihmyotB
  s5YTdvB5Zn/fFCKG66THhKnIm19NftbJcKc+Y3Z/ZX4W9SpMSj5dL2YW0Y176mLy
  gMLyZK9u5k+fVjYLqY7XlVAFalv9+HZsvQ3OQQIDAQABAoIBACfUkqXAsrrFrEDj
  DlCDqwZ5gBwdrwcD9ceYjdxuPXyu9PsCOHBtxNC2N23FcMmxP+zs09y+NuDaUZzG
  vCZbCFZ1tZgbLiyBbiOVjRVFLXw3aNkDSiT98jxTMcLqTi9kU5L2xN6YSOPTaYRo
  IoSqge8YjwlmLMkgGBVU7y3UuCmE/Rylclb1EI9mMPElTF+87tYK9IyA2QbIJm/w
  4aZugSZa3PwUvKGG/TCJVD+JfrZ1kCz6MFnNS1jYT/cQ6nzLsQx7UuYLgpvTMDK6
  Fjq63TmVg9Z1urTB4dqhxzpDbTNfJrV55MuA/z9/qFHs649tFB1/hCsG3EqWcDnP
  mcv79nECgYEA9WdOsDnnCI1bamKA0XZxovb2rpYZyRakv3GujjqDrYTI97zoG+Gh
  gLcD1EMLnLLQWAkDTITIf8eurkVLKzhb1xlN0Z4xCLs7ukgMetlVWfNrcYEkzGa8
  wec7n1LfHcH5BNjjancRH0Q1Xcc2K7UgGe2iw/Iw67wlJ8i5j2Wq3sUCgYEA0/6/
  irdJzFB/9aTC8SFWbqj1DdyrpjJPm4yZeXkRAdn2GeLU2jefqPtxYwMCB1goeORc
  gQLspQpxeDvLdiQod1Y1aTAGYOcZOyAatIlOqiI40y3Mmj8YU/KnL7NMkaYBCrJh
  aW//xo+l20dz52pONzLFjw1tW9vhCsG1QlrCaU0CgYB03qUn4ft4JDHUAWNN3fWS
  YcDrNkrDbIg7MD2sOIu7WFCJQyrbFGJgtUgaj295SeNU+b3bdCU0TXmQPynkRGvg
  jYl0+bxqZxizx1pCKzytoPKbVKCcw5TDV4caglIFjvoz58KuUlQSKt6rcZMHz7Oh
  BX4NiUrpCWo8fyh39Tgh7QKBgEUajm92Tc0XFI8LNSyK9HTACJmLLDzRu5d13nV1
  XHDhDtLjWQUFCrt3sz9WNKwWNaMqtWisfl1SKSjLPQh2wuYbqO9v4zRlQJlAXtQo
  yga1fxZ/oGlLVe/PcmYfKT91AHPvL8fB5XthSexPv11ZDsP5feKiutots47hE+fc
  U/ElAoGBAItNX4jpUfnaOj0mR0L+2R2XNmC5b4PrMhH/+XRRdSr1t76+RJ23MDwf
  SV3u3/30eS7Ch2OV9o9lr0sjMKRgBsLZcaSmKp9K0j/sotwBl0+C4nauZMUKDXqg
  uGCyWeTQdAOD9QblzGoWy6g3ZI+XZWQIMt0pH38d/ZRbuSUk5o5v
  -----END RSA PRIVATE KEY-----

Save the modifications in the GIT repository

ardana > cd ~/openstack/
ardana > git add -A
ardana > git commit -a -m "SSH backup configuration"
ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

Create the Freezer jobs

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml

Wait until all the SSH backup jobs have finished running

Freezer backup jobs are scheduled at interval specified in job specification

You will have to wait for the scheduled time interval for the backup job to run

To find the interval:

ardana > freezer job-list | grep SSH

| 34c1364692f64a328c38d54b95753844 | Ardana Default: deployer backup to SSH      |         7 | success | scheduled |       |            |
| 944154642f624bb7b9ff12c573a70577 | Ardana Default: swift backup to SSH         |         1 | success | scheduled |       |            |
| 22c6bab7ac4d43debcd4f5a9c4c4bb19 | Ardana Default: mysql backup to SSH         |         1 | success | scheduled |       |            |

ardana > freezer job-show 944154642f624bb7b9ff12c573a70577
+-------------+---------------------------------------------------------------------------------+
| Field       | Value                                                                           |
+-------------+---------------------------------------------------------------------------------+
| Job ID      | 944154642f624bb7b9ff12c573a70577                                                |
| Client ID   | ardana-qe201-cp1-c1-m1-mgmt                                                     |
| User ID     | 33a6a77adc4b4799a79a4c3bd40f680d                                                |
| Session ID  |                                                                                 |
| Description | Ardana Default: swift backup to SSH                                             |
| Actions     | [{u'action_id': u'e8373b03ca4b41fdafd83f9ba7734bfa',                            |
|             |   u'freezer_action': {u'action': u'backup',                                     |
|             |                       u'backup_name': u'freezer_swift_builder_dir_backup',      |
|             |                       u'container': u'/mnt/backups/freezer_rings_backups',      |
|             |                       u'log_config_append': u'/etc/freezer/agent-logging.conf', |
|             |                       u'max_level': 14,                                         |
|             |                       u'path_to_backup': u'/etc/swiftlm/',                      |
|             |                       u'remove_older_than': 90,                                 |
|             |                       u'snapshot': True,                                        |
|             |                       u'ssh_host': u'192.168.69.132',                           |
|             |                       u'ssh_key': u'/etc/freezer/ssh_key',                      |
|             |                       u'ssh_port': u'22',                                       |
|             |                       u'ssh_username': u'backupuser',                           |
|             |                       u'storage': u'ssh'},                                      |
|             |   u'max_retries': 5,                                                            |
|             |   u'max_retries_interval': 60,                                                  |
|             |   u'user_id': u'33a6a77adc4b4799a79a4c3bd40f680d'}]                             |
| Start Date  |                                                                                 |
| End Date    |                                                                                 |
| Interval    | 24 hours                                                                        |
+-------------+---------------------------------------------------------------------------------+

Swift SSH backup job has Interval of 24 hours, so the next backup would run after 24 hours.

In the default installation Interval for various backup jobs are:

Table 13.1: Default Interval for Freezer backup jobs #

Job Name	Interval
Ardana Default: deployer backup to SSH	48 hours
Ardana Default: mysql backup to SSH	12 hours
Ardana Default: swift backup to SSH	24 hours

You will have to wait for as long as 48 hours for all the backup jobs to run

On the backup server, you can verify that the backup files are present

backupuser > ls -lah  /mnt/backups/
total 16
drwxr-xr-x 2 backupuser users 4096 Jun 27  2017 bin
drwxr-xr-x 2 backupuser users 4096 Jun 29 14:04 freezer_database_backups
drwxr-xr-x 2 backupuser users 4096 Jun 29 14:05 freezer_lifecycle_manager_backups
drwxr-xr-x 2 backupuser users 4096 Jun 29 14:05 freezer_rings_backups

backupuser > du -shx *
4.0K    bin
509M    freezer_audit_logs_backups
2.8G    freezer_database_backups
24G     freezer_lifecycle_manager_backups
160K    freezer_rings_backups

13.2.1.2.8 Restore of the first controller #

Restore of the first controller

Edit the SSH backup configuration (re-enter the same information as earlier)
```
ardana > vi ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
```
Execute the restore helper. When prompted, enter the hostname the first controller had. In this example: doc-cp1-c1-m1-mgmt
```
ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost _deployer_restore_helper.yml
```
Execute the restore. When prompted, leave the first value empty (none) and validate the restore by typing 'yes'.
```
ardana > sudo su
cd /root/deployer_restore_helper/
./deployer_restore_script.sh
```

Create a restore file for Swift rings

ardana > nano swift_rings_restore.ini
ardana > cat swift_rings_restore.ini

Help:

[default]
action = restore
storage = ssh
# backup server ip
ssh_host = 192.168.69.132
# username to connect to the backup server
ssh_username = backupuser
ssh_key = /etc/freezer/ssh_key
# base directory for backups on the backup server 
container = /mnt/backups/freezer_ring_backups
backup_name = freezer_swift_builder_dir_backup
restore_abs_path = /etc/swiftlm
log_file = /var/log/freezer-agent/freezer-agent.log
# hostname that the controller
hostname = doc-cp1-c1-m1-mgmt
overwrite = True

Execute the restore of the swift rings

ardana > freezer-agent --config ./swift_rings_restore.ini

13.2.1.2.9 Re-deployment of controllers 1, 2 and 3 #

Re-deployment of controllers 1, 2 and 3

Change back to the default ardana user

Deactivate the freezer backup jobs (otherwise empty backups would be added on top of the current good backups)

ardana > nano ~/openstack/my_cloud/config/freezer/activate_jobs.yml
ardana > cat ~/openstack/my_cloud/config/freezer/activate_jobs.yml

# If set to false, We wont create backups jobs.
freezer_create_backup_jobs: false

# If set to false, We wont create restore jobs.
freezer_create_restore_jobs: true

Save the modification in the GIT repository

ardana > cd ~/openstack/
ardana > git add -A
ardana > git commit -a -m "De-Activate SSH backup jobs during re-deployment"
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

Run the cobbler-deploy.yml playbook

ardana > ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost cobbler-deploy.xml

Run the bm-reimage.yml playbook limited to the second and third controller
```
ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3
```
controller2 and controller3 names can vary. You can use the bm-power-status.yml playbook in order to check the cobbler names of these nodes.
Run the site.yml playbook limited to the three controllers and localhost. In this example, this means: doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt, doc-cp1-c1-m3-mgmt and localhost
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts site.yml --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
```

13.2.1.2.10 Cassandra database restore #

Cassandra database restore

Create a script cassandra-restore-extserver.sh on all controller nodes

root # cat > ~/cassandra-restore-extserver.sh << EOF
#!/bin/sh

# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/

# Setup variables
DATA_DIR=/var/cassandra
NODETOOL=/usr/bin/nodetool

HOST_NAME=\$(/bin/hostname)_

#Get snapshot name from command line.
if [ -z "\$*"  ]
then
  echo "usage \$0 <snapshot to restore>"
  exit 1
fi
SNAPSHOT_NAME=\$1

# restore
rsync -av -e "ssh -i /root/.ssh/id_rsa_backup" \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME/ /

# set ownership of newley restored files
chown -R cassandra:cassandra \$DATA_DIR

# Get a list of snapshot directories that have files to be restored.
RESTORE_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)

# use RESTORE_LIST to move snapshot files back into place of database.
for d in \$RESTORE_LIST
do
  cd \$d
  mv * ../..
  KEYSPACE=\$(pwd | rev | cut -d '/' -f4 | rev)
  TABLE_NAME=\$(pwd | rev | cut -d '/' -f3 |rev | cut -d '-' -f1)
  \$NODETOOL refresh \$KEYSPACE \$TABLE_NAME
done
cd
# Cleanup snapshot directories
\$NODETOOL clearsnapshot \$KEYSPACE
EOF

root # chmod +x ~/cassandra-restore-extserver.sh

Execute following steps on all the controller nodes

Edit ~/cassandra-restore-extserver.sh script
Set BACKUP_USER,BACKUP_SERVER to the desired backup user (for example, backupuser) and the desired backup server (for example, 192.168.68.132), respectively.
```
BACKUP_USER=backupuser
BACKUP_SERVER=192.168.69.132
BACKUP_DIR=/mnt/backups/cassandra_backups/
```

Execute ~/cassandra-restore-extserver.sh SNAPSHOT_NAME

You will have to find out SNAPSHOT_NAME from listing of /mnt/backups/cassandra_backups. All the directories are of format HOST_SNAPSHOT_NAME

ls -alt /mnt/backups/cassandra_backups
total 16
drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 .
drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306

root # ~/cassandra-restore-extserver.sh cassandra-snp-2018-06-28-0306

receiving incremental file list
./
var/
var/cassandra/
var/cassandra/data/
var/cassandra/data/data/
var/cassandra/data/data/monasca/
var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/
var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/
var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/
var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/manifest.json
var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-CompressionInfo.db
var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-Data.db
...
...
...
/usr/bin/nodetool clearsnapshot monasca

13.2.1.2.11 Databases restore #

Databases restore

13.2.1.2.11.1 MariaDB database restore #

MariaDB database restore

Source the backup credentials file
```
ardana > source ~/backup.osrc
```

List Freezer jobs

Gather the id of the job corresponding to the first controller and with the description. For example:

ardana > freezer job-list | grep "mysql restore from SSH"
+----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
| Job ID                           | Description                                 | # Actions | Result  | Status    | Event | Session ID |
+----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
| 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH      |         1 |         | stop      |       |            |

ardana > freezer job-show 64715c6ce8ed40e1b346136083923260
+-------------+---------------------------------------------------------------------------------+
| Field       | Value                                                                           |
+-------------+---------------------------------------------------------------------------------+
| Job ID      | 64715c6ce8ed40e1b346136083923260                                                |
| Client ID   | doc-cp1-c1-m1-mgmt                                                     |
| User ID     | 33a6a77adc4b4799a79a4c3bd40f680d                                                |
| Session ID  |                                                                                 |
| Description | Ardana Default: mysql restore from SSH                                          |
| Actions     | [{u'action_id': u'19dfb0b1851e41c682716ecc6990b25b',                            |
|             |   u'freezer_action': {u'action': u'restore',                                    |
|             |                       u'backup_name': u'freezer_mysql_backup',                  |
|             |                       u'container': u'/mnt/backups/freezer_database_backups',   |
|             |                       u'hostname': u'doc-cp1-c1-m1-mgmt',              |
|             |                       u'log_config_append': u'/etc/freezer/agent-logging.conf', |
|             |                       u'restore_abs_path': u'/tmp/mysql_restore/',              |
|             |                       u'ssh_host': u'192.168.69.132',                           |
|             |                       u'ssh_key': u'/etc/freezer/ssh_key',                      |
|             |                       u'ssh_port': u'22',                                       |
|             |                       u'ssh_username': u'backupuser',                           |
|             |                       u'storage': u'ssh'},                                      |
|             |   u'max_retries': 5,                                                            |
|             |   u'max_retries_interval': 60,                                                  |
|             |   u'user_id': u'33a6a77adc4b4799a79a4c3bd40f680d'}]                             |
| Start Date  |                                                                                 |
| End Date    |                                                                                 |
| Interval    |                                                                                 |
+-------------+---------------------------------------------------------------------------------+

Start the job using its id

ardana > freezer job-start 64715c6ce8ed40e1b346136083923260
Start request sent for job 64715c6ce8ed40e1b346136083923260

Wait for the job result to be success

ardana > freezer job-list | grep "mysql restore from SSH"
+----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
| Job ID                           | Description                                 | # Actions | Result  | Status    | Event | Session ID |
+----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
| 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH      |         1 |         | running      |       |            |

ardana > freezer job-list | grep "mysql restore from SSH"
+----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
| Job ID                           | Description                                 | # Actions | Result  | Status    | Event | Session ID |
+----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
| 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH      |         1 | success | completed |       |            |

Verify that the files have been restored on the controller

ardana > sudo du -shx /tmp/mysql_restore/*

16K     /tmp/mysql_restore/aria_log.00000001
4.0K    /tmp/mysql_restore/aria_log_control
3.4M    /tmp/mysql_restore/barbican
8.0K    /tmp/mysql_restore/ceilometer
4.2M    /tmp/mysql_restore/cinder
2.9M    /tmp/mysql_restore/designate
129M    /tmp/mysql_restore/galera.cache
2.1M    /tmp/mysql_restore/glance
4.0K    /tmp/mysql_restore/grastate.dat
4.0K    /tmp/mysql_restore/gvwstate.dat
2.6M    /tmp/mysql_restore/heat
752K    /tmp/mysql_restore/horizon
4.0K    /tmp/mysql_restore/ib_buffer_pool
76M     /tmp/mysql_restore/ibdata1
128M    /tmp/mysql_restore/ib_logfile0
128M    /tmp/mysql_restore/ib_logfile1
12M     /tmp/mysql_restore/ibtmp1
16K     /tmp/mysql_restore/innobackup.backup.log
313M    /tmp/mysql_restore/keystone
716K    /tmp/mysql_restore/magnum
12M     /tmp/mysql_restore/mon
8.3M    /tmp/mysql_restore/monasca_transform
0       /tmp/mysql_restore/multi-master.info
11M     /tmp/mysql_restore/mysql
4.0K    /tmp/mysql_restore/mysql_upgrade_info
14M     /tmp/mysql_restore/nova
4.4M    /tmp/mysql_restore/nova_api
14M     /tmp/mysql_restore/nova_cell0
3.6M    /tmp/mysql_restore/octavia
208K    /tmp/mysql_restore/opsconsole
38M     /tmp/mysql_restore/ovs_neutron
8.0K    /tmp/mysql_restore/performance_schema
24K     /tmp/mysql_restore/tc.log
4.0K    /tmp/mysql_restore/test
8.0K    /tmp/mysql_restore/winchester
4.0K    /tmp/mysql_restore/xtrabackup_galera_info

Repeat steps 2-5 on the other two controllers where the MariaDB/Galera database is running, which can be determined by running below command on deployer
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible FND-MDB --list-hosts
```

Stop HPE Helion OpenStack services on the three controllers (replace the hostnames of the controllers in the command)

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost

Clean the mysql directory and copy the restored backup on all three controllers where MariaDB/Galera database is running
```
root # cd /var/lib/mysql/
root # rm -rf ./*
root # cp -pr /tmp/mysql_restore/* ./
```
Switch back to the ardana user once the copy is finished

13.2.1.2.11.2 Restart HPE Helion OpenStack services #

Restart HPE Helion OpenStack services

Restart the MariaDB Database
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
```
On the deployer node, execute the galera-bootstrap.yml playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.
If this process fails to recover the database cluster, please refer to Section 13.2.2.1.2, “Recovering the MariaDB Database”. There Scenario 3 covers the process of manually starting the database.
Restart HPE Helion OpenStack services limited to the three controllers (replace the the hostnames of the controllers in the command).
```
ansible-playbook -i hosts/verb_hosts ardana-start.yml \
 --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
```

Re-configure HPE Helion OpenStack

ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml

13.2.1.2.11.3 Re-enable SSH backups #

Re-enable SSH backups

Re-activate Freezer backup jobs

ardana > vi ~/openstack/my_cloud/config/freezer/activate_jobs.yml
ardana > cat ~/openstack/my_cloud/config/freezer/activate_jobs.yml

# If set to false, We wont create backups jobs.
freezer_create_backup_jobs: true

# If set to false, We wont create restore jobs.
freezer_create_restore_jobs: true

Save the modifications in the GIT repository

cd ~/openstack/ardana/ansible/
git add -A
git commit -a -m “Re-Activate SSH backup jobs”
ansible-playbook -i hosts/localhost config-processor-run.yml
ansible-playbook -i hosts/localhost ready-deployment.yml

Create Freezer jobs

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml

13.2.1.2.12 Post restore testing #

Post restore testing

Source the service credential file
```
ardana > source ~/service.osrc
```

Swift

ardana > swift list
container_1
volumebackups

ardana > swift list container_1
var/lib/ardana/backup.osrc
var/lib/ardana/service.osrc

ardana > swift download container_1 /tmp/backup.osrc

Neutron

ardana > openstack network list
+--------------------------------------+---------------------+--------------------------------------+
| ID                                   | Name                | Subnets                              |
+--------------------------------------+---------------------+--------------------------------------+
| 07c35d11-13f9-41d4-8289-fa92147b1d44 | test-net             | 02d5ca3b-1133-4a74-a9ab-1f1dc2853ec8|
+--------------------------------------+---------------------+--------------------------------------+

Glance

ardana > openstack image list
+--------------------------------------+----------------------+--------+
| ID                                   | Name                 | Status |
+--------------------------------------+----------------------+--------+
| 411a0363-7f4b-4bbc-889c-b9614e2da52e | cirros-0.4.0-x86_64  | active |
+--------------------------------------+----------------------+--------+
ardana > openstack image save --file /tmp/cirros f751c39b-f1e3-4f02-8332-3886826889ba
ardana > ls -lah /tmp/cirros
-rw-r--r-- 1 ardana ardana 12716032 Jul  2 20:52 /tmp/cirros

Nova

ardana > openstack server list

ardana > openstack server list

ardana > openstack server create server_6 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e  --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
+-------------------------------------+------------------------------------------------------------+
| Field                               | Value                                                      |
+-------------------------------------+------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                     |
| OS-EXT-AZ:availability_zone         |                                                            |
| OS-EXT-SRV-ATTR:host                | None                                                       |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                       |
| OS-EXT-SRV-ATTR:instance_name       |                                                            |
| OS-EXT-STS:power_state              | NOSTATE                                                    |
| OS-EXT-STS:task_state               | scheduling                                                 |
| OS-EXT-STS:vm_state                 | building                                                   |
| OS-SRV-USG:launched_at              | None                                                       |
| OS-SRV-USG:terminated_at            | None                                                       |
| accessIPv4                          |                                                            |
| accessIPv6                          |                                                            |
| addresses                           |                                                            |
| adminPass                           | iJBoBaj53oUd                                               |
| config_drive                        |                                                            |
| created                             | 2018-07-02T21:02:01Z                                       |
| flavor                              | m1.small (2)                                               |
| hostId                              |                                                            |
| id                                  | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c                       |
| image                               | cirros-0.4.0-x86_64 (f751c39b-f1e3-4f02-8332-3886826889ba) |
| key_name                            | None                                                       |
| name                                | server_6                                                   |
| progress                            | 0                                                          |
| project_id                          | cca416004124432592b2949a5c5d9949                           |
| properties                          |                                                            |
| security_groups                     | name='default'                                             |
| status                              | BUILD                                                      |
| updated                             | 2018-07-02T21:02:01Z                                       |
| user_id                             | 8cb1168776d24390b44c3aaa0720b532                           |
| volumes_attached                    |                                                            |
+-------------------------------------+------------------------------------------------------------+

ardana > openstack server list
+--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+
| ID                                   | Name     | Status | Networks                        | Image               | Flavor    |
+--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+
| ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | server_6 | ACTIVE | n1=1.1.1.8                      | cirros-0.4.0-x86_64 | m1.small  |

ardana > openstack server delete ce7689ff-23bf-4fe9-b2a9-922d4aa9412c

13.2.2 Unplanned Control Plane Maintenance #

Unplanned maintenance tasks for controller nodes such as recovery from power failure.

13.2.2.1 Restarting Controller Nodes After a Reboot #

Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.

When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.

These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.

13.2.2.1.1 Prerequisites #

The following conditions must be true in order to perform these steps successfully:

Each of your controller nodes should be powered on.
Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.
The operator who performs these steps will need access to the lifecycle manager.

13.2.2.1.2 Recovering the MariaDB Database #

The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:

Scenario 1: Recovering one or two of your controller nodes but not the entire cluster

Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:

Ensure the controller nodes have power and are booted to the command prompt.
If the MariaDB service is not started, start it with this command:
```
sudo service mysql start
```
If MariaDB fails to start, proceed to the next section which covers the bootstrap process.

Scenario 2: Recovering the entire controller cluster with the bootstrap playbook

If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.

Make sure no mysqld daemon is running on any node in the cluster before you continue with the steps in this procedure. If there is a mysqld daemon running, then use the command below to shut down the daemon.
```
sudo systemctl stop mysql
```
If the mysqld daemon does not go down following the service stop, then kill the daemon using kill -9 before continuing.
On the deployer node, execute the galera-bootstrap.yml playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
```

13.2.2.1.3 Restarting Services on the Controller Nodes #

From the Cloud Lifecycle Manager you should execute the ardana-start.yml playbook for each node that was brought down so the services can be started back up.

If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>

If you have a shared Cloud Lifecycle Manager/controller setup and need to restart services on this shared node, you can use localhost to indicate the shared node, like this:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost

Note

If you leave off the --limit switch, the playbook will be run against all nodes.

13.2.2.1.4 Restart the Monitoring Agents #

As part of the recovery process, you should also restart the monasca-agent and these steps will show you how:

Stop the monasca-agent:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-agent-stop.yml

Restart the monasca-agent:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-agent-start.yml

You can then confirm the status of the monasca-agent with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml

13.2.2.2 Recovering the Control Plane #

If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery then we provide different scenarios for how to resolve them to get your cloud recovered.

If one or more of your controller nodes has experienced data or disk corruption due to power-loss or hardware failure and you need perform disaster recovery then we provide different scenarios for how to resolve them to get your cloud recovered.

Note

You should have backed up /etc/group of the Cloud Lifecycle Manager manually after installation. While recovering a Cloud Lifecycle Manager node, manually copy the /etc/group file from a backup of the old Cloud Lifecycle Manager.

13.2.2.2.1 Point-in-Time MariaDB Database Recovery #

In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.

13.2.2.2.1.1 Restore from a Swift backup #

Log in to the Cloud Lifecycle Manager.
Determine which node is the first host member in the FND-MDB group, which will be the first node hosting the MariaDB service in your cloud. You can do this by using these commands:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > grep -A1 FND-MDB--first-member hosts/verb_hosts
```
The result will be similar to the following example:
```
[FND-MDB--first-member:children]
ardana002-cp1-c1-m1
```
In this example, the host name of the node is ardana002-cp1-c1-m1

Find the host IP address which will be used to log in.

ardana > cat /etc/hosts | grep ardana002-cp1-c1-m1
10.84.43.82      ardana002-cp1-c1-m1-extapi ardana002-cp1-c1-m1-extapi
192.168.24.21    ardana002-cp1-c1-m1-mgmt ardana002-cp1-c1-m1-mgmt
10.1.2.1         ardana002-cp1-c1-m1-guest ardana002-cp1-c1-m1-guest
10.84.65.3       ardana002-cp1-c1-m1-EXTERNAL-VM ardana002-cp1-c1-m1-external-vm

In this example, 192.168.24.21 is the IP address for the host.

SSH into the host.
```
ardana > ssh ardana@192.168.24.21
```

Source the backup file.

ardana > source /var/lib/ardana/backup.osrc

Find the Client ID for the host name from the beginning of this procedure ( ardana002-cp1-c1-m1 ) in this example.

ardana > freezer client-list
+-----------------------------+----------------------------------+-----------------------------+-------------+
| Client ID                   | uuid                             | hostname                    | description |
+-----------------------------+----------------------------------+-----------------------------+-------------+
| ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
| ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
| ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
| ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
| ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
| ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
+-----------------------------+----------------------------------+-----------------------------+-------------+

In this example, the hostname and the Client ID are the same: ardana002-cp1-c1-m1-mgmt.

List the jobs

ardana > freezer job-list -C CLIENT ID

Using the example in the previous step:

ardana > freezer job-list -C ardana002-cp1-c1-m1-mgmt

Get the corresponding job id for Ardana Default: mysql restore from Swift.
Launch the restore process with:
```
ardana > freezer job-start JOB-ID
```
This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.
Log in to the Cloud Lifecycle Manager.

Stop the MariaDB service.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml

Log back in to the first node running the MariaDB service, the same node as in Step 3.
Clean the MariaDB directory using this command:
```
tux > sudo rm -r /var/lib/mysql/*
```

Copy the restored files back to the MariaDB directory:

tux > sudo cp -pr /tmp/mysql_restore/* /var/lib/mysql

Log in to each of the other nodes in your MariaDB cluster, which were determined in Step 3. Remove the grastate.dat file from each of them.
```
tux > sudo rm /var/lib/mysql/grastate.dat
```
Warning
Do not remove this file from the first node in your MariaDB cluster. Ensure you only do this from the other cluster nodes.
Log back in to the Cloud Lifecycle Manager.

Start the MariaDB service.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

13.2.2.2.1.2 Restore from an SSH backup #

Follow the same procedure as the one for Swift but select the job Ardana Default: mysql restore from SSH.

13.2.2.2.1.3 Restore MariaDB manually #

If restoring MariaDB fails during the procedure outlined above, you can follow this procedure to manually restore MariaDB:

Stop the MariaDB cluster:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml

On all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:
```
tux > sudo rm -r /var/lib/mysql/*
```
On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.
```
tux > sudo cp -pr /tmp/mysql_restore/* /var/lib/mysql
```

If you need to restore the files manually from SSH, follow these steps:

Create the /root/mysql_restore.ini file with the contents below. Be careful to substitute the {{ values }}. Note that the SSH information refers to the SSH server you configured for backup before installing.

[default]
action = restore
storage = ssh
ssh_host = {{ freezer_ssh_host }}
ssh_username = {{ freezer_ssh_username }}
container = {{ freezer_ssh_base_dir }}/freezer_mysql_backup
ssh_key = /etc/freezer/ssh_key
backup_name = freezer_mysql_backup
restore_abs_path = /var/lib/mysql/
log_file = /var/log/freezer-agent/freezer-agent.log
hostname = {{ hostname of the first MariaDB node }}

Execute the restore job:

ardana > freezer-agent --config /root/mysql_restore.ini

Log back in to the Cloud Lifecycle Manager.

Start the MariaDB service.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

After approximately 10-15 minutes, the output of the percona-status.yml playbook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml

An example output is as follows:

TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] *************
  ok: [ardana-cp1-c1-m1-mgmt] => {
  "msg": "mysql is synced."
  }
  ok: [ardana-cp1-c1-m2-mgmt] => {
  "msg": "mysql is synced."
  }
  ok: [ardana-cp1-c1-m3-mgmt] => {
  "msg": "mysql is synced."
  }

13.2.2.2.1.4 Point-in-Time Cassandra Recovery #

A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.

The following steps should be taken before enabling and deploying the replacement node.

Determine the IP address of the node that was removed or is being replaced.
On one of the functional Cassandra control plane nodes, log in as the ardana user.
Run the command nodetool status to display a list of Cassandra nodes.
If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.
If the node that was removed is still in the list, copy its node ID.
Run the command nodetool removenode ID.

After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 13.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.

For more information, please consult the Cassandra documentation.

13.2.2.2.2 Point-in-Time Swift Rings Recovery #

In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your Swift rings to a previous state.

Note

Freezer backs up and restores Swift rings only, not Swift data.

13.2.2.2.2.1 Restore from a Swift backup #

To find the first Swift Proxy node:

On the Cloud Lifecycle Manager

ardana > cd  ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
--limit SWF-PRX[0]

At the end of the output, you will see something like the following example:

...
Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)'
Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)'

PLAY RECAP ********************************************************************
ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```

Find the first node name and its IP address. For example:
```
ardana > cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
```

Source the backup environment file:

ardana > source /var/lib/ardana/backup.osrc

Find the client id.

ardana > freezer client-list
+-----------------------------+----------------------------------+-----------------------------+-------------+
| Client ID                   | uuid                             | hostname                    | description |
+-----------------------------+----------------------------------+-----------------------------+-------------+
| ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
| ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
| ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
| ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
| ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
| ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
+-----------------------------+----------------------------------+-----------------------------+-------------+

In this example, the hostname and the Client ID are the same: ardana002-cp1-c1-m1-mgmt.

List the jobs

ardana > freezer job-list -C CLIENT ID

Using the example in the previous step:

ardana > freezer job-list -C ardana002-cp1-c1-m1-mgmt

Get the corresponding job id for Ardana Default: swift restore from Swift in the Description column.
Launch the restore job:
```
ardana > freezer job-start JOB-ID
```
This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log Wait until the restore job is finished before doing the next step.
Log in to the Cloud Lifecycle Manager.

Stop the Swift service:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml

Log back in to the first Swift Proxy (SWF-PRX[0]) node, which was determined in Step 1.

Copy the restored files.

tux > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

For example

tux > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/

Log back in to the Cloud Lifecycle Manager.

Reconfigure the Swift service:\

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml

13.2.2.2.2.2 Restore from an SSH backup #

Follow almost the same procedure as for Swift in the section immediately preceding this one: Section 13.2.2.2.2.1, “Restore from a Swift backup”. The only change is that the restore job uses a different job id. Get the corresponding job id for Ardana Default: Swift restore from SSH in the Description column.

13.2.2.2.3 Point-in-time Cloud Lifecycle Manager Recovery #

In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.

Procedure 13.1: Restoring from a Swift or SSH Backup #

Source the backup environment file:

tux > source /var/lib/ardana/backup.osrc

Find the Client ID.

tux > freezer client-list
+-----------------------------+----------------------------------+-----------------------------+-------------+
| Client ID                   | uuid                             | hostname                    | description |
+-----------------------------+----------------------------------+-----------------------------+-------------+
| ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
| ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
| ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
| ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
| ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
| ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
+-----------------------------+----------------------------------+-----------------------------+-------------+

In this example, the hostname and the Client ID are the same: ardana002-cp1-c1-m1-mgmt.

List the jobs

tux > freezer job-list -C CLIENT ID

Using the example in the previous step:

tux > freezer job-list -C ardana002-cp1-c1-m1-mgmt

Find the correct job ID:
SSH Backups: Get the id corresponding to the job id for Ardana Default: deployer restore from SSH.
or
Swift Backups. Get the id corresponding to the job id for Ardana Default: deployer restore from Swift.
Stop the Dayzero UI:
```
tux > sudo systemctl stop dayzero
```
Launch the restore job:
```
tux > freezer job-start JOB ID
```
This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.
Start the Dayzero UI:
```
tux > sudo systemctl start dayzero
```

13.2.2.2.4 Cloud Lifecycle Manager Disaster Recovery #

In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.

To ensure that you use the same version of HPE Helion OpenStack that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the Book “Installing with Cloud Lifecycle Manager”, Chapter 3 “Installing the Cloud Lifecycle Manager server”, Section 3.5.2 “Installing the HPE Helion OpenStack Extension” before proceeding further.

13.2.2.2.4.1 Restore from a Swift backup #

Install the freezer-agent using the following playbook:

ardana > cd ~/openstack/ardana/ansible/
ardana > ansible-playbook -i hosts/localhost _deployer_restore_helper.yml

Access one of the other controller or compute nodes in your environment to perform the following steps:
1. Retrieve the /var/lib/ardana/backup.osrc file and copy it to the /var/lib/ardana/ directory on the Cloud Lifecycle Manager.
2. Copy all the files in the /opt/stack/service/freezer-api/etc/ directory to the same directory on the Cloud Lifecycle Manager.
3. Copy all the files in the /var/lib/ca-certificates directory to the same directory on the Cloud Lifecycle Manager.
4. Retrieve the /etc/hosts file and replace the one found on the Cloud Lifecycle Manager.
Log back in to the Cloud Lifecycle Manager.
Edit the value for client_id in the following file to contain the hostname of your Cloud Lifecycle Manager:
```
/opt/stack/service/freezer-api/etc/freezer-api.conf
```
Update your ca-certificates:
```
sudo update-ca-certificates
```

Edit the /etc/hosts file, ensuring you edit the 127.0.0.1 line so it points to ardana:

127.0.0.1       localhost ardana
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters

On the Cloud Lifecycle Manager, source the backup user credentials:
```
ardana > source ~/backup.osrc
```

Find the Client ID (ardana002-cp1-c0-m1-mgmt) for the host name as done in previous procedures (see Procedure 13.1, “Restoring from a Swift or SSH Backup”).

ardana > freezer client-list
+-----------------------------+----------------------------------+-----------------------------+-------------+
| Client ID                   | uuid                             | hostname                    | description |
+-----------------------------+----------------------------------+-----------------------------+-------------+
| ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
| ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
| ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
| ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
| ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
| ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
+-----------------------------+----------------------------------+-----------------------------+-------------+

In this example, the hostname and the Client ID are the same: ardana002-cp1-c0-m1-mgmt.

List the Freezer jobs

ardana > freezer job-list -C CLIENT ID

Using the example in the previous step:

ardana > freezer job-list -C ardana002-cp1-c0-m1-mgmt

Get the id of the job corresponding to Ardana Default: deployer backup to Swift. Stop that job so the freezer scheduler does not begin making backups when started.
```
ardana > freezer job-stop JOB-ID
```
If it is present, also stop the Cloud Lifecycle Manager's SSH backup.

Start the freezer scheduler:

sudo systemctl start openstack-freezer-scheduler

Get the id of the job corresponding to Ardana Default: deployer restore from Swift and launch that job:
```
ardana > freezer job-start JOB-ID
```
This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.
When the job completes, the previous Cloud Lifecycle Manager contents should be restored to your home directory:
```
ardana > cd ~
ardana > ls
```

If you are using Cobbler, restore your Cobbler configuration with these steps:

Remove the following files:

sudo rm -rf /var/lib/cobbler
sudo rm -rf /srv/www/cobbler

Deploy Cobbler:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml

Set the netboot-enabled flag for each of your nodes with this command:

for h in $(sudo cobbler system list)
do
  sudo cobbler system edit --name=$h --netboot-enabled=0
done

Update your deployment directory:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost ready_deployment.yml

If you are using a dedicated Cloud Lifecycle Manager, follow these steps:
1. re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
```
If you are using a shared Cloud Lifecycle Manager/controller, follow these steps:
1. If the node is also a Cloud Lifecycle Manager hypervisor, run the following commands to recreate the virtual machines that were lost:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-hypervisor-setup.yml --limit <this node>
```
2. If the node that was lost (or one of the VMs that it hosts) was a member of the RabbitMQ cluster then you need to remove the record of the old node, by running the following command on any one of the other cluster members. In this example the nodes are called cloud-cp1-rmq-mysql-m*-mgmt but you need to use the correct names for your system, which you can find in /etc/hosts:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ssh cloud-cp1-rmq-mysql-m3-mgmt sudo rabbitmqctl forget_cluster_node \
rabbit@cloud-cp1-rmq-mysql-m1-mgmt
```
3. Run the site.yml against the complete cloud to reinstall and rebuild the services that were lost. If you replaced one of the RabbitMQ cluster members then you will need to add the -e flag shown below, to nominate a new master node for the cluster, otherwise you can omit it.
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts site.yml -e \
rabbit_primary_hostname=cloud-cp1-rmq-mysql-m3
```

13.2.2.2.4.2 Restore from an SSH backup #

On the Cloud Lifecycle Manager, edit the following file so it contains the same information as it did previously:
```
ardana > ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
```

On the Cloud Lifecycle Manager, copy the following files, change directories, and run the playbook _deployer_restore_helper.yml:

ardana > cp -r ~/hp-ci/openstack/* ~/openstack/my_cloud/definition/
ardana > cd ~/openstack/ardana/ansible/
ardana > ansible-playbook -i hosts/localhost _deployer_restore_helper.yml

Perform the restore. First become root and change directories:
```
sudo su
root # cd /root/deployer_restore_helper/
```
Execute the restore job:
```
ardana > ./deployer_restore_script.sh
```

Update your deployment directory:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost ready_deployment.yml

When the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
```

13.2.2.2.5 One or Two Controller Node Disaster Recovery #

This scenario makes the following assumptions:

Your Cloud Lifecycle Manager is still intact and working.
One or two of your controller nodes went down, but not the entire cluster.
The node needs to be rebuilt from scratch, not simply rebooted.

13.2.2.2.5.1 Steps to recovering one or two controller nodes #

Ensure that your node has power and all of the hardware is functioning.
Log in to the Cloud Lifecycle Manager.
Verify that all of the information in your ~/openstack/my_cloud/definition/data/servers.yml file is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.
If you made changes to your servers.yml file then commit those changes to your local git:
```
ardana > git add -A
ardana > git commit -a -m "editing controller information"
```

Run the configuration processor:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

Ensure that Cobbler has the correct system information:
1. If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:
```
ardana > sudo cobbler system list
```
2. Remove any controller nodes from Cobbler that no longer exist:
```
ardana > sudo cobbler system remove --name=<node>
```
3. Add the new node into Cobbler:
```
ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
```
Then you can image the node:
```
ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>
```
Note
If you do not know the <node name> already, you can get it by using sudo cobbler system list.
Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See the Persisted Server Allocations section in for information on how this works.
[OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.
```
ardana > cd ~/scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>
```

Complete the rebuilding of your controller node with the two playbooks below:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>
ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>

13.2.2.2.6 Three Control Plane Node Disaster Recovery #

In this scenario, all control plane nodes are destroyed which need to be rebuilt or replaced.

13.2.2.2.6.1 Restore from a Swift backup: #

Restoring from a Swift backup is not possible because Swift is gone.

13.2.2.2.6.2 Restore from an SSH backup #

Disable the default backup job(s) by editing the following file:

ardana > ~/scratch/ansible/next/ardana/ansible/roles/freezer-jobs/defaults/activate.yml

Set the value for freezer_create_backup_jobs to false:

# If set to false, We won't create backups jobs.
freezer_create_backup_jobs: false

Deploy the control plane nodes, using the values for your control plane node hostnames:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
  CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \
  CONTROL_PLANE_HOSTNAME3 -e rebuild=True
```
For example, if you were using the default values from the example model files your command would look like this:
```
ardana > ansible-playbook -i hosts/verb_hosts site.yml \
    --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \
    -e rebuild=True
```
Note
The -e rebuild=True is only used on a single control plane node when there are other controllers available to pull configuration data from. This will cause the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.
Restore the MariaDB backup on the first controller node.
1. List the Freezer jobs:
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > freezer job-list -C FIRST_CONTROLLER_NODE
```
2. Run the Ardana Default: mysql restore from SSH job for your first controller node, replacing the JOB_ID for that job:
```
ardana > freezer job-start JOB_ID
```
You can monitor the restore job by connecting to your first controller node via SSH and running the following commands:
```
ardana > ssh FIRST_CONTROLLER_NODE
ardana > sudo su
root # tail -n 100 /var/log/freezer/freezer-scheduler.log
```
Log back in to the Cloud Lifecycle Manager.

Stop MySQL:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml

Log back in to the first controller node and move the following files:

ardana > ssh FIRST_CONTROLLER_NODE
ardana > sudo su
root # rm -rf /var/lib/mysql/*
root # cp -pr /tmp/mysql_restore/* /var/lib/mysql/

Log back in to the Cloud Lifecycle Manager and bootstrap MySQL:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

Verify the status of MySQL:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml

Re-enable the default backup job(s) by editing the following file:

~/scratch/ansible/next/ardana/ansible/roles/freezer-jobs/defaults/activate.yml

Set the value for freezer_create_backup_jobs to true:

# If set to false, We won't create backups jobs.
freezer_create_backup_jobs: true

Run this playbook to deploy the backup jobs:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml

13.2.2.2.7 Swift Rings Recovery #

To recover your Swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings from one Swift node if possible, or use the SSH backup that you have set up.

13.2.2.2.7.1 Restore from the Swift deployment backup #

See Section 15.6.2.7, “Recovering Swift Builder Files”.

13.2.2.2.7.2 Restore from the SSH Freezer backup #

In the very specific use case where you lost all system disks of all object nodes, and Swift proxy nodes are corrupted, you can recover the rings because a copy of the Swift rings is stored in Freezer. This means that Swift data is still there (the disks used by Swift needs to be still accessible).

Recover the rings with these steps.

Log in to a node that has the freezer-agent installed.
Become root:
```
ardana > sudo su
```
Create the temporary directory to restore your files to:
```
root # mkdir /tmp/swift_builder_dir_restore/
```

Create a restore file with the following content:

root # cat << EOF > ./restore_config.ini
[default]
action = restore
storage = ssh
compression = bzip2
restore_abs_path = /tmp/swift_builder_dir_restore/
ssh_key = /etc/freezer/ssh_key
ssh_host = <freezer_ssh_host>
ssh_port = <freezer_ssh_port>
ssh_user name = <freezer_ssh_user name>
container = <freezer_ssh_base_rid>/freezer_swift_backup_name = freezer_swift_builder_backup
hostname = <hostname of the old first Swift-Proxy (SWF-PRX[0])>
EOF

Edit the file and replace all <tags> with the right information.
```
vim ./restore_config.ini
```
You will also need to put the SSH key used to do the backups in /etc/freezer/ssh_key and remember to set the right permissions: 600.
Execute the restore job:
```
root # freezer-agent --config ./restore_config.ini
```
You now have the Swift rings in /tmp/swift_builder_dir_restore/

If the SWF-PRX[0] is already deployed, copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-PRX[0] Then from the Cloud Lifecycle Manager run:

ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

For example

ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml

If the SWF-ACC[0] is not deployed, from the Cloud Lifecycle Manager run these playbooks:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts guard-deployment.yml
ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>

Copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-ACC[0] You will have to create the directories : /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

For example

ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/

From the Cloud Lifecycle Manager, run the ardana-deploy.yml playbook:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

13.2.3 Unplanned Compute Maintenance #

Unplanned maintenance tasks including recovering compute nodes.

13.2.3.1 Recovering a Compute Node #

If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to recover a compute node include the following:

The node has failed, either because it has shut down has a hardware failure, or for another reason.
The node is working but the nova-compute process is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).
The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.

13.2.3.1.1 What to do if your compute node is down #

Compute node has power but is not powered on

If your compute node has power but is not powered on, use these steps to restore the node:

Log in to the Cloud Lifecycle Manager.
Obtain the name for your compute node in Cobbler:
```
sudo cobbler system list
```

Power the node back up with this playbook, specifying the node name from Cobbler:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Compute node is powered on but services are not running on it

If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:

Confirm the status of the compute service on the node with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>

You can start the compute service on the node with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>

13.2.3.1.2 Scenarios involving disk failures on your compute nodes #

Your compute nodes should have a minimum of two disks, one that is used for the operating system and one that is used as the data disk. These are defined during the installation of your cloud, in the ~/openstack/my_cloud/definition/data/disks_compute.yml file on the Cloud Lifecycle Manager. The data disk(s) are where the nova-compute service lives. Recovery scenarios will depend on whether one or the other, or both, of these disks experienced failures.

If your operating system disk failed but the data disk(s) are okay

If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:

Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
```
source ~/service.osrc
```
Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:
```
nova host-list | grep compute
```
Obtain the status of the nova-compute service on that node:
```
nova service-list --host <hostname>
```
You will likely want to disable provisioning on that node to ensure that nova-scheduler does not attempt to place any additional instances on the node while you are repairing it:
```
nova service-disable --reason "node is being rebuilt" <hostname> nova-compute
```
Obtain the status of the instances on the compute node:
```
nova list --host <hostname> --all-tenants
```
Before continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the nova evacuate or nova host-evacuate commands to do this. See Section 13.1.3.3, “Live Migration of Instances” for more details on how to do this.
If your instances are not booted from volumes, you will need to stop the instances using the nova stop command. Because the nova-compute service is not running on the node you will not see the instance status change, but the Task State for the instance should change to powering-off.
```
nova stop <instance_uuid>
```
Verify the status of each of the instances using these commands, verifying the Task State states powering-off:
```
nova list --host <hostname> --all-tenants
nova show <instance_uuid>
```
At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:
1. Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when <node_name> is requested:
```
sudo cobbler system list
```
2. Reimage the compute node with this playbook:
```
cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
```
Once reimaging is complete, use the following playbook to configure the operating system and start up services:
```
cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
```
You should then ensure any instances on the recovered node are in an ACTIVE state. If they are not then use the nova start command to bring them to the ACTIVE state:
```
nova list --host <hostname> --all-tenants
nova start <instance_uuid>
```

Reenable provisioning:

nova service-enable <hostname> nova-compute

Start any instances that you had stopped previously:

nova list --host <hostname> --all-tenants
nova start <instance_uuid>

If your data disk(s) failed but the operating system disk is okay OR if all drives failed

In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.

After that is complete, use the nova rebuild command to respawn your instances, which will also ensure that they receive the same IP address:

nova list --host <hostname> --all-tenants
nova rebuild <instance_uuid>

13.2.4 Unplanned Storage Maintenance #

Unplanned maintenance tasks for storage nodes.

13.2.4.1 Unplanned Swift Storage Maintenance #

Unplanned maintenance tasks for Swift storage nodes.

13.2.4.1.1 Recovering a Swift Node #

If one or more of your Swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to repair a Swift object or PAC node include:

The node has either shut down or been rebooted.
The entire node has failed and needs to be replaced.
A disk drive has failed and must be replaced.

13.2.4.1.1.1 What to do if your Swift host has shut down or rebooted #

If your Swift host has power but is not powered on, from the lifecycle manager you can run this playbook:

Log in to the Cloud Lifecycle Manager.
Obtain the name for your Swift host in Cobbler:
```
sudo cobbler system list
```

Power the node back up with this playbook, specifying the node name from Cobbler:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Once the node is booted up, Swift should start automatically. You can verify this with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-status.yml

Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 15.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.

13.2.4.1.1.2 How to replace your Swift node #

If your Swift node has irreparable damage and you need to replace the entire node in your environment, see Section 13.1.5.1.5, “Replacing a Swift Node” for details on how to do this.

13.2.4.1.1.3 How to replace a hard disk in your Swift node #

If you need to do a hard drive replacement in your Swift node, see Section 13.1.5.1.6, “Replacing Drives in a Swift Node” for details on how to do this.

13.3 Cloud Lifecycle Manager Maintenance Update Procedure #

Procedure 13.2: Preparing for Update #

Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Book “Installing with Cloud Lifecycle Manager”, Chapter 4 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”. Alternatives to setting up an SMT server are described in Book “Installing with Cloud Lifecycle Manager”, Chapter 5 “Software Repository Setup”.
Read the Release Notes for the security and maintenance updates that will be installed.
Have a backup strategy in place. For further information, see Chapter 14, Backup and Restore.
Ensure that you have a known starting state by resolving any unexpected alarms.
Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.
Review steps in Section 13.1.4.1, “Adding a Neutron Network Node” and Section 13.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the Neutron services are not provided via external SDN controllers.
Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the 324 evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.

13.3.1 Performing the Update #

Before you proceed, get the status of all your services:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

If status check returns an error for a specific service, run the SERVICE-reconfigure.yml playbook. Then run the SERVICE-status.yml playbook to check that the issue has been resolved.

Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 13.1.1.2, “Rolling Reboot of the Cloud”.

Note

The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.

To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 13.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.

Procedure 13.3: Update Instructions #

Install all available security and maintenance updates on the deployer using the zypper patch command.

Initialize the Cloud Lifecycle Manager and prepare the update playbooks.

Run the ardana-init initialization script to update the deployer.

Redeploy cobbler:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml

Run the configuration processor:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost config-processor-run.yml

Update your deployment directory:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

Installation and management of updates can be automated with the following playbooks:
- ardana-update-pkgs.yml
- ardana-update.yml
- ardana-update-status.yml
  Important
  Some playbooks are being deprecated. To determine how your system is affected, run:
  ardana > rpm -qa ardana-ansible
  The result will be ardana-ansible-8.0+git. followed by a version number string.
  If the first part of the version number string is greater than or equal to 1553878455 (for example, ardana-ansible-8.0+git.1553878455.7439e04), use the newly introduced parameters:
  pending_clm_update
  pending_service_update
  pending_system_reboot
  If the first part of the version number string is less than 1553878455 (for example, ardana-ansible-8.0+git.1552032267.5298d45), use the following parameters:
  update_status_var
  update_status_set
  update_status_reset
- ardana-reboot.yml
Confirm version changes by running hostnamectl before and after running the ardana-update-pkgs playbook on each node.
```
ardana > hostnamectl
```
Notice that the Boot ID: and Kernel: information has changed.
By default, the ardana-update-pkgs.yml playbook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.
```
ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
--limit TARGET_NODE_NAME
```
There may be a delay in the playbook output at the following task while updates are pulled from the deployer.
```
TASK: [ardana-upgrade-tools | pkg-update | Download and install
package updates] ***
```
After running the ardana-update-pkgs.yml playbook to install patches and updates not requiring reboot, check the status of remaining tasks.
```
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
--limit TARGET_NODE_NAME
```
To install patches that require reboot, run the ardana-update-pkgs.yml playbook with the parameter -e zypper_update_include_reboot_patches=true.
```
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
--limit  TARGET_NODE_NAME \
-e zypper_update_include_reboot_patches=true
```
If the output of ardana-update-pkgs.yml indicates that a reboot is required, run ardana-reboot.yml after completing the ardana-update.yml step below. Running ardana-reboot.yml will cause cloud service interruption.
Note
To update a single package (for example, apply a PTF on a single node or on all nodes), run zypper update PACKAGE.
To install all package updates using zypper update.

Update services:

ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml \
--limit TARGET_NODE_NAME

If indicated by the ardana-update-status.yml playbook, reboot the node.
There may also be a warning to reboot after running the ardana-update-pkgs.yml.
This check can be overridden by setting the SKIP_UPDATE_REBOOT_CHECKS environment variable or the skip_update_reboot_checks Ansible variable.
```
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \
--limit TARGET_NODE_NAME
```

To recheck pending system reboot status at a later time, run the following commands:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
--limit ardana-cp1-c1-m2

The pending system reboot status can be reset by running:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
--limit ardana-cp1-c1-m2 \
-e pending_system_reboot=off

Multiple servers can be patched at the same time with ardana-update-pkgs.yml by setting the option -e skip_single_host_checks=true.
Warning
When patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.
If multiple nodes are specified on the command line (with --limit), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.
Important
Do not reboot all of your controllers at the same time.

When the node comes up after the reboot, run the spark-start.yml file:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml

Verify that Spark is running on all Control Nodes:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts spark-status.yml

After all nodes have been updated, check the status of all services:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

13.3.2 Summary of the Update Playbooks #

ardana-update-pkgs.yml

Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable ardana-update-pkgs.yml -e skip_single_host_checks=true.

Provide the following -e options to modify default behavior:

zypper_update_method (default: patch)
- patch will install all patches for the system. Patches are intended for specific bug and security fixes.
- update will install all packages that have a higher version number than the installed packages.
- dist-upgrade replaces each package installed with the version from the repository and deletes packages not available in the repositories.
zypper_update_repositories (default: all) restricts the list of repositories used
zypper_update_gpg_checks (default: true) enables GPG checks. If set to true, checks if packages are correctly signed.
zypper_update_licenses_agree (default: false) automatically agrees with licenses. If set to true, zypper automatically accepts third party licenses.
zypper_update_include_reboot_patches (default: false) includes patches that require reboot. Setting this to true installs patches that require a reboot (such as kernel or glibc updates).

ardana-update.yml

Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding --limit nodename.

ardana-reboot.yml

Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.

ardana-update-status.yml

This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.

13.4 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment #

Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.

Use the following steps to deploy a PTF:

When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:

ardana > tmpdir=`mktemp -d`
ardana > cd $tmpdir
ardana > sudo wget --no-directories --recursive --reject "index.html*"\
--user=USER_NAME \
--password=PASSWORD \
--no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030

Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.
```
ardana > sudo rm -rf /srv/www/suse-12.3/x86_64/repos/PTF/*
```

Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a Neutron PTF.

ardana > sudo mkdir -p /srv/www/suse-12.3/x86_64/repos/PTF/
ardana > sudo mv $tmpdir/*
   /srv/www/suse-12.3/x86_64/repos/PTF/
ardana > sudo chown --recursive root:root /srv/www/suse-12.3/x86_64/repos/PTF/*
ardana > rmdir $tmpdir

Create or update the repository metadata:

ardana > sudo /usr/local/sbin/createrepo-cloud-ptf
Spawning worker 0 with 2 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata

Refresh the PTF repository before installing package updates on the Cloud Lifecycle Manager

ardana > sudo zypper refresh --force --repo PTF
Forcing raw metadata refresh
Retrieving repository 'PTF' metadata
..........................................[d
one]
Forcing building of repository cache
Building repository 'PTF' cache ..........................................[done]
Specified repositories have been refreshed.

The PTF shows as available on the deployer.

ardana > sudo zypper se --repo PTF
Loading repository data...
Reading installed packages...

S | Name                          | Summary                                 | Type
--+-------------------------------+-----------------------------------------+--------
  | python-neutronclient          | Python API and CLI for OpenStack Neutron | package
i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack Neutron | package

Install the PTF venv packages on the Cloud Lifecycle Manager

ardana > sudo zypper dup  --from PTF
Refreshing service
Loading repository data...
Reading installed packages...
Computing distribution upgrade...

The following package is going to be upgraded:
  venv-openstack-neutron-x86_64

The following package has no support information from its vendor:
  venv-openstack-neutron-x86_64

1 package to upgrade.
Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1),  64.2 MiB ( 64.6 MiB unpacked)
Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done]
Checking for file conflicts: ..............................................................[done]
(1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done]
Additional rpm output:
warning
warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEY

Validate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)

ardana > ls -la /opt/ardana_packager/ardana-8/sles_venv/x86_64
total 898952
drwxr-xr-x 2 root root     4096 Oct 30 16:10 .
...
-rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<<
-rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz
-rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz
-rw-r--r-- 1 root root     1879 Oct 30 16:10 packages
-rw-r--r-- 1 root root 27186008 Apr 26  2018 swift-20180426T230541Z.tgz

Install the non-venv PTF packages on the Compute Node

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmt

When it has finished, you can see that the upgraded package has been installed on comp0001-mgmt.

ardana > sudo zypper se --detail python-neutronclient
Loading repository data...
Reading installed packages...

S | Name                 | Type     | Version                         | Arch   | Repository
--+----------------------+----------+---------------------------------+--------+--------------------------------------
i | python-neutronclient | package  | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF
  | python-neutronclient | package  | 6.5.0-4.361                     | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1

Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.

The Compute Node before running the update playbook:

ardana > ls -la /opt/stack/venv
total 24
drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z

Run the update.

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmt

When it has finished, you can see that an additional virtual environment has been installed.

ardana > ls -la /opt/stack/venv
total 28
drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
drwxr-xr-x  9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed
drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z

The PTF may also have RPM package updates in addition to venv updates. To complete the update, follow the instructions at Section 13.3.1, “Performing the Update”.

13.5 Periodic OpenStack Maintenance Tasks #

Heat-manage helps manage Heat specific database operations. The associated database should be periodically purged to save space. The following should be setup as a cron job on the servers where the heat service is running at /etc/cron.weekly/local-cleanup-heat with the following content:

  #!/bin/bash
  su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :

nova-manage db archive_deleted_rows command will move deleted rows from production tables to shadow tables. Including --until-complete will make the command run continuously until all deleted rows are archived. It is recommended to setup this task as /etc/cron.weekly/local-cleanup-nova on the servers where the nova service is running, with the following content:

  #!/bin/bash
  su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :

Print this page