Applies to HPE Helion OpenStack 8

13 System Maintenance

Information about managing and configuring your cloud as well as procedures for performing node maintenance.

This section contains the following sections to help you manage, configure, and maintain your HPE Helion OpenStack cloud.

13.1 Planned System Maintenance

Planned maintenance tasks for your cloud. See sections below for:

13.1.1 Whole Cloud Maintenance

Planned maintenance procedures for your whole cloud.

13.1.1.1 Bringing Down Your Cloud: Services Down Method

Important
Important

If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 13.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.

If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 13.1.1.2, “Rolling Reboot of the Cloud”.

To perform backups prior to these steps, visit the backup and restore pages first at Chapter 14, Backup and Restore.

13.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment

You will do the following steps from your Cloud Lifecycle Manager.

  1. Log in to your Cloud Lifecycle Manager.

  2. Gracefully shut down your cloud by running the ardana-stop.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-stop.yml
  3. Shut down your nodes. You should shut down your controller nodes last and bring them up first after the maintenance.

    There are multiple ways you can do this:

    1. You can SSH to each node and use sudo reboot -f to reboot the node.

    2. From the Cloud Lifecycle Manager, you can use the bm-power-down.yml and bm-power-up.yml playbooks.

    3. You can shut down the nodes and then physically restart them either via a power button or the IPMI.

  4. Perform the necessary maintenance.

  5. After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.

  6. Determine the current power status of the nodes in your environment:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts bm-power-status.yml
  7. If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the -e nodelist=<node_name> switch.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
    Note
    Note

    Obtain the <node_name> by using the sudo cobbler system list command from the Cloud Lifecycle Manager.

  8. Bring the databases back up:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  9. Gracefully bring up your cloud services by running the ardana-start.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-start.yml
  10. Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml
  11. If any services did not start properly, you can run playbooks for the specific services having issues.

    For example:

    If RabbitMQ fails, run:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml

    You can check the status of RabbitMQ afterwards with this:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml

    If the recovery had failed, you can run:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml

    Each of the other services have playbooks in the ~/scratch/ansible/next/ardana/ansible directory in the format of <service>-start.yml that you can run. One example, for the compute service, is nova-start.yml.

  12. Continue checking the status of your HPE Helion OpenStack 8 cloud services until there are no more failed or unreachable nodes:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml

13.1.1.2 Rolling Reboot of the Cloud

If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 13.1.1.1, “Bringing Down Your Cloud: Services Down Method”.

13.1.1.2.1 Recommended node reboot order

To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.

The recommended way to achieve this is as follows:

  1. Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.

  2. Reboot of compute nodes (if present in your cloud).

  3. Reboot of Swift nodes (if present in your cloud).

  4. Reboot of ESX nodes (if present in your cloud).

13.1.1.2.2 Rebooting controller nodes

Turn off the Keystone Fernet Token-Signing Key Rotation

Before rebooting any controller node, you need to ensure that the Keystone Fernet token-signing key rotation is turned off. Run the following command:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml

Migrate singleton services first

Note
Note

If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that the apache2 service is running before continuing. To start the apache2 service, use this command:

sudo systemctl start apache2

The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.

For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.

For the cinder-volume singleton service:

Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:

ps auxww | grep cinder-volume | grep -v grep

Run the cinder-migrate-volume.yml playbook - details about the Cinder volume and backup migration instructions can be found in Section 7.1.3, “Managing Cinder Volume and Backup Services”.

For the nova-consoleauth singleton service:

The nova-consoleauth component runs by default on the first controller node, that is, the host with consoleauth_host_index=0. To move it to another controller node before rebooting controller 0, run the ansible playbook nova-start.yml and pass it the index of the next controller node. For example, to move it to controller 2 (index of 1), run:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts nova-start.yml --extra-vars "consoleauth_host_index=1"

After you run this command you may now see two instances of the nova-consoleauth service, which will show as being in disabled state, when you run the nova service-list command. You can then delete the service using these steps.

  1. Obtain the service ID for the duplicated nova-consoleauth service:

    nova service-list

    Example:

    $ nova service-list
    +----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+
    | Id | Binary           | Host                      | Zone     | Status   | State | Updated_at                 | Disabled Reason |
    +----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+
    | 1  | nova-conductor   | ...a-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:48.000000 | -               |
    | 10 | nova-conductor   | ...a-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:47.000000 | -               |
    | 13 | nova-conductor   | ...a-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:48.000000 | -               |
    | 16 | nova-scheduler   | ...a-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:39.000000 | -               |
    | 19 | nova-scheduler   | ...a-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:41.000000 | -               |
    | 22 | nova-scheduler   | ...a-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:44.000000 | -               |
    | 25 | nova-consoleauth | ...a-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2016-08-25T12:11:45.000000 | -               |
    | 49 | nova-compute     | ...a-cp1-comp0001-mgmt | nova     | enabled  | up    | 2016-08-25T12:11:48.000000 | -               |
    | 52 | nova-compute     | ...a-cp1-comp0002-mgmt | nova     | enabled  | up    | 2016-08-25T12:11:41.000000 | -               |
    | 55 | nova-compute     | ...a-cp1-comp0003-mgmt | nova     | enabled  | up    | 2016-08-25T12:11:43.000000 | -               |
    | 70 | nova-consoleauth | ...a-cp1-c1-m3-mgmt    | internal | disabled | down  | 2016-08-25T12:10:40.000000 | -               |
    +----+------------------+---------------------------+----------+----------+-------+----------------------------+-----------------+
  2. Delete the disabled duplicate service with this command:

    nova service-delete <service_ID>

    Given the example in the previous step, the command could be:

    nova service-delete 70

For the SNAT namespace singleton service:

If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.

  1. Locate the SNAT node where the router is providing the active snat_service:

    1. From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:

      source ~/service.osrc
      neutron port-list --device_owner network:router_gateway

      Example:

      $ neutron port-list --device_owner network:router_gateway
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
      | id                                   | name | mac_address       | fixed_ips                                                                           |
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
      | 287746e6-7d82-4b2c-914c-191954eba342 |      | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} |
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
    2. Look at the details of this port to determine what the binding:host_id value is, which will point to the host in which the port is bound to:

      neutron port-show <port_id>

      Example, with the value you need in bold:

      $ neutron port-show 287746e6-7d82-4b2c-914c-191954eba342
      +-----------------------+--------------------------------------------------------------------------------------------------------------+
      | Field                 | Value                                                                                                        |
      +-----------------------+--------------------------------------------------------------------------------------------------------------+
      | admin_state_up        | True                                                                                                         |
      | allowed_address_pairs |                                                                                                              |
      | binding:host_id       | ardana-cp1-c1-m2-mgmt                                                                                        |
      | binding:profile       | {}                                                                                                           |
      | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
      | binding:vif_type      | ovs                                                                                                          |
      | binding:vnic_type     | normal                                                                                                       |
      | device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
      | device_owner          | network:router_gateway                                                                                       |
      | dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
      | dns_name              |                                                                                                              |
      | extra_dhcp_opts       |                                                                                                              |
      | fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
      | id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
      | mac_address           | fa:16:3e:2e:26:ac                                                                                            |
      | name                  |                                                                                                              |
      | network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
      | security_groups       |                                                                                                              |
      | status                | DOWN                                                                                                         |
      | tenant_id             |                                                                                                              |
      +-----------------------+--------------------------------------------------------------------------------------------------------------+

      In this example, the ardana-cp1-c1-m2-mgmt is the node hosting the SNAT namespace service.

  2. SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:

    ssh <IP_of_SNAT_namespace_host>
    sudo ip netns exec snat-<router_ID> bash

    Example:

    sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bash
  3. Obtain the ID for the L3 Agent for the node hosting your SNAT namespace:

    source ~/service.osrc
    neutron agent-list

    Example, with the entry you need given the examples above:

    $ neutron agent-list
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
    | id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
    | 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent           | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent       | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-metadata-agent    |
    | 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent             | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent             | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-l3-agent          |
    | 58f01f34-b6ca-4186-ac38-b56ee376ffeb | Loadbalancerv2 agent | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-lbaasv2-agent     |
    | 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent   | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent             | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent       | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-metadata-agent    |
    | 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent       | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-metadata-agent    |
    | a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent             | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent           | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent   | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent   | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-openvswitch-agent |
    | e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent   | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent           | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent       | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-metadata-agent    |
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
  4. Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.

  5. Use these commands to move the SNAT namespace service, with the router_id being the same value as the ID for router:

    1. Remove the L3 Agent for the old host:

      neutron l3-agent-router-remove <agent_id_of_snat_namespace_host> <qrouter_uuid>

      Example:

      $ neutron l3-agent-router-remove a209c67d-c00f-4a00-b31c-0db30e9ec661 e122ea3f-90c5-4662-bf4a-3889f677aacf
      Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agent
    2. Remove the SNAT namespace:

      sudo ip netns delete snat-<router_id>

      Example:

      $ sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacf
    3. Create a new L3 Agent for the new host:

      neutron l3-agent-router-add <agent_id_of_new_snat_namespace_host> <qrouter_uuid>

      Example:

      $ neutron l3-agent-router-add 3bc28451-c895-437b-999d-fdcff259b016 e122ea3f-90c5-4662-bf4a-3889f677aacf
      Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent

    Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of binding:host_id which should be updated to the host you moved your SNAT namespace to:

    neutron port-show <port_ID>

    Example:

    $ neutron port-show 287746e6-7d82-4b2c-914c-191954eba342
    +-----------------------+--------------------------------------------------------------------------------------------------------------+
    | Field                 | Value                                                                                                        |
    +-----------------------+--------------------------------------------------------------------------------------------------------------+
    | admin_state_up        | True                                                                                                         |
    | allowed_address_pairs |                                                                                                              |
    | binding:host_id       | ardana-cp1-c1-m1-mgmt                                                                                        |
    | binding:profile       | {}                                                                                                           |
    | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
    | binding:vif_type      | ovs                                                                                                          |
    | binding:vnic_type     | normal                                                                                                       |
    | device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
    | device_owner          | network:router_gateway                                                                                       |
    | dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
    | dns_name              |                                                                                                              |
    | extra_dhcp_opts       |                                                                                                              |
    | fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
    | id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
    | mac_address           | fa:16:3e:2e:26:ac                                                                                            |
    | name                  |                                                                                                              |
    | network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
    | security_groups       |                                                                                                              |
    | status                | DOWN                                                                                                         |
    | tenant_id             |                                                                                                              |
    +-----------------------+--------------------------------------------------------------------------------------------------------------+

Reboot the controllers

In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.

for i in $(grep -w cluster-prefix ~/openstack/my_cloud/definition/data/control_plane.yml | awk '{print $2}'); do grep $i ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts | grep ansible_ssh_host | awk '{print $1}'; done

Then perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:

  1. If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.

  2. Stop all services on the controller node that you are rebooting first:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <controller node>
  3. Reboot the controller node, e.g. run the following command on the controller itself:

    sudo reboot

    Note that the current node being rebooted could be hosting the lifecycle manager.

  4. Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <controller node>
  5. Verify that the status of all services on that is OK on the controller node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml --limit <controller node>
  6. When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.

Note
Note

It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).

Reenable the Keystone Fernet Token-Signing Key Rotation

After all the controller nodes are successfully updated and back online, you need to re-enable the Keystone Fernet token-signing key rotation job by running the keystone-reconfigure.yml playbook. On the deployer, run:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
13.1.1.2.3 Rebooting compute nodes

To reboot a compute node the following operations will need to be performed:

  • Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.

  • Identify instances that exist on the compute node, and then either:

    • Live migrate the instances off the node before actioning the reboot. OR

    • Stop the instances

  • Reboot the node

  • Restart the Nova services

  1. Disable provisioning:

    nova service-disable --reason "<describe reason>" <node name> nova-compute

    If the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.

  2. Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.

    nova list --host <hostname> --all-tenants
  3. Migrate or Stop the instances on the compute node.

    Migrate the instances off the node by running one of the following commands for each of the instances:

    If your instance is booted from a volume and has any number of Cinder volume attached, use the nova live-migration command:

    nova live-migration <instance uuid> [<target compute host>]

    If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:

    nova live-migration --block-migrate <instance uuid> [<target compute host>]

    Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.

    OR

    Stop the instances on the node by running the following command for each of the instances:

    nova stop <instance-uuid>
  4. Stop all services on the Compute node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>
  5. SSH to your Compute nodes and reboot them:

    sudo reboot

    The operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.

  6. Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>
  7. Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>
  8. Re-enable provisioning on the node:

    nova service-enable <node-name> nova-compute
  9. Restart any instances you stopped.

    nova start <instance-uuid>
13.1.1.2.4 Rebooting Swift nodes

If your Swift services are on controller node, please follow the controller node reboot instructions above.

For a dedicated Swift PAC cluster or Swift Object resource node:

For each Swift host

  1. Stop all services on the Swift node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <Swift node>
  2. Reboot the Swift node by running the following command on the Swift node itself:

    sudo reboot
  3. Wait for the node to become ssh-able and then start all services on the Swift node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>
13.1.1.2.5 Get list of status playbooks

Running the following command will yield a list of status playbooks:

cd ~/scratch/ansible/next/ardana/ansible
ls *status*

Here is the list:

ls *status*
bm-power-status.yml          heat-status.yml      logging-producer-status.yml
ceilometer-status.yml        FND-AP2-status.yml   ardana-status.yml
FND-CLU-status.yml           horizon-status.yml   logging-status.yml
cinder-status.yml            freezer-status.yml   ironic-status.yml
cmc-status.yml               glance-status.yml    keystone-status.yml
galera-status.yml            memcached-status.yml nova-status.yml
logging-server-status.yml    monasca-status.yml   ops-console-status.yml
monasca-agent-status.yml     neutron-status.yml   rabbitmq-status.yml
swift-status.yml             zookeeper-status.yml

13.1.2 Planned Control Plane Maintenance

Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.

13.1.2.1 Replacing a Controller Node

This section outlines steps for replacing a controller node in your environment.

For HPE Helion OpenStack, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.

These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.

Keep in mind while performing the following tasks:

  • Do not add entries for a new server. Instead, update the entries for the broken one.

  • Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.

13.1.2.1.1 Replacing a Shared Cloud Lifecycle Manager/Controller Node

If the controller node you need to replace was also being used as your Cloud Lifecycle Manager then use these steps below. If this is not a shared controller then skip to the next section.

  1. To ensure that you use the same version of HPE Helion OpenStack that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the installation guide:

    1. Book “Installing with Cloud Lifecycle Manager”, Chapter 3 “Installing the Cloud Lifecycle Manager server”, Section 3.5.2 “Installing the HPE Helion OpenStack Extension”

    2. To restore your data, see Section 13.2.2.2.3, “Point-in-time Cloud Lifecycle Manager Recovery”

  2. On the new node, update your cloud model with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields to reflect the attributes of the node. Do not change the id, ip-addr, role, or server-group settings.

    Note
    Note

    When imaging servers with your own tooling, it is still necessary to have ILO/IPMI settings for all nodes. Even if you are not using Cobbler, the username and password fields in servers.yml need to be filled in with dummy settings. For example, add the following to servers.yml:

    ilo-user: manual
    ilo-password: deployment
  3. Get the servers.yml file stored in git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site

    then change, as necessary, the mac-addr, ilo-ip, ilo-password, and ilo-user fields of this existing controller node. Save and commit the change

    git commit -a -m "repaired node X"
  4. Run the configuration processor as follows:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml

    Then run ready-deployment:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  5. Deploy Cobbler:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    Note
    Note

    After this step you may see failures because MariaDB has not finished syncing. If so, rerun this step.

  6. Delete the haproxy user:

    sudo userdel haproxy
  7. Install the software on your new Cloud Lifecycle Manager/controller node with these three playbooks:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller-hostname>,<first-proxy-hostname>
  8. During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
    ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
13.1.2.1.2 Replacing a Standalone Controller Node

If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.

  1. Log in to the Cloud Lifecycle Manager.

  2. Update your cloud model, specifically the servers.yml file, with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields where these have changed. Do not change the id, ip-addr, role, or server-group settings.

  3. Commit your configuration to the Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”, as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  4. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Remove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:

    sudo cobbler system list

    and then remove the old controller nodes with this command:

    sudo cobbler system remove --name <node>
  7. Remove the SSH key of the old controller node from the known hosts file. You will specify the ip-addr value:

    ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>

    You should see a response similar to this one:

    ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135
    # Host 10.13.111.135 found: line 6 type ECDSA
    ~/.ssh/known_hosts updated.
    Original contents retained as ~/.ssh/known_hosts.old
  8. Run the cobbler-deploy playbook to add the new controller node:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Image the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>
    Important
    Important

    You must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.

  10. Configure the necessary keys used for the database etc:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml
  11. Run osconfig on the replacement controller node. For example:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>
  12. If the controller being replaced is the Swift ring builder (see Section 15.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the Swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. See Section 15.6.2.7, “Recovering Swift Builder Files” for details.

  13. Run the ardana-deploy playbook on the replacement controller.

    If the node being replaced is the Swift ring builder server then you only need to use the --limit switch for that node, otherwise you need to specify the hostname of your Swift ringer builder server and the hostname of the node being replaced.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True
    --limit=<controller-hostname>,<swift-ring-builder-hostname>
    Important
    Important

    If you receive a Keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the keystone-reconfigure.yml playbook to re-sync the Fernet keys.

    In this situation, do not use the --limit option when running keystone-reconfigure.yml. In order to re-sync Fernet keys, all the controller nodes must be in the play.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
    Important
    Important

    If you receive a RabbitMQ failure when running this playbook, review Section 15.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.

  14. During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
    ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh

13.1.3 Planned Compute Maintenance

Planned maintenance tasks for compute nodes.

13.1.3.1 Planned Maintenance for a Compute Node

If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.

13.1.3.1.1 Performing planned maintenance on a compute node

If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrator credentials:

    source ~/service.osrc
  3. Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

    nova host-list | grep compute

    The following example shows two compute nodes:

    $ nova host-list | grep compute
    | ardana-cp1-comp0001-mgmt | compute     | AZ1      |
    | ardana-cp1-comp0002-mgmt | compute     | AZ2      |
  4. Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:

    nova service-disable --reason "Maintenance mode" <hostname> nova-compute
    Note
    Note

    Make sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:

    nova service-enable <hostname> nova-compute
  5. At this point you have two choices:

    1. Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.

    2. Stop/start the instances: Issuing nova stop commands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.

    If you choose the live migration route, See Section 13.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.

    If you choose the stop start method, continue on.

    1. List all of the instances on the node so you can issue stop commands to them:

      nova list --host <hostname> --all-tenants
    2. Issue the nova stop command against each of the instances:

      nova stop <instance uuid>
    3. Confirm that the instances are stopped. If stoppage was successful you should see the instances in a SHUTOFF state, as shown here:

      $ nova list --host ardana-cp1-comp0002-mgmt --all-tenants
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
      | ID                                   | Name      | Tenant ID                        | Status  | Task State | Power State | Networks              |
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
      | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | -          | Shutdown    | demo_network=10.0.0.5 |
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
    4. Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their SHUTOFF state:

      nova list --host <hostname> --all-tenants
    5. Start the instances back up using this command:

      nova start <instance uuid>

      Example:

      $ nova start ef31c453-f046-4355-9bd3-11e774b1772f
      Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.
    6. Confirm that the instances started back up. If restarting is successful you should see the instances in an ACTIVE state, as shown here:

      $ nova list --host ardana-cp1-comp0002-mgmt --all-tenants
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
      | ID                                   | Name      | Tenant ID                        | Status | Task State | Power State | Networks              |
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
      | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | -          | Running     | demo_network=10.0.0.5 |
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
    7. If the nova start fails, you can try doing a hard reboot:

      nova reboot --hard <instance uuid>

      If this does not resolve the issue you may want to contact support.

  6. Reenable provisioning when the node is fixed:

    nova service-enable <hostname> nova-compute

13.1.3.2 Rebooting a Compute Node

If all you need to do is reboot a Compute node, the following steps can be used.

You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.

  1. Log in to the Cloud Lifecycle Manager.

  2. Reboot the Compute node(s) with the following playbook.

    You can specify either single or multiple Compute nodes using the --limit switch.

    An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
    Note
    Note

    If the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.

13.1.3.3 Live Migration of Instances

Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.

HPE Helion OpenStack Nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.

13.1.3.3.1 Migration Options

If your compute node has failed

A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.

In these cases you will want to use one of the Nova evacuate commands, which will cause Nova to rebuild the instances on other hosts.

This table describes each of the evacuate options for failed compute nodes:

CommandDescription

nova evacuate <instance> <hostname>

This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the Nova scheduler will choose one for you.

See nova help evacuate for more information and syntax. Further details can also be seen in the OpenStack documentation at http://docs.openstack.org/admin-guide/cli_nova_evacuate.html.

nova host-evacuate <hostname> --target_host <target_hostname>

This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the Nova scheduler will choose a target host for each instance.

See nova help host-evacuate for more information and syntax.

If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.

Cold migration is used to copy an instances data in a SHUTOFF status from one compute host to another. It does this using passwordless SSH access which has security concerns associated with it. For this reason, the nova migrate function has been disabled by default but you have the ability to enable this feature if you would like. Details on how to do this can be found in Section 5.4, “Enabling the Nova Resize and Migrate Features”.

Live migration can be performed on instances in either an ACTIVE or PAUSED state and uses the QEMU hypervisor to manage the copy of the running processes and associated resources to the destination compute host using the hypervisors own protocol and thus is a more secure method and allows for less downtime. There may be a short network outage, usually a few milliseconds but could be up to a few seconds if your compute instances are busy, during a live migration. Also there may be some performance degredation during the process.

The compute host must remain powered on during the migration process.

Both the cold migration and live migration options will honor Nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 13.1.3.3, “Live Migration of Instances” section.

This table describes each of the migration options for active compute nodes:

CommandDescriptionSLES

nova migrate <instance_uuid>

Used to cold migrate a single instance from a compute host. The nova-scheduler will choose the new host.

This command will work against instances in an ACTIVE or SHUTOFF state. The instances, if active, will be shutdown and restarted. Instances in a PAUSED state cannot be cold migrated.

See the difference between cold migration and live migration at the start of this section.

 

nova host-servers-migrate <hostname>

Used to cold migrate all instances off a specified host to other available hosts, chosen by the nova-scheduler.

This command will work against instances in an ACTIVE or SHUTOFF state. The instances, if active, will be shutdown and restarted. Instances in a PAUSED state cannot be cold migrated.

See the difference between cold migration and live migration at the start of this section.

 

nova live-migration <instance_uuid> [<target host>]

Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.

This command works against instances in ACTIVE or PAUSED states only.

X

nova live-migration --block-migrate <instance_uuid> [<target host>]

Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.

This command works against instances in ACTIVE or PAUSED states only.

X

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.

This command works against instances in ACTIVE or PAUSED states only.

X

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.

This command works against instances in ACTIVE or PAUSED states only.

X
13.1.3.3.2 Limitations of these Features

There are limitations that may impact your use of this feature:

  • To use live migration, your compute instances must be in either an ACTIVE or PAUSED state on the compute host. If you have instances in a SHUTOFF state then cold migration should be used.

  • Instances in a Paused state cannot be live migrated using the Horizon dashboard. You will need to utilize the NovaClient CLI to perform these.

  • Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the nova-scheduler. You should ensure that the target host you choose has the resources available to host the instances.

  • The nova host-evacuate-live command will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the --block-migrate option. This is described in further detail in Section 13.1.3.3, “Live Migration of Instances”.

  • Instances on KVM hosts can only be live migrated to other KVM hosts.

  • If you are using both Linux for HPE Helion (KVM) and SLES compute hosts, you cannot live migrate instances between them. Instances on KVM hosts can only be live migrated to other KVM hosts. Instances on SLES hosts can only be live migrated to other SLES hosts.

  • The migration options described in this document are not available on ESX compute hosts.

  • Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.

13.1.3.3.3 Performing a Live Migration

Cloud administrators can perform a migration on an instance using either the Horizon dashboard, API, or CLI. Instances in a Paused state cannot be live migrated using the Horizon GUI. You will need to utilize the CLI to perform these.

We have documented different scenarios:

13.1.3.3.4 Migrating instances off of a failed compute host
  1. Log in to the Cloud Lifecycle Manager.

  2. If the compute node is not already powered off, do so with this playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
    Note
    Note

    The value for <node_name> will be the name that Cobbler has when you run sudo cobbler system list from the Cloud Lifecycle Manager.

  3. Source the admin credentials necessary to run administrative commands against the Nova API:

    source ~/service.osrc
  4. Force the nova-compute service to go down on the compute node:

    nova service-force-down HOSTNAME nova-compute
    Note
    Note

    The value for HOSTNAME can be obtained by using nova host-list from the Cloud Lifecycle Manager.

  5. Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.

    For single instances on a failed host:

    nova evacuate <instance_uuid> <target_hostname>

    For all instances on a failed host:

    nova host-evacuate <hostname> [--target_host <target_hostname>]
  6. When you have repaired the failed node and start it back up again, when the nova-compute process starts again, it will clean up the evacuated instances.

13.1.3.3.5 Migrating instances off of an active compute host

Migrating instances using the Horizon dashboard

The Horizon dashboard offers a GUI method for performing live migrations. Instances in a Paused state will not provide you the live migration option in Horizon so you will need to use the CLI instructions in the next section to perform these.

  1. Log into the Horizon dashboard with admin credentials.

  2. Navigate to the menu Admin › Compute › Instances.

  3. Next to the instance you want to migrate, select the drop down menu and choose the Live Migrate Instance option.

  4. In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:

    Disk Over Commit - If this is not checked then the value will be False. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.

    Block Migration - If this is not checked then the value will be False. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.

  5. To begin the live migration, click Submit.

Migrating instances using the NovaClient CLI

To perform migrations from the command-line, use the NovaClient. The Cloud Lifecycle Manager node in your cloud environment should have the NovaClient already installed. If you will be accessing your environment through a different method, ensure that the NovaClient is installed. You can do so using Python's pip package manager.

To run the commands in the steps below, you need administrator credentials. From the Cloud Lifecycle Manager, you can source the service.osrc file which is provided that has the necessary credentials:

source ~/service.osrc

Here are the steps to perform:

  1. Log in to the Cloud Lifecycle Manager.

  2. Identify the instances on the compute node you wish to migrate:

    nova list --all-tenants --host <hostname>

    Example showing a host with a single compute instance on it:

    ardana >  nova list --host ardana-cp1-comp0001-mgmt --all-tenants
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
    | ID                                   | Name | Tenant ID                        | Status | Task State | Power State | Networks              |
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
    | 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | -          | Running     | adminnetwork=10.0.0.5 |
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
  3. When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:

    nova host-list
  4. Migrate the instance(s) on the compute node using the notes below.

    If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the nova live-migration command with this syntax:

    nova live-migration <instance uuid> [<target compute host>]

    If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the --block-migrate option:

    nova live-migration --block-migrate <instance uuid> [<target compute host>]
    Note
    Note

    The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.

    Multiple instances

    If you want to live migrate all of the instances off a single compute host you can utilize the nova host-evacuate-live command.

    Issue the host-evacuate-live command, which will begin the live migration process.

    If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:

    nova host-evacuate-live --block-migrate <hostname>

    Alternatively, if all of the instances are only using block storage volumes then omit the --block-migrate option:

    nova host-evacuate-live <hostname>
    Note
    Note

    You can either let the nova-scheduler choose a suitable target host or you can specify one using the --target-host <hostname> switch. See nova help host-evacuate-live for details.

13.1.3.3.6 Troubleshooting migration or host evacuate issues

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                                                                                        |
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 95a7ded8-ebfc-4848-9090-2df378c88a4c | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7)     |
| 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6)     |
+--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| e9874122-c5dc-406f-9039-217d9258c020 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a)     |
| 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112)     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt
ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)

Fix: This occurs when you are attempting to live migrate an instance that was booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration --block-migrate <instance_uuid> <target_hostname>

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt
ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)

Fix: This occurs when you are attempting to live migrate an instance that was booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration <instance_uuid> <target_hostname>

13.1.3.4 Adding Compute Node

Adding a Compute Node allows you to add capacity.

13.1.3.4.1 Adding a SLES Compute Node

Adding a SLES compute node allows you to add additional capacity for more virtual machines.

You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.

There are two methods you can use to add SLES compute hosts to your environment:

  1. Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.

  2. Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP3 ISO during the initial installation of your cloud, following the instructions at Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”, Section 19.1 “SLES Compute Node Installation Overview”.

    If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP3 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.

13.1.3.4.1.1 Prerequisites

You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”, Section 19.3 “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.1 “SLES Compute Nodes”.

13.1.3.4.1.2 Adding a SLES compute node

Adding pre-installed SLES compute hosts

This method requires that you have SUSE Linux Enterprise Server 12 SP3 pre-installed on the baremetal host prior to beginning these steps.

  1. Ensure you have SUSE Linux Enterprise Server 12 SP3 pre-installed on your baremetal host.

  2. Log in to the Cloud Lifecycle Manager.

  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).

    - id: compute4
      ip-addr: 192.168.102.70
      role: SLES-COMPUTE-ROLE
      server-group: RACK1

    You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.

    Important
    Important

    You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See for Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” more details.

  5. Commit the changes to git:

    git add -A
    git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

  8. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.

    Note
    Note

    The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    The location of hostname is ~/scratch/ansible/next/ardana/ansible/hosts.

    cd ~/scratch/ansible/next/ardana/ansible/
    ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  9. Complete the compute host deployment with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible/
    ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>

Adding SLES compute hosts with Ansible playbooks and Cobbler

These steps will show you how to add the new SLES compute host to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

If you did not have the SUSE Linux Enterprise Server 12 SP3 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute”.

When you are prepared to continue, use these steps:

  1. Log in to your Cloud Lifecycle Manager.

  2. Checkout the site branch of your local git so you can begin to make the necessary edits:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in this format:

    - id: compute4
      ip-addr: 192.168.102.70
      role: SLES-COMPUTE-ROLE
      server-group: RACK1
      mac-addr: e8:39:35:21:32:4e
      ilo-ip: 10.1.192.36
      ilo-password: password
      ilo-user: admin
      distro-id: sles12sp3-x86_64

    You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.

    Important
    Important

    You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

  5. Commit the changes to git:

    git add -A
    git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  7. The following playbook confirms that your servers are accessible over their IPMI ports.

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4
  8. Add the new node into Cobbler:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]
  10. Then you can image the node:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
    Note
    Note

    If you do not know the <node name>, you can get it by using sudo cobbler system list.

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

  11. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  12. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    cd ~/scratch/ansible/next/ardana/ansible/
    ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
    Note
    Note

    You can obtain the <hostname> from the file ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

  13. You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Book “Installing with Cloud Lifecycle Manager”, Chapter 19 “Installing SLES Compute” for details.

  14. Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the --limit switch:

    cd ~/scratch/ansible/next/ardana/ansible/
    ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
13.1.3.4.1.3 Adding a new SLES compute node to monitoring

If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
13.1.3.4.2 Adding a RHEL Compute Node

Adding a RHEL compute node allows you to increase cloud capacity for more virtual machines. These steps will help you add new RHEL compute hosts for this purpose.

13.1.3.4.2.1 Prerequisites

You need to ensure your input model files are properly setup for RHEL compute host clusters. This must be done during the installation process of your cloud and is discussed further at Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 10 “Modifying Example Configurations for Compute Nodes”, Section 10.2 “RHEL Compute Nodes”.

13.1.3.4.2.2 Adding a RHEL compute node

You must have RHEL 7.5 pre-installed on the baremetal host prior to beginning these steps.

  1. Ensure you have RHEL 7.5 pre-installed on your baremetal host.

  2. Log in to the Cloud Lifecycle Manager.

  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three RHEL compute hosts using the RHEL-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the RHEL OS on your host(s).

    - id: compute4
      ip-addr: 192.168.102.70
      role: RHEL-COMPUTE-ROLE
      server-group: RACK1

    You can find detailed descriptions of these fields in Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.5 “Servers”. Ensure that you use the same role for any new RHEL hosts you are adding as you specified on your existing RHEL hosts.

    Important
    Important

    Verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

  5. Commit the changes to git:

    git add -A
    git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”, Section 7.3.1 “Persisted Server Allocations” for information on how this works.

  8. Look up the value for the new compute node's hostname in ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

    Then, complete the compute host deployment with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible/
    ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
13.1.3.4.2.3 Adding a new RHEL compute node to monitoring

If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

13.1.3.5 Removing a Compute Node

Removing a Compute node allows you to remove capacity.

You may have a need to remove a Compute node and these steps will help you achieve this.

13.1.3.5.1 Disable Provisioning on the Compute Host
  1. Get a list of the Nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:

    nova service-list

    Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

    $ nova service-list
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
    | Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -               |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -               |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -               |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -               |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -               |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | -               |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
  2. Disable the Nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:

    nova service-disable --reason "<enter reason here>" <node hostname> nova-compute

    Here is an example if I wanted to remove the ardana-cp1-comp0002-mgmt in the output above:

    $ nova service-disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt nova-compute
    +--------------------------+--------------+----------+-----------------------+
    | Host                     | Binary       | Status   | Disabled Reason       |
    +--------------------------+--------------+----------+-----------------------+
    | ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation |
    +--------------------------+--------------+----------+-----------------------+
13.1.3.5.2 Remove the Compute Host from its Availability Zone

If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.

  1. Get a list of the Nova services running which will provide us with the details we need to remove a Compute node:

    nova service-list

    Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

    $ nova service-list
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
    | Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason       |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -                     |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -                     |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -                     |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -                     |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -                     |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | hardware reallocation |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
  2. You can remove the Compute host from the availability zone it was a part of with this command:

    nova aggregate-remove-host <availability zone> <nova hostname>

    So for the same example as the previous step, the ardana-cp1-comp0002-mgmt host was in the AZ2 availability zone so I would use this command to remove it:

    $ nova aggregate-remove-host AZ2 ardana-cp1-comp0002-mgmt
    Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4
    +----+------+-------------------+-------+-------------------------+
    | Id | Name | Availability Zone | Hosts | Metadata                |
    +----+------+-------------------+-------+-------------------------+
    | 4  | AZ2  | AZ2               |       | 'availability_zone=AZ2' |
    +----+------+-------------------+-------+-------------------------+
  3. You can confirm the last two steps completed successfully by running another nova service-list.

    Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:

    $ nova service-list
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
    | Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
13.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts
  1. You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:

    nova list --host=<nova hostname> --all_tenants=1

    Here is an example below which shows that we have a single running instance on this node currently:

    $ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
    | ID                                   | Name   | Tenant ID                        | Status | Task State | Power State | Networks        |
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
    | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | -          | Running     | paul=10.10.10.7 |
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
  2. You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within Nova. The command will look like this:

    nova live-migration --block-migrate <nova instance ID>

    Here is an example using the instance in the previous step:

    $ nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9

    You can check the status of the migration using the same command from the previous step:

    $ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
    | ID                                   | Name   | Tenant ID                        | Status    | Task State | Power State | Networks        |
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
    | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating  | Running     | paul=10.10.10.7 |
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
  3. Run nova list again

    $ nova list --host=ardana-cp1-comp0002-mgmt --all_tenants=1

    to see that the running instance has been migrated:

    +----+------+-----------+--------+------------+-------------+----------+
    | ID | Name | Tenant ID | Status | Task State | Power State | Networks |
    +----+------+-----------+--------+------------+-------------+----------+
    +----+------+-----------+--------+------------+-------------+----------+
13.1.3.5.4 Disable Neutron Agents on Node to be Removed

You should also locate and disable or remove neutron agents. To see the neutron agents running:

$ neutron agent-list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

$ neutron agent-update --admin-state-down 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ neutron agent-update --admin-state-down dbe4fe11-8f08-4306-8244-cc68e98bb770
$ neutron agent-update --admin-state-down f0d262d1-7139-40c7-bdc2-f227c6dee5c8
13.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host

To perform this step you have a few options. You can SSH into the Compute host and run the following commands:

sudo systemctl stop nova-compute
sudo systemctl stop neutron-*

Because the Neutron agent self-registers against Neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:

sudo systemctl list-units neutron-* --all

Here are the results:

UNIT                                  LOAD        ACTIVE     SUB      DESCRIPTION
neutron-common-rundir.service         loaded      inactive   dead     Create /var/run/neutron
•neutron-dhcp-agent.service         not-found     inactive   dead     neutron-dhcp-agent.service
neutron-l3-agent.service              loaded      inactive   dead     neutron-l3-agent Service
neutron-lbaasv2-agent.service         loaded      inactive   dead     neutron-lbaasv2-agent Service
neutron-metadata-agent.service        loaded      inactive   dead     neutron-metadata-agent Service
•neutron-openvswitch-agent.service    loaded      failed     failed   neutron-openvswitch-agent Service
neutron-ovs-cleanup.service           loaded      inactive   dead     Neutron OVS Cleanup Service

        LOAD   = Reflects whether the unit definition was properly loaded.
        ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
        SUB    = The low-level unit activation state, values depend on unit type.

        7 loaded units listed.
        To show all installed unit files use 'systemctl list-unit-files'.

For each loaded service issue the command

sudo systemctl disable <service-name>

In the above example that would be each service, except neutron-dhcp-agent.service

For example:

sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-lbaasv2-agent neutron-metadata-agent neutron-openvswitch-agent

Now you can shut down the node:

sudo shutdown now

OR

From the Cloud Lifecycle Manager you can use the bm-power-down.yml playbook to shut down the node:

cd ~/openstack/ardana/ansible
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node name>
Note
Note

The <node name> value will be the value corresponding to this node in Cobbler. You can run sudo cobbler system list to retrieve these names.

13.1.3.5.6 Delete the Compute Host from Nova

Retrieve the list of Nova services:

nova service-list

Here is an example highlighting the Compute host we're going to remove:

$ nova service-list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 25 | nova-consoleauth | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+

Delete the host from Nova using the command below:

nova service-delete <service ID>

Following our example above, you would use:

nova service-delete 37

Use the command below to confirm that the Compute host has been completely removed from Nova:

nova hypervisor-list
13.1.3.5.7 Delete the Compute Host from Neutron

Multiple Neutron agents are running on the compute node. You have to remove all of the agents running on the node using the "neutron agent-delete" command. In the example below, the l3-agent, openvswitch-agent and metadata-agent are running:

$ neutron agent-list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

$ neutron agent-delete AGENT_ID

$ neutron agent-delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ neutron agent-delete dbe4fe11-8f08-4306-8244-cc68e98bb770
$ neutron agent-delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8
13.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor

Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:

  1. Log in to the Cloud Lifecycle Manager

  2. Edit your servers.yml file in the location below to remove references to the Compute node(s) you want to remove:

    ~/openstack/my_cloud/definition/data/servers.yml
  3. You may also need to edit your control_plane.yml file to update the values for member-count, min-count, and max-count if you used those to ensure they reflect the proper number of nodes you are using.

    See Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.2 “Control Plane” for more details.

  4. Commit the changes to git:

    git commit -a -m "Remove node <name>"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml

    To free up the resources when running the configuration processor, use the switches remove_deleted_servers and free_unused_addresses. For more information, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
13.1.3.5.9 Remove the Compute Host from Cobbler

Complete these steps to remove the node from Cobbler:

  1. Confirm the system name in Cobbler with this command:

    sudo cobbler system list
  2. Remove the system from Cobbler using this command:

    sudo cobbler system remove --name=<node>
  3. Run the cobbler-deploy.yml playbook to complete the process:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
13.1.3.5.10 Remove the Compute Host from Monitoring

Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.

To find all Monasca API servers

tux > sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
    bind ardana-cp1-vip-public-MON-API-extapi:8070  ssl crt /etc/ssl/private//my-public-cert-entry-scale                                          
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5        
listen ardana-cp1-vip-MON-API-mgmt-8070
    bind ardana-cp1-vip-MON-API-mgmt:8070  ssl crt /etc/ssl/private//ardana-internal-cert                                          
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5                                          
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5

In above example ardana-cp1-c1-m1-mgmt,ardana-cp1-c1-m2-mgmt, ardana-cp1-c1-m3-mgmt are Monasa API servers

You will want to SSH to each of the Monasca API servers and edit the /etc/monasca/agent/conf.d/host_alive.yaml file to remove references to the Compute node you removed. This will require sudo access. The entries will look similar to the one below:

- alive_test: ping
  built_by: HostAlive
  host_name: ardana-cp1-comp0001-mgmt
  name: ardana-cp1-comp0001-mgmt ping

Once you have removed the references on each of your Monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the Monasca CLI which should be installed on each of your Monasca API servers by default:

monasca alarm-list --metric-dimensions hostname=<compute node deleted>

For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:

monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt

You can then delete the alarm with this command:

monasca alarm-delete <alarm ID>

13.1.4 Planned Network Maintenance

Planned maintenance task for networking nodes.

13.1.4.1 Adding a Neutron Network Node

Adding an additional Neutron networking node allows you to increase the performance of your cloud.

You may have a need to add an additional Neutron network node for increased performance or another purpose and these steps will help you achieve this.

13.1.4.1.1 Prerequisites

If you are using the mid-scale model then your networking nodes are already separate and the roles are defined. If you are not already using this model and wish to add separate networking nodes then you need to ensure that those roles are defined. You can look in the ~/openstack/examples folder on your Cloud Lifecycle Manager for the mid-scale example model files which show how to do this. We have also added the basic edits that need to be made below:

  1. In your server_roles.yml file, ensure you have the NEUTRON-ROLE defined.

    Path to file:

    ~/openstack/my_cloud/definition/data/server_roles.yml

    Example snippet:

    - name: NEUTRON-ROLE
      interface-model: NEUTRON-INTERFACES
      disk-model: NEUTRON-DISKS
  2. In your net_interfaces.yml file, ensure you have the NEUTRON-INTERFACES defined.

    Path to file:

    ~/openstack/my_cloud/definition/data/net_interfaces.yml

    Example snippet:

    - name: NEUTRON-INTERFACES
      network-interfaces:
      - device:
          name: hed3
        name: hed3
        network-groups:
        - EXTERNAL-VM
        - GUEST
        - MANAGEMENT
  3. Create a disks_neutron.yml file, ensure you have the NEUTRON-DISKS defined in it.

    Path to file:

    ~/openstack/my_cloud/definition/data/disks_neutron.yml

    Example snippet:

      product:
        version: 2
    
      disk-models:
      - name: NEUTRON-DISKS
        volume-groups:
          - name: ardana-vg
            physical-volumes:
             - /dev/sda_root
    
            logical-volumes:
            # The policy is not to consume 100% of the space of each volume group.
            # 5% should be left free for snapshots and to allow for some flexibility.
              - name: root
                size: 35%
                fstype: ext4
                mount: /
              - name: log
                size: 50%
                mount: /var/log
                fstype: ext4
                mkfs-opts: -O large_file
              - name: crash
                size: 10%
                mount: /var/crash
                fstype: ext4
                mkfs-opts: -O large_file
  4. Modify your control_plane.yml file, ensure you have the NEUTRON-ROLE defined as well as the Neutron services added.

    Path to file:

    ~/openstack/my_cloud/definition/data/control_plane.yml

    Example snippet:

      - allocation-policy: strict
        cluster-prefix: neut
        member-count: 1
        name: neut
        server-role: NEUTRON-ROLE
        service-components:
        - ntp-client
        - neutron-vpn-agent
        - neutron-dhcp-agent
        - neutron-metadata-agent
        - neutron-openvswitch-agent

You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 2 “Hardware and Software Support Matrix”.

13.1.4.1.2 Adding a network node

These steps will show you how to add the new network node to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

  1. Log in to your Cloud Lifecycle Manager.

  2. Checkout the site branch of your local git so you can begin to make the necessary edits:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > git checkout site
  3. In the same directory, edit your servers.yml file to include the details about your new network node(s).

    For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:

    # network nodes
    - id: neut3
      ip-addr: 10.13.111.137
      role: NEUTRON-ROLE
      server-group: RACK2
      mac-addr: "5c:b9:01:89:b6:18"
      nic-mapping: HP-DL360-6PORT
      ip-addr: 10.243.140.22
      ilo-ip: 10.1.12.91
      ilo-password: password
      ilo-user: admin
    Important
    Important

    You will need to verify that the ip-addr value you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your control_plane.yml file you will need to check the values for member-count, min-count, and max-count, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specified member-count: 3 and are adding a fourth network node, you will need to change that value to member-count: 4.

  5. Commit the changes to git:

    ardana > git commit -a -m "Add new networking node <name>"
  6. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Add the new node into Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>
    Note
    Note

    If you do not know the <hostname>, you can get it by using sudo cobbler system list.

  10. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  11. Configure the operating system on the new networking node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
  12. Complete the networking node deployment with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>
  13. Run the site.yml playbook with the required tag so that all other services become aware of the new node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
13.1.4.1.3 Adding a New Network Node to Monitoring

If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

13.1.5 Planned Storage Maintenance

Planned maintenance procedures for Swift storage nodes.

13.1.5.1 Planned Maintenance Tasks for Swift Nodes

Planned maintenance tasks including recovering, adding, and removing Swift nodes.

13.1.5.1.1 Adding a Swift Object Node

Adding additional object nodes allows you to increase capacity.

This topic describes how to add additional Swift object server nodes to an existing system.

13.1.5.1.1.1 To add a new node

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

  1. Log in to the Cloud Lifecycle Manager node.

  2. Get the servers.yml file stored in Git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. If not already done, set the weight-step attribute. For instructions, see Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  4. Add the details of new nodes to the servers.yml file. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in the servers.yml file:

    servers:
    ...
    - id: swobj4
      role: SWOBJ_ROLE
      server-group: <server-group-name>
      mac-addr: <mac-address>
      nic-mapping: <nic-mapping-name>
      ip-addr: <ip-address>
      ilo-ip: <ilo-ip-address>
      ilo-user: <ilo-username>
      ilo-password: <ilo-password>
  5. Commit your changes:

    git add -A
    git commit -m "Add Node <name>"
    Note
    Note

    Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:

    export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY

    For instructions, see Book “Installing with Cloud Lifecycle Manager”, Chapter 18 “Installation for HPE Helion OpenStack Entry-scale Cloud with Swift Only”.

  6. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the nodelist argument):

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>

    In the following example, the server id is swobj4 (mentioned in step 3):

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
    Note
    Note

    You must use the server id as it appears in the file servers.yml in the field server-id.

  9. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    The hostname of the newly added server can be found in the list generated from the output of the following command:

    grep hostname ~/openstack/my_cloud/info/server_info.yml

    For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt
  10. Validate that the disk drives of the new node are compatible with the disk model used by the node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    If any errors occur, correct them. For instructions, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”.

  11. Run the following playbook to ensure that all other server's host file are updated with the new server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
  12. Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
  13. You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

    For example:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
13.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node

Steps for adding additional PAC nodes to your Swift system.

This topic describes how to add additional Swift proxy, account, and container (PAC) servers to an existing system.

13.1.5.1.2.1 Adding a new node

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Get the servers.yml file stored in Git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. If not already done, set the weight-step attribute. For instructions, see Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  4. Add details of new nodes to the servers.yml file:

    servers:
    ...
    - id: swpac6
      role: SWPAC-ROLE
      server-group: <server-group-name>
      mac-addr: <mac-address>
      nic-mapping: <nic-mapping-name>
      ip-addr: <ip-address>
      ilo-ip: <ilo-ip-address>
      ilo-user: <ilo-username>
      ilo-password: <ilo-password>

    In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the servers.yml file.

    In the entry-scale configurations there is no dedicated Swift PAC cluster. Instead, there is a cluster using servers that have a role of CONTROLLER-ROLE. You cannot add swpac4 to this cluster because that would change the member-count. If your system does not already have a dedicated Swift PAC cluster you will need to add it to the configuration files. For details on how to do this, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.7 “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.

    If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:

    control_plane.yml
    disks_pac.yml
    net_interfaces.yml
    servers.yml
    server_roles.yml

    You can see a good example of this in the example configurations for the mid-scale model in the ~/openstack/examples/mid-scale-kvm directory.

    The following steps assume that you have already created a dedicated Swift PAC cluster and that it has two members (swpac4 and swpac5).

  5. Increase the member count of the Swift PAC cluster, as appropriate. For example, if you are adding swpac6 and you previously had two Swift PAC nodes, the increased member count should be 3 as shown in the following example:

    control-planes:
        - name: control-plane-1
          control-plane-prefix: cp1
    
      . . .
      clusters:
      . . .
         - name: ....
           cluster-prefix: c2
           server-role: SWPAC-ROLE
           member-count: 3
       . . .
  6. Commit your changes:

    git add -A
    git commit -m "Add Node <name>"
    Note
    Note

    Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:

    export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY

    For instructions, see Book “Installing with Cloud Lifecycle Manager”, Chapter 18 “Installation for HPE Helion OpenStack Entry-scale Cloud with Swift Only”.

  7. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  8. Create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  9. Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the nodelist argument):

    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>

    In the following example, the server id is swpac6 (mentioned in step 3):

    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
    Note
    Note

    You must use the server id as it appears in the file servers.yml in the field server-id.

  10. Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
  11. Validate that the disk drives of the new node are compatible with the disk model used by the node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml

    If any errors occur, correct them. For instructions, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”.

  12. Run the following playbook to ensure that all other server's host file are updated with the new server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
  13. Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
  14. You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

13.1.5.1.3 Adding Additional Disks to a Swift Node

Steps for adding additional disks to any nodes hosting Swift services.

You may have a need to add additional disks to a node for Swift usage and we can show you how. These steps work for adding additional disks to Swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the Swift service, like you would see if you are using one of the entry-scale example models.

Read through the notes below before beginning the process.

You can add multiple disks at the same time, there is no need to do it one at a time.

Important
Important: Add the Same Number of Disks

You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three Swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three Swift servers.

13.1.5.1.3.1 Adding additional disks to your Swift servers
  1. Verify the general health of the Swift system and that it is safe to rebalance your rings. See Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.

  2. Perform the disk maintenance.

    1. Shut down the first Swift server you wish to add disks to.

    2. Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.

      For more details, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 11 “Modifying Example Configurations for Object Storage using Swift”, Section 11.6 “Swift Requirements for Device Group Drives”.

    3. Power the server on.

    4. While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the Swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.

    5. Repeat the steps from Step 2.a for each of the Swift servers you are adding the disks to, one at a time.

      Note
      Note

      If the additional disks can be added to the Swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.

  3. On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.

    1. Edit the disk configuration file that correlates to the type of server you are adding your new disks to.

      Path to the typical disk configuration files:

      ~/openstack/my_cloud/definition/data/disks_swobj.yml
      ~/openstack/my_cloud/definition/data/disks_swpac.yml
      ~/openstack/my_cloud/definition/data/disks_controller_*.yml

      Example showing the addition of a single new disk, indicated by the /dev/sdd, in bold:

      device-groups:
        - name: SwiftObject
          devices:
            - name: "/dev/sdb"
            - name: "/dev/sdc"
            - name: "/dev/sdd"
          consumer:
            name: swift
            ...
      Note
      Note

      For more details on how the disk model works, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”.

    2. Configure the Swift weight-step value in the ~/openstack/my_cloud/definition/data/swift/rings.yml file. See Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.

    3. Commit the changes to Git:

      cd ~/openstack
      git commit -a -m "adding additional Swift disks"
    4. Run the configuration processor:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost config-processor-run.yml
    5. Update your deployment directory:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run the osconfig-run.yml playbook against the Swift nodes you have added disks to. Use the --limit switch to target the specific nodes:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>

    You can use a wildcard when specifying the hostnames with the --limit switch. If you added disks to all of the Swift servers in your environment and they all have the same prefix (for example, ardana-cp1-swobj...) then you can use a wildcard like ardana-cp1-swobj*. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.

  5. Validate your Swift configuration with this playbook which will also provide details of each drive being added:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
  6. Verify that Swift services are running on all of your servers:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-status.yml
  7. If everything looks okay with the Swift status, then apply the changes to your Swift rings with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  8. At this point your Swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

13.1.5.1.4 Removing a Swift Node

Removal process for both Swift Object and PAC nodes.

You can use this process when you want to remove one or more Swift nodes permanently. This process applies to both Swift Proxy, Account, Container (PAC) nodes and Swift Object nodes.

13.1.5.1.4.1 Setting the Pass-through Attributes

This process will remove the Swift node's drives from the rings and move it to the remaining nodes in your cluster.

  1. Log in to the Cloud Lifecycle Manager.

  2. Ensure that the weight-step attribute is set. See Section 8.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.

  3. Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your ~/openstack/my_cloud/definition/data/servers.yml file since your server IDs are already listed in that file. For more information about pass-through, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 6 “Configuration Objects”, Section 6.17 “Pass Through”.

    Here is the general format:

    pass-through:
      servers:
        - id: <server-id>
          data:
              <subsystem>:
                    <subsystem-attributes>

    Here is an example:

    ---
      product:
        version: 2
    
      pass-through:
        servers:
          - id: ccn-0001
            data:
              swift:
                drain: yes

    By setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the Swift data from the node in preparation for removing the node.

  4. Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Use the playbook to create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the Swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  8. Wait until the replication has completed. For further details, see Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring”

  9. Determine whether all of the partitions have been removed from all drives on the Swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:

    cd /etc/swiftlm/cloud1/cp1/builder_dir/
    sudo swift-ring-builder <ring_name>.builder

    For example, if the node you are removing was part of the object-o ring the command would be:

    sudo swift-ring-builder object-0.builder

    Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:

    $ cd /etc/swiftlm/cloud1/cp1/builder_dir/
    $ sudo swift-ring-builder object-0.builder
    account.builder, build version 6
    4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion
    The minimum number of hours before a partition can be reassigned is 16
    The overload factor is 0.00% (0.000000)
    Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta
                 0       1     1   192.168.245.3  6002   192.168.245.3              6002     disk0   0.00          0   -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc
                 1       1     1   192.168.245.3  6002   192.168.245.3              6002     disk1   0.00          0   -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd
                 2       1     1   192.168.245.4  6002   192.168.245.4              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc
                 3       1     1   192.168.245.4  6002   192.168.245.4              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd
                 4       1     1   192.168.245.5  6002   192.168.245.5              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc
                 5       1     1   192.168.245.5  6002   192.168.245.5              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m3:disk1:/dev/sdd
  10. If the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.

  11. If the number of partitions is zero for the server on all rings, you can remove the Swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the remove attribute as shown in this example:

    ---
      product:
        version: 2
    
      pass-through:
        servers:
          - id: ccn-0001
            data:
              swift:
                remove: yes
  12. Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  13. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  14. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  15. Run the Swift deploy playbook to rebuild the rings by removing the server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  16. At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.

13.1.5.1.4.2 To Disable Swift on a Node

The next phase in this process will disable the Swift service on the node. In this example, swobj4 is the node being removed from Swift.

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop Swift services on the node using the swift-stop.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit <hostname>
    Note
    Note

    When using the --limit argument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card * (for example, *swobj4*).

    The following example uses the swift-stop.yml playbook to stop Swift services on ardana-cp1-swobj0004:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004
  3. Remove the configuration files.

    ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
    Note
    Note

    Do not run any other playbooks until you have finished the process described in Section 13.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate /etc/swift and restart Swift on swobj4. If you accidentally run a playbook, repeat the process in Section 13.1.5.1.4.2, “To Disable Swift on a Node”.

13.1.5.1.4.3 To Remove a Node from the Input Model

Use the following steps to finish the process of removing the Swift node.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/definition/data/servers.yml file and remove the entry for the node (swobj4 in this example).

  3. If this was a SWPAC node, reduce the member-count attribute by 1 in the ~/openstack/my_cloud/definition/data/control_plane.yml file. For SWOBJ nodes, no such action is needed.

  4. Commit your configuration to the local Git repository (see Book “Installing with Cloud Lifecycle Manager”, Chapter 10 “Using Git for Configuration Management”), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml

    You may want to use the remove_deleted_servers and free_unused_addresses switches to free up the resources when running the configuration processor. For more information, see Book “Planning an Installation with Cloud Lifecycle Manager”, Chapter 7 “Other Topics”, Section 7.3 “Persisted Data”.

    ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Validate the changes you have made to the configuration files using the playbook below before proceeding further:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.

    For more details on how to interpret and resolve errors, see Section 15.6.2.3, “Interpreting Swift Input Model Validation Errors”

  8. Remove the node from Cobbler:

    sudo cobbler system remove --name=swobj4
  9. Run the Cobbler deploy playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  10. The final step will depend on what type of Swift node you are removing.

    If the node was a SWPAC node, run the ardana-deploy.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

    If the node was a SWOBJ node, run the swift-deploy.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  11. Wait until replication has finished. For more details, see Section 8.5.4, “Determining When to Rebalance and Deploy a New Ring”.

  12. You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 8.5.5, “Applying Input Model Changes to Existing Rings”.

13.1.5.1.4.4 Remove the Swift Node from Monitoring

Once you have removed the Swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.

You will want to SSH to each of the Monasca API servers and edit the /etc/monasca/agent/conf.d/host_alive.yaml file to remove references to the Swift node(s) you removed. This will require sudo access.

Once you have removed the references on each of your Monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the Swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the Monasca CLI which should be installed on each of your Monasca API servers by default:

monasca alarm-list --metric-dimensions hostname=<swift node deleted>

You can then delete the alarm with this command:

monasca alarm-delete <alarm ID>
13.1.5.1.5 Replacing a Swift Node

Maintenance steps for replacing a failed Swift node in your environment.

This process is used when you want to replace a failed Swift node in your cloud.

Warning
Warning

If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.

13.1.5.1.5.1 How to replace a Swift node in your environment
  1. Log in to the Cloud Lifecycle Manager.

  2. Update your cloud configuration with the details of your replacement Swift node.

    1. Edit your servers.yml file to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement Swift node.

      Note
      Note

      Do not change the server's IP address (that is, ip-addr).

      Path to file:

      ~/openstack/my_cloud/definition/data/servers.yml

      Example showing the fields to edit, in bold:

       - id: swobj5
         role: SWOBJ-ROLE
         server-group: rack2
         mac-addr: 8c:dc:d4:b5:cb:bd
         nic-mapping: HP-DL360-6PORT
         ip-addr: 10.243.131.10
         ilo-ip: 10.1.12.88
         ilo-user: iLOuser
         ilo-password: iLOpass
         ...
    2. Commit the changes to Git:

      cd ~/openstack
      git commit -a -m "replacing a Swift node"
    3. Run the configuration processor:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost config-processor-run.yml
    4. Update your deployment directory:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Update Cobbler and reimage your replacement Swift node:

    1. Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace <node name> in future steps.

      sudo cobbler system list
    2. Remove the replaced Swift node from Cobbler:

      sudo cobbler system remove --name <node name>
    3. Re-run the cobbler-deploy.yml playbook to add the replaced node:

      cd ~/scratch/ansible/next/ardana/ansible
      ansible-playbook -i hosts/localhost cobbler-deploy.yml
    4. Reimage the node using this playbook:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
  4. Complete the deployment of your replacement Swift node.

    1. Obtain the hostname for your new Swift node. You will use this value to replace <hostname> in future steps.

      cat ~/openstack/my_cloud/info/server_info.yml
    2. Configure the operating system on your replacement Swift node:

      cd ~/scratch/ansible/next/ardana/ansible
      ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>
    3. If this is the Swift ring builder server, restore the Swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 15.6.2.4, “Identifying the Swift Ring Building Server” and Section 15.6.2.7, “Recovering Swift Builder Files”.

    4. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor, include the --ask-vault-pass argument.

      ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
13.1.5.1.6 Replacing Drives in a Swift Node

Maintenance steps for replacing drives in a Swift node.

This process is used when you want to remove a failed hard drive from Swift node and replace it with a new one.

There are two different classes of drives in a Swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.

13.1.5.1.6.1 To Replace the Operating System Disk Drive

After the operating system disk drive is replaced, the node must be reimaged.

  1. Log in to the Cloud Lifecycle Manager.

  2. Update your Cobbler profile:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  3. Reimage the node using this playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>

    In the example below swobj2 server is reimaged:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2
  4. Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
  5. If this is the first server running the swift-proxy service, restore the Swift Ring Builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 15.6.2.4, “Identifying the Swift Ring Building Server” and Section 15.6.2.7, “Recovering Swift Builder Files”.

  6. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor include the --ask-vault-pass argument.

    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \
      --limit <hostname>

    For example:

    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
13.1.5.1.6.2 To Replace a Storage Disk Drive

After a storage drive is replaced, there is no need to reimage the server. Instead, run the swift-reconfigure.yml playbook.

  1. Log onto the Cloud Lifecycle Manager.

  2. Run the following commands:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>

    In following example, the server used is swobj2:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt

13.1.6 Updating MariaDB with Galera

Updating MariaDB with Galera must be done manually. Updates are not installed automatically. In particular, this situation applies to upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.

Using the CLI, update MariaDB with the following procedure:

  1. Mark Galera as unmanaged:

    crm resource unmanage galera

    Or put the whole cluster into maintenance mode:

    crm configure property maintenance-mode=true
  2. Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:

    crm_resource --wait --force-demote -r galera -V
  3. Perform updates:

    1. Uninstall the old versions of MariaDB and the Galera wsrep provider.

    2. Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.

    3. Change configuration options if necessary.

  4. Start MariaDB on the node.

    crm_resource --wait --force-promote -r galera -V
  5. Run mysql_upgrade with the --skip-write-binlog option.

  6. On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run mysql_upgrade.

  7. Mark Galera as managed:

    crm resource manage galera

    Or take the cluster out of maintenance mode.

13.2 Unplanned System Maintenance

Unplanned maintenance tasks for your cloud.

13.2.1 Whole Cloud Recovery Procedures

Unplanned maintenance procedures for your whole cloud.

13.2.1.1 Full Disaster Recovery

In this disaster scenario, you have lost everything in the cloud, including Swift.

13.2.1.1.1 Restore from a Swift backup:

Restoring from a Swift backup is not possible because Swift is gone.

13.2.1.1.2 Restore from an SSH backup:
  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the following file so it contains the same information as it had previously:

    ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
  3. On the Cloud Lifecycle Manager copy the following files:

    cp -r ~/hp-ci/openstack/* ~/openstack/my_cloud/definition/
  4. Run this playbook to restore the Cloud Lifecycle Manager helper:

    cd ~/openstack/ardana/ansible/
    ansible-playbook -i hosts/localhost _deployer_restore_helper.yml
  5. Run as root, and change directories:

    sudo su
    cd /root/deployer_restore_helper/
  6. Execute the restore:

    ./deployer_restore_script.sh
  7. Run this playbook to deploy your cloud:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml -e '{ "freezer_backup_jobs_upload": false }'
  8. You can now perform the procedures to restore MySQL and Swift. Once everything is restored, re-enable the backups from the Cloud Lifecycle Manager:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml

13.2.1.2 Full Disaster Recovery Test

Full Disaster Recovery Test

13.2.1.2.1 Prerequisites

HPE Helion OpenStack platform

An external server to store backups to via SSH

13.2.1.2.2 Goals

Here is a high level view of how we expect to test the disaster recovery of the platform.

  1. Backup the control plane using Freezer to an SSH target

  2. Backup the Cassandra Database

  3. Re-install Controller 1 with the HPE Helion OpenStack ISO

  4. Use Freezer to recover deployment data (model …)

  5. Re-install HPE Helion OpenStack on Controller 1, 2, 3

  6. Recover the Cassandra Database

  7. Recover the backup of the MariaDB database

13.2.1.2.3 Description of the testing environment

The testing environment is very similar to the Entry Scale model.

It used 5 servers: 3 Controllers and 2 computes.

The controller node have three disks. The first one is reserved for the system, while others are used for swift.

Note
Note

During this Disaster Recovery exercise, we have saved the data on disk 2 and 3 of the swift controllers.

This allow to restore the swift objects after the recovery.

If these disks were to be wiped as well, swift data would be lost but the procedure would not change.

The only difference is that Glance images would be lost and they will have to be re-uploaded.

13.2.1.2.4 Disaster recovery test note

If it is not specified otherwise, all the commands should be executed on controller 1, which is also the deployer node.

13.2.1.2.5 Pre-Disaster testing

In order to validate the procedure after recovery, we need to create some workloads.

  1. Source the service credential file

    ardana > source ~/service.osrc
  2. Copy an image to the platform and create a Glance image with it. In this example, Cirros is used

    ardana > openstack image create --disk-format raw --container-format bare --public --file ~/cirros-0.3.5-x86_64-disk.img cirros
  3. Create a network

    ardana > openstack network create test_net
  4. Create a subnet

    ardana > neutron subnet-create 07c35d11-13f9-41d4-8289-fa92147b1d44 192.168.42.0/24 --name test_subnet
  5. Create some instances

    ardana > openstack server create server_1 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    ardana > openstack server create server_2 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    ardana > openstack server create server_3 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    ardana > openstack server create server_4 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    ardana > openstack server create server_5 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    ardana > openstack server list
  6. Create containers and objects

    ardana > swift upload container_1 ~/service.osrc
    var/lib/ardana/service.osrc
    
    ardana > swift upload container_1 ~/backup.osrc
    swift upload container_1 ~/backup.osrc
    
    ardana > swift list container_1
    var/lib/ardana/backup.osrc
    var/lib/ardana/service.osrc
13.2.1.2.6 Preparation of the backup server

Preparation of the backup server

13.2.1.2.6.1 Preparation to store Freezer backups

In this example, we want to store the backups on the server 192.168.69.132

Freezer will connect with the user backupuser on port 22 and store the backups in the /mnt/backups/ directory.

  1. Connect to the backup server

  2. Create the user

    root # useradd backupuser --create-home --home-dir /mnt/backups/
  3. Switch to that user

    root # su backupuser
  4. Create the SSH keypair

    backupuser > ssh-keygen -t rsa
    > # Just leave the default for the first question and do not set any passphrase
    > Generating public/private rsa key pair.
    > Enter file in which to save the key (/mnt/backups//.ssh/id_rsa):
    > Created directory '/mnt/backups//.ssh'.
    > Enter passphrase (empty for no passphrase):
    > Enter same passphrase again:
    > Your identification has been saved in /mnt/backups//.ssh/id_rsa
    > Your public key has been saved in /mnt/backups//.ssh/id_rsa.pub
    > The key fingerprint is:
    > a9:08:ae:ee:3c:57:62:31:d2:52:77:a7:4e:37:d1:28 backupuser@padawan-ccp-c0-m1-mgmt
    > The key's randomart image is:
    > +---[RSA 2048]----+
    > |          o      |
    > |   . . E + .     |
    > |  o . . + .      |
    > | o +   o +       |
    > |  + o o S .      |
    > | . + o o         |
    > |  o + .          |
    > |.o .             |
    > |++o              |
    > +-----------------+
  5. Add the public key to the list of the keys authorized to connect to that user on this server

    backupuser > cat /mnt/backups/.ssh/id_rsa.pub >> /mnt/backups/.ssh/authorized_keys
  6. Print the private key. This is what we will use for the backup configuration (ssh_credentials.yml file)

    backupuser > cat /mnt/backups/.ssh/id_rsa
    
    > -----BEGIN RSA PRIVATE KEY-----
    > MIIEogIBAAKCAQEAvjwKu6f940IVGHpUj3ffl3eKXACgVr3L5s9UJnb15+zV3K5L
    > BZuor8MLvwtskSkgdXNrpPZhNCsWSkryJff5I335Jhr/e5o03Yy+RqIMrJAIa0X5
    > ...
    > ...
    > ...
    > iBKVKGPhOnn4ve3dDqy3q7fS5sivTqCrpaYtByJmPrcJNjb2K7VMLNvgLamK/AbL
    > qpSTZjicKZCCl+J2+8lrKAaDWqWtIjSUs29kCL78QmaPOgEvfsw=
    > -----END RSA PRIVATE KEY-----
13.2.1.2.6.2 Preparation to store Cassandra backups

In this example, we want to store the backups on the server 192.168.69.132. We will store the backups in the /mnt/backups/cassandra_backups/ directory.

  1. Create a directory on the backup server to store cassandra backups

    backupuser > mkdir /mnt/backups/cassandra_backups
  2. Copy private ssh key from backupserver to all controller nodes

    backupuser > scp /mnt/backups/.ssh/id_rsa ardana@CONTROLLER:~/.ssh/id_rsa_backup
             Password:
             id_rsa     100% 1675     1.6KB/s   00:00

    Replace CONTROLLER with each control node e.g. doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt etc

  3. Login to each controller node and copy private ssh key to the root user's .ssh directory

    tux > sudo cp /var/lib/ardana/.ssh/id_rsa_backup /root/.ssh/
  4. Verify that you can ssh to backup server as backup user using the private key

    root # ssh -i ~/.ssh/id_rsa_backup backupuser@doc-cp1-comp0001-mgmt
13.2.1.2.7 Perform Backups for disaster recovery test

Perform Backups for disaster recovery

13.2.1.2.7.1 Execute backup of Cassandra

Execute backup of Cassandra

Create cassandra-backup-extserver.sh script on all controller nodes where Cassandra runs, which can be determined by running this command on deployer

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible FND-CDB --list-hosts
root # cat > ~/cassandra-backup-extserver.sh << EOF
#!/bin/sh

# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/

# Setup variables
DATA_DIR=/var/cassandra/data/data
NODETOOL=/usr/bin/nodetool

# e.g. cassandra-snp-2018-06-26-1003
SNAPSHOT_NAME=cassandra-snp-\$(date +%F-%H%M)
HOST_NAME=\$(/bin/hostname)_

# Take a snapshot of cassandra database
\$NODETOOL snapshot -t \$SNAPSHOT_NAME monasca

# Collect a list of directories that make up the snapshot
SNAPSHOT_DIR_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)
for d in \$SNAPSHOT_DIR_LIST
  do
    # copy snapshot directories to external server
    rsync -avR -e "ssh -i /root/.ssh/id_rsa_backup" \$d \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME
  done

\$NODETOOL clearsnapshot monasca
EOF
root # chmod +x ~/cassandra-backup-extserver.sh

Execute following steps on all the controller nodes

Note
Note

/usr/local/sbin/cassandra-backup-extserver.sh should be executed on all the three controller nodes at the same time (within seconds of each other) for a successful backup

  1. Edit /usr/local/sbin/cassandra-backup-extserver.sh script

    Set BACKUP_USER and BACKUP_SERVER to the desired backup user (for example, backupuser) and desired backup server (for example, 192.168.68.132), respectively.

    BACKUP_USER=backupuser
    BACKUP_SERVER=192.168.69.132
    BACKUP_DIR=/mnt/backups/cassandra_backups/
  2. Execute ~/cassandra-backup-extserver.sh

    root # ~/cassandra-backup-extserver.sh (on all controller nodes which are also cassandra nodes)
    
    Requested creating snapshot(s) for [monasca] with snapshot name [cassandra-snp-2018-06-28-0251] and options {skipFlush=false}
    Snapshot directory: cassandra-snp-2018-06-28-0251
    sending incremental file list
    created directory /mnt/backups/cassandra_backups//doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
    /var/
    /var/cassandra/
    /var/cassandra/data/
    /var/cassandra/data/data/
    /var/cassandra/data/data/monasca/
    
    ...
    ...
    ...
    
    /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-Summary.db
    /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-72-big-TOC.txt
    /var/cassandra/data/data/monasca/measurements-e29033d0488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/schema.cql
    sent 173,691 bytes  received 531 bytes  116,148.00 bytes/sec
    total size is 171,378  speedup is 0.98
    Requested clearing snapshot(s) for [monasca]
  3. Verify cassandra backup directory on backup server

    backupuser > ls -alt /mnt/backups/cassandra_backups
    total 16
    drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 .
    drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
    drwxr-xr-x 3 backupuser users 4096 Jun 28 02:51 doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
    drwxr-xr-x 8 backupuser users 4096 Jun 27 20:56 ..
    
    $backupuser@backupserver> du -shx /mnt/backups/cassandra_backups/*
    6.2G    /mnt/backups/cassandra_backups/doc-cp1-c1-m1-mgmt_cassandra-snp-2018-06-28-0251
    6.3G    /mnt/backups/cassandra_backups/doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
13.2.1.2.7.2 Execute backup of HPE Helion OpenStack

Execute backup of HPE Helion OpenStack

  1. Edit the configuration file for SSH backups (be careful to format the private key as requested: pipe on the first line and two spaces indentation). The private key is the key we created on the backup server earlier.

    ardana > vi ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
    
    $ cat ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
    freezer_ssh_host: 192.168.69.132
    freezer_ssh_port: 22
    freezer_ssh_username: backupuser
    freezer_ssh_base_dir: /mnt/backups
    freezer_ssh_private_key: |
      -----BEGIN RSA PRIVATE KEY-----
      MIIEowIBAAKCAQEAyzhZ+F+sXQp70N8zCDDb6ORKAxreT/qD4zAetjOTuBoFlGb8
      pRBY79t9vNp7qvrKaXHBfb1OkKzhqyUwEqNcC9bdngABbb8KkCq+OkfDSAZRrmja
      wa5PzgtSaZcSJm9jQcF04Fq19mZY2BLK3OJL4qISp1DmN3ZthgJcpksYid2G3YG+
      bY/EogrQrdgHfcyLaoEkiBWQSBTEENKTKFBB2jFQYdmif3KaeJySv9cJqihmyotB
      s5YTdvB5Zn/fFCKG66THhKnIm19NftbJcKc+Y3Z/ZX4W9SpMSj5dL2YW0Y176mLy
      gMLyZK9u5k+fVjYLqY7XlVAFalv9+HZsvQ3OQQIDAQABAoIBACfUkqXAsrrFrEDj
      DlCDqwZ5gBwdrwcD9ceYjdxuPXyu9PsCOHBtxNC2N23FcMmxP+zs09y+NuDaUZzG
      vCZbCFZ1tZgbLiyBbiOVjRVFLXw3aNkDSiT98jxTMcLqTi9kU5L2xN6YSOPTaYRo
      IoSqge8YjwlmLMkgGBVU7y3UuCmE/Rylclb1EI9mMPElTF+87tYK9IyA2QbIJm/w
      4aZugSZa3PwUvKGG/TCJVD+JfrZ1kCz6MFnNS1jYT/cQ6nzLsQx7UuYLgpvTMDK6
      Fjq63TmVg9Z1urTB4dqhxzpDbTNfJrV55MuA/z9/qFHs649tFB1/hCsG3EqWcDnP
      mcv79nECgYEA9WdOsDnnCI1bamKA0XZxovb2rpYZyRakv3GujjqDrYTI97zoG+Gh
      gLcD1EMLnLLQWAkDTITIf8eurkVLKzhb1xlN0Z4xCLs7ukgMetlVWfNrcYEkzGa8
      wec7n1LfHcH5BNjjancRH0Q1Xcc2K7UgGe2iw/Iw67wlJ8i5j2Wq3sUCgYEA0/6/
      irdJzFB/9aTC8SFWbqj1DdyrpjJPm4yZeXkRAdn2GeLU2jefqPtxYwMCB1goeORc
      gQLspQpxeDvLdiQod1Y1aTAGYOcZOyAatIlOqiI40y3Mmj8YU/KnL7NMkaYBCrJh
      aW//xo+l20dz52pONzLFjw1tW9vhCsG1QlrCaU0CgYB03qUn4ft4JDHUAWNN3fWS
      YcDrNkrDbIg7MD2sOIu7WFCJQyrbFGJgtUgaj295SeNU+b3bdCU0TXmQPynkRGvg
      jYl0+bxqZxizx1pCKzytoPKbVKCcw5TDV4caglIFjvoz58KuUlQSKt6rcZMHz7Oh
      BX4NiUrpCWo8fyh39Tgh7QKBgEUajm92Tc0XFI8LNSyK9HTACJmLLDzRu5d13nV1
      XHDhDtLjWQUFCrt3sz9WNKwWNaMqtWisfl1SKSjLPQh2wuYbqO9v4zRlQJlAXtQo
      yga1fxZ/oGlLVe/PcmYfKT91AHPvL8fB5XthSexPv11ZDsP5feKiutots47hE+fc
      U/ElAoGBAItNX4jpUfnaOj0mR0L+2R2XNmC5b4PrMhH/+XRRdSr1t76+RJ23MDwf
      SV3u3/30eS7Ch2OV9o9lr0sjMKRgBsLZcaSmKp9K0j/sotwBl0+C4nauZMUKDXqg
      uGCyWeTQdAOD9QblzGoWy6g3ZI+XZWQIMt0pH38d/ZRbuSUk5o5v
      -----END RSA PRIVATE KEY-----
  2. Save the modifications in the GIT repository

    ardana > cd ~/openstack/
    ardana > git add -A
    ardana > git commit -a -m "SSH backup configuration"
    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Create the Freezer jobs

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
  4. Wait until all the SSH backup jobs have finished running

    Freezer backup jobs are scheduled at interval specified in job specification

    You will have to wait for the scheduled time interval for the backup job to run

    To find the interval:

    ardana > freezer job-list | grep SSH
    
    | 34c1364692f64a328c38d54b95753844 | Ardana Default: deployer backup to SSH      |         7 | success | scheduled |       |            |
    | 944154642f624bb7b9ff12c573a70577 | Ardana Default: swift backup to SSH         |         1 | success | scheduled |       |            |
    | 22c6bab7ac4d43debcd4f5a9c4c4bb19 | Ardana Default: mysql backup to SSH         |         1 | success | scheduled |       |            |
    
    ardana > freezer job-show 944154642f624bb7b9ff12c573a70577
    +-------------+---------------------------------------------------------------------------------+
    | Field       | Value                                                                           |
    +-------------+---------------------------------------------------------------------------------+
    | Job ID      | 944154642f624bb7b9ff12c573a70577                                                |
    | Client ID   | ardana-qe201-cp1-c1-m1-mgmt                                                     |
    | User ID     | 33a6a77adc4b4799a79a4c3bd40f680d                                                |
    | Session ID  |                                                                                 |
    | Description | Ardana Default: swift backup to SSH                                             |
    | Actions     | [{u'action_id': u'e8373b03ca4b41fdafd83f9ba7734bfa',                            |
    |             |   u'freezer_action': {u'action': u'backup',                                     |
    |             |                       u'backup_name': u'freezer_swift_builder_dir_backup',      |
    |             |                       u'container': u'/mnt/backups/freezer_rings_backups',      |
    |             |                       u'log_config_append': u'/etc/freezer/agent-logging.conf', |
    |             |                       u'max_level': 14,                                         |
    |             |                       u'path_to_backup': u'/etc/swiftlm/',                      |
    |             |                       u'remove_older_than': 90,                                 |
    |             |                       u'snapshot': True,                                        |
    |             |                       u'ssh_host': u'192.168.69.132',                           |
    |             |                       u'ssh_key': u'/etc/freezer/ssh_key',                      |
    |             |                       u'ssh_port': u'22',                                       |
    |             |                       u'ssh_username': u'backupuser',                           |
    |             |                       u'storage': u'ssh'},                                      |
    |             |   u'max_retries': 5,                                                            |
    |             |   u'max_retries_interval': 60,                                                  |
    |             |   u'user_id': u'33a6a77adc4b4799a79a4c3bd40f680d'}]                             |
    | Start Date  |                                                                                 |
    | End Date    |                                                                                 |
    | Interval    | 24 hours                                                                        |
    +-------------+---------------------------------------------------------------------------------+

    Swift SSH backup job has Interval of 24 hours, so the next backup would run after 24 hours.

    In the default installation Interval for various backup jobs are:

    Table 13.1: Default Interval for Freezer backup jobs
    Job NameInterval
    Ardana Default: deployer backup to SSH48 hours
    Ardana Default: mysql backup to SSH12 hours
    Ardana Default: swift backup to SSH24 hours

    You will have to wait for as long as 48 hours for all the backup jobs to run

  5. On the backup server, you can verify that the backup files are present

    backupuser > ls -lah  /mnt/backups/
    total 16
    drwxr-xr-x 2 backupuser users 4096 Jun 27  2017 bin
    drwxr-xr-x 2 backupuser users 4096 Jun 29 14:04 freezer_database_backups
    drwxr-xr-x 2 backupuser users 4096 Jun 29 14:05 freezer_lifecycle_manager_backups
    drwxr-xr-x 2 backupuser users 4096 Jun 29 14:05 freezer_rings_backups
    backupuser > du -shx *
    4.0K    bin
    509M    freezer_audit_logs_backups
    2.8G    freezer_database_backups
    24G     freezer_lifecycle_manager_backups
    160K    freezer_rings_backups
13.2.1.2.8 Restore of the first controller

Restore of the first controller

  1. Edit the SSH backup configuration (re-enter the same information as earlier)

    ardana > vi ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
  2. Execute the restore helper. When prompted, enter the hostname the first controller had. In this example: doc-cp1-c1-m1-mgmt

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost _deployer_restore_helper.yml
  3. Execute the restore. When prompted, leave the first value empty (none) and validate the restore by typing 'yes'.

    ardana > sudo su
    cd /root/deployer_restore_helper/
    ./deployer_restore_script.sh
  4. Create a restore file for Swift rings

    ardana > nano swift_rings_restore.ini
    ardana > cat swift_rings_restore.ini

    Help:

    [default]
    action = restore
    storage = ssh
    # backup server ip
    ssh_host = 192.168.69.132
    # username to connect to the backup server
    ssh_username = backupuser
    ssh_key = /etc/freezer/ssh_key
    # base directory for backups on the backup server 
    container = /mnt/backups/freezer_ring_backups
    backup_name = freezer_swift_builder_dir_backup
    restore_abs_path = /etc/swiftlm
    log_file = /var/log/freezer-agent/freezer-agent.log
    # hostname that the controller
    hostname = doc-cp1-c1-m1-mgmt
    overwrite = True
  5. Execute the restore of the swift rings

    ardana > freezer-agent --config ./swift_rings_restore.ini
13.2.1.2.9 Re-deployment of controllers 1, 2 and 3

Re-deployment of controllers 1, 2 and 3

  1. Change back to the default ardana user

  2. Deactivate the freezer backup jobs (otherwise empty backups would be added on top of the current good backups)

    ardana > nano ~/openstack/my_cloud/config/freezer/activate_jobs.yml
    ardana > cat ~/openstack/my_cloud/config/freezer/activate_jobs.yml
    
    # If set to false, We wont create backups jobs.
    freezer_create_backup_jobs: false
    
    # If set to false, We wont create restore jobs.
    freezer_create_restore_jobs: true
  3. Save the modification in the GIT repository

    ardana > cd ~/openstack/
    ardana > git add -A
    ardana > git commit -a -m "De-Activate SSH backup jobs during re-deployment"
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run the cobbler-deploy.yml playbook

    ardana > ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.xml
  5. Run the bm-reimage.yml playbook limited to the second and third controller

    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3

    controller2 and controller3 names can vary. You can use the bm-power-status.yml playbook in order to check the cobbler names of these nodes.

  6. Run the site.yml playbook limited to the three controllers and localhost. In this example, this means: doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt, doc-cp1-c1-m3-mgmt and localhost

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
13.2.1.2.10 Cassandra database restore

Cassandra database restore

Create a script cassandra-restore-extserver.sh on all controller nodes

root # cat > ~/cassandra-restore-extserver.sh << EOF
#!/bin/sh

# backup user
BACKUP_USER=backupuser
# backup server
BACKUP_SERVER=192.168.69.132
# backup directory
BACKUP_DIR=/mnt/backups/cassandra_backups/

# Setup variables
DATA_DIR=/var/cassandra
NODETOOL=/usr/bin/nodetool

HOST_NAME=\$(/bin/hostname)_

#Get snapshot name from command line.
if [ -z "\$*"  ]
then
  echo "usage \$0 <snapshot to restore>"
  exit 1
fi
SNAPSHOT_NAME=\$1

# restore
rsync -av -e "ssh -i /root/.ssh/id_rsa_backup" \$BACKUP_USER@\$BACKUP_SERVER:\$BACKUP_DIR/\$HOST_NAME\$SNAPSHOT_NAME/ /

# set ownership of newley restored files
chown -R cassandra:cassandra \$DATA_DIR

# Get a list of snapshot directories that have files to be restored.
RESTORE_LIST=\$(find \$DATA_DIR -type d -name \$SNAPSHOT_NAME)

# use RESTORE_LIST to move snapshot files back into place of database.
for d in \$RESTORE_LIST
do
  cd \$d
  mv * ../..
  KEYSPACE=\$(pwd | rev | cut -d '/' -f4 | rev)
  TABLE_NAME=\$(pwd | rev | cut -d '/' -f3 |rev | cut -d '-' -f1)
  \$NODETOOL refresh \$KEYSPACE \$TABLE_NAME
done
cd
# Cleanup snapshot directories
\$NODETOOL clearsnapshot \$KEYSPACE
EOF
root # chmod +x ~/cassandra-restore-extserver.sh

Execute following steps on all the controller nodes

  1. Edit ~/cassandra-restore-extserver.sh script

    Set BACKUP_USER,BACKUP_SERVER to the desired backup user (for example, backupuser) and the desired backup server (for example, 192.168.68.132), respectively.

    BACKUP_USER=backupuser
    BACKUP_SERVER=192.168.69.132
    BACKUP_DIR=/mnt/backups/cassandra_backups/
  2. Execute ~/cassandra-restore-extserver.sh SNAPSHOT_NAME

    You will have to find out SNAPSHOT_NAME from listing of /mnt/backups/cassandra_backups. All the directories are of format HOST_SNAPSHOT_NAME

    ls -alt /mnt/backups/cassandra_backups
    total 16
    drwxr-xr-x 4 backupuser users 4096 Jun 28 03:06 .
    drwxr-xr-x 3 backupuser users 4096 Jun 28 03:06 doc-cp1-c1-m2-mgmt_cassandra-snp-2018-06-28-0306
    root # ~/cassandra-restore-extserver.sh cassandra-snp-2018-06-28-0306
    
    receiving incremental file list
    ./
    var/
    var/cassandra/
    var/cassandra/data/
    var/cassandra/data/data/
    var/cassandra/data/data/monasca/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/manifest.json
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-CompressionInfo.db
    var/cassandra/data/data/monasca/alarm_state_history-e6bbdc20488d11e8bdabc32666406af1/snapshots/cassandra-snp-2018-06-28-0306/mc-37-big-Data.db
    ...
    ...
    ...
    /usr/bin/nodetool clearsnapshot monasca
13.2.1.2.11 Databases restore

Databases restore

13.2.1.2.11.1 MariaDB database restore

MariaDB database restore

  1. Source the backup credentials file

    ardana > source ~/backup.osrc
  2. List Freezer jobs

    Gather the id of the job corresponding to the first controller and with the description. For example:

    ardana > freezer job-list | grep "mysql restore from SSH"
    +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
    | Job ID                           | Description                                 | # Actions | Result  | Status    | Event | Session ID |
    +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
    | 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH      |         1 |         | stop      |       |            |
    
    ardana > freezer job-show 64715c6ce8ed40e1b346136083923260
    +-------------+---------------------------------------------------------------------------------+
    | Field       | Value                                                                           |
    +-------------+---------------------------------------------------------------------------------+
    | Job ID      | 64715c6ce8ed40e1b346136083923260                                                |
    | Client ID   | doc-cp1-c1-m1-mgmt                                                     |
    | User ID     | 33a6a77adc4b4799a79a4c3bd40f680d                                                |
    | Session ID  |                                                                                 |
    | Description | Ardana Default: mysql restore from SSH                                          |
    | Actions     | [{u'action_id': u'19dfb0b1851e41c682716ecc6990b25b',                            |
    |             |   u'freezer_action': {u'action': u'restore',                                    |
    |             |                       u'backup_name': u'freezer_mysql_backup',                  |
    |             |                       u'container': u'/mnt/backups/freezer_database_backups',   |
    |             |                       u'hostname': u'doc-cp1-c1-m1-mgmt',              |
    |             |                       u'log_config_append': u'/etc/freezer/agent-logging.conf', |
    |             |                       u'restore_abs_path': u'/tmp/mysql_restore/',              |
    |             |                       u'ssh_host': u'192.168.69.132',                           |
    |             |                       u'ssh_key': u'/etc/freezer/ssh_key',                      |
    |             |                       u'ssh_port': u'22',                                       |
    |             |                       u'ssh_username': u'backupuser',                           |
    |             |                       u'storage': u'ssh'},                                      |
    |             |   u'max_retries': 5,                                                            |
    |             |   u'max_retries_interval': 60,                                                  |
    |             |   u'user_id': u'33a6a77adc4b4799a79a4c3bd40f680d'}]                             |
    | Start Date  |                                                                                 |
    | End Date    |                                                                                 |
    | Interval    |                                                                                 |
    +-------------+---------------------------------------------------------------------------------+
  3. Start the job using its id

    ardana > freezer job-start 64715c6ce8ed40e1b346136083923260
    Start request sent for job 64715c6ce8ed40e1b346136083923260
  4. Wait for the job result to be success

    ardana > freezer job-list | grep "mysql restore from SSH"
    +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
    | Job ID                           | Description                                 | # Actions | Result  | Status    | Event | Session ID |
    +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
    | 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH      |         1 |         | running      |       |            |
    ardana > freezer job-list | grep "mysql restore from SSH"
    +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
    | Job ID                           | Description                                 | # Actions | Result  | Status    | Event | Session ID |
    +----------------------------------+---------------------------------------------+-----------+---------+-----------+-------+------------+
    | 64715c6ce8ed40e1b346136083923260 | Ardana Default: mysql restore from SSH      |         1 | success | completed |       |            |
  5. Verify that the files have been restored on the controller

    ardana > sudo du -shx /tmp/mysql_restore/*
    
    16K     /tmp/mysql_restore/aria_log.00000001
    4.0K    /tmp/mysql_restore/aria_log_control
    3.4M    /tmp/mysql_restore/barbican
    8.0K    /tmp/mysql_restore/ceilometer
    4.2M    /tmp/mysql_restore/cinder
    2.9M    /tmp/mysql_restore/designate
    129M    /tmp/mysql_restore/galera.cache
    2.1M    /tmp/mysql_restore/glance
    4.0K    /tmp/mysql_restore/grastate.dat
    4.0K    /tmp/mysql_restore/gvwstate.dat
    2.6M    /tmp/mysql_restore/heat
    752K    /tmp/mysql_restore/horizon
    4.0K    /tmp/mysql_restore/ib_buffer_pool
    76M     /tmp/mysql_restore/ibdata1
    128M    /tmp/mysql_restore/ib_logfile0
    128M    /tmp/mysql_restore/ib_logfile1
    12M     /tmp/mysql_restore/ibtmp1
    16K     /tmp/mysql_restore/innobackup.backup.log
    313M    /tmp/mysql_restore/keystone
    716K    /tmp/mysql_restore/magnum
    12M     /tmp/mysql_restore/mon
    8.3M    /tmp/mysql_restore/monasca_transform
    0       /tmp/mysql_restore/multi-master.info
    11M     /tmp/mysql_restore/mysql
    4.0K    /tmp/mysql_restore/mysql_upgrade_info
    14M     /tmp/mysql_restore/nova
    4.4M    /tmp/mysql_restore/nova_api
    14M     /tmp/mysql_restore/nova_cell0
    3.6M    /tmp/mysql_restore/octavia
    208K    /tmp/mysql_restore/opsconsole
    38M     /tmp/mysql_restore/ovs_neutron
    8.0K    /tmp/mysql_restore/performance_schema
    24K     /tmp/mysql_restore/tc.log
    4.0K    /tmp/mysql_restore/test
    8.0K    /tmp/mysql_restore/winchester
    4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  6. Repeat steps 2-5 on the other two controllers where the MariaDB/Galera database is running, which can be determined by running below command on deployer

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible FND-MDB --list-hosts
  7. Stop HPE Helion OpenStack services on the three controllers (replace the hostnames of the controllers in the command)

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  8. Clean the mysql directory and copy the restored backup on all three controllers where MariaDB/Galera database is running

    root # cd /var/lib/mysql/
    root # rm -rf ./*
    root # cp -pr /tmp/mysql_restore/* ./

    Switch back to the ardana user once the copy is finished

13.2.1.2.11.2 Restart HPE Helion OpenStack services

Restart HPE Helion OpenStack services

  1. Restart the MariaDB Database

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

    On the deployer node, execute the galera-bootstrap.yml playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.

    If this process fails to recover the database cluster, please refer to Section 13.2.2.1.2, “Recovering the MariaDB Database”. There Scenario 3 covers the process of manually starting the database.

  2. Restart HPE Helion OpenStack services limited to the three controllers (replace the the hostnames of the controllers in the command).

    ansible-playbook -i hosts/verb_hosts ardana-start.yml \
     --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  3. Re-configure HPE Helion OpenStack

    ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
13.2.1.2.11.3 Re-enable SSH backups

Re-enable SSH backups

  1. Re-activate Freezer backup jobs

    ardana > vi ~/openstack/my_cloud/config/freezer/activate_jobs.yml
    ardana > cat ~/openstack/my_cloud/config/freezer/activate_jobs.yml
    
    # If set to false, We wont create backups jobs.
    freezer_create_backup_jobs: true
    
    # If set to false, We wont create restore jobs.
    freezer_create_restore_jobs: true
  2. Save the modifications in the GIT repository

    cd ~/openstack/ardana/ansible/
    git add -A
    git commit -a -m “Re-Activate SSH backup jobs”
    ansible-playbook -i hosts/localhost config-processor-run.yml
    ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Create Freezer jobs

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
13.2.1.2.12 Post restore testing

Post restore testing

  1. Source the service credential file

    ardana > source ~/service.osrc
  2. Swift

    ardana > swift list
    container_1
    volumebackups
    
    ardana > swift list container_1
    var/lib/ardana/backup.osrc
    var/lib/ardana/service.osrc
    
    ardana > swift download container_1 /tmp/backup.osrc
  3. Neutron

    ardana > openstack network list
    +--------------------------------------+---------------------+--------------------------------------+
    | ID                                   | Name                | Subnets                              |
    +--------------------------------------+---------------------+--------------------------------------+
    | 07c35d11-13f9-41d4-8289-fa92147b1d44 | test-net             | 02d5ca3b-1133-4a74-a9ab-1f1dc2853ec8|
    +--------------------------------------+---------------------+--------------------------------------+
  4. Glance

    ardana > openstack image list
    +--------------------------------------+----------------------+--------+
    | ID                                   | Name                 | Status |
    +--------------------------------------+----------------------+--------+
    | 411a0363-7f4b-4bbc-889c-b9614e2da52e | cirros-0.4.0-x86_64  | active |
    +--------------------------------------+----------------------+--------+
    ardana > openstack image save --file /tmp/cirros f751c39b-f1e3-4f02-8332-3886826889ba
    ardana > ls -lah /tmp/cirros
    -rw-r--r-- 1 ardana ardana 12716032 Jul  2 20:52 /tmp/cirros
  5. Nova

    ardana > openstack server list
    
    ardana > openstack server list
    
    ardana > openstack server create server_6 --image 411a0363-7f4b-4bbc-889c-b9614e2da52e  --flavor m1.small --nic net-id=07c35d11-13f9-41d4-8289-fa92147b1d44
    +-------------------------------------+------------------------------------------------------------+
    | Field                               | Value                                                      |
    +-------------------------------------+------------------------------------------------------------+
    | OS-DCF:diskConfig                   | MANUAL                                                     |
    | OS-EXT-AZ:availability_zone         |                                                            |
    | OS-EXT-SRV-ATTR:host                | None                                                       |
    | OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                       |
    | OS-EXT-SRV-ATTR:instance_name       |                                                            |
    | OS-EXT-STS:power_state              | NOSTATE                                                    |
    | OS-EXT-STS:task_state               | scheduling                                                 |
    | OS-EXT-STS:vm_state                 | building                                                   |
    | OS-SRV-USG:launched_at              | None                                                       |
    | OS-SRV-USG:terminated_at            | None                                                       |
    | accessIPv4                          |                                                            |
    | accessIPv6                          |                                                            |
    | addresses                           |                                                            |
    | adminPass                           | iJBoBaj53oUd                                               |
    | config_drive                        |                                                            |
    | created                             | 2018-07-02T21:02:01Z                                       |
    | flavor                              | m1.small (2)                                               |
    | hostId                              |                                                            |
    | id                                  | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c                       |
    | image                               | cirros-0.4.0-x86_64 (f751c39b-f1e3-4f02-8332-3886826889ba) |
    | key_name                            | None                                                       |
    | name                                | server_6                                                   |
    | progress                            | 0                                                          |
    | project_id                          | cca416004124432592b2949a5c5d9949                           |
    | properties                          |                                                            |
    | security_groups                     | name='default'                                             |
    | status                              | BUILD                                                      |
    | updated                             | 2018-07-02T21:02:01Z                                       |
    | user_id                             | 8cb1168776d24390b44c3aaa0720b532                           |
    | volumes_attached                    |                                                            |
    +-------------------------------------+------------------------------------------------------------+
    
    ardana > openstack server list
    +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+
    | ID                                   | Name     | Status | Networks                        | Image               | Flavor    |
    +--------------------------------------+----------+--------+---------------------------------+---------------------+-----------+
    | ce7689ff-23bf-4fe9-b2a9-922d4aa9412c | server_6 | ACTIVE | n1=1.1.1.8                      | cirros-0.4.0-x86_64 | m1.small  |
    
    ardana > openstack server delete ce7689ff-23bf-4fe9-b2a9-922d4aa9412c

13.2.2 Unplanned Control Plane Maintenance

Unplanned maintenance tasks for controller nodes such as recovery from power failure.

13.2.2.1 Restarting Controller Nodes After a Reboot

Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.

When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.

These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.

13.2.2.1.1 Prerequisites

The following conditions must be true in order to perform these steps successfully:

  • Each of your controller nodes should be powered on.

  • Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.

  • The operator who performs these steps will need access to the lifecycle manager.

13.2.2.1.2 Recovering the MariaDB Database

The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:

Scenario 1: Recovering one or two of your controller nodes but not the entire cluster

Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:

  1. Ensure the controller nodes have power and are booted to the command prompt.

  2. If the MariaDB service is not started, start it with this command:

    sudo service mysql start
  3. If MariaDB fails to start, proceed to the next section which covers the bootstrap process.

Scenario 2: Recovering the entire controller cluster with the bootstrap playbook

If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.

  1. Make sure no mysqld daemon is running on any node in the cluster before you continue with the steps in this procedure. If there is a mysqld daemon running, then use the command below to shut down the daemon.

    sudo systemctl stop mysql

    If the mysqld daemon does not go down following the service stop, then kill the daemon using kill -9 before continuing.

  2. On the deployer node, execute the galera-bootstrap.yml playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
13.2.2.1.3 Restarting Services on the Controller Nodes

From the Cloud Lifecycle Manager you should execute the ardana-start.yml playbook for each node that was brought down so the services can be started back up.

If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>

If you have a shared Cloud Lifecycle Manager/controller setup and need to restart services on this shared node, you can use localhost to indicate the shared node, like this:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost
Note
Note

If you leave off the --limit switch, the playbook will be run against all nodes.

13.2.2.1.4 Restart the Monitoring Agents

As part of the recovery process, you should also restart the monasca-agent and these steps will show you how:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the monasca-agent:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-agent-stop.yml
  3. Restart the monasca-agent:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-agent-start.yml
  4. You can then confirm the status of the monasca-agent with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml

13.2.2.2 Recovering the Control Plane

If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery then we provide different scenarios for how to resolve them to get your cloud recovered.

If one or more of your controller nodes has experienced data or disk corruption due to power-loss or hardware failure and you need perform disaster recovery then we provide different scenarios for how to resolve them to get your cloud recovered.

Note
Note

You should have backed up /etc/group of the Cloud Lifecycle Manager manually after installation. While recovering a Cloud Lifecycle Manager node, manually copy the /etc/group file from a backup of the old Cloud Lifecycle Manager.

13.2.2.2.1 Point-in-Time MariaDB Database Recovery

In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.

13.2.2.2.1.1 Restore from a Swift backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Determine which node is the first host member in the FND-MDB group, which will be the first node hosting the MariaDB service in your cloud. You can do this by using these commands:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > grep -A1 FND-MDB--first-member hosts/verb_hosts

    The result will be similar to the following example:

    [FND-MDB--first-member:children]
    ardana002-cp1-c1-m1

    In this example, the host name of the node is ardana002-cp1-c1-m1

  3. Find the host IP address which will be used to log in.

    ardana > cat /etc/hosts | grep ardana002-cp1-c1-m1
    10.84.43.82      ardana002-cp1-c1-m1-extapi ardana002-cp1-c1-m1-extapi
    192.168.24.21    ardana002-cp1-c1-m1-mgmt ardana002-cp1-c1-m1-mgmt
    10.1.2.1         ardana002-cp1-c1-m1-guest ardana002-cp1-c1-m1-guest
    10.84.65.3       ardana002-cp1-c1-m1-EXTERNAL-VM ardana002-cp1-c1-m1-external-vm

    In this example, 192.168.24.21 is the IP address for the host.

  4. SSH into the host.

    ardana > ssh ardana@192.168.24.21
  5. Source the backup file.

    ardana > source /var/lib/ardana/backup.osrc
  6. Find the Client ID for the host name from the beginning of this procedure ( ardana002-cp1-c1-m1 ) in this example.

    ardana > freezer client-list
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | Client ID                   | uuid                             | hostname                    | description |
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
    | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
    | ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
    | ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
    | ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
    | ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
    +-----------------------------+----------------------------------+-----------------------------+-------------+

    In this example, the hostname and the Client ID are the same: ardana002-cp1-c1-m1-mgmt.

  7. List the jobs

    ardana > freezer job-list -C CLIENT ID

    Using the example in the previous step:

    ardana > freezer job-list -C ardana002-cp1-c1-m1-mgmt
  8. Get the corresponding job id for Ardana Default: mysql restore from Swift.

  9. Launch the restore process with:

    ardana > freezer job-start JOB-ID
  10. This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.

  11. Log in to the Cloud Lifecycle Manager.

  12. Stop the MariaDB service.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  13. Log back in to the first node running the MariaDB service, the same node as in Step 3.

  14. Clean the MariaDB directory using this command:

    tux > sudo rm -r /var/lib/mysql/*
  15. Copy the restored files back to the MariaDB directory:

    tux > sudo cp -pr /tmp/mysql_restore/* /var/lib/mysql
  16. Log in to each of the other nodes in your MariaDB cluster, which were determined in Step 3. Remove the grastate.dat file from each of them.

    tux > sudo rm /var/lib/mysql/grastate.dat
    Warning
    Warning

    Do not remove this file from the first node in your MariaDB cluster. Ensure you only do this from the other cluster nodes.

  17. Log back in to the Cloud Lifecycle Manager.

  18. Start the MariaDB service.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
13.2.2.2.1.2 Restore from an SSH backup

Follow the same procedure as the one for Swift but select the job Ardana Default: mysql restore from SSH.

13.2.2.2.1.3 Restore MariaDB manually

If restoring MariaDB fails during the procedure outlined above, you can follow this procedure to manually restore MariaDB:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the MariaDB cluster:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  3. On all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:

    tux > sudo rm -r /var/lib/mysql/*
  4. On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.

    tux > sudo cp -pr /tmp/mysql_restore/* /var/lib/mysql
  5. If you need to restore the files manually from SSH, follow these steps:

    1. Create the /root/mysql_restore.ini file with the contents below. Be careful to substitute the {{ values }}. Note that the SSH information refers to the SSH server you configured for backup before installing.

      [default]
      action = restore
      storage = ssh
      ssh_host = {{ freezer_ssh_host }}
      ssh_username = {{ freezer_ssh_username }}
      container = {{ freezer_ssh_base_dir }}/freezer_mysql_backup
      ssh_key = /etc/freezer/ssh_key
      backup_name = freezer_mysql_backup
      restore_abs_path = /var/lib/mysql/
      log_file = /var/log/freezer-agent/freezer-agent.log
      hostname = {{ hostname of the first MariaDB node }}
    2. Execute the restore job:

      ardana > freezer-agent --config /root/mysql_restore.ini
  6. Log back in to the Cloud Lifecycle Manager.

  7. Start the MariaDB service.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  8. After approximately 10-15 minutes, the output of the percona-status.yml playbook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml

    An example output is as follows:

    TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] *************
      ok: [ardana-cp1-c1-m1-mgmt] => {
      "msg": "mysql is synced."
      }
      ok: [ardana-cp1-c1-m2-mgmt] => {
      "msg": "mysql is synced."
      }
      ok: [ardana-cp1-c1-m3-mgmt] => {
      "msg": "mysql is synced."
      }
13.2.2.2.1.4 Point-in-Time Cassandra Recovery

A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.

The following steps should be taken before enabling and deploying the replacement node.

  1. Determine the IP address of the node that was removed or is being replaced.

  2. On one of the functional Cassandra control plane nodes, log in as the ardana user.

  3. Run the command nodetool status to display a list of Cassandra nodes.

  4. If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.

  5. If the node that was removed is still in the list, copy its node ID.

  6. Run the command nodetool removenode ID.

After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 13.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.

For more information, please consult the Cassandra documentation.

13.2.2.2.2 Point-in-Time Swift Rings Recovery

In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your Swift rings to a previous state.

Note
Note

Freezer backs up and restores Swift rings only, not Swift data.

13.2.2.2.2.1 Restore from a Swift backup
  1. Log in to the first Swift Proxy (SWF-PRX[0]) node.

    To find the first Swift Proxy node:

    1. On the Cloud Lifecycle Manager

      ardana > cd  ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
      --limit SWF-PRX[0]

      At the end of the output, you will see something like the following example:

      ...
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)'
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)'
      
      PLAY RECAP ********************************************************************
      ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
    2. Find the first node name and its IP address. For example:

      ardana > cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
  2. Source the backup environment file:

    ardana > source /var/lib/ardana/backup.osrc
  3. Find the client id.

    ardana > freezer client-list
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | Client ID                   | uuid                             | hostname                    | description |
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
    | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
    | ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
    | ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
    | ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
    | ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
    +-----------------------------+----------------------------------+-----------------------------+-------------+

    In this example, the hostname and the Client ID are the same: ardana002-cp1-c1-m1-mgmt.

  4. List the jobs

    ardana > freezer job-list -C CLIENT ID

    Using the example in the previous step:

    ardana > freezer job-list -C ardana002-cp1-c1-m1-mgmt
  5. Get the corresponding job id for Ardana Default: swift restore from Swift in the Description column.

  6. Launch the restore job:

    ardana > freezer job-start JOB-ID
  7. This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log Wait until the restore job is finished before doing the next step.

  8. Log in to the Cloud Lifecycle Manager.

  9. Stop the Swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml
  10. Log back in to the first Swift Proxy (SWF-PRX[0]) node, which was determined in Step 1.

  11. Copy the restored files.

    tux > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    tux > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  12. Log back in to the Cloud Lifecycle Manager.

  13. Reconfigure the Swift service:\

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
13.2.2.2.2.2 Restore from an SSH backup

Follow almost the same procedure as for Swift in the section immediately preceding this one: Section 13.2.2.2.2.1, “Restore from a Swift backup”. The only change is that the restore job uses a different job id. Get the corresponding job id for Ardana Default: Swift restore from SSH in the Description column.

13.2.2.2.3 Point-in-time Cloud Lifecycle Manager Recovery

In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.

Procedure 13.1: Restoring from a Swift or SSH Backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Source the backup environment file:

    tux > source /var/lib/ardana/backup.osrc
  3. Find the Client ID.

    tux > freezer client-list
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | Client ID                   | uuid                             | hostname                    | description |
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
    | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
    | ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
    | ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
    | ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
    | ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
    +-----------------------------+----------------------------------+-----------------------------+-------------+

    In this example, the hostname and the Client ID are the same: ardana002-cp1-c1-m1-mgmt.

  4. List the jobs

    tux > freezer job-list -C CLIENT ID

    Using the example in the previous step:

    tux > freezer job-list -C ardana002-cp1-c1-m1-mgmt
  5. Find the correct job ID:

    SSH Backups:  Get the id corresponding to the job id for Ardana Default: deployer restore from SSH.

    or

    Swift Backups.  Get the id corresponding to the job id for Ardana Default: deployer restore from Swift.

  6. Stop the Dayzero UI:

    tux > sudo systemctl stop dayzero
  7. Launch the restore job:

    tux > freezer job-start JOB ID
  8. This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.

  9. Start the Dayzero UI:

    tux > sudo systemctl start dayzero
13.2.2.2.4 Cloud Lifecycle Manager Disaster Recovery

In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.

To ensure that you use the same version of HPE Helion OpenStack that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the Book “Installing with Cloud Lifecycle Manager”, Chapter 3 “Installing the Cloud Lifecycle Manager server”, Section 3.5.2 “Installing the HPE Helion OpenStack Extension” before proceeding further.

13.2.2.2.4.1 Restore from a Swift backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Install the freezer-agent using the following playbook:

    ardana > cd ~/openstack/ardana/ansible/
    ardana > ansible-playbook -i hosts/localhost _deployer_restore_helper.yml
  3. Access one of the other controller or compute nodes in your environment to perform the following steps:

    1. Retrieve the /var/lib/ardana/backup.osrc file and copy it to the /var/lib/ardana/ directory on the Cloud Lifecycle Manager.

    2. Copy all the files in the /opt/stack/service/freezer-api/etc/ directory to the same directory on the Cloud Lifecycle Manager.

    3. Copy all the files in the /var/lib/ca-certificates directory to the same directory on the Cloud Lifecycle Manager.

    4. Retrieve the /etc/hosts file and replace the one found on the Cloud Lifecycle Manager.

  4. Log back in to the Cloud Lifecycle Manager.

  5. Edit the value for client_id in the following file to contain the hostname of your Cloud Lifecycle Manager:

    /opt/stack/service/freezer-api/etc/freezer-api.conf
  6. Update your ca-certificates:

    sudo update-ca-certificates
  7. Edit the /etc/hosts file, ensuring you edit the 127.0.0.1 line so it points to ardana:

    127.0.0.1       localhost ardana
    ::1             localhost ip6-localhost ip6-loopback
    ff02::1         ip6-allnodes
    ff02::2         ip6-allrouters
  8. On the Cloud Lifecycle Manager, source the backup user credentials:

    ardana > source ~/backup.osrc
  9. Find the Client ID (ardana002-cp1-c0-m1-mgmt) for the host name as done in previous procedures (see Procedure 13.1, “Restoring from a Swift or SSH Backup”).

    ardana > freezer client-list
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | Client ID                   | uuid                             | hostname                    | description |
    +-----------------------------+----------------------------------+-----------------------------+-------------+
    | ardana002-cp1-comp0001-mgmt | f4d9cfe0725145fb91aaf95c80831dd6 | ardana002-cp1-comp0001-mgmt |             |
    | ardana002-cp1-comp0002-mgmt | 55c93eb7d609467a8287f175a2275219 | ardana002-cp1-comp0002-mgmt |             |
    | ardana002-cp1-c0-m1-mgmt    | 50d26318e81a408e97d1b6639b9404b2 | ardana002-cp1-c0-m1-mgmt    |             |
    | ardana002-cp1-c1-m1-mgmt    | 78fe921473914bf6a802ad360c09d35b | ardana002-cp1-c1-m1-mgmt    |             |
    | ardana002-cp1-c1-m2-mgmt    | b2e9a4305c4b4272acf044e3f89d327f | ardana002-cp1-c1-m2-mgmt    |             |
    | ardana002-cp1-c1-m3-mgmt    | a3ceb80b8212425687dd11a92c8bc48e | ardana002-cp1-c1-m3-mgmt    |             |
    +-----------------------------+----------------------------------+-----------------------------+-------------+

    In this example, the hostname and the Client ID are the same: ardana002-cp1-c0-m1-mgmt.

  10. List the Freezer jobs

    ardana > freezer job-list -C CLIENT ID

    Using the example in the previous step:

    ardana > freezer job-list -C ardana002-cp1-c0-m1-mgmt
  11. Get the id of the job corresponding to Ardana Default: deployer backup to Swift. Stop that job so the freezer scheduler does not begin making backups when started.

    ardana > freezer job-stop JOB-ID

    If it is present, also stop the Cloud Lifecycle Manager's SSH backup.

  12. Start the freezer scheduler:

    sudo systemctl start openstack-freezer-scheduler
  13. Get the id of the job corresponding to Ardana Default: deployer restore from Swift and launch that job:

    ardana > freezer job-start JOB-ID

    This will take some time. You can follow the progress by running tail -f /var/log/freezer/freezer-scheduler.log. Wait until the restore job is finished before doing the next step.

  14. When the job completes, the previous Cloud Lifecycle Manager contents should be restored to your home directory:

    ardana > cd ~
    ardana > ls
  15. If you are using Cobbler, restore your Cobbler configuration with these steps:

    1. Remove the following files:

      sudo rm -rf /var/lib/cobbler
      sudo rm -rf /srv/www/cobbler
    2. Deploy Cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
    3. Set the netboot-enabled flag for each of your nodes with this command:

      for h in $(sudo cobbler system list)
      do
        sudo cobbler system edit --name=$h --netboot-enabled=0
      done
  16. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready_deployment.yml
  17. If you are using a dedicated Cloud Lifecycle Manager, follow these steps:

    1. re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
  18. If you are using a shared Cloud Lifecycle Manager/controller, follow these steps:

    1. If the node is also a Cloud Lifecycle Manager hypervisor, run the following commands to recreate the virtual machines that were lost:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts ardana-hypervisor-setup.yml --limit <this node>
    2. If the node that was lost (or one of the VMs that it hosts) was a member of the RabbitMQ cluster then you need to remove the record of the old node, by running the following command on any one of the other cluster members. In this example the nodes are called cloud-cp1-rmq-mysql-m*-mgmt but you need to use the correct names for your system, which you can find in /etc/hosts:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ssh cloud-cp1-rmq-mysql-m3-mgmt sudo rabbitmqctl forget_cluster_node \
      rabbit@cloud-cp1-rmq-mysql-m1-mgmt
    3. Run the site.yml against the complete cloud to reinstall and rebuild the services that were lost. If you replaced one of the RabbitMQ cluster members then you will need to add the -e flag shown below, to nominate a new master node for the cluster, otherwise you can omit it.

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts site.yml -e \
      rabbit_primary_hostname=cloud-cp1-rmq-mysql-m3
13.2.2.2.4.2 Restore from an SSH backup
  1. On the Cloud Lifecycle Manager, edit the following file so it contains the same information as it did previously:

    ardana > ~/openstack/my_cloud/config/freezer/ssh_credentials.yml
  2. On the Cloud Lifecycle Manager, copy the following files, change directories, and run the playbook _deployer_restore_helper.yml:

    ardana > cp -r ~/hp-ci/openstack/* ~/openstack/my_cloud/definition/
    ardana > cd ~/openstack/ardana/ansible/
    ardana > ansible-playbook -i hosts/localhost _deployer_restore_helper.yml
  3. Perform the restore. First become root and change directories:

    sudo su
    root # cd /root/deployer_restore_helper/
  4. Execute the restore job:

    ardana > ./deployer_restore_script.sh
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready_deployment.yml
  6. When the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
13.2.2.2.5 One or Two Controller Node Disaster Recovery

This scenario makes the following assumptions:

  • Your Cloud Lifecycle Manager is still intact and working.

  • One or two of your controller nodes went down, but not the entire cluster.

  • The node needs to be rebuilt from scratch, not simply rebooted.

13.2.2.2.5.1 Steps to recovering one or two controller nodes
  1. Ensure that your node has power and all of the hardware is functioning.

  2. Log in to the Cloud Lifecycle Manager.

  3. Verify that all of the information in your ~/openstack/my_cloud/definition/data/servers.yml file is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.

  4. If you made changes to your servers.yml file then commit those changes to your local git:

    ardana > git add -A
    ardana > git commit -a -m "editing controller information"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Ensure that Cobbler has the correct system information:

    1. If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:

      ardana > sudo cobbler system list
    2. Remove any controller nodes from Cobbler that no longer exist:

      ardana > sudo cobbler system remove --name=<node>
    3. Add the new node into Cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  8. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>
    Note
    Note

    If you do not know the <node name> already, you can get it by using sudo cobbler system list.

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See the Persisted Server Allocations section in for information on how this works.

  9. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>
  10. Complete the rebuilding of your controller node with the two playbooks below:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>
13.2.2.2.6 Three Control Plane Node Disaster Recovery

In this scenario, all control plane nodes are destroyed which need to be rebuilt or replaced.

13.2.2.2.6.1 Restore from a Swift backup:

Restoring from a Swift backup is not possible because Swift is gone.

13.2.2.2.6.2 Restore from an SSH backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Disable the default backup job(s) by editing the following file:

    ardana > ~/scratch/ansible/next/ardana/ansible/roles/freezer-jobs/defaults/activate.yml

    Set the value for freezer_create_backup_jobs to false:

    # If set to false, We won't create backups jobs.
    freezer_create_backup_jobs: false
  3. Deploy the control plane nodes, using the values for your control plane node hostnames:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
      CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \
      CONTROL_PLANE_HOSTNAME3 -e rebuild=True

    For example, if you were using the default values from the example model files your command would look like this:

    ardana > ansible-playbook -i hosts/verb_hosts site.yml \
        --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \
        -e rebuild=True
    Note
    Note

    The -e rebuild=True is only used on a single control plane node when there are other controllers available to pull configuration data from. This will cause the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.

  4. Restore the MariaDB backup on the first controller node.

    1. List the Freezer jobs:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > freezer job-list -C FIRST_CONTROLLER_NODE
    2. Run the Ardana Default: mysql restore from SSH job for your first controller node, replacing the JOB_ID for that job:

      ardana > freezer job-start JOB_ID
  5. You can monitor the restore job by connecting to your first controller node via SSH and running the following commands:

    ardana > ssh FIRST_CONTROLLER_NODE
    ardana > sudo su
    root # tail -n 100 /var/log/freezer/freezer-scheduler.log
  6. Log back in to the Cloud Lifecycle Manager.

  7. Stop MySQL:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  8. Log back in to the first controller node and move the following files:

    ardana > ssh FIRST_CONTROLLER_NODE
    ardana > sudo su
    root # rm -rf /var/lib/mysql/*
    root # cp -pr /tmp/mysql_restore/* /var/lib/mysql/
  9. Log back in to the Cloud Lifecycle Manager and bootstrap MySQL:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  10. Verify the status of MySQL:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml
  11. Re-enable the default backup job(s) by editing the following file:

    ~/scratch/ansible/next/ardana/ansible/roles/freezer-jobs/defaults/activate.yml

    Set the value for freezer_create_backup_jobs to true:

    # If set to false, We won't create backups jobs.
    freezer_create_backup_jobs: true
  12. Run this playbook to deploy the backup jobs:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts _freezer_manage_jobs.yml
13.2.2.2.7 Swift Rings Recovery

To recover your Swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings from one Swift node if possible, or use the SSH backup that you have set up.

To recover your Swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings from one Swift node if possible, or use the SSH backup that you have set up.

13.2.2.2.7.1 Restore from the Swift deployment backup

See Section 15.6.2.7, “Recovering Swift Builder Files”.

13.2.2.2.7.2 Restore from the SSH Freezer backup

In the very specific use case where you lost all system disks of all object nodes, and Swift proxy nodes are corrupted, you can recover the rings because a copy of the Swift rings is stored in Freezer. This means that Swift data is still there (the disks used by Swift needs to be still accessible).

Recover the rings with these steps.

  1. Log in to a node that has the freezer-agent installed.

  2. Become root:

    ardana > sudo su
  3. Create the temporary directory to restore your files to:

    root # mkdir /tmp/swift_builder_dir_restore/
  4. Create a restore file with the following content:

    root # cat << EOF > ./restore_config.ini
    [default]
    action = restore
    storage = ssh
    compression = bzip2
    restore_abs_path = /tmp/swift_builder_dir_restore/
    ssh_key = /etc/freezer/ssh_key
    ssh_host = <freezer_ssh_host>
    ssh_port = <freezer_ssh_port>
    ssh_user name = <freezer_ssh_user name>
    container = <freezer_ssh_base_rid>/freezer_swift_backup_name = freezer_swift_builder_backup
    hostname = <hostname of the old first Swift-Proxy (SWF-PRX[0])>
    EOF
  5. Edit the file and replace all <tags> with the right information.

    vim ./restore_config.ini

    You will also need to put the SSH key used to do the backups in /etc/freezer/ssh_key and remember to set the right permissions: 600.

  6. Execute the restore job:

    root # freezer-agent --config ./restore_config.ini

    You now have the Swift rings in /tmp/swift_builder_dir_restore/

  7. If the SWF-PRX[0] is already deployed, copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-PRX[0] Then from the Cloud Lifecycle Manager run:

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  8. If the SWF-ACC[0] is not deployed, from the Cloud Lifecycle Manager run these playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts guard-deployment.yml
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>
  9. Copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-ACC[0] You will have to create the directories : /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  10. From the Cloud Lifecycle Manager, run the ardana-deploy.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

13.2.3 Unplanned Compute Maintenance

Unplanned maintenance tasks including recovering compute nodes.

13.2.3.1 Recovering a Compute Node

If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to recover a compute node include the following:

  • The node has failed, either because it has shut down has a hardware failure, or for another reason.

  • The node is working but the nova-compute process is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).

  • The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.

13.2.3.1.1 What to do if your compute node is down

Compute node has power but is not powered on

If your compute node has power but is not powered on, use these steps to restore the node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the name for your compute node in Cobbler:

    sudo cobbler system list
  3. Power the node back up with this playbook, specifying the node name from Cobbler:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Compute node is powered on but services are not running on it

If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:

  1. Log in to the Cloud Lifecycle Manager.

  2. Confirm the status of the compute service on the node with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>
  3. You can start the compute service on the node with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>
13.2.3.1.2 Scenarios involving disk failures on your compute nodes

Your compute nodes should have a minimum of two disks, one that is used for the operating system and one that is used as the data disk. These are defined during the installation of your cloud, in the ~/openstack/my_cloud/definition/data/disks_compute.yml file on the Cloud Lifecycle Manager. The data disk(s) are where the nova-compute service lives. Recovery scenarios will depend on whether one or the other, or both, of these disks experienced failures.

If your operating system disk failed but the data disk(s) are okay

If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrator credentials:

    source ~/service.osrc
  3. Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

    nova host-list | grep compute
  4. Obtain the status of the nova-compute service on that node:

    nova service-list --host <hostname>
  5. You will likely want to disable provisioning on that node to ensure that nova-scheduler does not attempt to place any additional instances on the node while you are repairing it:

    nova service-disable --reason "node is being rebuilt" <hostname> nova-compute
  6. Obtain the status of the instances on the compute node:

    nova list --host <hostname> --all-tenants
  7. Before continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the nova evacuate or nova host-evacuate commands to do this. See Section 13.1.3.3, “Live Migration of Instances” for more details on how to do this.

    If your instances are not booted from volumes, you will need to stop the instances using the nova stop command. Because the nova-compute service is not running on the node you will not see the instance status change, but the Task State for the instance should change to powering-off.

    nova stop <instance_uuid>

    Verify the status of each of the instances using these commands, verifying the Task State states powering-off:

    nova list --host <hostname> --all-tenants
    nova show <instance_uuid>
  8. At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:

    1. Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when <node_name> is requested:

      sudo cobbler system list
    2. Reimage the compute node with this playbook:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
  9. Once reimaging is complete, use the following playbook to configure the operating system and start up services:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
  10. You should then ensure any instances on the recovered node are in an ACTIVE state. If they are not then use the nova start command to bring them to the ACTIVE state:

    nova list --host <hostname> --all-tenants
    nova start <instance_uuid>
  11. Reenable provisioning:

    nova service-enable <hostname> nova-compute
  12. Start any instances that you had stopped previously:

    nova list --host <hostname> --all-tenants
    nova start <instance_uuid>

If your data disk(s) failed but the operating system disk is okay OR if all drives failed

In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.

After that is complete, use the nova rebuild command to respawn your instances, which will also ensure that they receive the same IP address:

nova list --host <hostname> --all-tenants
nova rebuild <instance_uuid>

13.2.4 Unplanned Storage Maintenance

Unplanned maintenance tasks for storage nodes.

13.2.4.1 Unplanned Swift Storage Maintenance

Unplanned maintenance tasks for Swift storage nodes.

13.2.4.1.1 Recovering a Swift Node

If one or more of your Swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to repair a Swift object or PAC node include:

  • The node has either shut down or been rebooted.

  • The entire node has failed and needs to be replaced.

  • A disk drive has failed and must be replaced.

13.2.4.1.1.1 What to do if your Swift host has shut down or rebooted

If your Swift host has power but is not powered on, from the lifecycle manager you can run this playbook:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the name for your Swift host in Cobbler:

    sudo cobbler system list
  3. Power the node back up with this playbook, specifying the node name from Cobbler:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Once the node is booted up, Swift should start automatically. You can verify this with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-status.yml

Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 15.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.

13.2.4.1.1.2 How to replace your Swift node

If your Swift node has irreparable damage and you need to replace the entire node in your environment, see Section 13.1.5.1.5, “Replacing a Swift Node” for details on how to do this.

13.2.4.1.1.3 How to replace a hard disk in your Swift node

If you need to do a hard drive replacement in your Swift node, see Section 13.1.5.1.6, “Replacing Drives in a Swift Node” for details on how to do this.

13.3 Cloud Lifecycle Manager Maintenance Update Procedure

Procedure 13.2: Preparing for Update
  1. Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Book “Installing with Cloud Lifecycle Manager”, Chapter 4 “Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional)”. Alternatives to setting up an SMT server are described in Book “Installing with Cloud Lifecycle Manager”, Chapter 5 “Software Repository Setup”.

  2. Read the Release Notes for the security and maintenance updates that will be installed.

  3. Have a backup strategy in place. For further information, see Chapter 14, Backup and Restore.

  4. Ensure that you have a known starting state by resolving any unexpected alarms.

  5. Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.

  6. Review steps in Section 13.1.4.1, “Adding a Neutron Network Node” and Section 13.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the Neutron services are not provided via external SDN controllers.

  7. Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the 324 evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.

13.3.1 Performing the Update

Before you proceed, get the status of all your services:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

If status check returns an error for a specific service, run the SERVICE-reconfigure.yml playbook. Then run the SERVICE-status.yml playbook to check that the issue has been resolved.

Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 13.1.1.2, “Rolling Reboot of the Cloud”.

Note
Note

The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.

To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 13.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.

Procedure 13.3: Update Instructions
  1. Install all available security and maintenance updates on the deployer using the zypper patch command.

  2. Initialize the Cloud Lifecycle Manager and prepare the update playbooks.

    1. Run the ardana-init initialization script to update the deployer.

    2. Redeploy cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
    3. Run the configuration processor:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    4. Update your deployment directory:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Installation and management of updates can be automated with the following playbooks:

    • ardana-update-pkgs.yml

    • ardana-update.yml

    • ardana-update-status.yml

      Important
      Important

      Some playbooks are being deprecated. To determine how your system is affected, run:

      ardana > rpm -qa ardana-ansible

      The result will be ardana-ansible-8.0+git. followed by a version number string.

      • If the first part of the version number string is greater than or equal to 1553878455 (for example, ardana-ansible-8.0+git.1553878455.7439e04), use the newly introduced parameters:

        • pending_clm_update

        • pending_service_update

        • pending_system_reboot

      • If the first part of the version number string is less than 1553878455 (for example, ardana-ansible-8.0+git.1552032267.5298d45), use the following parameters:

        • update_status_var

        • update_status_set

        • update_status_reset

    • ardana-reboot.yml

  4. Confirm version changes by running hostnamectl before and after running the ardana-update-pkgs playbook on each node.

    ardana > hostnamectl

    Notice that the Boot ID: and Kernel: information has changed.

  5. By default, the ardana-update-pkgs.yml playbook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
    --limit TARGET_NODE_NAME

    There may be a delay in the playbook output at the following task while updates are pulled from the deployer.

    TASK: [ardana-upgrade-tools | pkg-update | Download and install
    package updates] ***
  6. After running the ardana-update-pkgs.yml playbook to install patches and updates not requiring reboot, check the status of remaining tasks.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit TARGET_NODE_NAME
  7. To install patches that require reboot, run the ardana-update-pkgs.yml playbook with the parameter -e zypper_update_include_reboot_patches=true.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
    --limit  TARGET_NODE_NAME \
    -e zypper_update_include_reboot_patches=true

    If the output of ardana-update-pkgs.yml indicates that a reboot is required, run ardana-reboot.yml after completing the ardana-update.yml step below. Running ardana-reboot.yml will cause cloud service interruption.

    Note
    Note

    To update a single package (for example, apply a PTF on a single node or on all nodes), run zypper update PACKAGE.

    To install all package updates using zypper update.

  8. Update services:

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml \
    --limit TARGET_NODE_NAME
  9. If indicated by the ardana-update-status.yml playbook, reboot the node.

    There may also be a warning to reboot after running the ardana-update-pkgs.yml.

    This check can be overridden by setting the SKIP_UPDATE_REBOOT_CHECKS environment variable or the skip_update_reboot_checks Ansible variable.

    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \
    --limit TARGET_NODE_NAME
  10. To recheck pending system reboot status at a later time, run the following commands:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit ardana-cp1-c1-m2
  11. The pending system reboot status can be reset by running:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit ardana-cp1-c1-m2 \
    -e pending_system_reboot=off
  12. Multiple servers can be patched at the same time with ardana-update-pkgs.yml by setting the option -e skip_single_host_checks=true.

    Warning
    Warning

    When patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.

    If multiple nodes are specified on the command line (with --limit), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.

    Important
    Important

    Do not reboot all of your controllers at the same time.

  13. When the node comes up after the reboot, run the spark-start.yml file:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
  14. Verify that Spark is running on all Control Nodes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-status.yml
  15. After all nodes have been updated, check the status of all services:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

13.3.2 Summary of the Update Playbooks

ardana-update-pkgs.yml

Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable ardana-update-pkgs.yml -e skip_single_host_checks=true.

Provide the following -e options to modify default behavior:

  • zypper_update_method (default: patch)

    • patch will install all patches for the system. Patches are intended for specific bug and security fixes.

    • update will install all packages that have a higher version number than the installed packages.

    • dist-upgrade replaces each package installed with the version from the repository and deletes packages not available in the repositories.

  • zypper_update_repositories (default: all) restricts the list of repositories used

  • zypper_update_gpg_checks (default: true) enables GPG checks. If set to true, checks if packages are correctly signed.

  • zypper_update_licenses_agree (default: false) automatically agrees with licenses. If set to true, zypper automatically accepts third party licenses.

  • zypper_update_include_reboot_patches (default: false) includes patches that require reboot. Setting this to true installs patches that require a reboot (such as kernel or glibc updates).

ardana-update.yml

Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding --limit nodename.

ardana-reboot.yml

Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.

ardana-update-status.yml

This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.

13.4 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment

Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.

Use the following steps to deploy a PTF:

  1. When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:

    ardana > tmpdir=`mktemp -d`
    ardana > cd $tmpdir
    ardana > sudo wget --no-directories --recursive --reject "index.html*"\
    --user=USER_NAME \
    --password=PASSWORD \
    --no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030
  2. Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.

    ardana > sudo rm -rf /srv/www/suse-12.3/x86_64/repos/PTF/*
  3. Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a Neutron PTF.

    ardana > sudo mkdir -p /srv/www/suse-12.3/x86_64/repos/PTF/
    ardana > sudo mv $tmpdir/*
       /srv/www/suse-12.3/x86_64/repos/PTF/
    ardana > sudo chown --recursive root:root /srv/www/suse-12.3/x86_64/repos/PTF/*
    ardana > rmdir $tmpdir
  4. Create or update the repository metadata:

    ardana > sudo /usr/local/sbin/createrepo-cloud-ptf
    Spawning worker 0 with 2 pkgs
    Workers Finished
    Saving Primary metadata
    Saving file lists metadata
    Saving other metadata
  5. Refresh the PTF repository before installing package updates on the Cloud Lifecycle Manager

    ardana > sudo zypper refresh --force --repo PTF
    Forcing raw metadata refresh
    Retrieving repository 'PTF' metadata
    ..........................................[d
    one]
    Forcing building of repository cache
    Building repository 'PTF' cache ..........................................[done]
    Specified repositories have been refreshed.
  6. The PTF shows as available on the deployer.

    ardana > sudo zypper se --repo PTF
    Loading repository data...
    Reading installed packages...
    
    S | Name                          | Summary                                 | Type
    --+-------------------------------+-----------------------------------------+--------
      | python-neutronclient          | Python API and CLI for OpenStack Neutron | package
    i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack Neutron | package
  7. Install the PTF venv packages on the Cloud Lifecycle Manager

    ardana > sudo zypper dup  --from PTF
    Refreshing service
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    The following package is going to be upgraded:
      venv-openstack-neutron-x86_64
    
    The following package has no support information from its vendor:
      venv-openstack-neutron-x86_64
    
    1 package to upgrade.
    Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1),  64.2 MiB ( 64.6 MiB unpacked)
    Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done]
    Checking for file conflicts: ..............................................................[done]
    (1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done]
    Additional rpm output:
    warning
    warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEY
  8. Validate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)

    ardana > ls -la /opt/ardana_packager/ardana-8/sles_venv/x86_64
    total 898952
    drwxr-xr-x 2 root root     4096 Oct 30 16:10 .
    ...
    -rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<<
    -rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz
    -rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz
    -rw-r--r-- 1 root root     1879 Oct 30 16:10 packages
    -rw-r--r-- 1 root root 27186008 Apr 26  2018 swift-20180426T230541Z.tgz
  9. Install the non-venv PTF packages on the Compute Node

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmt

    When it has finished, you can see that the upgraded package has been installed on comp0001-mgmt.

    ardana > sudo zypper se --detail python-neutronclient
    Loading repository data...
    Reading installed packages...
    
    S | Name                 | Type     | Version                         | Arch   | Repository
    --+----------------------+----------+---------------------------------+--------+--------------------------------------
    i | python-neutronclient | package  | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF
      | python-neutronclient | package  | 6.5.0-4.361                     | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1
  10. Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.

    The Compute Node before running the update playbook:

    ardana > ls -la /opt/stack/venv
    total 24
    drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
    drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
    drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
    drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z
  11. Run the update.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmt

    When it has finished, you can see that an additional virtual environment has been installed.

    ardana > ls -la /opt/stack/venv
    total 28
    drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
    drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
    drwxr-xr-x  9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed
    drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
    drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z
  12. The PTF may also have RPM package updates in addition to venv updates. To complete the update, follow the instructions at Section 13.3.1, “Performing the Update”.

13.5 Periodic OpenStack Maintenance Tasks

Heat-manage helps manage Heat specific database operations. The associated database should be periodically purged to save space. The following should be setup as a cron job on the servers where the heat service is running at /etc/cron.weekly/local-cleanup-heat with the following content:

  #!/bin/bash
  su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :

nova-manage db archive_deleted_rows command will move deleted rows from production tables to shadow tables. Including --until-complete will make the command run continuously until all deleted rows are archived. It is recommended to setup this task as /etc/cron.weekly/local-cleanup-nova on the servers where the nova service is running, with the following content:

  #!/bin/bash
  su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :
Print this page