Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / Documentation / Operations Guide CLM / System Maintenance
Applies to SUSE OpenStack Cloud 9

15 System Maintenance

This section contains the following subsections to help you manage, configure, and maintain your SUSE OpenStack Cloud cloud as well as procedures for performing node maintenance.

15.1 Planned System Maintenance

Planned maintenance tasks for your cloud. See sections below for:

15.1.1 Whole Cloud Maintenance

Planned maintenance procedures for your whole cloud.

15.1.1.1 Bringing Down Your Cloud: Services Down Method

Important
Important

If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.

If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 15.1.1.2, “Rolling Reboot of the Cloud”.

To perform backups prior to these steps, visit the backup and restore pages first at Chapter 17, Backup and Restore.

15.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment

You will do the following steps from your Cloud Lifecycle Manager.

  1. Log in to your Cloud Lifecycle Manager.

  2. Gracefully shut down your cloud by running the ardana-stop.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-stop.yml
  3. Shut down and restart your nodes. There are multiple ways you can do this:

    1. You can SSH to each node and use sudo reboot -f to reboot the node. Reboot the control plane nodes first so that they become functional as early as possible.

    2. You can shut down the nodes and then physically restart them either via a power button or the IPMI. If your cloud data model servers.yml specifies iLO connectivity for all nodes, then you can use the bm-power-down.yml and bm-power-up.yml playbooks on the Cloud Lifecycle Manager.

      Power down the control plane nodes last so that they remain online as long as possible, and power them back up before other nodes to restore their services quickly.

  4. Perform the necessary maintenance.

  5. After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.

  6. Determine the current power status of the nodes in your environment:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts bm-power-status.yml
  7. If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the -e nodelist=<node_name> switch.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
    Note
    Note

    Obtain the <node_name> by using the sudo cobbler system list command from the Cloud Lifecycle Manager.

  8. Bring the databases back up:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  9. Gracefully bring up your cloud services by running the ardana-start.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-start.yml
  10. Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml
  11. If any services did not start properly, you can run playbooks for the specific services having issues.

    For example:

    If RabbitMQ fails, run:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml

    You can check the status of RabbitMQ afterwards with this:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml

    If the recovery had failed, you can run:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml

    Each of the other services have playbooks in the ~/scratch/ansible/next/ardana/ansible directory in the format of <service>-start.yml that you can run. One example, for the compute service, is nova-start.yml.

  12. Continue checking the status of your SUSE OpenStack Cloud 9 cloud services until there are no more failed or unreachable nodes:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-status.yml

15.1.1.2 Rolling Reboot of the Cloud

If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 15.1.1.1, “Bringing Down Your Cloud: Services Down Method”.

15.1.1.2.1 Recommended node reboot order

To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.

The recommended way to achieve this is as follows:

  1. Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.

  2. Reboot of compute nodes (if present in your cloud).

  3. Reboot of swift nodes (if present in your cloud).

  4. Reboot of ESX nodes (if present in your cloud).

15.1.1.2.2 Rebooting controller nodes

Turn off the keystone Fernet Token-Signing Key Rotation

Before rebooting any controller node, you need to ensure that the keystone Fernet token-signing key rotation is turned off. Run the following command:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml

Migrate singleton services first

Note
Note

If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that the apache2 service is running before continuing. To start the apache2 service, use this command:

ardana > sudo systemctl start apache2

The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.

For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.

For the cinder-volume singleton service:

Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:

ardana > ps auxww | grep cinder-volume | grep -v grep

Run the cinder-migrate-volume.yml playbook - details about the cinder volume and backup migration instructions can be found in Section 8.1.3, “Managing cinder Volume and Backup Services”.

For the SNAT namespace singleton service:

If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.

  1. Locate the SNAT node where the router is providing the active snat_service:

    1. From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:

      ardana > source ~/service.osrc
      ardana > openstack port list --device_owner network:router_gateway

      Example:

      $ openstack port list --device_owner network:router_gateway
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
      | id                                   | name | mac_address       | fixed_ips                                                                           |
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
      | 287746e6-7d82-4b2c-914c-191954eba342 |      | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} |
      +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
    2. Look at the details of this port to determine what the binding:host_id value is, which will point to the host in which the port is bound to:

      openstack port show <port_id>

      Example, with the value you need in bold:

      ardana > openstack port show 287746e6-7d82-4b2c-914c-191954eba342
      +-----------------------+--------------------------------------------------------------------------------------------------------------+
      | Field                 | Value                                                                                                        |
      +-----------------------+--------------------------------------------------------------------------------------------------------------+
      | admin_state_up        | True                                                                                                         |
      | allowed_address_pairs |                                                                                                              |
      | binding:host_id       | ardana-cp1-c1-m2-mgmt                                                                                        |
      | binding:profile       | {}                                                                                                           |
      | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
      | binding:vif_type      | ovs                                                                                                          |
      | binding:vnic_type     | normal                                                                                                       |
      | device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
      | device_owner          | network:router_gateway                                                                                       |
      | dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
      | dns_name              |                                                                                                              |
      | extra_dhcp_opts       |                                                                                                              |
      | fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
      | id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
      | mac_address           | fa:16:3e:2e:26:ac                                                                                            |
      | name                  |                                                                                                              |
      | network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
      | security_groups       |                                                                                                              |
      | status                | DOWN                                                                                                         |
      | tenant_id             |                                                                                                              |
      +-----------------------+--------------------------------------------------------------------------------------------------------------+

      In this example, the ardana-cp1-c1-m2-mgmt is the node hosting the SNAT namespace service.

  2. SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:

    ardana > ssh <IP_of_SNAT_namespace_host>
    ardana > sudo ip netns exec snat-<router_ID> bash

    Example:

    ardana > sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bash
  3. Obtain the ID for the L3 Agent for the node hosting your SNAT namespace:

    ardana > source ~/service.osrc
    ardana > openstack network agent list

    Example, with the entry you need given the examples above:

    ardana > openstack network agent list
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
    | id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
    | 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent           | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent       | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-metadata-agent    |
    | 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent             | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent             | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-l3-agent          |
    | 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent   | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent             | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent       | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-metadata-agent    |
    | 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent       | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-metadata-agent    |
    | a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent             | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-vpn-agent         |
    | a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent           | ardana-cp1-c1-m1-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent   | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent   | ardana-cp1-comp0001-mgmt | :-)   | True           | neutron-openvswitch-agent |
    | e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent   | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-openvswitch-agent |
    | f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent           | ardana-cp1-c1-m2-mgmt    | :-)   | True           | neutron-dhcp-agent        |
    | fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent       | ardana-cp1-c1-m3-mgmt    | :-)   | True           | neutron-metadata-agent    |
    +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
  4. Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.

  5. Use these commands to move the SNAT namespace service, with the router_id being the same value as the ID for router:

    1. Remove the L3 Agent for the old host:

      ardana > openstack network agent remove router –agent-type l3
      <agent_id_of_snat_namespace_host> \
      <qrouter_uuid>

      Example:

      ardana > openstack network agent remove router –agent-type l3
      a209c67d-c00f-4a00-b31c-0db30e9ec661 \
      e122ea3f-90c5-4662-bf4a-3889f677aacf
      Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agent
    2. Remove the SNAT namespace:

      ardana > sudo ip netns delete snat-<router_id>

      Example:

      ardana > sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacf
    3. Create a new L3 Agent for the new host:

      ardana > openstack network agent add router –agent-type l3
      <agent_id_of_new_snat_namespace_host> \
      <qrouter_uuid>

      Example:

      ardana > openstack network agent add router –agent-type l3
      3bc28451-c895-437b-999d-fdcff259b016 \
      e122ea3f-90c5-4662-bf4a-3889f677aacf
      Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent

    Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of binding:host_id which should be updated to the host you moved your SNAT namespace to:

    ardana > openstack port show <port_ID>

    Example:

    ardana > openstack port show 287746e6-7d82-4b2c-914c-191954eba342
    +-----------------------+--------------------------------------------------------------------------------------------------------------+
    | Field                 | Value                                                                                                        |
    +-----------------------+--------------------------------------------------------------------------------------------------------------+
    | admin_state_up        | True                                                                                                         |
    | allowed_address_pairs |                                                                                                              |
    | binding:host_id       | ardana-cp1-c1-m1-mgmt                                                                                        |
    | binding:profile       | {}                                                                                                           |
    | binding:vif_details   | {"port_filter": true, "ovs_hybrid_plug": true}                                                               |
    | binding:vif_type      | ovs                                                                                                          |
    | binding:vnic_type     | normal                                                                                                       |
    | device_id             | e122ea3f-90c5-4662-bf4a-3889f677aacf                                                                         |
    | device_owner          | network:router_gateway                                                                                       |
    | dns_assignment        | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} |
    | dns_name              |                                                                                                              |
    | extra_dhcp_opts       |                                                                                                              |
    | fixed_ips             | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"}                          |
    | id                    | 287746e6-7d82-4b2c-914c-191954eba342                                                                         |
    | mac_address           | fa:16:3e:2e:26:ac                                                                                            |
    | name                  |                                                                                                              |
    | network_id            | d3cb12a6-a000-4e3e-82c4-ee04aa169291                                                                         |
    | security_groups       |                                                                                                              |
    | status                | DOWN                                                                                                         |
    | tenant_id             |                                                                                                              |
    +-----------------------+--------------------------------------------------------------------------------------------------------------+

Reboot the controllers

In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.

ardana > for i in $(grep -w cluster-prefix
~/openstack/my_cloud/definition/data/control_plane.yml \
| awk '{print $2}'); do grep $i
~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts \
| grep ansible_ssh_host | awk '{print $1}'; done

Then perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:

  1. If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.

  2. Stop all services on the controller node that you are rebooting first:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \
    <controller node>
  3. Reboot the controller node, e.g. run the following command on the controller itself:

    ardana > sudo reboot

    Note that the current node being rebooted could be hosting the lifecycle manager.

  4. Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml \
    --limit <controller node>
  5. Verify that the status of all services on that is OK on the controller node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml \
    --limit <controller node>
  6. When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.

Note
Note

It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).

Reenable the keystone Fernet Token-Signing Key Rotation

After all the controller nodes are successfully updated and back online, you need to re-enable the keystone Fernet token-signing key rotation job by running the keystone-reconfigure.yml playbook. On the deployer, run:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
15.1.1.2.3 Rebooting compute nodes

To reboot a compute node the following operations will need to be performed:

  • Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.

  • Identify instances that exist on the compute node, and then either:

    • Live migrate the instances off the node before actioning the reboot. OR

    • Stop the instances

  • Reboot the node

  • Restart the nova services

  1. Disable provisioning:

    ardana > openstack compute service set --disable --disable-reason "DESCRIBE REASON" compute nova-compute

    If the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.

  2. Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.

    ardana > openstack server list --host <hostname> --all-tenants
  3. Migrate or Stop the instances on the compute node.

    Migrate the instances off the node by running one of the following commands for each of the instances:

    If your instance is booted from a volume and has any number of cinder volume attached, use the nova live-migration command:

    ardana > nova live-migration <instance uuid> [<target compute host>]

    If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:

    ardana > nova live-migration --block-migrate <instance uuid> [<target compute host>]

    Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.

    OR

    Stop the instances on the node by running the following command for each of the instances:

    ardana > openstack server stop <instance-uuid>
  4. Stop all services on the Compute node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>
  5. SSH to your Compute nodes and reboot them:

    ardana > sudo reboot

    The operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.

  6. Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>
  7. Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>
  8. Re-enable provisioning on the node:

    ardana > openstack compute service set --enable compute nova-compute
  9. Restart any instances you stopped.

    ardana > openstack server start <instance-uuid>
15.1.1.2.4 Rebooting swift nodes

If your swift services are on controller node, please follow the controller node reboot instructions above.

For a dedicated swift PAC cluster or swift Object resource node:

For each swift host

  1. Stop all services on the swift node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <swift node>
  2. Reboot the swift node by running the following command on the swift node itself:

    ardana > sudo reboot
  3. Wait for the node to become ssh-able and then start all services on the swift node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>
15.1.1.2.5 Get list of status playbooks

The following command will display a list of status playbooks:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ls *status*

15.1.2 Planned Control Plane Maintenance

Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.

15.1.2.1 Replacing a Controller Node

This section outlines steps for replacing a controller node in your environment.

For SUSE OpenStack Cloud, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.

These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.

Keep in mind while performing the following tasks:

  • Do not add entries for a new server. Instead, update the entries for the broken one.

  • Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.

15.1.2.1.1 Replacing a Shared Cloud Lifecycle Manager/Controller Node

If the controller node you need to replace was also being used as your Cloud Lifecycle Manager then use these steps below. If this is not a shared controller, skip to the next section.

  1. To ensure that you use the same version of SUSE OpenStack Cloud that you previously had loaded on your Cloud Lifecycle Manager, you will need to download and install the lifecycle management software using the instructions from the installation guide:

    Section 15.5.2, “Installing the SUSE OpenStack Cloud Extension”

  2. Initialize the Cloud Lifecycle Manager platform by running ardana-init.

  3. To restore your data, see Section 15.2.3.2.3, “Point-in-time Cloud Lifecycle Manager Recovery”. At this time, restore only the backup of ardana files on the system into /var/lib/ardana (the user's home directory.)

  4. On the new node, update your cloud model with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields to reflect the attributes of the node. Do not change the id, ip-addr, role, or server-group settings.

    Note
    Note

    When imaging servers with your own tooling, it is still necessary to have ILO/IPMI settings for all nodes. Even if you are not using Cobbler, the username and password fields in servers.yml need to be filled in with dummy settings. For example, add the following to servers.yml:

    ilo-user: manual
    ilo-password: deployment
  5. Open the servers.yml file describing your cloud nodes:

    ardana > git -C ~/openstack checkout site
    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > vi servers.yml

    Change, as necessary, the mac-addr, ilo-ip, ilo-password, and ilo-user fields of the existing controller node. Save and commit the change:

    ardana > git commit -a -m "repaired node X"
  6. Run the configuration processor and ready-deployment playbooks as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the wipe_disks.yml playbook to ensure all non-OS partitions on the new node are completely wiped prior to continuing with the installation. (The value to be used for hostname is the host's identifier from ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.)

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit deployer_node_name

    The value for deployer_node_name should be the name identifying the deployer/controller being initialized as it is represented in the hosts/verb_hosts file.

  8. Deploy Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Refer again to Section 15.2.3.2.3, “Point-in-time Cloud Lifecycle Manager Recovery” and proceed to restore all remaining backups, with the exclusion of /var/lib/ardana (which was done earlier) and the cobbler content in /var/lib/cobbler and /srv/www/cobbler.

  10. Install the software on your new Cloud Lifecycle Manager/controller node with these playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-rebuild-pretasks.yml
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml \
    -e rebuild=True --limit deployer_node_name
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml \
    -e rebuild=True --limit deployer_node_name,localhost
    ardana > ansible-playbook -i hosts/verb_hosts tempest-deploy.yml
    Important
    Important

    If you receive the message stderr: Error: mnesia_not_running when running the ardana-deploy.yml playbook, it is likely due to one of the following conditions:

    • RabbitMQ was not running on the clustered node

    • The old node was not removed from the cluster

    Correct this problem with the following steps:

    1. Of the remaining clustered nodes (M2 and M3), M2 is the new master. Make sure the application has started and M1 is no longer a member. On the M2 node, run:

      ardana > sudo rabbitmqctl start_app; \
      ardana > sudo rabbitmqctl forget_cluster_node rabbit@M1

      Check that M1 is no longer a member of the cluster.

      ardana > sudo rabbitmqctl cluster_status
    2. On the newly installed node, M1, make sure RabbitMQ has stopped. On M1, run:

      ardana > sudo rabbitmqctl stop_app
    3. Re-run the ardana-deploy.yml playbook as before.

  11. During the replacement of the node, alarms will show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
15.1.2.1.2 Replacing a Standalone Controller Node

If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.

  1. Log in to the Cloud Lifecycle Manager.

  2. Update your cloud model, specifically the servers.yml file, with the new mac-addr, ilo-ip, ilo-password, and ilo-user fields where these have changed. Do not change the id, ip-addr, role, or server-group settings.

  3. Commit your configuration to the Chapter 22, Using Git for Configuration Management, as follows:

    ardana > cd ~/openstack/ardana/ansible
    ardana > git add -A
    ardana > git commit -m "My config or other commit message"
  4. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  5. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  6. Remove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:

    ardana > sudo cobbler system list

    and then remove the old controller nodes with this command:

    ardana > sudo cobbler system remove --name <node>
  7. Remove the SSH key of the old controller node from the known hosts file. You will specify the ip-addr value:

    ardana > ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>

    You should see a response similar to this one:

    ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135
    # Host 10.13.111.135 found: line 6 type ECDSA
    ~/.ssh/known_hosts updated.
    Original contents retained as ~/.ssh/known_hosts.old
  8. Run the cobbler-deploy playbook to add the new controller node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Image the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>
    Important
    Important

    You must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.

  10. Run the wipe_disks.yml playbook to ensure all non-OS partitions on the new node are completely wiped prior to continuing with the installation. (The value to be used for hostname is the host's identifier from ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.)

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  11. Run osconfig on the replacement controller node. For example:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>
  12. If the controller being replaced is the swift ring builder (see Section 18.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. See Section 18.6.2.7, “Recovering swift Builder Files” for details.

  13. Run the ardana-deploy playbook on the replacement controller.

    If the node being replaced is the swift ring builder server then you only need to use the --limit switch for that node, otherwise you need to specify the hostname of your swift ringer builder server and the hostname of the node being replaced.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True
    --limit=<controller-hostname>,<swift-ring-builder-hostname>
    Important
    Important

    If you receive a keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the keystone-reconfigure.yml playbook to re-sync the Fernet keys.

    In this situation, do not use the --limit option when running keystone-reconfigure.yml. In order to re-sync Fernet keys, all the controller nodes must be in the play.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
    Important
    Important

    If you receive a RabbitMQ failure when running this playbook, review Section 18.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.

  14. During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags thresh
    ardana > ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh

15.1.3 Planned Compute Maintenance

Planned maintenance tasks for compute nodes.

15.1.3.1 Planned Maintenance for a Compute Node

If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.

15.1.3.1.1 Performing planned maintenance on a compute node

If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrator credentials:

    source ~/service.osrc
  3. Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

    openstack host list | grep compute

    The following example shows two compute nodes:

    $ openstack host list | grep compute
    | ardana-cp1-comp0001-mgmt | compute     | AZ1      |
    | ardana-cp1-comp0002-mgmt | compute     | AZ2      |
  4. Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:

    openstack compute service set –disable --reason "Maintenance mode" <hostname>
    Note
    Note

    Make sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:

    openstack compute service set –enable <hostname>
  5. At this point you have two choices:

    1. Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.

    2. Stop/start the instances: Issuing openstack server stop commands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.

    If you choose the live migration route, See Section 15.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.

    If you choose the stop start method, continue on.

    1. List all of the instances on the node so you can issue stop commands to them:

      openstack server list --host <hostname> --all-tenants
    2. Issue the openstack server stop command against each of the instances:

      openstack server stop <instance uuid>
    3. Confirm that the instances are stopped. If stoppage was successful you should see the instances in a SHUTOFF state, as shown here:

      $ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
      | ID                                   | Name      | Tenant ID                        | Status  | Task State | Power State | Networks              |
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
      | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | -          | Shutdown    | demo_network=10.0.0.5 |
      +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
    4. Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their SHUTOFF state:

      openstack server list --host <hostname> --all-tenants
    5. Start the instances back up using this command:

      openstack server start <instance uuid>

      Example:

      $ openstack server start ef31c453-f046-4355-9bd3-11e774b1772f
      Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.
    6. Confirm that the instances started back up. If restarting is successful you should see the instances in an ACTIVE state, as shown here:

      $ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
      | ID                                   | Name      | Tenant ID                        | Status | Task State | Power State | Networks              |
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
      | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | -          | Running     | demo_network=10.0.0.5 |
      +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
    7. If the openstack server start fails, you can try doing a hard reboot:

      openstack server reboot --hard <instance uuid>

      If this does not resolve the issue you may want to contact support.

  6. Re-enable provisioning when the node is fixed:

    openstack compute service set –enable <hostname>

15.1.3.2 Rebooting a Compute Node

If all you need to do is reboot a Compute node, the following steps can be used.

You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.

  1. Log in to the Cloud Lifecycle Manager.

  2. Reboot the Compute node(s) with the following playbook.

    You can specify either single or multiple Compute nodes using the --limit switch.

    An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
    Note
    Note

    If the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.

15.1.3.3 Live Migration of Instances

Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.

SUSE OpenStack Cloud nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.

15.1.3.3.1 Migration Options

If your compute node has failed

A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.

In these cases you will want to use one of the nova evacuate commands, which will cause nova to rebuild the instances on other hosts.

This table describes each of the evacuate options for failed compute nodes:

CommandDescription

nova evacuate <instance> <hostname>

This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the nova scheduler will choose one for you.

See nova help evacuate for more information and syntax. Further details can also be seen in the OpenStack documentation at http://docs.openstack.org/admin-guide/cli_nova_evacuate.html.

nova host-evacuate <hostname> --target_host <target_hostname>

This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the nova scheduler will choose a target host for each instance.

See nova help host-evacuate for more information and syntax.

If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.

Cold migration is used to copy an instances data in a SHUTOFF status from one compute host to another. It does this using passwordless SSH access which has security concerns associated with it. For this reason, the openstack server migrate function has been disabled by default but you have the ability to enable this feature if you would like. Details on how to do this can be found in Section 6.4, “Enabling the Nova Resize and Migrate Features”.

Live migration can be performed on instances in either an ACTIVE or PAUSED state and uses the QEMU hypervisor to manage the copy of the running processes and associated resources to the destination compute host using the hypervisors own protocol and thus is a more secure method and allows for less downtime. There may be a short network outage, usually a few milliseconds but could be up to a few seconds if your compute instances are busy, during a live migration. Also there may be some performance degredation during the process.

The compute host must remain powered on during the migration process.

Both the cold migration and live migration options will honor nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 15.1.3.3, “Live Migration of Instances” section.

This table describes each of the migration options for active compute nodes:

CommandDescriptionSLES

openstack server migrate <instance_uuid>

Used to cold migrate a single instance from a compute host. The nova-scheduler will choose the new host.

This command will work against instances in an ACTIVE or SHUTOFF state. The instances, if active, will be shutdown and restarted. Instances in a PAUSED state cannot be cold migrated.

See the difference between cold migration and live migration at the start of this section.

 

nova host-servers-migrate <hostname>

Used to cold migrate all instances off a specified host to other available hosts, chosen by the nova-scheduler.

This command will work against instances in an ACTIVE or SHUTOFF state. The instances, if active, will be shutdown and restarted. Instances in a PAUSED state cannot be cold migrated.

See the difference between cold migration and live migration at the start of this section.

 

nova live-migration <instance_uuid> [<target host>]

Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.

This command works against instances in ACTIVE or PAUSED states only.

X

nova live-migration --block-migrate <instance_uuid> [<target host>]

Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.

This command works against instances in ACTIVE or PAUSED states only.

X

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.

This command works against instances in ACTIVE or PAUSED states only.

X

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration.

Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.

This command works against instances in ACTIVE or PAUSED states only.

X
15.1.3.3.2 Limitations of these Features

There are limitations that may impact your use of this feature:

  • To use live migration, your compute instances must be in either an ACTIVE or PAUSED state on the compute host. If you have instances in a SHUTOFF state then cold migration should be used.

  • Instances in a Paused state cannot be live migrated using the horizon dashboard. You will need to utilize the python-novaclient CLI to perform these.

  • Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the nova-scheduler. You should ensure that the target host you choose has the resources available to host the instances.

  • The nova host-evacuate-live command will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the --block-migrate option. This is described in further detail in Section 15.1.3.3, “Live Migration of Instances”.

  • Instances on KVM hosts can only be live migrated to other KVM hosts.

  • The migration options described in this document are not available on ESX compute hosts.

  • Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.

15.1.3.3.3 Performing a Live Migration

Cloud administrators can perform a migration on an instance using either the horizon dashboard, API, or CLI. Instances in a Paused state cannot be live migrated using the horizon GUI. You will need to utilize the CLI to perform these.

We have documented different scenarios:

15.1.3.3.4 Migrating instances off of a failed compute host
  1. Log in to the Cloud Lifecycle Manager.

  2. If the compute node is not already powered off, do so with this playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
    Note
    Note

    The value for <node_name> will be the name that Cobbler has when you run sudo cobbler system list from the Cloud Lifecycle Manager.

  3. Source the admin credentials necessary to run administrative commands against the nova API:

    source ~/service.osrc
  4. Force the nova-compute service to go down on the compute node:

    openstack compute service set --down HOSTNAME nova-compute
    Note
    Note

    The value for HOSTNAME can be obtained by using openstack host list from the Cloud Lifecycle Manager.

  5. Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.

    For single instances on a failed host:

    nova evacuate <instance_uuid> <target_hostname>

    For all instances on a failed host:

    nova host-evacuate <hostname> [--target_host <target_hostname>]
  6. When you have repaired the failed node and start it back up again, when the nova-compute process starts again, it will clean up the evacuated instances.

15.1.3.3.5 Migrating instances off of an active compute host

Migrating instances using the horizon dashboard

The horizon dashboard offers a GUI method for performing live migrations. Instances in a Paused state will not provide you the live migration option in horizon so you will need to use the CLI instructions in the next section to perform these.

  1. Log into the horizon dashboard with admin credentials.

  2. Navigate to the menu Admin › Compute › Instances.

  3. Next to the instance you want to migrate, select the drop down menu and choose the Live Migrate Instance option.

  4. In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:

    Disk Over Commit - If this is not checked then the value will be False. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.

    Block Migration - If this is not checked then the value will be False. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.

  5. To begin the live migration, click Submit.

Migrating instances using the python-novaclient CLI

To perform migrations from the command-line, use the python-novaclient. The Cloud Lifecycle Manager node in your cloud environment should have the python-novaclient already installed. If you will be accessing your environment through a different method, ensure that the python-novaclient is installed. You can do so using Python's pip package manager.

To run the commands in the steps below, you need administrator credentials. From the Cloud Lifecycle Manager, you can source the service.osrc file which is provided that has the necessary credentials:

source ~/service.osrc

Here are the steps to perform:

  1. Log in to the Cloud Lifecycle Manager.

  2. Identify the instances on the compute node you wish to migrate:

    openstack server list --all-tenants --host <hostname>

    Example showing a host with a single compute instance on it:

    ardana >  openstack server list --host ardana-cp1-comp0001-mgmt --all-tenants
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
    | ID                                   | Name | Tenant ID                        | Status | Task State | Power State | Networks              |
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
    | 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | -          | Running     | adminnetwork=10.0.0.5 |
    +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+
  3. When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:

    openstack host list
  4. Migrate the instance(s) on the compute node using the notes below.

    If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the nova live-migration command with this syntax:

    nova live-migration <instance uuid> [<target compute host>]

    If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the --block-migrate option:

    nova live-migration --block-migrate <instance uuid> [<target compute host>]
    Note
    Note

    The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.

    Multiple instances

    If you want to live migrate all of the instances off a single compute host you can utilize the nova host-evacuate-live command.

    Issue the host-evacuate-live command, which will begin the live migration process.

    If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:

    nova host-evacuate-live --block-migrate <hostname>

    Alternatively, if all of the instances are only using block storage volumes then omit the --block-migrate option:

    nova host-evacuate-live <hostname>
    Note
    Note

    You can either let the nova-scheduler choose a suitable target host or you can specify one using the --target-host <hostname> switch. See nova help host-evacuate-live for details.

15.1.3.3.6 Troubleshooting migration or host evacuate issues

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                                                                                        |
+--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 95a7ded8-ebfc-4848-9090-2df378c88a4c | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7)     |
| 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6)     |
+--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova host-evacuate-live against a node, you receive the error below:

$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Server UUID                          | Live Migration Accepted | Error Message                                                                                                                                                                                                     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| e9874122-c5dc-406f-9039-217d9258c020 | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a)     |
| 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False                   | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112)     |
+--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Fix: This occurs when you are attempting to live evacuate a host that contains instances booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live evacuation with this syntax:

nova host-evacuate-live <hostname> [--target-host <target_hostname>]

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt
ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)

Fix: This occurs when you are attempting to live migrate an instance that was booted from local storage and you are not specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration --block-migrate <instance_uuid> <target_hostname>

Issue: When attempting to use nova live-migration against an instance, you receive the error below:

$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt
ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)

Fix: This occurs when you are attempting to live migrate an instance that was booted from a block storage volume and you are specifying --block-migrate in your command. Re-attempt the live migration with this syntax:

nova live-migration <instance_uuid> <target_hostname>

15.1.3.4 Adding Compute Node

Adding a Compute Node allows you to add capacity.

15.1.3.4.1 Adding a SLES Compute Node

Adding a SLES compute node allows you to add additional capacity for more virtual machines.

You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.

There are two methods you can use to add SLES compute hosts to your environment:

  1. Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.

  2. Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP4 ISO during the initial installation of your cloud, following the instructions at Section 31.1, “SLES Compute Node Installation Overview”.

    If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP4 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.

15.1.3.4.1.1 Prerequisites

You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Section 31.3, “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Section 10.1, “SLES Compute Nodes”.

15.1.3.4.1.2 Adding a SLES compute node

Adding pre-installed SLES compute hosts

This method requires that you have SUSE Linux Enterprise Server 12 SP4 pre-installed on the baremetal host prior to beginning these steps.

  1. Ensure you have SUSE Linux Enterprise Server 12 SP4 pre-installed on your baremetal host.

  2. Log in to the Cloud Lifecycle Manager.

  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).

    - id: compute4
      ip-addr: 192.168.102.70
      role: SLES-COMPUTE-ROLE
      server-group: RACK1

    You can find detailed descriptions of these fields in Section 6.5, “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.

    Important
    Important

    You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See for Section 6.2, “Control Plane” more details.

  5. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Section 7.3.1, “Persisted Server Allocations” for information on how this works.

  8. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.

    Note
    Note

    The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    The value to be used for hostname is host's identifier from ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  9. Complete the compute host deployment with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>

Adding SLES compute hosts with Ansible playbooks and Cobbler

These steps will show you how to add the new SLES compute host to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

If you did not have the SUSE Linux Enterprise Server 12 SP4 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Chapter 31, Installing SLES Compute.

When you are prepared to continue, use these steps:

  1. Log in to your Cloud Lifecycle Manager.

  2. Checkout the site branch of your local git so you can begin to make the necessary edits:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > git checkout site
  3. Edit your ~/openstack/my_cloud/definition/data/servers.yml file to include the details about your new compute host(s).

    For example, if you already had a cluster of three SLES compute hosts using the SLES-COMPUTE-ROLE role and needed to add a fourth one you would add your details to the bottom of the file in this format:

    - id: compute4
      ip-addr: 192.168.102.70
      role: SLES-COMPUTE-ROLE
      server-group: RACK1
      mac-addr: e8:39:35:21:32:4e
      ilo-ip: 10.1.192.36
      ilo-password: password
      ilo-user: admin
      distro-id: sles12sp4-x86_64

    You can find detailed descriptions of these fields in Section 6.5, “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.

    Important
    Important

    You will need to verify that the ip-addr value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your ~/openstack/my_cloud/definition/data/control_plane.yml file you will need to check the values for member-count, min-count, and max-count. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specified member-count: 3 and are adding a fourth compute node, you will need to change that value to member-count: 4.

    See Section 6.2, “Control Plane” for more details.

  5. Commit the changes to git:

    ardana > git add -A
    ardana > git commit -a -m "Add node <name>"
  6. Run the configuration processor and resolve any errors that are indicated:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. The following playbook confirms that your servers are accessible over their IPMI ports.

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4
  8. Add the new node into Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]
  10. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
    Note
    Note

    If you do not know the <node name>, you can get it by using sudo cobbler system list.

    Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Section 7.3.1, “Persisted Server Allocations” for information on how this works.

  11. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  12. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
    Note
    Note

    You can obtain the <hostname> from the file ~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts.

  13. You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Chapter 31, Installing SLES Compute for details.

  14. Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the --limit switch:

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
15.1.3.4.1.3 Adding a new SLES compute node to monitoring

If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

15.1.3.5 Removing a Compute Node

Removing a Compute node allows you to remove capacity.

You may have a need to remove a Compute node and these steps will help you achieve this.

15.1.3.5.1 Disable Provisioning on the Compute Host
  1. Get a list of the nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:

    ardana > openstack compute service list

    Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

    ardana > openstack compute service list
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
    | Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -               |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -               |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -               |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -               |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -               |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | -               |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+
  2. Disable the nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:

    ardana > compute service set --disable --reason "enter reason here" node hostname

    Here is an example if I wanted to remove the ardana-cp1-comp0002-mgmt in the output above:

    ardana > compute service set –disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt
    +--------------------------+--------------+----------+-----------------------+
    | Host                     | Binary       | Status   | Disabled Reason       |
    +--------------------------+--------------+----------+-----------------------+
    | ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation |
    +--------------------------+--------------+----------+-----------------------+
15.1.3.5.2 Remove the Compute Host from its Availability Zone

If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.

  1. Get a list of the nova services running which will provide us with the details we need to remove a Compute node:

    ardana > openstack compute service list

    Here is an example below. I've highlighted the Compute node we are going to remove in the examples:

    ardana > openstack compute service list
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
    | Id | Binary           | Host                     | Zone     | Status  | State | Updated_at                 | Disabled Reason       |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled | up    | 2015-11-22T22:50:34.000000 | -                     |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:43.000000 | -                     |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled | up    | 2015-11-22T22:50:38.000000 | -                     |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled | up    | 2015-11-22T22:50:42.000000 | -                     |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled | up    | 2015-11-22T22:50:35.000000 | -                     |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | AZ2      | enabled | up    | 2015-11-22T22:50:44.000000 | hardware reallocation |
    +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+
  2. If the Zone reported for this host is simply "nova", then it is not a member of a particular availability zone, and this step will not be necessary. Otherwise, you must remove the Compute host from its availability zone:

    ardana > openstack aggregate remove host availability zone nova hostname

    So for the same example in the previous step, the ardana-cp1-comp0002-mgmt host was in the AZ2 availability zone so you would use this command to remove it:

    ardana > openstack aggregate remove host AZ2 ardana-cp1-comp0002-mgmt
    Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4
    +----+------+-------------------+-------+-------------------------+
    | Id | Name | Availability Zone | Hosts | Metadata                |
    +----+------+-------------------+-------+-------------------------+
    | 4  | AZ2  | AZ2               |       | 'availability_zone=AZ2' |
    +----+------+-------------------+-------+-------------------------+
  3. You can confirm the last two steps completed successfully by running another openstack compute service list.

    Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:

    ardana > openstack compute service list
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
    | Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
    | 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
    | 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
    | 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
    | 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
    | 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
    | 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
    +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
15.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts
  1. You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:

    ardana > openstack server list --host nova hostname --all_tenants=1

    Here is an example below which shows that we have a single running instance on this node currently:

    ardana > openstack server list --host ardana-cp1-comp0002-mgmt --all-projects
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
    | ID                                   | Name   | Tenant ID                        | Status | Task State | Power State | Networks        |
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
    | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | -          | Running     | paul=10.10.10.7 |
    +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+
  2. You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within nova. The command will look like this:

    ardana > nova live-migration --block-migrate nova instance ID

    Here is an example using the instance in the previous step:

    ardana > nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9

    You can check the status of the migration using the same command from the previous step:

    ardana > openstack server list --host ardana-cp1-comp0002-mgmt --all-projects
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
    | ID                                   | Name   | Tenant ID                        | Status    | Task State | Power State | Networks        |
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
    | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating  | Running     | paul=10.10.10.7 |
    +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+
  3. List the compute instances again to see that the running instance has been migrated:

    ardana > openstack server list --host ardana-cp1-comp0002-mgmt --all-projects
    +----+------+-----------+--------+------------+-------------+----------+
    | ID | Name | Tenant ID | Status | Task State | Power State | Networks |
    +----+------+-----------+--------+------------+-------------+----------+
    +----+------+-----------+--------+------------+-------------+----------+
15.1.3.5.4 Disable Neutron Agents on Node to be Removed

You should also locate and disable or remove neutron agents. To see the neutron agents running:

ardana > openstack network agent list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

ardana > openstack network agent set --disable 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
ardana > openstack network agent set --disable dbe4fe11-8f08-4306-8244-cc68e98bb770
ardana > openstack network agent set --disable f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host

To perform this step you have a few options. You can SSH into the Compute host and run the following commands:

tux > sudo systemctl stop nova-compute
tux > sudo systemctl stop neutron-*

Because the neutron agent self-registers against neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:

tux > sudo systemctl list-units neutron-* --all

Here are the results:

UNIT                                  LOAD        ACTIVE     SUB      DESCRIPTION
neutron-common-rundir.service         loaded      inactive   dead     Create /var/run/neutron
•neutron-dhcp-agent.service         not-found     inactive   dead     neutron-dhcp-agent.service
neutron-l3-agent.service              loaded      inactive   dead     neutron-l3-agent Service
neutron-metadata-agent.service        loaded      inactive   dead     neutron-metadata-agent Service
•neutron-openvswitch-agent.service    loaded      failed     failed   neutron-openvswitch-agent Service
neutron-ovs-cleanup.service           loaded      inactive   dead     neutron OVS Cleanup Service

        LOAD   = Reflects whether the unit definition was properly loaded.
        ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
        SUB    = The low-level unit activation state, values depend on unit type.

        7 loaded units listed.
        To show all installed unit files use 'systemctl list-unit-files'.

For each loaded service issue the command

tux > sudo systemctl disable service-name

In the above example that would be each service, except neutron-dhcp-agent.service

For example:

tux > sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-metadata-agent neutron-openvswitch-agent

Now you can shut down the node:

tux > sudo shutdown now

OR

From the Cloud Lifecycle Manager you can use the bm-power-down.yml playbook to shut down the node:

ardana > cd ~/openstack/ardana/ansible
ardana > ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=node name
Note
Note

The node name value will be the value corresponding to this node in Cobbler. You can run sudo cobbler system list to retrieve these names.

15.1.3.5.6 Delete the Compute Host from Nova

Retrieve the list of nova services:

ardana > openstack compute service list

Here is an example highlighting the Compute host we're going to remove:

ardana > openstack compute service list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary           | Host                     | Zone     | Status   | State | Updated_at                 | Disabled Reason       |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1  | nova-conductor   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 10 | nova-scheduler   | ardana-cp1-c1-m1-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:34.000000 | -                     |
| 13 | nova-conductor   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 16 | nova-conductor   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:33.000000 | -                     |
| 28 | nova-scheduler   | ardana-cp1-c1-m2-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:28.000000 | -                     |
| 31 | nova-scheduler   | ardana-cp1-c1-m3-mgmt    | internal | enabled  | up    | 2015-11-22T23:04:32.000000 | -                     |
| 34 | nova-compute     | ardana-cp1-comp0001-mgmt | AZ1      | enabled  | up    | 2015-11-22T23:04:25.000000 | -                     |
| 37 | nova-compute     | ardana-cp1-comp0002-mgmt | nova     | disabled | up    | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+

Delete the host from nova using the command below:

ardana > openstack compute service delete service ID

Following our example above, you would use:

ardana > openstack compute service delete 37

Use the command below to confirm that the Compute host has been completely removed from nova:

ardana > openstack hypervisor list
15.1.3.5.7 Delete the Compute Host from Neutron

Multiple neutron agents are running on the compute node. You have to remove all of the agents running on the node using the openstack network agent delete command. In the example below, the l3-agent, openvswitch-agent and metadata-agent are running:

ardana > openstack network agent list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id                                   | agent_type           | host                     | alive | admin_state_up | binary                    |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent             | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-l3-agent          |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent       | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-metadata-agent    |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent   | ardana-cp1-comp0002-mgmt | :-)   | False          | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+

$ openstack network agent delete AGENT_ID

$ openstack network agent delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ openstack network agent delete dbe4fe11-8f08-4306-8244-cc68e98bb770
$ openstack network agent delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor

Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:

  1. Log in to the Cloud Lifecycle Manager

  2. Edit your servers.yml file in the location below to remove references to the Compute node(s) you want to remove:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > vi servers.yml
  3. You may also need to edit your control_plane.yml file to update the values for member-count, min-count, and max-count if you used those to ensure they reflect the exact number of nodes you are using.

    See Section 6.2, “Control Plane” for more details.

  4. Commit the changes to git:

    ardana > git commit -a -m "Remove node NODE_NAME"
  5. To release the network capacity allocated to the deleted server(s), use the switches remove_deleted_servers and free_unused_addresses when running the configuration processors. (For more information, see Section 7.3, “Persisted Data”.)

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml \
      -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Refresh the /etc/hosts file through the cloud to remove references to the old node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
15.1.3.5.9 Remove the Compute Host from Cobbler

Complete these steps to remove the node from Cobbler:

  1. Confirm the system name in Cobbler with this command:

    tux > sudo  cobbler system list
  2. Remove the system from Cobbler using this command:

    tux > sudo  cobbler system remove --name=node
  3. Run the cobbler-deploy.yml playbook to complete the process:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
15.1.3.5.10 Remove the Compute Host from Monitoring

Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.

To find all monasca API servers

tux > sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
    bind ardana-cp1-vip-public-MON-API-extapi:8070  ssl crt /etc/ssl/private//my-public-cert-entry-scale
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
listen ardana-cp1-vip-MON-API-mgmt-8070
    bind ardana-cp1-vip-MON-API-mgmt:8070  ssl crt /etc/ssl/private//ardana-internal-cert
    server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
    server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5

In above example ardana-cp1-c1-m1-mgmt,ardana-cp1-c1-m2-mgmt, ardana-cp1-c1-m3-mgmt are Monasa API servers

You will want to SSH to each of the monasca API servers and edit the /etc/monasca/agent/conf.d/host_alive.yaml file to remove references to the Compute node you removed. This will require sudo access. The entries will look similar to the one below:

- alive_test: ping
  built_by: HostAlive
  host_name: ardana-cp1-comp0001-mgmt
  name: ardana-cp1-comp0001-mgmt ping

Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:

ardana > monasca alarm-list --metric-dimensions hostname=compute node deleted

For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:

ardana > monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt

You can then delete the alarm with this command:

ardana > monasca alarm-delete alarm ID

15.1.4 Planned Network Maintenance

Planned maintenance task for networking nodes.

15.1.4.1 Adding a Network Node

Adding an additional neutron networking node allows you to increase the performance of your cloud.

You may have a need to add an additional neutron network node for increased performance or another purpose and these steps will help you achieve this.

15.1.4.1.1 Prerequisites

If you are using the mid-scale model then your networking nodes are already separate and the roles are defined. If you are not already using this model and wish to add separate networking nodes then you need to ensure that those roles are defined. You can look in the ~/openstack/examples folder on your Cloud Lifecycle Manager for the mid-scale example model files which show how to do this. We have also added the basic edits that need to be made below:

  1. In your server_roles.yml file, ensure you have the NEUTRON-ROLE defined.

    Path to file:

    ~/openstack/my_cloud/definition/data/server_roles.yml

    Example snippet:

    - name: NEUTRON-ROLE
      interface-model: NEUTRON-INTERFACES
      disk-model: NEUTRON-DISKS
  2. In your net_interfaces.yml file, ensure you have the NEUTRON-INTERFACES defined.

    Path to file:

    ~/openstack/my_cloud/definition/data/net_interfaces.yml

    Example snippet:

    - name: NEUTRON-INTERFACES
      network-interfaces:
      - device:
          name: hed3
        name: hed3
        network-groups:
        - EXTERNAL-VM
        - GUEST
        - MANAGEMENT
  3. Create a disks_neutron.yml file, ensure you have the NEUTRON-DISKS defined in it.

    Path to file:

    ~/openstack/my_cloud/definition/data/disks_neutron.yml

    Example snippet:

      product:
        version: 2
    
      disk-models:
      - name: NEUTRON-DISKS
        volume-groups:
          - name: ardana-vg
            physical-volumes:
             - /dev/sda_root
    
            logical-volumes:
            # The policy is not to consume 100% of the space of each volume group.
            # 5% should be left free for snapshots and to allow for some flexibility.
              - name: root
                size: 35%
                fstype: ext4
                mount: /
              - name: log
                size: 50%
                mount: /var/log
                fstype: ext4
                mkfs-opts: -O large_file
              - name: crash
                size: 10%
                mount: /var/crash
                fstype: ext4
                mkfs-opts: -O large_file
  4. Modify your control_plane.yml file, ensure you have the NEUTRON-ROLE defined as well as the neutron services added.

    Path to file:

    ~/openstack/my_cloud/definition/data/control_plane.yml

    Example snippet:

      - allocation-policy: strict
        cluster-prefix: neut
        member-count: 1
        name: neut
        server-role: NEUTRON-ROLE
        service-components:
        - ntp-client
        - neutron-vpn-agent
        - neutron-dhcp-agent
        - neutron-metadata-agent
        - neutron-openvswitch-agent

You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Chapter 2, Hardware and Software Support Matrix.

15.1.4.1.2 Adding a network node

These steps will show you how to add the new network node to your servers.yml file and then run the playbooks that update your cloud configuration. You will run these playbooks from the lifecycle manager.

  1. Log in to your Cloud Lifecycle Manager.

  2. Checkout the site branch of your local git so you can begin to make the necessary edits:

    ardana > cd ~/openstack/my_cloud/definition/data
    ardana > git checkout site
  3. In the same directory, edit your servers.yml file to include the details about your new network node(s).

    For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:

    # network nodes
    - id: neut3
      ip-addr: 10.13.111.137
      role: NEUTRON-ROLE
      server-group: RACK2
      mac-addr: "5c:b9:01:89:b6:18"
      nic-mapping: HP-DL360-6PORT
      ip-addr: 10.243.140.22
      ilo-ip: 10.1.12.91
      ilo-password: password
      ilo-user: admin
    Important
    Important

    You will need to verify that the ip-addr value you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the ~/openstack/my_cloud/info/address_info.yml file on your Cloud Lifecycle Manager.

  4. In your control_plane.yml file you will need to check the values for member-count, min-count, and max-count, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specified member-count: 3 and are adding a fourth network node, you will need to change that value to member-count: 4.

  5. Commit the changes to git:

    ardana > git commit -a -m "Add new networking node <name>"
  6. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Add the new node into Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  9. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>
    Note
    Note

    If you do not know the <hostname>, you can get it by using sudo cobbler system list.

  10. [OPTIONAL] - Run the wipe_disks.yml playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other case, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>
  11. Configure the operating system on the new networking node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
  12. Complete the networking node deployment with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>
  13. Run the site.yml playbook with the required tag so that all other services become aware of the new node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
15.1.4.1.3 Adding a New Network Node to Monitoring

If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"

15.1.5 Planned Storage Maintenance

Planned maintenance procedures for swift storage nodes.

15.1.5.1 Planned Maintenance Tasks for swift Nodes

Planned maintenance tasks including recovering, adding, and removing swift nodes.

15.1.5.1.1 Adding a Swift Object Node

Adding additional object nodes allows you to increase capacity.

This topic describes how to add additional swift object server nodes to an existing system.

15.1.5.1.1.1 To add a new node

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

  1. Log in to the Cloud Lifecycle Manager node.

  2. Get the servers.yml file stored in Git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. If not already done, set the weight-step attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  4. Add the details of new nodes to the servers.yml file. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in the servers.yml file:

    servers:
    ...
    - id: swobj4
      role: SWOBJ_ROLE
      server-group: <server-group-name>
      mac-addr: <mac-address>
      nic-mapping: <nic-mapping-name>
      ip-addr: <ip-address>
      ilo-ip: <ilo-ip-address>
      ilo-user: <ilo-username>
      ilo-password: <ilo-password>
  5. Commit your changes:

    git add -A
    git commit -m "Add Node <name>"
    Note
    Note

    Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:

    export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY

    For instructions, see Chapter 30, Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only.

  6. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  7. Create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  8. Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the nodelist argument):

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>

    In the following example, the server id is swobj4 (mentioned in step 3):

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
    Note
    Note

    You must use the server id as it appears in the file servers.yml in the field id.

  9. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    The hostname of the newly added server can be found in the list generated from the output of the following command:

    grep hostname ~/openstack/my_cloud/info/server_info.yml

    For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt
  10. Validate that the disk drives of the new node are compatible with the disk model used by the node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.

  11. Run the following playbook to ensure that all other server's host file are updated with the new server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
  12. Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
  13. You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

    For example:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
15.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node

Steps for adding additional PAC nodes to your swift system.

This topic describes how to add additional swift proxy, account, and container (PAC) servers to an existing system.

15.1.5.1.2.1 Adding a new node

To add a new node to your cloud, you will need to add it to servers.yml, and then run the scripts that update your cloud configuration. To begin, access the servers.yml file by checking out the Git branch where you are required to make the changes:

Then, perform the following steps to add a new node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Get the servers.yml file stored in Git:

    cd ~/openstack/my_cloud/definition/data
    git checkout site
  3. If not already done, set the weight-step attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.

  4. Add details of new nodes to the servers.yml file:

    servers:
    ...
    - id: swpac6
      role: SWPAC-ROLE
      server-group: <server-group-name>
      mac-addr: <mac-address>
      nic-mapping: <nic-mapping-name>
      ip-addr: <ip-address>
      ilo-ip: <ilo-ip-address>
      ilo-user: <ilo-username>
      ilo-password: <ilo-password>

    In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the servers.yml file.

    In the entry-scale configurations there is no dedicated swift PAC cluster. Instead, there is a cluster using servers that have a role of CONTROLLER-ROLE. You cannot add additional nodes dedicated exclusively to swift PAC because that would change the member-count of the entire cluster. In that case, to create a dedicated swift PAC cluster, you will need to add it to the configuration files. For details on how to do this, see Section 11.7, “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.

    If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:

    control_plane.yml
    disks_pac.yml
    net_interfaces.yml
    servers.yml
    server_roles.yml

    You can see a good example of this in the example configurations for the mid-scale model in the ~/openstack/examples/mid-scale-kvm directory.

    The following steps assume that you have already created a dedicated swift PAC cluster and that it has two members (swpac4 and swpac5).

  5. Set the member count of the swift PAC cluster to match the number of nodes. For example, if you are adding swpac6 as the 6th swift PAC node, the member count should be increased from 5 to 6 as shown in the following example:

    control-planes:
        - name: control-plane-1
          control-plane-prefix: cp1
    
      . . .
      clusters:
      . . .
         - name: swpac
           cluster-prefix: swpac
           server-role: SWPAC-ROLE
           member-count: 6
       . . .
  6. Commit your changes:

    git add -A
    git commit -m "Add Node <name>"
    Note
    Note

    Before you run any playbooks, remember that you need to export the encryption key in the following environment variable:

    export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY

    For instructions, see Chapter 30, Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only.

  7. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  8. Create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  9. Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the nodelist argument):

    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>

    In the following example, the server id is swpac6 (mentioned in step 3):

    ansible-playbook -i hosts/localhost cobbler-deploy.yml
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
    Note
    Note

    You must use the server id as it appears in the file servers.yml in the field id.

  10. Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
  11. Validate that the disk drives of the new node are compatible with the disk model used by the node:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml

    If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.

  12. Run the following playbook to ensure that all other server's host file are updated with the new server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
  13. Run the ardana-deploy.yml playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
  14. You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

15.1.5.1.3 Adding Additional Disks to a Swift Node

Steps for adding additional disks to any nodes hosting swift services.

You may have a need to add additional disks to a node for swift usage and we can show you how. These steps work for adding additional disks to swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the swift service, like you would see if you are using one of the entry-scale example models.

Read through the notes below before beginning the process.

You can add multiple disks at the same time, there is no need to do it one at a time.

Important
Important: Add the Same Number of Disks

You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three swift servers.

15.1.5.1.3.1 Adding additional disks to your Swift servers
  1. Verify the general health of the swift system and that it is safe to rebalance your rings. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.

  2. Perform the disk maintenance.

    1. Shut down the first swift server you wish to add disks to.

    2. Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.

      For more details, see Section 11.6, “Swift Requirements for Device Group Drives”.

    3. Power the server on.

    4. While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.

    5. Repeat the steps from Step 2.a for each of the swift servers you are adding the disks to, one at a time.

      Note
      Note

      If the additional disks can be added to the swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.

  3. On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.

    1. Edit the disk configuration file that correlates to the type of server you are adding your new disks to.

      Path to the typical disk configuration files:

      ~/openstack/my_cloud/definition/data/disks_swobj.yml
      ~/openstack/my_cloud/definition/data/disks_swpac.yml
      ~/openstack/my_cloud/definition/data/disks_controller_*.yml

      Example showing the addition of a single new disk, indicated by the /dev/sdd, in bold:

      device-groups:
        - name: swiftObject
          devices:
            - name: "/dev/sdb"
            - name: "/dev/sdc"
            - name: "/dev/sdd"
          consumer:
            name: swift
            ...
      Note
      Note

      For more details on how the disk model works, see Chapter 6, Configuration Objects.

    2. Configure the swift weight-step value in the ~/openstack/my_cloud/definition/data/swift/swift_config.yml file. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.

    3. Commit the changes to Git:

      cd ~/openstack
      git commit -a -m "adding additional swift disks"
    4. Run the configuration processor:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost config-processor-run.yml
    5. Update your deployment directory:

      cd ~/openstack/ardana/ansible
      ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Run the osconfig-run.yml playbook against the swift nodes you have added disks to. Use the --limit switch to target the specific nodes:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>

    You can use a wildcard when specifying the hostnames with the --limit switch. If you added disks to all of the swift servers in your environment and they all have the same prefix (for example, ardana-cp1-swobj...) then you can use a wildcard like ardana-cp1-swobj*. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.

  5. Validate your swift configuration with this playbook which will also provide details of each drive being added:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
  6. Verify that swift services are running on all of your servers:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-status.yml
  7. If everything looks okay with the swift status, then apply the changes to your swift rings with this playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  8. At this point your swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

15.1.5.1.4 Removing a Swift Node

Removal process for both swift Object and PAC nodes.

You can use this process when you want to remove one or more swift nodes permanently. This process applies to both swift Proxy, Account, Container (PAC) nodes and swift Object nodes.

15.1.5.1.4.1 Setting the Pass-through Attributes

This process will remove the swift node's drives from the rings and rebalance their responsibilities among the remaining nodes in your cluster. Note that removal will not succeed if it causes the number of remaining disks in the cluster to decrease below the replica count of its rings.

  1. Log in to the Cloud Lifecycle Manager.

  2. Ensure that the weight-step attribute is set. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.

  3. Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your ~/openstack/my_cloud/definition/data/servers.yml file since your server IDs are already listed in that file. For more information about pass-through, see Section 6.17, “Pass Through”.

    Here is the format required, which can be inserted at the topmost level of indentation in your file (typically 2 spaces):

    pass-through:
      servers:
        - id: server-id
          data:
            subsystem:
              subsystem-attributes

    Here is an example:

    ---
      product:
        version: 2
    
      pass-through:
        servers:
          - id: ccn-0001
            data:
              swift:
                drain: yes

    If a pass-through definition already exists in any of your input model data files, just include the additional data for the server which you are removing instead of defining an entirely new pass-through block.

    By setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the swift data from the node in preparation for removing the node.

  4. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Use the playbook to create a deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Run the swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  8. Wait until the replication has completed. For further details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”

  9. Determine whether all of the partitions have been removed from all drives on the swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:

    cd /etc/swiftlm/cloud1/cp1/builder_dir/
    sudo swift-ring-builder ring_name.builder

    For example, if the node you are removing was part of the object-o ring the command would be:

    sudo swift-ring-builder object-0.builder

    Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:

    $ cd /etc/swiftlm/cloud1/cp1/builder_dir/
    $ sudo swift-ring-builder object-0.builder
    account.builder, build version 6
    4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion
    The minimum number of hours before a partition can be reassigned is 16
    The overload factor is 0.00% (0.000000)
    Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta
                 0       1     1   192.168.245.3  6002   192.168.245.3              6002     disk0   0.00          0   -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc
                 1       1     1   192.168.245.3  6002   192.168.245.3              6002     disk1   0.00          0   -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd
                 2       1     1   192.168.245.4  6002   192.168.245.4              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc
                 3       1     1   192.168.245.4  6002   192.168.245.4              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd
                 4       1     1   192.168.245.5  6002   192.168.245.5              6002     disk0  18.63       2048   -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc
                 5       1     1   192.168.245.5  6002   192.168.245.5              6002     disk1  18.63       2048   -0.00 padawan-ccp-c1-m3:disk1:/dev/sdd
  10. If the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.

  11. If the number of partitions is zero for the server on all rings, you can remove the swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the remove attribute as shown in this example:

    ---
      product:
        version: 2
    
      pass-through:
        servers:
          - id: ccn-0001
            data:
              swift:
                remove: yes
  12. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  13. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml
  14. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  15. Run the swift deploy playbook to rebuild the rings by removing the server:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  16. At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.

15.1.5.1.4.2 To Disable Swift on a Node

The next phase in this process will disable the swift service on the node. In this example, swobj4 is the node being removed from swift.

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop swift services on the node using the swift-stop.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit hostname
    Note
    Note

    When using the --limit argument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card * (for example, *swobj4*).

    The following example uses the swift-stop.yml playbook to stop swift services on ardana-cp1-swobj0004:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004
  3. Remove the configuration files.

    ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
    Note
    Note

    Do not run any other playbooks until you have finished the process described in Section 15.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate /etc/swift and restart swift on swobj4. If you accidentally run a playbook, repeat the process in Section 15.1.5.1.4.2, “To Disable Swift on a Node”.

15.1.5.1.4.3 To Remove a Node from the Input Model

Use the following steps to finish the process of removing the swift node.

  1. Log in to the Cloud Lifecycle Manager.

  2. Edit the ~/openstack/my_cloud/definition/data/servers.yml file and remove the entry for the node (swobj4 in this example). In addition, remove the related entry you created in the pass-through section earlier in this process.

  3. If this was a SWPAC node, reduce the member-count attribute by 1 in the ~/openstack/my_cloud/definition/data/control_plane.yml file. For SWOBJ nodes, no such action is needed.

  4. Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:

    cd ~/openstack/ardana/ansible
    git add -A
    git commit -m "My config or other commit message"
  5. Run the configuration processor:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost config-processor-run.yml

    Using the remove_deleted_servers and free_unused_addresses switches is recommended to free up the resources associated with the removed node when running the configuration processor. For more information, see Section 7.3, “Persisted Data”.

    ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
  6. Update your deployment directory:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Validate the changes you have made to the configuration files using the playbook below before proceeding further:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*

    If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.

    For more details on how to interpret and resolve errors, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”

  8. Remove the node from Cobbler:

    sudo cobbler system remove --name=swobj4
  9. Run the Cobbler deploy playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  10. The final step will depend on what type of swift node you are removing.

    If the node was a SWPAC node, run the ardana-deploy.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

    If the node was a SWOBJ node (and not a SWPAC node), run the swift-deploy.yml playbook:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-deploy.yml
  11. Wait until replication has finished. For more details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.

  12. You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 9.5.5, “Applying Input Model Changes to Existing Rings”.

15.1.5.1.4.4 Remove the Swift Node from Monitoring

Once you have removed the swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.

Connect to each of the nodes in your cluster running the monasca-api service (as defined in ~/openstack/my_cloud/definition/data/control_plane.yml) and use sudo vi /etc/monasca/agent/conf.d/host_alive.yaml to delete all references to the swift node(s) you removed.

Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:

tux > sudo service openstack-monasca-agent restart

With the swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:

monasca alarm-list --metric-dimensions hostname=swift node deleted

You can then delete the alarm with this command:

monasca alarm-delete alarm ID
15.1.5.1.5 Replacing a swift Node

Maintenance steps for replacing a failed swift node in your environment.

This process is used when you want to replace a failed swift node in your cloud.

Warning
Warning

If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.

15.1.5.1.5.1 How to replace a swift node in your environment
  1. Log in to the Cloud Lifecycle Manager.

  2. Power off the node.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=OLD_SWIFT_CONTROLLER_NODE
  3. Update your cloud configuration with the details of your replacement swift. node.

    1. Edit your servers.yml file to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement swift node.

      Note
      Note

      Do not change the server's IP address (that is, ip-addr).

      Path to file:

      ~/openstack/my_cloud/definition/data/servers.yml

      Example showing the fields to edit, in bold:

       - id: swobj5
         role: SWOBJ-ROLE
         server-group: rack2
         mac-addr: 8c:dc:d4:b5:cb:bd
         nic-mapping: HP-DL360-6PORT
         ip-addr: 10.243.131.10
         ilo-ip: 10.1.12.88
         ilo-user: iLOuser
         ilo-password: iLOpass
         ...
    2. Commit the changes to Git:

      ardana > cd ~/openstack
      ardana > git commit -a -m "replacing a swift node"
    3. Run the configuration processor:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    4. Update your deployment directory:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  4. Prepare SLES:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook prepare-sles-loader.yml
    ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=NEW REPLACEMENT NODE
  5. Update Cobbler and reimage your replacement swift node:

    1. Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace <node name> in future steps.

      ardana > sudo cobbler system list
    2. Remove the replaced swift node from Cobbler:

      ardana > sudo cobbler system remove --name <node name>
    3. Re-run the cobbler-deploy.yml playbook to add the replaced node:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
    4. Reimage the node using this playbook:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
  6. Wipe the disks on the NEW REPLACEMENT NODE. This action will not affect the OS partitions on the server.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit NEW_REPLACEMENT_NODE
  7. Complete the deployment of your replacement swift node.

    1. Obtain the hostname for your new swift node. Use this value to replace <hostname> in future steps.

      ardana > cat ~/openstack/my_cloud/info/server_info.yml
    2. Configure the operating system on your replacement swift node:

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>
    3. If this is the swift ring builder server, restore the swift ring builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.

    4. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor, include the --ask-vault-pass argument.

      ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
15.1.5.1.6 Replacing Drives in a swift Node

Maintenance steps for replacing drives in a swift node.

This process is used when you want to remove a failed hard drive from swift node and replace it with a new one.

There are two different classes of drives in a swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.

15.1.5.1.6.1 To Replace the Operating System Disk Drive

After the operating system disk drive is replaced, the node must be reimaged.

  1. Log in to the Cloud Lifecycle Manager.

  2. Update your Cobbler profile:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/localhost cobbler-deploy.yml
  3. Reimage the node using this playbook:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>

    In the example below swobj2 server is reimaged:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2
  4. Review the cloudConfig.yml and data/control_plane.yml files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>

    In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
  5. If this is the first server running the swift-proxy service, restore the swift Ring Builder files to the /etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.

  6. Configure services on the node using the ardana-deploy.yml playbook. If you have used an encryption password when running the configuration processor include the --ask-vault-pass argument.

    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \
      --limit <hostname>

    For example:

    ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
15.1.5.1.6.2 To Replace a Storage Disk Drive

After a storage drive is replaced, there is no need to reimage the server. Instead, run the swift-reconfigure.yml playbook.

  1. Log onto the Cloud Lifecycle Manager.

  2. Run the following commands:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>

    In following example, the server used is swobj2:

    cd ~/scratch/ansible/next/ardana/ansible
    ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt

15.1.6 Updating MariaDB with Galera

Updating MariaDB with Galera must be done manually. Updates are not installed automatically. This is particularly an issue with upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.

Using the CLI, update MariaDB with the following procedure:

  1. Mark Galera as unmanaged:

    crm resource unmanage galera

    Or put the whole cluster into maintenance mode:

    crm configure property maintenance-mode=true
  2. Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:

    crm_resource --wait --force-demote -r galera -V
  3. Perform updates:

    1. Uninstall the old versions of MariaDB and the Galera wsrep provider.

    2. Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.

    3. Change configuration options if necessary.

  4. Start MariaDB on the node.

    crm_resource --wait --force-promote -r galera -V
  5. Run mysql_upgrade with the --skip-write-binlog option.

  6. On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run mysql_upgrade.

  7. Mark Galera as managed:

    crm resource manage galera

    Or take the cluster out of maintenance mode.

15.2 Unplanned System Maintenance

Unplanned maintenance tasks for your cloud.

15.2.1 Whole Cloud Recovery Procedures

Unplanned maintenance procedures for your whole cloud.

15.2.1.1 Full Disaster Recovery

In this disaster scenario, you have lost everything in your cloud. In other words, you have lost access to all data stored in the cloud that was not backed up to an external backup location, including:

  • Data in swift object storage

  • glance images

  • cinder volumes

  • Metering, Monitoring, and Logging (MML) data

  • Workloads running on compute resources

In effect, the following recovery process creates a minimal new cloud with the existing identity information. Much of the operating state and data would have been lost, as would running workloads.

Important
Important

We recommend backups external to your cloud for your data, including as much as possible of the types of resources listed above. Most workloads that were running could possibly be recreated with sufficient external backups.

15.2.1.1.1 Install and Set Up a Cloud Lifecycle Manager Node

Before beginning the process of a full cloud recovery, you need to install and set up a Cloud Lifecycle Manager node as though you are creating a new cloud. There are several steps in that process:

  1. Install the appropriate version of SUSE Linux Enterprise Server

  2. Restore passwd, shadow, and group files. They have User ID (UID) and group ID (GID) content that will be used to set up the new cloud. If these are not restored immediately after installing the operating system, the cloud deployment will create new UIDs and GIDs, overwriting the existing content.

  3. Install Cloud Lifecycle Manager software

  4. Prepare the Cloud Lifecycle Manager, which includes installing the necessary packages

  5. Initialize the Cloud Lifecycle Manager

  6. Restore your OpenStack git repository

  7. Adjust input model settings if the hardware setup has changed

The following sections cover these steps in detail.

15.2.1.1.2 Install the Operating System

Follow the instructions for installing SUSE Linux Enterprise Server in Chapter 15, Installing the Cloud Lifecycle Manager server.

15.2.1.1.3 Restore files with UID and GID content
Important
Important

There is a risk that you may lose data completely. Restore the backups for /etc/passwd, /etc/shadow, and /etc/group immediately after installing SUSE Linux Enterprise Server.

Some backup files contain content that would no longer be valid if your cloud were to be freshly deployed in the next step of a whole cloud recovery. As a result, some of the backup must be restored before deploying a new cloud. Three kinds of backups are involved: passwd, shadow, and group. The following steps will restore those backups.

  1. Log in to the server where the Cloud Lifecycle Manager will be installed.

  2. Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.

    ardana > scp USER@REMOTE_SERVER:TAR_ARCHIVE
  3. Untar the TAR archives to overwrite the three locations:

    • passwd

    • shadow

    • group

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gz

    The following are examples. Use the actual tar.gz file names of the backups.

    BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ -f passwd.tar.gz

    BACKUP_TARGET=/etc/shadow

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ -f shadow.tar.gz

    BACKUP_TARGET=/etc/group

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ -f group.tar.gz
15.2.1.1.4 Install the Cloud Lifecycle Manager

To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Section 15.5.2, “Installing the SUSE OpenStack Cloud Extension”.

15.2.1.1.5 Prepare to deploy your cloud

The following is the general process for preparing to deploy a SUSE OpenStack Cloud. You may not need to perform all the steps, depending on your particular disaster recovery situation.

Important
Important

When you install the ardana cloud pattern in the following process, the ardana user and ardana group will already exist in /etc/passwd and /etc/group. Do not re-create them.

When you run ardana-init in the following process, /var/lib/ardana is created as a deployer account using the account settings in /etc/passwd and /etc/group that were restored in the previous step.

15.2.1.1.5.1 Prepare for Cloud Installation
  1. Review the Chapter 14, Pre-Installation Checklist about recommended pre-installation tasks.

  2. Prepare the Cloud Lifecycle Manager node. The Cloud Lifecycle Manager must be accessible either directly or via ssh, and have SUSE Linux Enterprise Server 12 SP4 installed. All nodes must be accessible to the Cloud Lifecycle Manager. If the nodes do not have direct access to online Cloud subscription channels, the Cloud Lifecycle Manager node will need to host the Cloud repositories.

    1. If you followed the installation instructions for Cloud Lifecycle Manager server (see Chapter 15, Installing the Cloud Lifecycle Manager server), SUSE OpenStack Cloud software should already be installed. Double-check whether SUSE Linux Enterprise and SUSE OpenStack Cloud are properly registered at the SUSE Customer Center by starting YaST and running Software › Product Registration.

      If you have not yet installed SUSE OpenStack Cloud, do so by starting YaST and running Software › Product Registration › Select Extensions. Choose SUSE OpenStack Cloud and follow the on-screen instructions. Make sure to register SUSE OpenStack Cloud during the installation process and to install the software pattern patterns-cloud-ardana.

      tux > sudo zypper -n in patterns-cloud-ardana
    2. Ensure the SUSE OpenStack Cloud media repositories and updates repositories are made available to all nodes in your deployment. This can be accomplished either by configuring the Cloud Lifecycle Manager server as an SMT mirror as described in Chapter 16, Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional) or by syncing or mounting the Cloud and updates repositories to the Cloud Lifecycle Manager server as described in Chapter 17, Software Repository Setup.

    3. Configure passwordless sudo for the user created when setting up the node (as described in Section 15.4, “Creating a User”). Note that this is not the user ardana that will be used later in this procedure. In the following we assume you named the user cloud. Run the command visudo as user root and add the following line to the end of the file:

      CLOUD ALL = (root) NOPASSWD:ALL

      Make sure to replace CLOUD with your user name choice.

    4. Set the password for the user ardana:

      tux > sudo passwd ardana
    5. Become the user ardana:

      tux > su - ardana
    6. Place a copy of the SUSE Linux Enterprise Server 12 SP4 .iso in the ardana home directory, var/lib/ardana, and rename it to sles12sp4.iso.

    7. Install the templates, examples, and working model directories:

      ardana > /usr/bin/ardana-init
15.2.1.1.6 Restore the remaining Cloud Lifecycle Manager content from a remote backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.

    ardana > scp USER@REMOTE_SERVER:TAR_ARCHIVE
  3. Untar the TAR archives to overwrite the remaining four required locations:

    • home

    • ssh

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gz

    The following are examples. Use the actual tar.gz file names of the backups.

    BACKUP_TARGET=/var/lib/ardana

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /var/lib/ -f home.tar.gz

    BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz
15.2.1.1.7 Re-deployment of controllers 1, 2 and 3
  1. Change back to the default ardana user.

  2. Run the cobbler-deploy.yml playbook.

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  3. Run the bm-reimage.yml playbook limited to the second and third controllers.

    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3

    The names of controller2 and controller3. Use the bm-power-status.yml playbook to check the cobbler names of these nodes.

  4. Run the site.yml playbook limited to the three controllers and localhost—in this example, doc-cp1-c1-m1-mgmt, doc-cp1-c1-m2-mgmt, doc-cp1-c1-m3-mgmt, and localhost

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
    doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  5. You can now perform the procedures to restore MariaDB and swift.

15.2.1.1.8 Restore MariaDB from a remote backup
  1. Log in to the first node running the MariaDB service.

  2. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

    ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
    --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
    -f mydb.tar.gz
  4. Verify that the files have been restored on the controller.

    ardana > sudo du -shx /tmp/mysql_restore/*
    16K     /tmp/mysql_restore/aria_log.00000001
    4.0K    /tmp/mysql_restore/aria_log_control
    3.4M    /tmp/mysql_restore/barbican
    8.0K    /tmp/mysql_restore/ceilometer
    4.2M    /tmp/mysql_restore/cinder
    2.9M    /tmp/mysql_restore/designate
    129M    /tmp/mysql_restore/galera.cache
    2.1M    /tmp/mysql_restore/glance
    4.0K    /tmp/mysql_restore/grastate.dat
    4.0K    /tmp/mysql_restore/gvwstate.dat
    2.6M    /tmp/mysql_restore/heat
    752K    /tmp/mysql_restore/horizon
    4.0K    /tmp/mysql_restore/ib_buffer_pool
    76M     /tmp/mysql_restore/ibdata1
    128M    /tmp/mysql_restore/ib_logfile0
    128M    /tmp/mysql_restore/ib_logfile1
    12M     /tmp/mysql_restore/ibtmp1
    16K     /tmp/mysql_restore/innobackup.backup.log
    313M    /tmp/mysql_restore/keystone
    716K    /tmp/mysql_restore/magnum
    12M     /tmp/mysql_restore/mon
    8.3M    /tmp/mysql_restore/monasca_transform
    0       /tmp/mysql_restore/multi-master.info
    11M     /tmp/mysql_restore/mysql
    4.0K    /tmp/mysql_restore/mysql_upgrade_info
    14M     /tmp/mysql_restore/nova
    4.4M    /tmp/mysql_restore/nova_api
    14M     /tmp/mysql_restore/nova_cell0
    3.6M    /tmp/mysql_restore/octavia
    208K    /tmp/mysql_restore/opsconsole
    38M     /tmp/mysql_restore/ovs_neutron
    8.0K    /tmp/mysql_restore/performance_schema
    24K     /tmp/mysql_restore/tc.log
    4.0K    /tmp/mysql_restore/test
    8.0K    /tmp/mysql_restore/winchester
    4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  5. Stop SUSE OpenStack Cloud services on the three controllers (using the hostnames of the controllers in your configuration).

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \
    doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  6. Delete the files in the mysql directory and copy the restored backup to that directory.

    root # cd /var/lib/mysql/
    root # rm -rf ./*
    root # cp -pr /tmp/mysql_restore/* ./
  7. Switch back to the ardana user when the copy is finished.

15.2.1.1.9 Restore swift from a remote backup
  1. Log in to the first swift Proxy (SWF-PRX--first-member) node.

    To find the first swift Proxy node:

    1. On the Cloud Lifecycle Manager

      ardana > cd  ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
      --limit SWF-PRX--first-member

      At the end of the output, you will see something like the following example:

      ...
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)'
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)'
      
      PLAY RECAP ********************************************************************
      ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
    2. Find the first node name and its IP address. For example:

      ardana > cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
  2. Retrieve (scp) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, swring.tar.gz).

    ardana > mkdir /tmp/swift_builder_dir_restore; sudo tar -z \
    --incremental --extract --ignore-zeros --warning=none --overwrite --directory \
    /tmp/swift_builder_dir_restore/  -f swring.tar.gz
  4. Log in to the Cloud Lifecycle Manager.

  5. Stop the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml
  6. Log back in to the first swift Proxy (SWF-PRX--first-member) node, which was determined in Step 1.

  7. Copy the restored files.

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  8. Log back in to the Cloud Lifecycle Manager.

  9. Reconfigure the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.1.1.10 Restart SUSE OpenStack Cloud services
  1. Restart the MariaDB database

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml

    On the deployer node, execute the galera-bootstrap.yml playbook which will determine the log sequence number, bootstrap the main node, and start the database cluster.

    If this process fails to recover the database cluster, refer to Section 15.2.3.1.2, “Recovering the MariaDB Database”.

  2. Restart SUSE OpenStack Cloud services on the three controllers as in the following example.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml \
    --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhost
  3. Reconfigure SUSE OpenStack Cloud

    ardana > ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml

15.2.2 Recover Start-up Processes

In this scenario, processes do not start. If those processes are not running, ansible start-up scripts will fail. On the deployer, use Ansible to check status on the control plane servers. The following checks and remedies address common causes of this condition.

  • If disk space is low, determine the cause and remove anything that is no longer needed. Check disk space with the following command:

    ardana > ansible KEY-API -m shell -a 'df -h'
  • Check that Network Time Protocol (NTP) is synchronizing clocks properly with the following command.

    ardana > ansible resources -i hosts/verb_hosts \
    -m shell -a "sudo ntpq -c peers"
  • Check keepalived, the daemon that monitors services or systems and automatically fails over to a standby if problems occur.

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status keepalived | head -8"
  • Restart keepalived if necessary.

    1. Check RabbitMQ status first:

      ardana > ansible KEY-API -i hosts/verb_hosts \
      -m shell -a "sudo rabbitmqctl status | head -10"
    2. Restart RabbitMQ if necessary:

      ardana > ansible KEY-API -i hosts/verb_hosts \
      -m shell -a "sudo systemctl start rabbitmq-server"
    3. If RabbitMQ is running, restart keepalived:

      ardana > ansible KEY-API -i hosts/verb_hosts \
      -m shell -a "sudo systemctl restart keepalived"
  • If RabbitMQ is up, is it clustered?

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo rabbitmqctl cluster_status"

    Restart RabbitMQ cluster if necessary:

    ardana > ansible_playbook -i hosts/verb_hosts rabbitmq-start.yml
  • Check Kafka messaging:

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status kafka | head -5"
  • Check the Spark framework:

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status spark-worker | head -8"
    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status spark-master | head -8"
  • If necessary, start Spark:

    ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
    ardana > ansible KEY-API -i hosts/verb_hosts -m shell -a \
    "sudo systemctl start spark-master | head -8"
  • Check Zookeeper centralized service:

    ardana > ansible KEY-API -i hosts/verb_hosts \
    -m shell -a "sudo systemctl status zookeeper| head -8"
  • Check MariaDB:

    ardana > ansible KEY-API -i hosts/verb_hosts
    -m shell -a "sudo mysql -e 'show status;' | grep -e wsrep_incoming_addresses \
    -e wsrep_local_state_comment "

15.2.3 Unplanned Control Plane Maintenance

Unplanned maintenance tasks for controller nodes such as recovery from power failure.

15.2.3.1 Restarting Controller Nodes After a Reboot

Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.

When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.

These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.

15.2.3.1.1 Prerequisites

The following conditions must be true in order to perform these steps successfully:

  • Each of your controller nodes should be powered on.

  • Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.

  • The operator who performs these steps will need access to the Cloud Lifecycle Manager.

15.2.3.1.2 Recovering the MariaDB Database

The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:

Scenario 1: Recovering one or two of your controller nodes but not the entire cluster

Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:

  1. Ensure the controller nodes have power and are booted to the command prompt.

  2. If the MariaDB service is not started, start it with this command:

    ardana > sudo service mysql start
  3. If MariaDB fails to start, proceed to the next section which covers the bootstrap process.

Scenario 2: Recovering the entire controller cluster with the bootstrap playbook

If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.

  1. Make sure no mysqld daemon is running on any node in the cluster before you continue with the steps in this procedure. If there is a mysqld daemon running, then use the command below to shut down the daemon.

    ardana > sudo systemctl stop mysql

    If the mysqld daemon does not go down following the service stop, then kill the daemon using kill -9 before continuing.

  2. On the deployer node, execute the galera-bootstrap.yml playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
15.2.3.1.3 Restarting Services on the Controller Nodes

From the Cloud Lifecycle Manager you should execute the ardana-start.yml playbook for each node that was brought down so the services can be started back up.

If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>

If you have a shared Cloud Lifecycle Manager/controller setup and need to restart services on this shared node, you can use localhost to indicate the shared node, like this:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost
Note
Note

If you leave off the --limit switch, the playbook will be run against all nodes.

15.2.3.1.4 Restart the Monitoring Agents

As part of the recovery process, you should also restart the monasca-agent and these steps will show you how:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the monasca-agent:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-stop.yml
  3. Restart the monasca-agent:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-start.yml
  4. You can then confirm the status of the monasca-agent with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml

15.2.3.2 Recovering the Control Plane

If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery, there are several scenarios for recovering your cloud.

Note
Note

If you backed up the Cloud Lifecycle Manager manually after installation (see Chapter 38, Post Installation Tasks, you will have a backup copy of /etc/group. When recovering a Cloud Lifecycle Manager node, manually copy the /etc/group file from a backup of the old Cloud Lifecycle Manager.

15.2.3.2.1 Point-in-Time MariaDB Database Recovery

In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.

15.2.3.2.1.1 Restore MariaDB manually

Follow this procedure to manually restore MariaDB:

  1. Log in to the Cloud Lifecycle Manager.

  2. Stop the MariaDB cluster:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  3. On all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:

    ardana > sudo rm -r /var/lib/mysql/*
  4. On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.

    ardana > sudo cp -pr /tmp/mysql_restore/* /var/lib/mysql
  5. If you need to restore the files manually from SSH, follow these steps:

    1. Log in to the first node running the MariaDB service.

    2. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

    3. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

      ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
      --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
      -f mydb.tar.gz
    4. Verify that the files have been restored on the controller.

      ardana > sudo du -shx /tmp/mysql_restore/*
      16K     /tmp/mysql_restore/aria_log.00000001
      4.0K    /tmp/mysql_restore/aria_log_control
      3.4M    /tmp/mysql_restore/barbican
      8.0K    /tmp/mysql_restore/ceilometer
      4.2M    /tmp/mysql_restore/cinder
      2.9M    /tmp/mysql_restore/designate
      129M    /tmp/mysql_restore/galera.cache
      2.1M    /tmp/mysql_restore/glance
      4.0K    /tmp/mysql_restore/grastate.dat
      4.0K    /tmp/mysql_restore/gvwstate.dat
      2.6M    /tmp/mysql_restore/heat
      752K    /tmp/mysql_restore/horizon
      4.0K    /tmp/mysql_restore/ib_buffer_pool
      76M     /tmp/mysql_restore/ibdata1
      128M    /tmp/mysql_restore/ib_logfile0
      128M    /tmp/mysql_restore/ib_logfile1
      12M     /tmp/mysql_restore/ibtmp1
      16K     /tmp/mysql_restore/innobackup.backup.log
      313M    /tmp/mysql_restore/keystone
      716K    /tmp/mysql_restore/magnum
      12M     /tmp/mysql_restore/mon
      8.3M    /tmp/mysql_restore/monasca_transform
      0       /tmp/mysql_restore/multi-master.info
      11M     /tmp/mysql_restore/mysql
      4.0K    /tmp/mysql_restore/mysql_upgrade_info
      14M     /tmp/mysql_restore/nova
      4.4M    /tmp/mysql_restore/nova_api
      14M     /tmp/mysql_restore/nova_cell0
      3.6M    /tmp/mysql_restore/octavia
      208K    /tmp/mysql_restore/opsconsole
      38M     /tmp/mysql_restore/ovs_neutron
      8.0K    /tmp/mysql_restore/performance_schema
      24K     /tmp/mysql_restore/tc.log
      4.0K    /tmp/mysql_restore/test
      8.0K    /tmp/mysql_restore/winchester
      4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  6. Log back in to the Cloud Lifecycle Manager.

  7. Start the MariaDB service.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  8. After approximately 10-15 minutes, the output of the percona-status.yml playbook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml

    An example output is as follows:

    TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] *************
      ok: [ardana-cp1-c1-m1-mgmt] => {
      "msg": "mysql is synced."
      }
      ok: [ardana-cp1-c1-m2-mgmt] => {
      "msg": "mysql is synced."
      }
      ok: [ardana-cp1-c1-m3-mgmt] => {
      "msg": "mysql is synced."
      }
15.2.3.2.1.2 Point-in-Time Cassandra Recovery

A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.

The following steps should be taken before enabling and deploying the replacement node.

  1. Determine the IP address of the node that was removed or is being replaced.

  2. On one of the functional Cassandra control plane nodes, log in as the ardana user.

  3. Run the command nodetool status to display a list of Cassandra nodes.

  4. If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.

  5. If the node that was removed is still in the list, copy its node ID.

  6. Run the command nodetool removenode ID.

After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 15.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.

For more information, please consult the Cassandra documentation.

15.2.3.2.2 Point-in-Time swift Rings Recovery

In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your swift rings to a previous state.

Note
Note

This process restores swift rings only, not swift data.

15.2.3.2.2.1 Restore from a swift backup
  1. Log in to the first swift Proxy (SWF-PRX--first-member) node.

    To find the first swift Proxy node:

    1. On the Cloud Lifecycle Manager

      ardana > cd  ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook -i hosts/verb_hosts swift-status.yml \
      --limit SWF-PRX--first-member

      At the end of the output, you will see something like the following example:

      ...
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)'
      Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)'
      
      PLAY RECAP ********************************************************************
      ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
    2. Find the first node name and its IP address. For example:

      ardana > cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
  2. Retrieve (scp) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.

  3. Create a temporary directory and extract the TAR archive (for example, swring.tar.gz).

    ardana > mkdir /tmp/swift_builder_dir_restore; sudo tar -z \
    --incremental --extract --ignore-zeros --warning=none --overwrite --directory \
    /tmp/swift_builder_dir_restore/  -f swring.tar.gz
  4. Log in to the Cloud Lifecycle Manager.

  5. Stop the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-stop.yml
  6. Log back in to the first swift Proxy (SWF-PRX--first-member) node, which was determined in Step 1.

  7. Copy the restored files.

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
        /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
        /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  8. Log back in to the Cloud Lifecycle Manager.

  9. Reconfigure the swift service:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.3.2.3 Point-in-time Cloud Lifecycle Manager Recovery

In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.

Procedure 15.1: Restoring from a Swift or SSH Backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.

  3. Extract the TAR archives for each of the seven locations.

    ardana > sudo tar -z --incremental --extract --ignore-zeros
        --warning=none --overwrite --directory
        RESTORE_TARGET -f
        BACKUP_TARGET.tar.gz

    For example, with a directory such as BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz

    With a file such as BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
15.2.3.2.4 Cloud Lifecycle Manager Disaster Recovery

In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.

To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Section 15.5.2, “Installing the SUSE OpenStack Cloud Extension” before proceeding.

Prepare the Cloud Lifecycle Manager following the steps in the Before You Start section of Chapter 21, Installing with the Install UI.

15.2.3.2.4.1 Restore from a remote backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Retrieve (with scp) the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.

  3. Extract the TAR archives for each of the seven locations.

    ardana > sudo tar -z --incremental --extract --ignore-zeros
        --warning=none --overwrite --directory
        RESTORE_TARGET -f
        BACKUP_TARGET.tar.gz

    For example, with a directory such as BACKUP_TARGET=/etc/ssh/

    ardana > sudo tar -z --incremental --extract --ignore-zeros \
    --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz

    With a file such as BACKUP_TARGET=/etc/passwd

    ardana > sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
  4. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready_deployment.yml
  5. When the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
15.2.3.2.5 One or Two Controller Node Disaster Recovery

This scenario makes the following assumptions:

  • Your Cloud Lifecycle Manager is still intact and working.

  • One or two of your controller nodes went down, but not the entire cluster.

  • The node needs to be rebuilt from scratch, not simply rebooted.

15.2.3.2.5.1 Steps to recovering one or two controller nodes
  1. Ensure that your node has power and all of the hardware is functioning.

  2. Log in to the Cloud Lifecycle Manager.

  3. Verify that all of the information in your ~/openstack/my_cloud/definition/data/servers.yml file is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.

  4. If you made changes to your servers.yml file then commit those changes to your local git:

    ardana > git add -A
    ardana > git commit -a -m "editing controller information"
  5. Run the configuration processor:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
  6. Update your deployment directory:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  7. Ensure that Cobbler has the correct system information:

    1. If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:

      ardana > sudo cobbler system list
    2. Remove any controller nodes from Cobbler that no longer exist:

      ardana > sudo cobbler system remove --name=<node>
    3. Add the new node into Cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
  8. Then you can image the node:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>
    Note
    Note

    If you do not know the <node name> already, you can get it by using sudo cobbler system list.

    Before proceeding, look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. To prevent loss of data, the configuration processor retains data about removed nodes and keeps their ID numbers from being reallocated. For more information about how this works, see Section 7.3.1, “Persisted Server Allocations”.

  9. Run the wipe_disks.yml playbook to ensure the non-OS partitions on your nodes are completely wiped prior to continuing with the installation.

    Important
    Important

    The wipe_disks.yml playbook is only meant to be run on systems immediately after running bm-reimage.yml. If used for any other situation, it may not wipe all of the expected partitions.

    ardana > cd ~/scratch/ansible/next/ardana/ansible/
    ardana > ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>
  10. Complete the rebuilding of your controller node with the two playbooks below:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>
15.2.3.2.6 Three Control Plane Node Disaster Recovery

In this scenario, all control plane nodes are down and need to be rebuilt or replaced. Restoring from a swift backup is not possible because swift is gone.

15.2.3.2.6.1 Restore from an SSH backup
  1. Log in to the Cloud Lifecycle Manager.

  2. Deploy the control plane nodes, using the values for your control plane node hostnames:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit \
      CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \
      CONTROL_PLANE_HOSTNAME3 -e rebuild=True

    For example, if you were using the default values from the example model files, the command would look like this:

    ardana > ansible-playbook -i hosts/verb_hosts site.yml \
    --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \
    -e rebuild=True
    Note
    Note

    The -e rebuild=True is only used on a single control plane node when there are other controllers available to pull configuration data from. This causes the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.

  3. Log in to the Cloud Lifecycle Manager.

  4. Stop MariaDB:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-stop.yml
  5. Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.

  6. Create a temporary directory and extract the TAR archive (for example, mydb.tar.gz).

    ardana > mkdir /tmp/mysql_restore; sudo tar -z --incremental \
    --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \
    -f mydb.tar.gz
  7. Verify that the files have been restored on the controller.

    ardana > sudo du -shx /tmp/mysql_restore/*
    16K     /tmp/mysql_restore/aria_log.00000001
    4.0K    /tmp/mysql_restore/aria_log_control
    3.4M    /tmp/mysql_restore/barbican
    8.0K    /tmp/mysql_restore/ceilometer
    4.2M    /tmp/mysql_restore/cinder
    2.9M    /tmp/mysql_restore/designate
    129M    /tmp/mysql_restore/galera.cache
    2.1M    /tmp/mysql_restore/glance
    4.0K    /tmp/mysql_restore/grastate.dat
    4.0K    /tmp/mysql_restore/gvwstate.dat
    2.6M    /tmp/mysql_restore/heat
    752K    /tmp/mysql_restore/horizon
    4.0K    /tmp/mysql_restore/ib_buffer_pool
    76M     /tmp/mysql_restore/ibdata1
    128M    /tmp/mysql_restore/ib_logfile0
    128M    /tmp/mysql_restore/ib_logfile1
    12M     /tmp/mysql_restore/ibtmp1
    16K     /tmp/mysql_restore/innobackup.backup.log
    313M    /tmp/mysql_restore/keystone
    716K    /tmp/mysql_restore/magnum
    12M     /tmp/mysql_restore/mon
    8.3M    /tmp/mysql_restore/monasca_transform
    0       /tmp/mysql_restore/multi-master.info
    11M     /tmp/mysql_restore/mysql
    4.0K    /tmp/mysql_restore/mysql_upgrade_info
    14M     /tmp/mysql_restore/nova
    4.4M    /tmp/mysql_restore/nova_api
    14M     /tmp/mysql_restore/nova_cell0
    3.6M    /tmp/mysql_restore/octavia
    208K    /tmp/mysql_restore/opsconsole
    38M     /tmp/mysql_restore/ovs_neutron
    8.0K    /tmp/mysql_restore/performance_schema
    24K     /tmp/mysql_restore/tc.log
    4.0K    /tmp/mysql_restore/test
    8.0K    /tmp/mysql_restore/winchester
    4.0K    /tmp/mysql_restore/xtrabackup_galera_info
  8. Log back in to the first controller node and move the following files:

    ardana > ssh FIRST_CONTROLLER_NODE
    ardana > sudo su
    root # rm -rf /var/lib/mysql/*
    root # cp -pr /tmp/mysql_restore/* /var/lib/mysql/
  9. Log back in to the Cloud Lifecycle Manager and bootstrap MariaDB:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
  10. Verify the status of MariaDB:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts percona-status.yml
15.2.3.2.7 swift Rings Recovery

To recover the swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings with the manual swift backup and restore or use the SSH backup.

15.2.3.2.7.1 Restore from the swift deployment backup

See Section 18.6.2.7, “Recovering swift Builder Files”.

15.2.3.2.7.2 Restore from the SSH backup

In case you have lost all system disks of all object nodes and swift proxy nodes are corrupted, you can recover the rings from a copy of the swift rings was backed up previously. swift data is still available (the disks used by swift still need to be accessible).

Recover the rings with these steps.

  1. Log in to a swift proxy node.

  2. Become root:

    ardana > sudo su
  3. Create the temporary directory for your restored files:

    root # mkdir /tmp/swift_builder_dir_restore/
  4. Retrieve (scp) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.

  5. Create a temporary directory and extract the TAR archive (for example, swring.tar.gz).

    ardana > mkdir /tmp/swift_builder_dir_restore; sudo tar -z \
    --incremental --extract --ignore-zeros --warning=none --overwrite --directory \
    /tmp/swift_builder_dir_restore/  -f swring.tar.gz

    You now have the swift rings in /tmp/swift_builder_dir_restore/

  6. If the SWF-PRX--first-member is already deployed, copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-PRX--first-member.

  7. Then from the Cloud Lifecycle Manager run:

  8. ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
  9. If the SWF-ACC--first-member is not deployed, from the Cloud Lifecycle Manager run these playbooks:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts guard-deployment.yml
    ardana > ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>
  10. Copy the contents of the restored directory (/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/) to /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/ on the SWF-ACC[0].

    Create the directories: /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \
    /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/

    For example:

    ardana > sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \
    /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/
  11. From the Cloud Lifecycle Manager, run the ardana-deploy.yml playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-deploy.yml

15.2.4 Unplanned Compute Maintenance

Unplanned maintenance tasks including recovering compute nodes.

15.2.4.1 Recovering a Compute Node

If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to recover a compute node include the following:

  • The node has failed, either because it has shut down has a hardware failure, or for another reason.

  • The node is working but the nova-compute process is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).

  • The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.

15.2.4.1.1 What to do if your compute node is down

Compute node has power but is not powered on

If your compute node has power but is not powered on, use these steps to restore the node:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the name for your compute node in Cobbler:

    ardana > sudo cobbler system list
  3. Power the node back up with this playbook, specifying the node name from Cobbler:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Compute node is powered on but services are not running on it

If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:

  1. Log in to the Cloud Lifecycle Manager.

  2. Confirm the status of the compute service on the node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>
  3. You can start the compute service on the node with this playbook:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>
15.2.4.1.2 Scenarios involving disk failures on your compute nodes

Your compute nodes should have a minimum of two disks, one that is used for the operating system and one that is used as the data disk. These are defined during the installation of your cloud, in the ~/openstack/my_cloud/definition/data/disks_compute.yml file on the Cloud Lifecycle Manager. The data disk(s) are where the nova-compute service lives. Recovery scenarios will depend on whether one or the other, or both, of these disks experienced failures.

If your operating system disk failed but the data disk(s) are okay

If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:

  1. Log in to the Cloud Lifecycle Manager.

  2. Source the administrator credentials:

    ardana > source ~/service.osrc
  3. Obtain the hostname for your compute node, which you will use in subsequent commands when <hostname> is requested:

    ardana > openstack host list | grep compute
  4. Obtain the status of the nova-compute service on that node:

    ardana > openstack compute service list --host <hostname>
  5. You will likely want to disable provisioning on that node to ensure that nova-scheduler does not attempt to place any additional instances on the node while you are repairing it:

    ardana > openstack compute service set –disable --reason "node is being rebuilt" <hostname>
  6. Obtain the status of the instances on the compute node:

    ardana > openstack server list --host <hostname> --all-tenants
  7. Before continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the nova evacuate or nova host-evacuate commands to do this. See Section 15.1.3.3, “Live Migration of Instances” for more details on how to do this.

    If your instances are not booted from volumes, you will need to stop the instances using the openstack server stop command. Because the nova-compute service is not running on the node you will not see the instance status change, but the Task State for the instance should change to powering-off.

    ardana > openstack server stop <instance_uuid>

    Verify the status of each of the instances using these commands, verifying the Task State states powering-off:

    ardana > openstack server list --host <hostname> --all-tenants
    ardana > openstack server show <instance_uuid>
  8. At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:

    1. Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when <node_name> is requested:

      ardana > sudo cobbler system list
    2. Run the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.

      ardana > cd ~/scratch/ansible/next/ardana/ansible
      ardana > ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]
    3. Reimage the compute node with this playbook:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
  9. Once reimaging is complete, use the following playbook to configure the operating system and start up services:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
  10. You should then ensure any instances on the recovered node are in an ACTIVE state. If they are not then use the openstack server start command to bring them to the ACTIVE state:

    ardana > openstack server list --host <hostname> --all-tenants
    ardana > openstack server start <instance_uuid>
  11. Re-enable provisioning:

    ardana > openstack compute service set –enable <hostname>
  12. Start any instances that you had stopped previously:

    ardana > openstack server list --host <hostname> --all-tenants
    ardana > openstack server start <instance_uuid>

If your data disk(s) failed but the operating system disk is okay OR if all drives failed

In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.

After that is complete, use the openstack server rebuild command to respawn your instances, which will also ensure that they receive the same IP address:

ardana > openstack server list --host <hostname> --all-tenants
ardana > openstack server rebuild <instance_uuid>

15.2.5 Unplanned Storage Maintenance

Unplanned maintenance tasks for storage nodes.

15.2.5.1 Unplanned swift Storage Maintenance

Unplanned maintenance tasks for swift storage nodes.

15.2.5.1.1 Recovering a Swift Node

If one or more of your swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.

Typical scenarios in which you will need to repair a swift object or PAC node include:

  • The node has either shut down or been rebooted.

  • The entire node has failed and needs to be replaced.

  • A disk drive has failed and must be replaced.

15.2.5.1.1.1 What to do if your Swift host has shut down or rebooted

If your swift host has power but is not powered on, from the lifecycle manager you can run this playbook:

  1. Log in to the Cloud Lifecycle Manager.

  2. Obtain the name for your swift host in Cobbler:

    sudo cobbler system list
  3. Power the node back up with this playbook, specifying the node name from Cobbler:

    cd ~/openstack/ardana/ansible
    ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>

Once the node is booted up, swift should start automatically. You can verify this with this playbook:

cd ~/scratch/ansible/next/ardana/ansible
ansible-playbook -i hosts/verb_hosts swift-status.yml

Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 18.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.

15.2.5.1.1.2 How to replace your Swift node

If your swift node has irreparable damage and you need to replace the entire node in your environment, see Section 15.1.5.1.5, “Replacing a swift Node” for details on how to do this.

15.2.5.1.1.3 How to replace a hard disk in your Swift node

If you need to do a hard drive replacement in your swift node, see Section 15.1.5.1.6, “Replacing Drives in a swift Node” for details on how to do this.

15.3 Cloud Lifecycle Manager Maintenance Update Procedure

Procedure 15.2: Preparing for Update
  1. Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Chapter 16, Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional). Alternatives to setting up an SMT server are described in Chapter 17, Software Repository Setup.

  2. Read the Release Notes for the security and maintenance updates that will be installed.

  3. Have a backup strategy in place. For further information, see Chapter 17, Backup and Restore.

  4. Ensure that you have a known starting state by resolving any unexpected alarms.

  5. Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.

  6. Review steps in Section 15.1.4.1, “Adding a Network Node” and Section 15.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the neutron services are not provided via external SDN controllers.

  7. Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.

15.3.1 Performing the Update

Before you proceed, get the status of all your services:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

If status check returns an error for a specific service, run the SERVICE-reconfigure.yml playbook. Then run the SERVICE-status.yml playbook to check that the issue has been resolved.

Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”.

Note
Note

The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.

To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 15.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.

Procedure 15.3: Update Instructions
  1. Install all available security and maintenance updates on the deployer using the zypper patch command.

  2. Initialize the Cloud Lifecycle Manager and prepare the update playbooks.

    1. Run the ardana-init initialization script to update the deployer.

    2. Redeploy cobbler:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
    3. Run the configuration processor:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
    4. Update your deployment directory:

      ardana > cd ~/openstack/ardana/ansible
      ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
  3. Installation and management of updates can be automated with the following playbooks:

    • ardana-update-pkgs.yml

    • ardana-update.yml

    • ardana-update-status.yml

    • ardana-reboot.yml

  4. Confirm version changes by running hostnamectl before and after running the ardana-update-pkgs playbook on each node.

    ardana > hostnamectl

    Notice that Boot ID: and Kernel: have changed.

  5. By default, the ardana-update-pkgs.yml playbook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
    --limit TARGET_NODE_NAME

    There may be a delay in the playbook output at the following task while updates are pulled from the deployer.

    TASK: [ardana-upgrade-tools | pkg-update | Download and install
    package updates] ***
  6. After running the ardana-update-pkgs.yml playbook to install patches and updates not requiring reboot, check the status of remaining tasks.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit TARGET_NODE_NAME
  7. To install patches that require reboot, run the ardana-update-pkgs.yml playbook with the parameter -e zypper_update_include_reboot_patches=true.

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \
    --limit  TARGET_NODE_NAME \
    -e zypper_update_include_reboot_patches=true

    If the output of ardana-update-pkgs.yml indicates that a reboot is required, run ardana-reboot.yml after completing the ardana-update.yml step below. Running ardana-reboot.yml will cause cloud service interruption.

    Note
    Note

    To update a single package (for example, apply a PTF on a single node or on all nodes), run zypper update PACKAGE.

    To install all package updates using zypper update.

  8. Update services:

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml \
    --limit TARGET_NODE_NAME
  9. If indicated by the ardana-update-status.yml playbook, reboot the node.

    There may also be a warning to reboot after running the ardana-update-pkgs.yml.

    This check can be overridden by setting the SKIP_UPDATE_REBOOT_CHECKS environment variable or the skip_update_reboot_checks Ansible variable.

    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \
    --limit TARGET_NODE_NAME
  10. To recheck pending system reboot status at a later time, run the following commands:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit ardana-cp1-c1-m2 -e update_status_var=system-reboot
  11. The pending system reboot status can be reset by running:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \
    --limit ardana-cp1-c1-m2 \
    -e update_status_var=system-reboot \
    -e update_status_reset=true
  12. Multiple servers can be patched at the same time with ardana-update-pkgs.yml by setting the option -e skip_single_host_checks=true.

    Warning
    Warning

    When patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.

    If multiple nodes are specified on the command line (with --limit), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.

    Important
    Important

    Do not reboot all of your controllers at the same time.

  13. When the node comes up after the reboot, run the spark-start.yml file:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-start.yml
  14. Verify that Spark is running on all Control Nodes:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts spark-status.yml
  15. After all nodes have been updated, check the status of all services:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-status.yml

15.3.2 Summary of the Update Playbooks

ardana-update-pkgs.yml

Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable ardana-update-pkgs.yml -e skip_single_host_checks=true.

Provide the following -e options to modify default behavior:

  • zypper_update_method (default: patch)

    • patch installs all patches for the system. Patches are intended for specific bug and security fixes.

    • update installs all packages that have a higher version number than the installed packages.

    • dist-upgrade replaces each package installed with the version from the repository and deletes packages not available in the repositories.

  • zypper_update_repositories (default: all) restricts the list of repositories used

  • zypper_update_gpg_checks (default: true) enables GPG checks. If set to true, checks if packages are correctly signed.

  • zypper_update_licenses_agree (default: false) automatically agrees with licenses. If set to true, zypper automatically accepts third party licenses.

  • zypper_update_include_reboot_patches (default: false) includes patches that require reboot. Setting this to true installs patches that require a reboot (such as kernel or glibc updates).

ardana-update.yml

Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding --limit nodename.

ardana-reboot.yml

Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.

ardana-update-status.yml

This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.

15.4 Upgrading Cloud Lifecycle Manager 8 to Cloud Lifecycle Manager 9

Before undertaking the upgrade from SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you need to ensure that your existing SUSE OpenStack Cloud 8 Cloud Lifecycle Manager installation is up to date by following the https://documentation.suse.com/hpe-helion/8/html/hpe-helion-openstack-clm-all/system-maintenance.html#maintenance-update.

Ensure you review the following resources:

To confirm that all nodes have been successfully updated with no pending actions, run the ardana-update-status.yml playbook on the Cloud Lifecycle Manager deployer node as follows:

ardana > cd scratch/ansible/next/ardana/ansible/
ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml
Note
Note

Ensure that all nodes have been updated, and that there are no pending update actions remaining to be completed. In particular, ensure that any nodes that need to be rebooted have been, using the documented reboot procedure.

Procedure 15.4: Running the Pre-Upgrade Validation Checks to Ensure that your Cloud is Ready for Upgrade
  • Once all nodes have been successfully updated, and there are no pending update actions remaining, you should be able to run the ardana-pre-upgrade-validations.sh script, as follows:

    ardana > cd scratch/ansible/next/ardana/ansible/
    ardana > ./ardana-pre-upgrade-validations.sh
    ~/scratch/ansible/next/ardana/ansible ~/scratch/ansible/next/ardana/ansible
    
    PLAY [Initialize an empty list of msgs] ***************************************
    
    TASK: [set_fact ] *************************************************************
    ok: [localhost]
    ...
    
    PLAY RECAP ********************************************************************
    ...
    localhost                  : ok=8    changed=5    unreachable=0    failed=0
    
    msg: Please refer to /var/log/ardana-pre-upgrade-validations.log for the results of this run. Ensure that any messages in the file that have the words FAIL or WARN are resolved.

    The last line of output from the ardana-pre-upgrade-validations.sh script will tell you the name of its log file—in this case, /var/log/ardana-pre-upgrade-validations.log. If you look at the log file, you will see content similar to the following:

    ardana > sudo cat /var/log/ardana-pre-upgrade-validations.log
    ardana-cp-dbmqsw-m1*************************************************************
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    
    ardana-cp-dbmqsw-m2*************************************************************
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    
    ardana-cp-dbmqsw-m3*************************************************************
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade.
    
    ardana-cp-mml-m1****************************************************************
    SUCCESS: Keystone V2 ==> V3 API config changes detected.
    ardana-cp-mml-m2****************************************************************
    SUCCESS: Keystone V2 ==> V3 API config changes detected.
    ardana-cp-mml-m3****************************************************************
    SUCCESS: Keystone V2 ==> V3 API config changes detected.
    localhost***********************************************************************

    The report states the following:

    SUCCESS: Keystone V2 ==> V3 API config changes detected.

    This check confirms that your cloud has been updated with the necessary changes such that all services will be using Keystone V3 API. This means that there should be minimal interruption of service during the upgrade. This is important because the Keystone V2 API has been removed in SUSE OpenStack Cloud 9.

    NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than the SUSE Linux Enterprise 12 SP4 recommendation of 512. Some recommended XFS data integrity features may not be available after upgrade.

    This check will only report something if you have local swift configured and it is formatted with the SUSE Linux Enterprise 12 SP3 default XFS inode size of 256. In SUSE Linux Enterprise 12 SP4, the default XFS inode size for a newly-formatted XFS file system has been increased to 512, to allow room for enabling some additional XFS data-integrity features by default.

Note
Note

There will be no loss of functionality as regards the swift solution after the upgrade. The difference is that some additional XFS features will not be available on file systems which were formatted under SUSE Linux Enterprise 12 SP3 or earlier. These XFS features aid in the detection of, and recovery from, data corruption. They are enabled by default for XFS file systems formatted under SUSE Linux Enterprise 12 SP 4.

Procedure 15.5: Additional Pre-Upgrade Checks That Should Be Performed

In addition to the automated upgrade checks above, there are some checks that should be performed manually.

  1. For each network interface device specified in the input model under ~/openstack/my_cloud/definition, ensure that there is only one untagged VLAN. The SUSE OpenStack Cloud 9 Cloud Lifecycle Manager configuration processor will fail with an error if it detects this problem during the upgrade, so address this problem before starting the upgrade process.

  2. If the deployer node is not a standalone system, but is instead co-located with the DB services, this can lead to potentially longer service disruptions during the upgrade process. To determine if this is the case, check if the deployer node (OPS-LM--first-member) is a member of the database nodes (FND-MDB). You can do this with the following command:

    ardana > cd scratch/ansible/next/ardana/ansible/
    ardana > ansible -i hosts/verb_hosts 'FND-MDB:&OPS-LM--first-member' --list-hosts

    If the output is:

           No hosts matched

    Then the deployer node is not co-located with the database nodes. Otherwise, if the command reports a hostname, then there may be additional interruptions to the database services during the upgrade.

  3. Similarly, if the deployer is co-located with the database services, and you are also trying to run a local SMT service on the deployer node, you will run into issues trying to configure the SMT to enable and mirror the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.

    In such cases, it is recommended that you run the SMT services on a different node, and NFS-import the /srv/www/htdocs/repo onto the deployer node, instead of trying to run the SMT services locally.

Note
Note: Backup the Cloud Lifecycle Manager Configuration Settings

The integrated backup solution in SUSE OpenStack Cloud 8 Cloud Lifecycle Manager, freezer, is no longer available in SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. Therefore, we recommend doing a manual backup to a server that is not a member of the cloud, as per Chapter 17, Backup and Restore.

15.4.1 Migrating the Deployer Node Packages

The upgrade process first migrates the SUSE OpenStack Cloud 8 Cloud Lifecycle Manager deployer node to SUSE Linux Enterprise 12 SP4 and the SOC 9 Cloud Lifecycle Manager packages.

Important
Important

If the deployer node is not a dedicated node, but is instead a member of one of the cloud-control planes, then some services may restart with the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM versions of the software during the migration. This may mean that:

  • Some services fail to restart. This will be resolved when the appropriate SUSE OpenStack Cloud 9 configuration changes are applied by running the ardana-upgrade.yml playbook, later during the upgrade process.

  • Other services may log excessive warnings about connectivity issues and backwards-compatibility warnings. This will be resolved when the relevant services are upgraded during the ardana-upgrade.yml playbook run.

In order to upgrade the deployer node to be based on SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you first need to migrate the system to SUSE Linux Enterprise 12 SP4 with the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager product installed.

The process for migrating the deployer node differs somewhat, depending on whether your deployer node is registered with the SUSE Customer Center (or an SMT mirror), versus using locally-maintained repositories available at the relevant locations.

If your deployer node is registered with the SUSE Customer Center or an SMT, the migration process requires the zypper-migration-plugin package to be installed.

Procedure 15.6: Migrating an SCC/SMT Registered Deployer Node
  1. If you are using an SMT server to mirror the relevant repositories, then you need to enable mirroring of the relevant repositories. See Section 16.3, “Setting up Repository Mirroring on the SMT Server” for more information.

    Ensure that the mirroring process has completed before proceeding.

  2. Ensure that the zypper-migration-plugin package is installed; if not, install it:

    ardana > sudo zypper install zypper-migration-plugin
    Refreshing service 'SMT-http_smt_example_com'.
    Loading repository data...
    Reading installed packages...
    'zypper-migration-plugin' is already installed.
    No update candidate for 'zypper-migration-plugin-0.10-12.4.noarch'. The highest available version is already installed.
    Resolving package dependencies...
    
    Nothing to do.
  3. De-register the SUSE Linux Enterprise Server LTSS 12 SP3 x86_64 extension (if enabled):

    ardana > sudo SUSEConnect --status-text
    Installed Products:
    ------------------------------------------
    
      SUSE Linux Enterprise Server 12 SP3 LTSS
      (SLES-LTSS/12.3/x86_64)
    
      Registered
    
    ------------------------------------------
    
      SUSE Linux Enterprise Server 12 SP3
      (SLES/12.3/x86_64)
    
      Registered
    
    ------------------------------------------
    
      SUSE OpenStack Cloud 8
      (suse-openstack-cloud/8/x86_64)
    
      Registered
    
    ------------------------------------------
    
    
    ardana > sudo SUSEConnect -d -p SLES-LTSS/12.3/x86_64
    Deregistering system from registration proxy https://smt.example.com/
    
    Deactivating SLES-LTSS 12.3 x86_64 ...
    -> Refreshing service ...
    -> Removing release package ...
    ardana > sudo SUSEConnect --status-text
    Installed Products:
    ------------------------------------------
    
      SUSE Linux Enterprise Server 12 SP3
      (SLES/12.3/x86_64)
    
      Registered
    
    ------------------------------------------
    
      SUSE OpenStack Cloud 8
      (suse-openstack-cloud/8/x86_64)
    
      Registered
    
    ------------------------------------------
  4. Disable any other SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories. The zypper migration process should detect and disable most of these automatically, but in some cases it may not catch all of them, which can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the /srv/www/suse-12.3 directory or the SUSE-12-4 alias under http://localhost:79/, you could use the following commands:

    ardana > zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2
    PTF
    SLES12-SP3-LTSS-Updates
    SLES12-SP3-Pool
    SLES12-SP3-Updates
    SUSE-OpenStack-Cloud-8-Pool
    SUSE-OpenStack-Cloud-8-Updates
    ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done
    Repository 'PTF' has been successfully disabled.
    Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled.
    Repository 'SLES12-SP3-Pool' has been successfully disabled.
    Repository 'SLES12-SP3-Updates' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.
  5. Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3 (a new one, based on SUSE Linux Enterprise 12 SP4, will be created during the upgrade process):

    ardana > zypper repos | grep PTF
     2 | PTF                                                               | PTF                                      | No      | (r ) Yes  | Yes
    ardana > sudo zypper removerepo PTF
    Removing repository 'PTF' ..............................................................................................[done]
    Repository 'PTF' has been removed.
  6. Remove the Cloud media repository (if defined):

    ardana > zypper repos | grep '[|] Cloud '
     1 | Cloud                          | SUSE OpenStack Cloud 8 DVD #1  | Yes     | (r ) Yes  | No
    ardana > sudo zypper removerepo Cloud
    Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done]
    Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.
  7. Run the zypper migration command, which should offer a single choice: namely, to upgrade to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. You need to accept the offered choice, then answer yes to any prompts to disable obsoleted repositories. At that point, the zypper migration command will run zypper dist-upgrade, which will prompt you to agree with the proposed package changes. Finally, you will to agree with any new licenses. After this, the package upgrade of the deployer node will proceed. The output of the running zypper migration should look something like the following:

    ardana > sudo zypper migration
    
    Executing 'zypper  refresh'
    
    Repository 'SLES12-SP3-Pool' is up to date.
    Repository 'SLES12-SP3-Updates' is up to date.
    Repository 'SLES12-SP3-Pool' is up to date.
    Repository 'SLES12-SP3-Updates' is up to date.
    Repository 'SUSE-OpenStack-Cloud-8-Pool' is up to date.
    Repository 'SUSE-OpenStack-Cloud-8-Updates' is up to date.
    Repository 'OpenStack-Cloud-8-Pool' is up to date.
    Repository 'OpenStack-Cloud-8-Updates' is up to date.
    All repositories have been refreshed.
    
    Executing 'zypper  --no-refresh patch-check --updatestack-only'
    
    Loading repository data...
    Reading installed packages...
    
    0 patches needed (0 security patches)
    
    Available migrations:
    
        1 | SUSE Linux Enterprise Server 12 SP4 x86_64
            SUSE OpenStack Cloud 9 x86_64
    
    
    [num/q]: 1
    
    Executing 'snapper create --type pre --cleanup-algorithm=number --print-number --userdata important=yes --description 'before online migration''
    
    The config 'root' does not exist. Likely snapper is not configured.
    See 'man snapper' for further instructions.
    Upgrading product SUSE Linux Enterprise Server 12 SP4 x86_64.
    Found obsolete repository SLES12-SP3-Updates
    Disable obsolete repository SLES12-SP3-Updates [y/n] (y): y
    ... disabling.
    Found obsolete repository SLES12-SP3-Pool
    Disable obsolete repository SLES12-SP3-Pool [y/n] (y): y
    ... disabling.
    Upgrading product SUSE OpenStack Cloud 9 x86_64.
    Found obsolete repository OpenStack-Cloud-8-Pool
    Disable obsolete repository OpenStack-Cloud-8-Pool [y/n] (y): y
    ... disabling.
    
    Executing 'zypper --releasever 12.4 ref -f'
    
    Warning: Enforced setting: $releasever=12.4
    Forcing raw metadata refresh
    Retrieving repository 'SLES12-SP4-Pool' metadata .......................................................................[done]
    Forcing building of repository cache
    Building repository 'SLES12-SP4-Pool' cache ............................................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'SLES12-SP4-Updates' metadata ....................................................................[done]
    Forcing building of repository cache
    Building repository 'SLES12-SP4-Updates' cache .........................................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'SUSE-OpenStack-Cloud-9-Pool' metadata ...........................................................[done]
    Forcing building of repository cache
    Building repository 'SUSE-OpenStack-Cloud-9-Pool' cache ................................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'SUSE-OpenStack-Cloud-9-Updates' metadata ........................................................[done]
    Forcing building of repository cache
    Building repository 'SUSE-OpenStack-Cloud-9-Updates' cache .............................................................[done]
    Forcing raw metadata refresh
    Retrieving repository 'OpenStack-Cloud-8-Updates' metadata .............................................................[done]
    Forcing building of repository cache
    Building repository 'OpenStack-Cloud-8-Updates' cache ..................................................................[done]
    All repositories have been refreshed.
    
    Executing 'zypper --releasever 12.4  --no-refresh  dist-upgrade --no-allow-vendor-change '
    
    Warning: Enforced setting: $releasever=12.4
    Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    ...
    
    525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch.
    Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    ...
        dracut: *** Generating early-microcode cpio image ***
        dracut: *** Constructing GenuineIntel.bin ****
        dracut: *** Store current command line parameters ***
        dracut: Stored kernel commandline:
        dracut:  rd.lvm.lv=ardana-vg/root
        dracut:  root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered
        dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' ***
        dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done ***
    
    Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script:
        Refresh script btrfs-scrub.sh for monthly
        Refresh script btrfs-defrag.sh for none
        Refresh script btrfs-balance.sh for weekly
        Refresh script btrfs-trim.sh for none
    
    There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.
Procedure 15.7: Migrating a Deployer Node with Locally-Managed Repositories

In this configuration, you need to manually migrate the system using zypper dist-upgrade, according to the following steps:

  1. Disable any SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud 8 Cloud Lifecycle Manager-related repositories. Leaving the SUSE Linux Enterprise 12 SP3 and/or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories enabled can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the /srv/www/suse-12.3 directory, or the SUSE-12-4 alias under http://localhost:79/, use the following commands:

    ardana > zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2
    PTF
    SLES12-SP3-LTSS-Updates
    SLES12-SP3-Pool
    SLES12-SP3-Updates
    SUSE-OpenStack-Cloud-8-Pool
    SUSE-OpenStack-Cloud-8-Updates
    ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done
    Repository 'PTF' has been successfully disabled.
    Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled.
    Repository 'SLES12-SP3-Pool' has been successfully disabled.
    Repository 'SLES12-SP3-Updates' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled.
    Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.
    Note
    Note

    The SLES12-SP3-LTSS-Updates repository should only be present if you have purchased the optional SUSE Linux Enterprise 12 SP3 LTSS support. Whether or not it is configured will not impact the upgrade process.

  2. Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3. A new one based on SUSE Linux Enterprise 12 SP4 will be created during the upgrade process.

    ardana > zypper repos | grep PTF
     2 | PTF                                                               | PTF                                      | Yes     | (r ) Yes  | Yes
    ardana > sudo zypper removerepo PTF
    Removing repository 'PTF' ..............................................................................................[done]
    Repository 'PTF' has been removed.
  3. Remove the Cloud media repository if defined.

    ardana > zypper repos | grep '[|] Cloud '
     1 | Cloud                          | SUSE OpenStack Cloud 8 DVD #1  | Yes     | (r ) Yes  | No
    ardana > sudo zypper removerepo Cloud
    Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done]
    Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.
  4. Ensure the deployer node has access to the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM repositories as documented in Chapter 17, Software Repository Setup paying attention to the non-SMT based repository setup. When you run zypper repos --show-enabled-only, the output should look similar to the following:

    ardana > zypper repos --show-enabled-only
    #  | Alias                          | Name                           | Enabled | GPG Check | Refresh
    ---+--------------------------------+--------------------------------+---------+-----------+--------
     1 | Cloud                          | SUSE OpenStack Cloud 9 DVD #1  | Yes     | (r ) Yes  | No
     7 | SLES12-SP4-Pool                | SLES12-SP4-Pool                | Yes     | (r ) Yes  | No
     8 | SLES12-SP4-Updates             | SLES12-SP4-Updates             | Yes     | (r ) Yes  | Yes
     9 | SUSE-OpenStack-Cloud-9-Pool    | SUSE-OpenStack-Cloud-9-Pool    | Yes     | (r ) Yes  | No
    10 | SUSE-OpenStack-Cloud-9-Updates | SUSE-OpenStack-Cloud-9-Updates | Yes     | (r ) Yes  | Yes
    Note
    Note

    The Cloud repository above is optional. Its content is equivalent to the SUSE-Openstack-Cloud-9-Pool repository.

  5. Run the zypper dist-upgrade command to upgrade the deployer node:

    ardana > sudo zypper dist-upgrade
    
    Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    ...
    
    525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch.
    Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    ...
        dracut: *** Generating early-microcode cpio image ***
        dracut: *** Constructing GenuineIntel.bin ****
        dracut: *** Store current command line parameters ***
        dracut: Stored kernel commandline:
        dracut:  rd.lvm.lv=ardana-vg/root
        dracut:  root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered
        dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' ***
        dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done ***
    
    Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script:
        Refresh script btrfs-scrub.sh for monthly
        Refresh script btrfs-defrag.sh for none
        Refresh script btrfs-balance.sh for weekly
        Refresh script btrfs-trim.sh for none
    
    There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.
    Note
    Note

    You may need to run the zypper dist-upgrade command more than once, if it determines that it needs to update the zypper infrastructure on your system to be able to successfully dist-upgrade the node; the command will tell you if you need to run it again.

15.4.2 Upgrading the Deployer Node Configuration Settings

Now that the deployer node packages have been migrated to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we need to update the configuration settings to be SUSE OpenStack Cloud 9 Cloud Lifecycle Manager based.

The first step is to run the ardana-init command. This will:

  • Add the PTF repository, creating it if needed.

  • Optionally add appropriate local repository references for any SMT-provided SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.

  • Upgrade the deployer account ~/openstack area to be based upon SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources.

    • This will import the new SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible code into the Git repository on the Ardana branch, and then rebase the customer site branch on top of the updated Ardana branch.

    • Follow the directions to resolve any Git merge conflicts that may arise due to local changes that may have been made on the site branch:

      ardana > ardana-init
      ...
       To continue installation copy your cloud layout to:
           /var/lib/ardana/openstack/my_cloud/definition
      
       Then execute the installation playbooks:
           cd /var/lib/ardana/openstack/ardana/ansible
           git add -A
           git commit -m 'My config'
           ansible-playbook -i hosts/localhost cobbler-deploy.yml
           ansible-playbook -i hosts/localhost bm-reimage.yml
           ansible-playbook -i hosts/localhost config-processor-run.yml
           ansible-playbook -i hosts/localhost ready-deployment.yml
           cd /var/lib/ardana/scratch/ansible/next/ardana/ansible
           ansible-playbook -i hosts/verb_hosts site.yml
      
       If you prefer to use the UI to install the product, you can
       do either of the following:
           - If you are running a browser on this machine, you can point
             your browser to http://localhost:9085 to start the install
             via the UI.
           - If you are running the browser on a remote machine, you will
             need to create an ssh tunnel to access the UI.  Please refer
             to the Ardana installation documentation for further details.
Note
Note

As we are upgrading to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we do not need to run the suggested bm-reimage.yml playbook.

Procedure 15.8: Updating the Bare-Metal Provisioning Configuration

If you were previously using the cobbler-based integrated provisioning solution, then you will need to perform the following steps to import the SUSE Linux Enterprise 12 SP4 ISO and update the default provisioning distribution:

  1. Ensure there is a copy of the SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso, named sles12sp4.iso, available in the /var/lib/ardana directory.

  2. Ensure that any distribution entries in servers.yml (or whichever file holds the server node definitions) under ~/openstack/my_cloud/definition are updated to specify sles12sp4 if they are currently using sles12sp3.

    Note
    Note

    The default distribution will now be sles12sp4, so if there are no specific distribution entries specified for the servers, then no change will be required.

    If you have made any changes to the ~/openstack/my_cloud/definition files, you will need to commit those changes, as follows:

    ardana > cd ~/openstack/my_cloud/definition
     ardana > git add -A
     ardana > git commit -m "Update sles12sp3 distro entries to sles12sp4"
  3. Run the cobbler-deploy.yml playbook to import the SUSE Linux Enterprise 12 SP4 distribution as the new default distribution:

    ardana > cd ~/openstack/ardana/ansible
     ardana > ansible-playbook -i hosts/localhost cobbler-deploy.yml
     Enter the password that will be used to access provisioned nodes:
     confirm Enter the password that will be used to access provisioned nodes:
    
     PLAY [localhost] **************************************************************
    
     GATHERING FACTS ***************************************************************
     ok: [localhost]
    
     TASK: [pbstart.yml pb_start_playbook] *****************************************
     ok: [localhost] => {
         "msg": "Playbook started - cobbler-deploy.yml"
     }
    
     msg: Playbook started - cobbler-deploy.yml
    
     ...
    
     PLAY [localhost] **************************************************************
    
     TASK: [pbfinish.yml pb_finish_playbook] ***************************************
     ok: [localhost] => {
         "msg": "Playbook finished - cobbler-deploy.yml"
     }
    
     msg: Playbook finished - cobbler-deploy.yml
    
     PLAY RECAP ********************************************************************
     localhost                  : ok=92   changed=45   unreachable=0    failed=0

You are now ready to upgrade the input model to be compatible.

Procedure 15.9: Upgrading the Cloud Input Model

At this point, there are some mandatory changes that will need to be made to the existing input model to permit the upgrade proceed. These mandatory changes represent:

  • The removal of previously-deprecated service components;

  • The dropping of service components that are no longer supported;

  • That there can be only one untagged VLAN per network interface;

  • That there must be a MANAGEMENT network group.

There are also some service components that have been made redundant and have no effect. These should be removed to quieten the associated config-processor-run.yml warnings.

For example, if you run the configuration-processor-run.yml playbook from the ~/openstack/ardana/ansible directory before you made the necessary input model changes, you should see it fail with errors similar to those shown below—unless your input model doesn't deploy the problematic service component:

ardana > cd ~/openstack/ardana/ansible
 ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
 Enter encryption key (press return for none):
 confirm Enter encryption key (press return for none):
 To change encryption key enter new key (press return for none):
 confirm To change encryption key enter new key (press return for none):

 PLAY [localhost] **************************************************************

 GATHERING FACTS ***************************************************************
 ok: [localhost]

 ...

             "################################################################################",
             "# The configuration processor failed.  ",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'designate-pool-manager' has been deprecated and will be replaced by 'designate-worker'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'manila-share' service component is deprecated. The 'manila-share' service component can be removed as manila share service will be deployed where manila-api is specified. This is not a deprecation for openstack-manila-share but just an entry deprecation in input model.",
             "",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'designate-zone-manager' has been deprecated and will be replaced by 'designate-producer'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:openstack-core: 'glance-registry' has been deprectated and is no longer deployed. Please update you input model to remove any 'glance-registry' service component specifications to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:mml: 'ceilometer-api' is no longer used by Ardana and will not be deployed. Please update your input model to remove this warning.",
             "",
             "#   control-planes-2.0        WRN: cp:sles-compute: 'neutron-lbaasv2-agent' has been deprecated and replaced by 'octavia' and will not be deployed in a future release. Please update your input model to remove this warning.",
             "",
             "#   control-planes-2.0        ERR: cp:common-service-components: Undefined component 'freezer-agent'",
             "#   control-planes-2.0        ERR: cp:openstack-core: Undefined component 'nova-console-auth'",
             "#   control-planes-2.0        ERR: cp:openstack-core: Undefined component 'heat-api-cloudwatch'",
             "#   control-planes-2.0        ERR: cp:mml: Undefined component 'freezer-api'",
             "################################################################################"
         ]
     }
 }

 TASK: [debug var=config_processor_result.stderr] ******************************
 ok: [localhost] => {
     "var": {
         "config_processor_result.stderr": "/usr/lib/python2.7/site-packages/ardana_configurationprocessor/cp/model/YamlConfigFile.py:95: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n  self._contents = yaml.load(''.join(lines))"
     }
 }

 TASK: [fail msg="Configuration processor run failed, see log output above for details"] ***
 failed: [localhost] => {"failed": true}
 msg: Configuration processor run failed, see log output above for details

 msg: Configuration processor run failed, see log output above for details

 FATAL: all hosts have already failed -- aborting

 PLAY RECAP ********************************************************************
            to retry, use: --limit @/var/lib/ardana/config-processor-run.retry

 localhost                  : ok=8    changed=5    unreachable=0    failed=1

To resolve any errors and warnings like those shown above, you will need to perform the following actions:

  1. Remove any service component entries that are no longer valid from the control_plane.yml (or whichever file holds the control-plane definitions) under ~/openstack/my_cloud/definition. This means that you have to comment out (or delete) any lines for the following service components, which are no longer available:

    • freezer-agent

    • freezer-api

    • heat-api-cloudwatch

    • nova-console-auth

    Note
    Note

    This should resolve the errors that cause the config-processor-run.yml playbook to fail.

  2. Similarly, remove any service components that are redundant and no longer required. This means that you should comment out (or delete) any lines for the following service components:

    • ceilometer-api

    • glance-registry

    • manila-share

    • neutron-lbaasv2-agent

    Note
    Note

    This should resolve most of the warnings reported by the config-processor-run.yml playbook.

    Important
    Important

    If you have deployed the designate service components (designate-pool-manager and designate-zone-manager) in your cloud, you will see warnings like those shown above, indicating that these service components have been deprecated.

    You can switch to using the newer designate-worker and designate-producer service components, which will quieten these deprecation warnings produced by the config-processor-run.yml playbook run.

    However, this is a procedure that should be perfomed after the upgrade has completed, as outlined in the Section 15.4.5, “Post-Upgrade Tasks” section below.

  3. Once you have made the necessary changes to your input model, if you run git diff under the ~/openstack/my_cloud/definition directory, you should see output similar to the following:

    ardana > cd ~/openstack/my_cloud/definition
     ardana > git diff
     diff --git a/my_cloud/definition/data/control_plane.yml b/my_cloud/definition/data/control_plane.yml
     index f7cfd84..2c1a73c 100644
     --- a/my_cloud/definition/data/control_plane.yml
     +++ b/my_cloud/definition/data/control_plane.yml
     @@ -32,7 +32,6 @@
              - NEUTRON-CONFIG-CP1
            common-service-components:
              - lifecycle-manager-target
     -        - freezer-agent
              - stunnel
              - monasca-agent
              - logging-rotate
     @@ -118,12 +117,10 @@
                  - cinder-volume
                  - cinder-backup
                  - glance-api
     -            - glance-registry
                  - nova-api
                  - nova-placement-api
                  - nova-scheduler
                  - nova-conductor
     -            - nova-console-auth
                  - nova-novncproxy
                  - neutron-server
                  - neutron-ml2-plugin
     @@ -137,7 +134,6 @@
                  - horizon
                  - heat-api
                  - heat-api-cfn
     -            - heat-api-cloudwatch
                  - heat-engine
                  - ops-console-web
                  - barbican-api
     @@ -151,7 +147,6 @@
                  - magnum-api
                  - magnum-conductor
                  - manila-api
     -            - manila-share
    
              - name: mml
                cluster-prefix: mml
     @@ -164,9 +159,7 @@
    
                  # freezer-api shares elastic-search with logging-server
                  # so must be co-located with it
     -            - freezer-api
    
     -            - ceilometer-api
                  - ceilometer-polling
                  - ceilometer-agent-notification
                  - ceilometer-common
     @@ -194,4 +187,3 @@
                  - neutron-l3-agent
                  - neutron-metadata-agent
                  - neutron-openvswitch-agent
     -            - neutron-lbaasv2-agent
  4. If you are happy with these changes, commit them into the Git repository as follows:

    ardana > cd ~/openstack/my_cloud/definition
     ardana > git add -A
     ardana > git commit -m "SOC 9 CLM Upgrade input model migration"
  5. Now you are ready to run the config-processor-run.yml playbook. If the necessary input model changes have been made, it will complete sucessfully:

    ardana > cd ~/openstack/ardana/ansible
     ardana > ansible-playbook -i hosts/localhost config-processor-run.yml
     Enter encryption key (press return for none):
     confirm Enter encryption key (press return for none):
     To change encryption key enter new key (press return for none):
     confirm To change encryption key enter new key (press return for none):
    
     PLAY [localhost] **************************************************************
    
     GATHERING FACTS ***************************************************************
     ok: [localhost]
    
     ...
     PLAY RECAP ********************************************************************
     localhost                  : ok=24   changed=20   unreachable=0    failed=0

15.4.3 Upgrading Cloud Services

The deployer node is now ready to be used to upgrade the remaining cloud nodes and running services.

Warning
Warning

If upgrading from Helion OpenStack 8, there is a manual file update that must be applied before continuing the upgrade process. In the file /usr/share/ardana/ansible/roles/osconfig/tasks/check-product-status.yml replace `command` with `shell` in the first ansible entry. The correct version of the file appears below.

- name: deployer-setup | check-product-status | Check HOS product installed
  shell: |-
    zypper info hpe-helion-openstack-release | grep "^Installed *: *Yes"
  ignore_errors: yes
  register: product_flavor_hos

- name: deployer-setup | check-product-status | Check SOC product availability
  become: yes
  zypper:
    name: "suse-openstack-cloud-release>=8"
    state: present
  ignore_errors: yes
  register: product_flavor_soc

- name: deployer-setup | check-product-status | Provide help
  fail:
    msg: >
      The deployer node does not have a Cloud Add-On product installed.
      In YaST select Software/Add-On Products to see an overview of installed
       add-on products and use "Add" to add the Cloud product.
  when:
    - product_flavor_soc|failed
    - product_flavor_hos|failed

Changes to the check-product-status.yml file must be staged and committed via git.

git add -u
git commit -m "applying osconfig fix prior to HOS8 to SOC9 upgrade"
Note
Note

The ardana-upgrade.yml playbook runs the upgrade process against all nodes in parallel, though some of the steps are serialised to run on only one node at a time to avoid triggering potentially problematic race conditions. As such, the playbook can take a long time to run.

Procedure 15.10: Generate the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Based Scratch Area
  1. Generate the updated scratch area using the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources:

    ardana > cd ~/openstack/ardana/ansible
    ardana > ansible-playbook -i hosts/localhost ready-deployment.yml
    
    PLAY [localhost] **************************************************************
    
    GATHERING FACTS ***************************************************************
    ok: [localhost]
    
    ...
    
    PLAY RECAP ********************************************************************
    localhost                  : ok=31   changed=16   unreachable=0    failed=0
  2. Confirm that there are no pending updates for the deployer node. This could happen if you are using an SMT to manage the repositories, and updates have been released through the official channels since the deployer node was migrated. To check for any pending Cloud Lifecycle Manager package updates, you can run the ardana-update-pkgs.yml playbook as follows:

     ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --limit OPS-LM--first-member
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-dplyr-m1]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-update-pkgs.yml"
    }
    
    ...
    
    TASK: [_ardana-update-status | Report update status] **************************
    ok: [ardana-cp-dplyr-m1] => {
        "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n====================================================================="
    }
    
    msg: =====================================================================
    Update status for node ardana-cp-dplyr-m1:
    =====================================================================
    No pending update actions on the ardana-cp-dplyr-m1 host
    were collected or reset during this update run or persisted during
    previous unsuccessful or incomplete update runs.
    
    =====================================================================
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-update-pkgs.yml"
    }
    
    msg: Playbook finished - ardana-update-pkgs.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-dplyr-m1         : ok=98   changed=12   unreachable=0    failed=0
    localhost                  : ok=6    changed=2    unreachable=0    failed=0
    Note
    Note

    If running the ardana-update-pkgs.yml playbook identifies that there were updates that needed to be installed on your deployer node, then you need to go back to running the ardana-init command, followed by the cobbler-deploy.yml playbook, then the config-processor-run.yml playbook, and finally the ready-deployment.yml playbook, addressing any additional input model changes that may be needed. Then, repeat this step to check for any pending updates before continuing with the upgrade.

  3. Double-check that there are no pending actions needed for the deployer node by running the ardana-update-status.yml playbook, as follows:

    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-status.yml --limit OPS-LM--first-member
    
    PLAY [resources] **************************************************************
    
    ...
    
    TASK: [_ardana-update-status | Report update status] **************************
    ok: [ardana-cp-dplyr-m1] => {
        "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n====================================================================="
    }
    
    msg: =====================================================================
    Update status for node ardana-cp-dplyr-m1:
    =====================================================================
    No pending update actions on the ardana-cp-dplyr-m1 host
    were collected or reset during this update run or persisted during
    previous unsuccessful or incomplete update runs.
    
    =====================================================================
    
    PLAY RECAP ********************************************************************
    ardana-cp-dplyr-m1         : ok=12   changed=0    unreachable=0    failed=0
  4. Having verified that there are no pending actions detected, it is safe to proceed with running the ardana-upgrade.yml playbook to upgrade the entire cloud:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-upgrade.yml
    PLAY [all] ********************************************************************
    
    ...
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-upgrade.yml"
    }
    
    msg: Playbook started - ardana-upgrade.yml
    
    ...
    ...
    ...
    ...
    ...
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-upgrade.yml"
    }
    
    msg: Playbook finished - ardana-upgrade.yml

The ardana-upgrade.yml playbook run will take a long time. The zypper dist-upgrade phase is serialised across all of the nodes and usually takes between five and 10 minutes for each node. This is followed by the cloud service upgrade phase, which will take approximately the same amount of time as a full cloud deploy. During this time, the cloud should remain basically functional, though there may be brief interruptions to some services. However, it is recommended that any workload management tasks are avoided during this period.

Note
Note

Until the ardana-upgrade.yml playbook run has ompleted successfully, other playbooks such as the ardana-status.yml, may report status problems. This is because some services that are expected to be running may not be installed, enabled, or migrated yet.

The ardana-upgrade.yml playbook run may sometimes fail during the whole cloud upgrade phase, if a service (for example, the monasca-thresh service) is slow to restart. In such cases, it is safe to run the ardana-upgrade.yml playbook again, and in most cases it should continue past the stage that failed previously. However, if the same problem persists across multiple runs, contact your support team for assistance.

Important
Important

It is important to disable all SUSE Linux Enterprise 12 SP3 SUSE OpenStack Cloud 8 Cloud Lifecycle Manager repositories before migrating the deployer to SUSE Linux Enterprise 12 SP4 SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. If you did not do this, then the first time you run the ardana-upgrade.yml playbook, it may complain that there are pending updates for the deployer node. This will require you to repeat the earlier steps to upgrade the deployer node, starting with running the ardana-init command. If this happens, repeat the steps as requested. Note that this does not represent a serious problem.

In SUSE OpenStack Cloud 9 Cloud Lifecycle Manager the LBaaS V2 legacy driver has been deprecated and removed. As part of the ardana-upgrade.yml playbook run, all existing LBaaS V2 load-balancers will be automatically migrated to being based on the Octavia Amphora provider. To enable creation of any new Octavia- based load-balancer instances, you need to ensure that an appropriate Amphora image is registered for use when creating instances, by following Chapter 43, Configuring Load Balancer as a Service.

Note
Note

While running the ardana-upgrade.yml playbook, a point will be reached when the Neutron services are upgraded. As part of this upgrade, any existing LBaaS V2 load-balancer definitions will be migrated to Octavia Amphora-based load-balancer definitions.

After this migration of load-balancer definitions has completed, if a load-balancer failover is triggered, then the replacement load- balancer may fail to start, as an appropriate Octavia Amphora image for SUSE OpenStack Cloud 9 Cloud Lifecycle Manager will not yet be available.

However, once the Octavia Amphora image has been uploaded using the above instructions, then it will be possible to recover any failed load-balancers by re-triggering the failover: follow the instructions at https://docs.openstack.org/python-octaviaclient/latest/cli/index.html#loadbalancer-failover.

15.4.4 Rebooting the Nodes into the SUSE Linux Enterprise 12 SP4 Kernel

At this point, all of the cloud services have been upgraded, but the nodes are still running the SUSE Linux Enterprise 12 SP3 kernel. The final step in the upgrade workflow is to reboot all of the nodes in the cloud in a controlled fashion, to ensure that active services failover appropriately.

The recommended order for rebooting nodes is to start with the deployer. This requires special handling, since the Ansible-based automation cannot fully manage the reboot of the node that it is running on.

After that, we recommend rebooting the rest of the nodes in the control planes in a rolling-reboot fashion, ensuring that high-availability services remain available.

Finally, the compute nodes can be rebooted, either individually or in groups, as is appropriate to avoid interruptions to running workloads.

Warning
Warning

Do not reboot all your control plane nodes at the same time.

Procedure 15.11: Rebooting the Deployer Node

The reboot of the deployer node requires additional steps, as the Ansible-based automation framework cannot fully automate the reboot of the node that runs the ansible-playbook commands.

  1. Run the ardana-reboot.yml playbook limited to the deployer node, either by name, or using the logical node identified OPS-LM--first-member, as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit OPS-LM--first-member
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-dplyr-m1]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-reboot.yml"
    }
    
    msg: Playbook started - ardana-reboot.yml
    
    ...
    
    TASK: [ardana-reboot | Deployer node has to be rebooted manually] *************
    failed: [ardana-cp-dplyr-m1] => {"failed": true}
    msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook:
    cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1
    
    msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook:
    cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1
    
    FATAL: all hosts have already failed -- aborting
    
    PLAY RECAP ********************************************************************
               to retry, use: --limit @/var/lib/ardana/ardana-reboot.retry
    
    ardana-cp-dplyr-m1         : ok=8    changed=3    unreachable=0    failed=1
    localhost                  : ok=7    changed=0    unreachable=0    failed=0

    The ardana-reboot.yml playbook will fail when run on a deployer node; this is expected. The reported failure message tells you what you need to do to complete the remaining steps of the reboot manually: namely, rebooting the node, then logging back in again to run the _ardana-post-reboot.yml playbook, to start any services that need to be running on the node.

  2. Manually reboot the deployer node, for example with shutdown -r now.

  3. Once the deployer node has rebooted, you need to log in again and run the _ardana-post-reboot.yml playbook to complete the startup of any services that should be running on the deployer node, as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook _ardana-post-reboot.yml --limit OPS-LM--first-member
    
    PLAY [resources] **************************************************************
    
    TASK: [Set pending_clm_update] ************************************************
    skipping: [ardana-cp-dplyr-m1]
    
    ...
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-status.yml"
    }
    
    msg: Playbook finished - ardana-status.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-dplyr-m1         : ok=26   changed=0    unreachable=0    failed=0
    localhost                  : ok=19   changed=1    unreachable=0    failed=0
Procedure 15.12: Rebooting the Remaining Control Plane Nodes

For the remaining nodes, you can use ardana-reboot.yml to fully automate the reboot process. However, it is recommended that you reboot the nodes in a rolling-reboot fashion, such that high-availability services continue to run without interruption. Similarly, to avoid interruption of service for any singleton services (such as the cinder-volume and cinder-backup services), they should be migrated off the intended node before it is rebooted, and then migrated back again afterwards.

  1. Use the ansible command's --list-hosts option to list the remaining nodes in the cloud that are neither the deployer nor a compute node:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP'
        ardana-cp-dbmqsw-m1
        ardana-cp-dbmqsw-m2
        ardana-cp-dbmqsw-m3
        ardana-cp-osc-m1
        ardana-cp-osc-m2
        ardana-cp-mml-m1
        ardana-cp-mml-m2
        ardana-cp-mml-m3
  2. Use the following command to generate the set of ansible-playbook commands that need to be run to reboot all the nodes sequentially:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > for node in $(ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP'); do echo ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ${node} || break; done
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m1
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m2
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m3
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m1
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m2
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m1
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m2
    ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3
    Warning
    Warning

    Do not reboot all your control-plane nodes at the same time.

  3. To reboot a specific control-plane node, you can use the above ansible-playbook commands as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-mml-m3]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-reboot.yml"
    }
    
    msg: Playbook started - ardana-reboot.yml
    
    
    
    ...
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-status.yml"
    }
    
    msg: Playbook finished - ardana-status.yml
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-reboot.yml"
    }
    
    msg: Playbook finished - ardana-reboot.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-mml-m3           : ok=389  changed=105  unreachable=0    failed=0
    localhost                  : ok=27   changed=1    unreachable=0    failed=0
Note
Note

You can reboot more than one control-plane node at a time, but only if they are members of different control-plane clusters. For example, you could reboot one node out of each of the Openstack controller, database, swift, monitoring or logging clusters, so long as doing do only reboots one node out of each cluster at the same time.

When rebooting the first member of the control-plane cluster where monitoring services run, the monasca-thresh service can sometimes fail to start up in a timely fashion when the node is coming back up after being rebooted. This can cause ardana-reboot.yml to fail. See below for suggestions on how to handle this problem.

Procedure 15.13: Getting monasca-thresh Running After an ardana-reboot.yml Failure

If the ardana-reboot.yml playbook failed because monasca-thresh didn't start up in a timely fashion after a reboot, you can retry starting the services on the node using the _ardana-post-reboot.yml playbook for the node. This is similar to the manual handling of the deployer reboot, since the node has already successfully rebooted onto the new kernel, and you just need to get the required services running again on the node.

It can sometimes take up to 15 minutes for the monasca-thresh service to successfully start in such cases.

  • However, if the service still fails to start after that time, then you may need to force a restart of the storm-nimbus and storm-supervisor services on all nodes in the MON-THR node group, as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible MON-THR -b -m shell -a "systemctl restart storm-nimbus"
    ardana-cp-mml-m1 | success | rc=0 >>
    
    
    ardana-cp-mml-m2 | success | rc=0 >>
    
    
    ardana-cp-mml-m3 | success | rc=0 >>
    
    
    ardana > ansible MON-THR -b -m shell -a "systemctl restart storm-supervisor"
    ardana-cp-mml-m1 | success | rc=0 >>
    
    
    ardana-cp-mml-m2 | success | rc=0 >>
    
    
    ardana-cp-mml-m3 | success | rc=0 >>
    
    
    ardana > ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-mml-m1

If the monasca-thresh service still fails to start up, contact your support team.

To check which control plane nodes have not yet been rebooted onto the new kernel, you can use an Ansible command to run the command uname -r on the target nodes, as follows:

ardana > cd ~/scratch/ansible/next/ardana/ansible
ardana > ansible -i hosts/verb_hosts 'resources:!OPS-LM--first-member:!NOV-CMP' -m command -a 'uname -r'
ardana-cp-dbmqsw-m1 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-dbmqsw-m3 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-osc-m1 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-dbmqsw-m2 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-mml-m2 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-osc-m2 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-mml-m1 | success | rc=0 >>
4.12.14-95.57-default

ardana-cp-mml-m3 | success | rc=0 >>
4.12.14-95.57-default

ardana > uname -r
4.12.14-95.57-default

If any node's uname -r value does not match the kernel that the deployer is running, you probably have not yet rebooted that node.

Procedure 15.14: Rebooting the Compute Nodes

Finally, you need to reboot the compute nodes. Rebooting multiple compute nodes at the same time is possible, so long as doing so does not compromise the integrity of running workloads. We recommended that you migrate workloads off groups of compute nodes in a controlled fashion, enabling them to be rebooted together.

Warning
Warning

Do not reboot all of your compute nodes at the same time.

  1. To see all the compute nodes that are available to be rebooted, you can run the following command:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
        ardana > ansible -i hosts/verb_hosts --list-hosts NOV-CMP
        ardana-cp-slcomp0001
        ardana-cp-slcomp0002
    ...
        ardana-cp-slcomp0080
  2. Reboot the compute nodes, individually or in groups, using the ardana-reboot.yml playbook as follows:

    ardana > cd ~/scratch/ansible/next/ardana/ansible
        ardana > ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-slcomp0001,ardana-cp-slcomp0002
    
    PLAY [all] ********************************************************************
    
    TASK: [setup] *****************************************************************
    ok: [ardana-cp-slcomp0001]
    ok: [ardana-cp-slcomp0002]
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbstart.yml pb_start_playbook] *****************************************
    ok: [localhost] => {
        "msg": "Playbook started - ardana-reboot.yml"
    }
    
    msg: Playbook started - ardana-reboot.yml
    
    ...
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-status.yml"
    }
    
    msg: Playbook finished - ardana-status.yml
    
    PLAY [localhost] **************************************************************
    
    TASK: [pbfinish.yml pb_finish_playbook] ***************************************
    ok: [localhost] => {
        "msg": "Playbook finished - ardana-reboot.yml"
    }
    
    msg: Playbook finished - ardana-reboot.yml
    
    PLAY RECAP ********************************************************************
    ardana-cp-slcomp0001       : ok=120  changed=11   unreachable=0    failed=0
    ardana-cp-slcomp0002       : ok=120  changed=11   unreachable=0    failed=0
    localhost                  : ok=27   changed=1    unreachable=0    failed=0
Important
Important

You must ensure that there is sufficient unused workload capacity to host any migrated workload or Amphora instances that may be running on the targeted compute nodes.

When rebooting multiple compute nodes at the same time, consider manually migrating any running workloads and Amphora instances off the target nodes in advance, to avoid any potential risk of workload or service interruption.

15.4.5 Post-Upgrade Tasks

After the cloud has been upgraded to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, if designate was previously configured, then the deprecated service components, designate-zone-manager and designate-pool-manager, were being used.

They will continue to operate correctly under SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, but we recommend that you migrate to using the newer designate-worker designate-producer service components instead by following the procedure documented in Section 25.4, “Migrate Zone/Pool to Worker/Producer after Upgrade”.

Procedure 15.15: Cleanup Orphaned Packages
  • After migrating the deployer node, there are a small number of packages that were installed that are no longer required—such as the ceilometer and freezer virtualenv (venv) packages.

    You can safely remove these packages with the following command:

    ardana > zypper packages --orphaned
    Loading repository data...
    Reading installed packages...
    S | Repository | Name                             | Version                           | Arch
    --+------------+----------------------------------+-----------------------------------+-------
    i | @System    | python-flup                      | 1.0.3.dev_20110405-2.10.52        | noarch
    i | @System    | python-happybase                 | 0.9-1.64                          | noarch
    i | @System    | venv-openstack-ceilometer-x86_64 | 9.0.8~dev7-12.24.2                | noarch
    i | @System    | venv-openstack-freezer-x86_64    | 5.0.0.0~xrc2~dev2-10.22.1         | noarch
    ardana> sudo zypper remove venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64
    Loading repository data...
    Reading installed packages...
    Resolving package dependencies...
    
    The following 2 packages are going to be REMOVED:
      venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64
    
    2 packages to remove.
    After the operation, 79.0 MiB will be freed.
    Continue? [y/n/...? shows all options] (y): y
    (1/2) Removing venv-openstack-ceilometer-x86_64-9.0.8~dev7-12.24.2.noarch ..................................................................[done]
    Additional rpm output:
    /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
      return yaml.load(f)
    
    
    (2/2) Removing venv-openstack-freezer-x86_64-5.0.0.0~xrc2~dev2-10.22.1.noarch ..............................................................[done]
    Additional rpm output:
    /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
      return yaml.load(f)
Procedure 15.16: Delete freezer Containers from swift

The freezer service has been deprecated and removed from SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, but the backups that the freezer service created before you upgraded will still be consuming space in your Swift Object store.

Therefore, once you have completed the upgrade successfully, you can safely delete the containers that freezer used to hold the database and ring backups, freeing up that space.

  • Using the credentials in the backup.osrc file, found on the deployer node in the Ardana account's home directory, run the following commands:

    ardana > . ~/backup.osrc
    ardana > swift list
    freezer_database_backups
    freezer_rings_backups
    ardana> swift delete --all
    freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/1_1598548599/segments/000000021
    freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/2_1598605266/data1
    ...
    freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/0_1598505404/segments/000000001
    freezer_database_backups
    freezer_rings_backups/metadata/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/metadata
    ...
    freezer_rings_backups/data/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/data
    freezer_rings_backups

15.5 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment

Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.

Use the following steps to deploy a PTF:

  1. When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:

    ardana > tmpdir=`mktemp -d`
    ardana > cd $tmpdir
    ardana > wget --no-directories --recursive --reject "index.html*"\
    --user=USER_NAME \
    --ask-password \
    --no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030/
  2. Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.

    ardana > sudo rm -rf /srv/www/suse-12.4/x86_64/repos/PTF/*
  3. Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a neutron PTF.

    ardana > sudo mkdir -p /srv/www/suse-12.4/x86_64/repos/PTF/
    ardana > sudo mv $tmpdir/*
       /srv/www/suse-12.4/x86_64/repos/PTF/
    ardana > sudo chown --recursive root:root /srv/www/suse-12.4/x86_64/repos/PTF/*
    ardana > rmdir $tmpdir
  4. Create or update the repository metadata:

    ardana > sudo /usr/local/sbin/createrepo-cloud-ptf
    Spawning worker 0 with 2 pkgs
    Workers Finished
    Saving Primary metadata
    Saving file lists metadata
    Saving other metadata
  5. Refresh the PTF repository before installing package updates on the Cloud Lifecycle Manager

    ardana > sudo zypper refresh --force --repo PTF
    Forcing raw metadata refresh
    Retrieving repository 'PTF' metadata
    ..........................................[d
    one]
    Forcing building of repository cache
    Building repository 'PTF' cache ..........................................[done]
    Specified repositories have been refreshed.
  6. The PTF shows as available on the deployer.

    ardana > sudo zypper se --repo PTF
    Loading repository data...
    Reading installed packages...
    
    S | Name                          | Summary                                 | Type
    --+-------------------------------+-----------------------------------------+--------
      | python-neutronclient          | Python API and CLI for OpenStack neutron | package
    i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack neutron | package
  7. Install the PTF venv packages on the Cloud Lifecycle Manager

    ardana > sudo zypper dup  --from PTF
    Refreshing service
    Loading repository data...
    Reading installed packages...
    Computing distribution upgrade...
    
    The following package is going to be upgraded:
      venv-openstack-neutron-x86_64
    
    The following package has no support information from its vendor:
      venv-openstack-neutron-x86_64
    
    1 package to upgrade.
    Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used.
    Continue? [y/n/...? shows all options] (y): y
    Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1),  64.2 MiB ( 64.6 MiB unpacked)
    Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done]
    Checking for file conflicts: ..............................................................[done]
    (1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done]
    Additional rpm output:
    warning
    warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEY
  8. Validate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)

    ardana > ls -la /opt/ardana_packager/ardana-9/sles_venv/x86_64
    total 898952
    drwxr-xr-x 2 root root     4096 Oct 30 16:10 .
    ...
    -rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<<
    -rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz
    -rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz
    -rw-r--r-- 1 root root     1879 Oct 30 16:10 packages
    -rw-r--r-- 1 root root 27186008 Apr 26  2018 swift-20180426T230541Z.tgz
  9. Install the non-venv PTF packages on the Compute Node

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmt

    When it has finished, you can see that the upgraded package has been installed on comp0001-mgmt.

    ardana > sudo zypper se --detail python-neutronclient
    Loading repository data...
    Reading installed packages...
    
    S | Name                 | Type     | Version                         | Arch   | Repository
    --+----------------------+----------+---------------------------------+--------+--------------------------------------
    i | python-neutronclient | package  | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF
      | python-neutronclient | package  | 6.5.0-4.361                     | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1
  10. Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.

    The Compute Node before running the update playbook:

    ardana > ls -la /opt/stack/venv
    total 24
    drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
    drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
    drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
    drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z
  11. Run the update.

    ardana > cd ~/scratch/ansible/next/ardana/ansible
    ardana > ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmt

    When it has finished, you can see that an additional virtual environment has been installed.

    ardana > ls -la /opt/stack/venv
    total 28
    drwxr-xr-x  9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z
    drwxr-xr-x  9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z
    drwxr-xr-x  9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed
    drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z
    drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306Z
  12. The PTF may also have RPM package updates in addition to venv updates. To complete the update, follow the instructions at Section 15.3.1, “Performing the Update”

15.6 Periodic OpenStack Maintenance Tasks

Heat-manage helps manage Heat specific database operations. The associated database should be periodically purged to save space. The following should be setup as a cron job on the servers where the heat service is running at /etc/cron.weekly/local-cleanup-heat with the following content:

  #!/bin/bash
  su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :

nova-manage db archive_deleted_rows command will move deleted rows from production tables to shadow tables. Including --until-complete will make the command run continuously until all deleted rows are archived. It is recommended to setup this task as /etc/cron.weekly/local-cleanup-nova on the servers where the nova service is running, with the following content:

  #!/bin/bash
  su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :