15 System Maintenance #
This section contains the following subsections to help you manage, configure, and maintain your SUSE OpenStack Cloud cloud as well as procedures for performing node maintenance.
15.1 Planned System Maintenance #
Planned maintenance tasks for your cloud. See sections below for:
15.1.1 Whole Cloud Maintenance #
Planned maintenance procedures for your whole cloud.
15.1.1.1 Bringing Down Your Cloud: Services Down Method #
If you have a planned maintenance and need to bring down your entire cloud, update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”. This method will bring down all of your services.
If you wish to use a method utilizing rolling reboots where your cloud services will continue running then see Section 15.1.1.2, “Rolling Reboot of the Cloud”.
To perform backups prior to these steps, visit the backup and restore pages first at Chapter 17, Backup and Restore.
15.1.1.1.1 Gracefully Bringing Down and Restarting Your Cloud Environment #
You will do the following steps from your Cloud Lifecycle Manager.
Log in to your Cloud Lifecycle Manager.
Gracefully shut down your cloud by running the
ardana-stop.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-stop.yml
Shut down and restart your nodes. There are multiple ways you can do this:
You can SSH to each node and use
sudo reboot -f
to reboot the node. Reboot the control plane nodes first so that they become functional as early as possible.You can shut down the nodes and then physically restart them either via a power button or the IPMI. If your cloud data model
servers.yml
specifies iLO connectivity for all nodes, then you can use thebm-power-down.yml
andbm-power-up.yml
playbooks on the Cloud Lifecycle Manager.Power down the control plane nodes last so that they remain online as long as possible, and power them back up before other nodes to restore their services quickly.
Perform the necessary maintenance.
After the maintenance is complete, power your Cloud Lifecycle Manager back up and then SSH to it.
Determine the current power status of the nodes in your environment:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts bm-power-status.yml
If necessary, power up any nodes that are not already powered up, ensuring that you power up your controller nodes first. You can target specific nodes with the
-e nodelist=<node_name>
switch.cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts bm-power-up.yml [-e nodelist=<node_name>]
NoteObtain the
<node_name>
by using thesudo cobbler system list
command from the Cloud Lifecycle Manager.Bring the databases back up:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
Gracefully bring up your cloud services by running the
ardana-start.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-start.yml
Pause for a few minutes and give the cloud environment time to come up completely and then verify the status of the individual services using this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml
If any services did not start properly, you can run playbooks for the specific services having issues.
For example:
If RabbitMQ fails, run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-start.yml
You can check the status of RabbitMQ afterwards with this:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-status.yml
If the recovery had failed, you can run:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts rabbitmq-disaster-recovery.yml
Each of the other services have playbooks in the
~/scratch/ansible/next/ardana/ansible
directory in the format of<service>-start.yml
that you can run. One example, for the compute service, isnova-start.yml
.Continue checking the status of your SUSE OpenStack Cloud 9 cloud services until there are no more failed or unreachable nodes:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-status.yml
15.1.1.2 Rolling Reboot of the Cloud #
If you have a planned maintenance and need to bring down your entire cloud and restart services while minimizing downtime, follow the steps here to safely restart your cloud. If you do not mind your services being down, then another option for planned maintenance can be found at Section 15.1.1.1, “Bringing Down Your Cloud: Services Down Method”.
15.1.1.2.1 Recommended node reboot order #
To ensure that rebooted nodes reintegrate into the cluster, the key is having enough time between controller reboots.
The recommended way to achieve this is as follows:
Reboot controller nodes one-by-one with a suitable interval in between. If you alternate between controllers and compute nodes you will gain more time between the controller reboots.
Reboot of compute nodes (if present in your cloud).
Reboot of swift nodes (if present in your cloud).
Reboot of ESX nodes (if present in your cloud).
15.1.1.2.2 Rebooting controller nodes #
Turn off the keystone Fernet Token-Signing Key Rotation
Before rebooting any controller node, you need to ensure that the keystone Fernet token-signing key rotation is turned off. Run the following command:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-stop-fernet-auto-rotation.yml
Migrate singleton services first
If you have previously rebooted your Cloud Lifecycle Manager for any reason, ensure that
the apache2
service is running before
continuing. To start the apache2
service, use
this command:
ardana >
sudo systemctl start apache2
The first consideration before rebooting any controller nodes is that there are a few services that run as singletons (non-HA), thus they will be unavailable while the controller they run on is down. Typically this is a very small window, but if you want to retain the service during the reboot of that server you should take special action to maintain service, such as migrating the service.
For these steps, if your singleton services are running on controller1 and you move them to controller2, then ensure you move them back to controller1 before proceeding to reboot controller2.
For the cinder-volume
singleton
service:
Execute the following command on each controller node to determine which node is hosting the cinder-volume singleton. It should be running on only one node:
ardana >
ps auxww | grep cinder-volume | grep -v grep
Run the cinder-migrate-volume.yml
playbook - details
about the cinder volume and backup migration instructions can be found in
Section 8.1.3, “Managing cinder Volume and Backup Services”.
For the SNAT namespace singleton service:
If you reboot the controller node hosting the SNAT namespace service on it, Compute instances without floating IPs will lose network connectivity when that controller is rebooted. To prevent this from happening, you can use these steps to determine which controller node is hosting the SNAT namespace service and migrate it to one of the other controller nodes while that node is rebooted.
Locate the SNAT node where the router is providing the active
snat_service
:From the Cloud Lifecycle Manager, list out your ports to determine which port is serving as the router gateway:
ardana >
source ~/service.osrcardana >
openstack port list --device_owner network:router_gatewayExample:
$ openstack port list --device_owner network:router_gateway +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+ | id | name | mac_address | fixed_ips | +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+ | 287746e6-7d82-4b2c-914c-191954eba342 | | fa:16:3e:2e:26:ac | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | +--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
Look at the details of this port to determine what the
binding:host_id
value is, which will point to the host in which the port is bound to:openstack port show <port_id>
Example, with the value you need in bold:
ardana >
openstack port show 287746e6-7d82-4b2c-914c-191954eba342 +-----------------------+--------------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | ardana-cp1-c1-m2-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | e122ea3f-90c5-4662-bf4a-3889f677aacf | | device_owner | network:router_gateway | | dns_assignment | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | | id | 287746e6-7d82-4b2c-914c-191954eba342 | | mac_address | fa:16:3e:2e:26:ac | | name | | | network_id | d3cb12a6-a000-4e3e-82c4-ee04aa169291 | | security_groups | | | status | DOWN | | tenant_id | | +-----------------------+--------------------------------------------------------------------------------------------------------------+In this example, the
ardana-cp1-c1-m2-mgmt
is the node hosting the SNAT namespace service.
SSH to the node hosting the SNAT namespace service and check the SNAT namespace, specifying the router_id that has the interface with the subnet that you are interested in:
ardana >
ssh <IP_of_SNAT_namespace_host>ardana >
sudo ip netns exec snat-<router_ID> bashExample:
ardana >
sudo ip netns exec snat-e122ea3f-90c5-4662-bf4a-3889f677aacf bashObtain the ID for the L3 Agent for the node hosting your SNAT namespace:
ardana >
source ~/service.osrcardana >
openstack network agent listExample, with the entry you need given the examples above:
ardana >
openstack network agent list +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 0126bbbf-5758-4fd0-84a8-7af4d93614b8 | DHCP agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-dhcp-agent | | 33dec174-3602-41d5-b7f8-a25fd8ff6341 | Metadata agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-metadata-agent | | 3bc28451-c895-437b-999d-fdcff259b016 | L3 agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-vpn-agent | | 4af1a941-61c1-4e74-9ec1-961cebd6097b | L3 agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-l3-agent | | 65bcb3a0-4039-4d9d-911c-5bb790953297 | Open vSwitch agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-openvswitch-agent | | 6981c0e5-5314-4ccd-bbad-98ace7db7784 | L3 agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-vpn-agent | | 7df9fa0b-5f41-411f-a532-591e6db04ff1 | Metadata agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-metadata-agent | | 92880ab4-b47c-436c-976a-a605daa8779a | Metadata agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-metadata-agent | | a209c67d-c00f-4a00-b31c-0db30e9ec661 | L3 agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-vpn-agent | | a9467f7e-ec62-4134-826f-366292c1f2d0 | DHCP agent | ardana-cp1-c1-m1-mgmt | :-) | True | neutron-dhcp-agent | | b13350df-f61d-40ec-b0a3-c7c647e60f75 | Open vSwitch agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-openvswitch-agent | | d4c07683-e8b0-4a2b-9d31-b5b0107b0b31 | Open vSwitch agent | ardana-cp1-comp0001-mgmt | :-) | True | neutron-openvswitch-agent | | e91d7f3f-147f-4ad2-8751-837b936801e3 | Open vSwitch agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-openvswitch-agent | | f33015c8-f4e4-4505-b19b-5a1915b6e22a | DHCP agent | ardana-cp1-c1-m2-mgmt | :-) | True | neutron-dhcp-agent | | fe43c0e9-f1db-4b67-a474-77936f7acebf | Metadata agent | ardana-cp1-c1-m3-mgmt | :-) | True | neutron-metadata-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+Also obtain the ID for the L3 Agent of the node you are going to move the SNAT namespace service to using the same commands as the previous step.
Use these commands to move the SNAT namespace service, with the
router_id
being the same value as the ID for router:Remove the L3 Agent for the old host:
ardana >
openstack network agent remove router –agent-type l3 <agent_id_of_snat_namespace_host> \ <qrouter_uuid>Example:
ardana >
openstack network agent remove router –agent-type l3 a209c67d-c00f-4a00-b31c-0db30e9ec661 \ e122ea3f-90c5-4662-bf4a-3889f677aacf Removed router e122ea3f-90c5-4662-bf4a-3889f677aacf from L3 agentRemove the SNAT namespace:
ardana >
sudo ip netns delete snat-<router_id>Example:
ardana >
sudo ip netns delete snat-e122ea3f-90c5-4662-bf4a-3889f677aacfCreate a new L3 Agent for the new host:
ardana >
openstack network agent add router –agent-type l3 <agent_id_of_new_snat_namespace_host> \ <qrouter_uuid>Example:
ardana >
openstack network agent add router –agent-type l3 3bc28451-c895-437b-999d-fdcff259b016 \ e122ea3f-90c5-4662-bf4a-3889f677aacf Added router e122ea3f-90c5-4662-bf4a-3889f677aacf to L3 agent
Confirm that it has been moved by listing the details of your port from step 1b above, noting the value of
binding:host_id
which should be updated to the host you moved your SNAT namespace to:ardana >
openstack port show <port_ID>Example:
ardana >
openstack port show 287746e6-7d82-4b2c-914c-191954eba342 +-----------------------+--------------------------------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+--------------------------------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | ardana-cp1-c1-m1-mgmt | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "ovs_hybrid_plug": true} | | binding:vif_type | ovs | | binding:vnic_type | normal | | device_id | e122ea3f-90c5-4662-bf4a-3889f677aacf | | device_owner | network:router_gateway | | dns_assignment | {"hostname": "host-10-247-96-29", "ip_address": "10.247.96.29", "fqdn": "host-10-247-96-29.openstacklocal."} | | dns_name | | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "f4152001-2500-4ebe-ba9d-a8d6149a50df", "ip_address": "10.247.96.29"} | | id | 287746e6-7d82-4b2c-914c-191954eba342 | | mac_address | fa:16:3e:2e:26:ac | | name | | | network_id | d3cb12a6-a000-4e3e-82c4-ee04aa169291 | | security_groups | | | status | DOWN | | tenant_id | | +-----------------------+--------------------------------------------------------------------------------------------------------------+
Reboot the controllers
In order to reboot the controller nodes, you must first retrieve a list of nodes in your cloud running control plane services.
ardana >
for i in $(grep -w cluster-prefix
~/openstack/my_cloud/definition/data/control_plane.yml \
| awk '{print $2}'); do grep $i
~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts \
| grep ansible_ssh_host | awk '{print $1}'; done
Then perform the following steps from your Cloud Lifecycle Manager for each of your controller nodes:
If any singleton services are active on this node, they will be unavailable while the node is down. If you want to retain the service during the reboot, you should take special action to maintain the service, such as migrating the service as appropriate as noted above.
Stop all services on the controller node that you are rebooting first:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \ <controller node>Reboot the controller node, e.g. run the following command on the controller itself:
ardana >
sudo rebootNote that the current node being rebooted could be hosting the lifecycle manager.
Wait for the controller node to become ssh-able and allow an additional minimum of five minutes for the controller node to settle. Start all services on the controller node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml \ --limit <controller node>Verify that the status of all services on that is OK on the controller node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-status.yml \ --limit <controller node>When above start operation has completed successfully, you may proceed to the next controller node. Ensure that you migrate your singleton services off the node first.
It is important that you not begin the reboot procedure for a new controller node until the reboot of the previous controller node has been completed successfully (that is, the ardana-status playbook has completed without error).
Reenable the keystone Fernet Token-Signing Key Rotation
After all the controller nodes are successfully updated and back online, you
need to re-enable the keystone Fernet token-signing key rotation job by
running the keystone-reconfigure.yml
playbook. On the
deployer, run:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.yml
15.1.1.2.3 Rebooting compute nodes #
To reboot a compute node the following operations will need to be performed:
Disable provisioning of the node to take the node offline to prevent further instances being scheduled to the node during the reboot.
Identify instances that exist on the compute node, and then either:
Live migrate the instances off the node before actioning the reboot. OR
Stop the instances
Reboot the node
Restart the nova services
Disable provisioning:
ardana >
openstack compute service set --disable --disable-reason "DESCRIBE REASON" compute nova-computeIf the node has existing instances running on it these instances will need to be migrated or stopped prior to re-booting the node.
Live migrate existing instances. Identify the instances on the compute node. Note: The following command must be run with nova admin credentials.
ardana >
openstack server list --host <hostname> --all-tenantsMigrate or Stop the instances on the compute node.
Migrate the instances off the node by running one of the following commands for each of the instances:
If your instance is booted from a volume and has any number of cinder volume attached, use the nova live-migration command:
ardana >
nova live-migration <instance uuid> [<target compute host>]If your instance has local (ephemeral) disk(s) only, you can use the --block-migrate option:
ardana >
nova live-migration --block-migrate <instance uuid> [<target compute host>]Note: The [<target compute host>] option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.
OR
Stop the instances on the node by running the following command for each of the instances:
ardana >
openstack server stop <instance-uuid>Stop all services on the Compute node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <compute node>SSH to your Compute nodes and reboot them:
ardana >
sudo rebootThe operating system cleanly shuts down services and then automatically reboots. If you want to be very thorough, run your backup jobs just before you reboot.
Run the ardana-start.yml playbook from the Cloud Lifecycle Manager. If needed, use the bm-power-up.yml playbook to restart the node. Specify just the node(s) you want to start in the 'nodelist' parameter arguments, that is, nodelist=<node1>[,<node2>][,<node3>].
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<compute node>Execute the ardana-start.yml playbook. Specifying the node(s) you want to start in the 'limit' parameter arguments. This parameter accepts wildcard arguments and also '@<filename>' to process all hosts listed in the file.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <compute node>Re-enable provisioning on the node:
ardana >
openstack compute service set --enable compute nova-computeRestart any instances you stopped.
ardana >
openstack server start <instance-uuid>
15.1.1.2.4 Rebooting swift nodes #
If your swift services are on controller node, please follow the controller node reboot instructions above.
For a dedicated swift PAC cluster or swift Object resource node:
For each swift host
Stop all services on the swift node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit <swift node>Reboot the swift node by running the following command on the swift node itself:
ardana >
sudo rebootWait for the node to become ssh-able and then start all services on the swift node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit <swift node>
15.1.1.2.5 Get list of status playbooks #
The following command will display a list of status playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ls *status*
15.1.2 Planned Control Plane Maintenance #
Planned maintenance tasks for controller nodes such as full cloud reboots and replacing controller nodes.
15.1.2.1 Replacing a Controller Node #
This section outlines steps for replacing a controller node in your environment.
For SUSE OpenStack Cloud, you must have three controller nodes. Therefore, adding or removing nodes is not an option. However, if you need to repair or replace a controller node, you may do so by following the steps outlined here. Note that to run any playbooks whatsoever for cloud maintenance, you will always run the steps from the Cloud Lifecycle Manager.
These steps will depend on whether you need to replace a shared lifecycle manager/controller node or whether this is a standalone controller node.
Keep in mind while performing the following tasks:
Do not add entries for a new server. Instead, update the entries for the broken one.
Be aware that all management commands are run on the node where the Cloud Lifecycle Manager is running.
15.1.2.1.2 Replacing a Standalone Controller Node #
If the controller node you need to replace is not also being used as the Cloud Lifecycle Manager, follow the steps below.
Log in to the Cloud Lifecycle Manager.
Update your cloud model, specifically the
servers.yml
file, with the newmac-addr
,ilo-ip
,ilo-password
, andilo-user
fields where these have changed. Do not change theid
,ip-addr
,role
, orserver-group
settings.Commit your configuration to the Chapter 22, Using Git for Configuration Management, as follows:
ardana >
cd ~/openstack/ardana/ansibleardana >
git add -Aardana >
git commit -m "My config or other commit message"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRemove the old controller node(s) from Cobbler. You can list out the systems in Cobbler currently with this command:
ardana >
sudo cobbler system listand then remove the old controller nodes with this command:
ardana >
sudo cobbler system remove --name <node>Remove the SSH key of the old controller node from the known hosts file. You will specify the
ip-addr
value:ardana >
ssh-keygen -f "~/.ssh/known_hosts" -R <ip_addr>You should see a response similar to this one:
ardana@ardana-cp1-c1-m1-mgmt:~/openstack/ardana/ansible$ ssh-keygen -f "~/.ssh/known_hosts" -R 10.13.111.135 # Host 10.13.111.135 found: line 6 type ECDSA ~/.ssh/known_hosts updated. Original contents retained as ~/.ssh/known_hosts.old
Run the cobbler-deploy playbook to add the new controller node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlImage the new node(s) by using the bm-reimage playbook. You will specify the name for the node in Cobbler in the command:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node-name>ImportantYou must ensure that the old controller node is powered off before completing this step. This is because the new controller node will re-use the original IP address.
Run the
wipe_disks.yml
playbook to ensure all non-OS partitions on the new node are completely wiped prior to continuing with the installation. (The value to be used forhostname
is the host's identifier from~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
.)ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Run osconfig on the replacement controller node. For example:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller-hostname>If the controller being replaced is the swift ring builder (see Section 18.6.2.4, “Identifying the Swift Ring Building Server”) you need to restore the swift ring builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir
directory. See Section 18.6.2.7, “Recovering swift Builder Files” for details.Run the ardana-deploy playbook on the replacement controller.
If the node being replaced is the swift ring builder server then you only need to use the
--limit
switch for that node, otherwise you need to specify the hostname of your swift ringer builder server and the hostname of the node being replaced.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller-hostname>,<swift-ring-builder-hostname>ImportantIf you receive a keystone failure when running this playbook, it is likely due to Fernet keys being out of sync. This problem can be corrected by running the
keystone-reconfigure.yml
playbook to re-sync the Fernet keys.In this situation, do not use the
--limit
option when runningkeystone-reconfigure.yml
. In order to re-sync Fernet keys, all the controller nodes must be in the play.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts keystone-reconfigure.ymlImportantIf you receive a RabbitMQ failure when running this playbook, review Section 18.2.1, “Understanding and Recovering RabbitMQ after Failure” for how to resolve the issue and then re-run the ardana-deploy playbook.
During the replacement of the node there will be alarms that show up during the process. If those do not clear after the node is back up and healthy, restart the threshold engine by running the following playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-stop.yml --tags threshardana >
ansible-playbook -i hosts/verb_hosts monasca-start.yml --tags thresh
15.1.3 Planned Compute Maintenance #
Planned maintenance tasks for compute nodes.
15.1.3.1 Planned Maintenance for a Compute Node #
If one or more of your compute nodes needs hardware maintenance and you can schedule a planned maintenance then this procedure should be followed.
15.1.3.1.1 Performing planned maintenance on a compute node #
If you have planned maintenance to perform on a compute node, you have to take it offline, repair it, and restart it. To do so, follow these steps:
Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
source ~/service.osrc
Obtain the hostname for your compute node, which you will use in subsequent commands when
<hostname>
is requested:openstack host list | grep compute
The following example shows two compute nodes:
$ openstack host list | grep compute | ardana-cp1-comp0001-mgmt | compute | AZ1 | | ardana-cp1-comp0002-mgmt | compute | AZ2 |
Disable provisioning on the compute node, which will prevent additional instances from being spawned on it:
openstack compute service set –disable --reason "Maintenance mode" <hostname>
NoteMake sure you re-enable provisioning after the maintenance is complete if you want to continue to be able to spawn instances on the node. You can do this with the command:
openstack compute service set –enable <hostname>
At this point you have two choices:
Live migration: This option enables you to migrate the instances off the compute node with minimal downtime so you can perform the maintenance without risk of losing data.
Stop/start the instances: Issuing
openstack server stop
commands to each of the instances will halt them. This option lets you do maintenance and then start the instances back up, as long as no disk failures occur on the compute node data disks. This method involves downtime for the length of the maintenance.
If you choose the live migration route, See Section 15.1.3.3, “Live Migration of Instances” for more details. Skip to step #6 after you finish live migration.
If you choose the stop start method, continue on.
List all of the instances on the node so you can issue stop commands to them:
openstack server list --host <hostname> --all-tenants
Issue the
openstack server stop
command against each of the instances:openstack server stop <instance uuid>
Confirm that the instances are stopped. If stoppage was successful you should see the instances in a
SHUTOFF
state, as shown here:$ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+ | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | SHUTOFF | - | Shutdown | demo_network=10.0.0.5 | +--------------------------------------+-----------+----------------------------------+---------+------------+-------------+-----------------------+
Do your required maintenance. If this maintenance does not take down the disks completely then you should be able to list the instances again after the repair and confirm that they are still in their
SHUTOFF
state:openstack server list --host <hostname> --all-tenants
Start the instances back up using this command:
openstack server start <instance uuid>
Example:
$ openstack server start ef31c453-f046-4355-9bd3-11e774b1772f Request to start server ef31c453-f046-4355-9bd3-11e774b1772f has been accepted.
Confirm that the instances started back up. If restarting is successful you should see the instances in an
ACTIVE
state, as shown here:$ openstack server list --host ardana-cp1-comp0002-mgmt --all-tenants +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+ | ef31c453-f046-4355-9bd3-11e774b1772f | instance1 | 4365472e025c407c8d751fc578b7e368 | ACTIVE | - | Running | demo_network=10.0.0.5 | +--------------------------------------+-----------+----------------------------------+--------+------------+-------------+-----------------------+
If the
openstack server start
fails, you can try doing a hard reboot:openstack server reboot --hard <instance uuid>
If this does not resolve the issue you may want to contact support.
Re-enable provisioning when the node is fixed:
openstack compute service set –enable <hostname>
15.1.3.2 Rebooting a Compute Node #
If all you need to do is reboot a Compute node, the following steps can be used.
You can choose to live migrate all Compute instances off the node prior to the reboot. Any instances that remain will be restarted when the node is rebooted. This playbook will ensure that all services on the Compute node are restarted properly.
Log in to the Cloud Lifecycle Manager.
Reboot the Compute node(s) with the following playbook.
You can specify either single or multiple Compute nodes using the
--limit
switch.An optional reboot wait time can also be specified. If no reboot wait time is specified it will default to 300 seconds.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts nova-compute-reboot.yml --limit [compute_node_or_list] [-e nova_reboot_wait_timeout=(seconds)]
NoteIf the Compute node fails to reboot, you should troubleshoot this issue separately as this playbook will not attempt to recover after a failed reboot.
15.1.3.3 Live Migration of Instances #
Live migration allows you to move active compute instances between compute nodes, allowing for less downtime during maintenance.
SUSE OpenStack Cloud nova offers a set of commands that allow you to move compute instances between compute hosts. Which command you use will depend on the state of the host, what operating system is on the host, what type of storage the instances are using, and whether you want to migrate a single instance or all of the instances off of the host. We will describe these options on this page as well as give you step-by-step instructions for performing them.
15.1.3.3.1 Migration Options #
If your compute node has failed
A compute host failure could be caused by hardware failure, such as the data disk needing to be replaced, power has been lost, or any other type of failure which requires that you replace the baremetal host. In this scenario, the instances on the compute node are unrecoverable and any data on the local ephemeral storage is lost. If you are utilizing block storage volumes, either as a boot device or as additional storage, they should be unaffected.
In these cases you will want to use one of the nova evacuate commands, which will cause nova to rebuild the instances on other hosts.
This table describes each of the evacuate options for failed compute nodes:
Command | Description |
---|---|
|
This command is used to evacuate a single instance from a failed host. You specify the compute instance UUID and the target host you want to evacuate it to. If no host is specified then the nova scheduler will choose one for you.
See |
|
This command is used to evacuate all instances from a failed host. You specify the hostname of the compute host you want to evacuate. Optionally you can specify a target host. If no target host is specified then the nova scheduler will choose a target host for each instance.
See |
If your compute host is active, powered on and the data disks are in working order you can utilize the migration commands to migrate your compute instances. There are two migration features, "cold" migration (also referred to simply as "migration") and live migration. Migration and live migration are two different functions.
Cold migration is used to copy an instances
data in a SHUTOFF
status from one compute host to
another. It does this using passwordless SSH access which has security
concerns associated with it. For this reason, the openstack server
migrate
function has been disabled by default but you have the
ability to enable this feature if you would like. Details on how to do this
can be found in Section 6.4, “Enabling the Nova Resize and Migrate Features”.
Live migration can be performed on
instances in either an ACTIVE
or
PAUSED
state and uses the QEMU hypervisor to manage the
copy of the running processes and associated resources to the destination
compute host using the hypervisors own protocol and thus is a more secure
method and allows for less downtime. There may be a short network outage,
usually a few milliseconds but could be up to a few seconds if your compute
instances are busy, during a live migration. Also there may be some
performance degredation during the process.
The compute host must remain powered on during the migration process.
Both the cold migration and live migration options will honor nova group policies, which includes affinity settings. There is a limitation to keep in mind if you use group policies and that is discussed in the Section 15.1.3.3, “Live Migration of Instances” section.
This table describes each of the migration options for active compute nodes:
Command | Description | SLES |
---|---|---|
|
Used to cold migrate a single instance from a compute host. The
This command will work against instances in an
See the difference between cold migration and live migration at the start of this section. | |
|
Used to cold migrate all instances off a specified host to other
available hosts, chosen by the
This command will work against instances in an
See the difference between cold migration and live migration at the start of this section. | |
|
Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.
This command works against instances in | X |
|
Used to migrate a single instance between two compute hosts. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.
This command works against instances in | X |
|
Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from a block storage volume or that have any number of block storage volumes attached.
This command works against instances in | X |
|
Used to live migrate all instances off of a compute host. You can optionally specify a target host or you can allow the nova scheduler to choose a host for you. If you choose to specify a target host, ensure that the target host has enough resources to host the instance prior to live migration. Works for instances that are booted from local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s) but are not booted from a block storage volume.
This command works against instances in | X |
15.1.3.3.2 Limitations of these Features #
There are limitations that may impact your use of this feature:
To use live migration, your compute instances must be in either an
ACTIVE
orPAUSED
state on the compute host. If you have instances in aSHUTOFF
state then cold migration should be used.Instances in a
Paused
state cannot be live migrated using the horizon dashboard. You will need to utilize thepython-novaclient
CLI to perform these.Both cold migration and live migration honor an instance's group policies. If you are utilizing an affinity policy and are migrating multiple instances you may run into an error stating no hosts are available to migrate to. To work around this issue you should specify a target host when migrating these instances, which will bypass the
nova-scheduler
. You should ensure that the target host you choose has the resources available to host the instances.The
nova host-evacuate-live
command will produce an error if you have a compute host that has a mix of instances that use local ephemeral storage and instances that are booted from a block storage volume or have any number of block storage volumes attached. If you have a mix of these instance types, you may need to run the command twice, utilizing the--block-migrate
option. This is described in further detail in Section 15.1.3.3, “Live Migration of Instances”.Instances on KVM hosts can only be live migrated to other KVM hosts.
The migration options described in this document are not available on ESX compute hosts.
Ensure that you read and take into account any other limitations that exist in the release notes. See the release notes for more details.
15.1.3.3.3 Performing a Live Migration #
Cloud administrators can perform a migration on an instance using either the
horizon dashboard, API, or CLI. Instances in a Paused
state cannot be live migrated using the horizon GUI. You will need to
utilize the CLI to perform these.
We have documented different scenarios:
15.1.3.3.4 Migrating instances off of a failed compute host #
Log in to the Cloud Lifecycle Manager.
If the compute node is not already powered off, do so with this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=<node_name>
NoteThe value for
<node_name>
will be the name that Cobbler has when you runsudo cobbler system list
from the Cloud Lifecycle Manager.Source the admin credentials necessary to run administrative commands against the nova API:
source ~/service.osrc
Force the
nova-compute
service to go down on the compute node:openstack compute service set --down HOSTNAME nova-compute
NoteThe value for HOSTNAME can be obtained by using
openstack host list
from the Cloud Lifecycle Manager.Evacuate the instances off of the failed compute node. This will cause the nova-scheduler to rebuild the instances on other valid hosts. Any local ephemeral data on the instances is lost.
For single instances on a failed host:
nova evacuate <instance_uuid> <target_hostname>
For all instances on a failed host:
nova host-evacuate <hostname> [--target_host <target_hostname>]
When you have repaired the failed node and start it back up again, when the
nova-compute
process starts again, it will clean up the evacuated instances.
15.1.3.3.5 Migrating instances off of an active compute host #
Migrating instances using the horizon dashboard
The horizon dashboard offers a GUI method for performing live migrations.
Instances in a Paused
state will not provide you the live
migration option in horizon so you will need to use the CLI instructions in
the next section to perform these.
Log into the horizon dashboard with admin credentials.
Navigate to the menu
› › .Next to the instance you want to migrate, select the drop down menu and choose the
option.In the Live Migrate wizard you will see the compute host the instance currently resides on and then a drop down menu that allows you to choose the compute host you want to migrate the instance to. Select a destination host from that menu. You also have two checkboxes for additional options, which are described below:
False
. If you check this box then it will allow you to override the check that occurs to ensure the destination host has the available disk space to host the instance.False
. If you check this box then it will migrate the local disks by using block migration. Use this option if you are only using ephemeral storage on your instances. If you are using block storage for your instance then ensure this box is not checked.To begin the live migration, click
.
Migrating instances using the python-novaclient CLI
To perform migrations from the command-line, use the python-novaclient
.
The Cloud Lifecycle Manager node in your cloud environment should have
the python-novaclient
already installed. If you will be accessing your environment
through a different method, ensure that the python-novaclient
is
installed. You can do so using Python's pip
package
manager.
To run the commands in the steps below, you need administrator
credentials. From the Cloud Lifecycle Manager, you can source the
service.osrc
file which is provided that has the
necessary credentials:
source ~/service.osrc
Here are the steps to perform:
Log in to the Cloud Lifecycle Manager.
Identify the instances on the compute node you wish to migrate:
openstack server list --all-tenants --host <hostname>
Example showing a host with a single compute instance on it:
ardana >
openstack server list --host ardana-cp1-comp0001-mgmt --all-tenants +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+ | 553ba508-2d75-4513-b69a-f6a2a08d04e3 | test | 193548a949c146dfa1f051088e141f0b | ACTIVE | - | Running | adminnetwork=10.0.0.5 | +--------------------------------------+------+----------------------------------+--------+------------+-------------+-----------------------+When using live migration you can either specify a target host that the instance will be migrated to or you can omit the target to allow the nova-scheduler to choose a node for you. If you want to get a list of available hosts you can use this command:
openstack host list
Migrate the instance(s) on the compute node using the notes below.
If your instance is booted from a block storage volume or has any number of block storage volumes attached, use the
nova live-migration
command with this syntax:nova live-migration <instance uuid> [<target compute host>]
If your instance has local (ephemeral) disk(s) only or if your instance has a mix of ephemeral disk(s) and block storage volume(s), you should use the
--block-migrate
option:nova live-migration --block-migrate <instance uuid> [<target compute host>]
NoteThe
[<target compute host>]
option is optional. If you do not specify a target host then the nova scheduler will choose a node for you.Multiple instances
If you want to live migrate all of the instances off a single compute host you can utilize the
nova host-evacuate-live
command.Issue the host-evacuate-live command, which will begin the live migration process.
If all of the instances on the host are using at least one local (ephemeral) disk, you should use this syntax:
nova host-evacuate-live --block-migrate <hostname>
Alternatively, if all of the instances are only using block storage volumes then omit the
--block-migrate
option:nova host-evacuate-live <hostname>
NoteYou can either let the nova-scheduler choose a suitable target host or you can specify one using the
--target-host <hostname>
switch. Seenova help host-evacuate-live
for details.
15.1.3.3.6 Troubleshooting migration or host evacuate issues #
Issue: When attempting to use nova
host-evacuate-live
against a node, you receive the error below:
$ nova host-evacuate-live ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt +--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Server UUID | Live Migration Accepted | Error Message | +--------------------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 95a7ded8-ebfc-4848-9090-2df378c88a4c | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-9fd79670-a780-40ed-a515-c14e28e0a0a7) | | 13ab4ef7-0623-4d00-bb5a-5bb2f1214be4 | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on shared storage: Live migration cannot be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-26834267-c3ec-4f8b-83cc-5193d6a394d6) | +--------------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fix: This occurs when you are attempting to
live evacuate a host that contains instances booted from local storage and
you are not specifying --block-migrate
in your command.
Re-attempt the live evacuation with this syntax:
nova host-evacuate-live --block-migrate <hostname> [--target-host <target_hostname>]
Issue: When attempting to use nova
host-evacuate-live
against a node, you receive the error below:
$ nova host-evacuate-live --block-migrate ardana-cp1-comp0001-mgmt --target-host ardana-cp1-comp0003-mgmt +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Server UUID | Live Migration Accepted | Error Message | +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | e9874122-c5dc-406f-9039-217d9258c020 | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-60b1196e-84a0-4b71-9e49-96d6f1358e1a) | | 84a02b42-9527-47ac-bed9-8fde1f98e3fe | False | Error while live migrating instance: ardana-cp1-comp0001-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-0cdf1198-5dbd-40f4-9e0c-e94aa1065112) | +--------------------------------------+-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Fix: This occurs when you are attempting to
live evacuate a host that contains instances booted from a block storage
volume and you are specifying --block-migrate
in your
command. Re-attempt the live evacuation with this syntax:
nova host-evacuate-live <hostname> [--target-host <target_hostname>]
Issue: When attempting to use nova
live-migration
against an instance, you receive the error below:
$ nova live-migration 2a13ffe6-e269-4d75-8e46-624fec7a5da0 ardana-cp1-comp0002-mgmt ERROR (BadRequest): ardana-cp1-comp0001-mgmt is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-158dd415-0bb7-4613-8529-6689265387e7)
Fix: This occurs when you are attempting to
live migrate an instance that was booted from local storage and you are not
specifying --block-migrate
in your command. Re-attempt
the live migration with this syntax:
nova live-migration --block-migrate <instance_uuid> <target_hostname>
Issue: When attempting to use nova
live-migration
against an instance, you receive the error below:
$ nova live-migration --block-migrate 84a02b42-9527-47ac-bed9-8fde1f98e3fe ardana-cp1-comp0001-mgmt ERROR (BadRequest): ardana-cp1-comp0002-mgmt is not on local storage: Block migration can not be used with shared storage. (HTTP 400) (Request-ID: req-51fee8d6-6561-4afc-b0c9-7afa7dc43a5b)
Fix: This occurs when you are attempting to
live migrate an instance that was booted from a block storage volume and you
are specifying --block-migrate
in your command.
Re-attempt the live migration with this syntax:
nova live-migration <instance_uuid> <target_hostname>
15.1.3.4 Adding Compute Node #
Adding a Compute Node allows you to add capacity.
15.1.3.4.1 Adding a SLES Compute Node #
Adding a SLES compute node allows you to add additional capacity for more virtual machines.
You may have a need to add additional SLES compute hosts for more virtual machine capacity or another purpose and these steps will help you achieve this.
There are two methods you can use to add SLES compute hosts to your environment:
Adding SLES pre-installed compute hosts. This method does not require the SLES ISO be on the Cloud Lifecycle Manager to complete.
Using the provided Ansible playbooks and Cobbler, SLES will be installed on your new compute hosts. This method requires that you provided a SUSE Linux Enterprise Server 12 SP4 ISO during the initial installation of your cloud, following the instructions at Section 31.1, “SLES Compute Node Installation Overview”.
If you want to use the provided Ansible playbooks and Cobbler to setup and configure your SLES hosts and you did not have the SUSE Linux Enterprise Server 12 SP4 ISO on your Cloud Lifecycle Manager during your initial installation then ensure you look at the note at the top of that section before proceeding.
15.1.3.4.1.1 Prerequisites #
You need to ensure your input model files are properly setup for SLES compute host clusters. This must be done during the installation process of your cloud and is discussed further at Section 31.3, “Using the Cloud Lifecycle Manager to Deploy SLES Compute Nodes” and Section 10.1, “SLES Compute Nodes”.
15.1.3.4.1.2 Adding a SLES compute node #
Adding pre-installed SLES compute hosts
This method requires that you have SUSE Linux Enterprise Server 12 SP4 pre-installed on the baremetal host prior to beginning these steps.
Ensure you have SUSE Linux Enterprise Server 12 SP4 pre-installed on your baremetal host.
Log in to the Cloud Lifecycle Manager.
Edit your
~/openstack/my_cloud/definition/data/servers.yml
file to include the details about your new compute host(s).For example, if you already had a cluster of three SLES compute hosts using the
SLES-COMPUTE-ROLE
role and needed to add a fourth one you would add your details to the bottom of the file in the format. Note that we left out the IPMI details because they will not be needed since you pre-installed the SLES OS on your host(s).- id: compute4 ip-addr: 192.168.102.70 role: SLES-COMPUTE-ROLE server-group: RACK1
You can find detailed descriptions of these fields in Section 6.5, “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
ImportantYou will need to verify that the
ip-addr
value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.yml
file on your Cloud Lifecycle Manager.In your
~/openstack/my_cloud/definition/data/control_plane.yml
file you will need to check the values formember-count
,min-count
, andmax-count
. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specifiedmember-count: 3
and are adding a fourth compute node, you will need to change that value tomember-count: 4
.See for Section 6.2, “Control Plane” more details.
Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "Add node <name>"Run the configuration processor and resolve any errors that are indicated:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlBefore proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Section 7.3.1, “Persisted Server Allocations” for information on how this works.
[OPTIONAL] - Run the
wipe_disks.yml
playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation.NoteThe
wipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions.The value to be used for
hostname
is host's identifier from~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Complete the compute host deployment with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"ardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
Adding SLES compute hosts with Ansible playbooks and Cobbler
These steps will show you how to add the new SLES compute host to your
servers.yml
file and then run the playbooks that update
your cloud configuration. You will run these playbooks from the lifecycle
manager.
If you did not have the SUSE Linux Enterprise Server 12 SP4 ISO available on your Cloud Lifecycle Manager during your initial installation, it must be installed before proceeding further. Instructions can be found in Chapter 31, Installing SLES Compute.
When you are prepared to continue, use these steps:
Log in to your Cloud Lifecycle Manager.
Checkout the
site
branch of your local git so you can begin to make the necessary edits:ardana >
cd ~/openstack/my_cloud/definition/dataardana >
git checkout siteEdit your
~/openstack/my_cloud/definition/data/servers.yml
file to include the details about your new compute host(s).For example, if you already had a cluster of three SLES compute hosts using the
SLES-COMPUTE-ROLE
role and needed to add a fourth one you would add your details to the bottom of the file in this format:- id: compute4 ip-addr: 192.168.102.70 role: SLES-COMPUTE-ROLE server-group: RACK1 mac-addr: e8:39:35:21:32:4e ilo-ip: 10.1.192.36 ilo-password: password ilo-user: admin distro-id: sles12sp4-x86_64
You can find detailed descriptions of these fields in Section 6.5, “Servers”. Ensure that you use the same role for any new SLES hosts you are adding as you specified on your existing SLES hosts.
ImportantYou will need to verify that the
ip-addr
value you choose for this host does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.yml
file on your Cloud Lifecycle Manager.In your
~/openstack/my_cloud/definition/data/control_plane.yml
file you will need to check the values formember-count
,min-count
, andmax-count
. If you specified them, ensure that they match up with your new total node count. For example, if you had previously specifiedmember-count: 3
and are adding a fourth compute node, you will need to change that value tomember-count: 4
.See Section 6.2, “Control Plane” for more details.
Commit the changes to git:
ardana >
git add -Aardana >
git commit -a -m "Add node <name>"Run the configuration processor and resolve any errors that are indicated:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlThe following playbook confirms that your servers are accessible over their IPMI ports.
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-status.yml -e nodelist=compute4
Add the new node into Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]Then you can image the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>NoteIf you do not know the
<node name>
, you can get it by usingsudo cobbler system list
.Before proceeding, you may want to take a look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. This is to prevent loss of data; the config processor retains data about removed nodes and keeps their ID numbers from being reallocated. See Section 7.3.1, “Persisted Server Allocations” for information on how this works.
Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml[OPTIONAL] - Run the
wipe_disks.yml
playbook to ensure all of your non-OS partitions on your hosts are completely wiped prior to continuing with the installation. Thewipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>NoteYou can obtain the
<hostname>
from the file~/scratch/ansible/next/ardana/ansible/hosts/verb_hosts
.You should verify that the netmask, bootproto, and other necessary settings are correct and if they are not then re-do them. See Chapter 31, Installing SLES Compute for details.
Complete the compute host deployment with these playbooks. For the last one, ensure you specify the compute hosts you are added with the
--limit
switch:ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"ardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>
15.1.3.4.1.3 Adding a new SLES compute node to monitoring #
If you want to add a new Compute node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
15.1.3.5 Removing a Compute Node #
Removing a Compute node allows you to remove capacity.
You may have a need to remove a Compute node and these steps will help you achieve this.
15.1.3.5.1 Disable Provisioning on the Compute Host #
Get a list of the nova services running which will provide us with the details we need to disable the provisioning on the Compute host you are wanting to remove:
ardana >
openstack compute service listHere is an example below. I've highlighted the Compute node we are going to remove in the examples:
ardana >
openstack compute service list +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:42.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T22:50:35.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | AZ2 | enabled | up | 2015-11-22T22:50:44.000000 | - | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------+Disable the nova service on the Compute node you are wanting to remove which will ensure it is taken out of the scheduling rotation:
ardana >
compute service set --disable --reason "enter reason here" node hostnameHere is an example if I wanted to remove the
ardana-cp1-comp0002-mgmt
in the output above:ardana >
compute service set –disable --reason "hardware reallocation" ardana-cp1-comp0002-mgmt +--------------------------+--------------+----------+-----------------------+ | Host | Binary | Status | Disabled Reason | +--------------------------+--------------+----------+-----------------------+ | ardana-cp1-comp0002-mgmt | nova-compute | disabled | hardware reallocation | +--------------------------+--------------+----------+-----------------------+
15.1.3.5.2 Remove the Compute Host from its Availability Zone #
If you configured the Compute host to be part of an availability zone, these steps will show you how to remove it.
Get a list of the nova services running which will provide us with the details we need to remove a Compute node:
ardana >
openstack compute service listHere is an example below. I've highlighted the Compute node we are going to remove in the examples:
ardana >
openstack compute service list +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T22:50:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:43.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T22:50:38.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T22:50:42.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T22:50:35.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | AZ2 | enabled | up | 2015-11-22T22:50:44.000000 | hardware reallocation | +----+------------------+--------------------------+----------+---------+-------+----------------------------+-----------------------+If the
Zone
reported for this host is simply "nova", then it is not a member of a particular availability zone, and this step will not be necessary. Otherwise, you must remove the Compute host from its availability zone:ardana >
openstack aggregate remove host availability zone nova hostnameSo for the same example in the previous step, the
ardana-cp1-comp0002-mgmt
host was in theAZ2
availability zone so you would use this command to remove it:ardana >
openstack aggregate remove host AZ2 ardana-cp1-comp0002-mgmt Host ardana-cp1-comp0002-mgmt has been successfully removed from aggregate 4 +----+------+-------------------+-------+-------------------------+ | Id | Name | Availability Zone | Hosts | Metadata | +----+------+-------------------+-------+-------------------------+ | 4 | AZ2 | AZ2 | | 'availability_zone=AZ2' | +----+------+-------------------+-------+-------------------------+You can confirm the last two steps completed successfully by running another
openstack compute service list
.Here is an example which confirms that the node has been disabled and that it has been removed from the availability zone. I have highlighted these:
ardana >
openstack compute service list +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+ | 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:34.000000 | - | | 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - | | 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - | | 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:32.000000 | - | | 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T23:04:25.000000 | - | | 37 | nova-compute | ardana-cp1-comp0002-mgmt | nova | disabled | up | 2015-11-22T23:04:34.000000 | hardware reallocation | +----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
15.1.3.5.3 Use Live Migration to Move Any Instances on this Host to Other Hosts #
You will need to verify if the Compute node is currently hosting any instances on it. You can do this with the command below:
ardana >
openstack server list --host nova hostname --all_tenants=1Here is an example below which shows that we have a single running instance on this node currently:
ardana >
openstack server list --host ardana-cp1-comp0002-mgmt --all-projects +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+ | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | ACTIVE | - | Running | paul=10.10.10.7 | +--------------------------------------+--------+----------------------------------+--------+------------+-------------+-----------------+You will likely want to migrate this instance off of this node before removing it. You can do this with the live migration functionality within nova. The command will look like this:
ardana >
nova live-migration --block-migrate nova instance IDHere is an example using the instance in the previous step:
ardana >
nova live-migration --block-migrate 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9You can check the status of the migration using the same command from the previous step:
ardana >
openstack server list --host ardana-cp1-comp0002-mgmt --all-projects +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+ | 78fdb938-a89c-4a0c-a0d4-b88f1555c3b9 | paul4d | 5e9998f1b1824ea9a3b06ad142f09ca5 | MIGRATING | migrating | Running | paul=10.10.10.7 | +--------------------------------------+--------+----------------------------------+-----------+------------+-------------+-----------------+List the compute instances again to see that the running instance has been migrated:
ardana >
openstack server list --host ardana-cp1-comp0002-mgmt --all-projects +----+------+-----------+--------+------------+-------------+----------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +----+------+-----------+--------+------------+-------------+----------+ +----+------+-----------+--------+------------+-------------+----------+
15.1.3.5.4 Disable Neutron Agents on Node to be Removed #
You should also locate and disable or remove neutron agents. To see the neutron agents running:
ardana >
openstack network agent list | grep NODE_NAME +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ | 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-l3-agent | | dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-metadata-agent | | f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent | ardana-cp1-comp0002-mgmt | :-) | True | neutron-openvswitch-agent | +--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+ardana >
openstack network agent set --disable 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4ardana >
openstack network agent set --disable dbe4fe11-8f08-4306-8244-cc68e98bb770ardana >
openstack network agent set --disable f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.5 Shut down or Stop the Nova and Neutron Services on the Compute Host #
To perform this step you have a few options. You can SSH into the Compute host and run the following commands:
tux >
sudo systemctl stop nova-compute
tux >
sudo systemctl stop neutron-*
Because the neutron agent self-registers against neutron server, you may want to prevent the following services from coming back online. Here is how you can get the list:
tux >
sudo systemctl list-units neutron-* --all
Here are the results:
UNIT LOAD ACTIVE SUB DESCRIPTION neutron-common-rundir.service loaded inactive dead Create /var/run/neutron •neutron-dhcp-agent.service not-found inactive dead neutron-dhcp-agent.service neutron-l3-agent.service loaded inactive dead neutron-l3-agent Service neutron-metadata-agent.service loaded inactive dead neutron-metadata-agent Service •neutron-openvswitch-agent.service loaded failed failed neutron-openvswitch-agent Service neutron-ovs-cleanup.service loaded inactive dead neutron OVS Cleanup Service LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 7 loaded units listed. To show all installed unit files use 'systemctl list-unit-files'.
For each loaded service issue the command
tux >
sudo systemctl disable service-name
In the above example that would be each service, except neutron-dhcp-agent.service
For example:
tux >
sudo systemctl disable neutron-common-rundir neutron-l3-agent neutron-metadata-agent neutron-openvswitch-agent
Now you can shut down the node:
tux >
sudo shutdown now
OR
From the Cloud Lifecycle Manager you can use the
bm-power-down.yml
playbook to shut down the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=node name
The node name
value will be the value
corresponding to this node in Cobbler. You can run
sudo cobbler system list
to retrieve these names.
15.1.3.5.6 Delete the Compute Host from Nova #
Retrieve the list of nova services:
ardana >
openstack compute service list
Here is an example highlighting the Compute host we're going to remove:
ardana >
openstack compute service list
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
| 1 | nova-conductor | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 10 | nova-scheduler | ardana-cp1-c1-m1-mgmt | internal | enabled | up | 2015-11-22T23:04:34.000000 | - |
| 13 | nova-conductor | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 16 | nova-conductor | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:33.000000 | - |
| 28 | nova-scheduler | ardana-cp1-c1-m2-mgmt | internal | enabled | up | 2015-11-22T23:04:28.000000 | - |
| 31 | nova-scheduler | ardana-cp1-c1-m3-mgmt | internal | enabled | up | 2015-11-22T23:04:32.000000 | - |
| 34 | nova-compute | ardana-cp1-comp0001-mgmt | AZ1 | enabled | up | 2015-11-22T23:04:25.000000 | - |
| 37 | nova-compute | ardana-cp1-comp0002-mgmt | nova | disabled | up | 2015-11-22T23:04:34.000000 | hardware reallocation |
+----+------------------+--------------------------+----------+----------+-------+----------------------------+-----------------------+
Delete the host from nova using the command below:
ardana >
openstack compute service delete service ID
Following our example above, you would use:
ardana >
openstack compute service delete 37
Use the command below to confirm that the Compute host has been completely removed from nova:
ardana >
openstack hypervisor list
15.1.3.5.7 Delete the Compute Host from Neutron #
Multiple neutron agents are running on the compute node. You have to remove
all of the agents running on the node using the openstack network
agent delete
command. In the example below, the l3-agent,
openvswitch-agent and metadata-agent are running:
ardana >
openstack network agent list | grep NODE_NAME
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| id | agent_type | host | alive | admin_state_up | binary |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
| 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4 | L3 agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-l3-agent |
| dbe4fe11-8f08-4306-8244-cc68e98bb770 | Metadata agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-metadata-agent |
| f0d262d1-7139-40c7-bdc2-f227c6dee5c8 | Open vSwitch agent | ardana-cp1-comp0002-mgmt | :-) | False | neutron-openvswitch-agent |
+--------------------------------------+----------------------+--------------------------+-------+----------------+---------------------------+
$ openstack network agent delete AGENT_ID
$ openstack network agent delete 08f16dbc-4ba2-4c1d-a4a3-a2ff2526ebe4
$ openstack network agent delete dbe4fe11-8f08-4306-8244-cc68e98bb770
$ openstack network agent delete f0d262d1-7139-40c7-bdc2-f227c6dee5c8
15.1.3.5.8 Remove the Compute Host from the servers.yml File and Run the Configuration Processor #
Complete these steps from the Cloud Lifecycle Manager to remove the Compute node:
Log in to the Cloud Lifecycle Manager
Edit your
servers.yml
file in the location below to remove references to the Compute node(s) you want to remove:ardana >
cd ~/openstack/my_cloud/definition/dataardana >
vi servers.ymlYou may also need to edit your
control_plane.yml
file to update the values formember-count
,min-count
, andmax-count
if you used those to ensure they reflect the exact number of nodes you are using.See Section 6.2, “Control Plane” for more details.
Commit the changes to git:
ardana >
git commit -a -m "Remove node NODE_NAME"To release the network capacity allocated to the deleted server(s), use the switches
remove_deleted_servers
andfree_unused_addresses
when running the configuration processors. (For more information, see Section 7.3, “Persisted Data”.)ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml \ -e remove_deleted_servers="y" -e free_unused_addresses="y"Update your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlRefresh the
/etc/hosts
file through the cloud to remove references to the old node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --tag "generate_hosts_file"
15.1.3.5.9 Remove the Compute Host from Cobbler #
Complete these steps to remove the node from Cobbler:
Confirm the system name in Cobbler with this command:
tux >
sudo cobbler system listRemove the system from Cobbler using this command:
tux >
sudo cobbler system remove --name=nodeRun the
cobbler-deploy.yml
playbook to complete the process:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.yml
15.1.3.5.10 Remove the Compute Host from Monitoring #
Once you have removed the Compute nodes, the alarms against them will trigger so there are additional steps to take to resolve this issue.
To find all monasca API servers
tux >
sudo cat /etc/haproxy/haproxy.cfg | grep MON
listen ardana-cp1-vip-public-MON-API-extapi-8070
bind ardana-cp1-vip-public-MON-API-extapi:8070 ssl crt /etc/ssl/private//my-public-cert-entry-scale
server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
listen ardana-cp1-vip-MON-API-mgmt-8070
bind ardana-cp1-vip-MON-API-mgmt:8070 ssl crt /etc/ssl/private//ardana-internal-cert
server ardana-cp1-c1-m1-mgmt-MON_API-8070 ardana-cp1-c1-m1-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m2-mgmt-MON_API-8070 ardana-cp1-c1-m2-mgmt:8070 check inter 5000 rise 2 fall 5
server ardana-cp1-c1-m3-mgmt-MON_API-8070 ardana-cp1-c1-m3-mgmt:8070 check inter 5000 rise 2 fall 5
In above example ardana-cp1-c1-m1-mgmt
,ardana-cp1-c1-m2-mgmt
,
ardana-cp1-c1-m3-mgmt
are Monasa API servers
You will want to SSH to each of the monasca API servers and edit the
/etc/monasca/agent/conf.d/host_alive.yaml
file to remove
references to the Compute node you removed. This will require
sudo
access. The entries will look similar to the one
below:
- alive_test: ping built_by: HostAlive host_name: ardana-cp1-comp0001-mgmt name: ardana-cp1-comp0001-mgmt ping
Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:
tux >
sudo service openstack-monasca-agent restart
With the Compute node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:
ardana >
monasca alarm-list --metric-dimensions hostname=compute node deleted
For example, if your Compute node looked like the example above then you would use this command to get the alarm ID:
ardana >
monasca alarm-list --metric-dimensions hostname=ardana-cp1-comp0001-mgmt
You can then delete the alarm with this command:
ardana >
monasca alarm-delete alarm ID
15.1.4 Planned Network Maintenance #
Planned maintenance task for networking nodes.
15.1.4.1 Adding a Network Node #
Adding an additional neutron networking node allows you to increase the performance of your cloud.
You may have a need to add an additional neutron network node for increased performance or another purpose and these steps will help you achieve this.
15.1.4.1.1 Prerequisites #
If you are using the mid-scale model then your networking nodes are already
separate and the roles are defined. If you are not already using this model
and wish to add separate networking nodes then you need to ensure that those
roles are defined. You can look in the ~/openstack/examples
folder on your Cloud Lifecycle Manager for the mid-scale example model files which
show how to do this. We have also added the basic edits that need to be made
below:
In your
server_roles.yml
file, ensure you have theNEUTRON-ROLE
defined.Path to file:
~/openstack/my_cloud/definition/data/server_roles.yml
Example snippet:
- name: NEUTRON-ROLE interface-model: NEUTRON-INTERFACES disk-model: NEUTRON-DISKS
In your
net_interfaces.yml
file, ensure you have theNEUTRON-INTERFACES
defined.Path to file:
~/openstack/my_cloud/definition/data/net_interfaces.yml
Example snippet:
- name: NEUTRON-INTERFACES network-interfaces: - device: name: hed3 name: hed3 network-groups: - EXTERNAL-VM - GUEST - MANAGEMENT
Create a
disks_neutron.yml
file, ensure you have theNEUTRON-DISKS
defined in it.Path to file:
~/openstack/my_cloud/definition/data/disks_neutron.yml
Example snippet:
product: version: 2 disk-models: - name: NEUTRON-DISKS volume-groups: - name: ardana-vg physical-volumes: - /dev/sda_root logical-volumes: # The policy is not to consume 100% of the space of each volume group. # 5% should be left free for snapshots and to allow for some flexibility. - name: root size: 35% fstype: ext4 mount: / - name: log size: 50% mount: /var/log fstype: ext4 mkfs-opts: -O large_file - name: crash size: 10% mount: /var/crash fstype: ext4 mkfs-opts: -O large_file
Modify your
control_plane.yml
file, ensure you have theNEUTRON-ROLE
defined as well as the neutron services added.Path to file:
~/openstack/my_cloud/definition/data/control_plane.yml
Example snippet:
- allocation-policy: strict cluster-prefix: neut member-count: 1 name: neut server-role: NEUTRON-ROLE service-components: - ntp-client - neutron-vpn-agent - neutron-dhcp-agent - neutron-metadata-agent - neutron-openvswitch-agent
You should also have one or more baremetal servers that meet the minimum hardware requirements for a network node which are documented in the Chapter 2, Hardware and Software Support Matrix.
15.1.4.1.2 Adding a network node #
These steps will show you how to add the new network node to your
servers.yml
file and then run the playbooks that update
your cloud configuration. You will run these playbooks from the lifecycle
manager.
Log in to your Cloud Lifecycle Manager.
Checkout the
site
branch of your local git so you can begin to make the necessary edits:ardana >
cd ~/openstack/my_cloud/definition/dataardana >
git checkout siteIn the same directory, edit your
servers.yml
file to include the details about your new network node(s).For example, if you already had a cluster of three network nodes and needed to add a fourth one you would add your details to the bottom of the file in this format:
# network nodes - id: neut3 ip-addr: 10.13.111.137 role: NEUTRON-ROLE server-group: RACK2 mac-addr: "5c:b9:01:89:b6:18" nic-mapping: HP-DL360-6PORT ip-addr: 10.243.140.22 ilo-ip: 10.1.12.91 ilo-password: password ilo-user: admin
ImportantYou will need to verify that the
ip-addr
value you choose for this node does not conflict with any other IP address in your cloud environment. You can confirm this by checking the~/openstack/my_cloud/info/address_info.yml
file on your Cloud Lifecycle Manager.In your
control_plane.yml
file you will need to check the values formember-count
,min-count
, andmax-count
, if you specified them, to ensure that they match up with your new total node count. So for example, if you had previously specifiedmember-count: 3
and are adding a fourth network node, you will need to change that value tomember-count: 4
.Commit the changes to git:
ardana >
git commit -a -m "Add new networking node <name>"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlAdd the new node into Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlThen you can image the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<hostname>NoteIf you do not know the
<hostname>
, you can get it by usingsudo cobbler system list
.[OPTIONAL] - Run the
wipe_disks.yml
playbook to ensure all of your non-OS partitions on your nodes are completely wiped prior to continuing with the installation. Thewipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other case, it may not wipe all of the expected partitions.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <hostname>Configure the operating system on the new networking node with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>Complete the networking node deployment with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --limit <hostname>Run the
site.yml
playbook with the required tag so that all other services become aware of the new node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
15.1.4.1.3 Adding a New Network Node to Monitoring #
If you want to add a new networking node to the monitoring service checks, there is an additional playbook that must be run to ensure this happens:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-deploy.yml --tags "active_ping_checks"
15.1.5 Planned Storage Maintenance #
Planned maintenance procedures for swift storage nodes.
15.1.5.1 Planned Maintenance Tasks for swift Nodes #
Planned maintenance tasks including recovering, adding, and removing swift nodes.
15.1.5.1.1 Adding a Swift Object Node #
Adding additional object nodes allows you to increase capacity.
This topic describes how to add additional swift object server nodes to an existing system.
15.1.5.1.1.1 To add a new node #
To add a new node to your cloud, you will need to add it to
servers.yml
, and then run the scripts that update your
cloud configuration. To begin, access the servers.yml
file
by checking out the Git branch where you are required to make
the changes:
Then, perform the following steps to add a new node:
Log in to the Cloud Lifecycle Manager node.
Get the
servers.yml
file stored in Git:cd ~/openstack/my_cloud/definition/data git checkout site
If not already done, set the
weight-step
attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.Add the details of new nodes to the
servers.yml
file. In the following example only one new server swobj4 is added. However, you can add multiple servers by providing the server details in theservers.yml
file:servers: ... - id: swobj4 role: SWOBJ_ROLE server-group: <server-group-name> mac-addr: <mac-address> nic-mapping: <nic-mapping-name> ip-addr: <ip-address> ilo-ip: <ilo-ip-address> ilo-user: <ilo-username> ilo-password: <ilo-password>
Commit your changes:
git add -A git commit -m "Add Node <name>"
NoteBefore you run any playbooks, remember that you need to export the encryption key in the following environment variable:
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
For instructions, see Chapter 30, Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only.
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Configure Cobbler to include the new node, and then reimage the node (if you are adding several nodes, use a comma-separated list with the
nodelist
argument):cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
In the following example, the server id is swobj4 (mentioned in step 3):
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj4
NoteYou must use the server id as it appears in the file
servers.yml
in the fieldid
.Configure the operating system:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
The hostname of the newly added server can be found in the list generated from the output of the following command:
grep hostname ~/openstack/my_cloud/info/server_info.yml
For example, for swobj4, the hostname is ardana-cp1-swobj0004-mgmt.
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-swobj0004-mgmt
Validate that the disk drives of the new node are compatible with the disk model used by the node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.
Run the following playbook to ensure that all other server's host file are updated with the new server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
Run the
ardana-deploy.yml
playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swobj4) that you are adding:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
For example:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-update-from-model-rebalance-rings.yml
15.1.5.1.2 Adding a Swift Proxy, Account, Container (PAC) Node #
Steps for adding additional PAC nodes to your swift system.
This topic describes how to add additional swift proxy, account, and container (PAC) servers to an existing system.
15.1.5.1.2.1 Adding a new node #
To add a new node to your cloud, you will need to add it to
servers.yml
, and then run the scripts that update your
cloud configuration. To begin, access the servers.yml
file by checking out the Git branch where you are required to make
the changes:
Then, perform the following steps to add a new node:
Log in to the Cloud Lifecycle Manager.
Get the
servers.yml
file stored in Git:cd ~/openstack/my_cloud/definition/data git checkout site
If not already done, set the weight-step attribute. For instructions, see Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes”.
Add details of new nodes to the
servers.yml
file:servers: ... - id: swpac6 role: SWPAC-ROLE server-group: <server-group-name> mac-addr: <mac-address> nic-mapping: <nic-mapping-name> ip-addr: <ip-address> ilo-ip: <ilo-ip-address> ilo-user: <ilo-username> ilo-password: <ilo-password>
In the above example, only one new server swpac6 is added. However, you can add multiple servers by providing the server details in the
servers.yml
file.In the entry-scale configurations there is no dedicated swift PAC cluster. Instead, there is a cluster using servers that have a role of
CONTROLLER-ROLE
. You cannot add additional nodes dedicated exclusively to swift PAC because that would change themember-count
of the entire cluster. In that case, to create a dedicated swift PAC cluster, you will need to add it to the configuration files. For details on how to do this, see Section 11.7, “Creating a Swift Proxy, Account, and Container (PAC) Cluster”.If using a new PAC nodes you must add the PAC node's configuration details in the following yaml files:
control_plane.yml disks_pac.yml net_interfaces.yml servers.yml server_roles.yml
You can see a good example of this in the example configurations for the mid-scale model in the
~/openstack/examples/mid-scale-kvm
directory.The following steps assume that you have already created a dedicated swift PAC cluster and that it has two members (swpac4 and swpac5).
Set the member count of the swift PAC cluster to match the number of nodes. For example, if you are adding swpac6 as the 6th swift PAC node, the member count should be increased from 5 to 6 as shown in the following example:
control-planes: - name: control-plane-1 control-plane-prefix: cp1 . . . clusters: . . . - name: swpac cluster-prefix: swpac server-role: SWPAC-ROLE member-count: 6 . . .
Commit your changes:
git add -A git commit -m "Add Node <name>"
NoteBefore you run any playbooks, remember that you need to export the encryption key in the following environment variable:
export HOS_USER_PASSWORD_ENCRYPT_KEY=ENCRYPTION_KEY
For instructions, see Chapter 30, Installation for SUSE OpenStack Cloud Entry-scale Cloud with Swift Only.
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Configure Cobbler to include the new node and reimage the node (if you are adding several nodes, use a comma-separated list for the
nodelist
argument):ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server-id>
In the following example, the server id is swpac6 (mentioned in step 3):
ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swpac6
NoteYou must use the server id as it appears in the file
servers.yml
in the fieldid
.Review the
cloudConfig.yml
anddata/control_plane.yml
files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
For example, for swpac6, the hostname is ardana-cp1-c2-m3-mgmt:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit ardana-cp1-c2-m3-mgmt
Validate that the disk drives of the new node are compatible with the disk model used by the node:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml
If any errors occur, correct them. For instructions, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”.
Run the following playbook to ensure that all other server's host file are updated with the new server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml --tags "generate_hosts_file"
Run the
ardana-deploy.yml
playbook to rebalance the rings to include the node, deploy the rings, and configure the new node. Do not limit this to just the node (swpac6) that you are adding:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
You may need to perform further rebalances of the rings. For instructions, see the "Weight Change Phase of Ring Rebalance" and the "Final Rebalance Phase" sections of Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
15.1.5.1.3 Adding Additional Disks to a Swift Node #
Steps for adding additional disks to any nodes hosting swift services.
You may have a need to add additional disks to a node for swift usage and we can show you how. These steps work for adding additional disks to swift object or proxy, account, container (PAC) nodes. It can also apply to adding additional disks to a controller node that is hosting the swift service, like you would see if you are using one of the entry-scale example models.
Read through the notes below before beginning the process.
You can add multiple disks at the same time, there is no need to do it one at a time.
You must add the same number of disks to each server that the disk model applies to. For example, if you have a single cluster of three swift servers and you want to increase capacity and decide to add two additional disks, you must add two to each of your three swift servers.
15.1.5.1.3.1 Adding additional disks to your Swift servers #
Verify the general health of the swift system and that it is safe to rebalance your rings. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Perform the disk maintenance.
Shut down the first swift server you wish to add disks to.
Add the additional disks to the physical server. The disk drives that are added should be clean. They should either contain no partitions or a single partition the size of the entire disk. It should not contain a file system or any volume groups. Failure to comply will cause errors and the disk will not be added.
For more details, see Section 11.6, “Swift Requirements for Device Group Drives”.
Power the server on.
While the server was shutdown, data that normally would have been placed on the server is placed elsewhere. When the server is rebooted, the swift replication process will move that data back onto the server. Monitor the replication process to determine when it is complete. See Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” for details on how to do this.
Repeat the steps from Step 2.a for each of the swift servers you are adding the disks to, one at a time.
NoteIf the additional disks can be added to the swift servers online (for example, via hotplugging) then there is no need to perform the last two steps.
On the Cloud Lifecycle Manager, update your cloud configuration with the details of your additional disks.
Edit the disk configuration file that correlates to the type of server you are adding your new disks to.
Path to the typical disk configuration files:
~/openstack/my_cloud/definition/data/disks_swobj.yml ~/openstack/my_cloud/definition/data/disks_swpac.yml ~/openstack/my_cloud/definition/data/disks_controller_*.yml
Example showing the addition of a single new disk, indicated by the
/dev/sdd
, in bold:device-groups: - name: swiftObject devices: - name: "/dev/sdb" - name: "/dev/sdc" - name: "/dev/sdd" consumer: name: swift ...
NoteFor more details on how the disk model works, see Chapter 6, Configuration Objects.
Configure the swift weight-step value in the
~/openstack/my_cloud/definition/data/swift/swift_config.yml
file. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for details on how to do this.Commit the changes to Git:
cd ~/openstack git commit -a -m "adding additional swift disks"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the
osconfig-run.yml
playbook against the swift nodes you have added disks to. Use the--limit
switch to target the specific nodes:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostnames>
You can use a wildcard when specifying the hostnames with the
--limit
switch. If you added disks to all of the swift servers in your environment and they all have the same prefix (for example,ardana-cp1-swobj...
) then you can use a wildcard likeardana-cp1-swobj*
. If you only added disks to a set of nodes but not all of them, you can use a comma deliminated list and enter the hostnames of each of the nodes you added disks to.Validate your swift configuration with this playbook which will also provide details of each drive being added:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --extra-vars "drive_detail=yes"
Verify that swift services are running on all of your servers:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml
If everything looks okay with the swift status, then apply the changes to your swift rings with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
At this point your swift rings will begin rebalancing. You should wait until replication has completed or min-part-hours has elapsed (whichever is longer), as described in Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring” and then follow the "Weight Change Phase of Ring Rebalance" process as described in Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
15.1.5.1.4 Removing a Swift Node #
Removal process for both swift Object and PAC nodes.
You can use this process when you want to remove one or more swift nodes permanently. This process applies to both swift Proxy, Account, Container (PAC) nodes and swift Object nodes.
15.1.5.1.4.1 Setting the Pass-through Attributes #
This process will remove the swift node's drives from the rings and rebalance their responsibilities among the remaining nodes in your cluster. Note that removal will not succeed if it causes the number of remaining disks in the cluster to decrease below the replica count of its rings.
Log in to the Cloud Lifecycle Manager.
Ensure that the weight-step attribute is set. See Section 9.5.2, “Using the Weight-Step Attributes to Prepare for Ring Changes” for more details.
Add the pass-through definition to your input model, specifying the server ID (as opposed to the server name). It is easiest to include in your
~/openstack/my_cloud/definition/data/servers.yml
file since your server IDs are already listed in that file. For more information about pass-through, see Section 6.17, “Pass Through”.Here is the format required, which can be inserted at the topmost level of indentation in your file (typically 2 spaces):
pass-through: servers: - id: server-id data: subsystem: subsystem-attributes
Here is an example:
--- product: version: 2 pass-through: servers: - id: ccn-0001 data: swift: drain: yes
If a pass-through definition already exists in any of your input model data files, just include the additional data for the server which you are removing instead of defining an entirely new pass-through block.
By setting this pass-through attribute, you indicate that the system should reduce the weight of the server's drives. The weight reduction is determined by the weight-step attribute as described in the previous step. This process is known as "draining", where you remove the swift data from the node in preparation for removing the node.
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Use the playbook to create a deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the swift deploy playbook to perform the first ring rebuild. This will remove some of the partitions from all drives on the node you are removing:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
Wait until the replication has completed. For further details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”
Determine whether all of the partitions have been removed from all drives on the swift node you are removing. You can do this by SSH'ing into the first account server node and using these commands:
cd /etc/swiftlm/cloud1/cp1/builder_dir/ sudo swift-ring-builder ring_name.builder
For example, if the node you are removing was part of the object-o ring the command would be:
sudo swift-ring-builder object-0.builder
Check the output. You will need to know the IP address of the server being drained. In the example below, the number of partitions of the drives on 192.168.245.3 has reached zero for the object-0 ring:
$ cd /etc/swiftlm/cloud1/cp1/builder_dir/ $ sudo swift-ring-builder object-0.builder account.builder, build version 6 4096 partitions, 3.000000 replicas, 1 regions, 1 zones, 6 devices, 0.00 balance, 0.00 dispersion The minimum number of hours before a partition can be reassigned is 16 The overload factor is 0.00% (0.000000) Devices: id region zone ip address port replication ip replication port name weight partitions balance meta 0 1 1 192.168.245.3 6002 192.168.245.3 6002 disk0 0.00 0 -0.00 padawan-ccp-c1-m1:disk0:/dev/sdc 1 1 1 192.168.245.3 6002 192.168.245.3 6002 disk1 0.00 0 -0.00 padawan-ccp-c1-m1:disk1:/dev/sdd 2 1 1 192.168.245.4 6002 192.168.245.4 6002 disk0 18.63 2048 -0.00 padawan-ccp-c1-m2:disk0:/dev/sdc 3 1 1 192.168.245.4 6002 192.168.245.4 6002 disk1 18.63 2048 -0.00 padawan-ccp-c1-m2:disk1:/dev/sdd 4 1 1 192.168.245.5 6002 192.168.245.5 6002 disk0 18.63 2048 -0.00 padawan-ccp-c1-m3:disk0:/dev/sdc 5 1 1 192.168.245.5 6002 192.168.245.5 6002 disk1 18.63 2048 -0.00 padawan-ccp-c1-m3:disk1:/dev/sdd
If the number of partitions is zero for the server on all rings, you can move to the next step, otherwise continue the ring rebalance cycle by repeating steps 7-9 until the weight has reached zero.
If the number of partitions is zero for the server on all rings, you can remove the swift nodes' drives from all rings. Edit the pass-through data you created in step #3 and set the
remove
attribute as shown in this example:--- product: version: 2 pass-through: servers: - id: ccn-0001 data: swift: remove: yes
Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Run the swift deploy playbook to rebuild the rings by removing the server:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
At this stage, the server has been removed from the rings and the data that was originally stored on the server has been replicated in a balanced way to the other servers in the system. You can proceed to the next phase.
15.1.5.1.4.2 To Disable Swift on a Node #
The next phase in this process will disable the swift service on the node. In this example, swobj4 is the node being removed from swift.
Log in to the Cloud Lifecycle Manager.
Stop swift services on the node using the
swift-stop.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit hostname
NoteWhen using the
--limit
argument, you must specify the full hostname (for example: ardana-cp1-swobj0004) or use the wild card*
(for example, *swobj4*).The following example uses the
swift-stop.yml
playbook to stop swift services on ardana-cp1-swobj0004:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-stop.yml --limit ardana-cp1-swobj0004
Remove the configuration files.
ssh ardana-cp1-swobj4-mgmt sudo rm -R /etc/swift
NoteDo not run any other playbooks until you have finished the process described in Section 15.1.5.1.4.3, “To Remove a Node from the Input Model”. Otherwise, these playbooks may recreate
/etc/swift
and restart swift on swobj4. If you accidentally run a playbook, repeat the process in Section 15.1.5.1.4.2, “To Disable Swift on a Node”.
15.1.5.1.4.3 To Remove a Node from the Input Model #
Use the following steps to finish the process of removing the swift node.
Log in to the Cloud Lifecycle Manager.
Edit the
~/openstack/my_cloud/definition/data/servers.yml
file and remove the entry for the node (swobj4 in this example). In addition, remove the related entry you created in the pass-through section earlier in this process.If this was a SWPAC node, reduce the member-count attribute by 1 in the
~/openstack/my_cloud/definition/data/control_plane.yml
file. For SWOBJ nodes, no such action is needed.Commit your configuration to the local Git repository (see Chapter 22, Using Git for Configuration Management), as follows:
cd ~/openstack/ardana/ansible git add -A git commit -m "My config or other commit message"
Run the configuration processor:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost config-processor-run.yml
Using the
remove_deleted_servers
andfree_unused_addresses
switches is recommended to free up the resources associated with the removed node when running the configuration processor. For more information, see Section 7.3, “Persisted Data”.ansible-playbook -i hosts/localhost config-processor-run.yml -e remove_deleted_servers="y" -e free_unused_addresses="y"
Update your deployment directory:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost ready-deployment.yml
Validate the changes you have made to the configuration files using the playbook below before proceeding further:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-compare-model-rings.yml --limit SWF*
If any errors occur, correct them in your configuration files and repeat steps 3-5 again until no more errors occur before going to the next step.
For more details on how to interpret and resolve errors, see Section 18.6.2.3, “Interpreting Swift Input Model Validation Errors”
Remove the node from Cobbler:
sudo cobbler system remove --name=swobj4
Run the Cobbler deploy playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
The final step will depend on what type of swift node you are removing.
If the node was a SWPAC node, run the
ardana-deploy.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
If the node was a SWOBJ node (and not a SWPAC node), run the
swift-deploy.yml
playbook:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-deploy.yml
Wait until replication has finished. For more details, see Section 9.5.4, “Determining When to Rebalance and Deploy a New Ring”.
You may need to continue to rebalance the rings. For instructions, see Final Rebalance Phase at Section 9.5.5, “Applying Input Model Changes to Existing Rings”.
15.1.5.1.4.4 Remove the Swift Node from Monitoring #
Once you have removed the swift node(s), the alarms against them will trigger so there are additional steps to take to resolve this issue.
Connect to each of the nodes in your cluster running the
monasca-api
service (as defined in
~/openstack/my_cloud/definition/data/control_plane.yml
)
and use sudo vi /etc/monasca/agent/conf.d/host_alive.yaml
to delete all references to the swift node(s) you removed.
Once you have removed the references on each of your monasca API servers you then need to restart the monasca-agent on each of those servers with this command:
tux >
sudo service openstack-monasca-agent restart
With the swift node references removed and the monasca-agent restarted, you can then delete the corresponding alarm to finish this process. To do so we recommend using the monasca CLI which should be installed on each of your monasca API servers by default:
monasca alarm-list --metric-dimensions hostname=swift node deleted
You can then delete the alarm with this command:
monasca alarm-delete alarm ID
15.1.5.1.5 Replacing a swift Node #
Maintenance steps for replacing a failed swift node in your environment.
This process is used when you want to replace a failed swift node in your cloud.
If it applies to the server, do not skip step 10. If you do, the system will overwrite the existing rings with new rings. This will not cause data loss, but, potentially, will move most objects in your system to new locations and may make data unavailable until the replication process has completed.
15.1.5.1.5.1 How to replace a swift node in your environment #
Log in to the Cloud Lifecycle Manager.
Power off the node.
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-down.yml -e nodelist=OLD_SWIFT_CONTROLLER_NODE
Update your cloud configuration with the details of your replacement swift. node.
Edit your
servers.yml
file to include the details (MAC address, IPMI user, password, and IP address (IPME) if these have changed) about your replacement swift node.NoteDo not change the server's IP address (that is,
ip-addr
).Path to file:
~/openstack/my_cloud/definition/data/servers.yml
Example showing the fields to edit, in bold:
- id: swobj5 role: SWOBJ-ROLE server-group: rack2 mac-addr: 8c:dc:d4:b5:cb:bd nic-mapping: HP-DL360-6PORT ip-addr: 10.243.131.10 ilo-ip: 10.1.12.88 ilo-user: iLOuser ilo-password: iLOpass ...
Commit the changes to Git:
ardana >
cd ~/openstackardana >
git commit -a -m "replacing a swift node"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
Prepare SLES:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook prepare-sles-loader.ymlardana >
ansible-playbook prepare-sles-grub2.yml -e nodelist=NEW REPLACEMENT NODE
Update Cobbler and reimage your replacement swift node:
Obtain the name in Cobbler for your node you wish to remove. You will use this value to replace
<node name>
in future steps.ardana >
sudo cobbler system listRemove the replaced swift node from Cobbler:
ardana >
sudo cobbler system remove --name <node name>Re-run the
cobbler-deploy.yml
playbook to add the replaced node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlReimage the node using this playbook:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
Wipe the disks on the
NEW REPLACEMENT NODE
. This action will not affect the OS partitions on the server.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limitNEW_REPLACEMENT_NODE
Complete the deployment of your replacement swift node.
Obtain the hostname for your new swift node. Use this value to replace
<hostname>
in future steps.ardana >
cat ~/openstack/my_cloud/info/server_info.ymlConfigure the operating system on your replacement swift node:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit <hostname>If this is the swift ring builder server, restore the swift ring builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir
directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.Configure services on the node using the
ardana-deploy.yml
playbook. If you have used an encryption password when running the configuration processor, include the--ask-vault-pass
argument.ardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit <hostname>
15.1.5.1.6 Replacing Drives in a swift Node #
Maintenance steps for replacing drives in a swift node.
This process is used when you want to remove a failed hard drive from swift node and replace it with a new one.
There are two different classes of drives in a swift node that needs to be replaced; the operating system disk drive (generally /dev/sda) and storage disk drives. There are different procedures for the replacement of each class of drive to bring the node back to normal.
15.1.5.1.6.1 To Replace the Operating System Disk Drive #
After the operating system disk drive is replaced, the node must be reimaged.
Log in to the Cloud Lifecycle Manager.
Update your Cobbler profile:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/localhost cobbler-deploy.yml
Reimage the node using this playbook:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<server name>
In the example below swobj2 server is reimaged:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=swobj2
Review the
cloudConfig.yml
anddata/control_plane.yml
files to get the host prefix (for example, openstack) and the control plane name (for example, cp1). This gives you the hostname of the node. Configure the operating system:cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <hostname>
In the following example, for swobj2, the hostname is ardana-cp1-swobj0002:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts osconfig-run.yml -limit ardana-cp1-swobj0002*
If this is the first server running the swift-proxy service, restore the swift Ring Builder files to the
/etc/swiftlm/CLOUD-NAME/CONTROL-PLANE-NAME/builder_dir
directory. For more information and instructions, see Section 18.6.2.4, “Identifying the Swift Ring Building Server” and Section 18.6.2.7, “Recovering swift Builder Files”.Configure services on the node using the
ardana-deploy.yml
playbook. If you have used an encryption password when running the configuration processor include the--ask-vault-pass
argument.ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass \ --limit <hostname>
For example:
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml --ask-vault-pass --limit ardana-cp1-swobj0002*
15.1.5.1.6.2 To Replace a Storage Disk Drive #
After a storage drive is replaced, there is no need to reimage the server.
Instead, run the swift-reconfigure.yml
playbook.
Log onto the Cloud Lifecycle Manager.
Run the following commands:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit <hostname>
In following example, the server used is swobj2:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml --limit ardana-cp1-swobj0002-mgmt
15.1.6 Updating MariaDB with Galera #
Updating MariaDB with Galera must be done manually. Updates are not installed automatically. This is particularly an issue with upgrades to MariaDB 10.2.17 or higher from MariaDB 10.2.16 or earlier. See MariaDB 10.2.22 Release Notes - Notable Changes.
Using the CLI, update MariaDB with the following procedure:
Mark Galera as unmanaged:
crm resource unmanage galera
Or put the whole cluster into maintenance mode:
crm configure property maintenance-mode=true
Pick a node other than the one currently targeted by the loadbalancer and stop MariaDB on that node:
crm_resource --wait --force-demote -r galera -V
Perform updates:
Uninstall the old versions of MariaDB and the Galera wsrep provider.
Install the new versions of MariaDB and the Galera wsrep provider. Select the appropriate instructions at Installing MariaDB with zypper.
Change configuration options if necessary.
Start MariaDB on the node.
crm_resource --wait --force-promote -r galera -V
Run
mysql_upgrade
with the--skip-write-binlog
option.On the other nodes, repeat the process detailed above: stop MariaDB, perform updates, start MariaDB, run
mysql_upgrade
.Mark Galera as managed:
crm resource manage galera
Or take the cluster out of maintenance mode.
15.2 Unplanned System Maintenance #
Unplanned maintenance tasks for your cloud.
15.2.1 Whole Cloud Recovery Procedures #
Unplanned maintenance procedures for your whole cloud.
15.2.1.1 Full Disaster Recovery #
In this disaster scenario, you have lost everything in your cloud. In other words, you have lost access to all data stored in the cloud that was not backed up to an external backup location, including:
Data in swift object storage
glance images
cinder volumes
Metering, Monitoring, and Logging (MML) data
Workloads running on compute resources
In effect, the following recovery process creates a minimal new cloud with the existing identity information. Much of the operating state and data would have been lost, as would running workloads.
We recommend backups external to your cloud for your data, including as much as possible of the types of resources listed above. Most workloads that were running could possibly be recreated with sufficient external backups.
15.2.1.1.1 Install and Set Up a Cloud Lifecycle Manager Node #
Before beginning the process of a full cloud recovery, you need to install and set up a Cloud Lifecycle Manager node as though you are creating a new cloud. There are several steps in that process:
Install the appropriate version of SUSE Linux Enterprise Server
Restore
passwd
,shadow
, andgroup
files. They have User ID (UID) and group ID (GID) content that will be used to set up the new cloud. If these are not restored immediately after installing the operating system, the cloud deployment will create new UIDs and GIDs, overwriting the existing content.Install Cloud Lifecycle Manager software
Prepare the Cloud Lifecycle Manager, which includes installing the necessary packages
Initialize the Cloud Lifecycle Manager
Restore your OpenStack git repository
Adjust input model settings if the hardware setup has changed
The following sections cover these steps in detail.
15.2.1.1.2 Install the Operating System #
Follow the instructions for installing SUSE Linux Enterprise Server in Chapter 15, Installing the Cloud Lifecycle Manager server.
15.2.1.1.3 Restore files with UID and GID content #
There is a risk that you may lose data completely. Restore the backups for
/etc/passwd
, /etc/shadow
, and
/etc/group
immediately after installing SUSE Linux Enterprise Server.
Some backup files contain content that would no longer be valid if your cloud were to be freshly deployed in the next step of a whole cloud recovery. As a result, some of the backup must be restored before deploying a new cloud. Three kinds of backups are involved: passwd, shadow, and group. The following steps will restore those backups.
Log in to the server where the Cloud Lifecycle Manager will be installed.
Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.
ardana >
scp USER@REMOTE_SERVER:TAR_ARCHIVEUntar the TAR archives to overwrite the three locations:
passwd
shadow
group
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzThe following are examples. Use the actual
tar.gz
file names of the backups.BACKUP_TARGET=
/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ -f passwd.tar.gzBACKUP_TARGET=
/etc/shadow
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ -f shadow.tar.gzBACKUP_TARGET=
/etc/group
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ -f group.tar.gz
15.2.1.1.4 Install the Cloud Lifecycle Manager #
To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Section 15.5.2, “Installing the SUSE OpenStack Cloud Extension”.
15.2.1.1.5 Prepare to deploy your cloud #
The following is the general process for preparing to deploy a SUSE OpenStack Cloud. You may not need to perform all the steps, depending on your particular disaster recovery situation.
When you install the ardana cloud pattern
in the
following process, the ardana
user and
ardana
group will already exist in
/etc/passwd
and /etc/group
. Do
not re-create them.
When you run ardana-init
in the following process,
/var/lib/ardana
is created as a deployer account using
the account settings in /etc/passwd
and
/etc/group
that were restored in the previous step.
15.2.1.1.5.1 Prepare for Cloud Installation #
Review the Chapter 14, Pre-Installation Checklist about recommended pre-installation tasks.
Prepare the Cloud Lifecycle Manager node. The Cloud Lifecycle Manager must be accessible either directly or via
ssh
, and have SUSE Linux Enterprise Server 12 SP4 installed. All nodes must be accessible to the Cloud Lifecycle Manager. If the nodes do not have direct access to online Cloud subscription channels, the Cloud Lifecycle Manager node will need to host the Cloud repositories.If you followed the installation instructions for Cloud Lifecycle Manager server (see Chapter 15, Installing the Cloud Lifecycle Manager server), SUSE OpenStack Cloud software should already be installed. Double-check whether SUSE Linux Enterprise and SUSE OpenStack Cloud are properly registered at the SUSE Customer Center by starting YaST and running › .
If you have not yet installed SUSE OpenStack Cloud, do so by starting YaST and running › › . Choose and follow the on-screen instructions. Make sure to register SUSE OpenStack Cloud during the installation process and to install the software pattern
patterns-cloud-ardana
.tux >
sudo zypper -n in patterns-cloud-ardanaEnsure the SUSE OpenStack Cloud media repositories and updates repositories are made available to all nodes in your deployment. This can be accomplished either by configuring the Cloud Lifecycle Manager server as an SMT mirror as described in Chapter 16, Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional) or by syncing or mounting the Cloud and updates repositories to the Cloud Lifecycle Manager server as described in Chapter 17, Software Repository Setup.
Configure passwordless
sudo
for the user created when setting up the node (as described in Section 15.4, “Creating a User”). Note that this is not the userardana
that will be used later in this procedure. In the following we assume you named the usercloud
. Run the commandvisudo
as userroot
and add the following line to the end of the file:CLOUD ALL = (root) NOPASSWD:ALL
Make sure to replace CLOUD with your user name choice.
Set the password for the user
ardana
:tux >
sudo passwd ardanaBecome the user
ardana
:tux >
su - ardanaPlace a copy of the SUSE Linux Enterprise Server 12 SP4
.iso
in theardana
home directory,var/lib/ardana
, and rename it tosles12sp4.iso
.Install the templates, examples, and working model directories:
ardana >
/usr/bin/ardana-init
15.2.1.1.6 Restore the remaining Cloud Lifecycle Manager content from a remote backup #
Log in to the Cloud Lifecycle Manager.
Retrieve the Cloud Lifecycle Manager backups from the remote server, which were created and saved during Procedure 17.1, “Manual Backup Setup”.
ardana >
scp USER@REMOTE_SERVER:TAR_ARCHIVEUntar the TAR archives to overwrite the remaining four required locations:
home
ssh
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzThe following are examples. Use the actual
tar.gz
file names of the backups.BACKUP_TARGET=
/var/lib/ardana
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /var/lib/ -f home.tar.gzBACKUP_TARGET=/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gz
15.2.1.1.7 Re-deployment of controllers 1, 2 and 3 #
Change back to the default ardana user.
Run the
cobbler-deploy.yml
playbook.ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the
bm-reimage.yml
playbook limited to the second and third controllers.ardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=controller2,controller3The names of controller2 and controller3. Use the
bm-power-status.yml
playbook to check the cobbler names of these nodes.Run the
site.yml
playbook limited to the three controllers and localhost—in this example,doc-cp1-c1-m1-mgmt
,doc-cp1-c1-m2-mgmt
,doc-cp1-c1-m3-mgmt
, andlocalhost
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit \ doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostYou can now perform the procedures to restore MariaDB and swift.
15.2.1.1.8 Restore MariaDB from a remote backup #
Log in to the first node running the MariaDB service.
Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_infoStop SUSE OpenStack Cloud services on the three controllers (using the hostnames of the controllers in your configuration).
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-stop.yml --limit \ doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostDelete the files in the
mysql
directory and copy the restored backup to that directory.root #
cd /var/lib/mysql/root #
rm -rf ./*root #
cp -pr /tmp/mysql_restore/* ./Switch back to the
ardana
user when the copy is finished.
15.2.1.1.9 Restore swift from a remote backup #
Log in to the first swift Proxy (
SWF-PRX--first-member
) node.To find the first swift Proxy node:
On the Cloud Lifecycle Manager
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-status.yml \ --limit SWF-PRX--first-memberAt the end of the output, you will see something like the following example:
... Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)' Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)' PLAY RECAP ******************************************************************** ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
Find the first node name and its IP address. For example:
ardana >
cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
Retrieve (
scp
) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.Create a temporary directory and extract the TAR archive (for example,
swring.tar.gz
).ardana >
mkdir /tmp/swift_builder_dir_restore; sudo tar -z \ --incremental --extract --ignore-zeros --warning=none --overwrite --directory \ /tmp/swift_builder_dir_restore/ -f swring.tar.gzLog in to the Cloud Lifecycle Manager.
Stop the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-stop.ymlLog back in to the first swift Proxy (
SWF-PRX--first-member
) node, which was determined in Step 1.Copy the restored files.
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/Log back in to the Cloud Lifecycle Manager.
Reconfigure the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.1.1.10 Restart SUSE OpenStack Cloud services #
Restart the MariaDB database
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlOn the deployer node, execute the
galera-bootstrap.yml
playbook which will determine the log sequence number, bootstrap the main node, and start the database cluster.If this process fails to recover the database cluster, refer to Section 15.2.3.1.2, “Recovering the MariaDB Database”.
Restart SUSE OpenStack Cloud services on the three controllers as in the following example.
ardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml \ --limit doc-cp1-c1-m1-mgmt,doc-cp1-c1-m2-mgmt,doc-cp1-c1-m3-mgmt,localhostReconfigure SUSE OpenStack Cloud
ardana >
ansible-playbook -i hosts/verb_hosts ardana-reconfigure.yml
15.2.2 Recover Start-up Processes #
In this scenario, processes do not start. If those processes are not running,
ansible start-up scripts will fail. On the deployer, use
Ansible
to check status on the control plane servers. The
following checks and remedies address common causes of this condition.
If disk space is low, determine the cause and remove anything that is no longer needed. Check disk space with the following command:
ardana >
ansible KEY-API -m shell -a 'df -h'Check that Network Time Protocol (NTP) is synchronizing clocks properly with the following command.
ardana >
ansible resources -i hosts/verb_hosts \ -m shell -a "sudo ntpq -c peers"Check
keepalived
, the daemon that monitors services or systems and automatically fails over to a standby if problems occur.ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status keepalived | head -8"Restart
keepalived
if necessary.Check RabbitMQ status first:
ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo rabbitmqctl status | head -10"Restart RabbitMQ if necessary:
ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl start rabbitmq-server"If RabbitMQ is running, restart
keepalived
:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl restart keepalived"
If RabbitMQ is up, is it clustered?
ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo rabbitmqctl cluster_status"Restart RabbitMQ cluster if necessary:
ardana >
ansible_playbook -i hosts/verb_hosts rabbitmq-start.ymlCheck
Kafka
messaging:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status kafka | head -5"Check the
Spark
framework:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status spark-worker | head -8"ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status spark-master | head -8"If necessary, start
Spark
:ardana >
ansible-playbook -i hosts/verb_hosts spark-start.ymlardana >
ansible KEY-API -i hosts/verb_hosts -m shell -a \ "sudo systemctl start spark-master | head -8"Check
Zookeeper
centralized service:ardana >
ansible KEY-API -i hosts/verb_hosts \ -m shell -a "sudo systemctl status zookeeper| head -8"Check MariaDB:
ardana >
ansible KEY-API -i hosts/verb_hosts -m shell -a "sudo mysql -e 'show status;' | grep -e wsrep_incoming_addresses \ -e wsrep_local_state_comment "
15.2.3 Unplanned Control Plane Maintenance #
Unplanned maintenance tasks for controller nodes such as recovery from power failure.
15.2.3.1 Restarting Controller Nodes After a Reboot #
Steps to follow if one or more of your controller nodes lose network connectivity or power, which includes if the node is either rebooted or needs hardware maintenance.
When a controller node is rebooted, needs hardware maintenance, loses network connectivity or loses power, these steps will help you recover the node.
These steps may also be used if the Host Status (ping) alarm is triggered for one or more of your controller nodes.
15.2.3.1.1 Prerequisites #
The following conditions must be true in order to perform these steps successfully:
Each of your controller nodes should be powered on.
Each of your controller nodes should have network connectivity, verified by SSH connectivity from the Cloud Lifecycle Manager to them.
The operator who performs these steps will need access to the Cloud Lifecycle Manager.
15.2.3.1.2 Recovering the MariaDB Database #
The recovery process for your MariaDB database cluster will depend on how many of your controller nodes need to be recovered. We will cover two scenarios:
Scenario 1: Recovering one or two of your controller nodes but not the entire cluster
Follow these steps to recover one or two of your controller nodes but not the entire cluster, then use these steps:
Ensure the controller nodes have power and are booted to the command prompt.
If the MariaDB service is not started, start it with this command:
ardana >
sudo service mysql startIf MariaDB fails to start, proceed to the next section which covers the bootstrap process.
Scenario 2: Recovering the entire controller cluster with the bootstrap playbook
If the scenario above failed or if you need to recover your entire control plane cluster, use the process below to recover the MariaDB database.
Make sure no
mysqld
daemon is running on any node in the cluster before you continue with the steps in this procedure. If there is amysqld
daemon running, then use the command below to shut down the daemon.ardana >
sudo systemctl stop mysqlIf the mysqld daemon does not go down following the service stop, then kill the daemon using
kill -9
before continuing.On the deployer node, execute the
galera-bootstrap.yml
playbook which will automatically determine the log sequence number, bootstrap the main node, and start the database cluster.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.yml
15.2.3.1.3 Restarting Services on the Controller Nodes #
From the Cloud Lifecycle Manager you should execute the
ardana-start.yml
playbook for each node that was brought
down so the services can be started back up.
If you have a dedicated (separate) Cloud Lifecycle Manager node you can use this syntax:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>
If you have a shared Cloud Lifecycle Manager/controller setup and need to restart
services on this shared node, you can use localhost
to
indicate the shared node, like this:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-start.yml --limit=<hostname_of_node>,localhost
If you leave off the --limit
switch, the playbook will
be run against all nodes.
15.2.3.1.4 Restart the Monitoring Agents #
As part of the recovery process, you should also restart the
monasca-agent
and these steps will show you how:
Log in to the Cloud Lifecycle Manager.
Stop the
monasca-agent
:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-agent-stop.ymlRestart the
monasca-agent
:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-agent-start.ymlYou can then confirm the status of the
monasca-agent
with this playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts monasca-agent-status.yml
15.2.3.2 Recovering the Control Plane #
If one or more of your controller nodes has experienced data or disk corruption due to power loss or hardware failure and you need perform disaster recovery, there are several scenarios for recovering your cloud.
If you backed up the Cloud Lifecycle Manager manually after installation (see Chapter 38, Post Installation Tasks, you will have a backup copy of
/etc/group
. When recovering a Cloud Lifecycle Manager node, manually copy
the /etc/group
file from a backup of the old Cloud Lifecycle Manager.
15.2.3.2.1 Point-in-Time MariaDB Database Recovery #
In this scenario, everything is still running (Cloud Lifecycle Manager, cloud controller nodes, and compute nodes) but you want to restore the MariaDB database to a previous state.
15.2.3.2.1.1 Restore MariaDB manually #
Follow this procedure to manually restore MariaDB:
Log in to the Cloud Lifecycle Manager.
Stop the MariaDB cluster:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-stop.ymlOn all of the nodes running the MariaDB service, which should be all of your controller nodes, run the following command to purge the old database:
ardana >
sudo rm -r /var/lib/mysql/*On the first node running the MariaDB service restore the backup with the command below. If you have already restored to a temporary directory, copy the files again.
ardana >
sudo cp -pr /tmp/mysql_restore/* /var/lib/mysqlIf you need to restore the files manually from SSH, follow these steps:
Log in to the first node running the MariaDB service.
Retrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_info
Log back in to the Cloud Lifecycle Manager.
Start the MariaDB service.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlAfter approximately 10-15 minutes, the output of the
percona-status.yml
playbook should show all the MariaDB nodes in sync. MariaDB cluster status can be checked using this playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-status.ymlAn example output is as follows:
TASK: [FND-MDB | status | Report status of "{{ mysql_service }}"] ************* ok: [ardana-cp1-c1-m1-mgmt] => { "msg": "mysql is synced." } ok: [ardana-cp1-c1-m2-mgmt] => { "msg": "mysql is synced." } ok: [ardana-cp1-c1-m3-mgmt] => { "msg": "mysql is synced." }
15.2.3.2.1.2 Point-in-Time Cassandra Recovery #
A node may have been removed either due to an intentional action in the Cloud Lifecycle Manager Admin UI or as a result of a fatal hardware event that requires a server to be replaced. In either case, the entry for the failed or deleted node should be removed from Cassandra before a new node is brought up.
The following steps should be taken before enabling and deploying the replacement node.
Determine the IP address of the node that was removed or is being replaced.
On one of the functional Cassandra control plane nodes, log in as the
ardana
user.Run the command
nodetool status
to display a list of Cassandra nodes.If the node that has been removed (no IP address matches that of the removed node) is not in the list, skip the next step.
If the node that was removed is still in the list, copy its node ID.
Run the command
nodetool removenode ID
.
After any obsolete node entries have been removed, the replacement node can be deployed as usual (for more information, see Section 15.1.2, “Planned Control Plane Maintenance”). The new Cassandra node will be able to join the cluster and replicate data.
For more information, please consult the Cassandra documentation.
15.2.3.2.2 Point-in-Time swift Rings Recovery #
In this situation, everything is still running (Cloud Lifecycle Manager, control plane nodes, and compute nodes) but you want to restore your swift rings to a previous state.
This process restores swift rings only, not swift data.
15.2.3.2.2.1 Restore from a swift backup #
Log in to the first swift Proxy (
SWF-PRX--first-member
) node.To find the first swift Proxy node:
On the Cloud Lifecycle Manager
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-status.yml \ --limit SWF-PRX--first-memberAt the end of the output, you will see something like the following example:
... Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:max-latency: 0.679254770279 (at 1529352109.66)' Jun 18 20:01:49 ardana-qe102-cp1-c1-m1-mgmt swiftlm-uptime-mon[3985]: 'uptime-mon - INFO : Metric:keystone-get-token:avg-latency: 0.679254770279 (at 1529352109.66)' PLAY RECAP ******************************************************************** ardana-qe102-cp1-c1-m1 : ok=12 changed=0 unreachable=0 failed=0```
Find the first node name and its IP address. For example:
ardana >
cat /etc/hosts | grep ardana-qe102-cp1-c1-m1
Retrieve (
scp
) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.Create a temporary directory and extract the TAR archive (for example,
swring.tar.gz
).ardana >
mkdir /tmp/swift_builder_dir_restore; sudo tar -z \ --incremental --extract --ignore-zeros --warning=none --overwrite --directory \ /tmp/swift_builder_dir_restore/ -f swring.tar.gzLog in to the Cloud Lifecycle Manager.
Stop the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-stop.ymlLog back in to the first swift Proxy (
SWF-PRX--first-member
) node, which was determined in Step 1.Copy the restored files.
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/Log back in to the Cloud Lifecycle Manager.
Reconfigure the swift service:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.yml
15.2.3.2.3 Point-in-time Cloud Lifecycle Manager Recovery #
In this scenario, everything is still running (Cloud Lifecycle Manager, controller nodes, and compute nodes) but you want to restore the Cloud Lifecycle Manager to a previous state.
Log in to the Cloud Lifecycle Manager.
Retrieve the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.
Extract the TAR archives for each of the seven locations.
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzFor example, with a directory such as BACKUP_TARGET=
/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gzWith a file such as BACKUP_TARGET=
/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gz
15.2.3.2.4 Cloud Lifecycle Manager Disaster Recovery #
In this scenario everything is still running (controller nodes and compute nodes) but you have lost either a dedicated Cloud Lifecycle Manager or a shared Cloud Lifecycle Manager/controller node.
To ensure that you use the same version of SUSE OpenStack Cloud that was previously loaded on your Cloud Lifecycle Manager, download and install the Cloud Lifecycle Manager software using the instructions from Section 15.5.2, “Installing the SUSE OpenStack Cloud Extension” before proceeding.
Prepare the Cloud Lifecycle Manager following the steps in the Before You
Start
section of Chapter 21, Installing with the Install UI.
15.2.3.2.4.1 Restore from a remote backup #
Log in to the Cloud Lifecycle Manager.
Retrieve (with
scp
) the Cloud Lifecycle Manager backups that were created with Section 17.3.1, “Cloud Lifecycle Manager Data Backup”. There are multiple backups; directories are handled differently than files.Extract the TAR archives for each of the seven locations.
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory RESTORE_TARGET -f BACKUP_TARGET.tar.gzFor example, with a directory such as BACKUP_TARGET=
/etc/ssh/
ardana >
sudo tar -z --incremental --extract --ignore-zeros \ --warning=none --overwrite --directory /etc/ssh/ -f ssh.tar.gzWith a file such as BACKUP_TARGET=
/etc/passwd
ardana >
sudo tar -z --incremental --extract --ignore-zeros --warning=none --overwrite --directory /etc/ -f passwd.tar.gzUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready_deployment.ymlWhen the Cloud Lifecycle Manager is restored, re-run the deployment to ensure the Cloud Lifecycle Manager is in the correct state:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit localhost
15.2.3.2.5 One or Two Controller Node Disaster Recovery #
This scenario makes the following assumptions:
Your Cloud Lifecycle Manager is still intact and working.
One or two of your controller nodes went down, but not the entire cluster.
The node needs to be rebuilt from scratch, not simply rebooted.
15.2.3.2.5.1 Steps to recovering one or two controller nodes #
Ensure that your node has power and all of the hardware is functioning.
Log in to the Cloud Lifecycle Manager.
Verify that all of the information in your
~/openstack/my_cloud/definition/data/servers.yml
file is correct for your controller node. You may need to replace the existing information if you had to either replacement your entire controller node or just pieces of it.If you made changes to your
servers.yml
file then commit those changes to your local git:ardana >
git add -Aardana >
git commit -a -m "editing controller information"Run the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.ymlEnsure that Cobbler has the correct system information:
If you replaced your controller node with a completely new machine, you need to verify that Cobbler has the correct list of controller nodes:
ardana >
sudo cobbler system listRemove any controller nodes from Cobbler that no longer exist:
ardana >
sudo cobbler system remove --name=<node>Add the new node into Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.yml
Then you can image the node:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node_name>NoteIf you do not know the
<node name>
already, you can get it by usingsudo cobbler system list
.Before proceeding, look at info/server_info.yml to see if the assignment of the node you have added is what you expect. It may not be, as nodes will not be numbered consecutively if any have previously been removed. To prevent loss of data, the configuration processor retains data about removed nodes and keeps their ID numbers from being reallocated. For more information about how this works, see Section 7.3.1, “Persisted Server Allocations”.
Run the
wipe_disks.yml
playbook to ensure the non-OS partitions on your nodes are completely wiped prior to continuing with the installation.ImportantThe
wipe_disks.yml
playbook is only meant to be run on systems immediately after runningbm-reimage.yml
. If used for any other situation, it may not wipe all of the expected partitions.ardana >
cd ~/scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts wipe_disks.yml --limit <controller_node_hostname>Complete the rebuilding of your controller node with the two playbooks below:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml -e rebuild=True --limit=<controller_node_hostname>ardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml -e rebuild=True --limit=<controller_node_hostname>
15.2.3.2.6 Three Control Plane Node Disaster Recovery #
In this scenario, all control plane nodes are down and need to be rebuilt or replaced. Restoring from a swift backup is not possible because swift is gone.
15.2.3.2.6.1 Restore from an SSH backup #
Log in to the Cloud Lifecycle Manager.
Deploy the control plane nodes, using the values for your control plane node hostnames:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit \ CONTROL_PLANE_HOSTNAME1,CONTROL_PLANE_HOSTNAME2, \ CONTROL_PLANE_HOSTNAME3 -e rebuild=TrueFor example, if you were using the default values from the example model files, the command would look like this:
ardana >
ansible-playbook -i hosts/verb_hosts site.yml \ --limit ardana-ccp-c1-m1-mgmt,ardana-ccp-c1-m2-mgmt,ardana-ccp-c1-m3-mgmt \ -e rebuild=TrueNoteThe
-e rebuild=True
is only used on a single control plane node when there are other controllers available to pull configuration data from. This causes the MariaDB database to be reinitialized, which is the only choice if there are no additional control nodes.Log in to the Cloud Lifecycle Manager.
Stop MariaDB:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-stop.ymlRetrieve the MariaDB backup that was created with Section 17.3.2, “MariaDB Database Backup”.
Create a temporary directory and extract the TAR archive (for example,
mydb.tar.gz
).ardana >
mkdir /tmp/mysql_restore; sudo tar -z --incremental \ --extract --ignore-zeros --warning=none --overwrite --directory /tmp/mysql_restore/ \ -f mydb.tar.gzVerify that the files have been restored on the controller.
ardana >
sudo du -shx /tmp/mysql_restore/* 16K /tmp/mysql_restore/aria_log.00000001 4.0K /tmp/mysql_restore/aria_log_control 3.4M /tmp/mysql_restore/barbican 8.0K /tmp/mysql_restore/ceilometer 4.2M /tmp/mysql_restore/cinder 2.9M /tmp/mysql_restore/designate 129M /tmp/mysql_restore/galera.cache 2.1M /tmp/mysql_restore/glance 4.0K /tmp/mysql_restore/grastate.dat 4.0K /tmp/mysql_restore/gvwstate.dat 2.6M /tmp/mysql_restore/heat 752K /tmp/mysql_restore/horizon 4.0K /tmp/mysql_restore/ib_buffer_pool 76M /tmp/mysql_restore/ibdata1 128M /tmp/mysql_restore/ib_logfile0 128M /tmp/mysql_restore/ib_logfile1 12M /tmp/mysql_restore/ibtmp1 16K /tmp/mysql_restore/innobackup.backup.log 313M /tmp/mysql_restore/keystone 716K /tmp/mysql_restore/magnum 12M /tmp/mysql_restore/mon 8.3M /tmp/mysql_restore/monasca_transform 0 /tmp/mysql_restore/multi-master.info 11M /tmp/mysql_restore/mysql 4.0K /tmp/mysql_restore/mysql_upgrade_info 14M /tmp/mysql_restore/nova 4.4M /tmp/mysql_restore/nova_api 14M /tmp/mysql_restore/nova_cell0 3.6M /tmp/mysql_restore/octavia 208K /tmp/mysql_restore/opsconsole 38M /tmp/mysql_restore/ovs_neutron 8.0K /tmp/mysql_restore/performance_schema 24K /tmp/mysql_restore/tc.log 4.0K /tmp/mysql_restore/test 8.0K /tmp/mysql_restore/winchester 4.0K /tmp/mysql_restore/xtrabackup_galera_infoLog back in to the first controller node and move the following files:
ardana >
ssh FIRST_CONTROLLER_NODEardana >
sudo suroot #
rm -rf /var/lib/mysql/*root #
cp -pr /tmp/mysql_restore/* /var/lib/mysql/Log back in to the Cloud Lifecycle Manager and bootstrap MariaDB:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts galera-bootstrap.ymlVerify the status of MariaDB:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts percona-status.yml
15.2.3.2.7 swift Rings Recovery #
To recover the swift rings in the event of a disaster, follow the procedure that applies to your situation: either recover the rings with the manual swift backup and restore or use the SSH backup.
15.2.3.2.7.1 Restore from the swift deployment backup #
15.2.3.2.7.2 Restore from the SSH backup #
In case you have lost all system disks of all object nodes and swift proxy nodes are corrupted, you can recover the rings from a copy of the swift rings was backed up previously. swift data is still available (the disks used by swift still need to be accessible).
Recover the rings with these steps.
Log in to a swift proxy node.
Become root:
ardana >
sudo suCreate the temporary directory for your restored files:
root #
mkdir /tmp/swift_builder_dir_restore/Retrieve (
scp
) the swift backup that was created with Section 17.3.3, “swift Ring Backup”.Create a temporary directory and extract the TAR archive (for example,
swring.tar.gz
).ardana >
mkdir /tmp/swift_builder_dir_restore; sudo tar -z \ --incremental --extract --ignore-zeros --warning=none --overwrite --directory \ /tmp/swift_builder_dir_restore/ -f swring.tar.gzYou now have the swift rings in
/tmp/swift_builder_dir_restore/
If the SWF-PRX--first-member is already deployed, copy the contents of the restored directory (
/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
) to/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
on the SWF-PRX--first-member.Then from the Cloud Lifecycle Manager run:
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts swift-reconfigure.ymlIf the SWF-ACC--first-member is not deployed, from the Cloud Lifecycle Manager run these playbooks:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts guard-deployment.ymlardana >
ansible-playbook -i hosts/verb_hosts osconfig-run.yml --limit <SWF-ACC[0]-hostname>Copy the contents of the restored directory (
/tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
) to/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
on the SWF-ACC[0].Create the directories:
/etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/* \ /etc/swiftlm/CLOUD_NAME/CONTROL_PLANE_NAME/builder_dir/For example:
ardana >
sudo cp -pr /tmp/swift_builder_dir_restore/entry-scale-kvm/control-plane-1/builder_dir/* \ /etc/swiftlm/entry-scale-kvm/control-plane-1/builder_dir/From the Cloud Lifecycle Manager, run the
ardana-deploy.yml
playbook:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-deploy.yml
15.2.4 Unplanned Compute Maintenance #
Unplanned maintenance tasks including recovering compute nodes.
15.2.4.1 Recovering a Compute Node #
If one or more of your compute nodes has experienced an issue such as power loss or hardware failure, then you need to perform disaster recovery. Here we provide different scenarios and how to resolve them to get your cloud repaired.
Typical scenarios in which you will need to recover a compute node include the following:
The node has failed, either because it has shut down has a hardware failure, or for another reason.
The node is working but the
nova-compute
process is not responding, thus instances are working but you cannot manage them (for example to delete, reboot, and attach/detach volumes).The node is fully operational but monitoring indicates a potential issue (such as disk errors) that require down time to fix.
15.2.4.1.1 What to do if your compute node is down #
Compute node has power but is not powered on
If your compute node has power but is not powered on, use these steps to restore the node:
Log in to the Cloud Lifecycle Manager.
Obtain the name for your compute node in Cobbler:
ardana >
sudo cobbler system listPower the node back up with this playbook, specifying the node name from Cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>
Compute node is powered on but services are not running on it
If your compute node is powered on but you are unsure if services are running, you can use these steps to ensure that they are running:
Log in to the Cloud Lifecycle Manager.
Confirm the status of the compute service on the node with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-status.yml --limit <hostname>You can start the compute service on the node with this playbook:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts nova-start.yml --limit <hostname>
15.2.4.1.2 Scenarios involving disk failures on your compute nodes #
Your compute nodes should have a minimum of two disks, one that is used for
the operating system and one that is used as the data disk. These are
defined during the installation of your cloud, in the
~/openstack/my_cloud/definition/data/disks_compute.yml
file on the Cloud Lifecycle Manager. The data disk(s) are where the
nova-compute
service lives. Recovery scenarios will
depend on whether one or the other, or both, of these disks experienced
failures.
If your operating system disk failed but the data disk(s) are okay
If you have had issues with the physical volume that nodes your operating system you need to ensure that your physical volume is restored and then you can use the following steps to restore the operating system:
Log in to the Cloud Lifecycle Manager.
Source the administrator credentials:
ardana >
source ~/service.osrcObtain the hostname for your compute node, which you will use in subsequent commands when
<hostname>
is requested:ardana >
openstack host list | grep computeObtain the status of the
nova-compute
service on that node:ardana >
openstack compute service list --host <hostname>You will likely want to disable provisioning on that node to ensure that
nova-scheduler
does not attempt to place any additional instances on the node while you are repairing it:ardana >
openstack compute service set –disable --reason "node is being rebuilt" <hostname>Obtain the status of the instances on the compute node:
ardana >
openstack server list --host <hostname> --all-tenantsBefore continuing, you should either evacuate all of the instances off your compute node or shut them down. If the instances are booted from volumes, then you can use the
nova evacuate
ornova host-evacuate
commands to do this. See Section 15.1.3.3, “Live Migration of Instances” for more details on how to do this.If your instances are not booted from volumes, you will need to stop the instances using the
openstack server stop
command. Because thenova-compute
service is not running on the node you will not see the instance status change, but theTask State
for the instance should change topowering-off
.ardana >
openstack server stop <instance_uuid>Verify the status of each of the instances using these commands, verifying the
Task State
statespowering-off
:ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server show <instance_uuid>At this point you should be ready with a functioning hard disk in the node that you can use for the operating system. Follow these steps:
Obtain the name for your compute node in Cobbler, which you will use in subsequent commands when
<node_name>
is requested:ardana >
sudo cobbler system listRun the following playbook, ensuring that you specify only your UEFI SLES nodes using the nodelist. This playbook will reconfigure Cobbler for the nodes listed.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook prepare-sles-grub2.yml -e nodelist=node1[,node2,node3]Reimage the compute node with this playbook:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost bm-reimage.yml -e nodelist=<node name>
Once reimaging is complete, use the following playbook to configure the operating system and start up services:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts site.yml --limit <hostname>You should then ensure any instances on the recovered node are in an
ACTIVE
state. If they are not then use theopenstack server start
command to bring them to theACTIVE
state:ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server start <instance_uuid>Re-enable provisioning:
ardana >
openstack compute service set –enable <hostname>Start any instances that you had stopped previously:
ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server start <instance_uuid>
If your data disk(s) failed but the operating system disk is okay OR if all drives failed
In this scenario your instances on the node are lost. First, follow steps 1 to 5 and 8 to 9 in the previous scenario.
After that is complete, use the openstack server rebuild
command to respawn your instances, which will also ensure that they receive
the same IP address:
ardana >
openstack server list --host <hostname> --all-tenantsardana >
openstack server rebuild <instance_uuid>
15.2.5 Unplanned Storage Maintenance #
Unplanned maintenance tasks for storage nodes.
15.2.5.1 Unplanned swift Storage Maintenance #
Unplanned maintenance tasks for swift storage nodes.
15.2.5.1.1 Recovering a Swift Node #
If one or more of your swift Object or PAC nodes has experienced an issue, such as power loss or hardware failure, and you need to perform disaster recovery then we provide different scenarios and how to resolve them to get your cloud repaired.
Typical scenarios in which you will need to repair a swift object or PAC node include:
The node has either shut down or been rebooted.
The entire node has failed and needs to be replaced.
A disk drive has failed and must be replaced.
15.2.5.1.1.1 What to do if your Swift host has shut down or rebooted #
If your swift host has power but is not powered on, from the lifecycle manager you can run this playbook:
Log in to the Cloud Lifecycle Manager.
Obtain the name for your swift host in Cobbler:
sudo cobbler system list
Power the node back up with this playbook, specifying the node name from Cobbler:
cd ~/openstack/ardana/ansible ansible-playbook -i hosts/localhost bm-power-up.yml -e nodelist=<node name>
Once the node is booted up, swift should start automatically. You can verify this with this playbook:
cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts swift-status.yml
Any alarms that have triggered due to the host going down should clear within 10 minutes. See Section 18.1.1, “Alarm Resolution Procedures” if further assistance is needed with the alarms.
15.2.5.1.1.2 How to replace your Swift node #
If your swift node has irreparable damage and you need to replace the entire node in your environment, see Section 15.1.5.1.5, “Replacing a swift Node” for details on how to do this.
15.2.5.1.1.3 How to replace a hard disk in your Swift node #
If you need to do a hard drive replacement in your swift node, see Section 15.1.5.1.6, “Replacing Drives in a swift Node” for details on how to do this.
15.3 Cloud Lifecycle Manager Maintenance Update Procedure #
Ensure that the update repositories have been properly set up on all nodes. The easiest way to provide the required repositories on the Cloud Lifecycle Manager Server is to set up an SMT server as described in Chapter 16, Installing and Setting Up an SMT Server on the Cloud Lifecycle Manager server (Optional). Alternatives to setting up an SMT server are described in Chapter 17, Software Repository Setup.
Read the Release Notes for the security and maintenance updates that will be installed.
Have a backup strategy in place. For further information, see Chapter 17, Backup and Restore.
Ensure that you have a known starting state by resolving any unexpected alarms.
Determine if you need to reboot your cloud after updating the software. Rebooting is highly recommended to ensure that all affected services are restarted. Reboot may be required after installing Linux kernel updates, but it can be skipped if the impact on running services is non-existent or well understood.
Review steps in Section 15.1.4.1, “Adding a Network Node” and Section 15.1.1.2, “Rolling Reboot of the Cloud” to minimize the impact on existing workloads. These steps are critical when the neutron services are not provided via external SDN controllers.
Before the update, prepare your working loads by consolidating all of your instances to one or more Compute Nodes. After the update is complete on the evacuated Compute Nodes, reboot them and move the images from the remaining Compute Nodes to the newly booted ones. Then, update the remaining Compute Nodes.
15.3.1 Performing the Update #
Before you proceed, get the status of all your services:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-status.yml
If status check returns an error for a specific service, run the
SERVICE-reconfigure.yml
playbook. Then run the
SERVICE-status.yml
playbook to check that the issue has been resolved.
Update and reboot all nodes in the cloud one by one. Start with the deployer node, then follow the order recommended in Section 15.1.1.2, “Rolling Reboot of the Cloud”.
The described workflow also covers cases in which the deployer node is also provisioned as an active cloud node.
To minimize the impact on the existing workloads, the node should first be prepared for an update and a subsequent reboot by following the steps leading up to stopping services listed in Section 15.1.1.2, “Rolling Reboot of the Cloud”, such as migrating singleton agents on Control Nodes and evacuating Compute Nodes. Do not stop services running on the node, as they need to be running during the update.
Install all available security and maintenance updates on the deployer using the
zypper patch
command.Initialize the Cloud Lifecycle Manager and prepare the update playbooks.
Run the
ardana-init
initialization script to update the deployer.Redeploy cobbler:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.ymlRun the configuration processor:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.ymlUpdate your deployment directory:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml
Installation and management of updates can be automated with the following playbooks:
ardana-update-pkgs.yml
ardana-update.yml
ardana-update-status.yml
ardana-reboot.yml
Confirm version changes by running
hostnamectl
before and after running theardana-update-pkgs
playbook on each node.ardana >
hostnamectlNotice that
Boot ID:
andKernel:
have changed.By default, the
ardana-update-pkgs.yml
playbook will install patches and updates that do not require a system reboot. Patches and updates that do require a system reboot will be installed later in this process.ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \ --limit TARGET_NODE_NAMEThere may be a delay in the playbook output at the following task while updates are pulled from the deployer.
TASK: [ardana-upgrade-tools | pkg-update | Download and install package updates] ***
After running the
ardana-update-pkgs.yml
playbook to install patches and updates not requiring reboot, check the status of remaining tasks.ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit TARGET_NODE_NAMETo install patches that require reboot, run the
ardana-update-pkgs.yml
playbook with the parameter-e zypper_update_include_reboot_patches=true
.ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml \ --limit TARGET_NODE_NAME \ -e zypper_update_include_reboot_patches=trueIf the output of
ardana-update-pkgs.yml
indicates that a reboot is required, runardana-reboot.yml
after completing theardana-update.yml
step below. Runningardana-reboot.yml
will cause cloud service interruption.NoteTo update a single package (for example, apply a PTF on a single node or on all nodes), run
zypper update PACKAGE
.To install all package updates using
zypper update
.Update services:
ardana >
ansible-playbook -i hosts/verb_hosts ardana-update.yml \ --limit TARGET_NODE_NAMEIf indicated by the
ardana-update-status.yml
playbook, reboot the node.There may also be a warning to reboot after running the
ardana-update-pkgs.yml
.This check can be overridden by setting the
SKIP_UPDATE_REBOOT_CHECKS
environment variable or theskip_update_reboot_checks
Ansible variable.ansible-playbook -i hosts/verb_hosts ardana-reboot.yml \ --limit TARGET_NODE_NAME
To recheck pending system reboot status at a later time, run the following commands:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit ardana-cp1-c1-m2 -e update_status_var=system-rebootThe pending system reboot status can be reset by running:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml \ --limit ardana-cp1-c1-m2 \ -e update_status_var=system-reboot \ -e update_status_reset=trueMultiple servers can be patched at the same time with
ardana-update-pkgs.yml
by setting the option-e skip_single_host_checks=true
.WarningWhen patching multiple servers at the same time, take care not to compromise HA capability by updating an entire cluster (controller, database, monitor, logging) at the same time.
If multiple nodes are specified on the command line (with
--limit
), services on those servers will experience outages as the packages are shutdown and updated. On Compute Nodes (or group of Compute Nodes) migrate the workload off if you plan to update it. The same applies to Control Nodes: move singleton services off of the control plane node that will be updated.ImportantDo not reboot all of your controllers at the same time.
When the node comes up after the reboot, run the
spark-start.yml
file:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-start.ymlVerify that Spark is running on all Control Nodes:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts spark-status.ymlAfter all nodes have been updated, check the status of all services:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-status.yml
15.3.2 Summary of the Update Playbooks #
- ardana-update-pkgs.yml
Top-level playbook automates the installation of package updates on a single node. It also works for multiple nodes, if the single-node restriction is overridden by setting the SKIP_SINGLE_HOST_CHECKS environment variable
ardana-update-pkgs.yml -e skip_single_host_checks=true
.Provide the following
-e
options to modify default behavior:zypper_update_method
(default: patch)patch
installs all patches for the system. Patches are intended for specific bug and security fixes.update
installs all packages that have a higher version number than the installed packages.dist-upgrade
replaces each package installed with the version from the repository and deletes packages not available in the repositories.
zypper_update_repositories
(default: all) restricts the list of repositories usedzypper_update_gpg_checks
(default: true) enables GPG checks. If set totrue
, checks if packages are correctly signed.zypper_update_licenses_agree
(default: false) automatically agrees with licenses. If set totrue
, zypper automatically accepts third party licenses.zypper_update_include_reboot_patches
(default: false) includes patches that require reboot. Setting this totrue
installs patches that require a reboot (such as kernel or glibc updates).
- ardana-update.yml
Top level playbook that automates the update of all the services. Runs on all nodes by default, or can be limited to a single node by adding
--limit nodename
.- ardana-reboot.yml
Top-level playbook that automates the steps required to reboot a node. It includes pre-boot and post-boot phases, which can be extended to include additional checks.
- ardana-update-status.yml
This playbook can be used to check or reset the update-related status variables maintained by the update playbooks. The main reason for having this mechanism is to allow the update status to be checked at any point during the update procedure. It is also used heavily by the automation scripts to orchestrate installing maintenance updates on multiple nodes.
15.4 Upgrading Cloud Lifecycle Manager 8 to Cloud Lifecycle Manager 9 #
Before undertaking the upgrade from SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you need to ensure that your existing SUSE OpenStack Cloud 8 Cloud Lifecycle Manager installation is up to date by following the https://documentation.suse.com/hpe-helion/8/html/hpe-helion-openstack-clm-all/system-maintenance.html#maintenance-update.
Ensure you review the following resources:
To confirm that all nodes have been successfully updated with no pending
actions, run the ardana-update-status.yml
playbook on
the Cloud Lifecycle Manager deployer node as follows:
ardana >
cd scratch/ansible/next/ardana/ansible/ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml
Ensure that all nodes have been updated, and that there are no pending update actions remaining to be completed. In particular, ensure that any nodes that need to be rebooted have been, using the documented reboot procedure.
Once all nodes have been successfully updated, and there are no pending update actions remaining, you should be able to run the
ardana-pre-upgrade-validations.sh
script, as follows:ardana >
cd scratch/ansible/next/ardana/ansible/ardana >
./ardana-pre-upgrade-validations.sh ~/scratch/ansible/next/ardana/ansible ~/scratch/ansible/next/ardana/ansible PLAY [Initialize an empty list of msgs] *************************************** TASK: [set_fact ] ************************************************************* ok: [localhost] ... PLAY RECAP ******************************************************************** ... localhost : ok=8 changed=5 unreachable=0 failed=0 msg: Please refer to /var/log/ardana-pre-upgrade-validations.log for the results of this run. Ensure that any messages in the file that have the words FAIL or WARN are resolved.The last line of output from the
ardana-pre-upgrade-validations.sh
script will tell you the name of its log file—in this case,/var/log/ardana-pre-upgrade-validations.log
. If you look at the log file, you will see content similar to the following:ardana >
sudo cat /var/log/ardana-pre-upgrade-validations.log ardana-cp-dbmqsw-m1************************************************************* NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. ardana-cp-dbmqsw-m2************************************************************* NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. ardana-cp-dbmqsw-m3************************************************************* NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk1 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk2 is smaller than SLE 12 SP4 recommended 512. Some recommended XFS data integrity features may not be available after upgrade. ardana-cp-mml-m1**************************************************************** SUCCESS: Keystone V2 ==> V3 API config changes detected. ardana-cp-mml-m2**************************************************************** SUCCESS: Keystone V2 ==> V3 API config changes detected. ardana-cp-mml-m3**************************************************************** SUCCESS: Keystone V2 ==> V3 API config changes detected. localhost***********************************************************************The report states the following:
- SUCCESS: Keystone V2 ==> V3 API config changes detected.
This check confirms that your cloud has been updated with the necessary changes such that all services will be using Keystone V3 API. This means that there should be minimal interruption of service during the upgrade. This is important because the Keystone V2 API has been removed in SUSE OpenStack Cloud 9.
- NOTE: pre-upgrade-swift-checks: Swift XFS inode size of 256 for /srv/node/disk0 is smaller than the SUSE Linux Enterprise 12 SP4 recommendation of 512. Some recommended XFS data integrity features may not be available after upgrade.
This check will only report something if you have local swift configured and it is formatted with the SUSE Linux Enterprise 12 SP3 default XFS inode size of 256. In SUSE Linux Enterprise 12 SP4, the default XFS inode size for a newly-formatted XFS file system has been increased to 512, to allow room for enabling some additional XFS data-integrity features by default.
There will be no loss of functionality as regards the swift solution after the upgrade. The difference is that some additional XFS features will not be available on file systems which were formatted under SUSE Linux Enterprise 12 SP3 or earlier. These XFS features aid in the detection of, and recovery from, data corruption. They are enabled by default for XFS file systems formatted under SUSE Linux Enterprise 12 SP 4.
In addition to the automated upgrade checks above, there are some checks that should be performed manually.
For each network interface device specified in the input model under
~/openstack/my_cloud/definition
, ensure that there is only one untagged VLAN. The SUSE OpenStack Cloud 9 Cloud Lifecycle Manager configuration processor will fail with an error if it detects this problem during the upgrade, so address this problem before starting the upgrade process.If the deployer node is not a standalone system, but is instead co-located with the DB services, this can lead to potentially longer service disruptions during the upgrade process. To determine if this is the case, check if the deployer node (
OPS-LM--first-member
) is a member of the database nodes (FND-MDB
). You can do this with the following command:ardana >
cd scratch/ansible/next/ardana/ansible/ardana >
ansible -i hosts/verb_hosts 'FND-MDB:&OPS-LM--first-member' --list-hostsIf the output is:
No hosts matched
Then the deployer node is not co-located with the database nodes. Otherwise, if the command reports a hostname, then there may be additional interruptions to the database services during the upgrade.
Similarly, if the deployer is co-located with the database services, and you are also trying to run a local SMT service on the deployer node, you will run into issues trying to configure the SMT to enable and mirror the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.
In such cases, it is recommended that you run the SMT services on a different node, and NFS-import the
/srv/www/htdocs/repo
onto the deployer node, instead of trying to run the SMT services locally.
The integrated backup solution in SUSE OpenStack Cloud 8 Cloud Lifecycle Manager, freezer, is no longer available in SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. Therefore, we recommend doing a manual backup to a server that is not a member of the cloud, as per Chapter 17, Backup and Restore.
15.4.1 Migrating the Deployer Node Packages #
The upgrade process first migrates the SUSE OpenStack Cloud 8 Cloud Lifecycle Manager deployer node to SUSE Linux Enterprise 12 SP4 and the SOC 9 Cloud Lifecycle Manager packages.
If the deployer node is not a dedicated node, but is instead a member of one of the cloud-control planes, then some services may restart with the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM versions of the software during the migration. This may mean that:
Some services fail to restart. This will be resolved when the appropriate SUSE OpenStack Cloud 9 configuration changes are applied by running the
ardana-upgrade.yml
playbook, later during the upgrade process.Other services may log excessive warnings about connectivity issues and backwards-compatibility warnings. This will be resolved when the relevant services are upgraded during the
ardana-upgrade.yml
playbook run.
In order to upgrade the deployer node to be based on SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, you first need to migrate the system to SUSE Linux Enterprise 12 SP4 with the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager product installed.
The process for migrating the deployer node differs somewhat, depending on whether your deployer node is registered with the SUSE Customer Center (or an SMT mirror), versus using locally-maintained repositories available at the relevant locations.
If your deployer node is registered with the SUSE Customer Center or an SMT, the migration
process requires the zypper-migration-plugin
package to be installed.
If you are using an SMT server to mirror the relevant repositories, then you need to enable mirroring of the relevant repositories. See Section 16.3, “Setting up Repository Mirroring on the SMT Server” for more information.
Ensure that the mirroring process has completed before proceeding.
Ensure that the
zypper-migration-plugin
package is installed; if not, install it:ardana >
sudo zypper install zypper-migration-plugin Refreshing service 'SMT-http_smt_example_com'. Loading repository data... Reading installed packages... 'zypper-migration-plugin' is already installed. No update candidate for 'zypper-migration-plugin-0.10-12.4.noarch'. The highest available version is already installed. Resolving package dependencies... Nothing to do.De-register the SUSE Linux Enterprise Server LTSS 12 SP3 x86_64 extension (if enabled):
ardana >
sudo SUSEConnect --status-text Installed Products: ------------------------------------------ SUSE Linux Enterprise Server 12 SP3 LTSS (SLES-LTSS/12.3/x86_64) Registered ------------------------------------------ SUSE Linux Enterprise Server 12 SP3 (SLES/12.3/x86_64) Registered ------------------------------------------ SUSE OpenStack Cloud 8 (suse-openstack-cloud/8/x86_64) Registered ------------------------------------------ ardana > sudo SUSEConnect -d -p SLES-LTSS/12.3/x86_64 Deregistering system from registration proxy https://smt.example.com/ Deactivating SLES-LTSS 12.3 x86_64 ... -> Refreshing service ... -> Removing release package ... ardana > sudo SUSEConnect --status-text Installed Products: ------------------------------------------ SUSE Linux Enterprise Server 12 SP3 (SLES/12.3/x86_64) Registered ------------------------------------------ SUSE OpenStack Cloud 8 (suse-openstack-cloud/8/x86_64) Registered ------------------------------------------Disable any other SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories. The zypper migration process should detect and disable most of these automatically, but in some cases it may not catch all of them, which can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the
/srv/www/suse-12.3
directory or the SUSE-12-4 alias underhttp://localhost:79/
, you could use the following commands:ardana >
zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2 PTF SLES12-SP3-LTSS-Updates SLES12-SP3-Pool SLES12-SP3-Updates SUSE-OpenStack-Cloud-8-Pool SUSE-OpenStack-Cloud-8-Updates ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done Repository 'PTF' has been successfully disabled. Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled. Repository 'SLES12-SP3-Pool' has been successfully disabled. Repository 'SLES12-SP3-Updates' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3 (a new one, based on SUSE Linux Enterprise 12 SP4, will be created during the upgrade process):
ardana >
zypper repos | grep PTF 2 | PTF | PTF | No | (r ) Yes | Yes ardana > sudo zypper removerepo PTF Removing repository 'PTF' ..............................................................................................[done] Repository 'PTF' has been removed.Remove the Cloud media repository (if defined):
ardana >
zypper repos | grep '[|] Cloud ' 1 | Cloud | SUSE OpenStack Cloud 8 DVD #1 | Yes | (r ) Yes | No ardana > sudo zypper removerepo Cloud Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done] Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.Run the
zypper migration
command, which should offer a single choice: namely, to upgrade to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager. You need to accept the offered choice, then answeryes
to any prompts to disable obsoleted repositories. At that point, thezypper migration
command will runzypper dist-upgrade
, which will prompt you to agree with the proposed package changes. Finally, you will to agree with any new licenses. After this, the package upgrade of the deployer node will proceed. The output of the runningzypper migration
should look something like the following:ardana >
sudo zypper migration Executing 'zypper refresh' Repository 'SLES12-SP3-Pool' is up to date. Repository 'SLES12-SP3-Updates' is up to date. Repository 'SLES12-SP3-Pool' is up to date. Repository 'SLES12-SP3-Updates' is up to date. Repository 'SUSE-OpenStack-Cloud-8-Pool' is up to date. Repository 'SUSE-OpenStack-Cloud-8-Updates' is up to date. Repository 'OpenStack-Cloud-8-Pool' is up to date. Repository 'OpenStack-Cloud-8-Updates' is up to date. All repositories have been refreshed. Executing 'zypper --no-refresh patch-check --updatestack-only' Loading repository data... Reading installed packages... 0 patches needed (0 security patches) Available migrations: 1 | SUSE Linux Enterprise Server 12 SP4 x86_64 SUSE OpenStack Cloud 9 x86_64 [num/q]: 1 Executing 'snapper create --type pre --cleanup-algorithm=number --print-number --userdata important=yes --description 'before online migration'' The config 'root' does not exist. Likely snapper is not configured. See 'man snapper' for further instructions. Upgrading product SUSE Linux Enterprise Server 12 SP4 x86_64. Found obsolete repository SLES12-SP3-Updates Disable obsolete repository SLES12-SP3-Updates [y/n] (y): y ... disabling. Found obsolete repository SLES12-SP3-Pool Disable obsolete repository SLES12-SP3-Pool [y/n] (y): y ... disabling. Upgrading product SUSE OpenStack Cloud 9 x86_64. Found obsolete repository OpenStack-Cloud-8-Pool Disable obsolete repository OpenStack-Cloud-8-Pool [y/n] (y): y ... disabling. Executing 'zypper --releasever 12.4 ref -f' Warning: Enforced setting: $releasever=12.4 Forcing raw metadata refresh Retrieving repository 'SLES12-SP4-Pool' metadata .......................................................................[done] Forcing building of repository cache Building repository 'SLES12-SP4-Pool' cache ............................................................................[done] Forcing raw metadata refresh Retrieving repository 'SLES12-SP4-Updates' metadata ....................................................................[done] Forcing building of repository cache Building repository 'SLES12-SP4-Updates' cache .........................................................................[done] Forcing raw metadata refresh Retrieving repository 'SUSE-OpenStack-Cloud-9-Pool' metadata ...........................................................[done] Forcing building of repository cache Building repository 'SUSE-OpenStack-Cloud-9-Pool' cache ................................................................[done] Forcing raw metadata refresh Retrieving repository 'SUSE-OpenStack-Cloud-9-Updates' metadata ........................................................[done] Forcing building of repository cache Building repository 'SUSE-OpenStack-Cloud-9-Updates' cache .............................................................[done] Forcing raw metadata refresh Retrieving repository 'OpenStack-Cloud-8-Updates' metadata .............................................................[done] Forcing building of repository cache Building repository 'OpenStack-Cloud-8-Updates' cache ..................................................................[done] All repositories have been refreshed. Executing 'zypper --releasever 12.4 --no-refresh dist-upgrade --no-allow-vendor-change ' Warning: Enforced setting: $releasever=12.4 Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command. Loading repository data... Reading installed packages... Computing distribution upgrade... ... 525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch. Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used. Continue? [y/n/...? shows all options] (y): y ... dracut: *** Generating early-microcode cpio image *** dracut: *** Constructing GenuineIntel.bin **** dracut: *** Store current command line parameters *** dracut: Stored kernel commandline: dracut: rd.lvm.lv=ardana-vg/root dracut: root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' *** dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done *** Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script: Refresh script btrfs-scrub.sh for monthly Refresh script btrfs-defrag.sh for none Refresh script btrfs-balance.sh for weekly Refresh script btrfs-trim.sh for none There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.
In this configuration, you need to manually migrate the system using
zypper dist-upgrade
, according to the following steps:
Disable any SUSE Linux Enterprise 12 SP3 or SUSE OpenStack Cloud 8 Cloud Lifecycle Manager-related repositories. Leaving the SUSE Linux Enterprise 12 SP3 and/or SUSE OpenStack Cloud (or HOS) 8 Cloud Lifecycle Manager-related repositories enabled can lead to a minor disruption later during the upgrade procedure. For example, to disable any repositories served from the
/srv/www/suse-12.3
directory, or the SUSE-12-4 alias underhttp://localhost:79/
, use the following commands:ardana >
zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2 PTF SLES12-SP3-LTSS-Updates SLES12-SP3-Pool SLES12-SP3-Updates SUSE-OpenStack-Cloud-8-Pool SUSE-OpenStack-Cloud-8-Updates ardana > for repo in $(zypper repos --show-enabled-only --uri | grep -e dir:///srv/www/suse-12.3 -e http://localhost:79/SUSE-12-4/ | cut -d '|' -f 2); do sudo zypper modifyrepo --disable "${repo}"; done Repository 'PTF' has been successfully disabled. Repository 'SLES12-SP3-LTSS-Updates' has been successfully disabled. Repository 'SLES12-SP3-Pool' has been successfully disabled. Repository 'SLES12-SP3-Updates' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Pool' has been successfully disabled. Repository 'SUSE-OpenStack-Cloud-8-Updates' has been successfully disabled.NoteThe SLES12-SP3-LTSS-Updates repository should only be present if you have purchased the optional SUSE Linux Enterprise 12 SP3 LTSS support. Whether or not it is configured will not impact the upgrade process.
Remove the PTF repository, which is based on SUSE Linux Enterprise 12 SP3. A new one based on SUSE Linux Enterprise 12 SP4 will be created during the upgrade process.
ardana >
zypper repos | grep PTF 2 | PTF | PTF | Yes | (r ) Yes | Yes ardana > sudo zypper removerepo PTF Removing repository 'PTF' ..............................................................................................[done] Repository 'PTF' has been removed.Remove the Cloud media repository if defined.
ardana >
zypper repos | grep '[|] Cloud ' 1 | Cloud | SUSE OpenStack Cloud 8 DVD #1 | Yes | (r ) Yes | No ardana > sudo zypper removerepo Cloud Removing repository 'SUSE OpenStack Cloud 8 DVD #1' ....................................................................[done] Repository 'SUSE OpenStack Cloud 8 DVD #1' has been removed.Ensure the deployer node has access to the SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 CLM repositories as documented in Chapter 17, Software Repository Setup paying attention to the non-SMT based repository setup. When you run
zypper repos --show-enabled-only
, the output should look similar to the following:ardana >
zypper repos --show-enabled-only # | Alias | Name | Enabled | GPG Check | Refresh ---+--------------------------------+--------------------------------+---------+-----------+-------- 1 | Cloud | SUSE OpenStack Cloud 9 DVD #1 | Yes | (r ) Yes | No 7 | SLES12-SP4-Pool | SLES12-SP4-Pool | Yes | (r ) Yes | No 8 | SLES12-SP4-Updates | SLES12-SP4-Updates | Yes | (r ) Yes | Yes 9 | SUSE-OpenStack-Cloud-9-Pool | SUSE-OpenStack-Cloud-9-Pool | Yes | (r ) Yes | No 10 | SUSE-OpenStack-Cloud-9-Updates | SUSE-OpenStack-Cloud-9-Updates | Yes | (r ) Yes | YesNoteThe Cloud repository above is optional. Its content is equivalent to the SUSE-Openstack-Cloud-9-Pool repository.
Run the
zypper dist-upgrade
command to upgrade the deployer node:ardana >
sudo zypper dist-upgrade Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command. Loading repository data... Reading installed packages... Computing distribution upgrade... ... 525 packages to upgrade, 14 to downgrade, 62 new, 5 to remove, 1 to change vendor, 1 to change arch. Overall download size: 1.24 GiB. Already cached: 0 B. After the operation, additional 780.8 MiB will be used. Continue? [y/n/...? shows all options] (y): y ... dracut: *** Generating early-microcode cpio image *** dracut: *** Constructing GenuineIntel.bin **** dracut: *** Store current command line parameters *** dracut: Stored kernel commandline: dracut: rd.lvm.lv=ardana-vg/root dracut: root=/dev/mapper/ardana--vg-root rootfstype=ext4 rootflags=rw,relatime,data=ordered dracut: *** Creating image file '/boot/initrd-4.4.180-94.127-default' *** dracut: *** Creating initramfs image file '/boot/initrd-4.4.180-94.127-default' done *** Output of btrfsmaintenance-0.2-18.1.noarch.rpm %posttrans script: Refresh script btrfs-scrub.sh for monthly Refresh script btrfs-defrag.sh for none Refresh script btrfs-balance.sh for weekly Refresh script btrfs-trim.sh for none There are some running programs that might use files deleted by recent upgrade. You may wish to check and restart some of them. Run 'zypper ps -s' to list these programs.NoteYou may need to run the
zypper dist-upgrade
command more than once, if it determines that it needs to update thezypper
infrastructure on your system to be able to successfullydist-upgrade
the node; the command will tell you if you need to run it again.
15.4.2 Upgrading the Deployer Node Configuration Settings #
Now that the deployer node packages have been migrated to SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we need to update the configuration settings to be SUSE OpenStack Cloud 9 Cloud Lifecycle Manager based.
The first step is to run the ardana-init
command. This will:
Add the PTF repository, creating it if needed.
Optionally add appropriate local repository references for any SMT-provided SUSE Linux Enterprise 12 SP4 and SUSE OpenStack Cloud 9 repositories.
Upgrade the deployer account
~/openstack
area to be based upon SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources.This will import the new SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible code into the Git repository on the Ardana branch, and then rebase the customer site branch on top of the updated Ardana branch.
Follow the directions to resolve any Git merge conflicts that may arise due to local changes that may have been made on the site branch:
ardana >
ardana-init ... To continue installation copy your cloud layout to: /var/lib/ardana/openstack/my_cloud/definition Then execute the installation playbooks: cd /var/lib/ardana/openstack/ardana/ansible git add -A git commit -m 'My config' ansible-playbook -i hosts/localhost cobbler-deploy.yml ansible-playbook -i hosts/localhost bm-reimage.yml ansible-playbook -i hosts/localhost config-processor-run.yml ansible-playbook -i hosts/localhost ready-deployment.yml cd /var/lib/ardana/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts site.yml If you prefer to use the UI to install the product, you can do either of the following: - If you are running a browser on this machine, you can point your browser to http://localhost:9085 to start the install via the UI. - If you are running the browser on a remote machine, you will need to create an ssh tunnel to access the UI. Please refer to the Ardana installation documentation for further details.
As we are upgrading to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, we do not need
to run the suggested bm-reimage.yml
playbook.
If you were previously using the cobbler
-based integrated provisioning
solution, then you will need to perform the following steps to import
the SUSE Linux Enterprise 12 SP4 ISO and update the default provisioning distribution:
Ensure there is a copy of the
SLE-12-SP4-Server-DVD-x86_64-GM-DVD1.iso
, namedsles12sp4.iso
, available in the/var/lib/ardana
directory.Ensure that any distribution entries in
servers.yml
(or whichever file holds the server node definitions) under~/openstack/my_cloud/definition
are updated to specifysles12sp4
if they are currently usingsles12sp3
.NoteThe default distribution will now be
sles12sp4
, so if there are no specific distribution entries specified for the servers, then no change will be required.If you have made any changes to the
~/openstack/my_cloud/definition
files, you will need to commit those changes, as follows:ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "Update sles12sp3 distro entries to sles12sp4"Run the
cobbler-deploy.yml
playbook to import the SUSE Linux Enterprise 12 SP4 distribution as the new default distribution:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost cobbler-deploy.yml Enter the password that will be used to access provisioned nodes: confirm Enter the password that will be used to access provisioned nodes: PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - cobbler-deploy.yml" } msg: Playbook started - cobbler-deploy.yml ... PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - cobbler-deploy.yml" } msg: Playbook finished - cobbler-deploy.yml PLAY RECAP ******************************************************************** localhost : ok=92 changed=45 unreachable=0 failed=0
You are now ready to upgrade the input model to be compatible.
At this point, there are some mandatory changes that will need to be made to the existing input model to permit the upgrade proceed. These mandatory changes represent:
The removal of previously-deprecated service components;
The dropping of service components that are no longer supported;
That there can be only one untagged VLAN per network interface;
That there must be a
MANAGEMENT
network group.
There are also some service components that have been made redundant
and have no effect. These should be removed to quieten the associated
config-processor-run.yml
warnings.
For example, if you run the configuration-processor-run.yml
playbook from the ~/openstack/ardana/ansible
directory before you made the necessary input model changes, you should
see it fail with errors similar to those shown below—unless your input
model doesn't deploy the problematic service component:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml Enter encryption key (press return for none): confirm Enter encryption key (press return for none): To change encryption key enter new key (press return for none): confirm To change encryption key enter new key (press return for none): PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] ... "################################################################################", "# The configuration processor failed. ", "# control-planes-2.0 WRN: cp:openstack-core: 'designate-pool-manager' has been deprecated and will be replaced by 'designate-worker'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.", "", "# control-planes-2.0 WRN: cp:openstack-core: 'manila-share' service component is deprecated. The 'manila-share' service component can be removed as manila share service will be deployed where manila-api is specified. This is not a deprecation for openstack-manila-share but just an entry deprecation in input model.", "", "# control-planes-2.0 WRN: cp:openstack-core: 'designate-zone-manager' has been deprecated and will be replaced by 'designate-producer'. The replacement component will be automatically deployed in a future release. You will then need to update the input model to remove this warning.", "", "# control-planes-2.0 WRN: cp:openstack-core: 'glance-registry' has been deprectated and is no longer deployed. Please update you input model to remove any 'glance-registry' service component specifications to remove this warning.", "", "# control-planes-2.0 WRN: cp:mml: 'ceilometer-api' is no longer used by Ardana and will not be deployed. Please update your input model to remove this warning.", "", "# control-planes-2.0 WRN: cp:sles-compute: 'neutron-lbaasv2-agent' has been deprecated and replaced by 'octavia' and will not be deployed in a future release. Please update your input model to remove this warning.", "", "# control-planes-2.0 ERR: cp:common-service-components: Undefined component 'freezer-agent'", "# control-planes-2.0 ERR: cp:openstack-core: Undefined component 'nova-console-auth'", "# control-planes-2.0 ERR: cp:openstack-core: Undefined component 'heat-api-cloudwatch'", "# control-planes-2.0 ERR: cp:mml: Undefined component 'freezer-api'", "################################################################################" ] } } TASK: [debug var=config_processor_result.stderr] ****************************** ok: [localhost] => { "var": { "config_processor_result.stderr": "/usr/lib/python2.7/site-packages/ardana_configurationprocessor/cp/model/YamlConfigFile.py:95: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n self._contents = yaml.load(''.join(lines))" } } TASK: [fail msg="Configuration processor run failed, see log output above for details"] *** failed: [localhost] => {"failed": true} msg: Configuration processor run failed, see log output above for details msg: Configuration processor run failed, see log output above for details FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/var/lib/ardana/config-processor-run.retry localhost : ok=8 changed=5 unreachable=0 failed=1
To resolve any errors and warnings like those shown above, you will need to perform the following actions:
Remove any service component entries that are no longer valid from the
control_plane.yml
(or whichever file holds the control-plane definitions) under~/openstack/my_cloud/definition
. This means that you have to comment out (or delete) any lines for the following service components, which are no longer available:freezer-agent
freezer-api
heat-api-cloudwatch
nova-console-auth
NoteThis should resolve the errors that cause the
config-processor-run.yml
playbook to fail.Similarly, remove any service components that are redundant and no longer required. This means that you should comment out (or delete) any lines for the following service components:
ceilometer-api
glance-registry
manila-share
neutron-lbaasv2-agent
NoteThis should resolve most of the warnings reported by the
config-processor-run.yml
playbook.ImportantIf you have deployed the
designate
service components (designate-pool-manager
anddesignate-zone-manager
) in your cloud, you will see warnings like those shown above, indicating that these service components have been deprecated.You can switch to using the newer
designate-worker
anddesignate-producer
service components, which will quieten these deprecation warnings produced by theconfig-processor-run.yml
playbook run.However, this is a procedure that should be perfomed after the upgrade has completed, as outlined in the Section 15.4.5, “Post-Upgrade Tasks” section below.
Once you have made the necessary changes to your input model, if you run
git diff
under the~/openstack/my_cloud/definition
directory, you should see output similar to the following:ardana >
cd ~/openstack/my_cloud/definitionardana >
git diff diff --git a/my_cloud/definition/data/control_plane.yml b/my_cloud/definition/data/control_plane.yml index f7cfd84..2c1a73c 100644 --- a/my_cloud/definition/data/control_plane.yml +++ b/my_cloud/definition/data/control_plane.yml @@ -32,7 +32,6 @@ - NEUTRON-CONFIG-CP1 common-service-components: - lifecycle-manager-target - - freezer-agent - stunnel - monasca-agent - logging-rotate @@ -118,12 +117,10 @@ - cinder-volume - cinder-backup - glance-api - - glance-registry - nova-api - nova-placement-api - nova-scheduler - nova-conductor - - nova-console-auth - nova-novncproxy - neutron-server - neutron-ml2-plugin @@ -137,7 +134,6 @@ - horizon - heat-api - heat-api-cfn - - heat-api-cloudwatch - heat-engine - ops-console-web - barbican-api @@ -151,7 +147,6 @@ - magnum-api - magnum-conductor - manila-api - - manila-share - name: mml cluster-prefix: mml @@ -164,9 +159,7 @@ # freezer-api shares elastic-search with logging-server # so must be co-located with it - - freezer-api - - ceilometer-api - ceilometer-polling - ceilometer-agent-notification - ceilometer-common @@ -194,4 +187,3 @@ - neutron-l3-agent - neutron-metadata-agent - neutron-openvswitch-agent - - neutron-lbaasv2-agentIf you are happy with these changes, commit them into the Git repository as follows:
ardana >
cd ~/openstack/my_cloud/definitionardana >
git add -Aardana >
git commit -m "SOC 9 CLM Upgrade input model migration"Now you are ready to run the
config-processor-run.yml
playbook. If the necessary input model changes have been made, it will complete sucessfully:ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost config-processor-run.yml Enter encryption key (press return for none): confirm Enter encryption key (press return for none): To change encryption key enter new key (press return for none): confirm To change encryption key enter new key (press return for none): PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] ... PLAY RECAP ******************************************************************** localhost : ok=24 changed=20 unreachable=0 failed=0
15.4.3 Upgrading Cloud Services #
The deployer node is now ready to be used to upgrade the remaining cloud nodes and running services.
If upgrading from Helion OpenStack 8, there is a manual file update that must
be applied before continuing the upgrade process. In the file
/usr/share/ardana/ansible/roles/osconfig/tasks/check-product-status.yml
replace `command` with `shell` in the first ansible entry. The correct version
of the file appears below.
- name: deployer-setup | check-product-status | Check HOS product installed shell: |- zypper info hpe-helion-openstack-release | grep "^Installed *: *Yes" ignore_errors: yes register: product_flavor_hos - name: deployer-setup | check-product-status | Check SOC product availability become: yes zypper: name: "suse-openstack-cloud-release>=8" state: present ignore_errors: yes register: product_flavor_soc - name: deployer-setup | check-product-status | Provide help fail: msg: > The deployer node does not have a Cloud Add-On product installed. In YaST select Software/Add-On Products to see an overview of installed add-on products and use "Add" to add the Cloud product. when: - product_flavor_soc|failed - product_flavor_hos|failed
Changes to the check-product-status.yml
file must be staged
and committed via git.
git add -u git commit -m "applying osconfig fix prior to HOS8 to SOC9 upgrade"
The ardana-upgrade.yml
playbook runs the upgrade process
against all nodes in parallel, though some of the steps are serialised
to run on only one node at a time to avoid triggering potentially
problematic race conditions. As such, the playbook can take a long time to run.
Generate the updated scratch area using the SUSE OpenStack Cloud 9 Cloud Lifecycle Manager Ansible sources:
ardana >
cd ~/openstack/ardana/ansibleardana >
ansible-playbook -i hosts/localhost ready-deployment.yml PLAY [localhost] ************************************************************** GATHERING FACTS *************************************************************** ok: [localhost] ... PLAY RECAP ******************************************************************** localhost : ok=31 changed=16 unreachable=0 failed=0Confirm that there are no pending updates for the deployer node. This could happen if you are using an SMT to manage the repositories, and updates have been released through the official channels since the deployer node was migrated. To check for any pending Cloud Lifecycle Manager package updates, you can run the
ardana-update-pkgs.yml
playbook as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --limit OPS-LM--first-member PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-dplyr-m1] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-update-pkgs.yml" } ... TASK: [_ardana-update-status | Report update status] ************************** ok: [ardana-cp-dplyr-m1] => { "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n=====================================================================" } msg: ===================================================================== Update status for node ardana-cp-dplyr-m1: ===================================================================== No pending update actions on the ardana-cp-dplyr-m1 host were collected or reset during this update run or persisted during previous unsuccessful or incomplete update runs. ===================================================================== PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-update-pkgs.yml" } msg: Playbook finished - ardana-update-pkgs.yml PLAY RECAP ******************************************************************** ardana-cp-dplyr-m1 : ok=98 changed=12 unreachable=0 failed=0 localhost : ok=6 changed=2 unreachable=0 failed=0NoteIf running the
ardana-update-pkgs.yml
playbook identifies that there were updates that needed to be installed on your deployer node, then you need to go back to running theardana-init
command, followed by thecobbler-deploy.yml
playbook, then theconfig-processor-run.yml
playbook, and finally theready-deployment.yml
playbook, addressing any additional input model changes that may be needed. Then, repeat this step to check for any pending updates before continuing with the upgrade.Double-check that there are no pending actions needed for the deployer node by running the
ardana-update-status.yml
playbook, as follows:ardana >
ansible-playbook -i hosts/verb_hosts ardana-update-status.yml --limit OPS-LM--first-member PLAY [resources] ************************************************************** ... TASK: [_ardana-update-status | Report update status] ************************** ok: [ardana-cp-dplyr-m1] => { "msg": "=====================================================================\nUpdate status for node ardana-cp-dplyr-m1:\n=====================================================================\nNo pending update actions on the ardana-cp-dplyr-m1 host\nwere collected or reset during this update run or persisted during\nprevious unsuccessful or incomplete update runs.\n\n=====================================================================" } msg: ===================================================================== Update status for node ardana-cp-dplyr-m1: ===================================================================== No pending update actions on the ardana-cp-dplyr-m1 host were collected or reset during this update run or persisted during previous unsuccessful or incomplete update runs. ===================================================================== PLAY RECAP ******************************************************************** ardana-cp-dplyr-m1 : ok=12 changed=0 unreachable=0 failed=0Having verified that there are no pending actions detected, it is safe to proceed with running the
ardana-upgrade.yml
playbook to upgrade the entire cloud:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-upgrade.yml PLAY [all] ******************************************************************** ... TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-upgrade.yml" } msg: Playbook started - ardana-upgrade.yml ... ... ... ... ... TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-upgrade.yml" } msg: Playbook finished - ardana-upgrade.yml
The ardana-upgrade.yml
playbook run will take a long time. The
zypper dist-upgrade
phase is serialised across all of
the nodes and usually takes between five and 10 minutes for each node. This
is followed by the cloud service upgrade phase, which will take
approximately the same amount of time as a full cloud deploy. During
this time, the cloud should remain basically functional, though there
may be brief interruptions to some services. However, it is recommended
that any workload management tasks are avoided during this period.
Until the ardana-upgrade.yml
playbook run has
ompleted successfully, other playbooks such as the ardana-status.yml
,
may report status problems. This is because some services that are
expected to be running may not be installed, enabled, or migrated yet.
The ardana-upgrade.yml
playbook run may sometimes
fail during the whole cloud upgrade phase, if a service (for example,
the monasca-thresh
service) is slow to restart. In such cases, it is
safe to run the ardana-upgrade.yml
playbook again,
and in most cases it should continue past the stage that failed
previously. However, if the same problem persists across multiple runs,
contact your support team for assistance.
It is important to disable all SUSE Linux Enterprise 12 SP3 SUSE OpenStack Cloud 8 Cloud Lifecycle Manager
repositories before migrating the deployer to SUSE Linux Enterprise 12 SP4 SUSE OpenStack Cloud
9 Cloud Lifecycle Manager. If you did not do this, then the first time you
run the ardana-upgrade.yml
playbook, it may
complain that there are pending updates for the deployer node.
This will require you to repeat the earlier steps to upgrade the
deployer node, starting with running the ardana-init
command. If this happens, repeat the steps as requested. Note that this
does not represent a serious problem.
In SUSE OpenStack Cloud 9 Cloud Lifecycle Manager the LBaaS V2 legacy driver has been
deprecated and removed. As part of the ardana-upgrade.yml
playbook run,
all existing LBaaS V2 load-balancers will be automatically migrated to being
based on the Octavia Amphora provider. To enable creation of any new Octavia-
based load-balancer instances, you need to ensure that an appropriate Amphora
image is registered for use when creating instances, by following
Chapter 43, Configuring Load Balancer as a Service.
While running the ardana-upgrade.yml
playbook, a point will be
reached when the Neutron services are upgraded. As part of this upgrade,
any existing LBaaS V2 load-balancer definitions will be migrated to
Octavia Amphora-based load-balancer definitions.
After this migration of load-balancer definitions has completed, if a load-balancer failover is triggered, then the replacement load- balancer may fail to start, as an appropriate Octavia Amphora image for SUSE OpenStack Cloud 9 Cloud Lifecycle Manager will not yet be available.
However, once the Octavia Amphora image has been uploaded using the above instructions, then it will be possible to recover any failed load-balancers by re-triggering the failover: follow the instructions at https://docs.openstack.org/python-octaviaclient/latest/cli/index.html#loadbalancer-failover.
15.4.4 Rebooting the Nodes into the SUSE Linux Enterprise 12 SP4 Kernel #
At this point, all of the cloud services have been upgraded, but the nodes are still running the SUSE Linux Enterprise 12 SP3 kernel. The final step in the upgrade workflow is to reboot all of the nodes in the cloud in a controlled fashion, to ensure that active services failover appropriately.
The recommended order for rebooting nodes is to start with the deployer. This requires special handling, since the Ansible-based automation cannot fully manage the reboot of the node that it is running on.
After that, we recommend rebooting the rest of the nodes in the control planes in a rolling-reboot fashion, ensuring that high-availability services remain available.
Finally, the compute nodes can be rebooted, either individually or in groups, as is appropriate to avoid interruptions to running workloads.
Do not reboot all your control plane nodes at the same time.
The reboot of the deployer node requires additional steps, as the Ansible-based automation framework cannot fully automate the reboot of the node that runs the ansible-playbook commands.
Run the
ardana-reboot.yml
playbook limited to the deployer node, either by name, or using the logical node identifiedOPS-LM--first-member
, as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit OPS-LM--first-member PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-dplyr-m1] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-reboot.yml" } msg: Playbook started - ardana-reboot.yml ... TASK: [ardana-reboot | Deployer node has to be rebooted manually] ************* failed: [ardana-cp-dplyr-m1] => {"failed": true} msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook: cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1 msg: The deployer node needs to be rebooted manually. After reboot, resume by running the post-reboot playbook: cd ~/scratch/ansible/next/ardana/ansible ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-dplyr-m1 FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/var/lib/ardana/ardana-reboot.retry ardana-cp-dplyr-m1 : ok=8 changed=3 unreachable=0 failed=1 localhost : ok=7 changed=0 unreachable=0 failed=0The
ardana-reboot.yml
playbook will fail when run on a deployer node; this is expected. The reported failure message tells you what you need to do to complete the remaining steps of the reboot manually: namely, rebooting the node, then logging back in again to run the_ardana-post-reboot.yml
playbook, to start any services that need to be running on the node.Manually reboot the deployer node, for example with
shutdown -r now
.Once the deployer node has rebooted, you need to log in again and run the
_ardana-post-reboot.yml
playbook to complete the startup of any services that should be running on the deployer node, as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook _ardana-post-reboot.yml --limit OPS-LM--first-member PLAY [resources] ************************************************************** TASK: [Set pending_clm_update] ************************************************ skipping: [ardana-cp-dplyr-m1] ... TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-status.yml" } msg: Playbook finished - ardana-status.yml PLAY RECAP ******************************************************************** ardana-cp-dplyr-m1 : ok=26 changed=0 unreachable=0 failed=0 localhost : ok=19 changed=1 unreachable=0 failed=0
For the remaining nodes, you can use ardana-reboot.yml
to fully automate the reboot process. However, it is recommended that you
reboot the nodes in a rolling-reboot fashion, such that high-availability
services continue to run without interruption. Similarly, to avoid
interruption of service for any singleton services (such as the cinder-volume
and cinder-backup
services), they should be migrated off the intended
node before it is rebooted, and then migrated back again afterwards.
Use the
ansible
command's--list-hosts
option to list the remaining nodes in the cloud that are neither the deployer nor a compute node:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP' ardana-cp-dbmqsw-m1 ardana-cp-dbmqsw-m2 ardana-cp-dbmqsw-m3 ardana-cp-osc-m1 ardana-cp-osc-m2 ardana-cp-mml-m1 ardana-cp-mml-m2 ardana-cp-mml-m3Use the following command to generate the set of
ansible-playbook
commands that need to be run to reboot all the nodes sequentially:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
for node in $(ansible -i hosts/verb_hosts --list-hosts 'resources:!OPS-LM--first-member:!NOV-CMP'); do echo ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ${node} || break; done ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m1 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m2 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-dbmqsw-m3 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m1 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-osc-m2 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m1 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m2 ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3WarningDo not reboot all your control-plane nodes at the same time.
To reboot a specific control-plane node, you can use the above
ansible-playbook
commands as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-mml-m3 PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-mml-m3] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-reboot.yml" } msg: Playbook started - ardana-reboot.yml ... PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-status.yml" } msg: Playbook finished - ardana-status.yml PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-reboot.yml" } msg: Playbook finished - ardana-reboot.yml PLAY RECAP ******************************************************************** ardana-cp-mml-m3 : ok=389 changed=105 unreachable=0 failed=0 localhost : ok=27 changed=1 unreachable=0 failed=0
You can reboot more than one control-plane node at a time, but only if they are members of different control-plane clusters. For example, you could reboot one node out of each of the Openstack controller, database, swift, monitoring or logging clusters, so long as doing do only reboots one node out of each cluster at the same time.
When rebooting the first member of the control-plane cluster where
monitoring services run, the monasca-thresh
service can sometimes fail
to start up in a timely fashion when the node is coming back up after
being rebooted. This can cause ardana-reboot.yml
to fail.
See below for suggestions on how to handle this problem.
monasca-thresh
Running After an ardana-reboot.yml
Failure #
If the ardana-reboot.yml
playbook failed because
monasca-thresh
didn't start up in a timely fashion
after a reboot, you can retry starting the services on the node using
the _ardana-post-reboot.yml
playbook for the node.
This is similar to the manual handling of the deployer reboot, since
the node has already successfully rebooted onto the new kernel, and
you just need to get the required services running again on the node.
It can sometimes take up to 15 minutes for the monasca-thresh
service to successfully start in such cases.
However, if the service still fails to start after that time, then you may need to force a restart of the
storm-nimbus
andstorm-supervisor
services on all nodes in theMON-THR
node group, as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible MON-THR -b -m shell -a "systemctl restart storm-nimbus" ardana-cp-mml-m1 | success | rc=0 >> ardana-cp-mml-m2 | success | rc=0 >> ardana-cp-mml-m3 | success | rc=0 >> ardana > ansible MON-THR -b -m shell -a "systemctl restart storm-supervisor" ardana-cp-mml-m1 | success | rc=0 >> ardana-cp-mml-m2 | success | rc=0 >> ardana-cp-mml-m3 | success | rc=0 >> ardana > ansible-playbook -i hosts/verb_hosts _ardana-post-reboot.yml --limit ardana-cp-mml-m1
If the monasca-thresh
service still fails to start up,
contact your support team.
To check which control plane nodes have not yet been rebooted onto
the new kernel, you can use an Ansible command to run the command uname -r
on the target nodes, as follows:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible -i hosts/verb_hosts 'resources:!OPS-LM--first-member:!NOV-CMP' -m command -a 'uname -r' ardana-cp-dbmqsw-m1 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-dbmqsw-m3 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-osc-m1 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-dbmqsw-m2 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-mml-m2 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-osc-m2 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-mml-m1 | success | rc=0 >> 4.12.14-95.57-default ardana-cp-mml-m3 | success | rc=0 >> 4.12.14-95.57-default ardana > uname -r 4.12.14-95.57-default
If any node's uname -r
value does not match the kernel
that the deployer is running, you probably have not yet rebooted that node.
Finally, you need to reboot the compute nodes. Rebooting multiple compute nodes at the same time is possible, so long as doing so does not compromise the integrity of running workloads. We recommended that you migrate workloads off groups of compute nodes in a controlled fashion, enabling them to be rebooted together.
Do not reboot all of your compute nodes at the same time.
To see all the compute nodes that are available to be rebooted, you can run the following command:
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible -i hosts/verb_hosts --list-hosts NOV-CMP ardana-cp-slcomp0001 ardana-cp-slcomp0002 ... ardana-cp-slcomp0080Reboot the compute nodes, individually or in groups, using the
ardana-reboot.yml
playbook as follows:ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-reboot.yml --limit ardana-cp-slcomp0001,ardana-cp-slcomp0002 PLAY [all] ******************************************************************** TASK: [setup] ***************************************************************** ok: [ardana-cp-slcomp0001] ok: [ardana-cp-slcomp0002] PLAY [localhost] ************************************************************** TASK: [pbstart.yml pb_start_playbook] ***************************************** ok: [localhost] => { "msg": "Playbook started - ardana-reboot.yml" } msg: Playbook started - ardana-reboot.yml ... PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-status.yml" } msg: Playbook finished - ardana-status.yml PLAY [localhost] ************************************************************** TASK: [pbfinish.yml pb_finish_playbook] *************************************** ok: [localhost] => { "msg": "Playbook finished - ardana-reboot.yml" } msg: Playbook finished - ardana-reboot.yml PLAY RECAP ******************************************************************** ardana-cp-slcomp0001 : ok=120 changed=11 unreachable=0 failed=0 ardana-cp-slcomp0002 : ok=120 changed=11 unreachable=0 failed=0 localhost : ok=27 changed=1 unreachable=0 failed=0
You must ensure that there is sufficient unused workload capacity to host any migrated workload or Amphora instances that may be running on the targeted compute nodes.
When rebooting multiple compute nodes at the same time, consider manually migrating any running workloads and Amphora instances off the target nodes in advance, to avoid any potential risk of workload or service interruption.
15.4.5 Post-Upgrade Tasks #
After the cloud has been upgraded to SUSE OpenStack Cloud 9 Cloud Lifecycle Manager,
if designate
was previously configured, then the deprecated service
components, designate-zone-manager
and
designate-pool-manager
, were being used.
They will continue to operate correctly under SUSE OpenStack Cloud 9 Cloud Lifecycle Manager,
but we recommend that you migrate to using the newer designate-worker
designate-producer
service components instead by
following the procedure documented in Section 25.4, “Migrate Zone/Pool to Worker/Producer after Upgrade”.
After migrating the deployer node, there are a small number of packages that were installed that are no longer required—such as the
ceilometer
andfreezer
virtualenv
(venv) packages.You can safely remove these packages with the following command:
ardana >
zypper packages --orphaned Loading repository data... Reading installed packages... S | Repository | Name | Version | Arch --+------------+----------------------------------+-----------------------------------+------- i | @System | python-flup | 1.0.3.dev_20110405-2.10.52 | noarch i | @System | python-happybase | 0.9-1.64 | noarch i | @System | venv-openstack-ceilometer-x86_64 | 9.0.8~dev7-12.24.2 | noarch i | @System | venv-openstack-freezer-x86_64 | 5.0.0.0~xrc2~dev2-10.22.1 | noarch ardana> sudo zypper remove venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64 Loading repository data... Reading installed packages... Resolving package dependencies... The following 2 packages are going to be REMOVED: venv-openstack-ceilometer-x86_64 venv-openstack-freezer-x86_64 2 packages to remove. After the operation, 79.0 MiB will be freed. Continue? [y/n/...? shows all options] (y): y (1/2) Removing venv-openstack-ceilometer-x86_64-9.0.8~dev7-12.24.2.noarch ..................................................................[done] Additional rpm output: /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. return yaml.load(f) (2/2) Removing venv-openstack-freezer-x86_64-5.0.0.0~xrc2~dev2-10.22.1.noarch ..............................................................[done] Additional rpm output: /usr/lib/python2.7/site-packages/ardana_packager/indexer.py:148: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. return yaml.load(f)
The freezer service has been deprecated and removed from SUSE OpenStack Cloud 9 Cloud Lifecycle Manager, but the backups that the freezer service created before you upgraded will still be consuming space in your Swift Object store.
Therefore, once you have completed the upgrade successfully, you can safely delete the containers that freezer used to hold the database and ring backups, freeing up that space.
Using the credentials in the
backup.osrc
file, found on the deployer node in the Ardana account's home directory, run the following commands:ardana >
. ~/backup.osrcardana >
swift list freezer_database_backups freezer_rings_backups ardana> swift delete --all freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/1_1598548599/segments/000000021 freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/2_1598605266/data1 ... freezer_database_backups/data/tar/ardana-cp-dbmqsw-m2-host_freezer_mysql_backup/1598505404/0_1598505404/segments/000000001 freezer_database_backups freezer_rings_backups/metadata/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/metadata ... freezer_rings_backups/data/tar/ardana-cp-dbmqsw-m1-host_freezer_swift_builder_dir_backup/1598548636/0_1598548636/data freezer_rings_backups
15.5 Cloud Lifecycle Manager Program Temporary Fix (PTF) Deployment #
Occasionally, in order to fix a given issue, SUSE will provide a set of packages known as a Program Temporary Fix (PTF). Such a PTF is fully supported by SUSE until the Maintenance Update containing a permanent fix has been released via the regular Update repositories. Customers running PTF fixes will be notified through the related Service Request when a permanent patch for a PTF has been released.
Use the following steps to deploy a PTF:
When SUSE has developed a PTF, you will receive a URL for that PTF. You should download the packages from the location provided by SUSE Support to a temporary location on the Cloud Lifecycle Manager. For example:
ardana >
tmpdir=`mktemp -d`ardana >
cd $tmpdirardana >
wget --no-directories --recursive --reject "index.html*"\ --user=USER_NAME \ --ask-password \ --no-parent https://ptf.suse.com/54321aaaa...dddd12345/cloud8/042171/x86_64/20181030/Remove any old data from the PTF repository, such as a listing for a PTF repository from a migration or when previous product patches were installed.
ardana >
sudo rm -rf /srv/www/suse-12.4/x86_64/repos/PTF/*Move packages from the temporary download location to the PTF repository directory on the CLM Server. This example is for a neutron PTF.
ardana >
sudo mkdir -p /srv/www/suse-12.4/x86_64/repos/PTF/ardana >
sudo mv $tmpdir/* /srv/www/suse-12.4/x86_64/repos/PTF/ardana >
sudo chown --recursive root:root /srv/www/suse-12.4/x86_64/repos/PTF/*ardana >
rmdir $tmpdirCreate or update the repository metadata:
ardana >
sudo /usr/local/sbin/createrepo-cloud-ptf Spawning worker 0 with 2 pkgs Workers Finished Saving Primary metadata Saving file lists metadata Saving other metadataRefresh the PTF repository before installing package updates on the Cloud Lifecycle Manager
ardana >
sudo zypper refresh --force --repo PTF Forcing raw metadata refresh Retrieving repository 'PTF' metadata ..........................................[d one] Forcing building of repository cache Building repository 'PTF' cache ..........................................[done] Specified repositories have been refreshed.The PTF shows as available on the deployer.
ardana >
sudo zypper se --repo PTF Loading repository data... Reading installed packages... S | Name | Summary | Type --+-------------------------------+-----------------------------------------+-------- | python-neutronclient | Python API and CLI for OpenStack neutron | package i | venv-openstack-neutron-x86_64 | Python virtualenv for OpenStack neutron | packageInstall the PTF venv packages on the Cloud Lifecycle Manager
ardana >
sudo zypper dup --from PTF Refreshing service Loading repository data... Reading installed packages... Computing distribution upgrade... The following package is going to be upgraded: venv-openstack-neutron-x86_64 The following package has no support information from its vendor: venv-openstack-neutron-x86_64 1 package to upgrade. Overall download size: 64.2 MiB. Already cached: 0 B. After the operation, additional 6.9 KiB will be used. Continue? [y/n/...? shows all options] (y): y Retrieving package venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ... (1/1), 64.2 MiB ( 64.6 MiB unpacked) Retrieving: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm ....[done] Checking for file conflicts: ..............................................................[done] (1/1) Installing: venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch ....[done] Additional rpm output: warning warning: /var/cache/zypp/packages/PTF/noarch/venv-openstack-neutron-x86_64-11.0.2-13.8.1.042171.0.PTF.102473.noarch.rpm: Header V3 DSA/SHA1 Signature, key ID b37b98a9: NOKEYValidate the venv tarball has been installed into the deployment directory:(note:the packages file under that dir shows the registered tarballs that will be used for the services, which should align with the installed venv RPM)
ardana >
ls -la /opt/ardana_packager/ardana-9/sles_venv/x86_64 total 898952 drwxr-xr-x 2 root root 4096 Oct 30 16:10 . ... -rw-r--r-- 1 root root 67688160 Oct 30 12:44 neutron-20181030T124310Z.tgz <<< -rw-r--r-- 1 root root 64674087 Aug 14 16:14 nova-20180814T161306Z.tgz -rw-r--r-- 1 root root 45378897 Aug 14 16:09 octavia-20180814T160839Z.tgz -rw-r--r-- 1 root root 1879 Oct 30 16:10 packages -rw-r--r-- 1 root root 27186008 Apr 26 2018 swift-20180426T230541Z.tgzInstall the non-venv PTF packages on the Compute Node
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update-pkgs.yml --extra-vars '{"zypper_update_method": "update", "zypper_update_repositories": ["PTF"]}' --limit comp0001-mgmtWhen it has finished, you can see that the upgraded package has been installed on
comp0001-mgmt
.ardana >
sudo zypper se --detail python-neutronclient Loading repository data... Reading installed packages... S | Name | Type | Version | Arch | Repository --+----------------------+----------+---------------------------------+--------+-------------------------------------- i | python-neutronclient | package | 6.5.1-4.361.042171.0.PTF.102473 | noarch | PTF | python-neutronclient | package | 6.5.0-4.361 | noarch | SUSE-OPENSTACK-CLOUD-x86_64-GM-DVD1Running the ardana update playbook will distribute the PTF venv packages to the cloud server. Then you can find them loaded in the virtual environment directory with the other venvs.
The Compute Node before running the update playbook:
ardana >
ls -la /opt/stack/venv total 24 drwxr-xr-x 9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z drwxr-xr-x 9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306ZRun the update.
ardana >
cd ~/scratch/ansible/next/ardana/ansibleardana >
ansible-playbook -i hosts/verb_hosts ardana-update.yml --limit comp0001-mgmtWhen it has finished, you can see that an additional virtual environment has been installed.
ardana >
ls -la /opt/stack/venv total 28 drwxr-xr-x 9 root root 4096 Jul 18 15:47 neutron-20180718T154642Z drwxr-xr-x 9 root root 4096 Aug 14 16:13 neutron-20180814T161306Z drwxr-xr-x 9 root root 4096 Oct 30 12:43 neutron-20181030T124310Z <<< New venv installed drwxr-xr-x 10 root root 4096 May 28 09:30 nova-20180528T092954Z drwxr-xr-x 10 root root 4096 Aug 14 16:13 nova-20180814T161306ZThe PTF may also have
RPM
package updates in addition to venv updates. To complete the update, follow the instructions at Section 15.3.1, “Performing the Update”
15.6 Periodic OpenStack Maintenance Tasks #
Heat-manage helps manage Heat specific database operations. The associated
database should be periodically purged to save space. The following should
be setup as a cron job on the servers where the heat service is running at
/etc/cron.weekly/local-cleanup-heat
with the following content:
#!/bin/bash su heat -s /bin/bash -c "/usr/bin/heat-manage purge_deleted -g days 14" || :
nova-manage db archive_deleted_rows command will move deleted rows
from production tables to shadow tables. Including
--until-complete
will make the command run continuously
until all deleted rows are archived. It is recommended to setup this task
as /etc/cron.weekly/local-cleanup-nova
on the servers where the nova service is running, with the
following content:
#!/bin/bash su nova -s /bin/bash -c "/usr/bin/nova-manage db archive_deleted_rows --until-complete" || :