Installing a Basic Three-Node High Availability Cluster

1 Usage scenario #

This guide describes the setup of a minimal High Availability cluster with the following properties:

Three cluster nodes with passwordless SSH access to each other. Three nodes are required for this setup so that diskless SBD can handle split-brain scenarios without the help of QDevice.
A floating, virtual IP address that allows clients to connect to the graphical management tool Hawk, no matter which node the service is running on.
Diskless SBD (STONITH Block Device) and a software watchdog used as the node fencing mechanism to avoid split-brain scenarios.
Failover of resources from one node to another if the active host breaks down (active/passive setup).

This is a simple cluster setup with minimal external requirements. You can use this cluster for testing purposes or as a basic cluster configuration that you can extend for a production environment later.

2 Installation overview #

To install the High Availability cluster described in Section 1, “Usage scenario”, you must perform the following tasks:

Review Section 3, “System requirements” to make sure you have everything you need.
Install SUSE Linux Enterprise High Availability on the cluster nodes with Section 4, “Enabling the High Availability extension”.
Initialize the cluster on the first node with Section 5, “Setting up the first node”.
Add more nodes to the cluster with Section 6, “Adding the second and third nodes”.
Log in to the Hawk Web interface to monitor the cluster with Section 7, “Logging in to Hawk”.
Perform basic tests to make sure the cluster works as expected with Section 8, “Testing the cluster”.
Review Section 9, “Next steps” for advice on expanding the cluster for a production environment.

3 System requirements #

This section describes the system requirements for a minimal setup of SUSE Linux Enterprise High Availability.

3.1 Hardware requirements #

Servers

Three servers to act as cluster nodes.

The servers can be bare metal or virtual machines. They do not require identical hardware (memory, disk space, etc.), but they must have the same architecture. Cross-platform clusters are not supported.

See the System Requirements section at https://www.suse.com/download/sle-ha/ for more details about server hardware.

Network Interface Cards (NICs)

At least two NICs per cluster node. This allows you to configure two or more communication channels for the cluster, using one of the following methods:

Combine the NICs into a network bond (preferred). In this case, you must set up the bonded device on each node before you initialize the cluster.
Create a second communication channel in Corosync. This can be configured by the cluster setup script. In this case, the two NICs must be on different subnets.

Node fencing

To be supported, all SUSE Linux Enterprise High Availability clusters must have at least one node fencing device configured. For critical workloads, we recommend using two or three fencing devices. A fencing device can be either a physical device (a power switch) or a software mechanism (SBD in combination with a watchdog).

The minimal setup described in this guide uses a software watchdog and diskless SBD, so no additional hardware is required. Before using this cluster in a production environment, replace the software watchdog with a hardware watchdog.

3.2 Software requirements #

Operating system

All nodes must have SUSE Linux Enterprise Server installed and registered.

High Availability extension

The SUSE Linux Enterprise High Availability extension requires an additional registration code.

This extension can be enabled during the SLES installation, or you can enable it later on a running system. This guide explains how to enable and register the extension on a running system.

3.3 Network requirements #

Time synchronization

All systems must synchronize to an NTP server outside the cluster. SUSE Linux Enterprise Server uses chrony for NTP. When you initialize the cluster, you are warned if chrony is not running.

Even if the nodes are synchronized, log files and cluster reports can still become difficult to analyze if the nodes have different time zones configured.

Host name and IP address

All cluster nodes must be able to find each other by name. Use the following methods for reliable name resolution:

Use static IP addresses.
List all nodes in the /etc/hosts file with their IP address, FQDN and short host name.

Only the primary IP address on each NIC is supported.

SSH

All cluster nodes must be able to access each other via SSH. Certain cluster operations also require passwordless SSH authentication. When you initialize the cluster, the setup script checks for existing SSH keys and generates them if they do not exist.

Important: root SSH access in SUSE Linux Enterprise 16

In SUSE Linux Enterprise 16, root SSH login with a password is disabled by default.

On each node, either create a user with sudo privileges or set up passwordless SSH authentication for the root user before you initialize the cluster.

If you initialize the cluster with a sudo user, certain crmsh commands also require passwordless sudo permission.

4 Enabling the High Availability extension #

This procedure explains how to install SUSE Linux Enterprise High Availability on an existing SUSE Linux Enterprise Server. You can skip this procedure if you already installed the High Availability extension and packages during the SLES installation with Agama.

Requirements #

SUSE Linux Enterprise Server is installed and registered with the SUSE Customer Center.
You have an additional registration code for SUSE Linux Enterprise High Availability.

Perform this procedure on all the machines you intend to use as cluster nodes:

Log in either as the root user or as a user with sudo privileges.
Check whether the High Availability extension is already enabled:
```
> sudo SUSEConnect --list-extensions
```
Check whether the High Availability packages are already installed:
```
> zypper search ha_sles
```
Enable the SUSE Linux Enterprise High Availability extension:
```
> sudo SUSEConnect -p sle-ha/16.0/x86_64 -r HA_REGCODE
```

Install the High Availability packages:

> sudo zypper install -t pattern ha_sles

5 Setting up the first node #

SUSE Linux Enterprise High Availability includes setup scripts to simplify the installation of a cluster. To set up the cluster on the first node, use the crm cluster init script.

5.1 Overview of the `crm cluster init` script #

The crm cluster init command starts a script that defines the basic parameters needed for cluster communication, resulting in a running one-node cluster.

The script checks and configures the following components:

NTP: Checks if chrony is configured to start at boot time. If not, a message appears.
SSH: Detects or generates SSH keys for passwordless login between cluster nodes.
Firewall: Opens the ports in the firewall that are needed for cluster communication.
Csync2: Configures Csync2 to replicate configuration files across all nodes in a cluster.
Corosync: Configures the cluster communication system.
SBD/watchdog: Checks if a watchdog exists and asks whether to configure SBD as the node fencing mechanism.
Hawk cluster administration: Enables the Hawk service and displays the URL for the Hawk Web interface.
Virtual floating IP: Asks whether to configure a virtual IP address for the Hawk Web interface.
QDevice/QNetd: Asks whether to configure QDevice and QNetd to participate in quorum decisions. This is recommended for clusters with an even number of nodes, and especially for two-node clusters.

Note: Pacemaker default settings

The options set by the crm cluster init script might not be the same as the Pacemaker default settings. You can check which settings the script changed in /var/log/crmsh/crmsh.log. Any options set during the bootstrap process can be modified later with crmsh.

Note: Cluster configuration for different platforms

The crm cluster init script detects the system environment (for example, Microsoft Azure) and adjusts certain cluster settings based on the profile for that environment. For more information, see the file /etc/crm/profiles.yml.

5.2 Initializing the cluster on the first node #

Configure the cluster on the first node with the crm cluster init script. The script prompts you for basic information about the cluster and configures the required settings and services. For more information, run the crm cluster init --help command.

Requirements #

SUSE Linux Enterprise High Availability is installed and up to date.
All nodes have at least two network interfaces or a network bond, with static IP addresses listed in the /etc/hosts file along with each node's FQDN and short host name.

Perform this procedure on only one node:

Log in to the first node either as the root user or as a user with sudo privileges.
Start the crm cluster init script:
```
> sudo crm cluster init
```
The script checks whether chrony is running, opens the required firewall ports, configures Csync2, and checks for SSH keys. If no SSH keys are available, the script generates them.
Configure Corosync for cluster communication:
1. Enter an IP address for the first communication channel (ring0). By default, the script proposes the address of the first available network interface. This could be either an individual interface or a bonded device. Accept this address or enter a different one.
2. If the script detects multiple network interfaces, it asks whether you want to configure a second communication channel (ring1). If you configured the first channel with a bonded device, you can decline with n. If you need to configure a second channel, confirm with y and enter the IP address of another network interface. The two interfaces must be on different subnets.
The script configures the default firewall ports for Corosync communication.
Choose whether to set up SBD as the node fencing mechanism:
1. Confirm with y that you want to use SBD.
2. When prompted for a path to a block device, enter none to configure diskless SBD.
The script configures SBD, including the relevant timeout settings. Unlike disk-based SBD, diskless SBD does not require a fence_sbd cluster resource.
If no hardware watchdog is available, the script configures the software watchdog softdog.
Configure a virtual IP address for cluster administration with the Hawk Web interface:
1. Confirm with y that you want to configure a virtual IP address.
2. Enter an unused IP address to use as the administration IP for Hawk.
Instead of logging in to Hawk on an individual cluster node, you can connect to the virtual IP address.
Choose whether to configure QDevice and QNetd:
For the minimal setup described in this document, decline with n.

The script starts the cluster services to bring the cluster online and enable Hawk. The URL to use for Hawk is displayed on the screen. You can also check the status of the cluster with the crm status command.

Important: Secure password for hacluster

The crm cluster init script creates a default cluster user and password. Replace the default password with a secure one as soon as possible:

> sudo passwd hacluster

6 Adding the second and third nodes #

Add more nodes to the cluster with the crm cluster join script. The script only needs access to an existing cluster node and completes the basic setup on the current machine automatically. For more information, run the crm cluster join --help command.

Requirements #

SUSE Linux Enterprise High Availability is installed and up to date.
An existing cluster is already running on at least one node.
All nodes have at least two network interfaces or a network bond, with static IP addresses listed in the /etc/hosts file along with each node's FQDN and short host name.
If you log in as a sudo user: The same user must exist on all nodes. This user must have passwordless sudo permission.
If you log in as the root user: Passwordless SSH authentication must be configured on all nodes.

Perform this procedure on each additional node:

Log in to this node as the same user you set up the first node with.
Start the crm cluster join script:
- If you set up the first node as root, you can start the script with no additional parameters:
```
# crm cluster join
```
- If you set up the first node as a sudo user, you must specify that user with the -c option:
```
> sudo crm cluster join -c USER@NODE1
```
The script checks if chrony is running, opens the required firewall ports, and configures Csync2.
If you did not already specify the first node with -c, you are prompted for its IP address or host name.
If you did not already configure passwordless SSH authentication between the nodes, you are prompted for the passwords of each of the existing nodes.
Configure Corosync for cluster communication:
1. The script proposes an IP address for ring0. This IP address must be on the same subnet as the IP address used for ring0 on the first node. If it is not, enter the correct IP address.
2. If the cluster has two Corosync communication channels configured, the script prompts you for an IP address for ring1. This IP address must be on the same subnet as the IP address used for ring1 on the first node.

The script copies the cluster configuration from the first node, adjusts the timeout settings to consider the new node, and brings the new node online.

You can check the status of the cluster with the crm status command.

Important: Secure password for hacluster

The crm cluster join script creates a default cluster user and password. On each node, replace the default password with a secure one as soon as possible:

> sudo passwd hacluster

7 Logging in to Hawk #

Hawk allows you to monitor and administer a High Availability cluster using a graphical Web browser. You can also configure a virtual IP address that allows clients to connect to Hawk no matter which node it is running on.

Requirements #

The client machine must be able to connect to the cluster nodes.
The client machine must have a graphical Web browser with JavaScript and cookies enabled.

You can perform this procedure on any machine that can connect to the cluster nodes:

Start a Web browser and enter the following URL:
```
https://HAWKSERVER:7630/
```
Replace HAWKSERVER with the IP address or host name of a cluster node, or the Hawk virtual IP address if one is configured.
Note: Certificate warning
If a certificate warning appears when you access the URL for the first time, a self-signed certificate is in use. To verify the certificate, ask your cluster operator for the certificate details. To proceed anyway, you can add an exception in the browser to bypass the warning.
On the Hawk login screen, enter the Username and Password of the hacluster user.
Click Log In. The Hawk Web interface shows the Status screen by default.

The Status screen shows one configured resource: the virtual IP address admin-ip, running on a node called alice.

Figure 1: The Hawk Status screen #

8 Testing the cluster #

The following tests can help you identify basic issues with the cluster setup. However, realistic tests involve specific use cases and scenarios. Before using the cluster in a production environment, test it thoroughly according to your use cases.

8.1 Testing resource failover #

Check whether the cluster moves resources to another node if the current node is set to standby. This procedure uses example nodes called alice and bob, and a virtual IP resource called admin-ip with the example IP address 192.168.1.10.

Open two terminals.
In the first terminal, ping the virtual IP address:
```
> ping 192.168.1.10
```
In the second terminal, log in to one of the cluster nodes.

Check which node the virtual IP address is running on:

> sudo crm status
[..]
Node List:
  * Online: [ alice bob ]

Full List of Resources:
  * admin-ip  (ocf:heartbeat:IPaddr2):    Started alice

Put alice into standby mode:
```
> sudo crm node standby alice
```

Check the cluster status again. The resource admin-ip should have migrated to bob:

> sudo crm status
[...]
Node List:
  * Node alice: standby
  * Online: [ bob ]

Full List of Resources:
  * admin-ip  (ocf:heartbeat:IPaddr2):    Started bob

In the first terminal, you should see an uninterrupted flow of pings to the virtual IP address during the migration. This shows that the cluster setup and the floating IP address work correctly.
Cancel the ping command with Ctrl–C.
In the second terminal, bring alice back online:
```
> sudo crm node online alice
```

8.2 Testing cluster failures #

The crm cluster crash_test command simulates cluster failures and reports the results.

The command supports the following checks:

--fence-node NODE: Fences a specific node passed from the command line.
--kill-sbd/--kill-corosync/ --kill-pacemakerd: Kills the daemons for SBD, Corosync, or Pacemaker. After running one of these tests, you can find a report in the directory /var/lib/crmsh/crash_test/. The report includes a test case description, action logging, and an explanation of possible results.
--split-brain-iptables: Simulates a split-brain scenario by blocking the Corosync port, and checks whether one node can be fenced as expected. You must install iptables before you can run this test.

For more information, run the crm cluster crash_test --help command.

This example uses nodes called alice and bob, and tests fencing bob. To watch bob change status during the test, you can log in to Hawk and navigate to Status › Nodes, or run crm status from another node.

Example 1: Manually triggering node fencing #

admin@alice> sudo crm cluster crash_test --fence-node bob

==============================================
Testcase:          Fence node bob
Fence action:      reboot
Fence timeout:     95

!!! WARNING WARNING WARNING !!!
THIS CASE MAY LEAD TO NODE BE FENCED.
TYPE Yes TO CONTINUE, OTHER INPUTS WILL CANCEL THIS CASE [Yes/No](No): Yes
INFO: Trying to fence node "bob"
INFO: Waiting 95s for node "bob" reboot...
INFO: Node "bob" will be fenced by "alice"!
INFO: Node "bob" was fenced by "alice" at DATE TIME

9 Next steps #

This guide describes a basic High Availability cluster that can be used for testing purposes. To expand this cluster for use in production environments, more steps are recommended:

Adding more nodes: Add more nodes to the cluster using the crm cluster join script.
Enabling a hardware watchdog: Before using the cluster in a production environment, replace softdog with a hardware watchdog.
Adding more fencing devices: For critical workloads, we highly recommend having two or three fencing devices, using either physical devices or disk-based SBD.
Configuring QDevice: QDevice and QNetd participate in quorum decisions. With assistance from the arbitrator QNetd, QDevice provides a configurable number of votes. This allows clusters to sustain more node failures than the standard quorum rules allow. We recommend deploying QDevice and QNetd in clusters with an even number of nodes, and especially in two-node clusters.

10 Legal Notice #

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.

For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.

Installing a Basic Three-Node High Availability Cluster

1 Usage scenario #

2 Installation overview #

3 System requirements #

3.1 Hardware requirements #

3.2 Software requirements #

3.3 Network requirements #

4 Enabling the High Availability extension #

5 Setting up the first node #

5.1 Overview of the crm cluster init script #

5.2 Initializing the cluster on the first node #

6 Adding the second and third nodes #

7 Logging in to Hawk #

8 Testing the cluster #

8.1 Testing resource failover #

8.2 Testing cluster failures #

9 Next steps #

10 Legal Notice #

5.1 Overview of the `crm cluster init` script #