Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
SUSE Linux Enterprise High Availability Extension 15 SP2

Installation and Setup Quick Start Edit source

SUSE Linux Enterprise High Availability Extension 15 SP2

This document guides you through the setup of a very basic two-node cluster, using the bootstrap scripts provided by the ha-cluster-bootstrap package. This includes the configuration of a virtual IP address as a cluster resource and the use of SBD on shared storage as a node fencing mechanism.

Authors: Tanja Roth and Thomas Schraitle
Publication Date: November 24, 2020

1 Usage Scenario Edit source

The procedures in this document will lead to a minimal setup of a two-node cluster with the following properties:

  • Two nodes: alice (IP: 192.168.1.1) and bob (IP: 192.168.1.2), connected to each other via network.

  • A floating, virtual IP address (192.168.2.1) which allows clients to connect to the service no matter which physical node it is running on.

  • A shared storage device, used as SBD fencing mechanism. This avoids split brain scenarios.

  • Failover of resources from one node to the other if the active host breaks down (active/passive setup).

After setup of the cluster with the bootstrap scripts, we will monitor the cluster with the graphical Hawk2. It is one of the cluster management tools included with SUSE® Linux Enterprise High Availability Extension. As a basic test of whether failover of resources works, we will put one of the nodes into standby mode and check if the virtual IP address is migrated to the second node.

You can use the two-node cluster for testing purposes or as a minimal cluster configuration that you can extend later on. Before using the cluster in a production environment, modify it according to your requirements.

2 System Requirements Edit source

This section informs you about the key system requirements for the scenario described in Section 1. To adjust the cluster for use in a production environment, refer to the full list in Chapter 2, System Requirements and Recommendations.

2.1 Hardware Requirements Edit source

Servers

Two servers with software as specified in Section 2.2, “Software Requirements”.

The servers can be bare metal or virtual machines. They do not require identical hardware (memory, disk space, etc.), but they must have the same architecture. Cross-platform clusters are not supported.

Communication Channels

At least two TCP/IP communication media per cluster node. The network equipment must support the communication means you want to use for cluster communication: multicast or unicast. The communication media should support a data rate of 100 Mbit/s or higher. For a supported cluster setup two or more redundant communication paths are required. This can be done via:

  • Network Device Bonding (preferred).

  • A second communication channel in Corosync.

Node Fencing/STONITH

To avoid a split brain scenario, clusters need a node fencing mechanism. In a split brain scenario, cluster nodes are divided into two or more groups that do not know about each other (because of a hardware or software failure or because of a cut network connection). A fencing mechanism isolates the node in question (usually by resetting or powering off the node). This is also called STONITH (Shoot the other node in the head). A node fencing mechanism can be either a physical device (a power switch) or a mechanism like SBD (STONITH by disk) in combination with a watchdog. Using SBD requires shared storage.

2.2 Software Requirements Edit source

All nodes that will be part of the cluster need at least the following modules and extensions:

  • Base System Module 15 SP2

  • Server Applications Module 15 SP2

  • SUSE Linux Enterprise High Availability Extension 15 SP2

2.3 Other Requirements and Recommendations Edit source

Time Synchronization

Cluster nodes must synchronize to an NTP server outside the cluster. Since SUSE Linux Enterprise High Availability Extension 15, chrony is the default implementation of NTP. For more information, see the Administration Guide for SUSE Linux Enterprise Server 15 SP2.

If nodes are not synchronized, the cluster may not work properly. In addition, log files and cluster reports are very hard to analyze without synchronization. If you use the bootstrap scripts, you will be warned if NTP is not configured yet.

Host Name and IP Address
  • Use static IP addresses.

  • List all cluster nodes in the /etc/hosts file with their fully qualified host name and short host name. It is essential that members of the cluster can find each other by name. If the names are not available, internal cluster communication will fail.

SSH

All cluster nodes must be able to access each other via SSH. Tools like crm report (for troubleshooting) and Hawk2's History Explorer require passwordless SSH access between the nodes, otherwise they can only collect data from the current node.

If you use the bootstrap scripts for setting up the cluster, the SSH keys will automatically be created and copied.

3 Overview of the Bootstrap Scripts Edit source

All commands from the ha-cluster-bootstrap package execute bootstrap scripts that require only a minimum of time and manual intervention.

  • With ha-cluster-init, define the basic parameters needed for cluster communication. This leaves you with a running one-node cluster.

  • With ha-cluster-join, add more nodes to your cluster.

  • With ha-cluster-remove, remove nodes from your cluster.

All bootstrap scripts log to /var/log/ha-cluster-bootstrap.log. Check this file for any details of the bootstrap process. Any options set during the bootstrap process can be modified later with the YaST cluster module. See Chapter 4, Using the YaST Cluster Module for details.

Each script comes with a man page covering the range of functions, the script's options, and an overview of the files the script can create and modify.

The bootstrap script ha-cluster-init checks and configures the following components:

NTP

If NTP has not been configured to start at boot time, a message appears. Since SUSE Linux Enterprise High Availability Extension 15, chrony is the default implementation of NTP.

SSH

It creates SSH keys for passwordless login between cluster nodes.

Csync2

It configures Csync2 to replicate configuration files across all nodes in a cluster.

Corosync

It configures the cluster communication system.

SBD/Watchdog

It checks if a watchdog exists and asks you whether to configure SBD as node fencing mechanism.

Virtual Floating IP

It asks you whether to configure a virtual IP address for cluster administration with Hawk2.

Firewall

It opens the ports in the firewall that are needed for cluster communication.

Cluster Name

It defines a name for the cluster, by default hacluster. This is optional and mostly useful for Geo clusters. Usually, the cluster name reflects the location and makes it easier to distinguish a site inside a Geo cluster.

QDevice/QNetd

This setup is not covered here. If you want to use a QNetd server, you can set it up with the bootstrap script as described in Chapter 12, QDevice and QNetd.

4 Installing SUSE Linux Enterprise High Availability Extension Edit source

The packages for configuring and managing a cluster with the High Availability Extension are included in the High Availability installation pattern (named sles_ha on the command line). This pattern is only available after SUSE Linux Enterprise High Availability Extension has been installed as an extension to SUSE® Linux Enterprise Server.

For information on how to install extensions, see the Deployment Guide for SUSE Linux Enterprise Server 15 SP2.

Procedure 1: Installing the High Availability Pattern

If the pattern is not installed yet, proceed as follows:

  1. Install it via command line using Zypper:

    root # zypper install -t pattern ha_sles
  2. Install the High Availability pattern on all machines that will be part of your cluster.

    Note
    Note: Installing Software Packages on All Parties

    For an automated installation of SUSE Linux Enterprise Server 15 SP2 and SUSE Linux Enterprise High Availability Extension 15 SP2 use AutoYaST to clone existing nodes. For more information see Section 3.2, “Mass Installation and Deployment with AutoYaST”.

  3. Register the machines at SUSE Customer Center. Find more information in the Upgrade Guide for SUSE Linux Enterprise Server 15 SP2.

5 Using SBD as Fencing Mechanism Edit source

If you have shared storage, for example, a SAN (Storage Area Network), you can use it to avoid split brain scenarios. To do so, configure SBD as node fencing mechanism. SBD uses watchdog support and the external/sbd STONITH resource agent.

5.1 Requirements for SBD Edit source

During setup of the first node with ha-cluster-init, you can decide whether to use SBD. If yes, you need to enter the path to the shared storage device. By default, ha-cluster-init will automatically create a small partition on the device to be used for SBD.

To use SBD, the following requirements must be met:

  • The path to the shared storage device must be persistent and consistent across all nodes in the cluster. Use stable device names such as /dev/disk/by-id/dm-uuid-part1-mpath-abcedf12345.

  • The SBD device must not use host-based RAID, LVM2, nor reside on a DRBD* instance.

For details of how to set up shared storage, refer to the Storage Administration Guide for SUSE Linux Enterprise Server 15 SP2.

5.2 Enabling the Softdog Watchdog for SBD Edit source

In SUSE Linux Enterprise Server, watchdog support in the kernel is enabled by default: It ships with several kernel modules that provide hardware-specific watchdog drivers. The High Availability Extension uses the SBD daemon as the software component that feeds the watchdog.

The following procedure uses the softdog watchdog.

Important
Important: Softdog Limitations

The softdog driver assumes that at least one CPU is still running. If all CPUs are stuck, the code in the softdog driver that should reboot the system will never be executed. In contrast, hardware watchdogs keep working even if all CPUs are stuck.

Before using the cluster in a production environment, we highly recommend to replace the softdog module with the hardware module that best fits your hardware.

However, if no watchdog matches your hardware, softdog can be used as kernel watchdog module.

  1. Create a persistent, shared storage as described in Section 5.1, “Requirements for SBD”.

  2. Enable the softdog watchdog:

    root # echo softdog > /etc/modules-load.d/watchdog.conf
    root # systemctl restart systemd-modules-load
  3. Test if the softdog module is loaded correctly:

    root # lsmod | grep dog
    softdog                16384  1

We highly recommend to test the SBD fencing mechanism for proper function to prevent a split scenario. Such a test can be done by blocking the Corosync cluster communication.

6 Setting Up the First Node Edit source

Set up the first node with the ha-cluster-init script. This requires only a minimum of time and manual intervention.

Procedure 2: Setting Up the First Node (alice) with ha-cluster-init
  1. Log in as root to the physical or virtual machine to use as cluster node.

  2. Start the bootstrap script by executing:

    root # ha-cluster-init --name CLUSTERNAME

    Replace the CLUSTERNAME placeholder with a meaningful name, like the geographical location of your cluster (for example, amsterdam). This is especially helpful to create a Geo cluster later on, as it simplifies the identification of a site.

    If you need unicast instead of multicast (the default) for your cluster communication, use the option -u. After installation, find the value udpu in the file /etc/corosync/corosync.conf. If ha-cluster-init detects a node running on Amazon Web Services (AWS), the script will use unicast automatically as default for cluster communication.

    The scripts checks for NTP configuration and a hardware watchdog service. It generates the public and private SSH keys used for SSH access and Csync2 synchronization and starts the respective services.

  3. Configure the cluster communication layer (Corosync):

    1. Enter a network address to bind to. By default, the script will propose the network address of eth0. Alternatively, enter a different network address, for example the address of bond0.

    2. Enter a multicast address. The script proposes a random address that you can use as default. Of course, your particular network needs to support this multicast address.

    3. Enter a multicast port. The script proposes 5405 as default.

  4. Set up SBD as node fencing mechanism:

    1. Confirm with y that you want to use SBD.

    2. Enter a persistent path to the partition of your block device that you want to use for SBD, see Section 5, “Using SBD as Fencing Mechanism”. The path must be consistent across all nodes in the cluster.

  5. Configure a virtual IP address for cluster administration with Hawk2. (We will use this virtual IP resource for testing successful failover later on).

    1. Confirm with y that you want to configure a virtual IP address.

    2. Enter an unused IP address that you want to use as administration IP for Hawk2: 192.168.2.1

      Instead of logging in to an individual cluster node with Hawk2, you can connect to the virtual IP address.

Finally, the script will start the Pacemaker service to bring the cluster online and enable Hawk2. The URL to use for Hawk2 is displayed on the screen.

You now have a running one-node cluster. To view its status, proceed as follows:

Procedure 3: Logging In to the Hawk2 Web Interface
  1. On any machine, start a Web browser and make sure that JavaScript and cookies are enabled.

  2. As URL, enter the IP address or host name of any cluster node running the Hawk Web service. Alternatively, enter the address of the virtual IP address that you configured in Step 5 of Procedure 2, “Setting Up the First Node (alice) with ha-cluster-init:

    https://HAWKSERVER:7630/
    Note
    Note: Certificate Warning

    If a certificate warning appears when you try to access the URL for the first time, a self-signed certificate is in use. Self-signed certificates are not considered trustworthy by default.

    Ask your cluster operator for the certificate details to verify the certificate.

    To proceed anyway, you can add an exception in the browser to bypass the warning.

  3. On the Hawk2 login screen, enter the Username and Password of the user that has been created during the bootstrap procedure (user hacluster, password linux).

    Important
    Important: Secure Password

    Replace the default password with a secure one as soon as possible:

    root # passwd hacluster
  4. Click Log In. After login, the Hawk2 Web interface shows the Status screen by default, displaying the current cluster status at a glance:

    Status of the One-Node Cluster in Hawk2
    Figure 1: Status of the One-Node Cluster in Hawk2

7 Adding the Second Node Edit source

If you have a one-node cluster up and running, add the second cluster node with the ha-cluster-join bootstrap script, as described in Procedure 4. The script only needs access to an existing cluster node and will complete the basic setup on the current machine automatically. For details, refer to the ha-cluster-join man page.

The bootstrap scripts take care of changing the configuration specific to a two-node cluster, for example, SBD, Corosync.

Procedure 4: Adding the Second Node (bob) with ha-cluster-join
  1. Log in as root to the physical or virtual machine supposed to join the cluster.

  2. Start the bootstrap script by executing:

    root # ha-cluster-join

    If NTP has not been configured to start at boot time, a message appears. The script also checks for a hardware watchdog device (which is important in case you want to configure SBD). You are warned if none is present.

  3. If you decide to continue anyway, you will be prompted for the IP address of an existing node. Enter the IP address of the first node (alice, 192.168.1.1).

  4. If you have not already configured a passwordless SSH access between both machines, you will be prompted for the root password of the existing node.

    After logging in to the specified node, the script will copy the Corosync configuration, configure SSH, Csync2, and will bring the current machine online as new cluster node. Apart from that, it will start the service needed for Hawk2.

Check the cluster status in Hawk2. Under Status › Nodes you should see two nodes with a green status (see Figure 2, “Status of the Two-Node Cluster”).

Status of the Two-Node Cluster
Figure 2: Status of the Two-Node Cluster

8 Testing the Cluster Edit source

Section 8.1, “Testing Resource Failover” is a simple test to check if the cluster moves the virtual IP address to the other node in case the node that currently runs the resource is set to standby.

However, a realistic test involves specific use cases and scenarios, including testing of your fencing mechanism to avoid a split brain situation. If you have not set up your fencing mechanism correctly, the cluster will not work properly.

Before using the cluster in a production environment, test it thoroughly according to your use cases or with the help of the ha-cluster-preflight-check script.

8.1 Testing Resource Failover Edit source

As a quick test, the following procedure checks on resource failovers:

Procedure 5: Testing Resource Failover
  1. Open a terminal and ping 192.168.2.1, your virtual IP address:

    root # ping 192.168.2.1
  2. Log in to your cluster as described in Procedure 3, “Logging In to the Hawk2 Web Interface”.

  3. In Hawk2 Status › Resources, check which node the virtual IP address (resource admin_addr) is running on. We assume the resource is running on alice.

  4. Put alice into Standby mode (see Figure 3, “Node alice in Standby Mode”).

    Node alice in Standby Mode
    Figure 3: Node alice in Standby Mode
  5. Click Status › Resources. The resource admin_addr has been migrated to bob.

During the migration, you should see an uninterrupted flow of pings to the virtual IP address. This shows that the cluster setup and the floating IP work correctly. Cancel the ping command with CtrlC.

8.2 Testing with the ha-cluster-preflight-check Command Edit source

The command ha-cluster-preflight-check runs standardized tests for a cluster. It triggers cluster failures and verifies configuration to find problems. Before you use your cluster in production, it is recommended to use this command to make sure everything works as expected.

The command supports the following checks:

  • Environment Check -e/--env-check This test checks:

    • Are host names resolvable?

    • Is the time service enabled and started?

    • Does the current node have a watchdog configured?

    • Is the firewalld service started and are cluster-related ports open?

  • Cluster State Check -c/--cluster-check Checks different states and services of the cluster. This test checks:

    • Are cluster services (Pacemaker/Corosync) enabled and running?

    • Is STONITH enabled? It also checks, whether STONITH-related resources are configured and started. If you configured SBD, is the SBD service started?

    • Does the cluster have a quorum? Shows current DC nodes and nodes which are online, offline, and unclean.

    • Do we have started, stopped, or failed resources?

  • Split Brain Check --split-brain-iptables Simulates a split brain scenario by blocking the Corosync port. Checks whether one node can be fenced as expected.

  • Kills the daemons for SBD, Corosync, and Pacemaker -kill-sbd/-kill-corosync/-kill-pacemakerd After running such a test, you can find a report in /var/lib/ha-cluster-preflight-check. The report includes test case description, action logging, and explains possible results.

  • Fence Node Check --fence-nodeFences the specific node passed from the command line.

For example, to test the environment, run:

root # crm_mon -1
Stack: corosync
Current DC: alice (version ...) - partition with quorum
Last updated: Fri Mar 03 14:40:21 2020
Last change: Fri Mar 03 14:35:07 2020 by root via cibadmin on alice

2 nodes configured
1 resource configured

Online: [ alice bob ]
Active resources:

 stonith-sbd    (stonith:external/sbd): Started alice

root # ha-cluster-preflight-check -e
[2020/03/20 14:40:45]INFO: Checking hostname resolvable [Pass]
[2020/03/20 14:40:45]INFO: Checking time service [Fail]
 INFO: chronyd.service is available
 WARNING: chronyd.service is disabled
 WARNING: chronyd.service is not active
[2020/03/20 14:40:45]INFO: Checking watchdog [Pass]
[2020/03/20 14:40:45]INFO: Checking firewall [Fail]
 INFO: firewalld.service is available
 WARNING: firewalld.service is not active

You can inspect the result in /var/log/ha-cluster-preflight-check.log.

9 For More Information Edit source

More documentation for this product is available at https://documentation.suse.com/sle-ha/15-SP2. For further configuration and administration tasks, see the comprehensive Administration Guide.

10 Legal Notice Edit source

Copyright© 2006– 2020 SUSE LLC and contributors. All rights reserved.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled GNU Free Documentation License.

For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.

Print this page