Jump to content
documentation.suse.com / Administration Guide
SUSE Linux Enterprise High Availability 15 SP4

Administration Guide

This guide is intended for administrators who need to set up, configure, and maintain clusters with SUSE® Linux Enterprise High Availability. For quick and efficient configuration and administration, the product includes both a graphical user interface and a command line interface (CLI). For performing key tasks, both approaches are covered in this guide. Thus, you can choose the appropriate tool that matches your needs.

Publication Date: December 05, 2024

Copyright © 2006–2024 SUSE LLC and contributors. All rights reserved.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled GNU Free Documentation License.

For SUSE trademarks, see https://www.suse.com/company/legal/. All third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.

Preface

1 Available documentation

Online documentation

Our documentation is available online at https://documentation.suse.com. Browse or download the documentation in various formats.

Note
Note: Latest updates

The latest updates are usually available in the English-language version of this documentation.

SUSE Knowledgebase

If you have run into an issue, also check out the Technical Information Documents (TIDs) that are available online at https://www.suse.com/support/kb/. Search the SUSE Knowledgebase for known solutions driven by customer need.

Release notes

For release notes, see https://www.suse.com/releasenotes/.

In your system

For offline use, the release notes are also available under /usr/share/doc/release-notes on your system. The documentation for individual packages is available at /usr/share/doc/packages.

Many commands are also described in their manual pages. To view them, run man, followed by a specific command name. If the man command is not installed on your system, install it with sudo zypper install man.

2 Improving the documentation

Your feedback and contributions to this documentation are welcome. The following channels for giving feedback are available:

Service requests and support

For services and support options available for your product, see https://www.suse.com/support/.

To open a service request, you need a SUSE subscription registered at SUSE Customer Center. Go to https://scc.suse.com/support/requests, log in, and click Create New.

Bug reports

Report issues with the documentation at https://bugzilla.suse.com/.

To simplify this process, click the Report an issue icon next to a headline in the HTML version of this document. This preselects the right product and category in Bugzilla and adds a link to the current section. You can start typing your bug report right away.

A Bugzilla account is required.

Contributions

To contribute to this documentation, click the Edit source document icon next to a headline in the HTML version of this document. This will take you to the source code on GitHub, where you can open a pull request.

A GitHub account is required.

Note
Note: Edit source document only available for English

The Edit source document icons are only available for the English version of each document. For all other languages, use the Report an issue icons instead.

For more information about the documentation environment used for this documentation, see the repository's README.

Mail

You can also report errors and send feedback concerning the documentation to <>. Include the document title, the product version, and the publication date of the document. Additionally, include the relevant section number and title (or provide the URL) and provide a concise description of the problem.

3 Documentation conventions

The following notices and typographic conventions are used in this document:

  • /etc/passwd: Directory names and file names

  • PLACEHOLDER: Replace PLACEHOLDER with the actual value

  • PATH: An environment variable

  • ls, --help: Commands, options, and parameters

  • user: The name of user or group

  • package_name: The name of a software package

  • Alt, AltF1: A key to press or a key combination. Keys are shown in uppercase as on a keyboard.

  • File, File › Save As: menu items, buttons

  • AMD/Intel This paragraph is only relevant for the AMD64/Intel 64 architectures. The arrows mark the beginning and the end of the text block.

    IBM Z, POWER This paragraph is only relevant for the architectures IBM Z and POWER. The arrows mark the beginning and the end of the text block.

  • Chapter 1, Example chapter: A cross-reference to another chapter in this guide.

  • Commands that must be run with root privileges. You can also prefix these commands with the sudo command to run them as a non-privileged user:

    # command
    > sudo command
  • Commands that can be run by non-privileged users:

    > command
  • Commands can be split into two or multiple lines by a backslash character (\) at the end of a line. The backslash informs the shell that the command invocation will continue after the line's end:

    > echo a b \
    c d
  • A code block that shows both the command (preceded by a prompt) and the respective output returned by the shell:

    > command
    output
  • Commands executed in the interactive crm shell.

    crm(live)# 
  • Notices

    Warning
    Warning: Warning notice

    Vital information you must be aware of before proceeding. Warns you about security issues, potential loss of data, damage to hardware, or physical hazards.

    Important
    Important: Important notice

    Important information you should be aware of before proceeding.

    Note
    Note: Note notice

    Additional information, for example about differences in software versions.

    Tip
    Tip: Tip notice

    Helpful information, like a guideline or a piece of practical advice.

  • Compact Notices

    Note

    Additional information, for example about differences in software versions.

    Tip

    Helpful information, like a guideline or a piece of practical advice.

For an overview of naming conventions for cluster nodes and names, resources, and constraints, see Appendix B, Naming conventions.

4 Support

Find the support statement for SUSE Linux Enterprise High Availability and general information about technology previews below. For details about the product lifecycle, see https://www.suse.com/lifecycle.

If you are entitled to support, find details on how to collect information for a support ticket at https://documentation.suse.com/sles-15/html/SLES-all/cha-adm-support.html.

4.1 Support statement for SUSE Linux Enterprise High Availability

To receive support, you need an appropriate subscription with SUSE. To view the specific support offers available to you, go to https://www.suse.com/support/ and select your product.

The support levels are defined as follows:

L1

Problem determination, which means technical support designed to provide compatibility information, usage support, ongoing maintenance, information gathering and basic troubleshooting using available documentation.

L2

Problem isolation, which means technical support designed to analyze data, reproduce customer problems, isolate a problem area and provide a resolution for problems not resolved by Level 1 or prepare for Level 3.

L3

Problem resolution, which means technical support designed to resolve problems by engaging engineering to resolve product defects which have been identified by Level 2 Support.

For contracted customers and partners, SUSE Linux Enterprise High Availability is delivered with L3 support for all packages, except for the following:

  • Technology previews.

  • Sound, graphics, fonts, and artwork.

  • Packages that require an additional customer contract.

  • Some packages shipped as part of the module Workstation Extension are L2-supported only.

  • Packages with names ending in -devel (containing header files and similar developer resources) will only be supported together with their main packages.

SUSE will only support the usage of original packages. That is, packages that are unchanged and not recompiled.

4.2 Technology previews

Technology previews are packages, stacks, or features delivered by SUSE to provide glimpses into upcoming innovations. Technology previews are included for your convenience to give you a chance to test new technologies within your environment. We would appreciate your feedback. If you test a technology preview, please contact your SUSE representative and let them know about your experience and use cases. Your input is helpful for future development.

Technology previews have the following limitations:

  • Technology previews are still in development. Therefore, they may be functionally incomplete, unstable, or otherwise not suitable for production use.

  • Technology previews are not supported.

  • Technology previews may only be available for specific hardware architectures.

  • Details and functionality of technology previews are subject to change. As a result, upgrading to subsequent releases of a technology preview may be impossible and require a fresh installation.

  • SUSE may discover that a preview does not meet customer or market needs, or does not comply with enterprise standards. Technology previews can be removed from a product at any time. SUSE does not commit to providing a supported version of such technologies in the future.

For an overview of technology previews shipped with your product, see the release notes at https://www.suse.com/releasenotes.

Part I Installation and setup

  • 1 Product overview
  • SUSE® Linux Enterprise High Availability is an integrated suite of open source clustering technologies. It enables you to implement highly available physical and virtual Linux clusters, and to eliminate single points of failure. It ensures the high availability and manageability of critical network resources including data, applications, and services. Thus, it helps you maintain business continuity, protect data integrity, and reduce unplanned downtime for your mission-critical Linux workloads.

    It ships with essential monitoring, messaging, and cluster resource management functionality (supporting failover, failback, and migration (load balancing) of individually managed cluster resources).

    This chapter introduces the main product features and benefits of SUSE Linux Enterprise High Availability. Inside you will find several example clusters and learn about the components making up a cluster. The last section provides an overview of the architecture, describing the individual architecture layers and processes within the cluster.

    For explanations of some common terms used in the context of High Availability clusters, refer to Glossary.

  • 2 System requirements and recommendations
  • The following section informs you about system requirements, and some prerequisites for SUSE® Linux Enterprise High Availability. It also includes recommendations for cluster setup.

  • 3 Installing SUSE Linux Enterprise High Availability
  • If you are setting up a High Availability cluster with SUSE® Linux Enterprise High Availability for the first time, the easiest way is to start with a basic two-node cluster. You can also use the two-node cluster to run some tests. Afterward, you can add more nodes by cloning existing cluster nodes with AutoYaST. The cloned nodes will have the same packages installed and the same system configuration as the original ones.

    If you want to upgrade an existing cluster that runs an older version of SUSE Linux Enterprise High Availability, refer to Chapter 29, Upgrading your cluster and updating software packages.

  • 4 Using the YaST cluster module
  • The YaST cluster module allows you to set up a cluster manually (from scratch) or to modify options for an existing cluster.

    However, if you prefer an automated approach for setting up a cluster, refer to Article “Installation and Setup Quick Start”. It describes how to install the needed packages and leads you to a basic two-node cluster, which is set up with the bootstrap scripts provided by the crm shell.

    You can also use a combination of both setup methods, for example: set up one node with YaST cluster and then use one of the bootstrap scripts to integrate more nodes (or vice versa).

1 Product overview

SUSE® Linux Enterprise High Availability is an integrated suite of open source clustering technologies. It enables you to implement highly available physical and virtual Linux clusters, and to eliminate single points of failure. It ensures the high availability and manageability of critical network resources including data, applications, and services. Thus, it helps you maintain business continuity, protect data integrity, and reduce unplanned downtime for your mission-critical Linux workloads.

It ships with essential monitoring, messaging, and cluster resource management functionality (supporting failover, failback, and migration (load balancing) of individually managed cluster resources).

This chapter introduces the main product features and benefits of SUSE Linux Enterprise High Availability. Inside you will find several example clusters and learn about the components making up a cluster. The last section provides an overview of the architecture, describing the individual architecture layers and processes within the cluster.

For explanations of some common terms used in the context of High Availability clusters, refer to Glossary.

1.1 Availability as a module or extension

High Availability is available as a module or extension for several products. For details, see https://documentation.suse.com/sles/html/SLES-all/article-modules.html#art-modules-high-availability.

1.2 Key features

SUSE® Linux Enterprise High Availability helps you ensure and manage the availability of your network resources. The following sections highlight some of the key features:

1.2.1 Wide range of clustering scenarios

SUSE Linux Enterprise High Availability supports the following scenarios:

  • Active/active configurations

  • Active/passive configurations: N+1, N+M, N to 1, N to M

  • Hybrid physical and virtual clusters, allowing virtual servers to be clustered with physical servers. This improves service availability and resource usage.

  • Local clusters

  • Metro clusters (stretched local clusters)

  • Geo clusters (geographically dispersed clusters)

Important
Important: No support for mixed architectures

All nodes belonging to a cluster should have the same processor platform: x86, IBM Z, or POWER. Clusters of mixed architectures are not supported.

Your cluster can contain up to 32 Linux servers. Using pacemaker_remote, the cluster can be extended to include additional Linux servers beyond this limit. Any server in the cluster can restart resources (applications, services, IP addresses, and file systems) from a failed server in the cluster.

1.2.2 Flexibility

SUSE Linux Enterprise High Availability ships with Corosync messaging and membership layer and Pacemaker Cluster Resource Manager. Using Pacemaker, administrators can continually monitor the health and status of their resources, and manage dependencies. They can automatically stop and start services based on highly configurable rules and policies. SUSE Linux Enterprise High Availability allows you to tailor a cluster to the specific applications and hardware infrastructure that fit your organization. Time-dependent configuration enables services to automatically migrate back to repaired nodes at specified times.

1.2.3 Storage and data replication

With SUSE Linux Enterprise High Availability you can dynamically assign and reassign server storage as needed. It supports Fibre Channel or iSCSI storage area networks (SANs). Shared disk systems are also supported, but they are not a requirement. SUSE Linux Enterprise High Availability also comes with a cluster-aware file system (OCFS2) and the cluster Logical Volume Manager (Cluster LVM). For replication of your data, use DRBD* to mirror the data of a High Availability service from the active node of a cluster to its standby node. Furthermore, SUSE Linux Enterprise High Availability also supports CTDB (Cluster Trivial Database), a technology for Samba clustering.

1.2.4 Support for virtualized environments

SUSE Linux Enterprise High Availability supports the mixed clustering of both physical and virtual Linux servers. SUSE Linux Enterprise Server 15 SP4 ships with Xen, an open source virtualization hypervisor, and with KVM (Kernel-based Virtual Machine). KVM is a virtualization software for Linux which is based on hardware virtualization extensions. The cluster resource manager in SUSE Linux Enterprise High Availability can recognize, monitor, and manage services running within virtual servers and services running in physical servers. Guest systems can be managed as services by the cluster.

Important
Important: Live migration in High Availability clusters

Use caution when performing live migration of nodes in an active cluster. The cluster stack might not tolerate an operating system freeze caused by the live migration process, which could lead to the node being fenced.

We recommend either of the following actions to help avoid node fencing during live migration:

You must thoroughly test this setup before attempting live migration in a production environment.

1.2.5 Support of local, metro, and Geo clusters

SUSE Linux Enterprise High Availability supports different geographical scenarios, including geographically dispersed clusters (Geo clusters).

Local clusters

A single cluster in one location (for example, all nodes are located in one data center). The cluster uses multicast or unicast for communication between the nodes and manages failover internally. Network latency can be neglected. Storage is typically accessed synchronously by all nodes.

Metro clusters

A single cluster that can stretch over multiple buildings or data centers, with all sites connected by fibre channel. The cluster uses multicast or unicast for communication between the nodes and manages failover internally. Network latency is usually low (<5 ms for distances of approximately 20 miles). Storage is frequently replicated (mirroring or synchronous replication).

Geo clusters (multi-site clusters)

Multiple, geographically dispersed sites with a local cluster each. The sites communicate via IP. Failover across the sites is coordinated by a higher-level entity. Geo clusters need to cope with limited network bandwidth and high latency. Storage is replicated asynchronously.

Note
Note: Geo clustering and SAP workloads

Currently Geo clusters neither support SAP HANA system replication nor SAP S/4HANA and SAP NetWeaver enqueue replication setups.

The greater the geographical distance between individual cluster nodes, the more factors may potentially disturb the high availability of services the cluster provides. Network latency, limited bandwidth and access to storage are the main challenges for long-distance clusters.

1.2.6 Resource agents

SUSE Linux Enterprise High Availability includes a huge number of resource agents to manage resources such as Apache, IPv4, IPv6 and many more. It also ships with resource agents for popular third party applications such as IBM WebSphere Application Server. For an overview of Open Cluster Framework (OCF) resource agents included with your product, use the crm ra command as described in Section 5.5.3, “Displaying information about OCF resource agents”.

1.2.7 User-friendly administration tools

SUSE Linux Enterprise High Availability ships with a set of powerful tools. Use them for basic installation and setup of your cluster and for effective configuration and administration:

YaST

A graphical user interface for general system installation and administration. Use it to install SUSE Linux Enterprise High Availability on top of SUSE Linux Enterprise Server as described in the Installation and Setup Quick Start. YaST also provides the following modules in the High Availability category to help configure your cluster or individual components:

Hawk2

A user-friendly Web-based interface with which you can monitor and administer your High Availability clusters from Linux or non-Linux machines alike. Hawk2 can be accessed from any machine inside or outside of the cluster by using a (graphical) Web browser. Therefore it is the ideal solution even if the system on which you are working only provides a minimal graphical user interface. For details, Section 5.4, “Introduction to Hawk2”.

crm Shell

A powerful unified command line interface to configure resources and execute all monitoring or administration tasks. For details, refer to Section 5.5, “Introduction to crmsh”.

1.3 Benefits

SUSE Linux Enterprise High Availability allows you to configure up to 32 Linux servers into a high-availability cluster (HA cluster). Resources can be dynamically switched or moved to any node in the cluster. Resources can be configured to automatically migrate if a node fails, or they can be moved manually to troubleshoot hardware or balance the workload.

SUSE Linux Enterprise High Availability provides high availability from commodity components. Lower costs are obtained through the consolidation of applications and operations onto a cluster. SUSE Linux Enterprise High Availability also allows you to centrally manage the complete cluster. You can adjust resources to meet changing workload requirements (thus, manually load balance the cluster). Allowing clusters of more than two nodes also provides savings by allowing several nodes to share a hot spare.

An equally important benefit is the potential reduction of unplanned service outages and planned outages for software and hardware maintenance and upgrades.

Reasons that you would want to implement a cluster include:

  • Increased availability

  • Improved performance

  • Low cost of operation

  • Scalability

  • Disaster recovery

  • Data protection

  • Server consolidation

  • Storage consolidation

Shared disk fault tolerance can be obtained by implementing RAID on the shared disk subsystem.

The following scenario illustrates some benefits SUSE Linux Enterprise High Availability can provide.

Example cluster scenario

Suppose you have configured a three-node cluster, with a Web server installed on each of the three nodes in the cluster. Each of the nodes in the cluster hosts two Web sites. All the data, graphics, and Web page content for each Web site are stored on a shared disk subsystem connected to each of the nodes in the cluster. The following figure depicts how this setup might look.

Three-server cluster
Figure 1.1: Three-server cluster

During normal cluster operation, each node is in constant communication with the other nodes in the cluster and performs periodic polling of all registered resources to detect failure.

Suppose Web Server 1 experiences hardware or software problems and the users depending on Web Server 1 for Internet access, e-mail, and information lose their connections. The following figure shows how resources are moved when Web Server 1 fails.

Three-server cluster after one server fails
Figure 1.2: Three-server cluster after one server fails

Web Site A moves to Web Server 2 and Web Site B moves to Web Server 3. IP addresses and certificates also move to Web Server 2 and Web Server 3.

When you configured the cluster, you decided where the Web sites hosted on each Web server would go should a failure occur. In the previous example, you configured Web Site A to move to Web Server 2 and Web Site B to move to Web Server 3. This way, the workload formerly handled by Web Server 1 continues to be available and is evenly distributed between any surviving cluster members.

When Web Server 1 failed, the High Availability software did the following:

  • Detected a failure and verified with STONITH that Web Server 1 was really dead. STONITH is an acronym for Shoot The Other Node In The Head. It is a means of bringing down misbehaving nodes to prevent them from causing trouble in the cluster.

  • Remounted the shared data directories that were formerly mounted on Web server 1 on Web Server 2 and Web Server 3.

  • Restarted applications that were running on Web Server 1 on Web Server 2 and Web Server 3.

  • Transferred IP addresses to Web Server 2 and Web Server 3.

In this example, the failover process happened quickly and users regained access to Web site information within seconds, usually without needing to log in again.

Now suppose the problems with Web Server 1 are resolved, and Web Server 1 is returned to a normal operating state. Web Site A and Web Site B can either automatically fail back (move back) to Web Server 1, or they can stay where they are. This depends on how you configured the resources for them. Migrating the services back to Web Server 1 will incur some down-time. Therefore SUSE Linux Enterprise High Availability also allows you to defer the migration until a period when it will cause little or no service interruption. There are advantages and disadvantages to both alternatives.

SUSE Linux Enterprise High Availability also provides resource migration capabilities. You can move applications, Web sites, etc. to other servers in your cluster as required for system management.

For example, you could have manually moved Web Site A or Web Site B from Web Server 1 to either of the other servers in the cluster. Use cases for this are upgrading or performing scheduled maintenance on Web Server 1, or increasing performance or accessibility of the Web sites.

1.4 Cluster configurations: storage

Cluster configurations with SUSE Linux Enterprise High Availability might or might not include a shared disk subsystem. The shared disk subsystem can be connected via high-speed Fibre Channel cards, cables, and switches, or it can be configured to use iSCSI. If a node fails, another designated node in the cluster automatically mounts the shared disk directories that were previously mounted on the failed node. This gives network users continuous access to the directories on the shared disk subsystem.

Important
Important: Shared disk subsystem with LVM

When using a shared disk subsystem with LVM, that subsystem must be connected to all servers in the cluster from which it needs to be accessed.

Typical resources might include data, applications, and services. The following figures show how a typical Fibre Channel cluster configuration might look. The green lines depict connections to an Ethernet power switch. Such a device can be controlled over a network and can reboot a node when a ping request fails.

Typical Fibre Channel cluster configuration
Figure 1.3: Typical Fibre Channel cluster configuration

Although Fibre Channel provides the best performance, you can also configure your cluster to use iSCSI. iSCSI is an alternative to Fibre Channel that can be used to create a low-cost Storage Area Network (SAN). The following figure shows how a typical iSCSI cluster configuration might look.

Typical iSCSI cluster configuration
Figure 1.4: Typical iSCSI cluster configuration

Although most clusters include a shared disk subsystem, it is also possible to create a cluster without a shared disk subsystem. The following figure shows how a cluster without a shared disk subsystem might look.

Typical cluster configuration without shared storage
Figure 1.5: Typical cluster configuration without shared storage

1.5 Architecture

This section provides a brief overview of SUSE Linux Enterprise High Availability architecture. It identifies and provides information on the architectural components, and describes how those components interoperate.

1.5.1 Architecture layers

SUSE Linux Enterprise High Availability has a layered architecture. Figure 1.6, “Architecture” illustrates the different layers and their associated components.

Architecture
Figure 1.6: Architecture

1.5.1.1 Membership and messaging layer (Corosync)

This component provides reliable messaging, membership, and quorum information about the cluster. This is handled by the Corosync cluster engine, a group communication system.

1.5.1.2 Cluster resource manager (Pacemaker)

Pacemaker as cluster resource manager is the brain which reacts to events occurring in the cluster. It is implemented as pacemaker-controld, the cluster controller, which coordinates all actions. Events can be nodes that join or leave the cluster, failure of resources, or scheduled activities such as maintenance, for example.

Local resource manager

The local resource manager is located between the Pacemaker layer and the resources layer on each node. It is implemented as pacemaker-execd daemon. Through this daemon, Pacemaker can start, stop, and monitor resources.

Cluster Information Database (CIB)

On every node, Pacemaker maintains the cluster information database (CIB). It is an XML representation of the cluster configuration (including cluster options, nodes, resources, constraints and the relationship to each other). The CIB also reflects the current cluster status. Each cluster node contains a CIB replica, which is synchronized across the whole cluster. The pacemaker-based daemon takes care of reading and writing cluster configuration and status.

Designated Coordinator (DC)

The DC is elected from all nodes in the cluster. This happens if there is no DC yet or if the current DC leaves the cluster for any reason. The DC is the only entity in the cluster that can decide that a cluster-wide change needs to be performed, such as fencing a node or moving resources around. All other nodes get their configuration and resource allocation information from the current DC.

Policy Engine

The policy engine runs on every node, but the one on the DC is the active one. The engine is implemented as pacemaker-schedulerd daemon. When a cluster transition is needed, based on the current state and configuration, pacemaker-schedulerd calculates the expected next state of the cluster. It determines what actions need to be scheduled to achieve the next state.

1.5.1.3 Resources and resource agents

In a High Availability cluster, the services that need to be highly available are called resources. Resource agents (RAs) are scripts that start, stop, and monitor cluster resources.

1.5.2 Process flow

The pacemakerd daemon launches and monitors all other related daemons. The daemon that coordinates all actions, pacemaker-controld, has an instance on each cluster node. Pacemaker centralizes all cluster decision-making by electing one of those instances as a primary. Should the elected pacemaker-controld daemon fail, a new primary is established.

Many actions performed in the cluster will cause a cluster-wide change. These actions can include things like adding or removing a cluster resource or changing resource constraints. It is important to understand what happens in the cluster when you perform such an action.

For example, suppose you want to add a cluster IP address resource. To do this, you can use the crm shell or the Web interface to modify the CIB. It is not required to perform the actions on the DC. You can use either tool on any node in the cluster and they will be relayed to the DC. The DC will then replicate the CIB change to all cluster nodes.

Based on the information in the CIB, the pacemaker-schedulerd then computes the ideal state of the cluster and how it should be achieved. It feeds a list of instructions to the DC. The DC sends commands via the messaging/infrastructure layer which are received by the pacemaker-controld peers on other nodes. Each of them uses its local resource agent executor (implemented as pacemaker-execd) to perform resource modifications. The pacemaker-execd is not cluster-aware and interacts directly with resource agents.

All peer nodes report the results of their operations back to the DC. After the DC concludes that all necessary operations are successfully performed in the cluster, the cluster will go back to the idle state and wait for further events. If any operation was not carried out as planned, the pacemaker-schedulerd is invoked again with the new information recorded in the CIB.

In some cases, it might be necessary to power off nodes to protect shared data or complete resource recovery. In a Pacemaker cluster, the implementation of node level fencing is STONITH. For this, Pacemaker comes with a fencing subsystem, pacemaker-fenced. STONITH devices must be configured as cluster resources (that use specific fencing agents), because this allows monitoring of the fencing devices. When clients detect a failure, they send a request to pacemaker-fenced, which then executes the fencing agent to bring down the node.

2 System requirements and recommendations

The following section informs you about system requirements, and some prerequisites for SUSE® Linux Enterprise High Availability. It also includes recommendations for cluster setup.

2.1 Hardware requirements

The following list specifies hardware requirements for a cluster based on SUSE® Linux Enterprise High Availability. These requirements represent the minimum hardware configuration. Additional hardware might be necessary, depending on how you intend to use your cluster.

Servers

1 to 32 Linux servers with software as specified in Section 2.2, “Software requirements”.

The servers can be bare metal or virtual machines. They do not require identical hardware (memory, disk space, etc.), but they must have the same architecture. Cross-platform clusters are not supported.

Using pacemaker_remote, the cluster can be extended to include additional Linux servers beyond the 32-node limit.

Communication channels

At least two TCP/IP communication media per cluster node. The network equipment must support the communication means you want to use for cluster communication: multicast or unicast. The communication media should support a data rate of 100 Mbit/s or higher. For a supported cluster setup two or more redundant communication paths are required. This can be done via:

  • Network Device Bonding (preferred).

  • A second communication channel in Corosync.

For details, refer to Chapter 16, Network device bonding and Procedure 4.3, “Defining a redundant communication channel”, respectively.

Node fencing/STONITH

To avoid a split brain scenario, clusters need a node fencing mechanism. In a split brain scenario, cluster nodes are divided into two or more groups that do not know about each other (because of a hardware or software failure or because of a cut network connection). A fencing mechanism isolates the node in question (usually by resetting or powering off the node). This is also called STONITH (Shoot the other node in the head). A node fencing mechanism can be either a physical device (a power switch) or a mechanism like SBD (STONITH by disk) in combination with a watchdog. Using SBD requires shared storage.

Unless SBD is used, each node in the High Availability cluster must have at least one STONITH device. We strongly recommend multiple STONITH devices per node.

Important
Important: No Support Without STONITH
  • You must have a node fencing mechanism for your cluster.

  • The global cluster options stonith-enabled and startup-fencing must be set to true. When you change them, you lose support.

2.2 Software requirements

All nodes that will be part of the cluster need at least the following modules and extensions:

  • Basesystem Module 15 SP4

  • Server Applications Module 15 SP4

  • SUSE Linux Enterprise High Availability 15 SP4

Depending on the system roles you select during installation, the following software patterns are installed by default:

Table 2.1: System roles and installed patterns

System Role

Software Pattern (YaST/Zypper)

HA Node

  • High Availability (sles_ha)

  • Enhanced Base System (enhanced_base)

HA GEO Node

  • Geo Clustering for High Availability (ha_geo)

  • Enhanced Base System (enhanced_base)

Note
Note: Minimal installation

An installation via those system roles results in a minimal installation only. You might need to add more packages manually, if required.

For machines that originally had another system role assigned, you need to manually install the sles_ha or ha_geo patterns and any further packages that you need.

2.3 Storage requirements

Some services require shared storage. If using an external NFS share, it must be reliably accessible from all cluster nodes via redundant communication paths.

To make data highly available, a shared disk system (Storage Area Network, or SAN) is recommended for your cluster. If a shared disk subsystem is used, ensure the following:

  • The shared disk system is properly set up and functional according to the manufacturer’s instructions.

  • The disks contained in the shared disk system should be configured to use mirroring or RAID to add fault tolerance to the shared disk system.

  • If you are using iSCSI for shared disk system access, ensure that you have properly configured iSCSI initiators and targets.

  • When using DRBD* to implement a mirroring RAID system that distributes data across two machines, make sure to only access the device provided by DRBD—never the backing device. To leverage the redundancy it is possible to use the same NICs as the rest of the cluster.

When using SBD as STONITH mechanism, additional requirements apply for the shared storage. For details, see Section 13.3, “Requirements and restrictions”.

2.4 Other requirements and recommendations

For a supported and useful High Availability setup, consider the following recommendations:

Number of cluster nodes

For clusters with more than two nodes, it is strongly recommended to use an odd number of cluster nodes to have quorum. For more information about quorum, see Section 5.2, “Quorum determination”. A regular cluster can contain up to 32 nodes. With the pacemaker_remote service, High Availability clusters can be extended to include additional nodes beyond this limit. See Pacemaker Remote Quick Start for more details.

Time synchronization

Cluster nodes must synchronize to an NTP server outside the cluster. Since SUSE Linux Enterprise High Availability 15, chrony is the default implementation of NTP. For more information, see the Administration Guide for SUSE Linux Enterprise Server 15 SP4.

The cluster might not work properly if the nodes are not synchronized, or even if they are synchronized but have different timezones configured. In addition, log files and cluster reports are very hard to analyze without synchronization. If you use the bootstrap scripts, you will be warned if NTP is not configured yet.

Network Interface Card (NIC) names

Must be identical on all nodes.

Host name and IP address
  • Use static IP addresses.

  • Only the primary IP address is supported.

  • List all cluster nodes in the /etc/hosts file with their fully qualified host name and short host name. It is essential that members of the cluster can find each other by name. If the names are not available, internal cluster communication will fail.

    For details on how Pacemaker gets the node names, see also http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-name.html.

SSH

All cluster nodes must be able to access each other via SSH. Tools like crm report (for troubleshooting) and Hawk2's History Explorer require passwordless SSH access between the nodes, otherwise they can only collect data from the current node.

Note
Note: Regulatory requirements

If passwordless SSH access does not comply with regulatory requirements, you can use the work-around described in Appendix D, Running cluster reports without root access for running crm report.

For the History Explorer there is currently no alternative for passwordless login.

3 Installing SUSE Linux Enterprise High Availability

If you are setting up a High Availability cluster with SUSE® Linux Enterprise High Availability for the first time, the easiest way is to start with a basic two-node cluster. You can also use the two-node cluster to run some tests. Afterward, you can add more nodes by cloning existing cluster nodes with AutoYaST. The cloned nodes will have the same packages installed and the same system configuration as the original ones.

If you want to upgrade an existing cluster that runs an older version of SUSE Linux Enterprise High Availability, refer to Chapter 29, Upgrading your cluster and updating software packages.

3.1 Manual installation

For the manual installation of the packages for High Availability refer to Article “Installation and Setup Quick Start”. It leads you through the setup of a basic two-node cluster.

3.2 Mass installation and deployment with AutoYaST

After you have installed and set up a two-node cluster, you can extend the cluster by cloning existing nodes with AutoYaST and adding the clones to the cluster.

AutoYaST uses profiles that contains installation and configuration data. A profile tells AutoYaST what to install and how to configure the installed system to get a ready-to-use system in the end. This profile can then be used for mass deployment in different ways (for example, to clone existing cluster nodes).

For detailed instructions on how to use AutoYaST in various scenarios, see the AutoYaST Guide for SUSE Linux Enterprise Server 15 SP4.

Important
Important: Identical hardware

Procedure 3.1, “Cloning a cluster node with AutoYaST” assumes you are rolling out SUSE Linux Enterprise High Availability 15 SP4 to a set of machines with identical hardware configurations.

If you need to deploy cluster nodes on non-identical hardware, refer to the Deployment Guide for SUSE Linux Enterprise Server 15 SP4, chapter Automated Installation, section Rule-Based Autoinstallation.

Procedure 3.1: Cloning a cluster node with AutoYaST
  1. Make sure the node you want to clone is correctly installed and configured. For details, see the Installation and Setup Quick Start or Chapter 4, Using the YaST cluster module.

  2. Follow the description outlined in the SUSE Linux Enterprise 15 SP4 Deployment Guide for simple mass installation. This includes the following basic steps:

    1. Creating an AutoYaST profile. Use the AutoYaST GUI to create and modify a profile based on the existing system configuration. In AutoYaST, choose the High Availability module and click the Clone button. If needed, adjust the configuration in the other modules and save the resulting control file as XML.

      If you have configured DRBD, you can select and clone this module in the AutoYaST GUI, too.

    2. Determining the source of the AutoYaST profile and the parameter to pass to the installation routines for the other nodes.

    3. Determining the source of the SUSE Linux Enterprise Server and SUSE Linux Enterprise High Availability installation data.

    4. Determining and setting up the boot scenario for autoinstallation.

    5. Passing the command line to the installation routines, either by adding the parameters manually or by creating an info file.

    6. Starting and monitoring the autoinstallation process.

After the clone has been successfully installed, execute the following steps to make the cloned node join the cluster:

Procedure 3.2: Bringing the cloned node online
  1. Transfer the key configuration files from the already configured nodes to the cloned node with Csync2 as described in Section 4.7, “Transferring the configuration to all nodes”.

  2. To bring the node online, start the cluster services on the cloned node as described in Section 4.8, “Bringing the cluster online”.

The cloned node will now join the cluster because the /etc/corosync/corosync.conf file has been applied to the cloned node via Csync2. The CIB is automatically synchronized among the cluster nodes.

4 Using the YaST cluster module

The YaST cluster module allows you to set up a cluster manually (from scratch) or to modify options for an existing cluster.

However, if you prefer an automated approach for setting up a cluster, refer to Article “Installation and Setup Quick Start”. It describes how to install the needed packages and leads you to a basic two-node cluster, which is set up with the bootstrap scripts provided by the crm shell.

You can also use a combination of both setup methods, for example: set up one node with YaST cluster and then use one of the bootstrap scripts to integrate more nodes (or vice versa).

4.1 Definition of terms

Several key terms used in the YaST cluster module and in this chapter are defined below.

Bind network address (bindnetaddr)

The network address the Corosync executive should bind to. To simplify sharing configuration files across the cluster, Corosync uses network interface netmask to mask only the address bits that are used for routing the network. For example, if the local interface is 192.168.5.92 with netmask 255.255.255.0, set bindnetaddr to 192.168.5.0. If the local interface is 192.168.5.92 with netmask 255.255.255.192, set bindnetaddr to 192.168.5.64.

If nodelist with ringX_addr is explicitly configured in /etc/corosync/corosync.conf, bindnetaddr is not strictly required.

Note
Note: Network address for all nodes

As the same Corosync configuration will be used on all nodes, make sure to use a network address as bindnetaddr, not the address of a specific network interface.

conntrack Tools

Allow interaction with the in-kernel connection tracking system for enabling stateful packet inspection for iptables. Used by SUSE Linux Enterprise High Availability to synchronize the connection status between cluster nodes. For detailed information, refer to http://conntrack-tools.netfilter.org/.

Csync2

A synchronization tool that can be used to replicate configuration files across all nodes in the cluster, and even across Geo clusters. Csync2 can handle any number of hosts, sorted into synchronization groups. Each synchronization group has its own list of member hosts and its include/exclude patterns that define which files should be synchronized in the synchronization group. The groups, the host names belonging to each group, and the include/exclude rules for each group are specified in the Csync2 configuration file, /etc/csync2/csync2.cfg.

For authentication, Csync2 uses the IP addresses and pre-shared keys within a synchronization group. You need to generate one key file for each synchronization group and copy it to all group members.

For more information about Csync2, refer to http://oss.linbit.com/csync2/paper.pdf

Existing cluster

The term existing cluster is used to refer to any cluster that consists of at least one node. Existing clusters have a basic Corosync configuration that defines the communication channels, but they do not necessarily have resource configuration yet.

Multicast

A technology used for a one-to-many communication within a network that can be used for cluster communication. Corosync supports both multicast and unicast.

Note
Note: Switches and multicast

To use multicast for cluster communication, make sure your switches support multicast.

Multicast address (mcastaddr)

IP address to be used for multicasting by the Corosync executive. The IP address can either be IPv4 or IPv6. If IPv6 networking is used, node IDs must be specified. You can use any multicast address in your private network.

Multicast port (mcastport)

The port to use for cluster communication. Corosync uses two ports: the specified mcastport for receiving multicast, and mcastport -1 for sending multicast.

Redundant Ring Protocol (RRP)

Allows the use of multiple redundant local area networks for resilience against partial or total network faults. This way, cluster communication can still be kept up as long as a single network is operational. Corosync supports the Totem Redundant Ring Protocol. A logical token-passing ring is imposed on all participating nodes to deliver messages in a reliable and sorted manner. A node is allowed to broadcast a message only if it holds the token.

When having defined redundant communication channels in Corosync, use RRP to tell the cluster how to use these interfaces. RRP can have three modes (rrp_mode):

  • If set to active, Corosync uses both interfaces actively. However, this mode is deprecated.

  • If set to passive, Corosync sends messages alternatively over the available networks.

  • If set to none, RRP is disabled.

Unicast

A technology for sending messages to a single network destination. Corosync supports both multicast and unicast. In Corosync, unicast is implemented as UDP-unicast (UDPU).

4.2 YaST Cluster module

Start YaST and select High Availability › Cluster. Alternatively, start the module from command line:

sudo yast2 cluster

The following list shows an overview of the available screens in the YaST cluster module. It also mentions whether the screen contains parameters that are required for successful cluster setup or whether its parameters are optional.

Communication channels (required)

Allows you to define one or two communication channels for communication between the cluster nodes. As transport protocol, either use multicast (UDP) or unicast (UDPU). For details, see Section 4.3, “Defining the communication channels”.

Important
Important: Redundant communication paths

For a supported cluster setup two or more redundant communication paths are required. The preferred way is to use network device bonding as described in Chapter 16, Network device bonding.

If this is impossible, you need to define a second communication channel in Corosync.

Security (optional but recommended)

Allows you to define the authentication settings for the cluster. HMAC/SHA1 authentication requires a shared secret used to protect and authenticate messages. For details, see Section 4.4, “Defining authentication settings”.

Configure Csync2 (optional but recommended)

Csync2 helps you to keep track of configuration changes and to keep files synchronized across the cluster nodes. For details, see Section 4.7, “Transferring the configuration to all nodes”.

Configure conntrackd (optional)

Allows you to configure the user space conntrackd. Use the conntrack tools for stateful packet inspection for iptables. For details, see Section 4.5, “Synchronizing connection status between cluster nodes”.

Service (required)

Allows you to configure the service for bringing the cluster node online. Define whether to start the cluster services at boot time and whether to open the ports in the firewall that are needed for communication between the nodes. For details, see Section 4.6, “Configuring services”.

If you start the cluster module for the first time, it appears as a wizard, guiding you through all the steps necessary for basic setup. Otherwise, click the categories on the left panel to access the configuration options for each step.

Note
Note: Settings in the YaST Cluster module

Some settings in the YaST cluster module apply only to the current node. Other settings may automatically be transferred to all nodes with Csync2. Find detailed information about this in the following sections.

4.3 Defining the communication channels

For successful communication between the cluster nodes, define at least one communication channel. As transport protocol, either use multicast (UDP) or unicast (UDPU) as described in Procedure 4.1 or Procedure 4.2, respectively. If you want to define a second, redundant channel (Procedure 4.3), both communication channels must use the same protocol.

Note
Note: Public clouds: use unicast

For deploying SUSE Linux Enterprise High Availability in public cloud platforms, use unicast as transport protocol. Multicast is generally not supported by the cloud platforms themselves.

All settings defined in the YaST Communication Channels screen are written to /etc/corosync/corosync.conf. Find example files for a multicast and a unicast setup in /usr/share/doc/packages/corosync/.

If you are using IPv4 addresses, node IDs are optional. If you are using IPv6 addresses, node IDs are required. Instead of specifying IDs manually for each node, the YaST cluster module contains an option to automatically generate a unique ID for every cluster node.

Procedure 4.1: Defining the first communication channel (multicast)

When using multicast, the same bindnetaddr, mcastaddr, and mcastport will be used for all cluster nodes. All nodes in the cluster will know each other by using the same multicast address. For different clusters, use different multicast addresses.

  1. Start the YaST cluster module and switch to the Communication Channels category.

  2. Set the Transport protocol to Multicast.

  3. Define the Bind Network Address. Set the value to the subnet you will use for cluster multicast.

  4. Define the Multicast Address.

  5. Define the Port.

  6. To automatically generate a unique ID for every cluster node keep Auto Generate Node ID enabled.

  7. Define a Cluster Name.

  8. Enter the number of Expected Votes. This is important for Corosync to calculate quorum in case of a partitioned cluster. By default, each node has 1 vote. The number of Expected Votes must match the number of nodes in your cluster.

  9. Confirm your changes.

  10. If needed, define a redundant communication channel in Corosync as described in Procedure 4.3, “Defining a redundant communication channel”.

YaST Cluster—multicast configuration
Figure 4.1: YaST Cluster—multicast configuration

To use unicast instead of multicast for cluster communication, proceed as follows.

Procedure 4.2: Defining the first communication channel (unicast)
  1. Start the YaST cluster module and switch to the Communication Channels category.

  2. Set the Transport protocol to Unicast.

  3. Define the Port.

  4. For unicast communication, Corosync needs to know the IP addresses of all nodes in the cluster. For each node that will be part of the cluster, click Add and enter the following details:

    • IP Address

    • Redundant IP Address (only required if you use a second communication channel in Corosync)

    • Node ID (only required if the option Auto Generate Node ID is disabled)

    To modify or remove any addresses of cluster members, use the Edit or Del buttons.

  5. To automatically generate a unique ID for every cluster node keep Auto Generate Node ID enabled.

  6. Define a Cluster Name.

  7. Enter the number of Expected Votes. This is important for Corosync to calculate quorum in case of a partitioned cluster. By default, each node has 1 vote. The number of Expected Votes must match the number of nodes in your cluster.

  8. Confirm your changes.

  9. If needed, define a redundant communication channel in Corosync as described in Procedure 4.3, “Defining a redundant communication channel”.

YaST Cluster—unicast configuration
Figure 4.2: YaST Cluster—unicast configuration

If network device bonding cannot be used for any reason, the second best choice is to define a redundant communication channel (a second ring) in Corosync. That way, two physically separate networks can be used for communication. If one network fails, the cluster nodes can still communicate via the other network.

The additional communication channel in Corosync will form a second token-passing ring. In /etc/corosync/corosync.conf, the first channel you configured is the primary ring and gets the ring number 0. The second ring (redundant channel) gets the ring number 1.

When having defined redundant communication channels in Corosync, use RRP to tell the cluster how to use these interfaces. With RRP, two physically separate networks are used for communication. If one network fails, the cluster nodes can still communicate via the other network.

RRP can have three modes:

  • If set to active, Corosync uses both interfaces actively. However, this mode is deprecated.

  • If set to passive, Corosync sends messages alternatively over the available networks.

  • If set to none, RRP is disabled.

Procedure 4.3: Defining a redundant communication channel
Important
Important: Redundant rings and /etc/hosts

If multiple rings are configured in Corosync, each node can have multiple IP addresses. This needs to be reflected in the /etc/hosts file of all nodes.

  1. Start the YaST cluster module and switch to the Communication Channels category.

  2. Activate Redundant Channel. The redundant channel must use the same protocol as the first communication channel you defined.

  3. If you use multicast, enter the following parameters: the Bind Network Address to use, the Multicast Address and the Port for the redundant channel.

    If you use unicast, define the following parameters: the Bind Network Address to use, and the Port. Enter the IP addresses of all nodes that will be part of the cluster.

  4. To tell Corosync how and when to use the different channels, select the rrp_mode to use:

    • If only one communication channel is defined, rrp_mode is automatically disabled (value none).

    • If set to active, Corosync uses both interfaces actively. However, this mode is deprecated.

    • If set to passive, Corosync sends messages alternatively over the available networks.

    When RRP is used, SUSE Linux Enterprise High Availability monitors the status of the current rings and automatically re-enables redundant rings after faults.

    Alternatively, check the ring status manually with corosync-cfgtool. View the available options with -h.

  5. Confirm your changes.

4.4 Defining authentication settings

To define the authentication settings for the cluster, you can use HMAC/SHA1 authentication. This requires a shared secret used to protect and authenticate messages. The authentication key (password) you specify will be used on all nodes in the cluster.

Procedure 4.4: Enabling secure authentication
  1. Start the YaST cluster module and switch to the Security category.

  2. Activate Enable Security Auth.

  3. For a newly created cluster, click Generate Auth Key File. An authentication key is created and written to /etc/corosync/authkey.

    If you want the current machine to join an existing cluster, do not generate a new key file. Instead, copy the /etc/corosync/authkey from one of the nodes to the current machine (either manually or with Csync2).

  4. Confirm your changes. YaST writes the configuration to /etc/corosync/corosync.conf.

YaST Cluster—security
Figure 4.3: YaST Cluster—security

4.5 Synchronizing connection status between cluster nodes

To enable stateful packet inspection for iptables, configure and use the conntrack tools. This requires the following basic steps:

Procedure 4.5: Configuring the conntrackd with YaST

Use the YaST cluster module to configure the user space conntrackd (see Figure 4.4, “YaST Clusterconntrackd). It needs a dedicated network interface that is not used for other communication channels. The daemon can be started via a resource agent afterward.

  1. Start the YaST cluster module and switch to the Configure conntrackd category.

  2. Define the Multicast Address to be used for synchronizing the connection status.

  3. In Group Number, define a numeric ID for the group to synchronize the connection status to.

  4. Click Generate /etc/conntrackd/conntrackd.conf to create the configuration file for conntrackd.

  5. If you modified any options for an existing cluster, confirm your changes and close the cluster module.

  6. For further cluster configuration, click Next and proceed with Section 4.6, “Configuring services”.

  7. Select a Dedicated Interface for synchronizing the connection status. The IPv4 address of the selected interface is automatically detected and shown in YaST. It must already be configured and it must support multicast.

YaST Cluster—conntrackd
Figure 4.4: YaST Clusterconntrackd

After having configured the conntrack tools, you can use them for Linux Virtual Server (see Load balancing).

4.6 Configuring services

In the YaST cluster module define whether to start certain services on a node at boot time. You can also use the module to start and stop the services manually. To bring the cluster nodes online and start the cluster resource manager, Pacemaker must be running as a service.

Procedure 4.6: Enabling the cluster services
  1. In the YaST cluster module, switch to the Service category.

  2. To start the cluster services each time this cluster node is booted, select the respective option in the Booting group. If you select Off in the Booting group, you must start the cluster services manually each time this node is booted. To start the cluster services manually, use the command:

    # crm cluster start
  3. To start or stop the cluster services immediately, click the respective button.

  4. To open the ports in the firewall that are needed for cluster communication on the current machine, activate Open Port in Firewall.

  5. Confirm your changes. Note that the configuration only applies to the current machine, not to all cluster nodes.

YaST Cluster—services
Figure 4.5: YaST Cluster—services

4.7 Transferring the configuration to all nodes

Instead of copying the resulting configuration files to all nodes manually, use the csync2 tool for replication across all nodes in the cluster.

This requires the following basic steps:

Csync2 helps you to keep track of configuration changes and to keep files synchronized across the cluster nodes:

  • You can define a list of files that are important for operation.

  • You can show changes to these files (against the other cluster nodes).

  • You can synchronize the configured files with a single command.

  • With a simple shell script in ~/.bash_logout, you can be reminded about unsynchronized changes before logging out of the system.

Find detailed information about Csync2 at http://oss.linbit.com/csync2/ and http://oss.linbit.com/csync2/paper.pdf.

4.7.1 Configuring Csync2 with YaST

Procedure 4.7: Configuring Csync2 with YaST
  1. Start the YaST cluster module and switch to the Csync2 category.

  2. To specify the synchronization group, click Add in the Sync Host group and enter the local host names of all nodes in your cluster. For each node, you must use exactly the strings that are returned by the hostname command.

    Tip
    Tip: Host name resolution

    If host name resolution does not work properly in your network, you can also specify a combination of host name and IP address for each cluster node. To do so, use the string HOSTNAME@IP such as alice@192.168.2.100, for example. Csync2 will then use the IP addresses when connecting.

  3. Click Generate Pre-Shared-Keys to create a key file for the synchronization group. The key file is written to /etc/csync2/key_hagroup. After it has been created, it must be copied manually to all members of the cluster.

  4. To populate the Sync File list with the files that usually need to be synchronized among all nodes, click Add Suggested Files.

  5. To Edit, Add or Remove files from the list of files to be synchronized use the respective buttons. You must enter the absolute path name for each file.

  6. Activate Csync2 by clicking Turn Csync2 ON. This will execute the following command to start Csync2 automatically at boot time:

    # systemctl enable csync2.socket
  7. Click Finish. YaST writes the Csync2 configuration to /etc/csync2/csync2.cfg.

YaST Cluster—Csync2
Figure 4.6: YaST Cluster—Csync2

4.7.2 Synchronizing changes with Csync2

Before running Csync2 for the first time, you need to make the following preparations:

Procedure 4.8: Preparing for initial synchronization with Csync2
  1. Copy the file /etc/csync2/csync2.cfg manually to all nodes after you have configured it as described in Section 4.7.1, “Configuring Csync2 with YaST”.

  2. Copy the file /etc/csync2/key_hagroup that you have generated on one node in Step 3 of Section 4.7.1 to all nodes in the cluster. It is needed for authentication by Csync2. However, do not regenerate the file on the other nodes—it needs to be the same file on all nodes.

  3. Execute the following command on all nodes to start the service now:

    # systemctl start csync2.socket
Procedure 4.9: Synchronizing the configuration files with Csync2
  1. To initially synchronize all files once, execute the following command on the machine that you want to copy the configuration from:

    # csync2 -xv

    This will synchronize all the files once by pushing them to the other nodes. If all files are synchronized successfully, Csync2 will finish with no errors.

    If one or several files that are to be synchronized have been modified on other nodes (not only on the current one), Csync2 reports a conflict. You will get an output similar to the one below:

    While syncing file /etc/corosync/corosync.conf:
    ERROR from peer hex-14: File is also marked dirty here!
    Finished with 1 errors.
  2. If you are sure that the file version on the current node is the best one, you can resolve the conflict by forcing this file and resynchronizing:

    # csync2 -f /etc/corosync/corosync.conf
    # csync2 -x

For more information on the Csync2 options, run

# csync2 -help
Note
Note: Pushing synchronization after any changes

Csync2 only pushes changes. It does not continuously synchronize files between the machines.

Each time you update files that need to be synchronized, you need to push the changes to the other machines: Run csync2  -xv on the machine where you did the changes. If you run the command on any of the other machines with unchanged files, nothing will happen.

4.8 Bringing the cluster online

Before starting the cluster, make sure passwordless SSH is configured between the nodes. If you did not already configure passwordless SSH before setting up the cluster, you can do so now by using the ssh stage of the bootstrap scripts:

  1. On the first node, run crm cluster init ssh.

  2. On the rest of the nodes, run crm cluster join ssh -c NODE1.

After the initial cluster configuration is done, start the cluster services on all cluster nodes to bring the stack online:

Procedure 4.10: Starting cluster services and checking the status
  1. Log in to an existing node.

  2. Start the cluster services on all cluster nodes:

    # crm cluster start --all
  3. Check the cluster status with the crm status command. If all nodes are online, the output should be similar to the following:

    # crm status
    Cluster Summary:
      * Stack: corosync
      * Current DC: alice (version ...) - partition with quorum
      * Last updated: ...
      * Last change:  ... by hacluster via crmd on bob
      * 2 nodes configured
      * 1 resource instance configured
    
    Node List:
      * Online: [ alice bob ]
    ...

    This output indicates that the cluster resource manager is started and is ready to manage resources.

After the basic configuration is done and the nodes are online, you can start to configure cluster resources. Use one of the cluster management tools like the crm shell (crmsh) or Hawk2. For more information, see Section 5.5, “Introduction to crmsh” or Section 5.4, “Introduction to Hawk2”.

Part II Configuration and administration

  • 5 Configuration and administration basics
  • The main purpose of an HA cluster is to manage user services. Typical examples of user services are an Apache Web server or a database. From the user's point of view, the services do something specific when ordered to do so. To the cluster, however, they are only resources which may be started or stopped—the nature of the service is irrelevant to the cluster.

    This chapter introduces some basic concepts you need to know when administering your cluster. The following chapters show you how to execute the main configuration and administration tasks with each of the management tools SUSE Linux Enterprise High Availability provides.

  • 6 Configuring cluster resources
  • As a cluster administrator, you need to create cluster resources for every resource or application you run on servers in your cluster. Cluster resources can include Web sites, e-mail servers, databases, file systems, virtual machines, and any other server-based applications or services you want to make available to users at all times.

  • 7 Configuring resource constraints
  • Having all the resources configured is only part of the job. Even if the cluster knows all needed resources, it still might not be able to handle them correctly. Resource constraints let you specify which cluster nodes resources can run on, what order resources will load in, and what other resources a specific resource is dependent on.

  • 8 Managing cluster resources
  • After configuring the resources in the cluster, use the cluster management tools to start, stop, clean up, remove or migrate the resources. This chapter describes how to use Hawk2 or crmsh for resource management tasks.

  • 9 Managing services on remote hosts
  • The possibilities for monitoring and managing services on remote hosts has become increasingly important during the last few years. SUSE Linux Enterprise High Availability 11 SP3 offered fine-grained monitoring of services on remote hosts via monitoring plug-ins. The recent addition of the pacemaker_remote service now allows SUSE Linux Enterprise High Availability 15 SP4 to fully manage and monitor resources on remote hosts just as if they were a real cluster node—without the need to install the cluster stack on the remote machines.

  • 10 Adding or modifying resource agents
  • All tasks that need to be managed by a cluster must be available as a resource. There are two major groups here to consider: resource agents and STONITH agents. For both categories, you can add your own agents, extending the abilities of the cluster to your own needs.

  • 11 Monitoring clusters
  • This chapter describes how to monitor a cluster's health and view its history.

  • 12 Fencing and STONITH
  • Fencing is a very important concept in computer clusters for HA (High Availability). A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource. Fencing may be defined as a method to bring an HA cluster to a known state.

    Every resource in a cluster has a state attached. For example: resource r1 is started on alice. In an HA cluster, such a state implies that resource r1 is stopped on all nodes except alice, because the cluster must make sure that every resource may be started on only one node. Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.

    When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.

  • 13 Storage protection and SBD
  • SBD (STONITH Block Device) provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). This isolates the fencing mechanism from changes in firmware version or dependencies on specific firmware controllers. SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. Under certain conditions, it is also possible to use SBD without shared storage, by running it in diskless mode.

    The cluster bootstrap scripts provide an automated way to set up a cluster with the option of using SBD as fencing mechanism. For details, see the Article “Installation and Setup Quick Start”. However, manually setting up SBD provides you with more options regarding the individual settings.

    This chapter explains the concepts behind SBD. It guides you through configuring the components needed by SBD to protect your cluster from potential data corruption in case of a split brain scenario.

    In addition to node level fencing, you can use additional mechanisms for storage protection, such as LVM exclusive activation or OCFS2 file locking support (resource level fencing). They protect your system against administrative or application faults.

  • 14 QDevice and QNetd
  • QDevice and QNetd participate in quorum decisions. With assistance from the arbitrator corosync-qnetd, corosync-qdevice provides a configurable number of votes, allowing a cluster to sustain more node failures than the standard quorum rules allow. We recommend deploying corosync-qnetd and corosync-qdevice for clusters with an even number of nodes, and especially for two-node clusters.

  • 15 Access control lists
  • The cluster administration tools like crm shell (crmsh) or Hawk2 can be used by root or any user in the group haclient. By default, these users have full read/write access. To limit access or assign more fine-grained access rights, you can use Access control lists (ACLs).

    Access control lists consist of an ordered set of access rules. Each rule allows read or write access or denies access to a part of the cluster configuration. Rules are typically combined to produce a specific role, then users may be assigned to a role that matches their tasks.

  • 16 Network device bonding
  • For many systems, it is desirable to implement network connections that comply to more than the standard data security or availability requirements of a typical Ethernet device. In these cases, several Ethernet devices can be aggregated to a single bonding device.

  • 17 Load balancing
  • Load Balancing makes a cluster of servers appear as one large, fast server to outside clients. This apparent single server is called a virtual server. It consists of one or more load balancers dispatching incoming requests and several real servers running the actual services. With a load balancing setup of SUSE Linux Enterprise High Availability, you can build highly scalable and highly available network services, such as Web, cache, mail, FTP, media and VoIP services.

  • 18 High Availability for virtualization
  • This chapter explains how to configure virtual machines as highly available cluster resources.

  • 19 Geo clusters (multi-site clusters)
  • Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High Availability 15 SP4 also supports geographically dispersed clusters (Geo clusters, sometimes also called multi-site clusters). That means you can have multiple, geographically dispersed sites with a local cluster each. Failover between these clusters is coordinated by a higher level entity, the so-called booth. For details on how to use and set up Geo clusters, refer to Article “Geo Clustering Quick Start” and Book “Geo Clustering Guide”.

5 Configuration and administration basics

The main purpose of an HA cluster is to manage user services. Typical examples of user services are an Apache Web server or a database. From the user's point of view, the services do something specific when ordered to do so. To the cluster, however, they are only resources which may be started or stopped—the nature of the service is irrelevant to the cluster.

This chapter introduces some basic concepts you need to know when administering your cluster. The following chapters show you how to execute the main configuration and administration tasks with each of the management tools SUSE Linux Enterprise High Availability provides.

5.1 Use case scenarios

In general, clusters fall into one of two categories:

  • Two-node clusters

  • Clusters with more than two nodes. This usually means an odd number of nodes.

Adding also different topologies, different use cases can be derived. The following use cases are the most common:

Two-node cluster in one location

Configuration:FC SAN or similar shared storage, layer 2 network.

Usage scenario:Embedded clusters that focus on service high availability and not data redundancy for data replication. Such a setup is used for radio stations or assembly line controllers, for example.

Two-node clusters in two locations (most widely used)

Configuration:Symmetrical stretched cluster, FC SAN, and layer 2 network all across two locations.

Usage scenario:Classic stretched clusters, focus on high availability of services and local data redundancy. For databases and enterprise resource planning. One of the most popular setups.

Odd number of nodes in three locations

Configuration:2×N+1 nodes, FC SAN across two main locations. Auxiliary third site with no FC SAN, but acts as a majority maker. Layer 2 network at least across two main locations.

Usage scenario:Classic stretched cluster, focus on high availability of services and data redundancy. For example, databases, enterprise resource planning.

5.2 Quorum determination

Whenever communication fails between one or more nodes and the rest of the cluster, a cluster partition occurs. The nodes can only communicate with other nodes in the same partition and are unaware of the separated nodes. A cluster partition is defined as having quorum (being quorate) if it has the majority of nodes (or votes). How this is achieved is done by quorum calculation. Quorum is a requirement for fencing.

Quorum is not calculated or determined by Pacemaker. Corosync can handle quorum for two-node clusters directly without changing the Pacemaker configuration.

How quorum is calculated is influenced by the following factors:

Number of cluster nodes

To keep services running, a cluster with more than two nodes relies on quorum (majority vote) to resolve cluster partitions. Based on the following formula, you can calculate the minimum number of operational nodes required for the cluster to function:

N ≥ C/2 + 1

N = minimum number of operational nodes
C = number of cluster nodes

For example, a five-node cluster needs a minimum of three operational nodes (or two nodes which can fail).

We strongly recommend to use either a two-node cluster or an odd number of cluster nodes. Two-node clusters make sense for stretched setups across two sites. Clusters with an odd number of nodes can either be built on one single site or might be spread across three sites.

Corosync configuration

Corosync is a messaging and membership layer, see Section 5.2.1, “Corosync configuration for two-node clusters” and Section 5.2.2, “Corosync configuration for n-node clusters”.

5.2.1 Corosync configuration for two-node clusters

When using the bootstrap scripts, the Corosync configuration contains a quorum section with the following options:

Example 5.1: Excerpt of Corosync configuration for a two-node cluster
quorum {
   # Enable and configure quorum subsystem (default: off)
   # see also corosync.conf.5 and votequorum.5
   provider: corosync_votequorum
   expected_votes: 2
   two_node: 1
}

By default, when two_node: 1 is set, the wait_for_all option is automatically enabled. If wait_for_all is not enabled, the cluster should be started on both nodes in parallel. Otherwise the first node will perform a startup-fencing on the missing second node.

5.2.2 Corosync configuration for n-node clusters

When not using a two-node cluster, we strongly recommend an odd number of nodes for your N-node cluster. For quorum configuration, you have the following options:

  • Adding additional nodes with the crm cluster join command, or

  • Adapting the Corosync configuration manually.

If you adjust /etc/corosync/corosync.conf manually, use the following settings:

Example 5.2: Excerpt of Corosync configuration for an n-node cluster
quorum {
   provider: corosync_votequorum 1
   expected_votes: N 2
   wait_for_all: 1 3
}

1

Use the quorum service from Corosync

2

The number of votes to expect. This parameter can either be provided inside the quorum section, or is automatically calculated when the nodelist section is available.

3

Enables the wait for all (WFA) feature. When WFA is enabled, the cluster will be quorate for the first time only after all nodes have become visible. To avoid some startup race conditions, setting wait_for_all to 1 may help. For example, in a five-node cluster every node has one vote and thus, expected_votes is set to 5. When three or more nodes are visible to each other, the cluster partition becomes quorate and can start operating.

5.3 Global cluster options

Global cluster options control how the cluster behaves when confronted with certain situations. They are grouped into sets and can be viewed and modified with the cluster management tools like Hawk2 and the crm shell.

The predefined values can usually be kept. However, to make key functions of your cluster work as expected, you might need to adjust the following parameters after basic cluster setup:

5.3.1 Global option no-quorum-policy

This global option defines what to do when a cluster partition does not have quorum (no majority of nodes is part of the partition).

The following values are available:

ignore

Setting no-quorum-policy to ignore makes a cluster partition behave like it has quorum, even if it does not. The cluster partition is allowed to issue fencing and continue resource management.

On SLES 11 this was the recommended setting for a two-node cluster. Starting with SLES 12, the value ignore is obsolete and must not be used. Based on configuration and conditions, Corosync gives cluster nodes or a single node quorum—or not.

For two-node clusters the only meaningful behavior is to always react in case of node loss. The first step should always be to try to fence the lost node.

freeze

If quorum is lost, the cluster partition freezes. Resource management is continued: running resources are not stopped (but possibly restarted in response to monitor events), but no further resources are started within the affected partition.

This setting is recommended for clusters where certain resources depend on communication with other nodes (for example, OCFS2 mounts). In this case, the default setting no-quorum-policy=stop is not useful, as it would lead to the following scenario: Stopping those resources would not be possible while the peer nodes are unreachable. Instead, an attempt to stop them would eventually time out and cause a stop failure, triggering escalated recovery and fencing.

stop (default value)

If quorum is lost, all resources in the affected cluster partition are stopped in an orderly fashion.

suicide

If quorum is lost, all nodes in the affected cluster partition are fenced. This option works only in combination with SBD, see Chapter 13, Storage protection and SBD.

5.3.2 Global option stonith-enabled

This global option defines whether to apply fencing, allowing STONITH devices to shoot failed nodes and nodes with resources that cannot be stopped. By default, this global option is set to true, because for normal cluster operation it is necessary to use STONITH devices. According to the default value, the cluster will refuse to start any resources if no STONITH resources have been defined.

If you need to disable fencing for any reasons, set stonith-enabled to false, but be aware that this has impact on the support status for your product. Furthermore, with stonith-enabled="false", resources like the Distributed Lock Manager (DLM) and all services depending on DLM (such as lvmlockd, GFS2, and OCFS2) will fail to start.

Important
Important: No support without STONITH

A cluster without STONITH is not supported.

5.4 Introduction to Hawk2

To configure and manage cluster resources, either use Hawk2, or the crm shell (crmsh) command line utility.

Hawk2's user-friendly Web interface allows you to monitor and administer your High Availability clusters from Linux or non-Linux machines alike. Hawk2 can be accessed from any machine inside or outside of the cluster by using a (graphical) Web browser.

5.4.1 Hawk2 requirements

Before users can log in to Hawk2, the following requirements need to be fulfilled:

hawk2 Package

The hawk2 package must be installed on all cluster nodes you want to connect to with Hawk2.

Web browser

On the machine from which to access a cluster node using Hawk2, you need a (graphical) Web browser (with JavaScript and cookies enabled) to establish the connection.

Hawk2 service

To use Hawk2, the respective Web service must be started on the node that you want to connect to via the Web interface. See Procedure 5.1, “Starting Hawk2 services”.

If you have set up your cluster with the bootstrap scripts provided by the crm shell, the Hawk2 service is already enabled.

Username, group and password on each cluster node

Hawk2 users must be members of the haclient group. The installation creates a Linux user named hacluster, who is added to the haclient group.

When using the crm cluster init script for setup, a default password is set for the hacluster user. Before starting Hawk2, change it to a secure password. If you did not use the crm cluster init script, either set a password for the hacluster first or create a new user which is a member of the haclient group. Do this on every node you will connect to with Hawk2.

Wild card certificate handling

A wild card certificate is a public key certificate that is valid for multiple sub-domains. For example, a wild card certificate for *.example.com secures the domains www.example.com, login.example.com, and possibly more.

Hawk2 supports wild card certificates as well as conventional certificates. A self-signed default private key and certificate is generated by /srv/www/hawk/bin/generate-ssl-cert.

To use your own certificate (conventional or wild card), replace the generated certificate at /etc/ssl/certs/hawk.pem with your own.

Procedure 5.1: Starting Hawk2 services
  1. On the node you want to connect to, open a shell and log in as root.

  2. Check the status of the service by entering

    # systemctl status hawk
  3. If the service is not running, start it with

    # systemctl start hawk

    If you want Hawk2 to start automatically at boot time, execute the following command:

    # systemctl enable hawk

5.4.2 Logging in

The Hawk2 Web interface uses the HTTPS protocol and port 7630.

Instead of logging in to an individual cluster node with Hawk2, you can configure a floating, virtual IP address (IPaddr or IPaddr2) as a cluster resource. It does not need any special configuration. It allows clients to connect to the Hawk service no matter which physical node the service is running on.

When setting up the cluster with the bootstrap scripts provided by the crm shell, you will be asked whether to configure a virtual IP for cluster administration.

Procedure 5.2: Logging in to the Hawk2 web interface
  1. On any machine, start a Web browser and enter the following URL:

    https://HAWKSERVER:7630/

    Replace HAWKSERVER with the IP address or host name of any cluster node running the Hawk Web service. If a virtual IP address has been configured for cluster administration with Hawk2, replace HAWKSERVER with the virtual IP address.

    Note
    Note: Certificate warning

    If a certificate warning appears when you try to access the URL for the first time, a self-signed certificate is in use. Self-signed certificates are not considered trustworthy by default.

    To verify the certificate, ask your cluster operator for the certificate details.

    To proceed anyway, you can add an exception in the browser to bypass the warning.

    For information on how to replace the self-signed certificate with a certificate signed by an official Certificate Authority, refer to Replacing the self-signed certificate.

  2. On the Hawk2 login screen, enter the Username and Password of the hacluster user (or of any other user that is a member of the haclient group).

  3. Click Log In.

5.4.3 Hawk2 overview: main elements

After logging in to Hawk2, you will see a navigation bar on the left-hand side and a top-level row with several links on the right-hand side.

Note
Note: Available functions in Hawk2

By default, users logged in as root or hacluster have full read-write access to all cluster configuration tasks. However, Access control lists (ACLs) can be used to define fine-grained access permissions.

If ACLs are enabled in the CRM, the available functions in Hawk2 depend on the user role and their assigned access permissions. The History Explorer in Hawk2 can only be executed by the user hacluster.

5.4.3.1 Left navigation bar

Monitoring
  • Status: Displays the current cluster status at a glance (similar to crm status on the crmsh). For details, see Section 11.1.1, “Monitoring a single cluster”. If your cluster includes guest nodes (nodes that run the pacemaker_remote daemon), they are displayed, too. The screen refreshes in near real-time: any status changes for nodes or resources are visible almost immediately.

  • Dashboard: Allows you to monitor multiple clusters (also located on different sites, in case you have a Geo cluster setup). For details, see Section 11.1.2, “Monitoring multiple clusters”. If your cluster includes guest nodes (nodes that run the pacemaker_remote daemon), they are displayed, too. The screen refreshes in near real-time: any status changes for nodes or resources are visible almost immediately.

Troubleshooting
Configuration

5.4.3.2 Top-level row

Hawk2's top-level row shows the following entries:

  • Batch: Click to switch to batch mode. This allows you to simulate and stage changes and to apply them as a single transaction. For details, see Section 5.4.7, “Using the batch mode”.

  • USERNAME: Allows you to set preferences for Hawk2 (for example, the language for the Web interface, or whether to display a warning if STONITH is disabled).

  • Help: Access the SUSE Linux Enterprise High Availability documentation, read the release notes or report a bug.

  • Logout: Click to log out.

5.4.4 Configuring global cluster options

Global cluster options control how the cluster behaves when confronted with certain situations. They are grouped into sets and can be viewed and modified with cluster management tools like Hawk2 and crmsh. The predefined values can usually be kept. However, to ensure the key functions of your cluster work correctly, you need to adjust the following parameters after basic cluster setup:

Procedure 5.3: Modifying global cluster options
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Cluster Configuration.

    The Cluster Configuration screen opens. It displays the global cluster options and their current values.

    To display a short description of the parameter on the right-hand side of the screen, hover your mouse over a parameter.

    Hawk2—cluster configuration
    Figure 5.1: Hawk2—cluster configuration
  3. Check the values for no-quorum-policy and stonith-enabled and adjust them, if necessary:

    1. Set no-quorum-policy to the appropriate value. See Section 5.3.1, “Global option no-quorum-policy for more details.

    2. If you need to disable fencing for any reason, set stonith-enabled to no. By default, it is set to true, because using STONITH devices is necessary for normal cluster operation. According to the default value, the cluster will refuse to start any resources if no STONITH resources have been configured.

      Important
      Important: No Support Without STONITH
      • You must have a node fencing mechanism for your cluster.

      • The global cluster options stonith-enabled and startup-fencing must be set to true. When you change them, you lose support.

    3. To remove a parameter from the cluster configuration, click the Minus icon next to the parameter. If a parameter is deleted, the cluster will behave as if that parameter had the default value.

    4. To add a new parameter to the cluster configuration, choose one from the drop-down box.

  4. If you need to change Resource Defaults or Operation Defaults, proceed as follows:

    1. To adjust a value, either select a different value from the drop-down box or edit the value directly.

    2. To add a new resource default or operation default, choose one from the empty drop-down box and enter a value. If there are default values, Hawk2 proposes them automatically.

    3. To remove a parameter, click the Minus icon next to it. If no values are specified for Resource Defaults and Operation Defaults, the cluster uses the default values that are documented in Section 6.12, “Resource options (meta attributes)” and Section 6.14, “Resource operations”.

  5. Confirm your changes.

5.4.5 Showing the current cluster configuration (CIB)

Sometimes a cluster administrator needs to know the cluster configuration. Hawk2 can show the current configuration in crm shell syntax, as XML and as a graph. To view the cluster configuration in crm shell syntax, from the left navigation bar select Configuration › Edit Configuration and click Show. To show the configuration in raw XML instead, click XML. Click Graph for a graphical representation of the nodes and resources configured in the CIB. It also shows the relationships between resources.

5.4.6 Adding resources with the wizard

The Hawk2 wizard is a convenient way of setting up simple resources like a virtual IP address or an SBD STONITH resource, for example. It is also useful for complex configurations that include multiple resources, like the resource configuration for a DRBD block device or an Apache Web server. The wizard guides you through the configuration steps and provides information about the parameters you need to enter.

Procedure 5.4: Using the resource wizard
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Wizards.

  3. Expand the individual categories by clicking the arrow down icon next to them and select the desired wizard.

  4. Follow the instructions on the screen. After the last configuration step, Verify the values you have entered.

    Hawk2 shows which actions it is going to perform and what the configuration looks like. Depending on the configuration, you might be prompted for the root password before you can Apply the configuration.

Hawk2—wizard for Apache web server
Figure 5.2: Hawk2—wizard for Apache web server

For more information, see Chapter 6, Configuring cluster resources.

5.4.7 Using the batch mode

Hawk2 provides a Batch Mode, including a cluster simulator. It can be used for the following:

  • Staging changes to the cluster and applying them as a single transaction, instead of having each change take effect immediately.

  • Simulating changes and cluster events, for example, to explore potential failure scenarios.

For example, batch mode can be used when creating groups of resources that depend on each other. Using batch mode, you can avoid applying intermediate or incomplete configurations to the cluster.

While batch mode is enabled, you can add or edit resources and constraints or change the cluster configuration. It is also possible to simulate events in the cluster, including nodes going online or offline, resource operations and tickets being granted or revoked. See Procedure 5.6, “Injecting node, resource or ticket events” for details.

The cluster simulator runs automatically after every change and shows the expected outcome in the user interface. For example, this also means: If you stop a resource while in batch mode, the user interface shows the resource as stopped—while actually, the resource is still running.

Important
Important: Wizards and changes to the live system

Some wizards include actions beyond mere cluster configuration. When using those wizards in batch mode, any changes that go beyond cluster configuration would be applied to the live system immediately.

Therefore wizards that require root permission cannot be executed in batch mode.

Procedure 5.5: Working with the batch mode
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. To activate the batch mode, select Batch from the top-level row.

    An additional bar appears below the top-level row. It indicates that batch mode is active and contains links to actions that you can execute in batch mode.

    Hawk2 batch mode activated
    Figure 5.3: Hawk2 batch mode activated
  3. While batch mode is active, perform any changes to your cluster, like adding or editing resources and constraints or editing the cluster configuration.

    The changes will be simulated and shown in all screens.

  4. To view details of the changes you have made, select Show from the batch mode bar. The Batch Mode window opens.

    For any configuration changes it shows the difference between the live state and the simulated changes in crmsh syntax: Lines starting with a - character represent the current state whereas lines starting with + show the proposed state.

  5. To inject events or view even more details, see Procedure 5.6. Otherwise Close the window.

  6. Choose to either Discard or Apply the simulated changes and confirm your choice. This also deactivates batch mode and takes you back to normal mode.

When running in batch mode, Hawk2 also allows you to inject Node Events and Resource Events.

Node events

Let you change the state of a node. Available states are online, offline, and unclean.

Resource events

Let you change some properties of a resource. For example, you can set an operation (like start, stop, monitor), the node it applies to, and the expected result to be simulated.

Ticket events

Let you test the impact of granting and revoking tickets (used for Geo clusters).

Procedure 5.6: Injecting node, resource or ticket events
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. If batch mode is not active yet, click Batch at the top-level row to switch to batch mode.

  3. In the batch mode bar, click Show to open the Batch Mode window.

  4. To simulate a status change of a node:

    1. Click Inject › Node Event.

    2. Select the Node you want to manipulate and select its target State.

    3. Confirm your changes. Your event is added to the queue of events listed in the Batch Mode dialog.

  5. To simulate a resource operation:

    1. Click Inject › Resource Event.

    2. Select the Resource you want to manipulate and select the Operation to simulate.

    3. If necessary, define an Interval.

    4. Select the Node on which to run the operation and the targeted Result. Your event is added to the queue of events listed in the Batch Mode dialog.

    5. Confirm your changes.

  6. To simulate a ticket action:

    1. Click Inject › Ticket Event.

    2. Select the Ticket you want to manipulate and select the Action to simulate.

    3. Confirm your changes. Your event is added to the queue of events listed in the Batch Mode dialog.

  7. The Batch Mode dialog (Figure 5.4) shows a new line per injected event. Any event listed here is simulated immediately and is reflected on the Status screen.

    If you have made any configuration changes, too, the difference between the live state and the simulated changes is shown below the injected events.

    Hawk2 batch mode—injected invents and configuration changes
    Figure 5.4: Hawk2 batch mode—injected invents and configuration changes
  8. To remove an injected event, click the Remove icon next to it. Hawk2 updates the Status screen accordingly.

  9. To view more details about the simulation run, click Simulator and choose one of the following:

    Summary

    Shows a detailed summary.

    CIB (in)/CIB (out)

    CIB (in) shows the initial CIB state. CIB (out) shows what the CIB would look like after the transition.

    Transition graph

    Shows a graphical representation of the transition.

    Transition

    Shows an XML representation of the transition.

  10. If you have reviewed the simulated changes, close the Batch Mode window.

  11. To leave the batch mode, either Apply or Discard the simulated changes.

5.5 Introduction to crmsh

To configure and manage cluster resources, either use the crm shell (crmsh) command line utility or Hawk2, a Web-based user interface.

This section introduces the command line tool crm. The crm command has several subcommands which manage resources, CIBs, nodes, resource agents, and others. It offers a thorough help system with embedded examples. All examples follow a naming convention described in Appendix B.

Events are logged to /var/log/crmsh/crmsh.log.

Note
Note: User privileges

Sufficient privileges are necessary to manage a cluster. The crm command and its subcommands need to be run either as root user or as the CRM owner user (typically the user hacluster).

However, the user option allows you to run crm and its subcommands as a regular (unprivileged) user and to change its ID using sudo whenever necessary. For example, with the following command crm will use hacluster as the privileged user ID:

# crm options user hacluster

Note that you need to set up /etc/sudoers so that sudo does not ask for a password.

Tip
Tip: Interactive crm prompt

By using crm without arguments (or with only one sublevel as argument), the crm shell enters the interactive mode. This mode is indicated by the following prompt:

crm(live/HOSTNAME)

For readability reasons, we omit the host name in the interactive crm prompts in our documentation. We only include the host name if you need to run the interactive shell on a specific node, like alice for example:

crm(live/alice)

5.5.1 Getting help

Help can be accessed in several ways:

  • To output the usage of crm and its command line options:

    # crm --help
  • To give a list of all available commands:

    # crm help
  • To access other help sections, not only the command reference:

    # crm help topics
  • To view the extensive help text of the configure subcommand:

    # crm configure help
  • To print the syntax, its usage, and examples of the group subcommand of configure:

    # crm configure help group

    This is the same:

    # crm help configure group

Almost all output of the help subcommand (do not mix it up with the --help option) opens a text viewer. This text viewer allows you to scroll up or down and read the help text more comfortably. To leave the text viewer, press the Q key.

Tip
Tip: Use tab completion in Bash and interactive shell

The crmsh supports full tab completion in Bash directly, not only for the interactive shell. For example, typing crm help config→| will complete the word like in the interactive shell.

5.5.2 Executing crmsh's subcommands

The crm command itself can be used in the following ways:

  • Directly: Concatenate all subcommands to crm, press Enter and you see the output immediately. For example, enter crm help ra to get information about the ra subcommand (resource agents).

    It is possible to abbreviate subcommands as long as they are unique. For example, you can shorten status as st and crmsh will know what you mean.

    Another feature is to shorten parameters. Usually, you add parameters through the params keyword. You can leave out the params section if it is the first and only section. For example, this line:

    # crm primitive ipaddr IPaddr2 params ip=192.168.0.55

    is equivalent to this line:

    # crm primitive ipaddr IPaddr2 ip=192.168.0.55
  • As crm shell script: Crm shell scripts contain subcommands of crm. For more information, see Section 5.5.4, “Using crmsh's shell scripts”.

  • As crmsh cluster scripts:These are a collection of metadata, references to RPM packages, configuration files, and crmsh subcommands bundled under a single, yet descriptive name. They are managed through the crm script command.

    Do not confuse them with crmsh shell scripts: although both share some common objectives, the crm shell scripts only contain subcommands whereas cluster scripts incorporate much more than a simple enumeration of commands. For more information, see Section 5.5.5, “Using crmsh's cluster scripts”.

  • Interactive as internal shell: Type crm to enter the internal shell. The prompt changes to crm(live). With help you can get an overview of the available subcommands. As the internal shell has different levels of subcommands, you can enter one by typing this subcommand and press Enter.

    For example, if you type resource you enter the resource management level. Your prompt changes to crm(live)resource#. To leave the internal shell, use the command quit. If you need to go one level back, use back, up, end, or cd.

    You can enter the level directly by typing crm and the respective subcommand(s) without any options and press Enter.

    The internal shell supports also tab completion for subcommands and resources. Type the beginning of a command, press →| and crm completes the respective object.

In addition to previously explained methods, crmsh also supports synchronous command execution. Use the -w option to activate it. If you have started crm without -w, you can enable it later with the user preference's wait set to yes (options wait yes). If this option is enabled, crm waits until the transition is finished. Whenever a transaction is started, dots are printed to indicate progress. Synchronous command execution is only applicable for commands like resource start.

Note
Note: Differentiate between management and configuration subcommands

The crm tool has management capability (the subcommands resource and node) and can be used for configuration (cib, configure).

The following subsections give you an overview of some important aspects of the crm tool.

5.5.3 Displaying information about OCF resource agents

As you need to deal with resource agents in your cluster configuration all the time, the crm tool contains the ra command. Use it to show information about resource agents and to manage them (for additional information, see also Section 6.2, “Supported resource agent classes”):

# crm ra
crm(live)ra# 

The command classes lists all classes and providers:

crm(live)ra# classes
 lsb
 ocf / heartbeat linbit lvm2 ocfs2 pacemaker
 service
 stonith
 systemd

To get an overview of all available resource agents for a class (and provider) use the list command:

crm(live)ra# list ocf
AoEtarget           AudibleAlarm        CTDB                ClusterMon
Delay               Dummy               EvmsSCC             Evmsd
Filesystem          HealthCPU           HealthSMART         ICP
IPaddr              IPaddr2             IPsrcaddr           IPv6addr
LVM                 LinuxSCSI           MailTo              ManageRAID
ManageVE            Pure-FTPd           Raid1               Route
SAPDatabase         SAPInstance         SendArp             ServeRAID
...

An overview of a resource agent can be viewed with info:

crm(live)ra# info ocf:linbit:drbd
This resource agent manages a DRBD* resource
as a master/slave resource. DRBD is a shared-nothing replicated storage
device. (ocf:linbit:drbd)

Master/Slave OCF Resource Agent for DRBD

Parameters (* denotes required, [] the default):

drbd_resource* (string): drbd resource name
    The name of the drbd resource from the drbd.conf file.

drbdconf (string, [/etc/drbd.conf]): Path to drbd.conf
    Full path to the drbd.conf file.

Operations' defaults (advisory minimum):

    start         timeout=240
    promote       timeout=90
    demote        timeout=90
    notify        timeout=90
    stop          timeout=100
    monitor_Slave_0 interval=20 timeout=20 start-delay=1m
    monitor_Master_0 interval=10 timeout=20 start-delay=1m

Leave the viewer by pressing Q.

Tip
Tip: Use crm directly

In the former example we used the internal shell of the crm command. However, you do not necessarily need to use it. You get the same results if you add the respective subcommands to crm. For example, you can list all the OCF resource agents by entering crm ra list ocf in your shell.

5.5.4 Using crmsh's shell scripts

The crmsh shell scripts provide a convenient way to enumerate crmsh subcommands into a file. This makes it easy to comment specific lines or to replay them later. Keep in mind that a crmsh shell script can contain only crmsh subcommands. Any other commands are not allowed.

Before you can use a crmsh shell script, create a file with specific commands. For example, the following file prints the status of the cluster and gives a list of all nodes:

Example 5.3: A simple crmsh shell script
# A small example file with some crm subcommands
status
node list

Any line starting with the hash symbol (#) is a comment and is ignored. If a line is too long, insert a backslash (\) at the end and continue in the next line. It is recommended to indent lines that belong to a certain subcommand to improve readability.

To use this script, use one of the following methods:

# crm -f example.cli
# crm < example.cli

5.5.5 Using crmsh's cluster scripts

Collecting information from all cluster nodes and deploying any changes is a key cluster administration task. Instead of performing the same procedures manually on different nodes (which is error-prone), you can use the crmsh cluster scripts.

Do not confuse them with the crmsh shell scripts, which are explained in Section 5.5.4, “Using crmsh's shell scripts”.

In contrast to crmsh shell scripts, cluster scripts performs additional tasks like:

  • Installing software that is required for a specific task.

  • Creating or modifying any configuration files.

  • Collecting information and reporting potential problems with the cluster.

  • Deploying the changes to all nodes.

crmsh cluster scripts do not replace other tools for managing clusters—they provide an integrated way to perform the above tasks across the cluster. Find detailed information at http://crmsh.github.io/scripts/.

5.5.5.1 Usage

To get a list of all available cluster scripts, run:

# crm script list

To view the components of a script, use the show command and the name of the cluster script, for example:

# crm script show mailto
mailto (Basic)
MailTo

 This is a resource agent for MailTo. It sends email to a sysadmin
whenever  a takeover occurs.

1. Notifies recipients by email in the event of resource takeover

  id (required)  (unique)
      Identifier for the cluster resource
  email (required)
      Email address
  subject
      Subject

The output of show contains a title, a short description, and a procedure. Each procedure is divided into a series of steps, performed in the given order.

Each step contains a list of required and optional parameters, along with a short description and its default value.

Each cluster script understands a set of common parameters. These parameters can be passed to any script:

Table 5.1: Common parameters
ParameterArgumentDescription
actionINDEXIf set, only execute a single action (index, as returned by verify)
dry_runBOOLIf set, simulate execution only (default: no)
nodesLISTList of nodes to execute the script for
portNUMBERPort to connect to
statefileFILEWhen single-stepping, the state is saved in the given file
sudoBOOLIf set, crm will prompt for a sudo password and use sudo where appropriate (default: no)
timeoutNUMBERExecution timeout in seconds (default: 600)
userUSERRun script as the given user

5.5.5.2 Verifying and running a cluster script

Before running a cluster script, review the actions that it will perform and verify its parameters to avoid problems. A cluster script can potentially perform a series of actions and may fail for various reasons. Thus, verifying your parameters before running it helps to avoid problems.

For example, the mailto resource agent requires a unique identifier and an e-mail address. To verify these parameters, run:

# crm script verify mailto id=sysadmin email=tux@example.org
1. Ensure mail package is installed

        mailx

2. Configure cluster resources

        primitive sysadmin MailTo
                email="tux@example.org"
                op start timeout="10"
                op stop timeout="10"
                op monitor interval="10" timeout="10"

        clone c-sysadmin sysadmin

The verify command prints the steps and replaces any placeholders with your given parameters. If verify finds any problems, it will report it. If everything is OK, replace the verify command with run:

# crm script run mailto id=sysadmin email=tux@example.org
INFO: MailTo
INFO: Nodes: alice, bob
OK: Ensure mail package is installed
OK: Configure cluster resources

Check whether your resource is integrated into your cluster with crm status:

# crm status
[...]
 Clone Set: c-sysadmin [sysadmin]
     Started: [ alice bob ]

5.5.6 Using configuration templates

Note
Note: Deprecation notice

The use of configuration templates is deprecated and will be removed in the future. Configuration templates will be replaced by cluster scripts, see Section 5.5.5, “Using crmsh's cluster scripts”.

Configuration templates are ready-made cluster configurations for crmsh. Do not confuse them with the resource templates (as described in Section 6.8.2, “Creating resource templates with crmsh”). Those are templates for the cluster and not for the crm shell.

Configuration templates require minimum effort to be tailored to the particular user's needs. Whenever a template creates a configuration, warning messages give hints which can be edited later for further customization.

The following procedure shows how to create a simple yet functional Apache configuration:

  1. Log in as root and start the crm interactive shell:

    # crm configure
  2. Create a new configuration from a configuration template:

    1. Switch to the template subcommand:

      crm(live)configure# template
    2. List the available configuration templates:

      crm(live)configure template# list templates
      gfs2-base   filesystem  virtual-ip  apache   clvm     ocfs2    gfs2
    3. Decide which configuration template you need. As we need an Apache configuration, we select the apache template and name it g-intranet:

      crm(live)configure template# new g-intranet apache
      INFO: pulling in template apache
      INFO: pulling in template virtual-ip
  3. Define your parameters:

    1. List the configuration you have created:

      crm(live)configure template# list
      g-intranet
    2. Display the minimum required changes that need to be filled out by you:

      crm(live)configure template# show
      ERROR: 23: required parameter ip not set
      ERROR: 61: required parameter id not set
      ERROR: 65: required parameter configfile not set
    3. Invoke your preferred text editor and fill out all lines that have been displayed as errors in Step 3.b:

      crm(live)configure template# edit
  4. Show the configuration and check whether it is valid (bold text depends on the configuration you have entered in Step 3.c):

    crm(live)configure template# show
    primitive virtual-ip ocf:heartbeat:IPaddr \
        params ip="192.168.1.101"
    primitive apache apache \
        params configfile="/etc/apache2/httpd.conf"
        monitor apache 120s:60s
    group g-intranet \
        apache virtual-ip
  5. Apply the configuration:

    crm(live)configure template# apply
    crm(live)configure# cd ..
    crm(live)configure# show
  6. Submit your changes to the CIB:

    crm(live)configure# commit

It is possible to simplify the commands even more, if you know the details. The above procedure can be summarized with the following command on the shell:

# crm configure template \
   new g-intranet apache params \
   configfile="/etc/apache2/httpd.conf" ip="192.168.1.101"

If you are inside your internal crm shell, use the following command:

crm(live)configure template# new intranet apache params \
   configfile="/etc/apache2/httpd.conf" ip="192.168.1.101"

However, the previous command only creates its configuration from the configuration template. It does not apply nor commit it to the CIB.

5.5.7 Testing with shadow configuration

A shadow configuration is used to test different configuration scenarios. If you have created several shadow configurations, you can test them one by one to see the effects of your changes.

The usual process looks like this:

  1. Log in as root and start the crm interactive shell:

    # crm configure
  2. Create a new shadow configuration:

    crm(live)configure# cib new myNewConfig
    INFO: myNewConfig shadow CIB created

    If you omit the name of the shadow CIB, a temporary name @tmp@ is created.

  3. To copy the current live configuration into your shadow configuration, use the following command, otherwise skip this step:

    crm(myNewConfig)# cib reset myNewConfig

    The previous command makes it easier to modify any existing resources later.

  4. Make your changes as usual. After you have created the shadow configuration, all changes go there. To save all your changes, use the following command:

    crm(myNewConfig)# commit
  5. If you need the live cluster configuration again, switch back with the following command:

    crm(myNewConfig)configure# cib use live
    crm(live)# 

5.5.8 Debugging your configuration changes

Before loading your configuration changes back into the cluster, it is recommended to review your changes with ptest. The ptest command can show a diagram of actions that will be induced by committing the changes. You need the graphviz package to display the diagrams. The following example is a transcript, adding a monitor operation:

# crm configure
crm(live)configure# show fence-bob
primitive fence-bob stonith:apcsmart \
        params hostlist="bob"
crm(live)configure# monitor fence-bob 120m:60s
crm(live)configure# show changed
primitive fence-bob stonith:apcsmart \
        params hostlist="bob" \
        op monitor interval="120m" timeout="60s"
crm(live)configure# ptest
crm(live)configure# commit

5.5.9 Cluster diagram

To output a cluster diagram, use the command crm configure graph. It displays the current configuration on its current window, therefore requiring X11.

If you prefer Scalable Vector Graphics (SVG), use the following command:

# crm configure graph dot config.svg svg

5.5.10 Managing Corosync configuration

Corosync is the underlying messaging layer for most HA clusters. The corosync subcommand provides commands for editing and managing the Corosync configuration.

For example, to list the status of the cluster, use status:

# crm corosync status
Printing ring status.
Local node ID 175704363
RING ID 0
        id      = 10.121.9.43
        status  = ring 0 active with no faults
Quorum information
------------------
Date:             Thu May  8 16:41:56 2014
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          175704363
Ring ID:          4032
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
 175704363          1 alice.example.com (local)
 175704619          1 bob.example.com

The diff command is very helpful: It compares the Corosync configuration on all nodes (if not stated otherwise) and prints the difference between:

# crm corosync diff
--- bob
+++ alice
@@ -46,2 +46,2 @@
-       expected_votes: 2
-       two_node: 1
+       expected_votes: 1
+       two_node: 0

For more details, see http://crmsh.nongnu.org/crm.8.html#cmdhelp_corosync.

5.5.11 Setting passwords independent of cib.xml

If your cluster configuration contains sensitive information, such as passwords, it should be stored in local files. That way, these parameters will never be logged or leaked in support reports.

Before using secret, run the show command to get an overview of all your resources:

# crm configure show
primitive mydb mysql \
   params replication_user=admin ...

To set a password for the above mydb resource, use the following commands:

# crm resource secret mydb set passwd linux
INFO: syncing /var/lib/heartbeat/lrm/secrets/mydb/passwd to [your node list]

You can get the saved password back with:

# crm resource secret mydb show passwd
linux

Note that the parameters need to be synchronized between nodes; the crm resource secret command will take care of that. We highly recommend to only use this command to manage secret parameters.

5.6 For more information

http://crmsh.github.io/

Home page of the crm shell (crmsh), the advanced command line interface for High Availability cluster management.

http://crmsh.github.io/documentation

Holds several documents about the crm shell, including a Getting Started tutorial for basic cluster setup with crmsh and the comprehensive Manual for the crm shell. The latter is available at http://crmsh.github.io/man-2.0/. Find the tutorial at http://crmsh.github.io/start-guide/.

http://clusterlabs.org/

Home page of Pacemaker, the cluster resource manager shipped with SUSE Linux Enterprise High Availability.

http://www.clusterlabs.org/pacemaker/doc/

Holds several comprehensive manuals and some shorter documents explaining general concepts. For example:

  • Pacemaker Explained: Contains comprehensive and very detailed information for reference.

  • Colocation Explained

  • Ordering Explained

6 Configuring cluster resources

As a cluster administrator, you need to create cluster resources for every resource or application you run on servers in your cluster. Cluster resources can include Web sites, e-mail servers, databases, file systems, virtual machines, and any other server-based applications or services you want to make available to users at all times.

6.1 Types of resources

The following types of resources can be created:

Primitives

A primitive resource, the most basic type of resource.

Groups

Groups contain a set of resources that need to be located together, started sequentially and stopped in the reverse order.

Clones

Clones are resources that can be active on multiple hosts. Any resource can be cloned, provided the respective resource agent supports it.

Promotable clones (also known as multi-state resources) are a special type of clone resources that can be promoted.

6.2 Supported resource agent classes

For each cluster resource you add, you need to define the standard that the resource agent conforms to. Resource agents abstract the services they provide and present an accurate status to the cluster, which allows the cluster to be non-committal about the resources it manages. The cluster relies on the resource agent to react appropriately when given a start, stop or monitor command.

Typically, resource agents come in the form of shell scripts. SUSE Linux Enterprise High Availability supports the following classes of resource agents:

Open Cluster Framework (OCF) resource agents

OCF RA agents are best suited for use with High Availability, especially when you need promotable clone resources or special monitoring abilities. The agents are generally located in /usr/lib/ocf/resource.d/provider/. Their functionality is similar to that of LSB scripts. However, the configuration is always done with environmental variables which allow them to accept and process parameters easily. OCF specifications have strict definitions of which exit codes must be returned by actions, see Section 10.3, “OCF return codes and failure recovery”. The cluster follows these specifications exactly.

All OCF Resource Agents are required to have at least the actions start, stop, status, monitor, and meta-data. The meta-data action retrieves information about how to configure the agent. For example, to know more about the IPaddr agent by the provider heartbeat, use the following command:

OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/IPaddr meta-data

The output is information in XML format, including several sections (general description, available parameters, available actions for the agent).

Alternatively, use the crmsh to view information on OCF resource agents. For details, see Section 5.5.3, “Displaying information about OCF resource agents”.

Linux Standards Base (LSB) scripts

LSB resource agents are generally provided by the operating system/distribution and are found in /etc/init.d. To be used with the cluster, they must conform to the LSB init script specification. For example, they must have several actions implemented, which are, at minimum, start, stop, restart, reload, force-reload, and status. For more information, see http://refspecs.linuxbase.org/LSB_4.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html.

The configuration of those services is not standardized. If you intend to use an LSB script with High Availability, make sure that you understand how the relevant script is configured. Often you can find information about this in the documentation of the relevant package in /usr/share/doc/packages/PACKAGENAME.

Systemd

Starting with SUSE Linux Enterprise 12, systemd is a replacement for the popular System V init daemon. Pacemaker can manage systemd services if they are present. Instead of init scripts, systemd has unit files. Generally the services (or unit files) are provided by the operating system. In case you want to convert existing init scripts, find more information at http://0pointer.de/blog/projects/systemd-for-admins-3.html.

Service

There are currently many common types of system services that exist in parallel: LSB (belonging to System V init), systemd, and (in some distributions) upstart. Therefore, Pacemaker supports a special alias which intelligently figures out which one applies to a given cluster node. This is particularly useful when the cluster contains a mix of systemd, upstart, and LSB services. Pacemaker will try to find the named service in the following order: as an LSB (SYS-V) init script, a systemd unit file, or an Upstart job.

Nagios

Monitoring plug-ins (formerly called Nagios plug-ins) allow to monitor services on remote hosts. Pacemaker can do remote monitoring with the monitoring plug-ins if they are present. For detailed information, see Section 9.1, “Monitoring services on remote hosts with monitoring plug-ins”.

STONITH (fencing) resource agents

This class is used exclusively for fencing related resources. For more information, see Chapter 12, Fencing and STONITH.

The agents supplied with SUSE Linux Enterprise High Availability are written to OCF specifications.

6.3 Timeout values

Timeouts values for resources can be influenced by the following parameters:

  • op_defaults (global timeout for operations),

  • a specific timeout value defined in a resource template,

  • a specific timeout value defined for a resource.

Note
Note: Priority of values

If a specific value is defined for a resource, it takes precedence over the global default. A specific value for a resource also takes precedence over a value that is defined in a resource template.

Getting timeout values right is very important. Setting them too low will result in a lot of (unnecessary) fencing operations for the following reasons:

  1. If a resource runs into a timeout, it fails and the cluster will try to stop it.

  2. If stopping the resource also fails (for example, because the timeout for stopping is set too low), the cluster will fence the node. It considers the node where this happens to be out of control.

You can adjust the global default for operations and set any specific timeout values with both crmsh and Hawk2. The best practice for determining and setting timeout values is as follows:

Procedure 6.1: Determining timeout values
  1. Check how long it takes your resources to start and stop (under load).

  2. If needed, add the op_defaults parameter and set the (default) timeout value accordingly:

    1. For example, set op_defaults to 60 seconds:

      crm(live)configure# op_defaults timeout=60
    2. For resources that need longer periods of time, define individual timeout values.

  3. When configuring operations for a resource, add separate start and stop operations. When configuring operations with Hawk2, it will provide useful timeout proposals for those operations.

6.4 Creating primitive resources

Before you can use a resource in the cluster, it must be set up. For example, to use an Apache server as a cluster resource, set up the Apache server first and complete the Apache configuration before starting the respective resource in your cluster.

If a resource has specific environment requirements, make sure they are present and identical on all cluster nodes. This kind of configuration is not managed by SUSE Linux Enterprise High Availability. You must do this yourself.

You can create primitive resources using either Hawk2 or crmsh.

Note
Note: Do not touch services managed by the cluster

When managing a resource with SUSE Linux Enterprise High Availability, the same resource must not be started or stopped otherwise (outside of the cluster, for example manually or on boot or reboot). The High Availability software is responsible for all service start or stop actions.

If you need to execute testing or maintenance tasks after the services are already running under cluster control, make sure to put the resources, nodes, or the whole cluster into maintenance mode before you touch any of them manually. For details, see Section 28.2, “Different options for maintenance tasks”.

Important
Important: Resource IDs and node names

Cluster resources and cluster nodes should be named differently. Otherwise Hawk2 will fail.

6.4.1 Creating primitive resources with Hawk2

To create the most basic type of resource, proceed as follows:

Procedure 6.2: Adding a primitive resource with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Primitive.

  3. Enter a unique Resource ID.

  4. If a resource template exists on which you want to base the resource configuration, select the respective Template.

  5. Select the resource agent Class you want to use: lsb, ocf, service, stonith, or systemd. For more information, see Section 6.2, “Supported resource agent classes”.

  6. If you selected ocf as class, specify the Provider of your OCF resource agent. The OCF specification allows multiple vendors to supply the same resource agent.

  7. From the Type list, select the resource agent you want to use (for example, IPaddr or Filesystem). A short description for this resource agent is displayed.

    Note
    Note

    The selection you get in the Type list depends on the Class (and for OCF resources also on the Provider) you have chosen.

    Hawk2—primitive resource
    Figure 6.1: Hawk2—primitive resource
  8. After you have specified the resource basics, Hawk2 shows the following categories. Either keep these categories as suggested by Hawk2, or edit them as required.

    Parameters (instance attributes)

    Determines which instance of a service the resource controls. When creating a resource, Hawk2 automatically shows any required parameters. Edit them to get a valid resource configuration.

    For more information, refer to Section 6.13, “Instance attributes (parameters)”.

    Operations

    Needed for resource monitoring. When creating a resource, Hawk2 displays the most important resource operations (monitor, start, and stop).

    For more information, refer to Section 6.14, “Resource operations”.

    Meta attributes

    Tells the CRM how to treat a specific resource. When creating a resource, Hawk2 automatically lists the important meta attributes for that resource (for example, the target-role attribute that defines the initial state of a resource. By default, it is set to Stopped, so the resource will not start immediately).

    For more information, refer to Section 6.12, “Resource options (meta attributes)”.

    Utilization

    Tells the CRM what capacity a certain resource requires from a node.

    For more information, refer to Section 7.10.1, “Placing resources based on their load impact with Hawk2”.

  9. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

6.4.2 Creating primitive resources with crmsh

Procedure 6.3: Adding a primitive resource with crmsh
  1. Log in as root and start the crm tool:

    # crm configure
  2. Configure a primitive IP address:

    crm(live)configure# primitive myIP IPaddr \
          params ip=127.0.0.99 op monitor interval=60s

    The previous command configures a primitive with the name myIP. You need to choose a class (here ocf), provider (heartbeat), and type (IPaddr). Furthermore, this primitive expects other parameters like the IP address. Change the address to your setup.

  3. Display and review the changes you have made:

    crm(live)configure# show
  4. Commit your changes to take effect:

    crm(live)configure# commit

6.5 Creating resource groups

Some cluster resources depend on other components or resources. They require that each component or resource starts in a specific order and runs together on the same server with resources it depends on. To simplify this configuration, you can use cluster resource groups.

You can create resource groups using either Hawk2 or crmsh.

Example 6.1: Resource group for a web server

An example of a resource group would be a Web server that requires an IP address and a file system. In this case, each component is a separate resource that is combined into a cluster resource group. The resource group would run on one or more servers. In case of a software or hardware malfunction, the group would fail over to another server in the cluster, similar to an individual cluster resource.

Group resource
Figure 6.2: Group resource

Groups have the following properties:

Starting and stopping

Resources are started in the order they appear in and stopped in the reverse order.

Dependency

If a resource in the group cannot run anywhere, then none of the resources located after that resource in the group is allowed to run.

Contents

Groups may only contain a collection of primitive cluster resources. Groups must contain at least one resource, otherwise the configuration is not valid. To refer to the child of a group resource, use the child's ID instead of the group's ID.

Constraints

Although it is possible to reference the group's children in constraints, it is usually preferable to use the group's name instead.

Stickiness

Stickiness is additive in groups. Every active member of the group will contribute its stickiness value to the group's total. So if the default resource-stickiness is 100 and a group has seven members (five of which are active), the group as a whole will prefer its current location with a score of 500.

Resource monitoring

To enable resource monitoring for a group, you must configure monitoring separately for each resource in the group that you want monitored.

6.5.1 Creating resource groups with Hawk2

Note
Note: Empty groups

Groups must contain at least one resource, otherwise the configuration is not valid. While creating a group, Hawk2 allows you to create more primitives and add them to the group.

Procedure 6.4: Adding a resource group with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Group.

  3. Enter a unique Group ID.

  4. To define the group members, select one or multiple entries in the list of Children. Re-sort group members by dragging and dropping them into the order you want by using the handle icon on the right.

  5. If needed, modify or add Meta Attributes.

  6. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

Hawk2—resource group
Figure 6.3: Hawk2—resource group

6.5.2 Creating a resource group with crmsh

The following example creates two primitives (an IP address and an e-mail resource).

Procedure 6.5: Adding a resource group with crmsh
  1. Run the crm command as system administrator. The prompt changes to crm(live).

  2. Configure the primitives:

    crm(live)# configure
    crm(live)configure# primitive Public-IP ocf:heartbeat:IPaddr2 \
        params ip=1.2.3.4 \
        op monitor interval=10s
    crm(live)configure# primitive Email systemd:postfix \
        op monitor interval=10s
  3. Group the primitives with their relevant identifiers in the correct order:

    crm(live)configure# group g-mailsvc Public-IP Email

6.6 Creating clone resources

You may want certain resources to run simultaneously on multiple nodes in your cluster. To do this you must configure a resource as a clone. Examples of resources that might be configured as clones include cluster file systems like OCFS2. You can clone any resource provided. This is supported by the resource's Resource Agent. Clone resources may even be configured differently depending on which nodes they are hosted.

There are three types of resource clones:

Anonymous clones

These are the simplest type of clones. They behave identically anywhere they are running. Because of this, there can only be one instance of an anonymous clone active per machine.

Globally unique clones

These resources are distinct entities. An instance of the clone running on one node is not equivalent to another instance on another node; nor would any two instances on the same node be equivalent.

Promotable clones (multi-state resources)

Active instances of these resources are divided into two states, active and passive. These are also sometimes called primary and secondary. Promotable clones can be either anonymous or globally unique. For more information, see Section 6.7, “Creating promotable clones (multi-state resources)”.

Clones must contain exactly one group or one regular resource.

When configuring resource monitoring or constraints, clones have different requirements than simple resources. For details, see Pacemaker Explained, available from http://www.clusterlabs.org/pacemaker/doc/.

You can create clone resources using either Hawk2 or crmsh.

6.6.1 Creating clone resources with Hawk2

Note
Note: Child resources for clones

Clones can either contain a primitive or a group as child resources. In Hawk2, child resources cannot be created or modified while creating a clone. Before adding a clone, create child resources and configure them as desired.

Procedure 6.6: Adding a clone resource with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Clone.

  3. Enter a unique Clone ID.

  4. From the Child Resource list, select the primitive or group to use as a sub-resource for the clone.

  5. If needed, modify or add Meta Attributes.

  6. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

Hawk2—clone resource
Figure 6.4: Hawk2—clone resource

6.6.2 Creating clone resources with crmsh

To create an anonymous clone resource, first create a primitive resource and then refer to it with the clone command.

Procedure 6.7: Adding a clone resource with crmsh
  1. Log in as root and start the crm interactive shell:

    # crm configure
  2. Configure the primitive, for example:

    crm(live)configure# primitive Apache apache
  3. Clone the primitive:

    crm(live)configure# clone cl-apache Apache

6.7 Creating promotable clones (multi-state resources)

Promotable clones (formerly known as multi-state resources) are a specialization of clones. They allow the instances to be in one of two operating modes (primary or secondary). Promotable clones must contain exactly one group or one regular resource.

When configuring resource monitoring or constraints, promotable clones have different requirements than simple resources. For details, see Pacemaker Explained, available from http://www.clusterlabs.org/pacemaker/doc/.

You can create promotable clones using either Hawk2 or crmsh.

6.7.1 Creating promotable clones with Hawk2

Note
Note: Child resources for promotable clones

Promotable clones can either contain a primitive or a group as child resources. In Hawk2, child resources cannot be created or modified while creating a promotable clone. Before adding a promotable clone, create child resources and configure them as desired. See Section 6.4.1, “Creating primitive resources with Hawk2” or Section 6.5.1, “Creating resource groups with Hawk2”.

Procedure 6.8: Adding a promotable clone with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Multi-state.

  3. Enter a unique Multi-state ID.

  4. From the Child Resource list, select the primitive or group to use as a sub-resource for the multi-state resource.

  5. If needed, modify or add Meta Attributes.

  6. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

6.7.2 Creating promotable clones with crmsh

To create a promotable clone resource, first create a primitive resource and then the promotable clone resource. The promotable clone resource must support at least promote and demote operations.

Procedure 6.9: Adding a promotable clone with crmsh
  1. Log in as root and start the crm interactive shell:

    # crm configure
  2. Configure the primitive. Change the intervals if needed:

    crm(live)configure# primitive my-rsc ocf:myCorp:myAppl \
        op monitor interval=60 \
        op monitor interval=61 role=Promoted
  3. Create the promotable clone resource:

    crm(live)configure# clone clone-rsc my-rsc meta promotable=true

6.8 Creating resource templates

If you want to create lots of resources with similar configurations, defining a resource template is the easiest way. After being defined, it can be referenced in primitives, or in certain types of constraints as described in Section 7.3, “Resource templates and constraints”.

If a template is referenced in a primitive, the primitive will inherit all operations, instance attributes (parameters), meta attributes, and utilization attributes defined in the template. Additionally, you can define specific operations or attributes for your primitive. If any of these are defined in both the template and the primitive, the values defined in the primitive will take precedence over the ones defined in the template.

You can create resource templates using either Hawk2 or crmsh.

6.8.1 Creating resource templates with Hawk2

Resource templates are configured like primitive resources.

Procedure 6.10: Adding a resource template
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Template.

  3. Enter a unique Resource ID.

  4. Follow the instructions in Procedure 6.2, “Adding a primitive resource with Hawk2”, starting from Step 5.

6.8.2 Creating resource templates with crmsh

Use the rsc_template command to get familiar with the syntax:

# crm configure rsc_template
 usage: rsc_template <name> [<class>:[<provider>:]]<type>
        [params <param>=<value> [<param>=<value>...]]
        [meta <attribute>=<value> [<attribute>=<value>...]]
        [utilization <attribute>=<value> [<attribute>=<value>...]]
        [operations id_spec
            [op op_type [<attribute>=<value>...] ...]]

For example, the following command creates a new resource template with the name BigVM derived from the ocf:heartbeat:Xen resource and some default values and operations:

crm(live)configure# rsc_template BigVM ocf:heartbeat:Xen \
   params allow_mem_management="true" \
   op monitor timeout=60s interval=15s \
   op stop timeout=10m \
   op start timeout=10m

Once you define the new resource template, you can use it in primitives or reference it in order, colocation, or rsc_ticket constraints. To reference the resource template, use the @ sign:

crm(live)configure# primitive MyVM1 @BigVM \
   params xmfile="/etc/xen/shared-vm/MyVM1" name="MyVM1"

The new primitive MyVM1 is going to inherit everything from the BigVM resource templates. For example, the equivalent of the above two would be:

crm(live)configure# primitive MyVM1 Xen \
   params xmfile="/etc/xen/shared-vm/MyVM1" name="MyVM1" \
   params allow_mem_management="true" \
   op monitor timeout=60s interval=15s \
   op stop timeout=10m \
   op start timeout=10m

If you want to overwrite some options or operations, add them to your (primitive) definition. For example, the following new primitive MyVM2 doubles the timeout for monitor operations but leaves others untouched:

crm(live)configure# primitive MyVM2 @BigVM \
   params xmfile="/etc/xen/shared-vm/MyVM2" name="MyVM2" \
   op monitor timeout=120s interval=30s

A resource template may be referenced in constraints to stand for all primitives which are derived from that template. This helps to produce a more concise and clear cluster configuration. Resource template references are allowed in all constraints except location constraints. Colocation constraints may not contain more than one template reference.

6.9 Creating STONITH resources

Important
Important: No Support Without STONITH
  • You must have a node fencing mechanism for your cluster.

  • The global cluster options stonith-enabled and startup-fencing must be set to true. When you change them, you lose support.

By default, the global cluster option stonith-enabled is set to true. If no STONITH resources have been defined, the cluster will refuse to start any resources. Configure one or more STONITH resources to complete the STONITH setup. While STONITH resources are configured similarly to other resources, their behavior is different in some respects. For details refer to Section 12.3, “STONITH resources and configuration”.

You can create STONITH resources using either Hawk2 or crmsh.

6.9.1 Creating STONITH resources with Hawk2

To add a STONITH resource for SBD, for libvirt (KVM/Xen), or for vCenter/ESX Server, the easiest way is to use the Hawk2 wizard.

Procedure 6.11: Adding a STONITH resource with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Primitive.

  3. Enter a unique Resource ID.

  4. From the Class list, select the resource agent class stonith.

  5. From the Type list, select the STONITH plug-in to control your STONITH device. A short description for this plug-in is displayed.

  6. Hawk2 automatically shows the required Parameters for the resource. Enter values for each parameter.

  7. Hawk2 displays the most important resource Operations and proposes default values. If you do not modify any settings here, Hawk2 adds the proposed operations and their default values when you confirm.

  8. If there is no reason to change them, keep the default Meta Attributes settings.

    Hawk2—STONITH resource
    Figure 6.5: Hawk2—STONITH resource
  9. Confirm your changes to create the STONITH resource.

    A message at the top of the screen shows if the action has been successful.

To complete your fencing configuration, add constraints. For more details, refer to Chapter 12, Fencing and STONITH.

6.9.2 Creating STONITH resources with crmsh

Procedure 6.12: Adding a STONITH resource with crmsh
  1. Log in as root and start the crm interactive shell:

    # crm
  2. Get a list of all STONITH types with the following command:

    crm(live)# ra list stonith
    apcmaster                  apcmastersnmp              apcsmart
    baytech                    bladehpi                   cyclades
    drac3                      external/drac5             external/dracmc-telnet
    external/hetzner           external/hmchttp           external/ibmrsa
    external/ibmrsa-telnet     external/ipmi              external/ippower9258
    external/kdumpcheck        external/libvirt           external/nut
    external/rackpdu           external/riloe             external/sbd
    external/vcenter           external/vmware            external/xen0
    external/xen0-ha           fence_legacy               ibmhmc
    ipmilan                    meatware                   nw_rpc100s
    rcd_serial                 rps10                      suicide
    wti_mpc                    wti_nps
  3. Choose a STONITH type from the above list and view the list of possible options. Use the following command:

    crm(live)# ra info stonith:external/ipmi
    IPMI STONITH external device (stonith:external/ipmi)
    
    ipmitool based power management. Apparently, the power off
    method of ipmitool is intercepted by ACPI which then makes
    a regular shutdown. If case of a split brain on a two-node
    it may happen that no node survives. For two-node clusters
    use only the reset method.
    
    Parameters (* denotes required, [] the default):
    
    hostname (string): Hostname
       The name of the host to be managed by this STONITH device.
    ...
  4. Create the STONITH resource with the stonith class, the type you have chosen in Step 3, and the respective parameters if needed, for example:

    crm(live)# configure
    crm(live)configure# primitive my-stonith stonith:external/ipmi \
       params hostname="alice" \
       ipaddr="192.168.1.221" \
       userid="admin" passwd="secret" \
       op monitor interval=60m timeout=120s

6.10 Configuring resource monitoring

If you want to ensure that a resource is running, you must configure resource monitoring for it. You can configure resource monitoring using either Hawk2 or crmsh.

If the resource monitor detects a failure, the following takes place:

  • Log file messages are generated, according to the configuration specified in the logging section of /etc/corosync/corosync.conf.

  • The failure is reflected in the cluster management tools (Hawk2, crm status), and in the CIB status section.

  • The cluster initiates noticeable recovery actions which may include stopping the resource to repair the failed state and restarting the resource locally or on another node. The resource also may not be restarted, depending on the configuration and state of the cluster.

If you do not configure resource monitoring, resource failures after a successful start will not be communicated, and the cluster will always show the resource as healthy.

Usually, resources are only monitored by the cluster while they are running. However, to detect concurrency violations, also configure monitoring for resources which are stopped. For resource monitoring, specify a timeout and/or start delay value, and an interval. The interval tells the CRM how often it should check the resource status. You can also set particular parameters such as timeout for start or stop operations.

For more information about monitor operation parameters, see Section 6.14, “Resource operations”.

6.10.1 Configuring resource monitoring with Hawk2

Procedure 6.13: Adding and modifying an operation
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. Add a resource as described in Procedure 6.2, “Adding a primitive resource with Hawk2” or select an existing primitive to edit.

    Hawk2 automatically shows the most important Operations (start, stop, monitor) and proposes default values.

    To see the attributes belonging to each proposed value, hover the mouse pointer over the respective value.

    Image
  3. To change the suggested timeout values for the start or stop operation:

    1. Click the pen icon next to the operation.

    2. In the dialog that opens, enter a different value for the timeout parameter, for example 10, and confirm your change.

  4. To change the suggested interval value for the monitor operation:

    1. Click the pen icon next to the operation.

    2. In the dialog that opens, enter a different value for the monitoring interval.

    3. To configure resource monitoring in the case that the resource is stopped:

      1. Select the role entry from the empty drop-down box below.

      2. From the role drop-down box, select Stopped.

      3. Click Apply to confirm your changes and to close the dialog for the operation.

  5. Confirm your changes in the resource configuration screen. A message at the top of the screen shows if the action has been successful.

To view resource failures, switch to the Status screen in Hawk2 and select the resource you are interested in. In the Operations column click the arrow down icon and select Recent Events. The dialog that opens lists recent actions performed for the resource. Failures are displayed in red. To view the resource details, click the magnifier icon in the Operations column.

Hawk2—resource details
Figure 6.6: Hawk2—resource details

6.10.2 Configuring resource monitoring with crmsh

To monitor a resource, there are two possibilities: either define a monitor operation with the op keyword or use the monitor command. The following example configures an Apache resource and monitors it every 60 seconds with the op keyword:

crm(live)configure# primitive apache apache \
  params ... \
  op monitor interval=60s timeout=30s

The same can be done with the following commands:

crm(live)configure# primitive apache apache \
   params ...
 crm(live)configure# monitor apache 60s:30s
Monitoring stopped resources

Usually, resources are only monitored by the cluster as long as they are running. However, to detect concurrency violations, also configure monitoring for resources which are stopped. For example:

crm(live)configure# primitive dummy1 Dummy \
     op monitor interval="300s" role="Stopped" timeout="10s" \
     op monitor interval="30s" timeout="10s"

This configuration triggers a monitoring operation every 300 seconds for the resource dummy1 when it is in role="Stopped". When running, it will be monitored every 30 seconds.

Probing

The CRM executes an initial monitoring for each resource on every node, the so-called probe. A probe is also executed after the cleanup of a resource. If multiple monitoring operations are defined for a resource, the CRM will select the one with the smallest interval and will use its timeout value as default timeout for probing. If no monitor operation is configured, the cluster-wide default applies. The default is 20 seconds (if not specified otherwise by configuring the op_defaults parameter). If you do not want to rely on the automatic calculation or the op_defaults value, define a specific monitoring operation for the probing of this resource. Do so by adding a monitoring operation with the interval set to 0, for example:

crm(live)configure# primitive rsc1 ocf:pacemaker:Dummy \
     op monitor interval="0" timeout="60"

The probe of rsc1 will time out in 60s, independent of the global timeout defined in op_defaults, or any other operation timeouts configured. If you did not set interval="0" for specifying the probing of the respective resource, the CRM will automatically check for any other monitoring operations defined for that resource and will calculate the timeout value for probing as described above.

6.11 Loading resources from a file

Parts or all of the configuration can be loaded from a local file or a network URL. Three different methods can be defined:

replace

This option replaces the current configuration with the new source configuration.

update

This option tries to import the source configuration. It adds new items or updates existing items to the current configuration.

push

This option imports the content from the source into the current configuration (same as update). However, it removes objects that are not available in the new configuration.

To load the new configuration from the file mycluster-config.txt use the following syntax:

# crm configure load push mycluster-config.txt

6.12 Resource options (meta attributes)

For each resource you add, you can define options. Options are used by the cluster to decide how your resource should behave; they tell the CRM how to treat a specific resource. Resource options can be set with the crm_resource --meta command or with Hawk2.

The following resource options are available:

priority

If not all resources can be active, the cluster stops lower priority resources to keep higher priority resources active.

The default value is 0.

target-role

In what state should the cluster attempt to keep this resource? Allowed values: Stopped, Started, Unpromoted, Promoted.

The default value is Started.

is-managed

Is the cluster allowed to start and stop the resource? Allowed values: true, false. If the value is set to false, the status of the resource is still monitored and any failures are reported. This is different from setting a resource to maintenance="true".

The default value is true.

maintenance

Can the resources be touched manually? Allowed values: true, false. If set to true, all resources become unmanaged: the cluster stops monitoring them and does not know their status. You can stop or restart cluster resources without the cluster attempting to restart them.

The default value is false.

resource-stickiness

How much does the resource prefer to stay where it is?

The default value is 1 for individual clone instances, and 0 for all other resources.

migration-threshold

How many failures should occur for this resource on a node before making the node ineligible to host this resource?

The default value is INFINITY.

multiple-active

What should the cluster do if it ever finds the resource active on more than one node? Allowed values: block (mark the resource as unmanaged), stop_only, stop_start.

The default value is stop_start.

failure-timeout

How many seconds to wait before acting as if the failure did not occur (and potentially allowing the resource back to the node on which it failed)?

The default value is 0 (disabled).

allow-migrate

Whether to allow live migration for resources that support migrate_to and migrate_from actions. If the value is set to true, the resource can be migrated without loss of state. If the value is set to false, the resource will be shut down on the first node and restarted on the second node.

The default value is true for ocf:pacemaker:remote resources, and false for all other resources.

remote-node

The name of the remote node this resource defines. This both enables the resource as a remote node and defines the unique name used to identify the remote node. If no other parameters are set, this value is also assumed as the host name to connect to at the remote-port port.

This option is disabled by default.

Warning
Warning: Use unique IDs

This value must not overlap with any existing resource or node IDs.

remote-port

Custom port for the guest connection to pacemaker_remote.

The default value is 3121.

remote-addr

The IP address or host name to connect to if the remote node's name is not the host name of the guest.

The default value is the value set by remote-node.

remote-connect-timeout

How long before a pending guest connection times out?

The default value is 60s.

6.13 Instance attributes (parameters)

The scripts of all resource classes can be given parameters which determine how they behave and which instance of a service they control. If your resource agent supports parameters, you can add them with the crm_resource command or with Hawk2. In the crm command line utility and in Hawk2, instance attributes are called params or Parameter, respectively. The list of instance attributes supported by an OCF script can be found by executing the following command as root:

# crm ra info [class:[provider:]]resource_agent

or (without the optional parts):

# crm ra info resource_agent

The output lists all the supported attributes, their purpose and default values.

Note
Note: Instance attributes for groups, clones or promotable clones

Note that groups, clones and promotable clone resources do not have instance attributes. However, any instance attributes set will be inherited by the group's, clone's or promotable clone's children.

6.14 Resource operations

By default, the cluster will not ensure that your resources are still healthy. To instruct the cluster to do this, you need to add a monitor operation to the resource's definition. Monitor operations can be added for all classes or resource agents.

Monitor operations can have the following properties:

id

Your name for the action. Must be unique. (The ID is not shown).

name

The action to perform. Common values: monitor, start, stop.

interval

How frequently to perform the operation, in seconds.

timeout

How long to wait before declaring the action has failed.

requires

What conditions need to be satisfied before this action occurs. Allowed values: nothing, quorum, fencing. The default depends on whether fencing is enabled and if the resource's class is stonith. For STONITH resources, the default is nothing.

on-fail

The action to take if this action ever fails. Allowed values:

  • ignore: Pretend the resource did not fail.

  • block: Do not perform any further operations on the resource.

  • stop: Stop the resource and do not start it elsewhere.

  • restart: Stop the resource and start it again (possibly on a different node).

  • fence: Bring down the node on which the resource failed (STONITH).

  • standby: Move all resources away from the node on which the resource failed.

enabled

If false, the operation is treated as if it does not exist. Allowed values: true, false.

role

Run the operation only if the resource has this role.

record-pending

Can be set either globally or for individual resources. Makes the CIB reflect the state of in-flight operations on resources.

description

Description of the operation.

7 Configuring resource constraints

Having all the resources configured is only part of the job. Even if the cluster knows all needed resources, it still might not be able to handle them correctly. Resource constraints let you specify which cluster nodes resources can run on, what order resources will load in, and what other resources a specific resource is dependent on.

7.1 Types of constraints

There are three different kinds of constraints available:

Resource location

Location constraints define on which nodes a resource may be run, may not be run or is preferred to be run.

Resource colocation

Colocation constraints tell the cluster which resources may or may not run together on a node.

Resource order

Order constraints define the sequence of actions.

Important
Important: Restrictions for constraints and certain types of resources
  • Do not create colocation constraints for members of a resource group. Create a colocation constraint pointing to the resource group as a whole instead. All other types of constraints are safe to use for members of a resource group.

  • Do not use any constraints on a resource that has a clone resource or a promotable clone resource applied to it. The constraints must apply to the clone or promotable clone resource, not to the child resource.

7.2 Scores and infinity

When defining constraints, you also need to deal with scores. Scores of all kinds are integral to how the cluster works. Practically everything from migrating a resource to deciding which resource to stop in a degraded cluster is achieved by manipulating scores in some way. Scores are calculated on a per-resource basis and any node with a negative score for a resource cannot run that resource. After calculating the scores for a resource, the cluster then chooses the node with the highest score.

INFINITY is currently defined as 1,000,000. Additions or subtractions with it stick to the following three basic rules:

  • Any value + INFINITY = INFINITY

  • Any value - INFINITY = -INFINITY

  • INFINITY - INFINITY = -INFINITY

When defining resource constraints, you specify a score for each constraint. The score indicates the value you are assigning to this resource constraint. Constraints with higher scores are applied before those with lower scores. By creating additional location constraints with different scores for a given resource, you can specify an order for the nodes that a resource will fail over to.

7.3 Resource templates and constraints

If you have defined a resource template (see Section 6.8, “Creating resource templates”), it can be referenced in the following types of constraints:

  • order constraints

  • colocation constraints

  • rsc_ticket constraints (for Geo clusters)

However, colocation constraints must not contain more than one reference to a template. Resource sets must not contain a reference to a template.

Resource templates referenced in constraints stand for all primitives which are derived from that template. This means, the constraint applies to all primitive resources referencing the resource template. Referencing resource templates in constraints is an alternative to resource sets and can simplify the cluster configuration considerably. For details about resource sets, refer to Section 7.7, “Using resource sets to define constraints”.

7.4 Adding location constraints

A location constraint determines on which node a resource may be run, is preferably run, or may not be run. An example of a location constraint is to place all resources related to a certain database on the same node. This type of constraint may be added multiple times for each resource. All location constraints are evaluated for a given resource.

You can add location constraints using either Hawk2 or crmsh.

7.4.1 Adding location constraints with Hawk2

Procedure 7.1: Adding a location constraint
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Constraint › Location.

  3. Enter a unique Constraint ID.

  4. From the list of Resources select the resource or resources for which to define the constraint.

  5. Enter a Score. The score indicates the value you are assigning to this resource constraint. Positive values indicate the resource can run on the Node you specify in the next step. Negative values mean it should not run on that node. Constraints with higher scores are applied before those with lower scores.

    Some often-used values can also be set via the drop-down box:

    • To force the resources to run on the node, click the arrow icon and select Always. This sets the score to INFINITY.

    • If you never want the resources to run on the node, click the arrow icon and select Never. This sets the score to -INFINITY, meaning that the resources must not run on the node.

    • To set the score to 0, click the arrow icon and select Advisory. This disables the constraint. This is useful when you want to set resource discovery but do not want to constrain the resources.

  6. Select a Node.

  7. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

Hawk2—location constraint
Figure 7.1: Hawk2—location constraint

7.4.2 Adding location constraints with crmsh

The location command defines on which nodes a resource may be run, may not be run or is preferred to be run.

A simple example that expresses a preference to run the resource fs1 on the node with the name alice to 100 would be the following:

crm(live)configure# location loc-fs1 fs1 100: alice

Another example is a location with ping:

crm(live)configure# primitive ping ping \
    params name=ping dampen=5s multiplier=100 host_list="r1 r2"
crm(live)configure# clone cl-ping ping meta interleave=true
crm(live)configure# location loc-node_pref internal_www \
    rule 50: #uname eq alice \
    rule ping: defined ping

The parameter host_list is a space-separated list of hosts to ping and count. Another use case for location constraints are grouping primitives as a resource set. This can be useful if several resources depend on, for example, a ping attribute for network connectivity. In former times, the -inf/ping rules needed to be duplicated several times in the configuration, making it unnecessarily complex.

The following example creates a resource set loc-alice, referencing the virtual IP addresses vip1 and vip2:

crm(live)configure# primitive vip1 IPaddr2 params ip=192.168.1.5
crm(live)configure# primitive vip2 IPaddr2 params ip=192.168.1.6
crm(live)configure# location loc-alice { vip1 vip2 } inf: alice

In some cases it is much more efficient and convenient to use resource patterns for your location command. A resource pattern is a regular expression between two slashes. For example, the above virtual IP addresses can be all matched with the following:

crm(live)configure# location loc-alice /vip.*/ inf: alice

7.5 Adding colocation constraints

A colocation constraint tells the cluster which resources may or may not run together on a node. As a colocation constraint defines a dependency between resources, you need at least two resources to create a colocation constraint.

You can add colocation constraints using either Hawk2 or crmsh.

7.5.1 Adding colocation constraints with Hawk2

Procedure 7.2: Adding colocation constraints
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Constraint › Colocation.

  3. Enter a unique Constraint ID.

  4. Enter a Score. The score determines the location relationship between the resources. Positive values indicate that the resources should run on the same node. Negative values indicate that the resources should not run on the same node. The score will be combined with other factors to decide where to put the resource.

    Some often-used values can also be set via the drop-down box:

    • To force the resources to run on the same node, click the arrow icon and select Always. This sets the score to INFINITY.

    • If you never want the resources to run on the same node, click the arrow icon and select Never. This sets the score to -INFINITY, meaning that the resources must not run on the same node.

  5. To define the resources for the constraint:

    1. From the drop-down box in the Resources category, select a resource (or a template).

      The resource is added and a new empty drop-down box appears beneath.

    2. Repeat this step to add more resources.

      As the topmost resource depends on the next resource and so on, the cluster will first decide where to put the last resource, then place the depending ones based on that decision. If the constraint cannot be satisfied, the cluster may not allow the dependent resource to run.

    3. To swap the order of resources within the colocation constraint, click the arrow up icon next to a resource to swap it with the entry above.

  6. If needed, specify further parameters for each resource (such as Started, Stopped, Promote, Demote): Click the empty drop-down box next to the resource and select the desired entry.

  7. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

Hawk2—colocation constraint
Figure 7.2: Hawk2—colocation constraint

7.5.2 Adding colocation constraints with crmsh

The colocation command is used to define what resources should run on the same or on different hosts.

You can set a score of either +inf or -inf, defining resources that must always or must never run on the same node. You can also use non-infinite scores. In that case the colocation is called advisory and the cluster may decide not to follow them in favor of not stopping other resources if there is a conflict.

For example, to always run the resources resource1 and resource2 on the same host, use the following constraint:

crm(live)configure# colocation coloc-2resource inf: resource1 resource2

For a primary/secondary configuration, it is necessary to know if the current node is a primary in addition to running the resource locally.

7.6 Adding order constraints

Order constraints can be used to start or stop a service right before or after a different resource meets a special condition, such as being started, stopped, or promoted to primary. For example, you cannot mount a file system before the device is available to a system. As an order constraint defines a dependency between resources, you need at least two resources to create an order constraint.

You can add order constraints using either Hawk2 or crmsh.

7.6.1 Adding order constraints with Hawk2

Procedure 7.3: Adding an order constraint
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Constraint › Order.

  3. Enter a unique Constraint ID.

  4. Enter a Score. If the score is greater than zero, the order constraint is mandatory, otherwise it is optional.

    Some often-used values can also be set via the drop-down box:

    • To make the order constraint mandatory, click the arrow icon and select Mandatory.

    • To make the order constraint a suggestion only, click the arrow icon and select Optional.

    • Serialize: To ensure that no two stop/start actions occur concurrently for the resources, click the arrow icon and select Serialize. This makes sure that one resource must complete starting before the other can be started. A typical use case are resources that put a high load on the host during start-up.

  5. For order constraints, you can usually keep the option Symmetrical enabled. This specifies that resources are stopped in reverse order.

  6. To define the resources for the constraint:

    1. From the drop-down box in the Resources category, select a resource (or a template).

      The resource is added and a new empty drop-down box appears beneath.

    2. Repeat this step to add more resources.

      The topmost resource will start first, then the second, etc. Usually the resources will be stopped in reverse order.

    3. To swap the order of resources within the order constraint, click the arrow up icon next to a resource to swap it with the entry above.

  7. If needed, specify further parameters for each resource (like Started, Stopped, Promote, Demote): Click the empty drop-down box next to the resource and select the desired entry.

  8. Confirm your changes to finish the configuration. A message at the top of the screen shows if the action has been successful.

Hawk2—order constraint
Figure 7.3: Hawk2—order constraint

7.6.2 Adding order constraints using crmsh

The order command defines a sequence of action.

For example, to always start resource1 before resource2, use the following constraint:

crm(live)configure# order res1_before_res2 Mandatory: resource1 resource2

7.7 Using resource sets to define constraints

As an alternative format for defining location, colocation, or order constraints, you can use resource sets, where primitives are grouped together in one set. Previously this was possible either by defining a resource group (which could not always accurately express the design), or by defining each relationship as an individual constraint. The latter caused a constraint explosion as the number of resources and combinations grew. The configuration via resource sets is not necessarily less verbose, but is easier to understand and maintain.

You can configure resource sets using either Hawk2 or crmsh.

7.7.1 Using resource sets to define constraints with Hawk2

Procedure 7.4: Using a resource set for constraints
  1. To use a resource set within a location constraint:

    1. Proceed as outlined in Procedure 7.1, “Adding a location constraint”, apart from Step 4: Instead of selecting a single resource, select multiple resources by pressing Ctrl or Shift and mouse click. This creates a resource set within the location constraint.

    2. To remove a resource from the location constraint, press Ctrl and click the resource again to deselect it.

  2. To use a resource set within a colocation or order constraint:

    1. Proceed as described in Procedure 7.2, “Adding colocation constraints” or Procedure 7.3, “Adding an order constraint”, apart from the step where you define the resources for the constraint (Step 5.a or Step 6.a):

    2. Add multiple resources.

    3. To create a resource set, click the chain icon next to a resource to link it to the resource above. A resource set is visualized by a frame around the resources belonging to a set.

    4. You can combine multiple resources in a resource set or create multiple resource sets.

      Hawk2—two resource sets in a colocation constraint
      Figure 7.4: Hawk2—two resource sets in a colocation constraint
    5. To unlink a resource from the resource above, click the scissors icon next to the resource.

  3. Confirm your changes to finish the constraint configuration.

7.7.2 Using resource sets to define constraints with crmsh

Example 7.1: A resource set for location constraints

For example, you can use the following configuration of a resource set (loc-alice) in the crmsh to place two virtual IPs (vip1 and vip2) on the same node, alice:

crm(live)configure# primitive vip1 IPaddr2 params ip=192.168.1.5
crm(live)configure# primitive vip2 IPaddr2 params ip=192.168.1.6
crm(live)configure# location loc-alice { vip1 vip2 } inf: alice

If you want to use resource sets to replace a configuration of colocation constraints, consider the following two examples:

Example 7.2: A chain of colocated resources
<constraints>
    <rsc_colocation id="coloc-1" rsc="B" with-rsc="A" score="INFINITY"/>
    <rsc_colocation id="coloc-2" rsc="C" with-rsc="B" score="INFINITY"/>
    <rsc_colocation id="coloc-3" rsc="D" with-rsc="C" score="INFINITY"/>
</constraints>

The same configuration expressed by a resource set:

<constraints>
   <rsc_colocation id="coloc-1" score="INFINITY" >
    <resource_set id="colocated-set-example" sequential="true">
     <resource_ref id="A"/>
     <resource_ref id="B"/>
     <resource_ref id="C"/>
     <resource_ref id="D"/>
    </resource_set>
   </rsc_colocation>
</constraints>

If you want to use resource sets to replace a configuration of order constraints, consider the following two examples:

Example 7.3: A chain of ordered resources
<constraints>
    <rsc_order id="order-1" first="A" then="B" />
    <rsc_order id="order-2" first="B" then="C" />
    <rsc_order id="order-3" first="C" then="D" />
</constraints>

The same purpose can be achieved by using a resource set with ordered resources:

Example 7.4: A chain of ordered resources expressed as resource set
<constraints>
    <rsc_order id="order-1">
    <resource_set id="ordered-set-example" sequential="true">
    <resource_ref id="A"/>
    <resource_ref id="B"/>
    <resource_ref id="C"/>
    <resource_ref id="D"/>
    </resource_set>
    </rsc_order>
</constraints>

Sets can be either ordered (sequential=true) or unordered (sequential=false). Furthermore, the require-all attribute can be used to switch between AND and OR logic.

7.7.3 Collocating sets for resources without dependency

Sometimes it is useful to place a group of resources on the same node (defining a colocation constraint), but without having hard dependencies between the resources. For example, you want two resources to be placed on the same node, but you do not want the cluster to restart the other one if one of them fails.

This can be achieved on the crm shell by using the weak-bond command:

# crm configure assist weak-bond resource1 resource2

The weak-bond command creates a dummy resource and a colocation constraint with the given resources automatically.

7.8 Specifying resource failover nodes

A resource will be automatically restarted if it fails. If that cannot be achieved on the current node, or it fails N times on the current node, it tries to fail over to another node. Each time the resource fails, its fail count is raised. You can define several failures for resources (a migration-threshold), after which they will migrate to a new node. If you have more than two nodes in your cluster, the node a particular resource fails over to is chosen by the High Availability software.

However, you can specify the node a resource will fail over to by configuring one or several location constraints and a migration-threshold for that resource.

You can specify resource failover nodes using either Hawk2 or crmsh.

Example 7.5: Migration threshold—process flow

For example, let us assume you have configured a location constraint for resource rsc1 to preferably run on alice. If it fails there, migration-threshold is checked and compared to the fail count. If failcount >= migration-threshold, then the resource is migrated to the node with the next best preference.

After the threshold has been reached, the node will no longer be allowed to run the failed resource until the resource's fail count is reset. This can be done manually by the cluster administrator or by setting a failure-timeout option for the resource.

For example, a setting of migration-threshold=2 and failure-timeout=60s would cause the resource to migrate to a new node after two failures. It would be allowed to move back (depending on the stickiness and constraint scores) after one minute.

There are two exceptions to the migration threshold concept, occurring when a resource either fails to start or fails to stop:

  • Start failures set the fail count to INFINITY and thus always cause an immediate migration.

  • Stop failures cause fencing (when stonith-enabled is set to true which is the default).

    In case there is no STONITH resource defined (or stonith-enabled is set to false), the resource will not migrate.

7.8.1 Specifying resource failover nodes with Hawk2

Procedure 7.5: Specifying a failover node
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. Configure a location constraint for the resource as described in Procedure 7.1, “Adding a location constraint”.

  3. Add the migration-threshold meta attribute to the resource as described in Procedure 8.1: Modifying a resource or group, Step 5 and enter a value for the migration-threshold. The value should be positive and less than INFINITY.

  4. If you want to automatically expire the fail count for a resource, add the failure-timeout meta attribute to the resource as described in Procedure 6.2: Adding a primitive resource with Hawk2, Step 5 and enter a Value for the failure-timeout.

    Image
  5. If you want to specify additional failover nodes with preferences for a resource, create additional location constraints.

Instead of letting the fail count for a resource expire automatically, you can also clean up fail counts for a resource manually at any time. Refer to Section 8.5.1, “Cleaning up cluster resources with Hawk2” for details.

7.8.2 Specifying resource failover nodes with crmsh

To determine a resource failover, use the meta attribute migration-threshold. If the fail count exceeds migration-threshold on all nodes, the resource will remain stopped. For example:

crm(live)configure# location rsc1-alice rsc1 100: alice

Normally, rsc1 prefers to run on alice. If it fails there, migration-threshold is checked and compared to the fail count. If failcount >= migration-threshold, then the resource is migrated to the node with the next best preference.

Start failures set the fail count to inf depend on the start-failure-is-fatal option. Stop failures cause fencing. If there is no STONITH defined, the resource will not migrate.

7.9 Specifying resource failback nodes (resource stickiness)

A resource might fail back to its original node when that node is back online and in the cluster. To prevent a resource from failing back to the node that it was running on, or to specify a different node for the resource to fail back to, change its resource stickiness value. You can either specify resource stickiness when you are creating a resource or afterward.

Consider the following implications when specifying resource stickiness values:

Value is 0:

The resource is placed optimally in the system. This may mean that it is moved when a better or less loaded node becomes available. The option is almost equivalent to automatic failback, except that the resource may be moved to a node that is not the one it was previously active on. This is the Pacemaker default.

Value is greater than 0:

The resource will prefer to remain in its current location, but may be moved if a more suitable node is available. Higher values indicate a stronger preference for a resource to stay where it is.

Value is less than 0:

The resource prefers to move away from its current location. Higher absolute values indicate a stronger preference for a resource to be moved.

Value is INFINITY:

The resource will always remain in its current location unless forced off because the node is no longer eligible to run the resource (node shutdown, node standby, reaching the migration-threshold, or configuration change). This option is almost equivalent to completely disabling automatic failback.

Value is -INFINITY:

The resource will always move away from its current location.

7.9.1 Specifying resource failback nodes with Hawk2

Procedure 7.6: Specifying resource stickiness
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. Add the resource-stickiness meta attribute to the resource as described in Procedure 8.1: Modifying a resource or group, Step 5.

  3. Specify a value between -INFINITY and INFINITY for resource-stickiness.

    Image

7.10 Placing resources based on their load impact

Not all resources are equal. Some, such as Xen guests, require that the node hosting them meets their capacity requirements. If resources are placed such that their combined need exceed the provided capacity, the resources diminish in performance (or even fail).

To take this into account, SUSE Linux Enterprise High Availability allows you to specify the following parameters:

  1. The capacity a certain node provides.

  2. The capacity a certain resource requires.

  3. An overall strategy for placement of resources.

You can configure these setting using either Hawk2 or crmsh:

A node is considered eligible for a resource if it has sufficient free capacity to satisfy the resource's requirements. To manually configure the resource's requirements and the capacity a node provides, use utilization attributes. You can name the utilization attributes according to your preferences and define as many name/value pairs as your configuration needs. However, the attribute's values must be integers.

If multiple resources with utilization attributes are grouped or have colocation constraints, SUSE Linux Enterprise High Availability takes that into account. If possible, the resources is placed on a node that can fulfill all capacity requirements.

Note
Note: Utilization attributes for groups

It is impossible to set utilization attributes directly for a resource group. However, to simplify the configuration for a group, you can add a utilization attribute with the total capacity needed to any of the resources in the group.

SUSE Linux Enterprise High Availability also provides the means to detect and configure both node capacity and resource requirements automatically:

The NodeUtilization resource agent checks the capacity of a node (regarding CPU and RAM). To configure automatic detection, create a clone resource of the following class, provider, and type: ocf:pacemaker:NodeUtilization. One instance of the clone should be running on each node. After the instance has started, a utilization section will be added to the node's configuration in CIB. For more information, see crm ra info NodeUtilization.

For automatic detection of a resource's minimal requirements (regarding RAM and CPU) the Xen resource agent has been improved. Upon start of a Xen resource, it will reflect the consumption of RAM and CPU. Utilization attributes will automatically be added to the resource configuration.

Note
Note: Different resource agents for Xen and libvirt

The ocf:heartbeat:Xen resource agent should not be used with libvirt, as libvirt expects to be able to modify the machine description file.

For libvirt, use the ocf:heartbeat:VirtualDomain resource agent.

Apart from detecting the minimal requirements, you can also monitor the current utilization via the VirtualDomain resource agent. It detects CPU and RAM use of the virtual machine. To use this feature, configure a resource of the following class, provider and type: ocf:heartbeat:VirtualDomain. The following instance attributes are available:

  • autoset_utilization_cpu

  • autoset_utilization_hv_memory (for Xen) or autoset_utilization_host_memory (for KVM)

These attributes default to true. This updates the utilization values in the CIB during each monitoring cycle. For more information, see crm ra info VirtualDomain.

Note
Note: hv_memory and host_memory

In the NodeUtilization and VirtualDomain resource agents, hv_memory and host_memory both default to true. However, Xen only requires hv_memory, and KVM only requires host_memory. To avoid confusion, we recommend disabling the unnecessary attributes. For example:

Example 7.6: Creating resource agents for KVM with hv_memory disabled
# crm configure primitive p_nu NodeUtilization \
      params utilization_hv_memory=false \
      op monitor timeout=20s interval=60
# crm configure primitive p_vm VirtualDomain \
      params autoset_utilization_hv_memory=false \
      op monitor timeout=30s interval=10s
Example 7.7: Creating resource agents for Xen with host_memory disabled
# crm configure primitive p_nu NodeUtilization \
      params utilization_host_memory=false \
      op monitor timeout=20s interval=60
# crm configure primitive p_vm VirtualDomain \
      params autoset_utilization_host_memory=false \
      op monitor timeout=30s interval=10s

For either virtualization type, make sure to clone the NodeUtilization resource so it can run on all nodes:

# crm configure cl_nu p_nu

You can set the placement-strategy property to balanced with the following command:

# crm configure property placement-strategy=balanced

Independent of manually or automatically configuring capacity and requirements, the placement strategy must be specified with the placement-strategy property (in the global cluster options). The following values are available:

default (default value)

Utilization values are not considered. Resources are allocated according to location scoring. If scores are equal, resources are evenly distributed across nodes.

utilization

Utilization values are considered when deciding if a node has enough free capacity to satisfy a resource's requirements. However, load-balancing is still done based on the number of resources allocated to a node.

minimal

Utilization values are considered when deciding if a node has enough free capacity to satisfy a resource's requirements. An attempt is made to concentrate the resources on as few nodes as possible (to achieve power savings on the remaining nodes).

balanced

Utilization values are considered when deciding if a node has enough free capacity to satisfy a resource's requirements. An attempt is made to distribute the resources evenly, thus optimizing resource performance.

Note
Note: Configuring resource priorities

The available placement strategies are best-effort—they do not yet use complex heuristic solvers to always reach optimum allocation results. Ensure that resource priorities are properly set so that your most important resources are scheduled first.

7.10.1 Placing resources based on their load impact with Hawk2

Utilization attributes are used to configure both the resource's requirements and the capacity a node provides. You first need to configure a node's capacity before you can configure the capacity a resource requires.

Procedure 7.7: Configuring the capacity a node provides
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Status.

  3. On the Nodes tab, select the node whose capacity you want to configure.

  4. In the Operations column, click the arrow down icon and select Edit.

    The Edit Node screen opens.

  5. Below Utilization, enter a name for a utilization attribute into the empty drop-down box.

    The name can be arbitrary (for example, RAM_in_GB).

  6. Click the Add icon to add the attribute.

  7. In the empty text box next to the attribute, enter an attribute value. The value must be an integer.

    Image
  8. Add as many utilization attributes as you need and add values for all of them.

  9. Confirm your changes. A message at the top of the screen shows if the action has been successful.

Procedure 7.8: Configuring the capacity a resource requires

Configure the capacity a certain resource requires from a node either when creating a primitive resource or when editing an existing primitive resource.

Before you can add utilization attributes to a resource, you need to have set utilization attributes for your cluster nodes as described in Procedure 7.7.

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. To add a utilization attribute to an existing resource: Go to Manage › Status and open the resource configuration dialog as described in Section 8.2.1, “Editing resources and groups with Hawk2”.

    If you create a new resource: Go to Configuration › Add Resource and proceed as described in Section 6.4.1, “Creating primitive resources with Hawk2”.

  3. In the resource configuration dialog, go to the Utilization category.

  4. From the empty drop-down box, select one of the utilization attributes that you have configured for the nodes in Procedure 7.7.

  5. In the empty text box next to the attribute, enter an attribute value. The value must be an integer.

  6. Add as many utilization attributes as you need and add values for all of them.

  7. Confirm your changes. A message at the top of the screen shows if the action has been successful.

After you have configured the capacities your nodes provide and the capacities your resources require, set the placement strategy in the global cluster options. Otherwise the capacity configurations have no effect. Several strategies are available to schedule the load: for example, you can concentrate it on as few nodes as possible, or balance it evenly over all available nodes.

Procedure 7.9: Setting the placement strategy
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Cluster Configuration to open the respective screen. It shows global cluster options and resource and operation defaults.

  3. From the empty drop-down box in the upper part of the screen, select placement-strategy.

    By default, its value is set to Default, which means that utilization attributes and values are not considered.

  4. Depending on your requirements, set Placement Strategy to the appropriate value.

  5. Confirm your changes.

7.10.2 Placing resources based on their load impact with crmsh

To configure the resource's requirements and the capacity a node provides, use utilization attributes. You can name the utilization attributes according to your preferences and define as many name/value pairs as your configuration needs. In certain cases, some agents update the utilization themselves, for example the VirtualDomain.

In the following example, we assume that you already have a basic configuration of cluster nodes and resources. You now additionally want to configure the capacities a certain node provides and the capacity a certain resource requires.

Procedure 7.10: Adding or modifying utilization attributes with crm
  1. Log in as root and start the crm interactive shell:

    # crm configure
  2. To specify the capacity a node provides, use the following command and replace the placeholder NODE_1 with the name of your node:

    crm(live)configure# node NODE_1 utilization hv_memory=16384 cpu=8

    With these values, NODE_1 would be assumed to provide 16GB of memory and 8 CPU cores to resources.

  3. To specify the capacity a resource requires, use:

    crm(live)configure# primitive xen1 Xen ... \
          utilization hv_memory=4096 cpu=4

    This would make the resource consume 4096 of those memory units from NODE_1, and 4 of the CPU units.

  4. Configure the placement strategy with the property command:

    crm(live)configure# property ...

    The following values are available:

    default (default value)

    Utilization values are not considered. Resources are allocated according to location scoring. If scores are equal, resources are evenly distributed across nodes.

    utilization

    Utilization values are considered when deciding if a node has enough free capacity to satisfy a resource's requirements. However, load-balancing is still done based on the number of resources allocated to a node.

    minimal

    Utilization values are considered when deciding if a node has enough free capacity to satisfy a resource's requirements. An attempt is made to concentrate the resources on as few nodes as possible (to achieve power savings on the remaining nodes).

    balanced

    Utilization values are considered when deciding if a node has enough free capacity to satisfy a resource's requirements. An attempt is made to distribute the resources evenly, thus optimizing resource performance.

    Note
    Note: Configuring resource priorities

    The available placement strategies are best-effort—they do not yet use complex heuristic solvers to always reach optimum allocation results. Ensure that resource priorities are properly set so that your most important resources are scheduled first.

  5. Commit your changes before leaving crmsh:

    crm(live)configure# commit

The following example demonstrates a three node cluster of equal nodes, with four virtual machines:

crm(live)configure# node alice utilization hv_memory="4000"
crm(live)configure# node bob utilization hv_memory="4000"
crm(live)configure# node charlie utilization hv_memory="4000"
crm(live)configure# primitive xenA Xen \
    utilization hv_memory="3500" meta priority="10" \
    params xmfile="/etc/xen/shared-vm/vm1"
crm(live)configure# primitive xenB Xen \
    utilization hv_memory="2000" meta priority="1" \
    params xmfile="/etc/xen/shared-vm/vm2"
crm(live)configure# primitive xenC Xen \
    utilization hv_memory="2000" meta priority="1" \
    params xmfile="/etc/xen/shared-vm/vm3"
crm(live)configure# primitive xenD Xen \
    utilization hv_memory="1000" meta priority="5" \
    params xmfile="/etc/xen/shared-vm/vm4"
crm(live)configure# property placement-strategy="minimal"

With all three nodes up, xenA will be placed onto a node first, followed by xenD. xenB and xenC would either be allocated together or one of them with xenD.

If one node failed, too little total memory would be available to host them all. xenA would be ensured to be allocated, as would xenD. However, only one of xenB or xenC could still be placed, and since their priority is equal, the result is not defined yet. To resolve this ambiguity as well, you would need to set a higher priority for either one.

7.11 For more information

For more information on configuring constraints and detailed background information about the basic concepts of ordering and colocation, refer to the following documents at http://www.clusterlabs.org/pacemaker/doc/:

  • Pacemaker Explained, chapter Resource Constraints

  • Colocation Explained

  • Ordering Explained

8 Managing cluster resources

After configuring the resources in the cluster, use the cluster management tools to start, stop, clean up, remove or migrate the resources. This chapter describes how to use Hawk2 or crmsh for resource management tasks.

8.1 Showing cluster resources

8.1.1 Showing cluster resources with crmsh

When administering a cluster the command crm configure show lists the current CIB objects like cluster configuration, global options, primitives, and others:

# crm configure show
node 178326192: alice
node 178326448: bob
primitive admin_addr IPaddr2 \
       params ip=192.168.2.1 \
       op monitor interval=10 timeout=20
primitive stonith-sbd stonith:external/sbd \
       params pcmk_delay_max=30
property cib-bootstrap-options: \
       have-watchdog=true \
       dc-version=1.1.15-17.1-e174ec8 \
       cluster-infrastructure=corosync \
       cluster-name=hacluster \
       stonith-enabled=true \
       placement-strategy=balanced \
       standby-mode=true
rsc_defaults rsc-options: \
       resource-stickiness=1 \
       migration-threshold=3
op_defaults op-options: \
       timeout=600 \
       record-pending=true

If you have lots of resources, the output of show is too verbose. To restrict the output, use the name of the resource. For example, to list the properties of the primitive admin_addr only, append the resource name to show:

# crm configure show admin_addr
primitive admin_addr IPaddr2 \
       params ip=192.168.2.1 \
       op monitor interval=10 timeout=20

However, in some cases, you want to limit the output of specific resources even more. This can be achieved with filters. Filters limit the output to specific components. For example, to list the nodes only, use type:node:

# crm configure show type:node
node 178326192: alice
node 178326448: bob

If you are also interested in primitives, use the or operator:

# crm configure show type:node or type:primitive
node 178326192: alice
node 178326448: bob
primitive admin_addr IPaddr2 \
       params ip=192.168.2.1 \
       op monitor interval=10 timeout=20
primitive stonith-sbd stonith:external/sbd \
       params pcmk_delay_max=30

Furthermore, to search for an object that starts with a certain string, use this notation:

# crm configure show type:primitive and 'admin*'
primitive admin_addr IPaddr2 \
       params ip=192.168.2.1 \
       op monitor interval=10 timeout=20

To list all available types, enter crm configure show type: and press the →| key. The Bash completion will give you a list of all types.

8.2 Editing resources and groups

You can edit resources or groups using either Hawk2 or crmsh.

8.2.1 Editing resources and groups with Hawk2

If you have created a resource, you can edit its configuration at any time by adjusting parameters, operations, or meta attributes as needed.

Procedure 8.1: Modifying a resource or group
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. On the Hawk2 Status screen, go to the Resources list.

  3. In the Operations column, click the arrow down icon next to the resource or group you want to modify and select Edit.

    The resource configuration screen opens.

    Hawk2—editing a primitive resource
    Figure 8.1: Hawk2—editing a primitive resource
  4. At the top of the configuration screen, you can select operations to perform.

    If you edit a primitive resource, the following operations are available:

    • Copying the resource

    • Renaming the resource (changing its ID)

    • Deleting the resource

    If you edit a group, the following operations are available:

    • Creating a new primitive which will be added to this group

    • Renaming the group (changing its ID)

    • Dragging and dropping group members into a new order

  5. To add a new parameter, operation, or meta attribute, select an entry from the empty drop-down box.

  6. To edit any values in the Operations category, click the Edit icon of the respective entry, enter a different value for the operation, and click Apply.

  7. When you are finished, click the Apply button in the resource configuration screen to confirm your changes to the parameters, operations, or meta attributes.

    A message at the top of the screen shows if the action has been successful.

8.2.2 Editing groups with crmsh

To change the order of a group member, use the modgroup command from the configure subcommand. For example, use the following command to move the primitive Email before Public-IP:

crm(live)configure# modgroup g-mailsvc add Email before Public-IP

To remove a resource from a group (for example, Email), use this command:

crm(live)configure# modgroup g-mailsvc remove Email

8.3 Starting cluster resources

Before you start a cluster resource, make sure it is set up correctly. For example, if you use an Apache server as a cluster resource, set up the Apache server first. Complete the Apache configuration before starting the respective resource in your cluster.

Note
Note: Do not touch services managed by the cluster

When managing a resource via the High Availability software, the resource must not be started or stopped otherwise (outside the cluster, for example manually or on boot or reboot). The High Availability software is responsible for all service start or stop actions.

However, if you want to check if the service is configured properly, start it manually, but make sure that it is stopped again before the High Availability software takes over.

For interventions in resources that are currently managed by the cluster, set the resource to maintenance mode first. For details, see Procedure 28.5, “Putting a resource into maintenance mode with Hawk2”.

You can start a cluster resource using either Hawk2 or crmsh.

8.3.1 Starting cluster resources with Hawk2

When creating a resource with Hawk2, you can set its initial state with the target-role meta attribute. If you set its value to stopped, the resource does not start automatically after being created.

Procedure 8.2: Starting a new resource
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Status. The list of Resources also shows the Status.

  3. Select the resource to start. In its Operations column click the Start icon. To continue, confirm the message that appears.

When the resource has started, Hawk2 changes the resource's Status to green and shows on which node it is running.

8.3.2 Starting cluster resources with crmsh

To start a new cluster resource, you need the respective identifier.

Procedure 8.3: Starting a cluster resource with crmsh
  1. Log in as root and start the crm interactive shell:

    # crm
  2. Switch to the resource level:

    crm(live)# resource
  3. Start the resource with start and press the →| key to show all known resources:

    crm(live)resource# start ID

8.4 Stopping cluster resources

8.4.1 Stopping cluster resources with crmsh

To stop one or more existing cluster resources, you need the respective identifier(s).

Procedure 8.4: Stopping cluster resources with crmsh
  1. Log in as root and start the crm interactive shell:

    # crm
  2. Switch to the resource level:

    crm(live)# resource
  3. Stop the resource with stop and press the →| key to show all known resources:

    crm(live)resource# stop ID

    You can stop multiple resources at once:

    crm(live)resource# stop ID1 ID2 ...

8.5 Cleaning up cluster resources

A resource is automatically restarted if it fails, but each failure increases the resource's fail count.

If a migration-threshold has been set for the resource, the node will no longer run the resource when the number of failures reaches the migration threshold.

By default, fail counts are not automatically reset. You can configure a fail count to be reset automatically by setting a failure-timeout option for the resource, or you can manually reset the fail count using either Hawk2 or crmsh.

8.5.1 Cleaning up cluster resources with Hawk2

Procedure 8.5: Cleaning up a resource
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Status. The list of Resources also shows the Status.

  3. Go to the resource to clean up. In the Operations column click the arrow down button and select Cleanup. To continue, confirm the message that appears.

    This executes the command crm resource cleanup and cleans up the resource on all nodes.

8.5.2 Cleaning up cluster resources with crmsh

Procedure 8.6: Cleaning up a resource with crmsh
  1. Open a shell and log in as user root.

  2. Get a list of all your resources:

    # crm resource status
    Full List of Resources
       * admin-ip      (ocf:heartbeat:IPaddr2):    Started
       * stonith-sbd   (stonith:external/sbd):     Started
       * Resource Group: dlm-clvm:
         * dlm:        (ocf:pacemaker:controld)    Started
         * clvm:       (ocf:heartbeat:lvmlockd)    Started
  3. Show the fail count of a resource:

    # crm resource failcount RESOURCE show NODE

    For example, to show the fail count of the resource dlm on node alice:

    # crm resource failcount dlm show alice
    scope=status name=fail-count-dlm value=2
  4. Clean up the resource:

    # crm resource cleanup RESOURCE

    This command cleans up the resource on all nodes. If the resource is part of a group, crmsh also cleans up the other resources in the group.

8.6 Removing cluster resources

To remove a resource from the cluster, follow either the Hawk2 or crmsh procedure below to avoid configuration errors.

8.6.1 Removing cluster resources with Hawk2

Procedure 8.7: Removing a cluster resource
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. Clean up the resource on all nodes as described in Procedure 8.5, “Cleaning up a resource”.

  3. Stop the resource:

    1. From the left navigation bar, select Monitoring › Status. The list of Resources also shows the Status.

    2. In the Operations column click the Stop button next to the resource.

    3. To continue, confirm the message that appears.

      The Status column will reflect the change when the resource is stopped.

  4. Delete the resource:

    1. From the left navigation bar, select Configuration › Edit Configuration.

    2. In the list of Resources, go to the respective resource. From the Operations column click the Delete icon next to the resource.

    3. To continue, confirm the message that appears.

8.6.2 Removing cluster resources with crmsh

Procedure 8.8: Removing a cluster resource with crmsh
  1. Log in as root and start the crm interactive shell:

    # crm
  2. Get a list of your resources:

    crm(live)# resource status
    Full List of Resources:
      * admin-ip     (ocf:heartbeat:IPaddr2):     Started
      * stonith-sbd  (stonith:external/sbd):      Started
      * nfsserver    (ocf:heartbeat:nfsserver):   Started
  3. Stop the resource you want to remove:

    crm(live)# resource stop RESOURCE
  4. Delete the resource:

    crm(live)# configure delete RESOURCE

8.7 Migrating cluster resources

The cluster will fail over (migrate) resources automatically in case of software or hardware failures, according to certain parameters you can define (for example, migration threshold or resource stickiness). You can also manually migrate a resource to another node in the cluster, or move the resource away from the current node and let the cluster decide where to put it.

You can migrate a cluster resource using either Hawk2 or crmsh.

8.7.1 Migrating cluster resources with Hawk2

Procedure 8.9: Manually migrating a resource
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Status. The list of Resources also shows the Status.

  3. In the list of Resources, select the respective resource.

  4. In the Operations column click the arrow down button and select Migrate.

  5. In the window that opens you have the following choices:

    • Away from current node: This creates a location constraint with a -INFINITY score for the current node.

    • Alternatively, you can move the resource to another node. This creates a location constraint with an INFINITY score for the destination node.

  6. Confirm your choice.

To allow a resource to move back again, proceed as follows:

Procedure 8.10: Unmigrating a resource
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Status. The list of Resources also shows the Status.

  3. In the list of Resources, go to the respective resource.

  4. In the Operations column click the arrow down button and select Clear. To continue, confirm the message that appears.

    Hawk2 uses the crm_resource  --clear command. The resource can move back to its original location or it may stay where it is (depending on resource stickiness).

For more information, see Pacemaker Explained, available from http://www.clusterlabs.org/pacemaker/doc/. Refer to section Resource Migration.

8.7.2 Migrating cluster resources with crmsh

Use the move command for this task. For example, to migrate the resource ipaddress1 to a cluster node named bob, use these commands:

# crm resource
 crm(live)resource# move ipaddress1 bob

8.8 Grouping resources by using tags

Tags are a way to refer to multiple resources at once, without creating any colocation or ordering relationship between them. This can be useful for grouping conceptually related resources. For example, if you have several resources related to a database, create a tag called databases and add all resources related to the database to this tag. This allows you to stop or start them all with a single command.

Tags can also be used in constraints. For example, the following location constraint loc-db-prefer applies to the set of resources tagged with databases:

location loc-db-prefer databases 100: alice

You can create tags using either Hawk2 or crmsh.

8.8.1 Grouping resources by using tags with Hawk2

Procedure 8.11: Adding a tag
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Add Resource › Tag.

  3. Enter a unique Tag ID.

  4. From the Objects list, select the resources you want to refer to with the tag.

  5. Click Create to finish the configuration. A message at the top of the screen shows if the action has been successful.

Hawk2—tag
Figure 8.2: Hawk2—tag

8.8.2 Grouping resources by using tags with crmsh

For example, if you have several resources related to a database, create a tag called databases and add all resources related to the database to this tag:

# crm configure tag databases: db1 db2 db3

This allows you to start them all with a single command:

# crm resource start databases

Similarly, you can stop them all too:

# crm resource stop databases

9 Managing services on remote hosts

The possibilities for monitoring and managing services on remote hosts has become increasingly important during the last few years. SUSE Linux Enterprise High Availability 11 SP3 offered fine-grained monitoring of services on remote hosts via monitoring plug-ins. The recent addition of the pacemaker_remote service now allows SUSE Linux Enterprise High Availability 15 SP4 to fully manage and monitor resources on remote hosts just as if they were a real cluster node—without the need to install the cluster stack on the remote machines.

9.1 Monitoring services on remote hosts with monitoring plug-ins

Monitoring of virtual machines can be done with the VM agent (which only checks if the guest shows up in the hypervisor), or by external scripts called from the VirtualDomain or Xen agent. Up to now, more fine-grained monitoring was only possible with a full setup of the High Availability stack within the virtual machines.

By providing support for monitoring plug-ins (formerly named Nagios plug-ins), SUSE Linux Enterprise High Availability now also allows you to monitor services on remote hosts. You can collect external statuses on the guests without modifying the guest image. For example, VM guests might run Web services or simple network resources that need to be accessible. With the Nagios resource agents, you can now monitor the Web service or the network resource on the guest. If these services are not reachable anymore, SUSE Linux Enterprise High Availability triggers a restart or migration of the respective guest.

If your guests depend on a service (for example, an NFS server to be used by the guest), the service can either be an ordinary resource, managed by the cluster, or an external service that is monitored with Nagios resources instead.

To configure the Nagios resources, the following packages must be installed on the host:

  • monitoring-plugins

  • monitoring-plugins-metadata

YaST or Zypper will resolve any dependencies on further packages, if required.

A typical use case is to configure the monitoring plug-ins as resources belonging to a resource container, which usually is a VM. The container will be restarted if any of its resources has failed. Refer to Example 9.1, “Configuring resources for monitoring plug-ins” for a configuration example. Alternatively, Nagios resource agents can also be configured as ordinary resources to use them for monitoring hosts or services via the network.

Example 9.1: Configuring resources for monitoring plug-ins
primitive vm1 VirtualDomain \
    params hypervisor="qemu:///system" config="/etc/libvirt/qemu/vm1.xml" \
    op start interval="0" timeout="90" \
    op stop interval="0" timeout="90" \
    op monitor interval="10" timeout="30"
primitive vm1-sshd nagios:check_tcp \
    params hostname="vm1" port="22" \ 1
    op start interval="0" timeout="120" \ 2
    op monitor interval="10"
group g-vm1-and-services vm1 vm1-sshd \
    meta container="vm1" 3

1

The supported parameters are the same as the long options of a monitoring plug-in. Monitoring plug-ins connect to services with the parameter hostname. Therefore the attribute's value must be a resolvable host name or an IP address.

2

As it takes some time to get the guest operating system up and its services running, the start timeout of the monitoring resource must be long enough.

3

A cluster resource container of type ocf:heartbeat:Xen, ocf:heartbeat:VirtualDomain or ocf:heartbeat:lxc. It can either be a VM or a Linux Container.

The example above contains only one resource for the check_tcpplug-in, but multiple resources for different plug-in types can be configured (for example, check_http or check_udp).

If the host names of the services are the same, the hostname parameter can also be specified for the group, instead of adding it to the individual primitives. For example:

group g-vm1-and-services vm1 vm1-sshd vm1-httpd \
     meta container="vm1" \
     params hostname="vm1"

If any of the services monitored by the monitoring plug-ins fail within the VM, the cluster will detect that and restart the container resource (the VM). Which action to take in this case can be configured by specifying the on-fail attribute for the service's monitoring operation. It defaults to restart-container.

Failure counts of services will be taken into account when considering the VM's migration-threshold.

9.2 Managing services on remote nodes with pacemaker_remote

With the pacemaker_remote service, High Availability clusters can be extended to virtual nodes or remote bare-metal machines. They do not need to run the cluster stack to become members of the cluster.

SUSE Linux Enterprise High Availability can now launch virtual environments (KVM and LXC), plus the resources that live within those virtual environments without requiring the virtual environments to run Pacemaker or Corosync.

For the use case of managing both virtual machines as cluster resources plus the resources that live within the VMs, you can now use the following setup:

  • The normal (bare-metal) cluster nodes run SUSE Linux Enterprise High Availability.

  • The virtual machines run the pacemaker_remote service (almost no configuration required on the VM's side).

  • The cluster stack on the normal cluster nodes launches the VMs and connects to the pacemaker_remote service running on the VMs to integrate them as remote nodes into the cluster.

As the remote nodes do not have the cluster stack installed, this has the following implications:

  • Remote nodes do not take part in quorum.

  • Remote nodes cannot become the DC.

  • Remote nodes are not bound by the scalability limits (Corosync has a member limit of 32 nodes).

Find more information about the remote_pacemaker service, including multiple use cases with detailed setup instructions in Article “Pacemaker Remote Quick Start”.

10 Adding or modifying resource agents

All tasks that need to be managed by a cluster must be available as a resource. There are two major groups here to consider: resource agents and STONITH agents. For both categories, you can add your own agents, extending the abilities of the cluster to your own needs.

10.1 STONITH agents

A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource.

Warning
Warning: External SSH/STONITH are not supported

It is impossible to know how SSH might react to other system problems. For this reason, external SSH/STONITH agents (like stonith:external/ssh) are not supported for production environments. If you still want to use such agents for testing, install the libglue-devel package.

To get a list of all currently available STONITH devices (from the software side), use the command crm ra list stonith. If you do not find your favorite agent, install the -devel package. For more information on STONITH devices and resource agents, see Chapter 12, Fencing and STONITH.

As of yet there is no documentation about writing STONITH agents. If you want to write new STONITH agents, consult the examples available in the source of the cluster-glue package.

10.2 Writing OCF resource agents

All OCF resource agents (RAs) are available in /usr/lib/ocf/resource.d/, see Section 6.2, “Supported resource agent classes” for more information. Each resource agent must supported the following operations to control it:

start

start or enable the resource

stop

stop or disable the resource

status

returns the status of the resource

monitor

similar to status, but checks also for unexpected states

validate

validate the resource's configuration

meta-data

returns information about the resource agent in XML

The general procedure of how to create an OCF RA is like the following:

  1. Load the file /usr/lib/ocf/resource.d/pacemaker/Dummy as a template.

  2. Create a new subdirectory for each new resource agents to avoid naming contradictions. For example, if you have a resource group kitchen with the resource coffee_machine, add this resource to the directory /usr/lib/ocf/resource.d/kitchen/. To access this RA, execute the command crm:

    # crm configure primitive coffee_1 ocf:coffee_machine:kitchen ...
  3. Implement the different shell functions and save your file under a different name.

More details about writing OCF resource agents can be found at http://www.clusterlabs.org/pacemaker/doc/ in the guide Pacemaker Administration. Find special information about several concepts at Chapter 1, Product overview.

10.3 OCF return codes and failure recovery

According to the OCF specification, there are strict definitions of the exit codes an action must return. The cluster always checks the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and a recovery action is initiated. There are three types of failure recovery:

Table 10.1: Failure recovery types

Recovery Type

Description

Action Taken by the Cluster

soft

A transient error occurred.

Restart the resource or move it to a new location.

hard

A non-transient error occurred. The error may be specific to the current node.

Move the resource elsewhere and prevent it from being retried on the current node.

fatal

A non-transient error occurred that will be common to all cluster nodes. This means a bad configuration was specified.

Stop the resource and prevent it from being started on any cluster node.

Assuming an action is considered to have failed, the following table outlines the different OCF return codes. It also shows the type of recovery the cluster will initiate when the respective error code is received.

Table 10.2: OCF return codes

OCF Return Code

OCF Alias

Description

Recovery Type

0

OCF_SUCCESS

Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.

soft

1

OCF_ERR_­GENERIC

Generic there was a problem error code.

soft

2

OCF_ERR_ARGS

The resource’s configuration is not valid on this machine (for example, it refers to a location/tool not found on the node).

hard

3

OCF_­ERR_­UN­IMPLEMENTED

The requested action is not implemented.

hard

4

OCF_ERR_PERM

The resource agent does not have sufficient privileges to complete the task.

hard

5

OCF_ERR_­INSTALLED

The tools required by the resource are not installed on this machine.

hard

6

OCF_ERR_­CONFIGURED

The resource’s configuration is invalid (for example, required parameters are missing).

fatal

7

OCF_NOT_­RUNNING

The resource is not running. The cluster will not attempt to stop a resource that returns this for any action.

This OCF return code may or may not require resource recovery—it depends on what is the expected resource status. If unexpected, then soft recovery.

N/A

8

OCF_RUNNING_­PROMOTED

The resource is running in Promoted mode.

soft

9

OCF_FAILED_­PROMOTED

The resource is in Promoted mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again.

soft

other

N/A

Custom error code.

soft

11 Monitoring clusters

This chapter describes how to monitor a cluster's health and view its history.

11.1 Monitoring cluster status

Hawk2 has different screens for monitoring single clusters and multiple clusters: the Status and the Dashboard screen.

11.1.1 Monitoring a single cluster

To monitor a single cluster, use the Status screen. After you have logged in to Hawk2, the Status screen is displayed by default. An icon in the upper right corner shows the cluster status at a glance. For further details, have a look at the following categories:

Errors

If errors have occurred, they are shown at the top of the page.

Resources

Shows the configured resources including their Status, Name (ID), Location (node on which they are running), and resource agent Type. From the Operations column, you can start or stop a resource, trigger several actions, or view details. Actions that can be triggered include setting the resource to maintenance mode (or removing maintenance mode), migrating it to a different node, cleaning up the resource, showing any recent events, or editing the resource.

Nodes

Shows the nodes belonging to the cluster site you are logged in to, including the nodes' Status and Name. In the Maintenance and Standby columns, you can set or remove the maintenance or standby flag for a node. The Operations column allows you to view recent events for the node or further details: for example, if a utilization, standby or maintenance attribute is set for the respective node.

Tickets

Only shown if tickets have been configured (for use with Geo clustering).

Hawk2—cluster status
Figure 11.1: Hawk2—cluster status

11.1.2 Monitoring multiple clusters

To monitor multiple clusters, use the Hawk2 Dashboard. The cluster information displayed in the Dashboard screen is stored on the server side. It is synchronized between the cluster nodes (if passwordless SSH access between the cluster nodes has been configured). For details, see Section D.2, “Configuring a passwordless SSH account”. However, the machine running Hawk2 does not even need to be part of any cluster for that purpose—it can be a separate, unrelated system.

In addition to the general Hawk2 requirements, the following prerequisites need to be fulfilled to monitor multiple clusters with Hawk2:

Prerequisites
  • All clusters to be monitored from Hawk2's Dashboard must be running SUSE Linux Enterprise High Availability 15 SP4.

  • If you did not replace the self-signed certificate for Hawk2 on every cluster node with your own certificate (or a certificate signed by an official Certificate Authority) yet, do the following: Log in to Hawk2 on every node in every cluster at least once. Verify the certificate (or add an exception in the browser to bypass the warning). Otherwise Hawk2 cannot connect to the cluster.

Procedure 11.1: Monitoring multiple clusters with the dashboard
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Dashboard.

    Hawk2 shows an overview of the resources and nodes on the current cluster site. In addition, it shows any Tickets that have been configured for use with a Geo cluster. If you need information about the icons used in this view, click Legend. To search for a resource ID, enter the name (ID) into the Search text box. To only show specific nodes, click the filter icon and select a filtering option.

    Hawk2 dashboard with one cluster site (amsterdam)
    Figure 11.2: Hawk2 dashboard with one cluster site (amsterdam)
  3. To add dashboards for multiple clusters:

    1. Click Add Cluster.

    2. Enter the Cluster name with which to identify the cluster in the Dashboard. For example, berlin.

    3. Enter the fully qualified host name of one of the nodes in the second cluster. For example, charlie.

      Image
    4. Click Add. Hawk2 will display a second tab for the newly added cluster site with an overview of its nodes and resources.

      Note
      Note: Connection error

      If instead you are prompted to log in to this node by entering a password, you probably did not connect to this node yet and have not replaced the self-signed certificate. In that case, even after entering the password, the connection will fail with the following message: Error connecting to server. Retrying every 5 seconds... '.

      To proceed, see Replacing the self-signed certificate.

  4. To view more details for a cluster site or to manage it, switch to the site's tab and click the chain icon.

    Hawk2 opens the Status view for this site in a new browser window or tab. From there, you can administer this part of the Geo cluster.

  5. To remove a cluster from the dashboard, click the x icon on the right-hand side of the cluster's details.

11.2 Verifying cluster health

You can check the health of a cluster using either Hawk2 or crmsh.

11.2.1 Verifying cluster health with Hawk2

Hawk2 provides a wizard which checks and detects issues with your cluster. After the analysis is complete, Hawk2 creates a cluster report with further details. To verify cluster health and generate the report, Hawk2 requires passwordless SSH access between the nodes. Otherwise it can only collect data from the current node. If you have set up your cluster with the bootstrap scripts, provided by the crm shell, passwordless SSH access is already configured. In case you need to configure it manually, see Section D.2, “Configuring a passwordless SSH account”.

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Configuration › Wizards.

  3. Expand the Basic category.

  4. Select the Verify health and configuration wizard.

  5. Confirm with Verify.

  6. Enter the root password for your cluster and click Apply. Hawk2 will generate the report.

11.2.2 Getting health status with crmsh

The health status of a cluster or node can be displayed with so called scripts. A script can perform different tasks—they are not targeted to health. However, for this subsection, we focus on how to get the health status.

To get all the details about the health command, use describe:

# crm script describe health

It shows a description and a list of all parameters and their default values. To execute a script, use run:

# crm script run health

If you prefer to run only one step from the suite, the describe command lists all available steps in the Steps category.

For example, the following command executes the first step of the health command. The output is stored in the health.json file for further investigation:

# crm script run health statefile='health.json'

It is also possible to run the above commands with crm cluster health.

For additional information regarding scripts, see http://crmsh.github.io/scripts/.

11.3 Viewing the cluster history

Hawk2 provides the following possibilities to view past events on the cluster (on different levels and in varying detail):

You can also view cluster history information using crmsh:

11.3.1 Viewing recent events of nodes or resources

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Monitoring › Status. It lists Resources and Nodes.

  3. To view recent events of a resource:

    1. Click Resources and select the respective resource.

    2. In the Operations column for the resource, click the arrow down button and select Recent events.

      Hawk2 opens a new window and displays a table view of the latest events.

  4. To view recent events of a node:

    1. Click Nodes and select the respective node.

    2. In the Operations column for the node, select Recent events.

      Hawk2 opens a new window and displays a table view of the latest events.

      Image

11.3.2 Using the history explorer for cluster reports

From the left navigation bar, select Troubleshooting › History to access the History Explorer. The History Explorer allows you to create detailed cluster reports and view transition information. It provides the following options:

Generate

Create a cluster report for a certain time. Hawk2 calls the crm report command to generate the report.

Upload

Allows you to upload crm report archives that have either been created with the crm shell directly or even on a different cluster.

After reports have been generated or uploaded, they are shown below Reports. From the list of reports, you can show a report's details, download or delete the report.

Hawk2—history explorer main view
Figure 11.3: Hawk2—history explorer main view
Procedure 11.2: Generating or uploading a cluster report
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Troubleshooting › History.

    The History Explorer screen opens in the Generate view. By default, the suggested time frame for a report is the last hour.

  3. To create a cluster report:

    1. To immediately start a report, click Generate.

    2. To modify the time frame for the report, click anywhere on the suggested time frame and select another option from the drop-down box. You can also enter a Custom start date, end date and hour, respectively. To start the report, click Generate.

      After the report has finished, it is shown below Reports.

  4. To upload a cluster report, the crm report archive must be located on a file system that you can access with Hawk2. Proceed as follows:

    1. Switch to the Upload tab.

    2. Browse for the cluster report archive and click Upload.

      After the report is uploaded, it is shown below Reports.

  5. To download or delete a report, click the respective icon next to the report in the Operations column.

  6. To view Report details in history explorer, click the report's name or select Show from the Operations column.

    Image
  7. Return to the list of reports by clicking the Reports button.

Report details in history explorer
  • Name of the report.

  • Start time of the report.

  • End time of the report.

  • Number of transitions plus time line of all transitions in the cluster that are covered by the report. To learn how to view more details for a transition, see Section 11.3.3.

  • Node events.

  • Resource events.

11.3.3 Viewing transition details in the history explorer

For each transition, the cluster saves a copy of the state which it provides as input to pacemaker-schedulerd. The path to this archive is logged. All pe-* files are generated on the Designated Coordinator (DC). As the DC can change in a cluster, there may be pe-* files from several nodes. Any pe-* files are saved snapshots of the CIB, used as input of calculations by pacemaker-schedulerd.

In Hawk2, you can display the name of each pe-* file plus the time and node on which it was created. In addition, the History Explorer can visualize the following details, based on the respective pe-* file:

Transition details in the history explorer
Details

Shows snippets of logging data that belongs to the transition. Displays the output of the following command (including the resource agents' log messages):

crm history transition peinput
Configuration

Shows the cluster configuration at the time that the pe-* file was created.

Diff

Shows the differences of configuration and status between the selected pe-* file and the following one.

Log

Shows snippets of logging data that belongs to the transition. Displays the output of the following command:

crm history transition log peinput

This includes details from the following daemons: pacemaker-schedulerd, pacemaker-controld, and pacemaker-execd.

Graph

Shows a graphical representation of the transition. If you click Graph, the calculation is simulated (exactly as done by pacemaker-schedulerd) and a graphical visualization is generated.

Procedure 11.3: Viewing transition details
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. From the left navigation bar, select Troubleshooting › History.

    If reports have already been generated or uploaded, they are shown in the list of Reports. Otherwise generate or upload a report as described in Procedure 11.2.

  3. Click the report's name or select Show from the Operations column to open the Report details in history explorer.

  4. To access the transition details, you need to select a transition point in the transition time line that is shown below. Use the Previous and Next icons and the Zoom In and Zoom Out icons to find the transition that you are interested in.

  5. To display the name of a pe-input* file plus the time and node on which it was created, hover the mouse pointer over a transition point in the time line.

  6. To view the Transition details in the history explorer, click the transition point for which you want to know more.

  7. To show Details, Configuration, Diff, Logs or Graph, click the respective buttons to show the content described in Transition details in the history explorer.

  8. To return to the list of reports, click the Reports button.

11.3.4 Retrieving history information with crmsh

Investigating the cluster history is a complex task. To simplify this task, crmsh contains the history command with its subcommands. It is assumed SSH is configured correctly.

Each cluster moves states, migrates resources, or starts important processes. All these actions can be retrieved by subcommands of history.

By default, all history commands look at the events of the last hour. To change this time frame, use the limit subcommand. The syntax is:

# crm history
crm(live)history# limit FROM_TIME [TO_TIME]

Some valid examples include:

limit 4:00pm , limit 16:00

Both commands mean the same, today at 4pm.

limit 2012/01/12 6pm

January 12th 2012 at 6pm

limit "Sun 5 20:46"

In the current year of the current month at Sunday the 5th at 8:46pm

Find more examples and how to create time frames at http://labix.org/python-dateutil.

The info subcommand shows all the parameters which are covered by the crm report:

crm(live)history# info
Source: live
Period: 2012-01-12 14:10:56 - end
Nodes: alice
Groups:
Resources:

To limit crm report to certain parameters view the available options with the subcommand help.

To narrow down the level of detail, use the subcommand detail with a level:

crm(live)history# detail 1

The higher the number, the more detailed your report will be. Default is 0 (zero).

After you have set above parameters, use log to show the log messages.

To display the last transition, use the following command:

crm(live)history# transition -1
INFO: fetching new logs, please wait ...

This command fetches the logs and runs dotty (from the graphviz package) to show the transition graph. The shell opens the log file which you can browse with the and cursor keys.

If you do not want to open the transition graph, use the nograph option:

crm(live)history# transition -1 nograph

11.4 Monitoring system health with the SysInfo resource agent

To prevent a node from running out of disk space and thus being unable to manage any resources that have been assigned to it, SUSE Linux Enterprise High Availability provides a resource agent, ocf:pacemaker:SysInfo. Use it to monitor a node's health with regard to disk partitions. The SysInfo RA creates a node attribute named #health_disk which will be set to red if any of the monitored disks' free space is below a specified limit.

To define how the CRM should react in case a node's health reaches a critical state, use the global cluster option node-health-strategy.

Procedure 11.4: Configuring system health monitoring

To automatically move resources away from a node in case the node runs out of disk space, proceed as follows:

  1. Configure an ocf:pacemaker:SysInfo resource:

    primitive sysinfo ocf:pacemaker:SysInfo \
         params disks="/tmp /var"1 min_disk_free="100M"2 disk_unit="M"3 \
         op monitor interval="15s"

    1

    Which disk partitions to monitor. For example, /tmp, /usr, /var, and /dev. To specify multiple partitions as attribute values, separate them with a blank.

    Note
    Note: / file system always monitored

    You do not need to specify the root partition (/) in disks. It is always monitored by default.

    2

    The minimum free disk space required for those partitions. Optionally, you can specify the unit to use for measurement (in the example above, M for megabytes is used). If not specified, min_disk_free defaults to the unit defined in the disk_unit parameter.

    3

    The unit in which to report the disk space.

  2. To complete the resource configuration, create a clone of ocf:pacemaker:SysInfo and start it on each cluster node.

  3. Set the node-health-strategy to migrate-on-red:

    property node-health-strategy="migrate-on-red"

    In case of a #health_disk attribute set to red, the pacemaker-schedulerd adds -INF to the resources' score for that node. This will cause any resources to move away from this node. The STONITH resource will be the last one to be stopped but even if the STONITH resource is not running anymore, the node can still be fenced. Fencing has direct access to the CIB and will continue to work.

After a node's health status has turned to red, solve the issue that led to the problem. Then clear the red status to make the node eligible again for running resources. Log in to the cluster node and use one of the following methods:

  • Execute the following command:

    # crm node status-attr NODE delete #health_disk
  • Restart the cluster services on that node.

  • Reboot the node.

The node will be returned to service and can run resources again.

12 Fencing and STONITH

Fencing is a very important concept in computer clusters for HA (High Availability). A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource. Fencing may be defined as a method to bring an HA cluster to a known state.

Every resource in a cluster has a state attached. For example: resource r1 is started on alice. In an HA cluster, such a state implies that resource r1 is stopped on all nodes except alice, because the cluster must make sure that every resource may be started on only one node. Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.

When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.

12.1 Classes of fencing

There are two classes of fencing: resource level and node level fencing. The latter is the primary subject of this chapter.

Resource level fencing

Resource level fencing ensures exclusive access to a given resource. Common examples of this are changing the zoning of the node from a SAN fiber channel switch (thus locking the node out of access to its disks) or methods like SCSI reserve. For examples, refer to Section 13.10, “Additional mechanisms for storage protection”.

Node level fencing

Node level fencing prevents a failed node from accessing shared resources entirely. This is usually done in a simple and abrupt way: reset or power off the node.

Table 12.1: Classes of fencing
ClassesMethodsOptionsExamples
Node fencing[a] (reboot or shutdown)Remote managementIn-nodeILO, DRAC, IPMI
ExternalHMC, vCenter, EC2
SBD and watchdogDisk-basedhpwdt, iTCO_wdt, ipmi_wdt, softdog
Diskless
I/O fencing (locking or reservation)Pure lockingIn-clusterSFEX
ExternalSCSI2 reservation, SCSI3 reservation
Built-in lockingCluster-basedCluster MD, LVM with lvmlockd and DLM
Cluster-handledMD-RAID

[a] Mandatory in High Availability clusters

12.2 Node level fencing

In a Pacemaker cluster, the implementation of node level fencing is STONITH (Shoot The Other Node in the Head). SUSE Linux Enterprise High Availability includes the stonith command line tool, an extensible interface for remotely powering down a node in the cluster. For an overview of the available options, run stonith --help or refer to the man page of stonith for more information.

12.2.1 STONITH devices

To use node level fencing, you first need to have a fencing device. To get a list of STONITH devices which are supported by SUSE Linux Enterprise High Availability, run one of the following commands on any of the nodes:

# stonith -L

or

# crm ra list stonith

STONITH devices may be classified into the following categories:

Power Distribution Units (PDU)

Power Distribution Units are an essential element in managing power capacity and functionality for critical network, server and data center equipment. They can provide remote load monitoring of connected equipment and individual outlet power control for remote power recycling.

Uninterruptible Power Supplies (UPS)

A stable power supply provides emergency power to connected equipment by supplying power from a separate source if a utility power failure occurs.

Blade power control devices

If you are running a cluster on a set of blades, then the power control device in the blade enclosure is the only candidate for fencing. Of course, this device must be capable of managing single blade computers.

Lights-out devices

Lights-out devices (IBM RSA, HP iLO, Dell DRAC) are becoming increasingly popular and may even become standard in off-the-shelf computers. However, they are inferior to UPS devices, because they share a power supply with their host (a cluster node). If a node stays without power, the device supposed to control it would be useless. In that case, the CRM would continue its attempts to fence the node indefinitely while all other resource operations would wait for the fencing/STONITH operation to complete.

Testing devices

Testing devices are used exclusively for testing purposes. They are usually more gentle on the hardware. Before the cluster goes into production, they must be replaced with real fencing devices.

The choice of the STONITH device depends mainly on your budget and the kind of hardware you use.

12.2.2 STONITH implementation

The STONITH implementation of SUSE® Linux Enterprise High Availability consists of two components:

pacemaker-fenced

pacemaker-fenced is a daemon which can be accessed by local processes or over the network. It accepts the commands which correspond to fencing operations: reset, power-off, and power-on. It can also check the status of the fencing device.

The pacemaker-fenced daemon runs on every node in the High Availability cluster. The pacemaker-fenced instance running on the DC node receives a fencing request from the pacemaker-controld. It is up to this and other pacemaker-fenced programs to carry out the desired fencing operation.

STONITH plug-ins

For every supported fencing device there is a STONITH plug-in which is capable of controlling said device. A STONITH plug-in is the interface to the fencing device. The STONITH plug-ins contained in the cluster-glue package reside in /usr/lib64/stonith/plugins on each node. (If you installed the fence-agents package, too, the plug-ins contained there are installed in /usr/sbin/fence_*.) All STONITH plug-ins look the same to pacemaker-fenced, but are quite different on the other side, reflecting the nature of the fencing device.

Some plug-ins support more than one device. A typical example is ipmilan (or external/ipmi) which implements the IPMI protocol and can control any device which supports this protocol.

12.3 STONITH resources and configuration

To set up fencing, you need to configure one or more STONITH resources—the pacemaker-fenced daemon requires no configuration. All configuration is stored in the CIB. A STONITH resource is a resource of class stonith (see Section 6.2, “Supported resource agent classes”). STONITH resources are a representation of STONITH plug-ins in the CIB. Apart from the fencing operations, the STONITH resources can be started, stopped and monitored, like any other resource. Starting or stopping STONITH resources means loading and unloading the STONITH device driver on a node. Starting and stopping are thus only administrative operations and do not translate to any operation on the fencing device itself. However, monitoring does translate to logging in to the device (to verify that the device will work in case it is needed). When a STONITH resource fails over to another node it enables the current node to talk to the STONITH device by loading the respective driver.

STONITH resources can be configured like any other resource. For details how to do so with your preferred cluster management tool:

The list of parameters (attributes) depends on the respective STONITH type. To view a list of parameters for a specific device, use the stonith command:

# stonith -t stonith-device-type -n

For example, to view the parameters for the ibmhmc device type, enter the following:

# stonith -t ibmhmc -n

To get a short help text for the device, use the -h option:

# stonith -t stonith-device-type -h

12.3.1 Example STONITH resource configurations

In the following, find some example configurations written in the syntax of the crm command line tool. To apply them, put the sample in a text file (for example, sample.txt) and run:

# crm < sample.txt

For more information about configuring resources with the crm command line tool, refer to Section 5.5, “Introduction to crmsh”.

Example 12.1: Configuration of an IBM RSA lights-out device

An IBM RSA lights-out device might be configured like this:

# crm configure
crm(live)configure# primitive st-ibmrsa-1 stonith:external/ibmrsa-telnet \
params nodename=alice ip_address=192.168.0.101 \
username=USERNAME password=PASSW0RD
crm(live)configure# primitive st-ibmrsa-2 stonith:external/ibmrsa-telnet \
params nodename=bob ip_address=192.168.0.102 \
username=USERNAME password=PASSW0RD
crm(live)configure# location l-st-alice st-ibmrsa-1 -inf: alice
crm(live)configure# location l-st-bob st-ibmrsa-2 -inf: bob
crm(live)configure# commit

In this example, location constraints are used for the following reason: There is always a certain probability that the STONITH operation is going to fail. Therefore, a STONITH operation on the node which is the executioner as well is not reliable. If the node is reset, it cannot send the notification about the fencing operation outcome. The only way to do that is to assume that the operation is going to succeed and send the notification beforehand. But if the operation fails, problems could arise. Therefore, by convention, pacemaker-fenced refuses to terminate its host.

Example 12.2: Configuration of a UPS fencing device

The configuration of a UPS type fencing device is similar to the examples above. The details are not covered here. All UPS devices employ the same mechanics for fencing. How the device is accessed varies. Old UPS devices only had a serial port, usually connected at 1200baud using a special serial cable. Many new ones still have a serial port, but often they also use a USB or Ethernet interface. The kind of connection you can use depends on what the plug-in supports.

For example, compare the apcmaster with the apcsmart device by using the stonith -t stonith-device-type -n command:

# stonith -t apcmaster -h

returns the following information:

STONITH Device: apcmaster - APC MasterSwitch (via telnet)
NOTE: The APC MasterSwitch accepts only one (telnet)
connection/session a time. When one session is active,
subsequent attempts to connect to the MasterSwitch will fail.
For more information see http://www.apc.com/
List of valid parameter names for apcmaster STONITH device:
        ipaddr
        login
        password
For Config info [-p] syntax, give each of the above parameters in order as
the -p value.
Arguments are separated by white space.
Config file [-F] syntax is the same as -p, except # at the start of a line
denotes a comment

With

# stonith -t apcsmart -h

you get the following output:

STONITH Device: apcsmart - APC Smart UPS
(via serial port - NOT USB!).
Works with higher-end APC UPSes, like
Back-UPS Pro, Smart-UPS, Matrix-UPS, etc.
(Smart-UPS may have to be >= Smart-UPS 700?).
See http://www.networkupstools.org/protocols/apcsmart.html
for protocol compatibility details.
For more information see http://www.apc.com/
List of valid parameter names for apcsmart STONITH device:
ttydev
hostlist

The first plug-in supports APC UPS with a network port and telnet protocol. The second plug-in uses the APC SMART protocol over the serial line, which is supported by many APC UPS product lines.

Example 12.3: Configuration of a Kdump device

Kdump belongs to the Special fencing devices and is in fact the opposite of a fencing device. The plug-in checks if a Kernel dump is in progress on a node. If so, it returns true and acts as if the node has been fenced, because the node will reboot after the Kdump is complete. If not, it returns a failure and the next fencing device is triggered.

The Kdump plug-in must be used together with another, real STONITH device, for example, external/ipmi. It does not work with SBD as the STONITH device. For the fencing mechanism to work properly, you must specify the order of the fencing devices so that Kdump is checked before a real STONITH device is triggered, as shown in the following procedure.

  1. Use the stonith:fence_kdump fence agent. A configuration example is shown below. For more information, see crm ra info stonith:fence_kdump.

    # crm configure
    crm(live)configure# primitive st-kdump stonith:fence_kdump \
        params nodename="alice "\ 1
        pcmk_host_list="alice" \
        pcmk_host_check="static-list" \
        pcmk_reboot_action="off" \
        pcmk_monitor_action="metadata" \
        pcmk_reboot_retries="1" \
        timeout="60"2
    crm(live)configure# commit

    1

    Name of the node to listen for a message from fence_kdump_send. Configure more STONITH resources for other nodes if needed.

    2

    Defines how long to wait for a message from fence_kdump_send. If a message is received, then a Kdump is in progress and the fencing mechanism considers the node to be fenced. If no message is received, fence_kdump times out, which indicates that the fence operation failed. The next STONITH device in the fencing_topology eventually fences the node.

  2. On each node, configure fence_kdump_send to send a message to all nodes when the Kdump process is finished. In /etc/sysconfig/kdump, edit the KDUMP_POSTSCRIPT line. For example:

    KDUMP_POSTSCRIPT="/usr/lib/fence_kdump_send -i 10 -p 7410 -c 1 NODELIST"

    Replace NODELIST with the host names of all the cluster nodes.

  3. Run either systemctl restart kdump.service or mkdumprd. Either of these commands will detect that /etc/sysconfig/kdump was modified, and will regenerate the initrd to include the library fence_kdump_send with network enabled.

  4. Open a port in the firewall for the fence_kdump resource. The default port is 7410.

  5. To have Kdump checked before triggering a real fencing mechanism (like external/ipmi), use a configuration similar to the following:

    crm(live)configure# fencing_topology \
      alice: kdump-node1 ipmi-node1 \
      bob: kdump-node2 ipmi-node2
    crm(live)configure# commit

    For more details on fencing_topology:

    crm(live)configure# help fencing_topology

12.4 Monitoring fencing devices

Like any other resource, the STONITH class agents also support the monitoring operation for checking status.

Fencing devices are an indispensable part of an HA cluster, but the less you need to use them, the better. Power management equipment is often affected by too much broadcast traffic. Some devices cannot handle more than ten or so connections per minute. Some get confused if two clients try to connect at the same time. Most cannot handle more than one session at a time.

The probability that a fencing operation needs to be performed and the fencing device fails is low. For most devices, a monitoring interval of at least 1800 seconds (30 minutes) should suffice. The exact value depends on the device and infrastructure. STONITH SBD resources do not need a monitor at all. See Section 12.5, “Special fencing devices” and Chapter 13, Storage protection and SBD.

For detailed information on how to configure monitor operations, refer to Section 6.10.2, “Configuring resource monitoring with crmsh” for the command line approach.

12.5 Special fencing devices

In addition to plug-ins which handle real STONITH devices, there are special purpose STONITH plug-ins.

Warning
Warning: For testing only

Some STONITH plug-ins mentioned below are for demonstration and testing purposes only. Do not use any of the following devices in real-life scenarios because this may lead to data corruption and unpredictable results:

  • external/ssh

  • ssh

fence_kdump

This plug-in checks if a Kernel dump is in progress on a node. If so, it returns true, and acts as if the node has been fenced. The node cannot run any resources during the dump anyway. This avoids fencing a node that is already down but doing a dump, which takes some time. The plug-in must be used in concert with another, real STONITH device.

For configuration details, see Example 12.3, “Configuration of a Kdump device”.

external/sbd

This is a self-fencing device. It reacts to a so-called poison pill which can be inserted into a shared disk. On shared-storage connection loss, it stops the node from operating. Learn how to use this STONITH agent to implement storage-based fencing in Chapter 13, Procedure 13.7, “Configuring the cluster to use SBD”. See also http://www.linux-ha.org/wiki/SBD_Fencing for more details.

Important
Important: external/sbd and DRBD

The external/sbd fencing mechanism requires that the SBD partition is readable directly from each node. Thus, a DRBD* device must not be used for an SBD partition.

However, you can use the fencing mechanism for a DRBD cluster, provided the SBD partition is located on a shared disk that is not mirrored or replicated.

external/ssh

Another software-based fencing mechanism. The nodes must be able to log in to each other as root without passwords. It takes a single parameter, hostlist, specifying the nodes that it will target. As it is not able to reset a truly failed node, it must not be used for real-life clusters—for testing and demonstration purposes only. Using it for shared storage would result in data corruption.

meatware

meatware requires help from the user to operate. Whenever invoked, meatware logs a CRIT severity message which shows up on the node's console. The operator then confirms that the node is down and issues a meatclient(8) command. This tells meatware to inform the cluster that the node should be considered dead. See /usr/share/doc/packages/cluster-glue/README.meatware for more information.

suicide

This is a software-only device, which can reboot a node it is running on, using the reboot command. This requires action by the node's operating system and can fail under certain circumstances. Therefore avoid using this device whenever possible. However, it is safe to use on one-node clusters.

Diskless SBD

This configuration is useful if you want a fencing mechanism without shared storage. In this diskless mode, SBD fences nodes by using the hardware watchdog without relying on any shared device. However, diskless SBD cannot handle a split brain scenario for a two-node cluster. Use this option only for clusters with more than two nodes.

suicide is the only exception to the I do not shoot my host rule.

12.6 Basic recommendations

Check the following list of recommendations to avoid common mistakes:

  • Do not configure several power switches in parallel.

  • To test your STONITH devices and their configuration, pull the plug once from each node and verify that fencing the node does takes place.

  • Test your resources under load and verify the timeout values are appropriate. Setting timeout values too low can trigger (unnecessary) fencing operations. For details, refer to Section 6.3, “Timeout values”.

  • Use appropriate fencing devices for your setup. For details, also refer to Section 12.5, “Special fencing devices”.

  • Configure one or more STONITH resources. By default, the global cluster option stonith-enabled is set to true. If no STONITH resources have been defined, the cluster will refuse to start any resources.

  • Do not set the global cluster option stonith-enabled to false for the following reasons:

    • Clusters without STONITH enabled are not supported.

    • DLM/OCFS2 will block forever waiting for a fencing operation that will never happen.

  • Do not set the global cluster option startup-fencing to false. By default, it is set to true for the following reason: If a node is in an unknown state during cluster start-up, the node will be fenced once to clarify its status.

12.7 For more information

/usr/share/doc/packages/cluster-glue

In your installed system, this directory contains README files for many STONITH plug-ins and devices.

http://www.clusterlabs.org/pacemaker/doc/

Pacemaker Explained: Explains the concepts used to configure Pacemaker. Contains comprehensive and detailed information for reference.

http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html

Article explaining the concepts of split brain, quorum and fencing in HA clusters.

13 Storage protection and SBD

SBD (STONITH Block Device) provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). This isolates the fencing mechanism from changes in firmware version or dependencies on specific firmware controllers. SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. Under certain conditions, it is also possible to use SBD without shared storage, by running it in diskless mode.

The cluster bootstrap scripts provide an automated way to set up a cluster with the option of using SBD as fencing mechanism. For details, see the Article “Installation and Setup Quick Start”. However, manually setting up SBD provides you with more options regarding the individual settings.

This chapter explains the concepts behind SBD. It guides you through configuring the components needed by SBD to protect your cluster from potential data corruption in case of a split brain scenario.

In addition to node level fencing, you can use additional mechanisms for storage protection, such as LVM exclusive activation or OCFS2 file locking support (resource level fencing). They protect your system against administrative or application faults.

13.1 Conceptual overview

SBD expands to Storage-Based Death or STONITH Block Device.

The highest priority of the High Availability cluster stack is to protect the integrity of data. This is achieved by preventing uncoordinated concurrent access to data storage. The cluster stack takes care of this using several control mechanisms.

However, network partitioning or software malfunction could potentially cause scenarios where several DCs are elected in a cluster. If this so-called split brain scenario were allowed to unfold, data corruption might occur.

Node fencing via STONITH is the primary mechanism to prevent this. Using SBD as a node fencing mechanism is one way of shutting down nodes without using an external power off device in case of a split brain scenario.

SBD components and mechanisms
SBD partition

In an environment where all nodes have access to shared storage, a small partition of the device is formatted for use with SBD. The size of the partition depends on the block size of the used disk (for example, 1 MB for standard SCSI disks with 512 byte block size or 4 MB for DASD disks with 4 kB block size). The initialization process creates a message layout on the device with slots for up to 255 nodes.

SBD daemon

After the respective SBD daemon is configured, it is brought online on each node before the rest of the cluster stack is started. It is terminated after all other cluster components have been shut down, thus ensuring that cluster resources are never activated without SBD supervision.

Messages

The daemon automatically allocates one of the message slots on the partition to itself, and constantly monitors it for messages addressed to itself. Upon receipt of a message, the daemon immediately complies with the request, such as initiating a power-off or reboot cycle for fencing.

Also, the daemon constantly monitors connectivity to the storage device, and terminates itself in case the partition becomes unreachable. This guarantees that it is not disconnected from fencing messages. If the cluster data resides on the same logical unit in a different partition, this is not an additional point of failure: The workload will terminate anyway if the storage connectivity has been lost.

Watchdog

Whenever SBD is used, a correctly working watchdog is crucial. Modern systems support a hardware watchdog that needs to be tickled or fed by a software component. The software component (in this case, the SBD daemon) feeds the watchdog by regularly writing a service pulse to the watchdog. If the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an I/O error.

If Pacemaker integration is activated, loss of device majority alone does not trigger self-fencing. For example, your cluster contains three nodes: A, B, and C. Because of a network split, A can only see itself while B and C can still communicate. In this case, there are two cluster partitions: one with quorum because of being the majority (B, C), and one without (A). If this happens while the majority of fencing devices are unreachable, node A would self-fence, but nodes B and C would continue to run.

13.2 Overview of manually setting up SBD

The following steps are necessary to manually set up storage-based protection. They must be executed as root. Before you start, check Section 13.3, “Requirements and restrictions”.

  1. Setting up the watchdog

  2. Depending on your scenario, either use SBD with one to three devices or in diskless mode. For an outline, see Section 13.4, “Number of SBD devices”. The detailed setup is described in:

  3. Testing SBD and fencing

13.3 Requirements and restrictions

  • You can use up to three SBD devices for storage-based fencing. When using one to three devices, the shared storage must be accessible from all nodes.

  • The path to the shared storage device must be persistent and consistent across all nodes in the cluster. Use stable device names such as /dev/disk/by-id/dm-uuid-part1-mpath-abcedf12345.

  • The shared storage can be connected via Fibre Channel (FC), Fibre Channel over Ethernet (FCoE), or even iSCSI.

  • The shared storage segment must not use host-based RAID, LVM, or DRBD*. DRBD can be split, which is problematic for SBD, as there cannot be two states in SBD. Cluster multi-device (Cluster MD) cannot be used for SBD.

  • However, using storage-based RAID and multipathing is recommended for increased reliability.

  • An SBD device can be shared between different clusters, as long as no more than 255 nodes share the device.

  • Fencing does not work with an asymmetric SBD setup. When using more than one SBD device, all nodes must have a slot in all SBD devices.

  • When using more than one SBD device, all devices must have the same configuration, for example, the same timeout values.

  • For clusters with more than two nodes, you can also use SBD in diskless mode.

13.4 Number of SBD devices

SBD supports the use of up to three devices:

One device

The most simple implementation. It is appropriate for clusters where all of your data is on the same shared storage.

Two devices

This configuration is primarily useful for environments that use host-based mirroring but where no third storage device is available. SBD will not terminate itself if it loses access to one mirror leg, allowing the cluster to continue. However, since SBD does not have enough knowledge to detect an asymmetric split of the storage, it will not fence the other side while only one mirror leg is available. Thus, it cannot automatically tolerate a second failure while one of the storage arrays is down.

Three devices

The most reliable configuration. It is resilient against outages of one device—be it because of failures or maintenance. SBD will terminate itself only if more than one device is lost and if required, depending on the status of the cluster partition or node. If at least two devices are still accessible, fencing messages can be successfully transmitted.

This configuration is suitable for more complex scenarios where storage is not restricted to a single array. Host-based mirroring solutions can have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI.

Diskless

This configuration is useful if you want a fencing mechanism without shared storage. In this diskless mode, SBD fences nodes by using the hardware watchdog without relying on any shared device. However, diskless SBD cannot handle a split brain scenario for a two-node cluster. Use this option only for clusters with more than two nodes.

13.5 Calculation of timeouts

When using SBD as a fencing mechanism, it is vital to consider the timeouts of all components, because they depend on each other. When using more than one SBD device, all devices must have the same timeout values.

watchdog timeout

This timeout is set during initialization of the SBD device. It depends mostly on your storage latency. The majority of devices must be successfully read within this time. Otherwise, the node might self-fence.

Note
Note: Multipath or iSCSI setup

If your SBD device(s) reside on a multipath setup or iSCSI, the timeout should be set to the time required to detect a path failure and switch to the next path.

This also means that in /etc/multipath.conf the value of max_polling_interval must be less than the watchdog timeout.

msgwait timeout

This timeout is set during initialization of the SBD device. It defines the time after which a message written to a node's slot on the SBD device is considered delivered. The timeout should be long enough for the node to detect that it needs to self-fence.

However, if the msgwait timeout is relatively long, a fenced cluster node might rejoin before the fencing action returns. This can be mitigated by setting the SBD_DELAY_START parameter in the SBD configuration, as described in Procedure 13.4 in Step 3.

stonith-timeout in the CIB

This timeout is set in the CIB as a global cluster property. It defines how long to wait for the STONITH action (reboot, on, off) to complete.

stonith-watchdog-timeout in the CIB

This timeout is set in the CIB as a global cluster property. If not set explicitly, it defaults to 0, which is appropriate for using SBD with one to three devices. For SBD in diskless mode, this timeout must not be 0. For details, see Procedure 13.8, “Configuring diskless SBD”.

If you change the watchdog timeout, you need to adjust the other two timeouts as well. The following formula expresses the relationship between these three values:

Example 13.1: Formula for timeout calculation
Timeout (msgwait) >= (Timeout (watchdog) * 2)
stonith-timeout >= Timeout (msgwait) + 20%

For example, if you set the watchdog timeout to 120, set the msgwait timeout to at least 240 and the stonith-timeout to at least 288.

If you use the bootstrap scripts provided by the crm shell to set up a cluster and to initialize the SBD device, the relationship between these timeouts is automatically considered.

13.6 Setting up the watchdog

SUSE Linux Enterprise High Availability ships with several kernel modules that provide hardware-specific watchdog drivers. For a list of the most commonly used ones, see Commonly used watchdog drivers.

For clusters in production environments we recommend to use a hardware-specific watchdog driver. However, if no watchdog matches your hardware, softdog can be used as kernel watchdog module.

SUSE Linux Enterprise High Availability uses the SBD daemon as the software component that feeds the watchdog.

13.6.1 Using a hardware watchdog

Finding the right watchdog kernel module for a given system is not trivial. Automatic probing fails very often. As a result, lots of modules are already loaded before the right one gets a chance.

The following table lists some commonly used watchdog drivers. However, this is not a complete list of supported drivers. If your hardware is not listed here, you can also find a list of choices in the directories /lib/modules/KERNEL_VERSION/kernel/drivers/watchdog and /lib/modules/KERNEL_VERSION/kernel/drivers/ipmi. Alternatively, ask your hardware or system vendor for details on system-specific watchdog configuration.

Table 13.1: Commonly used watchdog drivers
HardwareDriver
HPhpwdt
Dell, Lenovo (Intel TCO)iTCO_wdt
Fujitsuipmi_watchdog
LPAR on IBM Powerpseries-wdt
VM on IBM z/VMvmwatchdog
Xen VM (DomU)xen_xdt
VM on VMware vSpherewdat_wdt
Genericsoftdog
Important
Important: Accessing the watchdog timer

Some hardware vendors ship systems management software that uses the watchdog for system resets (for example, HP ASR daemon). If the watchdog is used by SBD, disable such software. No other software must access the watchdog timer.

Procedure 13.1: Loading the correct kernel module

To make sure the correct watchdog module is loaded, proceed as follows:

  1. List the drivers that have been installed with your kernel version:

    # rpm -ql kernel-VERSION | grep watchdog
  2. List any watchdog modules that are currently loaded in the kernel:

    # lsmod | egrep "(wd|dog)"
  3. If you get a result, unload the wrong module:

    # rmmod WRONG_MODULE
  4. Enable the watchdog module that matches your hardware:

    # echo WATCHDOG_MODULE > /etc/modules-load.d/watchdog.conf
    # systemctl restart systemd-modules-load
  5. Test whether the watchdog module is loaded correctly:

    # lsmod | grep dog
  6. Verify if the watchdog device is available and works:

    # ls -l /dev/watchdog*
    # sbd query-watchdog

    If your watchdog device is not available, stop here and check the module name and options. Maybe use another driver.

  7. Verify if the watchdog device works:

    # sbd -w WATCHDOG_DEVICE test-watchdog
  8. Reboot your machine to make sure there are no conflicting kernel modules. For example, if you find the message cannot register ... in your log, this would indicate such conflicting modules. To ignore such modules, refer to https://documentation.suse.com/sles/html/SLES-all/cha-mod.html#sec-mod-modprobe-blacklist.

13.6.2 Using the software watchdog (softdog)

For clusters in production environments we recommend to use a hardware-specific watchdog driver. However, if no watchdog matches your hardware, softdog can be used as kernel watchdog module.

Important
Important: Softdog limitations

The softdog driver assumes that at least one CPU is still running. If all CPUs are stuck, the code in the softdog driver that should reboot the system will never be executed. In contrast, hardware watchdogs keep working even if all CPUs are stuck.

Procedure 13.2: Loading the softdog kernel module
  1. Enable the softdog watchdog:

    # echo softdog > /etc/modules-load.d/watchdog.conf
    # systemctl restart systemd-modules-load
  2. Test whether the softdog watchdog module is loaded correctly:

    # lsmod | grep softdog

13.7 Setting up SBD with devices

The following steps are necessary for setup:

Before you start, make sure the block device or devices you want to use for SBD meet the requirements specified in Section 13.3.

When setting up the SBD devices, you need to take several timeout values into account. For details, see Section 13.5, “Calculation of timeouts”.

The node will terminate itself if the SBD daemon running on it has not updated the watchdog timer fast enough. After having set the timeouts, test them in your specific environment.

Procedure 13.3: Initializing the SBD devices

To use SBD with shared storage, you must first create the messaging layout on one to three block devices. The sbd create command will write a metadata header to the specified device or devices. It will also initialize the messaging slots for up to 255 nodes. If executed without any further options, the command will use the default timeout settings.

Warning
Warning: Overwriting existing data

Make sure the device or devices you want to use for SBD do not hold any important data. When you execute the sbd create command, roughly the first megabyte of the specified block devices will be overwritten without further requests or backup.

  1. Decide which block device or block devices to use for SBD.

  2. Initialize the SBD device with the following command:

    # sbd -d /dev/disk/by-id/DEVICE_ID create

    To use more than one device for SBD, specify the -d option multiple times, for example:

    # sbd -d /dev/disk/by-id/DEVICE_ID1 -d /dev/disk/by-id/DEVICE_ID2 -d /dev/disk/by-id/DEVICE_ID3 create
  3. If your SBD device resides on a multipath group, use the -1 and -4 options to adjust the timeouts to use for SBD. If you initialized more than one device, you must set the same timeout values for all devices. For details, see Section 13.5, “Calculation of timeouts”. All timeouts are given in seconds:

    # sbd -d /dev/disk/by-id/DEVICE_ID -1 901 -4 1802 create

    1

    The -1 option is used to specify the watchdog timeout. In the example above, it is set to 90 seconds. The minimum allowed value for the emulated watchdog is 15 seconds.

    2

    The -4 option is used to specify the msgwait timeout. In the example above, it is set to 180 seconds.

  4. Check what has been written to the device:

    # sbd -d /dev/disk/by-id/DEVICE_ID dump
    Header version     : 2.1
    UUID               : 619127f4-0e06-434c-84a0-ea82036e144c
    Number of slots    : 255
    Sector size        : 512
    Timeout (watchdog) : 5
    Timeout (allocate) : 2
    Timeout (loop)     : 1
    Timeout (msgwait)  : 10
    ==Header on disk /dev/disk/by-id/DEVICE_ID is dumped

    As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.

After you have initialized the SBD devices, edit the SBD configuration file, then enable and start the respective services for the changes to take effect.

Procedure 13.4: Editing the SBD configuration file
  1. Open the file /etc/sysconfig/sbd.

  2. Search for the following parameter: SBD_DEVICE.

    It specifies the devices to monitor and to use for exchanging SBD messages.

    Edit this line by replacing /dev/disk/by-id/DEVICE_ID with your SBD device:

    SBD_DEVICE="/dev/disk/by-id/DEVICE_ID"

    If you need to specify multiple devices in the first line, separate them with semicolons (the order of the devices does not matter):

    SBD_DEVICE="/dev/disk/by-id/DEVICE_ID1;/dev/disk/by-id/DEVICE_ID2;/dev/disk/by-id/DEVICE_ID3"

    If the SBD device is not accessible, the daemon fails to start and inhibits cluster start-up.

  3. Search for the following parameter: SBD_DELAY_START.

    Enables or disables a delay. Set SBD_DELAY_START to yes if msgwait is relatively long, but your cluster nodes boot very fast. Setting this parameter to yes delays the start of SBD on boot. This is sometimes necessary with virtual machines.

    The default delay length is the same as the msgwait timeout value. Alternatively, you can specify an integer, in seconds, instead of yes.

    If you enable SBD_DELAY_START, you must also check the SBD service file to ensure that the value of TimeoutStartSec is greater than the value of SBD_DELAY_START. For more information, see https://www.suse.com/support/kb/doc/?id=000019356.

  4. Copy the configuration file to all nodes by using csync2:

    # csync2 -xv

    For more information, see Section 4.7, “Transferring the configuration to all nodes”.

After you have added your SBD devices to the SBD configuration file, enable the SBD daemon. The SBD daemon is a critical piece of the cluster stack. It needs to be running when the cluster stack is running. Thus, the sbd service is started as a dependency whenever the cluster services are started.

Procedure 13.5: Enabling and starting the SBD service
  1. On each node, enable the SBD service:

    # systemctl enable sbd

    It will be started together with the Corosync service whenever the cluster services are started.

  2. Restart the cluster services on all nodes at once by using the --all option:

    # crm cluster restart --all

    This automatically triggers the start of the SBD daemon.

    Important
    Important: Restart cluster services for SBD changes

    If any SBD metadata changes, you must restart the cluster services again. To keep critical cluster resources running during the restart, consider putting the cluster into maintenance mode first. For more information, see Chapter 28, Executing maintenance tasks.

As a next step, test the SBD devices as described in Procedure 13.6.

Procedure 13.6: Testing the SBD devices
  1. The following command will dump the node slots and their current messages from the SBD device:

    # sbd -d /dev/disk/by-id/DEVICE_ID list

    Now you should see all cluster nodes that have ever been started with SBD listed here. For example, if you have a two-node cluster, the message slot should show clear for both nodes:

    0       alice        clear
    1       bob          clear
  2. Try sending a test message to one of the nodes:

    # sbd -d /dev/disk/by-id/DEVICE_ID message alice test
  3. The node will acknowledge the receipt of the message in the system log files:

    May 03 16:08:31 alice sbd[66139]: /dev/disk/by-id/DEVICE_ID: notice: servant:
    Received command test from bob on disk /dev/disk/by-id/DEVICE_ID

    This confirms that SBD is indeed up and running on the node and that it is ready to receive messages.

As a final step, you need to adjust the cluster configuration as described in Procedure 13.7.

Procedure 13.7: Configuring the cluster to use SBD
  1. Start a shell and log in as root or equivalent.

  2. Run crm configure.

  3. Enter the following:

    crm(live)configure# property stonith-enabled="true"1
    crm(live)configure# property stonith-watchdog-timeout=02
    crm(live)configure# property stonith-timeout="40s"3

    1

    This is the default configuration, because clusters without STONITH are not supported. But in case STONITH has been deactivated for testing purposes, make sure this parameter is set to true again.

    2

    If not explicitly set, this value defaults to 0, which is appropriate for use of SBD with one to three devices.

    3

    To calculate the stonith-timeout, refer to Section 13.5, “Calculation of timeouts”. A stonith-timeout value of 40 would be appropriate if the msgwait timeout value for SBD was set to 30 seconds.

  4. Configure the SBD STONITH resource. You do not need to clone this resource.

    For a two-node cluster, in case of split brain, there will be fencing issued from each node to the other as expected. To prevent both nodes from being reset at practically the same time, it is recommended to apply the following fencing delays to help one of the nodes, or even the preferred node, win the fencing match. For clusters with more than two nodes, you do not need to apply these delays.

    Priority fencing delay

    The priority-fencing-delay cluster property is disabled by default. By configuring a delay value, if the other node is lost and it has the higher total resource priority, the fencing targeting it will be delayed for the specified amount of time. This means that in case of split-brain, the more important node wins the fencing match.

    Resources that matter can be configured with priority meta attribute. On calculation, the priority values of the resources or instances that are running on each node are summed up to be accounted. A promoted resource instance takes the configured base priority plus one so that it receives a higher value than any unpromoted instance.

    # crm configure property priority-fencing-delay=30

    Even if priority-fencing-delay is used, we still recommend also using pcmk_delay_base or pcmk_delay_max as described below to address any situations where the nodes happen to have equal priority. The value of priority-fencing-delay should be significantly greater than the maximum of pcmk_delay_base / pcmk_delay_max, and preferably twice the maximum.

    Predictable static delay

    This parameter adds a static delay before executing STONITH actions. To prevent the nodes from being reset at the same time under split-brain of a two-node cluster, configure separate fencing resources with different delay values. The preferred node can be marked with the parameter to be targeted with a longer fencing delay so that it wins any fencing match. To make this succeed, it is essential to create two primitive STONITH devices for each node. In the following configuration, alice will win and survive in case of a split brain scenario:

    crm(live)configure# primitive st-sbd-alice stonith:external/sbd params \
    pcmk_host_list=alice pcmk_delay_base=20
    crm(live)configure# primitive st-sbd-bob stonith:external/sbd params \
    pcmk_host_list=bob pcmk_delay_base=0
    Dynamic random delay

    This parameter adds a random delay for STONITH actions on the fencing device. Rather than a static delay targeting a specific node, the parameter pcmk_delay_max adds a random delay for any fencing with the fencing resource to prevent double reset. Unlike pcmk_delay_base, this parameter can be specified for a unified fencing resource targeting multiple nodes.

    crm(live)configure# primitive stonith_sbd stonith:external/sbd \
    params pcmk_delay_max=30
    Warning
    Warning: pcmk_delay_max might not prevent double reset in a split-brain scenario

    The lower the value of pcmk_delay_max, the higher the chance that a double reset might still occur.

    If your aim is to have a predictable survivor, use a priority fencing delay or predictable static delay.

  5. Review your changes with show.

  6. Submit your changes with commit and leave the crm live configuration with quit.

After the resource has started, your cluster is successfully configured for use of SBD. It will use this method in case a node needs to be fenced.

13.8 Setting up diskless SBD

SBD can be operated in a diskless mode. In this mode, a watchdog device will be used to reset the node in the following cases: if it loses quorum, if any monitored daemon is lost and not recovered, or if Pacemaker decides that the node requires fencing. Diskless SBD is based on self-fencing of a node, depending on the status of the cluster, the quorum and some reasonable assumptions. No STONITH SBD resource primitive is needed in the CIB.

Important
Important: Do not block Corosync traffic in the local firewall

Diskless SBD relies on reformed membership and loss of quorum to achieve fencing. Corosync traffic must be able to pass through all network interfaces, including the loopback interface, and must not be blocked by a local firewall. Otherwise, Corosync cannot reform a new membership, which can cause a split-brain scenario that cannot be handled by diskless SBD fencing.

Important
Important: Number of cluster nodes

Do not use diskless SBD as a fencing mechanism for two-node clusters. Use diskless SBD only for clusters with three or more nodes. SBD in diskless mode cannot handle split brain scenarios for two-node clusters. If you want to use diskless SBD for two-node clusters, use QDevice as described in Chapter 14, QDevice and QNetd.

Procedure 13.8: Configuring diskless SBD
  1. Open the file /etc/sysconfig/sbd and use the following entries:

    SBD_PACEMAKER=yes
    SBD_STARTMODE=always
    SBD_DELAY_START=no
    SBD_WATCHDOG_DEV=/dev/watchdog
    SBD_WATCHDOG_TIMEOUT=5

    The SBD_DEVICE entry is not needed as no shared disk is used. When this parameter is missing, the sbd service does not start any watcher process for SBD devices.

    If you need to delay the start of SBD on boot, change SBD_DELAY_START to yes. The default delay length is double the value of SBD_WATCHDOG_TIMEOUT. Alternatively, you can specify an integer, in seconds, instead of yes.

    Important
    Important: SBD_WATCHDOG_TIMEOUT for diskless SBD and QDevice

    If you use QDevice with diskless SBD, the SBD_WATCHDOG_TIMEOUT value must be greater than QDevice's sync_timeout value, or SBD will time out and fail to start.

    The default value for sync_timeout is 30 seconds. Therefore, set SBD_WATCHDOG_TIMEOUT to a greater value, such as 35.

  2. On each node, enable the SBD service:

    # systemctl enable sbd

    It will be started together with the Corosync service whenever the cluster services are started.

  3. Restart the cluster services on all nodes at once by using the --all option:

    # crm cluster restart --all

    This automatically triggers the start of the SBD daemon.

    Important
    Important: Restart cluster services for SBD changes

    If any SBD metadata changes, you must restart the cluster services again. To keep critical cluster resources running during the restart, consider putting the cluster into maintenance mode first. For more information, see Chapter 28, Executing maintenance tasks.

  4. Check if the parameter have-watchdog=true has been automatically set:

    # crm configure show | grep have-watchdog
             have-watchdog=true
  5. Run crm configure and set the following cluster properties on the crm shell:

    crm(live)configure# property stonith-enabled="true"1
    crm(live)configure# property stonith-watchdog-timeout=102
    crm(live)configure# property stonith-timeout=153

    1

    This is the default configuration, because clusters without STONITH are not supported. But in case STONITH has been deactivated for testing purposes, make sure this parameter is set to true again.

    2

    For diskless SBD, this parameter must not equal zero. It defines after how long it is assumed that the fencing target has already self-fenced. Use the following formula to calculate this timeout:

    stonith-watchdog-timeout >= (SBD_WATCHDOG_TIMEOUT * 2)

    If you set stonith-watchdog-timeout to a negative value, Pacemaker automatically calculates this timeout and sets it to twice the value of SBD_WATCHDOG_TIMEOUT.

    3

    This parameter must allow sufficient time for fencing to complete. For diskless SBD, use the following formula to calculate this timeout:

    stonith-timeout >= stonith-watchdog-timeout + 20%
    Important
    Important: Diskless SBD timeouts

    With diskless SBD, if the stonith-timeout value is smaller than the stonith-watchdog-timeout value, failed nodes can become stuck in an UNCLEAN state and block failover of active resources.

  6. Review your changes with show.

  7. Submit your changes with commit and leave the crm live configuration with quit.

13.9 Testing SBD and fencing

To test whether SBD works as expected for node fencing purposes, use one or all of the following methods:

Manually triggering fencing of a node

To trigger a fencing action for node NODENAME:

# crm node fence NODENAME

Check if the node is fenced and if the other nodes consider the node as fenced after the stonith-watchdog-timeout.

Simulating an SBD failure
  1. Identify the process ID of the SBD inquisitor:

    # systemctl status sbd
    ● sbd.service - Shared-storage based fencing daemon
    
       Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled)
       Active: active (running) since Tue 2018-04-17 15:24:51 CEST; 6 days ago
         Docs: man:sbd(8)
      Process: 1844 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=exited, status=0/SUCCESS)
     Main PID: 1859 (sbd)
        Tasks: 4 (limit: 4915)
       CGroup: /system.slice/sbd.service
               ├─1859 sbd: inquisitor
    [...]
  2. Simulate an SBD failure by terminating the SBD inquisitor process. In our example, the process ID of the SBD inquisitor is 1859):

    # kill -9 1859

    The node proactively self-fences. The other nodes notice the loss of the node and consider it has self-fenced after the stonith-watchdog-timeout.

Triggering fencing through a monitor operation failure

With a normal configuration, a failure of a resource stop operation will trigger fencing. To trigger fencing manually, you can produce a failure of a resource stop operation. Alternatively, you can temporarily change the configuration of a resource monitor operation and produce a monitor failure as described below:

  1. Configure an on-fail=fence property for a resource monitor operation:

    op monitor interval=10 on-fail=fence
  2. Let the monitoring operation fail (for example, by terminating the respective daemon, if the resource relates to a service).

    This failure triggers a fencing action.

13.10 Additional mechanisms for storage protection

Apart from node fencing via STONITH there are other methods to achieve storage protection at a resource level. For example, SCSI-3 and SCSI-4 use persistent reservations whereas sfex provides a locking mechanism. Both methods are explained in the following subsections.

13.10.1 Configuring an sg_persist resource

The SCSI specifications 3 and 4 define persistent reservations. These are SCSI protocol features and can be used for I/O fencing and failover. This feature is implemented in the sg_persist Linux command.

Note
Note: SCSI disk compatibility

Any backing disks for sg_persist must be SCSI disk compatible. sg_persist only works for devices like SCSI disks or iSCSI LUNs. Do not use it for IDE, SATA, or any block devices which do not support the SCSI protocol.

Before you proceed, check if your disk supports persistent reservations. Use the following command (replace DEVICE_ID with your device name):

# sg_persist -n --in --read-reservation -d /dev/disk/by-id/DEVICE_ID

The result shows whether your disk supports persistent reservations:

  • Supported disk:

    PR generation=0x0, there is NO reservation held
  • Unsupported disk:

    PR in (Read reservation): command not supported
    Illegal request, Invalid opcode

If you get an error message (like the one above), replace the old disk with an SCSI compatible disk. Otherwise proceed as follows:

  1. Create the primitive resource sg_persist, using a stable device name for the disk:

    # crm configure
    crm(live)configure# primitive sg sg_persist \
        params devs="/dev/disk/by-id/DEVICE_ID" reservation_type=3 \
        op monitor interval=60 timeout=60
  2. Create a promotable clone of the sg_persist primitive:

    crm(live)configure# clone clone-sg sg \
        meta promotable=true promoted-max=1 notify=true
  3. Test the setup. When the resource is promoted, you can mount and write to the disk partitions on the cluster node where the primary instance is running, but you cannot write on the cluster node where the secondary instance is running.

  4. Add a file system primitive for Ext4, using a stable device name for the disk partition:

    crm(live)configure# primitive ext4 Filesystem \
        params device="/dev/disk/by-id/DEVICE_ID" directory="/mnt/ext4" fstype=ext4
  5. Add the following order relationship plus a collocation between the sg_persist clone and the file system resource:

    crm(live)configure# order o-clone-sg-before-ext4 Mandatory: clone-sg:promote ext4:start
    crm(live)configure# colocation col-ext4-with-sg-persist inf: ext4 clone-sg:Promoted
  6. Check all your changes with the show changed command.

  7. Commit your changes.

For more information, refer to the sg_persist man page.

13.10.2 Ensuring exclusive storage activation with sfex

This section introduces sfex, an additional low-level mechanism to lock access to shared storage exclusively to one node. Note that sfex does not replace STONITH. As sfex requires shared storage, it is recommended that the SBD node fencing mechanism described above is used on another partition of the storage.

By design, sfex cannot be used with workloads that require concurrency (such as OCFS2). It serves as a layer of protection for classic failover style workloads. This is similar to an SCSI-2 reservation in effect, but more general.

13.10.2.1 Overview

In a shared storage environment, a small partition of the storage is set aside for storing one or more locks.

Before acquiring protected resources, the node must first acquire the protecting lock. The ordering is enforced by Pacemaker. The sfex component ensures that even if Pacemaker were subject to a split brain situation, the lock will never be granted more than once.

These locks must also be refreshed periodically, so that a node's death does not permanently block the lock and other nodes can proceed.

13.10.2.2 Setup

In the following, learn how to create a shared partition for use with sfex and how to configure a resource for the sfex lock in the CIB. A single sfex partition can hold any number of locks, and needs 1 KB of storage space allocated per lock. By default, sfex_init creates one lock on the partition.

Important
Important: Requirements
  • The shared partition for sfex should be on the same logical unit as the data you want to protect.

  • The shared sfex partition must not use host-based RAID or DRBD.

  • Using an LVM logical volume is possible.

Procedure 13.9: Creating an sfex partition
  1. Create a shared partition for use with sfex. Note the name of this partition and use it as a substitute for /dev/sfex below.

  2. Create the sfex metadata with the following command:

    # sfex_init -n 1 /dev/sfex
  3. Verify that the metadata has been created correctly:

    # sfex_stat -i 1 /dev/sfex ; echo $?

    This should return 2, since the lock is not currently held.

Procedure 13.10: Configuring a resource for the sfex lock
  1. The sfex lock is represented via a resource in the CIB, configured as follows:

    crm(live)configure# primitive sfex_1 ocf:heartbeat:sfex \
          params device="/dev/sfex" index="1" collision_timeout="1" \
          lock_timeout="70" monitor_interval="10" \
          op monitor interval="10s" timeout="30s" on-fail="fence"
  2. To protect resources via an sfex lock, create mandatory order and placement constraints between the resources to protect the sfex resource. If the resource to be protected has the ID filesystem1:

    crm(live)configure# order order-sfex-1 Mandatory: sfex_1 filesystem1
    crm(live)configure# colocation col-sfex-1 inf: filesystem1 sfex_1
  3. If using group syntax, add the sfex resource as the first resource to the group:

    crm(live)configure# group LAMP sfex_1 filesystem1 apache ipaddr

13.11 Changing SBD configuration

You might need to change the cluster's SBD configuration for various reasons. For example:

  • Changing disk-based SBD to diskless SBD

  • Changing diskless SBD to disk-based SBD

  • Replacing an SBD device with a new device

  • Changing timeout values and other settings

You can use crmsh to change the SBD configuration. This method uses crmsh's default settings, including timeout values.

If you need to change any settings, or if you use custom settings and need to retain them when replacing a device, you must manually edit the SBD configuration.

Procedure 13.11: Redeploying disk-based SBD with crmsh

Use this procedure to change diskless SBD to disk-based SBD, or to replace an existing SBD device with a new device.

  1. Put the cluster into maintenance mode:

    # crm maintenance on

    In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even if the cluster services stop.

  2. Configure the new device:

    # crm -F cluster init sbd -s /dev/disk/by-id/DEVICE_ID

    The -F option allows crmsh to reconfigure SBD even if the SBD service is already running.

  3. Check the status of the cluster:

    # crm status

    Initially, the nodes have a status of UNCLEAN (offline), but after a short time they change to Online.

  4. Check the SBD configuration. First, check the device metadata:

    # sbd -d /dev/disk/by-id/DEVICE_ID dump

    Then check that all nodes in the cluster are assigned to a slot in the device:

    # sbd -d /dev/disk/by-id/DEVICE_ID list
  5. Check the status of the SBD service:

    # systemctl status sbd

    If you changed diskless SBD to disk-based SBD, check that the following section includes a device ID:

    CGroup: /system/.slice/sbd.service
            |—23314 "sbd: inquisitor"
            |—23315 "sbd: watcher: /dev/disk/by-id/DEVICE_ID - slot: 0 - uuid: DEVICE_UUID"
            |—23316 "sbd: watcher: Pacemaker"
            |—23317 "sbd: watcher: Cluster"
  6. When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:

    # crm maintenance off
Procedure 13.12: Redeploying disk-based SBD manually

Use this procedure to change diskless SBD to disk-based SBD, to replace an existing SBD device with a new device, or to change the timeout settings for disk-based SBD.

  1. Put the cluster into maintenance mode:

    # crm maintenance on

    In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even when you stop the cluster services.

  2. Stop the cluster services, including SBD, on all nodes:

    # crm cluster stop --all
  3. Reinitialize the device metadata, specifying the new device ID and timeouts as required:

    # sbd -d /dev/disk/by-id/DEVICE_ID -1 VALUE -4 VALUE create

    The -1 option specifies the watchdog timeout.

    The -4 option specifies the msgwait timeout. This must be at least double the watchdog timeout.

    For more information, see Procedure 13.3, “Initializing the SBD devices”.

  4. Open the file /etc/sysconfig/sbd.

  5. If you are changing diskless SBD to disk-based SBD, add the following line and specify the device ID. If you are replacing an SBD device, change the value of this line to the new device ID:

    SBD_DEVICE="/dev/disk/by-id/DEVICE_ID"
  6. Adjust the other settings as required. For more information, see Procedure 13.4, “Editing the SBD configuration file”.

  7. Copy the configuration file to all nodes:

    # csync2 -xv
  8. Start the cluster services on all nodes:

    # crm cluster start --all
  9. Check the status of the cluster:

    # crm status

    Initially, the nodes have a status of UNCLEAN (offline), but after a short time they change to Online.

  10. Check the SBD configuration. First, check the device metadata:

    # sbd -d /dev/disk/by-id/DEVICE_ID dump

    Then check that all nodes in the cluster are assigned to a slot in the device:

    # sbd -d /dev/disk/by-id/DEVICE_ID list
  11. Check the status of the SBD service:

    # systemctl status sbd

    If you changed diskless SBD to disk-based SBD, check that the following section includes a device ID:

    CGroup: /system/.slice/sbd.service
            |—23314 "sbd: inquisitor"
            |—23315 "sbd: watcher: /dev/disk/by-id/DEVICE_ID - slot: 0 - uuid: DEVICE_UUID"
            |—23316 "sbd: watcher: Pacemaker"
            |—23317 "sbd: watcher: Cluster"
  12. If you changed any timeouts, or if you changed diskless SBD to disk-based SBD, you might also need to change the CIB properties stonith-timeout and stonith-watchdog-timeout. For disk-based SBD, stonith-watchdog-timeout should be 0 or defaulted. For more information, see Section 13.5, “Calculation of timeouts”.

    To check the current values, run the following command:

    # crm configure show

    If you need to change the values, use the following commands:

    # crm configure property stonith-watchdog-timeout=0
    # crm configure property stonith-timeout=VALUE
  13. If you changed diskless SBD to disk-based SBD, you must configure a STONITH resource for SBD. For example:

    # crm configure primitive stonith-sbd stonith:external/sbd

    For more information, see Step 4 in Procedure 13.7, “Configuring the cluster to use SBD”.

  14. When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:

    # crm maintenance off
Procedure 13.13: Redeploying diskless SBD with crmsh

Use this procedure to change disk-based SBD to diskless SBD.

  1. Put the cluster into maintenance mode:

    # crm maintenance on

    In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even if the cluster services stop.

  2. Configure diskless SBD:

    # crm -F cluster init sbd -S

    The -F option allows crmsh to reconfigure SBD even if the SBD service is already running.

  3. Check the status of the cluster:

    # crm status

    Initially, the nodes have a status of UNCLEAN (offline), but after a short time they change to Online.

  4. Check the status of the SBD service:

    # systemctl status sbd

    The following section should not include a device ID:

    CGroup: /system/.slice/sbd.service
            |—23314 "sbd: inquisitor"
            |—23315 "sbd: watcher: Pacemaker"
            |—23316 "sbd: watcher: Cluster"
  5. When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:

    # crm maintenance off
Procedure 13.14: Redeploying diskless SBD manually

Use this procedure to change disk-based SBD to diskless SBD, or to change the timeout values for diskless SBD.

  1. Put the cluster into maintenance mode:

    # crm maintenance on

    In this state, the cluster stops monitoring all resources, so the services managed by the resources can continue to run even when you stop the cluster services.

  2. Stop the cluster services, including SBD, on all nodes:

    # crm cluster stop --all
  3. Open the file /etc/sysconfig/sbd.

  4. If you are changing disk-based SBD to diskless SBD, remove or comment out the SBD_DEVICE entry.

  5. Adjust the other settings as required. For more information, see Section 13.8, “Setting up diskless SBD”.

  6. Copy the configuration file to all nodes:

    # csync2 -xv
  7. Start the cluster services on all nodes:

    # crm cluster start --all
  8. Check the status of the cluster:

    # crm status

    Initially, the nodes have a status of UNCLEAN (offline), but after a short time they change to Online.

  9. Check the status of the SBD service:

    # systemctl status sbd

    If you changed disk-based SBD to diskless SBD, check that the following section does not include a device ID:

    CGroup: /system/.slice/sbd.service
            |—23314 "sbd: inquisitor"
            |—23315 "sbd: watcher: Pacemaker"
            |—23316 "sbd: watcher: Cluster"
  10. If you changed any timeouts, or if you changed disk-based SBD to diskless SBD, you might also need to change the CIB properties stonith-timeout and stonith-watchdog-timeout. For more information, see Step 5 of Procedure 13.8, “Configuring diskless SBD”.

    To check the current values, run the following command:

    # crm configure show

    If you need to change the values, use the following commands:

    # crm configure property stonith-watchdog-timeout=VALUE
    # crm configure property stonith-timeout=VALUE
  11. When the nodes are back online, move the cluster out of maintenance mode and back into normal operation:

    # crm maintenance off

13.12 For more information

For more details, see man sbd.

14 QDevice and QNetd

QDevice and QNetd participate in quorum decisions. With assistance from the arbitrator corosync-qnetd, corosync-qdevice provides a configurable number of votes, allowing a cluster to sustain more node failures than the standard quorum rules allow. We recommend deploying corosync-qnetd and corosync-qdevice for clusters with an even number of nodes, and especially for two-node clusters.

14.1 Conceptual overview

In comparison to calculating quora among cluster nodes, the QDevice-and-QNetd approach has the following benefits:

  • It provides better sustainability in case of node failures.

  • You can write your own heuristics scripts to affect votes. This is especially useful for complex setups.

  • It enables you to configure a QNetd server to provide votes for multiple clusters.

  • It allows using diskless SBD for two-node clusters.

  • It helps with quorum decisions for clusters with an even number of nodes under split-brain situations, especially for two-node clusters.

A setup with QDevice/QNetd consists of the following components and mechanisms:

QDevice/QNetd components and mechanisms
QNetd (corosync-qnetd)

A systemd service (a daemon, the QNetd server) which is not part of the cluster. The systemd service provides a vote to the corosync-qdevice daemon.

To improve security, corosync-qnetd can work with TLS for client certificate checking.

QDevice (corosync-qdevice)

A systemd service (a daemon) on each cluster node running together with Corosync. This is the client of corosync-qnetd. Its primary use is to allow a cluster to sustain more node failures than standard quorum rules allow.

QDevice is designed to work with different arbitrators. However, currently, only QNetd is supported.

Algorithms

QDevice supports different algorithms, which determine the behavior of how votes are assigned. Currently, the following exist:

  • FFSplit (fifty-fifty split) is the default. It is used for clusters with an even number of nodes. If the cluster splits into two similar partitions, this algorithm provides one vote to one of the partitions, based on the results of heuristics checks and other factors.

  • LMS (last man standing) allows the only remaining node that can see the QNetd server to get the votes. So this algorithm is useful when a cluster with only one active node should remain quorate.

Heuristics

QDevice supports a set of commands (heuristics). The commands are executed locally on start-up of cluster services, cluster membership change, successful connection to corosync-qnetd, or, optionally, at regular times. The heuristics can be set with the quorum.device.heuristics key (in the corosync.conf file) or with the --qdevice-heuristics-mode option. Both know the values off (default), sync, and on. The difference between sync and on is that you can additionally execute the above commands regularly.

Only if all commands are executed successfully are the heuristics considered to have passed; otherwise, they failed. The heuristics' result is sent to corosync-qnetd where it is used in calculations to determine which partition should be quorate.

Tiebreaker

This is used as a fallback if the cluster partitions are completely equal even with the same heuristics results. It can be configured to be the lowest, the highest, or a specific node ID.

14.2 Requirements and prerequisites

Before setting up QDevice and QNetd, you need to prepare the environment as follows:

  • In addition to the cluster nodes, you have a separate machine which will become the QNetd server. See Section 14.3, “Setting up the QNetd server”.

  • A different physical network than the one that Corosync uses. It is recommended for QDevice to reach the QNetd server. Ideally, the QNetd server should be in a separate rack than the main cluster, or at least on a separate PSU and not in the same network segment as the Corosync ring or rings.

14.3 Setting up the QNetd server

The QNetd server is not part of the cluster stack, and it is also not a real member of your cluster. As such, you cannot move resources to this server.

The QNetd server is almost state free. Usually, you do not need to change anything in the configuration file /etc/sysconfig/corosync-qnetd. By default, the corosync-qnetd service runs the daemon as user coroqnetd in the group coroqnetd. This avoids running the daemon as root.

To create a QNetd server, proceed as follows:

  1. On the machine that will become the QNetd server, install SUSE Linux Enterprise Server 15 SP4.

  2. Enable the SUSE Linux Enterprise High Availability using the command listed in SUSEConnect --list-extensions.

  3. Install the corosync-qnetd package:

    # zypper install corosync-qnetd

    You do not need to manually start the corosync-qnetd service. It will be started automatically when you configure QDevice on the cluster.

Your QNetd server is ready to accept connections from a QDevice client corosync-qdevice. Further configuration is not needed.

14.4 Connecting QDevice clients to the QNetd server

After you set up your QNetd server, you can set up and run the clients. You can connect the clients to the QNetd server during the installation of your cluster, or you can add them later. This procedure documents how to add them later.

  1. On all nodes, install the corosync-qdevice package:

    # zypper install corosync-qdevice
  2. On one of the nodes, run the following command to configure QDevice:

    # crm cluster init qdevice
    Do you want to configure QDevice (y/n)? y
    HOST or IP of the QNetd server to be used []QNETD_SERVER
    TCP PORT of QNetd server [5403]
    QNetd decision ALGORITHM (ffsplit/lms) [ffsplit]
    QNetd TIE_BREAKER (lowest/highest/valid node id) [lowest]
    Whether using TLS on QDevice/QNetd (on/off/required) [on]
    Heuristics COMMAND to run with absolute path; For multiple commands, use ";" to separate []

    Confirm with y that you want to configure QDevice, then enter the host name or IP address of the QNetd server. For the remaining fields, you can accept the default values or change them if required.

    Important
    Important: SBD_WATCHDOG_TIMEOUT for diskless SBD and QDevice

    If you use QDevice with diskless SBD, the SBD_WATCHDOG_TIMEOUT value must be greater than QDevice's sync_timeout value, or SBD will time out and fail to start.

    The default value for sync_timeout is 30 seconds. Therefore, in the file /etc/sysconfig/sbd, make sure that SBD_WATCHDOG_TIMEOUT is set to a greater value, such as 35.

14.5 Setting up a QDevice with heuristics

If you need additional control over how votes are determined, use heuristics. Heuristics are a set of commands that are executed in parallel.

For this purpose, the command crm cluster init qdevice provides the option --qdevice-heuristics. You can pass one or more commands (separated by semicolons) with absolute paths.

For example, if your own command for heuristic checks is located at /usr/sbin/my-script.sh you can run it on one of your cluster nodes as follows:

# crm cluster init qdevice --qnetd-hostname=charlie \
     --qdevice-heuristics=/usr/sbin/my-script.sh \
     --qdevice-heuristics-mode=on

The command or commands can be written in any language such as Shell, Python, or Ruby. If they succeed, they return 0 (zero), otherwise they return an error code.

You can also pass a set of commands. Only when all commands finish successfully (return code is zero), have the heuristics passed.

The --qdevice-heuristics-mode=on option lets the heuristics commands run regularly.

14.6 Checking and showing quorum status

You can query the quorum status on one of your cluster nodes as shown in Example 14.1, “Status of QDevice”. It shows the status of your QDevice nodes.

Example 14.1: Status of QDevice
# corosync-quorumtool1
 Quorum information
------------------
Date:             ...
Quorum provider:  corosync_votequorum
Nodes:            2 2
Node ID:          3232235777 3
Ring ID:          3232235777/8
Quorate:          Yes 4

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
 3232235777         1    A,V,NMW 192.168.1.1 (local) 5
 3232235778         1    A,V,NMW 192.168.1.2 5
         0          1            Qdevice

1

As an alternative with an identical result, you can also use the crm corosync status quorum command.

2

The number of nodes we are expecting. In this example, it is a two-node cluster.

3

As the node ID is not explicitly specified in corosync.conf this ID is a 32-bit integer representation of the IP address. In this example, the value 3232235777 stands for the IP address 192.168.1.1.

4

The quorum status. In this case, the cluster has quorum.

5

The status for each cluster node means:

A (alive) or NA (not alive)

Shows the connectivity status between QDevice and Corosync. If there is a heartbeat between QDevice and Corosync, it is shown as alive (A).

V (vote) or NV (non vote)

Shows if the quorum device has given a vote (letter V) to the node. A letter V means that both nodes can communicate with each other. In a split-brain situation, one node would be set to V and the other node would be set to NV.

MW (master wins) or NMW(not master wins)

Shows if the quorum device master_wins flag is set. By default, the flag is not set, so you see NMW (not master wins). See the man page votequorum_qdevice_master_wins(3) for more information.

NR (not registered)

Shows that the cluster is not using a quorum device.

If you query the status of the QNetd server, you get a similar output to that shown in Example 14.2, “Status of QNetd server”:

Example 14.2: Status of QNetd server
# corosync-qnetd-tool -lv1
Cluster "hacluster": 2
    Algorithm:          Fifty-Fifty split 3
    Tie-breaker:        Node with lowest node ID
    Node ID 3232235777: 4
        Client address:         ::ffff:192.168.1.1:54732
        HB interval:            8000ms
        Configured node list:   3232235777, 3232235778
        Ring ID:                aa10ab0.8
        Membership node list:   3232235777, 3232235778
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 3232235778:
        Client address:         ::ffff:192.168.1.2:43016
        HB interval:            8000ms
        Configured node list:   3232235777, 3232235778
        Ring ID:                aa10ab0.8
        Membership node list:   3232235777, 3232235778
        Heuristics:             Undefined (membership: Undefined, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   No change (ACK)

1

As an alternative with an identical result, you can also use the crm corosync status qnetd command on one of the cluster nodes.

2

The name of your cluster as set in the configuration file /etc/corosync/corosync.conf in the totem.cluster_name section.

3

The algorithm currently used. In this example, it is FFSplit.

4

This is the entry for the node with the IP address 192.168.1.1. TLS is active.

14.7 For more information

For additional information about QDevice and QNetd, see the man pages of corosync-qdevice(8) and corosync-qnetd(8).

15 Access control lists

The cluster administration tools like crm shell (crmsh) or Hawk2 can be used by root or any user in the group haclient. By default, these users have full read/write access. To limit access or assign more fine-grained access rights, you can use Access control lists (ACLs).

Access control lists consist of an ordered set of access rules. Each rule allows read or write access or denies access to a part of the cluster configuration. Rules are typically combined to produce a specific role, then users may be assigned to a role that matches their tasks.

Note
Note: CIB syntax validation version and ACL differences

This ACL documentation only applies if your CIB is validated with the CIB syntax version pacemaker-2.0 or higher. For details on how to check this and upgrade the CIB version, see Note: Upgrading the CIB syntax version.

15.1 Requirements and prerequisites

Before you start using ACLs on your cluster, make sure the following conditions are fulfilled:

  • Ensure you have the same users on all nodes in your cluster, either by using NIS, Active Directory, or by manually adding the same users to all nodes.

  • All users for whom you want to modify access rights with ACLs must belong to the haclient group.

  • All users need to run crmsh by its absolute path /usr/sbin/crm.

  • If non-privileged users want to run crmsh, their PATH variable needs to be extended with /usr/sbin.

Important
Important: Default access rights
  • ACLs are an optional feature. By default, use of ACLs is disabled.

  • If ACLs are not enabled, root and all users belonging to the haclient group have full read/write access to the cluster configuration.

  • Even if ACLs are enabled and configured, both root and the default CRM owner hacluster always have full access to the cluster configuration.

15.2 Conceptual overview

Access control lists consist of an ordered set of access rules. Each rule allows read or write access or denies access to a part of the cluster configuration. Rules are typically combined to produce a specific role, then users may be assigned to a role that matches their tasks. An ACL role is a set of rules which describe access rights to CIB. A rule consists of the following:

  • an access right like read, write, or deny

  • a specification where to apply the rule. This specification can be a type, an ID reference, or an XPath expression. XPath is a language for selecting nodes in an XML document. Refer to http://en.wikipedia.org/wiki/XPath.

Usually, it is convenient to bundle ACLs into roles and assign a specific role to system users (ACL targets). There are these methods to create ACL roles:

15.3 Enabling use of ACLs in your cluster

Before you can start configuring ACLs, you need to enable use of ACLs. To do so, use the following command in the crmsh:

# crm configure property enable-acl=true

Alternatively, use Hawk2 as described in Procedure 15.1, “Enabling use of ACLs with Hawk2”.

Procedure 15.1: Enabling use of ACLs with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. In the left navigation bar, select Cluster Configuration to display the global cluster options and their current values.

  3. Below Cluster Configuration click the empty drop-down box and select enable-acl to add the parameter. It is added with its default value No.

  4. Set its value to Yes and apply your changes.

15.4 Creating a read-only monitor role

The following subsections describe how to configure read-only access by defining a monitor role either in Hawk2 or crm shell.

15.4.1 Creating a read-only monitor role with Hawk2

The following procedures show how to configure read-only access to the cluster configuration by defining a monitor role and assigning it to a user. Alternatively, you can use crmsh to do so, as described in Procedure 15.4, “Adding a monitor role and assigning a user with crmsh”.

Procedure 15.2: Adding a monitor role with Hawk2
  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. In the left navigation bar, select Roles.

  3. Click Create.

  4. Enter a unique Role ID, for example, monitor.

  5. As access Right, select Read.

  6. As Xpath, enter the XPath expression /cib.

    Image
  7. Click Create.

    This creates a new role with the name monitor, sets the read rights and applies this to all elements in the CIB by using the XPath expression/cib.

  8. If necessary, add more rules by clicking the plus icon and specifying the respective parameters.

  9. Sort the individual rules by using the arrow up or down buttons.

Procedure 15.3: Assigning a role to a target with Hawk2

To assign the role we created in Procedure 15.2 to a system user (target), proceed as follows:

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. In the left navigation bar, select Targets.

  3. To create a system user (ACL Target), click Create and enter a unique Target ID, for example, tux. Make sure this user belongs to the haclient group.

  4. To assign a role to the target, select one or multiple Roles.

    In our example, select the monitor role you created in Procedure 15.2.

    Image
  5. Confirm your choice.

To configure access rights for resources or constraints, you can also use the abbreviated syntax as explained in Section 15.8, “Setting ACL rules via abbreviations”.

15.4.2 Creating a read-only monitor role with crmsh

The following procedure shows how to configure a read-only access to the cluster configuration by defining a monitor role and assigning it to a user.

Procedure 15.4: Adding a monitor role and assigning a user with crmsh
  1. Log in as root.

  2. Start the interactive mode of crmsh:

    # crm configure
    crm(live)configure# 
  3. Define your ACL role(s):

    1. Use the role command to define a new role:

      crm(live)configure# role monitor read xpath:"/cib"

      The previous command creates a new role with the name monitor, sets the read rights and applies it to all elements in the CIB by using the XPath expression /cib. If necessary, you can add more access rights and XPath arguments.

    2. Add additional roles as needed.

  4. Assign your roles to one or multiple ACL targets, which are the corresponding system users. Make sure they belong to the haclient group.

    crm(live)configure# acl_target tux monitor
  5. Check your changes:

    crm(live)configure# show
  6. Commit your changes:

    crm(live)configure# commit

To configure access rights for resources or constraints, you can also use the abbreviated syntax as explained in Section 15.8, “Setting ACL rules via abbreviations”.

15.5 Removing a user

The following subsections describe how to remove an existing user from ACL either in Hawk2 or crmsh.

15.5.1 Removing a user with Hawk2

To remove a user from ACL, proceed as follows:

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. In the left navigation bar, select Targets.

  3. To remove a system user (ACL target), click the garbage bin icon under the Operations column.

  4. Confirm the dialog box.

15.5.2 Removing a user with crmsh

To remove a user from ACL, replace the placeholder USER with the name of the user:

# crm configure delete USERNAME

As an alternative, you can use the edit subcommand:

# crm configure edit USERNAME

15.6 Removing an existing role

The following subsections describe how to remove an existing role in either Hawk2 or crmsh.

Note
Note: Deleting roles with referenced users

Keep in mind, no user should belong to this role. If there is still a reference to a user in the role, the role cannot be deleted. Delete the references to users first, before you delete the role.

15.6.1 Removing an existing role with Hawk2

To remove a role, proceed as follows:

  1. Log in to Hawk2:

    https://HAWKSERVER:7630/
  2. In the left navigation bar, select Roles.

  3. To remove a role, click the garbage bin icon under the Operations column.

  4. Confirm the dialog box. If an error message appears, make sure your role is empty and does not reference users.

15.6.2 Removing an existing role with crmsh

To remove an existing role, replace the placeholder ROLE with the name of the role:

# crm configure delete ROLE

15.7 Setting ACL rules via XPath expressions

To manage ACL rules via XPath, you need to know the structure of the underlying XML. Retrieve the structure with the following command that shows your cluster configuration in XML (see Example 15.1):

# crm configure show xml
Example 15.1: Excerpt of a cluster configuration in XML
<cib>
  <!-- ... -->
  <configuration>
    <crm_config>
       <cluster_property_set id="cib-bootstrap-options">
        <nvpair name="stonith-enabled" value="true" id="cib-bootstrap-options-stonith-enabled"/>
       [...]
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="175704363" uname="alice"/>
      <node id="175704619" uname="bob"/>
    </nodes>
    <resources> [...]  </resources>
    <constraints/>
    <rsc_defaults> [...] </rsc_defaults>
    <op_defaults> [...] </op_defaults>
  <configuration>
</cib>

With the XPath language you can locate nodes in this XML document. For example, to select the root node (cib) use the XPath expression /cib. To locate the global cluster configurations, use the XPath expression /cib/configuration/crm_config.

As an example, Table 15.1, “Operator role—access types and XPath expressions” shows the parameters (access type and XPath expression) to create an operator role. Users with this role can only execute the tasks mentioned in the second column—they cannot reconfigure any resources (for example, change parameters or operations), nor change the configuration of colocation or order constraints.

Table 15.1: Operator role—access types and XPath expressions

Type

XPath/Explanation

Write

//crm_config//nvpair[@name='maintenance-mode']

Turn cluster maintenance mode on or off.

Write

//op_defaults//nvpair[@name='record-pending']

Choose whether pending operations are recorded.

Write

//nodes/node//nvpair[@name='standby']

Set node in online or standby mode.

Write

//resources//nvpair[@name='target-role']

Start, stop, promote, or demote any resource.

Write

//resources//nvpair[@name='maintenance']

Select whether a resource should be put in maintenance mode or not.

Write

//constraints/rsc_location

Migrate/move resources from one node to another.

Read

/cib

View the status of the cluster.

15.8 Setting ACL rules via abbreviations

For users who do not want to deal with the XML structure, there is an easier method.

For example, consider the following XPath:

//*[@id="rsc1"]

which locates all the XML nodes with the ID rsc1.

The abbreviated syntax is written like this:

ref:"rsc1"

This also works for constraints. Here is the verbose XPath:

//constraints/rsc_location

The abbreviated syntax is written like this:

type:"rsc_location"

The abbreviated syntax can be used in crmsh and Hawk2. The CIB daemon knows how to apply the ACL rules to the matching objects.

16 Network device bonding

For many systems, it is desirable to implement network connections that comply to more than the standard data security or availability requirements of a typical Ethernet device. In these cases, several Ethernet devices can be aggregated to a single bonding device.

The configuration of the bonding device is done by means of bonding module options. The behavior is determined through the mode of the bonding device. By default, this is mode=active-backup, which means that a different device will become active if the primary device fails.

When using Corosync, the bonding device is not managed by the cluster software. Therefore, the bonding device must be configured on each cluster node that might possibly need to access the bonding device.

16.1 Configuring bonding devices with YaST

To configure a bonding device, you need to have multiple Ethernet devices that can be aggregated to a single bonding device. Proceed as follows:

  1. Start YaST as root and select System › Network Settings.

  2. In the Network Settings, switch to the Overview tab, which shows the available devices.

  3. Check if the Ethernet devices to be aggregate to a bonding device have an IP address assigned. If yes, change it:

    1. Select the device to change and click Edit.

    2. In the Address tab of the Network Card Setup dialog that opens, select the option No Link and IP Setup (Bond Ports).

    3. Click Next to return to the Overview tab in the Network Settings dialog.

  4. To add a new bonding device:

    1. Click Add and set the Device Type to Bond. Proceed with Next.

    2. Select how to assign the IP address to the bonding device. Three methods are at your disposal:

      • No Link and IP Setup (Bond Ports)

      • Dynamic Address (with DHCP or Zeroconf)

      • Statically assigned IP Address

      Use the method that is appropriate for your environment. If Corosync manages virtual IP addresses, select Statically assigned IP Address and assign an IP address to the interface.

    3. Switch to the Bond Ports tab.

    4. To select the Ethernet devices that you want to include into the bond, activate the check box in front of the respective devices.

    5. Edit the Bond Driver Options. The following modes are available:

      balance-rr

      Provides load balancing and fault tolerance, at the cost of out-of-order packet transmission. This may cause delays, for example, for TCP reassembly.

      active-backup

      Provides fault tolerance.

      balance-xor

      Provides load balancing and fault tolerance.

      broadcast

      Provides fault tolerance.

      802.3ad

      Provides dynamic link aggregation if supported by the connected switch.

      balance-tlb

      Provides load balancing for outgoing traffic.

      balance-alb

      Provides load balancing for incoming and outgoing traffic, if the network devices used allow the modifying of the network device's hardware address while in use.

    6. Make sure to add the parameter miimon=100 to Bond Driver Options. Without this parameter, the link is not checked regularly, so the bonding driver might continue to lose packets on a faulty link.

  5. Click Next and leave YaST with OK to finish the configuration of the bonding device. YaST writes the configuration to /etc/sysconfig/network/ifcfg-bondDEVICENUMBER.

16.2 Hotplugging devices into a bond

Sometimes it is necessary to replace an interface in a bond with another one, for example, if the respective network device constantly fails. The solution is to set up hotplugging. It is also necessary to change the udev rules to match the device by bus ID instead of by MAC address. This enables you to replace defective hardware (a network card in the same slot but with a different MAC address), if the hardware allows for that.

Procedure 16.1: Hotplugging devices into a bond with YaST

If you prefer manual configuration instead, refer to the SUSE Linux Enterprise Server Administration Guide, chapter Basic Networking, section Hotplugging of bond ports.

  1. Start YaST as root and select System › Network Settings.

  2. In the Network Settings, switch to the Overview tab, which shows the already configured devices. If devices are already configured in a bond, the Note column shows it.

  3. For each of the Ethernet devices that have been aggregated to a bonding device, execute the following steps:

    1. Select the device to change and click Edit. The Network Card Setup dialog opens.

    2. Switch to the General tab and make sure that Activate device is set to On Hotplug.

    3. Switch to the Hardware tab.

    4. For the Udev rules, click Change and select the BusID option.

    5. Click OK and Next to return to the Overview tab in the Network Settings dialog. If you click the Ethernet device entry now, the bottom pane shows the device's details, including the bus ID.

  4. Click OK to confirm your changes and leave the network settings.

At boot time, the network setup does not wait for the hotplug devices, but for the bond to become ready, which needs at least one available device. When one of the interfaces is removed from the system (unbind from NIC driver, rmmod of the NIC driver or true PCI hotplug removal), the Kernel removes it from the bond automatically. When a new card is added to the system (replacement of the hardware in the slot), udev renames it by applying the bus-based persistent name rule and calls ifup for it. The ifup call automatically joins it into the bond.

16.3 For more information

All modes and many options are explained in detail in the Linux Ethernet Bonding Driver HOWTO at https://www.kernel.org/doc/Documentation/networking/bonding.txt.

For High Availability setups, the following options are especially important: miimon and use_carrier.

17 Load balancing

Load Balancing makes a cluster of servers appear as one large, fast server to outside clients. This apparent single server is called a virtual server. It consists of one or more load balancers dispatching incoming requests and several real servers running the actual services. With a load balancing setup of SUSE Linux Enterprise High Availability, you can build highly scalable and highly available network services, such as Web, cache, mail, FTP, media and VoIP services.

17.1 Conceptual overview

SUSE Linux Enterprise High Availability supports two technologies for load balancing: Linux Virtual Server (LVS) and HAProxy. The key difference is Linux Virtual Server operates at OSI layer 4 (Transport), configuring the network layer of kernel, while HAProxy operates at layer 7 (Application), running in user space. Thus Linux Virtual Server needs fewer resources and can handle higher loads, while HAProxy can inspect the traffic, do SSL termination and make dispatching decisions based on the content of the traffic.

On the other hand, Linux Virtual Server includes two different software: IPVS (IP Virtual Server) and KTCPVS (Kernel TCP Virtual Server). IPVS provides layer 4 load balancing whereas KTCPVS provides layer 7 load balancing.

This section gives you a conceptual overview of load balancing in combination with high availability, then briefly introduces you to Linux Virtual Server and HAProxy. Finally, it points you to further reading.

The real servers and the load balancers may be interconnected by either high-speed LAN or by geographically dispersed WAN. The load balancers dispatch requests to the different servers. They make parallel services of the cluster appear as one virtual service on a single IP address (the virtual IP address or VIP). Request dispatching can use IP load balancing technologies or application-level load balancing technologies. Scalability of the system is achieved by transparently adding or removing nodes in the cluster.

High availability is provided by detecting node or service failures and reconfiguring the whole virtual server system appropriately, as usual.

There are several load balancing strategies. Here are some Layer 4 strategies, suitable for Linux Virtual Server:

  • Round robin.  The simplest strategy is to direct each connection to a different address, taking turns. For example, a DNS server can have several entries for a given host name. With DNS round robin, the DNS server will return all of them in a rotating order. Thus different clients will see different addresses.

  • Selecting the best server.  Although this has several drawbacks, balancing could be implemented with a first server who responds or least loaded server approach.

  • Balancing the number of connections per server.  A load balancer between users and servers can divide the number of users across multiple servers.

  • Geographical location.  It is possible to direct clients to a server nearby.

Here are some Layer 7 strategies, suitable for HAProxy:

  • URI.  Inspect the HTTP content and dispatch to a server most suitable for this specific URI.

  • URL parameter, RDP cookie.  Inspect the HTTP content for a session parameter, possibly in post parameters, or the RDP (remote desktop protocol) session cookie, and dispatch to the server serving this session.

Although there is some overlap, HAProxy can be used in scenarios where LVS/ipvsadm is not adequate and vice versa:

  • SSL termination.  The front-end load balancers can handle the SSL layer. Thus the cloud nodes do not need to have access to the SSL keys, or could take advantage of SSL accelerators in the load balancers.

  • Application level.  HAProxy operates at the application level, allowing the load balancing decisions to be influenced by the content stream. This allows for persistence based on cookies and other such filters.

On the other hand, LVS/ipvsadm cannot be fully replaced by HAProxy:

  • LVS supports direct routing, where the load balancer is only in the inbound stream, whereas the outbound traffic is routed to the clients directly. This allows for potentially much higher throughput in asymmetric environments.

  • LVS supports stateful connection table replication (via conntrackd). This allows for load balancer failover that is transparent to the client and server.

17.2 Configuring load balancing with Linux Virtual Server

The following sections give an overview of the main LVS components and concepts. Then we explain how to set up Linux Virtual Server on SUSE Linux Enterprise High Availability.

17.2.1 Director

The main component of LVS is the ip_vs (or IPVS) Kernel code. It is part of the default Kernel and implements transport-layer load balancing inside the Linux Kernel (layer-4 switching). The node that runs a Linux Kernel including the IPVS code is called director. The IPVS code running on the director is the essential feature of LVS.

When clients connect to the director, the incoming requests are load-balanced across all cluster nodes: The director forwards packets to the real servers, using a modified set of routing rules that make the LVS work. For example, connections do not originate or terminate on the director, it does not send acknowledgments. The director acts as a specialized router that forwards packets from end users to real servers (the hosts that run the applications that process the requests).

17.2.2 User space controller and daemons

The ldirectord daemon is a user space daemon for managing Linux Virtual Server and monitoring the real servers in an LVS cluster of load balanced virtual servers. A configuration file (see below) specifies the virtual services and their associated real servers and tells ldirectord how to configure the server as an LVS redirector. When the daemon is initialized, it creates the virtual services for the cluster.

By periodically requesting a known URL and checking the responses, the ldirectord daemon monitors the health of the real servers. If a real server fails, it will be removed from the list of available servers at the load balancer. When the service monitor detects that the dead server has recovered and is working again, it will add the server back to the list of available servers. In case that all real servers should be down, a fall-back server can be specified to which to redirect a Web service. Typically the fall-back server is localhost, presenting an emergency page about the Web service being temporarily unavailable.

The ldirectord uses the ipvsadm tool (package ipvsadm) to manipulate the virtual server table in the Linux Kernel.

17.2.3 Packet forwarding

There are three different methods of how the director can send packets from the client to the real servers:

Network address translation (NAT)

Incoming requests arrive at the virtual IP. They are forwarded to the real servers by changing the destination IP address and port to that of the chosen real server. The real server sends the response to the load balancer which in turn changes the destination IP address and forwards the response back to the client. Thus, the end user receives the replies from the expected source. As all traffic goes through the load balancer, it usually becomes a bottleneck for the cluster.

IP tunneling (IP-IP encapsulation)

IP tunneling enables packets addressed to an IP address to be redirected to another address, possibly on a different network. The LVS sends requests to real servers through an IP tunnel (redirecting to a different IP address) and the real servers reply directly to the client using their own routing tables. Cluster members can be in different subnets.

Direct routing

Packets from end users are forwarded directly to the real server. The IP packet is not modified, so the real servers must be configured to accept traffic for the virtual server's IP address. The response from the real server is sent directly to the client. The real servers and load balancers need to be in the same physical network segment.

17.2.4 Scheduling algorithms

Deciding which real server to use for a new connection requested by a client is implemented using different algorithms. They are available as modules and can be adapted to specific needs. For an overview of available modules, refer to the ipvsadm(8) man page. Upon receiving a connect request from a client, the director assigns a real server to the client based on a schedule. The scheduler is the part of the IPVS Kernel code which decides which real server will get the next new connection.

More detailed description about Linux Virtual Server scheduling algorithms can be found at http://kb.linuxvirtualserver.org/wiki/IPVS. Furthermore, search for --scheduler in the ipvsadm man page.

Related load balancing strategies for HAProxy can be found at http://www.haproxy.org/download/1.6/doc/configuration.txt.

17.2.5 Setting up IP load balancing with YaST

You can configure Kernel-based IP load balancing with the YaST IP Load Balancing module. It is a front-end for ldirectord.

To access the IP Load Balancing dialog, start YaST as root and select High Availability › IP Load Balancing. Alternatively, start the YaST cluster module as root on a command line with yast2 iplb.

The default installation does not include the configuration file /etc/ha.d/ldirectord.cf. This file is created by the YaST module. The tabs available in the YaST module correspond to the structure of the /etc/ha.d/ldirectord.cf configuration file, defining global options and defining the options for the virtual services.

For an example configuration and the resulting processes between load balancers and real servers, refer to Example 17.1, “Simple ldirectord configuration”.

Note
Note: Global parameters and virtual server parameters

If a certain parameter is specified in both the virtual server section and in the global section, the value defined in the virtual server section overrides the value defined in the global section.

Procedure 17.1: Configuring global parameters

The following procedure describes how to configure the most important global parameters. For more details about the individual parameters (and the parameters not covered here), click Help or refer to the ldirectord man page.

  1. With Check Interval, define the interval in which ldirectord will connect to each of the real servers to check if they are still online.

  2. With Check Timeout, set the time in which the real server should have responded after the last check.

  3. With Failure Count you can define how many times ldirectord will attempt to request the real servers until the check is considered failed.

  4. With Negotiate Timeout define a timeout in seconds for negotiate checks.

  5. In Fallback, enter the host name or IP address of the Web server onto which to redirect a Web service in case all real servers are down.

  6. If you want the system to send alerts in case the connection status to any real server changes, enter a valid e-mail address in Email Alert.

  7. With Email Alert Frequency, define after how many seconds the e-mail alert should be repeated if any of the real servers remains inaccessible.

  8. In Email Alert Status specify the server states for which e-mail alerts should be sent. If you want to define more than one state, use a comma-separated list.

  9. With Auto Reload define, if ldirectord should continuously monitor the configuration file for modification. If set to yes, the configuration is automatically reloaded upon changes.

  10. With the Quiescent switch, define whether to remove failed real servers from the Kernel's LVS table or not. If set to Yes, failed servers are not removed. Instead their weight is set to 0 which means that no new connections will be accepted. Already established connections will persist until they time out.

  11. To use an alternative path for logging, specify a path for the log files in Log File. By default, ldirectord writes its log files to /var/log/ldirectord.log.

YaST IP load balancing—global parameters
Figure 17.1: YaST IP load balancing—global parameters
Procedure 17.2: Configuring virtual services

You can configure one or more virtual services by defining a couple of parameters for each. The following procedure describes how to configure the most important parameters for a virtual service. For more details about the individual parameters (and the parameters not covered here), click Help or refer to the ldirectord man page.

  1. In the YaST IP Load Balancing module, switch to the Virtual Server Configuration tab.

  2. Add a new virtual server or Edit an existing virtual server. A new dialog shows the available options.

  3. In Virtual Server enter the shared virtual IP address (IPv4 or IPv6) and port under which the load balancers and the real servers are accessible as LVS. Instead of IP address and port number you can also specify a host name and a service. Alternatively, you can also use a firewall mark. A firewall mark is a way of aggregating an arbitrary collection of VIP:port services into one virtual service.

  4. To specify the Real Servers, you need to enter the IP addresses (IPv4, IPv6, or host names) of the servers, the ports (or service names) and the forwarding method. The forwarding method must either be gate, ipip or masq, see Section 17.2.3, “Packet forwarding”.

    Click the Add button and enter the required arguments for each real server.

  5. As Check Type, select the type of check that should be performed to test if the real servers are still alive. For example, to send a request and check if the response contains an expected string, select Negotiate.

  6. If you have set the Check Type to Negotiate, you also need to define the type of service to monitor. Select it from the Service drop-down box.

  7. In Request, enter the URI to the object that is requested on each real server during the check intervals.

  8. If you want to check if the response from the real servers contains a certain string (I'm alive message), define a regular expression that needs to be matched. Enter the regular expression into Receive. If the response from a real server contains this expression, the real server is considered to be alive.

  9. Depending on the type of Service you have selected in Step 6, you also need to specify further parameters for authentication. Switch to the Auth type tab and enter the details like Login, Password, Database, or Secret. For more information, refer to the YaST help text or to the ldirectord man page.

  10. Switch to the Others tab.

  11. Select the Scheduler to be used for load balancing. For information on the available schedulers, refer to the ipvsadm(8) man page.

  12. Select the Protocol to be used. If the virtual service is specified as an IP address and port, it must be either tcp or udp. If the virtual service is specified as a firewall mark, the protocol must be fwm.

  13. Define further parameters, if needed. Confirm your configuration with OK. YaST writes the configuration to /etc/ha.d/ldirectord.cf.

YaST IP load balancing—virtual services
Figure 17.2: YaST IP load balancing—virtual services
Example 17.1: Simple ldirectord configuration

The values shown in Figure 17.1, “YaST IP load balancing—global parameters” and Figure 17.2, “YaST IP load balancing—virtual services”, would lead to the following configuration, defined in /etc/ha.d/ldirectord.cf:

autoreload = yes 1
    checkinterval = 5 2
    checktimeout = 3 3
    quiescent = yes 4
    virtual = 192.168.0.200:80 5
    checktype = negotiate 6
    fallback = 127.0.0.1:80 7
    protocol = tcp 8
    real = 192.168.0.110:80 gate 9
    real = 192.168.0.120:80 gate 9
    receive = "still alive" 10
    request = "test.html" 11
    scheduler = wlc 12
    service = http 13

1

Defines that ldirectord should continuously check the configuration file for modification.

2

Interval in which ldirectord will connect to each of the real servers to check if they are still online.

3

Time in which the real server should have responded after the last check.

4

Defines not to remove failed real servers from the Kernel's LVS table, but to set their weight to 0 instead.

5

Virtual IP address (VIP) of the LVS. The LVS is available at port 80.

6

Type of check that should be performed to test if the real servers are still alive.

7

Server onto which to redirect a Web service all real servers for this service are down.

8

Protocol to be used.

9

Two real servers defined, both available at port 80. The packet forwarding method is gate, meaning that direct routing is used.

10

Regular expression that needs to be matched in the response string from the real server.

11

URI to the object that is requested on each real server during the check intervals.

12

Selected scheduler to be used for load balancing.

13

Type of service to monitor.

This configuration would lead to the following process flow: The ldirectord will connect to each real server once every 5 seconds (2) and request 192.168.0.110:80/test.html or 192.168.0.120:80/test.html as specified in 9 and 11. If it does not receive the expected still alive string (10) from a real server within 3 seconds (3) of the last check, it will remove the real server from the available pool. However, because of the quiescent=yes setting (4), the real server will not be removed from the LVS table. Instead its weight will be set to 0 so that no new connections to this real server will be accepted. Already established connections will be persistent until they time out.

17.2.6 Further setup

Apart from the configuration of ldirectord with YaST, you need to make sure the following conditions are fulfilled to complete the LVS setup:

  • The real servers are set up correctly to provide the needed services.

  • The load balancing server (or servers) must be able to route traffic to the real servers using IP forwarding. The network configuration of the real servers depends on which packet forwarding method you have chosen.

  • To prevent the load balancing server (or servers) from becoming a single point of failure for the whole system, you need to set up one or several backups of the load balancer. In the cluster configuration, configure a primitive resource for ldirectord, so that ldirectord can fail over to other servers in case of hardware failure.

  • As the backup of the load balancer also needs the ldirectord configuration file to fulfill its task, make sure the /etc/ha.d/ldirectord.cf is available on all servers that you want to use as backup for the load balancer. You can synchronize the configuration file with Csync2 as described in Section 4.7, “Transferring the configuration to all nodes”.

17.3 Configuring load balancing with HAProxy

The following section gives an overview of the HAProxy and how to set up on High Availability. The load balancer distributes all requests to its back-end servers. It is configured as active/passive, meaning if one server fails, the passive server becomes active. In such a scenario, the user will not notice any interruption.

In this section, we will use the following setup:

  • A load balancer, with the IP address 192.168.1.99.

  • A virtual, floating IP address 192.168.1.99.

  • Our servers (usually for Web content) www.example1.com (IP: 192.168.1.200) and www.example2.com (IP: 192.168.1.201)

To configure HAProxy, use the following procedure:

  1. Install the haproxy package.

  2. Create the file /etc/haproxy/haproxy.cfg with the following contents:

    global 1
      maxconn 256
      daemon
    
    defaults 2
      log     global
      mode    http
      option  httplog
      option  dontlognull
      retries 3
      option redispatch
      maxconn 2000
      timeout connect   5000  3
      timeout client    50s   4
      timeout server    50000 5
    
    frontend LB
      bind 192.168.1.99:80 6
      reqadd X-Forwarded-Proto:\ http
      default_backend LB
    
    backend LB
      mode http
      stats enable
      stats hide-version
      stats uri /stats
      stats realm Haproxy\ Statistics
      stats auth haproxy:password	7
      balance roundrobin	8
      option  httpclose
      option forwardfor
      cookie LB insert
      option httpchk GET /robots.txt HTTP/1.0
      server web1-srv 192.168.1.200:80 cookie web1-srv check
      server web2-srv 192.168.1.201:80 cookie web2-srv check

    1

    Section which contains process-wide and OS-specific options.

    maxconn

    Maximum per-process number of concurrent connections.

    daemon

    Recommended mode, HAProxy runs in the background.

    2

    Section which sets default parameters for all other sections following its declaration. Some important lines:

    redispatch

    Enables or disables session redistribution in case of connection failure.

    log

    Enables logging of events and traffic.

    mode http

    Operates in HTTP mode (recommended mode for HAProxy). In this mode, a request will be analyzed before a connection to any server is performed. Request that are not RFC-compliant will be rejected.

    option forwardfor

    Adds the HTTP X-Forwarded-For header into the request. You need this option if you want to preserve the client's IP address.

    3

    The maximum time to wait for a connection attempt to a server to succeed.

    4

    The maximum time of inactivity on the client side.

    5

    The maximum time of inactivity on the server side.

    6

    Section which combines front-end and back-end sections in one.

    balance leastconn

    Defines the load balancing algorithm, see http://cbonte.github.io/haproxy-dconv/configuration-1.5.html#4-balance.

    stats enable , stats auth

    Enables statistics reporting (by stats enable). The auth option logs statistics with authentication to a specific account.

    7

    Credentials for HAProxy Statistic report page.

    8

    Load balancing will work in a round-robin process.

  3. Test your configuration file:

    # haproxy -f /etc/haproxy/haproxy.cfg -c
  4. Add the following line to Csync2's configuration file /etc/csync2/csync2.cfg to make sure the HAProxy configuration file is included:

    include /etc/haproxy/haproxy.cfg
  5. Synchronize it:

    # csync2 -f /etc/haproxy/haproxy.cfg
    # csync2 -xv
    Note
    Note

    The Csync2 configuration part assumes that the HA nodes were configured using the bootstrap scripts provided by the crm shell. For details, see the Installation and Setup Quick Start.

  6. Make sure HAProxy is disabled on both load balancers (alice and bob) as it is started by Pacemaker:

    # systemctl disable haproxy
  7. Configure a new CIB:

    # crm
    crm(live)# cib new haproxy-config
    crm(haproxy-config)# primitive haproxy systemd:haproxy \
        op start timeout=120 interval=0 \
        op stop timeout=120 interval=0 \
        op monitor timeout=100 interval=5s \
        meta target-role=Started
    crm(haproxy-config)# primitive vip IPaddr2 \
        params ip=192.168.1.99 nic=eth0 cidr_netmask=23 broadcast=192.168.1.255 \
        op monitor interval=5s timeout=120 on-fail=restart
    crm(haproxy-config)# group g-haproxy vip haproxy
  8. Verify the new CIB and correct any errors:

    crm(haproxy-config)# verify
  9. Commit the new CIB:

    crm(haproxy-config)# cib use live
    crm(live)# cib commit haproxy-config

17.4 For more information

18 High Availability for virtualization

This chapter explains how to configure virtual machines as highly available cluster resources.

18.1 Overview

Virtual machines can take different roles in a High Availability cluster:

  • A virtual machine can be managed by the cluster as a resource, without the cluster managing the services that run on the virtual machine. In this case, the VM is opaque to the cluster. This is the scenario described in this document.

  • A virtual machine can be a cluster resource and run pacemaker_remote, which allows the cluster to manage services running on the virtual machine. In this case, the VM is a guest node and is transparent to the cluster. For this scenario, see Article “Pacemaker Remote Quick Start”, Section 4 “Use case 2: setting up a cluster with guest nodes”.

  • A virtual machine can run a full cluster stack. In this case, the VM is a regular cluster node and is not managed by the cluster as a resource. For this scenario, see Article “Installation and Setup Quick Start”.

The following procedures describe how to set up highly available virtual machines on block storage, with another block device used as an OCFS2 volume to store the VM lock files and XML configuration files. The virtual machines and the OCFS2 volume are configured as resources managed by the cluster, with resource constraints to ensure that the lock file directory is always available before a virtual machine starts on any node. This prevents the virtual machines from starting on multiple nodes.

18.2 Requirements

  • A running High Availability cluster with at least two nodes and a fencing device such as SBD.

  • Passwordless root SSH login between the cluster nodes.

  • A network bridge on each cluster node, to be used for installing and running the VMs. This must be separate from the network used for cluster communication and management.

  • Two or more shared storage devices (or partitions on a single shared device), so that all cluster nodes can access the files and storage required by the VMs:

    • A device to use as an OCFS2 volume, which will store the VM lock files and XML configuration files. Creating and mounting the OCFS2 volume is explained in the following procedure.

    • A device containing the VM installation source (such as an ISO file or disk image).

    • Depending on the installation source, you might also need another device for the VM storage disks.

    To avoid I/O starvation, these devices must be separate from the shared device used for SBD.

  • Stable device names for all storage paths, for example, /dev/disk/by-id/DEVICE_ID. A shared storage device might have mismatched /dev/sdX names on different nodes, which will cause VM migration to fail.

18.3 Configuring cluster resources to manage the lock files

Use this procedure to configure the cluster to manage the virtual machine lock files. The lock file directory must be available on all nodes so that the cluster is aware of the lock files no matter which node the VMs are running on.

You only need to run the following commands on one of the cluster nodes.

Procedure 18.1: Configuring cluster resources to manage the lock files
  1. Create an OCFS2 volume on one of the shared storage devices:

    # mkfs.ocfs2 /dev/disk/by-id/DEVICE_ID
  2. Run crm configure to start the crm interactive shell.

  3. Create a primitive resource for DLM:

    crm(live)configure# primitive dlm ocf:pacemaker:controld \
      op monitor interval=60 timeout=60
  4. Create a primitive resource for the OCFS2 volume:

    crm(live)configure# primitive ocfs2 Filesystem \
      params device="/dev/disk/by-id/DEVICE_ID" directory="/mnt/shared" fstype=ocfs2 \
      op monitor interval=20 timeout=40
  5. Create a group for the DLM and OCFS2 resources:

    crm(live)configure# group g-virt-lock dlm ocfs2
  6. Clone the group so that it runs on all nodes:

    crm(live)configure# clone cl-virt-lock g-virt-lock \
      meta interleave=true
  7. Review your changes with show.

  8. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

  9. Check the status of the group clone. It should be running on all nodes:

    # crm status
    [...]
    Full List of Resources:
    [...]
      * Clone Set: cl-virt-lock [g-virt-lock]:
        * Started: [ alice bob ]

18.4 Preparing the cluster nodes to host virtual machines

Use this procedure to install and start the required virtualization services, and to configure the nodes to store the VM lock files on the shared OCFS2 volume.

This procedure uses crm cluster run to run commands on all nodes at once. If you prefer to manage each node individually, you can omit the crm cluster run portion of the commands.

Procedure 18.2: Preparing the cluster nodes to host virtual machines
  1. Install the virtualization packages on all nodes in the cluster:

    # crm cluster run "zypper install -y -t pattern kvm_server kvm_tools"
  2. On one node, find and enable the lock_manager setting in the file /etc/libvirt/qemu.conf:

    lock_manager = "lockd"
  3. On the same node, find and enable the file_lockspace_dir setting in the file /etc/libvirt/qemu-lockd.conf, and change the value to point to a directory on the OCFS2 volume:

    file_lockspace_dir = "/mnt/shared/lockd"
  4. Copy these files to the other nodes in the cluster:

    # crm cluster copy /etc/libvirt/qemu.conf
    # crm cluster copy /etc/libvirt/qemu-lockd.conf
  5. Enable and start the libvirtd service on all nodes in the cluster:

    # crm cluster run "systemctl enable --now libvirtd"

    This also starts the virtlockd service.

18.5 Adding virtual machines as cluster resources

Use this procedure to add virtual machines to the cluster as cluster resources, with resource constraints to ensure the VMs can always access the lock files. The lock files are managed by the resources in the group g-virt-lock, which is available on all nodes via the clone cl-virt-lock.

Procedure 18.3: Adding virtual machines as cluster resources
  1. Install your virtual machines on one of the cluster nodes, with the following restrictions:

    • The installation source and storage must be on shared devices.

    • Do not configure the VMs to start on host boot.

    For more information, see Virtualization Guide for SUSE Linux Enterprise Server.

  2. If the virtual machines are running, shut them down. The cluster will start the VMs after you add them as resources.

  3. Dump the XML configuration to the OCFS2 volume. Repeat this step for each VM:

    # virsh dumpxml VM1 > /mnt/shared/VM1.xml
    Important

    Make sure the XML files do not contain any references to unshared local paths.

  4. Run crm configure to start the crm interactive shell.

  5. Create primitive resources to manage the virtual machines. Repeat this step for each VM:

    crm(live)configure# primitive VM1 VirtualDomain \
      params config="/mnt/shared/VM1.xml" remoteuri="qemu+ssh://%n/system" \
      meta allow-migrate=true \
      op monitor timeout=30s interval=10s

    The option allow-migrate=true enables live migration. If the value is set to false, the cluster migrates the VM by shutting it down on one node and restarting it on another node.

    If you need to set utilization attributes to help place VMs based on their load impact, see Section 7.10, “Placing resources based on their load impact”.

  6. Create a colocation constraint so that the virtual machines can only start on nodes where cl-virt-lock is running:

    crm(live)configure# colocation col-fs-virt inf: ( VM1 VM2 VMX ) cl-virt-lock
  7. Create an ordering constraint so that cl-virt-lock always starts before the virtual machines:

    crm(live)configure# order o-fs-virt Mandatory: cl-virt-lock ( VM1 VM2 VMX )
  8. Review your changes with show.

  9. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

  10. Check the status of the virtual machines:

    # crm status
    [...]
    Full List of Resources:
    [...]
      * Clone Set: cl-virt-lock [g-virt-lock]:
        * Started: [ alice bob ]
      * VM1 (ocf::heartbeat:VirtualDomain): Started alice
      * VM2 (ocf::heartbeat:VirtualDomain): Started alice
      * VMX (ocf::heartbeat:VirtualDomain): Started alice

The virtual machines are now managed by the High Availability cluster, and can migrate between the cluster nodes.

Important
Important: Do not manually start or stop cluster-managed VMs

After adding virtual machines as cluster resources, do not manage them manually. Only use the cluster tools as described in Chapter 8, Managing cluster resources.

To perform maintenance tasks on cluster-managed VMs, see Section 28.2, “Different options for maintenance tasks”.

18.6 Testing the setup

Use the following tests to confirm that the virtual machine High Availability setup works as expected.

Important

Perform these tests in a test environment, not a production environment.

Procedure 18.4: Verifying that the VM resource is protected across cluster nodes
  1. The virtual machine VM1 is running on node alice.

  2. On node bob, try to start the VM manually with virsh start VM1.

  3. Expected result: The virsh command fails. VM1 cannot be started manually on bob when it is running on alice.

Procedure 18.5: Verifying that the VM resource can live migrate between cluster nodes
  1. The virtual machine VM1 is running on node alice.

  2. Open two terminals.

  3. In the first terminal, connect to VM1 via SSH.

  4. In the second terminal, try to migrate VM1 to node bob with crm resource move VM1 bob.

  5. Run crm_mon -r to monitor the cluster status until it stabilizes. This might take a short time.

  6. In the first terminal, check whether the SSH connection to VM1 is still active.

  7. Expected result: The cluster status shows that VM1 has started on bob. The SSH connection to VM1 remains active during the whole migration.

Procedure 18.6: Verifying that the VM resource can migrate to another node when the current node reboots
  1. The virtual machine VM1 is running on node bob.

  2. Reboot bob.

  3. On node alice, run crm_mon -r to monitor the cluster status until it stabilizes. This might take a short time.

  4. Expected result: The cluster status shows that VM1 has started on alice.

Procedure 18.7: Verifying that the VM resource can fail over to another node when the current node crashes
  1. The virtual machine VM1 is running on node alice.

  2. Simulate a crash on alice by forcing the machine off or unplugging the power cable.

  3. On node bob, run crm_mon -r to monitor the cluster status until it stabilizes. VM failover after a node crashes usually takes longer than VM migration after a node reboots.

  4. Expected result: After a short time, the cluster status shows that VM1 has started on bob.

19 Geo clusters (multi-site clusters)

Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High Availability 15 SP4 also supports geographically dispersed clusters (Geo clusters, sometimes also called multi-site clusters). That means you can have multiple, geographically dispersed sites with a local cluster each. Failover between these clusters is coordinated by a higher level entity, the so-called booth. For details on how to use and set up Geo clusters, refer to Article “Geo Clustering Quick Start” and Book “Geo Clustering Guide”.

Part III Storage and data replication

  • 20 Distributed Lock Manager (DLM)
  • The Distributed Lock Manager (DLM) in the kernel is the base component used by OCFS2, GFS2, Cluster MD, and Cluster LVM (lvmlockd) to provide active-active storage at each respective layer.

  • 21 OCFS2
  • Oracle Cluster File System 2 (OCFS2) is a general-purpose journaling file system that has been fully integrated since the Linux 2.6 Kernel. OCFS2 allows you to store application binary files, data files, and databases on devices on shared storage. All nodes in a cluster have concurrent read and write access to the file system. A user space control daemon, managed via a clone resource, provides the integration with the HA stack, in particular with Corosync and the Distributed Lock Manager (DLM).

  • 22 GFS2
  • Global File System 2 or GFS2 is a shared disk file system for Linux computer clusters. GFS2 allows all nodes to have direct concurrent access to the same shared block storage. GFS2 has no disconnected operating-mode, and no client or server roles. All nodes in a GFS2 cluster function as peers. GFS2 supports up to 32 cluster nodes. Using GFS2 in a cluster requires hardware to allow access to the shared storage, and a lock manager to control access to the storage.

    SUSE recommends OCFS2 over GFS2 for your cluster environments if performance is one of your major requirements. Our tests have revealed that OCFS2 performs better as compared to GFS2 in such settings.

  • 23 DRBD
  • The distributed replicated block device (DRBD*) allows you to create a mirror of two block devices that are located at two different sites across an IP network. When used with Corosync, DRBD supports distributed high-availability Linux clusters. This chapter shows you how to install and set up DRBD.

  • 24 Cluster logical volume manager (Cluster LVM)
  • When managing shared storage on a cluster, every node must be informed about changes to the storage subsystem. Logical Volume Manager (LVM) supports transparent management of volume groups across the whole cluster. Volume groups shared among multiple hosts can be managed using the same commands as local storage.

  • 25 Cluster multi-device (Cluster MD)
  • The cluster multi-device (Cluster MD) is a software based RAID storage solution for a cluster. Currently, Cluster MD provides the redundancy of RAID1 mirroring to the cluster. With SUSE Linux Enterprise High Availability 15 SP4, RAID10 is included as a technology preview. If you want to try RAID10, replace mirror with 10 in the related mdadm command. This chapter shows you how to create and use Cluster MD.

  • 26 Samba clustering
  • A clustered Samba server provides a High Availability solution in your heterogeneous networks. This chapter explains some background information and how to set up a clustered Samba server.

  • 27 Disaster recovery with ReaR (Relax-and-Recover)
  • Relax-and-Recover (ReaR) is a disaster recovery framework for use by system administrators. It is a collection of Bash scripts that need to be adjusted to the specific production environment that is to be protected in case of disaster.

    No disaster recovery solution will work out-of-the-box. Therefore it is essential to take preparations before any disaster happens.

20 Distributed Lock Manager (DLM)

The Distributed Lock Manager (DLM) in the kernel is the base component used by OCFS2, GFS2, Cluster MD, and Cluster LVM (lvmlockd) to provide active-active storage at each respective layer.

20.1 Protocols for DLM communication

To avoid single points of failure, redundant communication paths are important for High Availability clusters. This is also true for DLM communication. If network bonding (Link Aggregation Control Protocol, LACP) cannot be used for any reason, we highly recommend defining a redundant communication channel (a second ring) in Corosync. For details, see Procedure 4.3, “Defining a redundant communication channel”.

DLM communicates through port 21064 using either the TCP or SCTP protocol, depending on the configuration in /etc/corosync/corosync.conf:

  • If rrp_mode is set to none (which means redundant ring configuration is disabled), DLM automatically uses TCP. However, without a redundant communication channel, DLM communication will fail if the TCP link is down.

  • If rrp_mode is set to passive (which is the typical setting), and a second communication ring in /etc/corosync/corosync.conf is configured correctly, DLM automatically uses SCTP. In this case, DLM messaging has the redundancy capability provided by SCTP.

20.2 Configuring DLM cluster resources

DLM uses the cluster membership services from Pacemaker which run in user space. Therefore, DLM needs to be configured as a clone resource that is present on each node in the cluster.

Note
Note: DLM resource for several solutions

As OCFS2, GFS2, Cluster MD, and Cluster LVM (lvmlockd) all use DLM, it is enough to configure one resource for DLM. As the DLM resource runs on all nodes in the cluster it is configured as a clone resource.

If you have a setup that includes both OCFS2 and Cluster LVM, configuring one DLM resource for both OCFS2 and Cluster LVM is enough. In this case, configure DLM using Procedure 20.1, “Configuring a base group for DLM”.

However, if you need to keep the resources that use DLM independent from one another (such as multiple OCFS2 mount points), use separate colocation and order constraints instead of a group. In this case, configure DLM using Procedure 20.2, “Configuring an independent DLM resource”.

Procedure 20.1: Configuring a base group for DLM

This configuration consists of a base group that includes several primitives and a base clone. Both base group and base clone can be used in various scenarios afterward (for both OCFS2 and Cluster LVM, for example). You only need to extend the base group with the respective primitives as needed. As the base group has internal colocation and ordering, this simplifies the overall setup as you do not need to specify several individual groups, clones and their dependencies.

  1. Log in to a node as root or equivalent.

  2. Run crm configure.

  3. Create the primitive resource for DLM:

    crm(live)configure# primitive dlm ocf:pacemaker:controld \
      op monitor interval="60" timeout="60"
  4. Create a base group for the dlm resource and further storage-related resources:

    crm(live)configure# group g-storage dlm
  5. Clone the g-storage group so that it runs on all nodes:

    crm(live)configure#  clone cl-storage g-storage \
      meta interleave=true target-role=Started
  6. Review your changes with show.

  7. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

Note
Note: Failure when disabling STONITH

Clusters without STONITH are not supported. If you set the global cluster option stonith-enabled to false for testing or troubleshooting purposes, the DLM resource and all services depending on it (such as Cluster LVM, GFS2, and OCFS2) will fail to start.

Procedure 20.2: Configuring an independent DLM resource

This configuration consists of a primitive and a clone, but no group. By adding colocation and order constraints, you can avoid introducing dependencies between multiple resources that use DLM (such as multiple OCFS2 mount points).

  1. Log in to a node as root or equivalent.

  2. Run crm configure.

  3. Create the primitive resource for DLM:

    crm(live)configure# primitive dlm ocf:pacemaker:controld \
      op start timeout=90 interval=0 \
      op stop timeout=100 interval=0 \
      op monitor interval=60 timeout=60
  4. Clone the dlm resource so that it runs on all nodes:

    crm(live)configure#  clone cl-dlm dlm meta interleave=true
  5. Review your changes with show.

  6. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

21 OCFS2

Oracle Cluster File System 2 (OCFS2) is a general-purpose journaling file system that has been fully integrated since the Linux 2.6 Kernel. OCFS2 allows you to store application binary files, data files, and databases on devices on shared storage. All nodes in a cluster have concurrent read and write access to the file system. A user space control daemon, managed via a clone resource, provides the integration with the HA stack, in particular with Corosync and the Distributed Lock Manager (DLM).

21.1 Features and benefits

OCFS2 can be used for the following storage solutions for example:

  • General applications and workloads.

  • Xen image store in a cluster. Xen virtual machines and virtual servers can be stored on OCFS2 volumes that are mounted by cluster servers. This provides quick and easy portability of Xen virtual machines between servers.

  • LAMP (Linux, Apache, MySQL, and PHP | Perl | Python) stacks.

As a high-performance, symmetric and parallel cluster file system, OCFS2 supports the following functions:

  • An application's files are available to all nodes in the cluster. Users simply install it once on an OCFS2 volume in the cluster.

  • All nodes can concurrently read and write directly to storage via the standard file system interface, enabling easy management of applications that run across the cluster.

  • File access is coordinated through DLM. DLM control is good for most cases, but an application's design might limit scalability if it contends with the DLM to coordinate file access.

  • Storage backup functionality is available on all back-end storage. An image of the shared application files can be easily created, which can help provide effective disaster recovery.

OCFS2 also provides the following capabilities:

  • Metadata caching.

  • Metadata journaling.

  • Cross-node file data consistency.

  • Support for multiple-block sizes up to 4 KB, cluster sizes up to 1 MB, for a maximum volume size of 4 PB (Petabyte).

  • Support for up to 32 cluster nodes.

  • Asynchronous and direct I/O support for database files for improved database performance.

Note
Note: Support for OCFS2

OCFS2 is only supported by SUSE when used with the pcmk (Pacemaker) stack, as provided by SUSE Linux Enterprise High Availability. SUSE does not provide support for OCFS2 in combination with the o2cb stack.

21.2 OCFS2 packages and management utilities

The OCFS2 Kernel module (ocfs2) is installed automatically in SUSE Linux Enterprise High Availability 15 SP4. To use OCFS2, make sure the following packages are installed on each node in the cluster: ocfs2-tools and the matching ocfs2-kmp-* packages for your Kernel.

The ocfs2-tools package provides the following utilities for management of OCFS2 volumes. For syntax information, see their man pages.

Table 21.1: OCFS2 utilities

OCFS2 Utility

Description

debugfs.ocfs2

Examines the state of the OCFS2 file system for debugging.

defragfs.ocfs2Reduces fragmentation of the OCFS2 file system.

fsck.ocfs2

Checks the file system for errors and optionally repairs errors.

mkfs.ocfs2

Creates an OCFS2 file system on a device, usually a partition on a shared physical or logical disk.

mounted.ocfs2

Detects and lists all OCFS2 volumes on a clustered system. Detects and lists all nodes on the system that have mounted an OCFS2 device or lists all OCFS2 devices.

tunefs.ocfs2

Changes OCFS2 file system parameters, including the volume label, number of node slots, journal size for all node slots, and volume size.

21.3 Configuring OCFS2 services and a STONITH resource

Before you can create OCFS2 volumes, you must configure the following resources as services in the cluster: DLM and a STONITH resource.

The following procedure uses the crm shell to configure the cluster resources. Alternatively, you can also use Hawk2 to configure the resources as described in Section 21.6, “Configuring OCFS2 resources with Hawk2”.

Procedure 21.1: Configuring a STONITH resource
Note
Note: STONITH device needed

You need to configure a fencing device. Without a STONITH mechanism (like external/sbd) in place the configuration will fail.

  1. Start a shell and log in as root or equivalent.

  2. Create an SBD partition as described in Procedure 13.3, “Initializing the SBD devices”.

  3. Run crm configure.

  4. Configure external/sbd as the fencing device:

    crm(live)configure# primitive sbd_stonith stonith:external/sbd \
      params pcmk_delay_max=30 meta target-role="Started"
  5. Review your changes with show.

  6. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

For details on configuring the resource for DLM, see Section 20.2, “Configuring DLM cluster resources”.

21.4 Creating OCFS2 volumes

After you have configured a DLM cluster resource as described in Section 21.3, “Configuring OCFS2 services and a STONITH resource”, configure your system to use OCFS2 and create OCFs2 volumes.

Note
Note: OCFS2 volumes for application and data files

We recommend that you generally store application files and data files on different OCFS2 volumes. If your application volumes and data volumes have different requirements for mounting, it is mandatory to store them on different volumes.

Before you begin, prepare the block devices you plan to use for your OCFS2 volumes. Leave the devices as free space.

Then create and format the OCFS2 volume with the mkfs.ocfs2 as described in Procedure 21.2, “Creating and formatting an OCFS2 volume”. The most important parameters for the command are listed in Table 21.2, “Important OCFS2 parameters”. For more information and the command syntax, refer to the mkfs.ocfs2 man page.

Table 21.2: Important OCFS2 parameters

OCFS2 Parameter

Description and Recommendation

Volume Label (-L)

A descriptive name for the volume to make it uniquely identifiable when it is mounted on different nodes. Use the tunefs.ocfs2 utility to modify the label as needed.

Cluster Size (-C)

Cluster size is the smallest unit of space allocated to a file to hold the data. For the available options and recommendations, refer to the mkfs.ocfs2 man page.

Number of Node Slots (-N)

The maximum number of nodes that can concurrently mount a volume. For each of the nodes, OCFS2 creates separate system files, such as the journals. Nodes that access the volume can be a combination of little-endian architectures (such as AMD64/Intel 64) and big-endian architectures (such as S/390x).

Node-specific files are called local files. A node slot number is appended to the local file. For example: journal:0000 belongs to whatever node is assigned to slot number 0.

Set each volume's maximum number of node slots when you create it, according to how many nodes that you expect to concurrently mount the volume. Use the tunefs.ocfs2 utility to increase the number of node slots as needed. Note that the value cannot be decreased, and one node slot will consume about 100 MiB of disk space.

In case the -N parameter is not specified, the number of node slots is decided based on the size of the file system. For the default value, refer to the mkfs.ocfs2 man page.

Block Size (-b)

The smallest unit of space addressable by the file system. Specify the block size when you create the volume. For the available options and recommendations, refer to the mkfs.ocfs2 man page.

Specific Features On/Off (--fs-features)

A comma separated list of feature flags can be provided, and mkfs.ocfs2 will try to create the file system with those features set according to the list. To turn a feature on, include it in the list. To turn a feature off, prepend no to the name.

For an overview of all available flags, refer to the mkfs.ocfs2 man page.

Pre-Defined Features (--fs-feature-level)

Allows you to choose from a set of pre-determined file system features. For the available options, refer to the mkfs.ocfs2 man page.

If you do not specify any features when creating and formatting the volume with mkfs.ocfs2, the following features are enabled by default: backup-super, sparse, inline-data, unwritten, metaecc, indexed-dirs, and xattr.

Procedure 21.2: Creating and formatting an OCFS2 volume

Execute the following steps only on one of the cluster nodes.

  1. Open a terminal window and log in as root.

  2. Check if the cluster is online with the command crm status.

  3. Create and format the volume using the mkfs.ocfs2 utility. For information about the syntax for this command, refer to the mkfs.ocfs2 man page.

    For example, to create a new OCFS2 file system that supports up to 32 cluster nodes, enter the following command:

    # mkfs.ocfs2 -N 32 /dev/disk/by-id/DEVICE_ID

    Always use a stable device name (for example: /dev/disk/by-id/scsi-ST2000DM001-0123456_Wabcdefg).

21.5 Mounting OCFS2 volumes

You can either mount an OCFS2 volume manually or with the cluster manager, as described in Procedure 21.4, “Mounting an OCFS2 volume with the cluster resource manager”.

To mount multiple OCFS2 volumes, see Procedure 21.5, “Mounting multiple OCFS2 volumes with the cluster resource manager”.

Procedure 21.3: Manually mounting an OCFS2 volume
  1. Open a terminal window and log in as root.

  2. Check if the cluster is online with the command crm status.

  3. Mount the volume from the command line, using the mount command.

Tip
Tip: Mounting an existing OCFS2 volume on a single node

You can mount an OCFS2 volume on a single node without a fully functional cluster stack: for example, to quickly access the data from a backup. To do so, use the mount command with the -o nocluster option.

Warning

This mount method lacks cluster-wide protection. To avoid damaging the file system, you must ensure that it is only mounted on one node.

Procedure 21.4: Mounting an OCFS2 volume with the cluster resource manager

To mount an OCFS2 volume with the High Availability software, configure an OCFS2 file system resource in the cluster. The following procedure uses the crm shell to configure the cluster resources. Alternatively, you can also use Hawk2 to configure the resources as described in Section 21.6, “Configuring OCFS2 resources with Hawk2”.

  1. Log in to a node as root or equivalent.

  2. Run crm configure.

  3. Configure Pacemaker to mount the OCFS2 file system on every node in the cluster:

    crm(live)configure# primitive ocfs2-1 ocf:heartbeat:Filesystem \
      params device="/dev/disk/by-id/DEVICE_ID" directory="/mnt/shared" fstype="ocfs2" \
      op monitor interval="20" timeout="40" \
      op start timeout="60" op stop timeout="60" \
      meta target-role="Started"
  4. Add the ocfs2-1 primitive to the g-storage group you created in Procedure 20.1, “Configuring a base group for DLM”.

    crm(live)configure# modgroup g-storage add ocfs2-1

    Because of the base group's internal colocation and ordering, the ocfs2-1 resource will only start on nodes that also have a dlm resource already running.

    Important
    Important: Do not use a group for multiple OCFS2 resources

    Adding multiple OCFS2 resources to a group creates a dependency between the OCFS2 volumes. For example, if you created a group with crm configure group g-storage dlm ocfs2-1 ocfs2-2, then stopping ocfs2-1 will also stop ocfs2-2, and starting ocfs2-2 will also start ocfs2-1.

    To use multiple OCFS2 resources in the cluster, use colocation and order constraints as described in Procedure 21.5, “Mounting multiple OCFS2 volumes with the cluster resource manager”.

  5. Review your changes with show.

  6. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

Procedure 21.5: Mounting multiple OCFS2 volumes with the cluster resource manager

To mount multiple OCFS2 volumes in the cluster, configure an OCFS2 file system resource for each volume, and colocate them with the dlm resource you created in Procedure 20.2, “Configuring an independent DLM resource”.

Important

Do not add multiple OCFS2 resources to a group with DLM. This creates a dependency between the OCFS2 volumes. For example, if ocfs2-1 and ocfs2-2 are in the same group, then stopping ocfs2-1 will also stop ocfs2-2.

  1. Log in to a node as root or equivalent.

  2. Run crm configure.

  3. Create the primitive for the first OCFS2 volume:

    crm(live)configure# primitive ocfs2-1 Filesystem \
      params directory="/srv/ocfs2-1" fstype=ocfs2 device="/dev/disk/by-id/DEVICE_ID1" \
      op monitor interval=20 timeout=40 \
      op start timeout=60 interval=0 \
      op stop timeout=60 interval=0
  4. Create the primitive for the second OCFS2 volume:

    crm(live)configure# primitive ocfs2-2 Filesystem \
      params directory="/srv/ocfs2-2" fstype=ocfs2 device="/dev/disk/by-id/DEVICE_ID2" \
      op monitor interval=20 timeout=40 \
      op start timeout=60 interval=0 \
      op stop timeout=60 interval=0
  5. Clone the OCFS2 resources so that they can run on all nodes:

    crm(live)configure# clone cl-ocfs2-1 ocfs2-1 meta interleave=true
    crm(live)configure# clone cl-ocfs2-2 ocfs2-2 meta interleave=true
  6. Add a colocation constraint for both OCFS2 resources so that they can only run on nodes where DLM is also running:

    crm(live)configure# colocation co-ocfs2-with-dlm inf: ( cl-ocfs2-1 cl-ocfs2-2 ) cl-dlm
  7. Add an order constraint for both OCFS2 resources so that they can only start after DLM is already running:

    crm(live)configure# order o-dlm-before-ocfs2 Mandatory: cl-dlm ( cl-ocfs2-1 cl-ocfs2-2 )
  8. Review your changes with show.

  9. If everything is correct, submit your changes with commit and leave the crm live configuration with quit.

21.6 Configuring OCFS2 resources with Hawk2

Instead of configuring the DLM and the file system resource for OCFS2 manually with the crm shell, you can also use the OCFS2 template in Hawk2's Setup Wizard.

Important
Important: Differences between manual configuration and Hawk2

The OCFS2 template in the Setup Wizard does not include the configuration of a STONITH resource. If you use the wizard, you still need to create an SBD partition on the shared storage and configure a STONITH resource as described in Procedure 21.1, “Configuring a STONITH resource”.

Using the OCFS2 template in the Hawk2 Setup Wizard also leads to a slightly different resource configuration than the manual configuration described in Procedure 20.1, “Configuring a base group for DLM” and Procedure 21.4, “Mounting an OCFS2 volume with the cluster resource manager”.

Procedure 21.6: Configuring OCFS2 resources with Hawk2's Wizard