The main purpose of an HA cluster is to manage user services. Typical examples of user services are an Apache Web server or a database. From the user's point of view, the services do something specific when ordered to do so. To the cluster, however, they are only resources which may be started or stopped—the nature of the service is irrelevant to the cluster.
In this chapter, we will introduce some basic concepts you need to know when configuring resources and administering your cluster. The following chapters show you how to execute the main configuration and administration tasks with each of the management tools the High Availability Extension provides.
To configure and manage cluster resources, either use HA Web Konsole (Hawk2), or the crm shell (crmsh) command line utility. If you upgrade from an earlier version of SUSE® Linux Enterprise High Availability Extension where Hawk was installed, the package will be replaced with the current version, Hawk2.
Hawk2's Web-based user-interface allows you to monitor and administer your Linux cluster from non-Linux machines. Furthermore, it is the ideal solution in case your system only provides a minimal graphical user interface.
To configure and manage cluster resources, either use the crm shell (crmsh) command line utility or HA Web Konsole (Hawk2), a Web-based user interface.
This chapter introduces crm
, the command line tool
and covers an overview of this tool, how to use templates, and mainly
configuring and managing cluster resources: creating basic and advanced
types of resources (groups and clones), configuring constraints,
specifying failover nodes and failback nodes, configuring resource
monitoring, starting, cleaning up or removing resources, and migrating
resources manually.
All tasks that need to be managed by a cluster must be available as a resource. There are two major groups here to consider: resource agents and STONITH agents. For both categories, you can add your own agents, extending the abilities of the cluster to your own needs.
Fencing is a very important concept in computer clusters for HA (High Availability). A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource. Fencing may be defined as a method to bring an HA cluster to a known state.
Every resource in a cluster has a state attached. For example: “resource r1 is started on alice”. In an HA cluster, such a state implies that “resource r1 is stopped on all nodes except alice”, because the cluster must make sure that every resource may be started on only one node. Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.
When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.
SBD (STONITH Block Device) provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). This isolates the fencing mechanism from changes in firmware version or dependencies on specific firmware controllers. SBD needs a watchdog on each node to ensure that misbehaving nodes are really stopped. Under certain conditions, it is also possible to use SBD without shared storage, by running it in diskless mode.
The ha-cluster-bootstrap scripts provide an automated way to set up a cluster with the option of using SBD as fencing mechanism. For details, see the Installation and Setup Quick Start. However, manually setting up SBD provides you with more options regarding the individual settings.
This chapter explains the concepts behind SBD. It guides you through configuring the components needed by SBD to protect your cluster from potential data corruption in case of a split brain scenario.
In addition to node level fencing, you can use additional mechanisms for storage protection, such as LVM2 exclusive activation or OCFS2 file locking support (resource level fencing). They protect your system against administrative or application faults.
The cluster administration tools like crm shell (crmsh) or
Hawk2 can be used by root
or any user in the group
haclient
. By default, these
users have full read/write access. To limit access or assign more
fine-grained access rights, you can use Access control
lists (ACLs).
Access control lists consist of an ordered set of access rules. Each rule allows read or write access or denies access to a part of the cluster configuration. Rules are typically combined to produce a specific role, then users may be assigned to a role that matches their tasks.
For many systems, it is desirable to implement network connections that comply to more than the standard data security or availability requirements of a typical Ethernet device. In these cases, several Ethernet devices can be aggregated to a single bonding device.
Load Balancing makes a cluster of servers appear as one large, fast server to outside clients. This apparent single server is called a virtual server. It consists of one or more load balancers dispatching incoming requests and several real servers running the actual services. With a load balancing setup of High Availability Extension, you can build highly scalable and highly available network services, such as Web, cache, mail, FTP, media and VoIP services.
Apart from local clusters and metro area clusters, SUSE® Linux Enterprise High Availability Extension
12 SP5 also supports geographically dispersed clusters (Geo
clusters, sometimes also called multi-site clusters). That means you can
have multiple, geographically dispersed sites with a local cluster each.
Failover between these clusters is coordinated by a higher level entity,
the so-called booth
. Support for Geo clusters is
available as a separate extension to Geo Clustering for SUSE Linux Enterprise High Availability Extension. For details on how to
use and set up Geo clusters, refer to the Kurzanleitung zu Geo-Clustering
or the Geo Clustering Guide.
To perform maintenance tasks on the cluster nodes, you might need to stop the resources running on that node, to move them, or to shut down or reboot the node. It might also be necessary to temporarily take over the control of resources from the cluster, or even to stop the cluster service while resources remain running.
This chapter explains how to manually take down a cluster node without negative side effects. It also gives an overview of different options the cluster stack provides for executing maintenance tasks.