Applies to SUSE Linux Enterprise High Availability 12 SP5

Glossary #

active/active, active/passive #

A concept of how services are running on nodes. An active-passive scenario means that one or more services are running on the active node and the passive node waits for the active node to fail. Active-active means that each node is active and passive at the same time. For example, it has some services running, but can take over other services from the other node. Compare with primary/secondary and dual-primary in DRBD speak.

arbitrator #

Additional instance in a Geo cluster that helps to reach consensus about decisions such as failover of resources across sites. Arbitrators are single machines that run one or more booth instances in a special mode.

AutoYaST #

AutoYaST is a system for installing one or more SUSE Linux Enterprise systems automatically and without user intervention.

bindnetaddr (bind network address) #

The network address the Corosync executive should bind to.

booth #

The instance that manages the failover process between the sites of a Geo cluster. It aims to get multi-site resources active on one and only one site. This is achieved by using so-called tickets that are treated as failover domain between cluster sites, in case a site should be down.

boothd (booth daemon) #

Each of the participating clusters and arbitrators in a Geo cluster runs a service, the boothd. It connects to the booth daemons running at the other sites and exchanges connectivity details.

CCM (consensus cluster membership) #

The CCM determines which nodes make up the cluster and shares this information across the cluster. Any new addition and any loss of nodes or quorum is delivered by the CCM. A CCM module runs on each node of the cluster.

CIB (cluster information base) #

A representation of the whole cluster configuration and status (cluster options, nodes, resources, constraints and the relationship to each other). It is written in XML and resides in memory. A master CIB is kept and maintained on the DC (designated coordinator) and replicated to the other nodes. Normal read and write operations on the CIB are serialized through the master CIB.

cluster #

A high-performance cluster is a group of computers (real or virtual) sharing the application load to achieve faster results. A high-availability cluster is designed primarily to secure the highest possible availability of services.

cluster partition #

Whenever communication fails between one or more nodes and the rest of the cluster, a cluster partition occurs. The nodes of a cluster are split into partitions but still active. They can only communicate with nodes in the same partition and are unaware of the separated nodes. As the loss of the nodes on the other partition cannot be confirmed, a split brain scenario develops (see also split brain).

cluster site #

In Geo clustering, a cluster site (or just “site”) is a group of nodes in the same physical location, managed by booth.

concurrency violation #

A resource that should be running on only one node in the cluster is running on several nodes.

conntrack tools #

Allow interaction with the in-kernel connection tracking system for enabling stateful packet inspection for iptables. Used by SUSE Linux Enterprise High Availability to synchronize the connection status between cluster nodes.

CRM (cluster resource manager) #

The main management entity responsible for coordinating all non-local interactions. SUSE Linux Enterprise High Availability uses Pacemaker as CRM. Each node of the cluster has its own CRM instance, but the one running on the DC is the one elected to relay decisions to the other non-local CRMs and process their input. A CRM interacts with several components: local resource managers, both on its own node and on the other nodes, non-local CRMs, administrative commands, the fencing functionality, the membership layer, and booth.

crmd (cluster resource manager daemon) #

The CRM is implemented as daemon, crmd. It has an instance on each cluster node. All cluster decision-making is centralized by electing one of the crmd instances to act as a master. If the elected crmd process fails (or the node it ran on), a new one is established.

crmsh #

The command line utility crmsh manages your cluster, nodes, and resources.

See Chapter 7, Configuring and Managing Cluster Resources (Command Line) for more information.

Csync2 #

A synchronization tool that can be used to replicate configuration files across all nodes in the cluster, and even across Geo clusters.

DC (designated coordinator) #

One CRM in the cluster is elected as the Designated Coordinator (DC). The DC is the only entity in the cluster that can decide that a cluster-wide change needs to be performed, such as fencing a node or moving resources around. The DC is also the node where the master copy of the CIB is kept. All other nodes get their configuration and resource allocation information from the current DC. The DC is elected from all nodes in the cluster after a membership change.

Disaster #

Unexpected interruption of critical infrastructure induced by nature, humans, hardware failure, or software bugs.

Disaster Recover Plan #

A strategy to recover from a disaster with minimum impact on IT infrastructure.

Disaster Recovery #

Disaster recovery is the process by which a business function is restored to the normal, steady state after a disaster.

DLM (distributed lock manager) #

DLM coordinates disk access for clustered file systems and administers file locking to increase performance and availability.

DRBD #

DRBD® is a block device designed for building high availability clusters. The whole block device is mirrored via a dedicated network and is seen as a network RAID-1.

existing cluster #

The term “existing cluster” is used to refer to any cluster that consists of at least one node. Existing clusters have a basic Corosync configuration that defines the communication channels, but they do not necessarily have resource configuration yet.

failover #

Occurs when a resource or node fails on one machine and the affected resources are started on another node.

failover domain #

A named subset of cluster nodes that are eligible to run a cluster service if a node fails.

fencing #

Describes the concept of preventing access to a shared resource by isolated or failing cluster members. Should a cluster node fail, it will be shut down or reset to prevent it from causing trouble. This way, resources are locked out of a node whose status is uncertain.

Geo cluster (geographically dispersed cluster) #

Consists of multiple, geographically dispersed sites with a local cluster each. The sites communicate via IP. Failover across the sites is coordinated by a higher-level entity, the booth. Geo clusters need to cope with limited network bandwidth and high latency. Storage is replicated asynchronously.

heartbeat #

A CCM, in version 3 an alternative to Corosync. Supports more than two communication paths, but not cluster file systems.

load balancing #

The ability to make several servers participate in the same service and do the same work.

local cluster #

A single cluster in one location (for example, all nodes are located in one data center). Network latency can be neglected. Storage is typically accessed synchronously by all nodes.

location #

In the context of a location constraint, “location” refers to the nodes on which a resource can or cannot run.

LRM (local resource manager) #

Responsible for performing operations on resources. It uses the resource agent scripts to carry out these operations. The LRM is “dumb” in that it does not know of any policy. It needs the DC to tell it what to do.

mcastaddr (multicast address) #

IP address to be used for multicasting by the Corosync executive. The IP address can either be IPv4 or IPv6.

mcastport (multicast port) #

The port to use for cluster communication.

metro cluster #

A single cluster that can stretch over multiple buildings or data centers, with all sites connected by fibre channel. Network latency is usually low (<5 ms for distances of approximately 20 miles). Storage is frequently replicated (mirroring or synchronous replication).

multicast #

A technology used for a one-to-many communication within a network that can be used for cluster communication. Corosync supports both multicast and unicast.

node #

Any computer (real or virtual) that is a member of a cluster and invisible to the user.

PE (policy engine) #

The policy engine computes the actions that need to be taken to implement policy changes in the CIB. The PE also produces a transition graph containing a list of (resource) actions and dependencies to achieve the next cluster state. The PE always runs on the DC.

quorum #

In a cluster, a cluster partition is defined to have quorum (can “quorate”) if it has the majority of nodes (or votes). Quorum distinguishes exactly one partition. It is part of the algorithm to prevent several disconnected partitions or nodes from proceeding and causing data and service corruption (split brain). Quorum is a prerequisite for fencing, which then ensures that quorum is indeed unique.

RA (resource agent) #

A script acting as a proxy to manage a resource (for example, to start, stop, or monitor a resource). SUSE Linux Enterprise High Availability supports different kinds of resource agents. For details, see Section 5.3.2, “Supported Resource Agent Classes”.

ReaR (Relax and Recover) #

An administrator tool set for creating disaster recovery images.

resource #

Any type of service or application that is known to Pacemaker. Examples include an IP address, a file system, or a database.

The term “resource” is also used for DRBD, where it names a set of block devices that are using a common connection for replication.

RRP (redundant ring protocol) #

Allows the use of multiple redundant local area networks for resilience against partial or total network faults. This way, cluster communication can still be kept up as long as a single network is operational. Corosync supports the Totem Redundant Ring Protocol.

SBD (STONITH Block Device) #

Provides a node fencing mechanism through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). Can also be used in diskless mode. Needs a hardware or software watchdog on each node to ensure that misbehaving nodes are really stopped.

SFEX (shared disk file exclusiveness) #

SFEX provides storage protection over SAN.

split brain #

A scenario in which the cluster nodes are divided into two or more groups that do not know of each other (either through a software or hardware failure). STONITH prevents a split brain situation from badly affecting the entire cluster. Also known as a “partitioned cluster” scenario.

The term split brain is also used in DRBD but means that the two nodes contain different data.

SPOF (single point of failure) #

Any component of a cluster that, should it fail, triggers the failure of the entire cluster.

STONITH #

The acronym for “Shoot the other node in the head”. It refers to the fencing mechanism that shuts down a misbehaving node to prevent it from causing trouble in a cluster.

switchover #

Planned, on-demand moving of services to other nodes in a cluster. See failover.

ticket #

A component used in Geo clusters. A ticket grants the right to run certain resources on a specific cluster site. A ticket can only be owned by one site at a time. Resources can be bound to a certain ticket by dependencies. Only if the defined ticket is available at a site, the respective resources are started. Vice versa, if the ticket is removed, the resources depending on that ticket are automatically stopped.

unicast #

A technology for sending messages to a single network destination. Corosync supports both multicast and unicast. In Corosync, unicast is implemented as UDP-unicast (UDPU).