Applies to SUSE Linux Enterprise High Availability 15 SP4

Glossary #

active/active, active/passive #

A concept of how services are running on nodes. An active-passive scenario means that one or more services are running on the active node and the passive node waits for the active node to fail. Active-active means that each node is active and passive at the same time. For example, it has some services running, but can take over other services from the other node. Compare with primary/secondary and dual-primary in DRBD speak.

arbitrator #

Additional instance in a Geo cluster that helps to reach consensus about decisions such as failover of resources across sites. Arbitrators are single machines that run one or more booth instances in a special mode.

AutoYaST #

AutoYaST is a system for installing one or more SUSE Linux Enterprise systems automatically and without user intervention.

bindnetaddr (bind network address) #

The network address the Corosync executive should bind to.

booth #

The instance that manages the failover process between the sites of a Geo cluster. It aims to get multi-site resources active on one and only one site. This is achieved by using so-called tickets that are treated as failover domain between cluster sites, in case a site should be down.

boothd (booth daemon) #

Each of the participating clusters and arbitrators in a Geo cluster runs a service, the boothd. It connects to the booth daemons running at the other sites and exchanges connectivity details.

CIB (cluster information base) #

A representation of the whole cluster configuration and status (cluster options, nodes, resources, constraints and the relationship to each other). It is written in XML and resides in memory. A primary CIB is kept and maintained on the DC (designated coordinator) and replicated to the other nodes. Normal read and write operations on the CIB are serialized through the primary CIB.

cluster #

A high-performance cluster is a group of computers (real or virtual) sharing the application load to achieve faster results. A high-availability cluster is designed primarily to secure the highest possible availability of services.

cluster partition #

Whenever communication fails between one or more nodes and the rest of the cluster, a cluster partition occurs. The nodes of a cluster are split into partitions but still active. They can only communicate with nodes in the same partition and are unaware of the separated nodes. As the loss of the nodes on the other partition cannot be confirmed, a split brain scenario develops (see also split brain).

cluster site #

In Geo clustering, a cluster site (or just “site”) is a group of nodes in the same physical location, managed by booth.

cluster stack #

The ensemble of software technologies and components that compose a cluster.

concurrency violation #

A resource that should be running on only one node in the cluster is running on several nodes.

conntrack tools #

Allow interaction with the in-kernel connection tracking system for enabling stateful packet inspection for iptables. Used by SUSE Linux Enterprise High Availability to synchronize the connection status between cluster nodes.

CRM (cluster resource manager) #

The management entity responsible for coordinating all non-local interactions in a High Availability cluster. SUSE Linux Enterprise High Availability uses Pacemaker as CRM. The CRM is implemented as pacemaker-controld. It interacts with several components: local resource managers, both on its own node and on the other nodes, non-local CRMs, administrative commands, the fencing functionality, and the membership layer.

crmsh #

The command line utility crmsh manages your cluster, nodes, and resources.

See Section 5.5, “Introduction to crmsh” for more information.

Csync2 #

A synchronization tool that can be used to replicate configuration files across all nodes in the cluster, and even across Geo clusters.

DC (designated coordinator) #

The DC is elected from all nodes in the cluster. This happens if there is no DC yet or if the current DC leaves the cluster for any reason. The DC is the only entity in the cluster that can decide that a cluster-wide change needs to be performed, such as fencing a node or moving resources around. All other nodes get their configuration and resource allocation information from the current DC.

Disaster #

Unexpected interruption of critical infrastructure induced by nature, humans, hardware failure, or software bugs.

Disaster Recover Plan #

A strategy to recover from a disaster with minimum impact on IT infrastructure.

Disaster Recovery #

Disaster recovery is the process by which a business function is restored to the normal, steady state after a disaster.

DLM (distributed lock manager) #

DLM coordinates disk access for clustered file systems and administers file locking to increase performance and availability.

DRBD #

DRBD® is a block device designed for building high availability clusters. The whole block device is mirrored via a dedicated network and is seen as a network RAID-1.

existing cluster #

The term “existing cluster” is used to refer to any cluster that consists of at least one node. Existing clusters have a basic Corosync configuration that defines the communication channels, but they do not necessarily have resource configuration yet.

failover #

Occurs when a resource or node fails on one machine and the affected resources are started on another node.

failover domain #

A named subset of cluster nodes that are eligible to run a cluster service if a node fails.

fencing #

Describes the concept of preventing access to a shared resource by isolated or failing cluster members. There are two classes of fencing: resource level fencing and node level fencing. Resource level fencing ensures exclusive access to a given resource. Node level fencing prevents a failed node from accessing shared resources entirely and prevents resources from running on a node whose status is uncertain. This is usually done in a simple and abrupt way: reset or power off the node.

Geo cluster (geographically dispersed cluster) #

Consists of multiple, geographically dispersed sites with a local cluster each. The sites communicate via IP. Failover across the sites is coordinated by a higher-level entity, the booth. Geo clusters need to cope with limited network bandwidth and high latency. Storage is replicated asynchronously.

load balancing #

The ability to make several servers participate in the same service and do the same work.

local cluster #

A single cluster in one location (for example, all nodes are located in one data center). Network latency can be neglected. Storage is typically accessed synchronously by all nodes.

location #

In the context of a location constraint, “location” refers to the nodes on which a resource can or cannot run.

LRM (local resource manager) #

The local resource manager is located between the Pacemaker layer and the resources layer on each node. It is implemented as pacemaker-execd daemon. Through this daemon, Pacemaker can start, stop, and monitor resources.

mcastaddr (multicast address) #

IP address to be used for multicasting by the Corosync executive. The IP address can either be IPv4 or IPv6.

mcastport (multicast port) #

The port to use for cluster communication.

metro cluster #

A single cluster that can stretch over multiple buildings or data centers, with all sites connected by fibre channel. Network latency is usually low (<5 ms for distances of approximately 20 miles). Storage is frequently replicated (mirroring or synchronous replication).

multicast #

A technology used for a one-to-many communication within a network that can be used for cluster communication. Corosync supports both multicast and unicast.

node #

Any computer (real or virtual) that is a member of a cluster and invisible to the user.

pacemaker-controld (cluster controller daemon) #

The CRM is implemented as daemon, pacemaker-controld. It has an instance on each cluster node. All cluster decision-making is centralized by electing one of the pacemaker-controld instances to act as a primary. If the elected pacemaker-controld process fails (or the node it ran on), a new one is established.

PE (policy engine) #

The policy engine is implemented as pacemaker-schedulerd daemon. When a cluster transition is needed, based on the current state and configuration, pacemaker-schedulerd calculates the expected next state of the cluster. It determines what actions need to be scheduled to achieve the next state.

quorum #

In a cluster, a cluster partition is defined to have quorum (be “quorate”) if it has the majority of nodes (or votes). Quorum distinguishes exactly one partition. It is part of the algorithm to prevent several disconnected partitions or nodes from proceeding and causing data and service corruption (split brain). Quorum is a prerequisite for fencing, which then ensures that quorum is indeed unique.

RA (resource agent) #

A script acting as a proxy to manage a resource (for example, to start, stop, or monitor a resource). SUSE Linux Enterprise High Availability supports different kinds of resource agents. For details, see Section 6.2, “Supported resource agent classes”.

ReaR (Relax and Recover) #

An administrator tool set for creating disaster recovery images.

resource #

Any type of service or application that is known to Pacemaker. Examples include an IP address, a file system, or a database.

The term “resource” is also used for DRBD, where it names a set of block devices that are using a common connection for replication.

RRP (redundant ring protocol) #

Allows the use of multiple redundant local area networks for resilience against partial or total network faults. This way, cluster communication can still be kept up as long as a single network is operational. Corosync supports the Totem Redundant Ring Protocol.

SBD (STONITH Block Device) #

Provides a node fencing mechanism through the exchange of messages via shared block storage (SAN, iSCSI, FCoE, etc.). Can also be used in diskless mode. Needs a hardware or software watchdog on each node to ensure that misbehaving nodes are really stopped.

SFEX (shared disk file exclusiveness) #

SFEX provides storage protection over SAN.

split brain #

A scenario in which the cluster nodes are divided into two or more groups that do not know of each other (either through a software or hardware failure). STONITH prevents a split brain situation from badly affecting the entire cluster. Also known as a “partitioned cluster” scenario.

The term split brain is also used in DRBD but means that the two nodes contain different data.

SPOF (single point of failure) #

Any component of a cluster that, should it fail, triggers the failure of the entire cluster.

STONITH #

The acronym for “Shoot the other node in the head”. It refers to the fencing mechanism that shuts down a misbehaving node to prevent it from causing trouble in a cluster. In a Pacemaker cluster, the implementation of node level fencing is STONITH. For this, Pacemaker comes with a fencing subsystem, pacemaker-fenced.

switchover #

Planned, on-demand moving of services to other nodes in a cluster. See failover.

ticket #

A component used in Geo clusters. A ticket grants the right to run certain resources on a specific cluster site. A ticket can only be owned by one site at a time. Resources can be bound to a certain ticket by dependencies. Only if the defined ticket is available at a site, the respective resources are started. Vice versa, if the ticket is removed, the resources depending on that ticket are automatically stopped.

unicast #

A technology for sending messages to a single network destination. Corosync supports both multicast and unicast. In Corosync, unicast is implemented as UDP-unicast (UDPU).