Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to SUSE Linux Enterprise High Availability Extension 12 SP5

9 Adding or Modifying Resource Agents

Abstract

All tasks that need to be managed by a cluster must be available as a resource. There are two major groups here to consider: resource agents and STONITH agents. For both categories, you can add your own agents, extending the abilities of the cluster to your own needs.

9.1 STONITH Agents

A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource.

Warning
Warning: External SSH/STONITH Are Not Supported

It is impossible to know how SSH might react to other system problems. For this reason, external SSH/STONITH agents (like stonith:external/ssh) are not supported for production environments. If you still want to use such agents for testing, install the libglue-devel package.

To get a list of all currently available STONITH devices (from the software side), use the command crm ra list stonith. If you do not find your favorite agent, install the -devel package. For more information on STONITH devices and resource agents, see Chapter 10, Fencing and STONITH.

As of yet there is no documentation about writing STONITH agents. If you want to write new STONITH agents, consult the examples available in the source of the cluster-glue package.

9.2 Writing OCF Resource Agents

All OCF resource agents (RAs) are available in /usr/lib/ocf/resource.d/, see Section 6.3.2, “Supported Resource Agent Classes” for more information. Each resource agent must supported the following operations to control it:

start

start or enable the resource

stop

stop or disable the resource

status

returns the status of the resource

monitor

similar to status, but checks also for unexpected states

validate

validate the resource's configuration

meta-data

returns information about the resource agent in XML

The general procedure of how to create an OCF RA is like the following:

  1. Load the file /usr/lib/ocf/resource.d/pacemaker/Dummy as a template.

  2. Create a new subdirectory for each new resource agents to avoid naming contradictions. For example, if you have a resource group kitchen with the resource coffee_machine, add this resource to the directory /usr/lib/ocf/resource.d/kitchen/. To access this RA, execute the command crm:

    root # crm configure primitive coffee_1 ocf:coffee_machine:kitchen ...
  3. Implement the different shell functions and save your file under a different name.

More details about writing OCF resource agents can be found at https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc. Find special information about several concepts at Chapter 1, Product Overview.

9.3 OCF Return Codes and Failure Recovery

According to the OCF specification, there are strict definitions of the exit codes an action must return. The cluster always checks the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and a recovery action is initiated. There are three types of failure recovery:

Table 9.1: Failure Recovery Types

Recovery Type

Description

Action Taken by the Cluster

soft

A transient error occurred.

Restart the resource or move it to a new location.

hard

A non-transient error occurred. The error may be specific to the current node.

Move the resource elsewhere and prevent it from being retried on the current node.

fatal

A non-transient error occurred that will be common to all cluster nodes. This means a bad configuration was specified.

Stop the resource and prevent it from being started on any cluster node.

Assuming an action is considered to have failed, the following table outlines the different OCF return codes. It also shows the type of recovery the cluster will initiate when the respective error code is received.

Table 9.2: OCF Return Codes

OCF Return Code

OCF Alias

Description

Recovery Type

0

OCF_SUCCESS

Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.

soft

1

OCF_ERR_­GENERIC

Generic there was a problem error code.

soft

2

OCF_ERR_ARGS

The resource’s configuration is not valid on this machine (for example, it refers to a location/tool not found on the node).

hard

3

OCF_­ERR_­UN­IMPLEMENTED

The requested action is not implemented.

hard

4

OCF_ERR_PERM

The resource agent does not have sufficient privileges to complete the task.

hard

5

OCF_ERR_­INSTALLED

The tools required by the resource are not installed on this machine.

hard

6

OCF_ERR_­CONFIGURED

The resource’s configuration is invalid (for example, required parameters are missing).

fatal

7

OCF_NOT_­RUNNING

The resource is not running. The cluster will not attempt to stop a resource that returns this for any action.

This OCF return code may or may not require resource recovery—it depends on what is the expected resource status. If unexpected, then soft recovery.

N/A

8

OCF_RUNNING_­MASTER

The resource is running in Master mode.

soft

9

OCF_FAILED_­MASTER

The resource is in Master mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again.

soft

other

N/A

Custom error code.

soft

Print this page