Applies to SUSE Linux Enterprise High Availability 15 SP6

10 Adding or modifying resource agents #

Revision History: Documentação do SUSE Linux Enterprise High Availability

All tasks that need to be managed by a cluster must be available as a resource. There are two major groups to consider: resource agents and STONITH agents. For both categories, you can add your own agents, extending the abilities of the cluster to your own needs.

10.1 STONITH agents #

A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource.

Warning: External SSH/STONITH are not supported

It is impossible to know how SSH might react to other system problems. For this reason, external SSH/STONITH agents (like stonith:external/ssh) are not supported for production environments. If you still want to use such agents for testing, install the libglue-devel package.

To get a list of all currently available STONITH devices (from the software side), use the command crm ra list stonith. If you do not find your favorite agent, install the -devel package. For more information on STONITH devices and resource agents, see Chapter 12, Fencing and STONITH.

There is currently no documentation about writing STONITH agents. If you want to write new STONITH agents, consult the examples available in the source of the cluster-glue package.

10.2 Writing OCF resource agents #

All OCF resource agents (RAs) are available in /usr/lib/ocf/resource.d/, see Section 6.2, “Supported resource agent classes” for more information. Each resource agent must supported the following operations to control it:

start: start or enable the resource
stop: stop or disable the resource
status: returns the status of the resource
monitor: similar to status, but checks also for unexpected states
validate: validate the resource's configuration
meta-data: returns information about the resource agent in XML

The general procedure of how to create an OCF RA is like the following:

Load the file /usr/lib/ocf/resource.d/pacemaker/Dummy as a template.
Create a new subdirectory for each new resource agents to avoid naming contradictions. For example, if you have a resource group kitchen with the resource coffee_machine, add this resource to the directory /usr/lib/ocf/resource.d/kitchen/. To access this RA, execute the command crm:
```
# crm configure primitive coffee_1 ocf:coffee_machine:kitchen ...
```
Implement the different shell functions and save your file under a different name.

More details about writing OCF resource agents can be found at https://clusterlabs.org/projects/pacemaker/doc/ in the guide Pacemaker Administration. Find special information about several concepts at Chapter 1, Product overview.

10.3 OCF return codes and failure recovery #

According to the OCF specification, there are strict definitions of the exit codes an action must return. The cluster always checks the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and a recovery action is initiated. There are three types of failure recovery:

Table 10.1: Failure recovery types #

Recovery Type	Description	Action Taken by the Cluster
soft	A transient error occurred.	Restart the resource or move it to a new location.
hard	A non-transient error occurred. The error may be specific to the current node.	Move the resource elsewhere and prevent it from being retried on the current node.
fatal	A non-transient error occurred that is common to all cluster nodes. This means a bad configuration was specified.	Stop the resource and prevent it from being started on any cluster node.

Assuming an action is considered to have failed, the following table outlines the different OCF return codes. It also shows the type of recovery the cluster initiates when the respective error code is received.

Table 10.2: OCF return codes #

OCF Return Code	OCF Alias	Description	Recovery Type
0	OCF_SUCCESS	Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.	soft
1	OCF_ERR_GENERIC	Generic “there was a problem” error code.	soft
2	OCF_ERR_ARGS	The resource’s configuration is not valid on this machine (for example, it refers to a location/tool not found on the node).	hard
3	OCF_ERR_UNIMPLEMENTED	The requested action is not implemented.	hard
4	OCF_ERR_PERM	The resource agent does not have sufficient privileges to complete the task.	hard
5	OCF_ERR_INSTALLED	The tools required by the resource are not installed on this machine.	hard
6	OCF_ERR_CONFIGURED	The resource’s configuration is invalid (for example, required parameters are missing).	fatal
7	OCF_NOT_RUNNING	The resource is not running. The cluster does not attempt to stop a resource that returns this for any action. This OCF return code may or may not require resource recovery—it depends on what is the expected resource status. If unexpected, then `soft` recovery.	N/A
8	OCF_RUNNING_PROMOTED	The resource is running in Promoted mode.	soft
9	OCF_FAILED_PROMOTED	The resource is in Promoted mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again.	soft
other	N/A	Custom error code.	soft