10 Adding or modifying resource agents #
All tasks that need to be managed by a cluster must be available as a resource. There are two major groups to consider: resource agents and STONITH agents. For both categories, you can add your own agents, extending the abilities of the cluster to your own needs.
10.1 STONITH agents #
A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. This is called fencing and is commonly done with a STONITH resource.
It is impossible to know how SSH might react to other system problems.
For this reason, external SSH/STONITH agents (like
stonith:external/ssh
) are not supported for
production environments. If you still want to use such agents for
testing, install the
libglue-devel package.
To get a list of all currently available STONITH devices (from the
software side), use the command crm ra list stonith
.
If you do not find your favorite agent, install the
-devel package.
For more information on STONITH devices and resource agents,
see Chapter 12, Fencing and STONITH.
There is currently no documentation about writing STONITH agents. If you want to write new STONITH agents, consult the examples available in the source of the cluster-glue package.
10.2 Writing OCF resource agents #
All OCF resource agents (RAs) are available in
/usr/lib/ocf/resource.d/
, see
Section 6.2, “Supported resource agent classes” for more information.
Each resource agent must supported the following operations to control
it:
start
start or enable the resource
stop
stop or disable the resource
status
returns the status of the resource
monitor
similar to
status
, but checks also for unexpected statesvalidate
validate the resource's configuration
meta-data
returns information about the resource agent in XML
The general procedure of how to create an OCF RA is like the following:
Load the file
/usr/lib/ocf/resource.d/pacemaker/Dummy
as a template.Create a new subdirectory for each new resource agents to avoid naming contradictions. For example, if you have a resource group
kitchen
with the resourcecoffee_machine
, add this resource to the directory/usr/lib/ocf/resource.d/kitchen/
. To access this RA, execute the commandcrm
:#
crm configure primitive coffee_1 ocf:coffee_machine:kitchen
...Implement the different shell functions and save your file under a different name.
More details about writing OCF resource agents can be found at https://www.clusterlabs.org/pacemaker/doc/ in the guide Pacemaker Administration. Find special information about several concepts at Chapter 1, Product overview.
10.3 OCF return codes and failure recovery #
According to the OCF specification, there are strict definitions of the exit codes an action must return. The cluster always checks the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and a recovery action is initiated. There are three types of failure recovery:
Recovery Type |
Description |
Action Taken by the Cluster |
---|---|---|
soft |
A transient error occurred. |
Restart the resource or move it to a new location. |
hard |
A non-transient error occurred. The error may be specific to the current node. |
Move the resource elsewhere and prevent it from being retried on the current node. |
fatal |
A non-transient error occurred that is common to all cluster nodes. This means a bad configuration was specified. |
Stop the resource and prevent it from being started on any cluster node. |
Assuming an action is considered to have failed, the following table outlines the different OCF return codes. It also shows the type of recovery the cluster initiates when the respective error code is received.
OCF Return Code |
OCF Alias |
Description |
Recovery Type |
---|---|---|---|
0 |
OCF_SUCCESS |
Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands. |
soft |
1 |
OCF_ERR_GENERIC |
Generic “there was a problem” error code. |
soft |
2 |
OCF_ERR_ARGS |
The resource’s configuration is not valid on this machine (for example, it refers to a location/tool not found on the node). |
hard |
3 |
OCF_ERR_UNIMPLEMENTED |
The requested action is not implemented. |
hard |
4 |
OCF_ERR_PERM |
The resource agent does not have sufficient privileges to complete the task. |
hard |
5 |
OCF_ERR_INSTALLED |
The tools required by the resource are not installed on this machine. |
hard |
6 |
OCF_ERR_CONFIGURED |
The resource’s configuration is invalid (for example, required parameters are missing). |
fatal |
7 |
OCF_NOT_RUNNING |
The resource is not running. The cluster does not attempt to stop a resource that returns this for any action.
This OCF return code may or may not require resource
recovery—it depends on what is the expected resource status.
If unexpected, then |
N/A |
8 |
OCF_RUNNING_PROMOTED |
The resource is running in Promoted mode. |
soft |
9 |
OCF_FAILED_PROMOTED |
The resource is in Promoted mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again. |
soft |
other |
N/A |
Custom error code. |
soft |