Salt timeouts

1. General Salt timeouts

Salt features two timeout parameters called timeout and gather_job_timeout that are relevant during the execution of Salt commands and jobs—​it does not matter whether they are triggered using the command line interface or API. These two parameters are explained in the following article.

This is a normal workflow when all clients are well reachable:

  • A salt command or job is executed:

    salt '*' test.ping
  • Salt master publishes the job with the targeted clients into the Salt PUB channel.

  • Clients take that job and start working on it.

  • Salt master is looking at the Salt RET channel to gather responses from the clients.

  • If Salt master gets all responses from targeted clients, then everything is completed and Salt master will return a response containing all the client responses.

If some of the clients are down during this process, the workflow continues as follows:

  1. If timeout is reached before getting all expected responses from the clients, then Salt master would trigger an aditional job (a Salt find_job job) targeting only pending clients to check whether the job is already running on the client.

  2. Now gather_job_timeout is evaluated. A new counter is now triggered.

  3. If this new find_job job responds that the original job is actually running on the client, then Salt master will wait for that client’s response for another gather_job_timeout interval before issuing the next find_job job.

  4. In case of reaching gather_job_timeout without having any response from the client (neither for the initial test.ping nor for the find_job job), Salt master will return with only the gathered responses from the responding clients.

By default, SUSE Manager globally sets timeout and gather_job_timeout to 120 seconds. So, in the worst case, a Salt call targeting unreachable clients will end up with 240 seconds of waiting until getting a response.

You can configure these values differently by creating a /etc/salt/master.d/custom.conf configuration file according to syntax in /etc/salt/master.conf.

2. Presence Ping Timeouts

Before Actions are executed on Salt clients, whether they scheduled via the Web UI or the API, SUSE Manager performs a "presence ping" command to ensure the respective salt-minion processes are active and able to respond. Then, a ping gather job runs on the Salt master to handle the incoming pings from the clients. Actual commands will begin only after all clients have either responded to the ping, or timed out.

The presence ping is an ordinary Salt command, but is not subject to the same timeout parameters as all other Salt commands (timeout/gather_job_timeout, described above). Rather, it has its own parameters (presence_ping_timeout/presence_ping_gather_job_timeout) that can be set in /etc/rhn/rhn.conf.

To allow for quicker detection of unresponsive clients, the timeout values for presence pings are by default significantly shorter than the general defaults. You can configure the presence ping parameters in /etc/rhn/rhn.conf, however the default values should be sufficient in most cases.

A lower total presence ping timeout value will increase the chance of false negatives. In some cases, a client might be marked as non-responding, when it is responding but did not respond quickly enough. Additionally, setting this total presence ping timeout value too low could result in a client hanging at the boot screen. A higher total presence ping timeout will increase the accuracy of the test, as even slow clients will respond to the presence ping before timing out. Additionally, a higher presence ping timeout could limit throughput if you are targeting a large number of clients, when some of them are slow.

If a client does not reply to a ping within the allocated time, it is marked as not available, and is excluded from the command. The Web UI shows a minion is down or could not be contacted message in this case.

The presence ping timeout parameter changes the timeout setting for the presence ping, in seconds. Adjust the java.salt_presence_ping_timeout parameter. Defaults to 4 seconds.

The presence ping gather job parameter changes the timeout setting for gathering the presence ping, in seconds. Adjust the java.salt_presence_ping_gather_job_timeout parameter. Defaults to 1 second.

3. Salt SSH Clients (SSH Push)

Salt SSH clients are slightly different than regular clients (zeromq). Salt SSH clients do not use Salt PUB/RET channels but a wrapper Salt command inside of an SSH call. Salt timeout and gather_job_timeout are not playing a role here.

SUSE Manager defines a timeout for SSH connections in /etc/rhn/rhn.conf:

# salt_ssh_connect_timeout = 180