The rate at which SUSE Manager can on-board minions (accept Salt keys) is limited and depends on hardware resources. On-boarding minions at a faster rate than SUSE Manager is configured for will build up a backlog of unprocessed keys slowing the process and potentially exhausting resources. It is recommended to limit the acceptance key rate pro-grammatically. A safe starting point would be to on-board a minion every 15 seconds, which can be implemented via the following command:
for k in $(salt-key -l un|grep -v Unaccepted); do salt-key -y -a $k; sleep 15; done
Minions which have not been on-boarded, (minions running with unaccepted Salt keys) consume resources, in particular inbound network bandwidth for ~2.5 Kb/s per minion. 1000 idle minions will consume around ~2.5 Mb/s, and this number will drop to almost 0 once on-boarding has been completed. Limit non-onboarded systems for optimal performance.
Salt features two timeout parameters called timeout
and gather_job_timeout
that are relevant during the execution of Salt commands and jobs—it does not matter whether they are triggered using the command line interface or API.
These two parameters are explained in the following article.
This is a normal workflow when all minions are well reachable:
A salt command or job is executed:
salt '*' test.ping
Salt master publishes the job with the targeted minions into the Salt PUB channel.
Minions take that job and start working on it.
Salt master is looking at the Salt RET channel to gather responses from the minions.
If Salt master gets all responses from targeted minions, then everything is completed and Salt master will return a response containing all the minion responses.
If some of the minions are down during this process, the workflow continues as follows:
If timeout
is reached before getting all expected responses from the minions, then Salt master would trigger an additional job (a Salt find_job
job) targeting only pending minions to check whether the job is already running on the minion.
Now gather_job_timeout
is evaluated. A new counter is now triggered.
If this new find_job
job responses that the original job is actually running on the minion, then Salt master will wait for that minion’s response.
In case of reaching gather_job_timeout
without having any response from the minion (neither for the initial test.ping
nor for the find_job
job), Salt master will return with only the gathered responses from the responding minions.
By default, SUSE Manager globally sets timeout
and gather_job_timeout
to 120 seconds.
So, in the worst case, a Salt call targeting unreachable minions will end up with 240 seconds of waiting until getting a response.
There are two parameters that control how presence pings from the Salt master are handled, one for the ping timeout, and one for the ping gather job.
Salt batch calls begin with the Salt master performing a presence ping on the target minions. A ping gather job runs on the Salt master to handle the incoming pings from the minions. Batched commands will begin only after all minions have either responded to the ping, or timed out.
The presence ping is an ordinary Salt command, but is not subject to the same timeout parameters as all other Salt commands (timeout
/gather_job_timeout
), rather, it has its own parameters (presence_ping_timeout
/presence_ping_gather_job_timeout
).
You can configure the global timeout values in the /etc/salt/master.d/custom.conf
configuration file.
However, to allow for quicker detection of unresponsive minions, the timeout values for presence pings are by default significantly shorter than those used elsewhere.
You can configure the presence ping parameters in /etc/rhn/rhn.conf
, however the default values should be sufficient in most cases.
A lower total presence ping timeout value will increase the chance of false negatives. In some cases, a minion might be marked as non-responding, when it is responding, but did not respond quickly enough. A higher total presence ping timeout will increase the accuracy of the test, as even slow minions will respond to the presence ping before timing out. Additionally, a higher presence ping timeout could limit throughput if you are targeting a large number of minions, when some of them are slow.
If a minion does not reply to a ping within the allocated time, it will be marked as not available
, and will be excluded from the command.
The Web UI will show a minion is down
message in this case.
For more information on minion timeouts, see scale-minions.xml.
The presence ping timeout parameter changes the timeout setting for the presence ping, in seconds.
Adjust the java.salt_presence_ping_timeout
parameter.
Defaults to 4 seconds.
The presence ping gather job parameter changes the timeout setting for gathering the presence ping, in seconds.
Adjust the java.salt_presence_ping_gather_job_timeout
parameter.
Defaults to 1 second.
There are two parameters that control how actions are sent to clients, one for the batch size, and one for the delay.
When the Salt master sends a batch of actions to the target minions, it will send it to the number of minions determined in the batch size parameter. After the specified delay period, commands will be sent to the next batch of minions. The number of minions in each subsequent batch is equal to the number of minions that have completed in the previous batch.
Choosing a lower batch size will reduce system load and parallelism, but might reduce overall performance for processing actions.
The batch size parameter sets the maximum number of clients that can execute a single action at the same time.
Adjust the java.salt_batch_size
parameter.
Defaults to 100.
Increasing the delay increases the chance that multiple minions will have completed before the next action is issued, resulting in fewer overall commands, and reducing load.
The batch delay parameter sets the amount of time, in seconds, to wait after a command is processed before beginning to process the command on the next minion.
Adjust the java.salt_batch_delay
parameter.
Defaults to 1.0 seconds.
Salt SSH minions are slightly different that regular minions (zeromq). Salt SSH minions do not use Salt PUB/RET channels but a wrapper Salt command inside of an SSH call.
Salt timeout
and gather_job_timeout
are not playing a role here.
SUSE Manager defines a timeout for SSH connections in /etc/rhn/rhn.conf
:
# salt_ssh_connect_timeout = 180
The presence ping mechanism is also working with SSH minions.
In this case, SUSE Manager will use salt_presence_ping_timeout
to override the default timeout value for SSH connections.