Scaling Salt Clients
The rate at which SUSE Manager can on-board clients (accept Salt keys) is limited and depends on hardware resources. On-boarding clients at a faster rate than SUSE Manager is configured for will build up a backlog of unprocessed keys slowing the process and potentially exhausting resources. It is recommended to limit the acceptance key rate pro-grammatically. A safe starting point would be to on-board a client every 15 seconds, which can be implemented via the following command:
for k in $(salt-key -l un|grep -v Unaccepted); do salt-key -y -a $k; sleep 15; done
Clients which have not been on-boarded, (clients running with unaccepted Salt keys) consume resources, in particular inbound network bandwidth for ~2.5 Kb/s per client. 1000 idle clients will consume around ~2.5 Mb/s, and this number will drop to almost 0 once on-boarding has been completed. Limit non-onboarded systems for optimal performance.
Salt features two timeout parameters called
gather_job_timeout that are relevant during the execution of Salt commands and jobs—it does not matter whether they are triggered using the command line interface or API.
These two parameters are explained in the following article.
This is a normal workflow when all clients are well reachable:
A salt command or job is executed:
salt '*' test.ping
Salt master publishes the job with the targeted clients into the Salt PUB channel.
Clients take that job and start working on it.
Salt master is looking at the Salt RET channel to gather responses from the clients.
If Salt master gets all responses from targeted clients, then everything is completed and Salt master will return a response containing all the client responses.
If some of the clients are down during this process, the workflow continues as follows:
timeoutis reached before getting all expected responses from the clients, then Salt master would trigger an aditional job (a Salt
find_jobjob) targeting only pending clients to check whether the job is already running on the client.
gather_job_timeoutis evaluated. A new counter is now triggered.
If this new
find_jobjob responses that the original job is actually running on the client, then Salt master will wait for that client’s response.
In case of reaching
gather_job_timeoutwithout having any response from the client (neither for the initial
test.pingnor for the
find_jobjob), Salt master will return with only the gathered responses from the responding clients.
By default, SUSE Manager globally sets
gather_job_timeout to 120 seconds.
So, in the worst case, a Salt call targeting unreachable clients will end up with 240 seconds of waiting until getting a response.
Salt SSH clients are slightly different that regular clients (zeromq). Salt SSH clients do not use Salt PUB/RET channels but a wrapper Salt command inside of an SSH call.
gather_job_timeout are not playing a role here.
SUSE Manager defines a timeout for SSH connections in
# salt_ssh_connect_timeout = 180
The presence ping mechanism is also working with SSH clients.
In this case, SUSE Manager will use
salt_presence_ping_timeout to override the default timeout value for SSH connections.