Scaling Salt Clients

Salt Client Onboarding Rate

The rate at which SUSE Manager can on-board clients (accept Salt keys) is limited and depends on hardware resources. On-boarding clients at a faster rate than SUSE Manager is configured for will build up a backlog of unprocessed keys slowing the process and potentially exhausting resources. It is recommended to limit the acceptance key rate pro-grammatically. A safe starting point would be to on-board a client every 15 seconds, which can be implemented via the following command:

for k in $(salt-key -l un|grep -v Unaccepted); do salt-key -y -a $k; sleep 15; done

Clients Running with Unaccepted Salt Keys

Clients which have not been on-boarded, (clients running with unaccepted Salt keys) consume resources, in particular inbound network bandwidth for ~2.5 Kb/s per client. 1000 idle clients will consume around ~2.5 Mb/s, and this number will drop to almost 0 once on-boarding has been completed. Limit non-onboarded systems for optimal performance.

Salt Timeouts

Background Information

Salt features two timeout parameters called timeout and gather_job_timeout that are relevant during the execution of Salt commands and jobs—​it does not matter whether they are triggered using the command line interface or API. These two parameters are explained in the following article.

This is a normal workflow when all clients are well reachable:

  • A salt command or job is executed:

    salt '*'
  • Salt master publishes the job with the targeted clients into the Salt PUB channel.

  • Clients take that job and start working on it.

  • Salt master is looking at the Salt RET channel to gather responses from the clients.

  • If Salt master gets all responses from targeted clients, then everything is completed and Salt master will return a response containing all the client responses.

If some of the clients are down during this process, the workflow continues as follows:

  1. If timeout is reached before getting all expected responses from the clients, then Salt master would trigger an aditional job (a Salt find_job job) targeting only pending clients to check whether the job is already running on the client.

  2. Now gather_job_timeout is evaluated. A new counter is now triggered.

  3. If this new find_job job responses that the original job is actually running on the client, then Salt master will wait for that client’s response.

  4. In case of reaching gather_job_timeout without having any response from the client (neither for the initial nor for the find_job job), Salt master will return with only the gathered responses from the responding clients.

By default, SUSE Manager globally sets timeout and gather_job_timeout to 120 seconds. So, in the worst case, a Salt call targeting unreachable clients will end up with 240 seconds of waiting until getting a response.

Salt SSH Clients (SSH Push)

Salt SSH clients are slightly different that regular clients (zeromq). Salt SSH clients do not use Salt PUB/RET channels but a wrapper Salt command inside of an SSH call. Salt timeout and gather_job_timeout are not playing a role here.

SUSE Manager defines a timeout for SSH connections in /etc/rhn/rhn.conf:

# salt_ssh_connect_timeout = 180

The presence ping mechanism is also working with SSH clients. In this case, SUSE Manager will use salt_presence_ping_timeout to override the default timeout value for SSH connections.