Tuning Large Scale Deployments
- 1. The Tuning Process
- 2. Environmental Variables
- 3. Parameters
- 3.1.
MaxClients
- 3.2.
ServerLimit
- 3.3.
maxThreads
- 3.4.
connectionTimeout
- 3.5.
keepAliveTimeout
- 3.6. Tomcat’s
-Xmx
- 3.7.
java.disable_list_update_status
- 3.8.
java.message_queue_thread_pool_size
- 3.9.
java.salt_batch_size
- 3.10.
java.salt_event_thread_pool_size
- 3.11.
java.salt_presence_ping_timeout
- 3.12.
java.salt_presence_ping_gather_job_timeout
- 3.13.
java.taskomatic_channel_repodata_workers
- 3.14.
taskomatic.java.maxmemory
- 3.15.
org.quartz.threadPool.threadCount
- 3.16.
org.quartz.scheduler.idleWaitTime
- 3.17.
MinionActionExecutor.parallel_threads
- 3.18.
SSHMinionActionExecutor.parallel_threads
- 3.19.
hibernate.c3p0.max_size
- 3.20.
rhn-search.java.maxmemory
- 3.21.
shared_buffers
- 3.22.
max_connections
- 3.23.
work_mem
- 3.24.
effective_cache_size
- 3.25.
thread_pool
- 3.26.
worker_threads
- 3.27.
auth_events
- 3.28.
minion_data_cache_events
- 3.29.
pub_hwm
- 3.30.
zmq_backlog
- 3.31.
swappiness
- 3.32.
wait_for_backend
- 3.33.
tcp_keepalive
- 3.1.
- 4. Memory Usage
SUSE Manager is designed by default to work on small and medium scale installations. For installations with more than 1000 clients per SUSE Manager Server, adequate hardware sizing and parameter tuning must be performed.
The instructions in this section can have severe and catastrophic performance impacts when improperly used. In some cases, they can cause SUSE Manager to completely cease functioning. Always test changes before implementing them in a production environment. During implementation, take care when changing parameters. Monitor performance before and after each change, and revert any steps that do not produce the expected result. |
We strongly recommend that you contact SUSE Consulting for assistance with tuning. SUSE will not provide support for catastrophic failure when these advanced parameters are modified without consultation. |
Tuning is not required on installations of fewer than 1000 clients. Do not perform these instructions on small or medium scale installations. |
1. The Tuning Process
Any SUSE Manager installation is subject to a number of design and infrastructure constraints that, for the purposes of tuning, we call environmental variables. Environmental variables can include the total number of clients, the number of different operating systems under management, and the number of software channels.
Environmental variables influence, either directly or indirectly, the value of most configuration parameters. During the tuning process, the configuration parameters are manipulated to improve system performance.
Before you begin tuning, you will need to estimate the best setting for each environment variable, and adjust the configuration parameters to suit.
To help you with the estimation process, we have provided you with a dependency graph. Locate the environmental variables on the dependency graph to determine how they will influence other variables and parameters.
Environmental variables are represented by graph nodes in a rectangle at the top of the dependency graph. Each node is connected to the relevant parameters that might need tuning. Consult the relevant sections in this document for more information about recommended values.
Tuning one parameter might require tuning other parameters, or changing hardware, or the infrastructure. When you change a parameter, follow the arrows from that node on the graph to determine what other parameters might need adjustment. Continue through each parameter until you have visited all nodes on the graph.
-
3D boxes are hardware design variables or constraints
-
Oval-shaped boxes are software or system design variables or constraints
-
Rectangle-shaped boxes are configurable parameters, color-coded by configuration file:
-
Red: Apache
httpd
configuration files -
Blue: Salt configuration files
-
Brown: Tomcat configuration files
-
Grey: PostgreSQL configuration files
-
Purple:
/etc/rhn/rhn.conf
-
-
Dashed connecting lines indicate a variable or constraint that might require a change to another parameter
-
Solid connecting lines indicate that changing a configuration parameter requires checking another one to prevent issues
After the initial tuning has been completed, you will need to consider tuning again in these cases:
-
If your tuning inputs change significantly
-
If special conditions arise that require a certain parameter to be changed. For example, if specific warnings appear in a log file.
-
If performance is not satisfactory
To re-tune your installation, you will need to use the dependency graph again. Start from the node where significant change has happened.
2. Environmental Variables
This section contains information about environmental variables (inputs to the tuning process).
- Network Bandwidth
-
A measure of the typically available egress bandwith from the SUSE Manager Server host to the clients or SUSE Manager Proxy hosts. This should take into account network hardware and topology as well as possible capacity limits on switches, routers, and other network equipment between the server and clients.
- Channel count
-
The number of expected channels to manage. Includes any vendor-provided, third-party, and cloned or staged channels.
- Client count
-
The total number of actual or expected clients. It is important to tune any parameters in advance of a client count increase, whenever possible.
- OS mix
-
The number of distinct operating system versions that managed clients have installed. This is ordered by family (SUSE Linux Enterprise, openSUSE, Red Hat Enterprise Linux, or Ubuntu based). Storage and computing requirements are different in each case.
- User count
-
The expected maximum amount of concurrent users interacting with the Web UI plus the number of programs simultaneously using the XMLRPC API. Includes
spacecmd
,spacewalk-clone-by-date
, and similar.
3. Parameters
This section contains information about the available parameters.
3.1. MaxClients
Description |
The maximum number of HTTP requests served simultaneously by Apache httpd. Proxies, Web UI, and XMLRPC API clients each consume one. Requests exceeding the parameter will be queued and might result in timeouts. |
Tune when |
User count and proxy count increase significantly and this line appears in |
Value default |
150 |
Value recommendation |
150-500 |
Location |
|
Example |
|
After changing |
Immediately change |
Notes |
This parameter was renamed to |
More information |
https://httpd.apache.org/docs/2.4/en/mod/mpm_common.html#maxrequestworkers |
3.2. ServerLimit
Description |
The number of Apache httpd processes serving HTTP requests simultaneously.
The number must equal |
Tune when |
|
Value default |
150 |
Value recommendation |
The same value as |
Location |
|
Example |
|
More information |
https://httpd.apache.org/docs/2.4/en/mod/mpm_common.html#serverlimit |
3.3. maxThreads
Description |
The number of Tomcat threads dedicated to serving HTTP requests |
Tune when |
|
Value default |
150 |
Value recommendation |
The same value as |
Location |
|
Example |
|
More information |
3.4. connectionTimeout
Description |
The number of milliseconds before a non-responding AJP connection is forcibly closed. |
Tune when |
Client count increases significantly and |
Value default |
900000 |
Value recommendation |
20000-3600000 |
Location |
|
Example |
|
More information |
3.5. keepAliveTimeout
Description |
The number of milliseconds without data exchange from the JVM before a non-responding AJP connection is forcibly closed. |
Tune when |
Client count increases significantly and |
Value default |
300000 |
Value recommendation |
20000-600000 |
Location |
|
Example |
|
More information |
3.6. Tomcat’s -Xmx
Description |
The maximum amount of memory Tomcat can use |
Tune when |
|
Value default |
1 GiB |
Value recommendation |
4-8 GiB |
Location |
|
Example |
|
After changing |
Check memory usage |
More information |
https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html |
3.7. java.disable_list_update_status
Description |
Disable displaying the update status for clients of a system group |
Tune when |
displaying the update status causes timeouts |
Value default |
|
Value recommendation |
|
Location |
|
Example |
|
After changing |
? |
Notes |
|
More information |
|
3.8. java.message_queue_thread_pool_size
Description |
The maximum number of threads in Tomcat dedicated to asynchronous operations |
Tune when |
Client count increases significantly |
Value default |
5 |
Value recommendation |
50 - 150 |
Location |
|
Example |
|
After changing |
Check |
Notes |
Incoming Salt events are handled in separate thread pool, see |
More information |
|
3.9. java.salt_batch_size
Description |
The maximum number of minions concurrently executing a scheduled action. |
Tune when |
Client count reaches several thousands and actions are not executed quickly enough. |
Value default |
200 |
Value recommendation |
200-500 |
Location |
|
Example |
|
After changing |
Check memory usage. Monitor memory usage closely before and after the change. |
More information |
3.10. java.salt_event_thread_pool_size
Description |
The maximum number of threads in Tomcat dedicated to handling of incoming Salt events. |
Tune when |
The number of queued Salt events grows. Typically, this can happen during onboarding of large number of minions with higher value of
|
Value default |
8 |
Value recommendation |
20-100 |
Location |
|
Example |
|
After changing |
Check the length of Salt event queue.
Check |
More information |
|
3.11. java.salt_presence_ping_timeout
Description |
Before any action is executed on a client, a presence ping is executed to make sure the client is reachable.
This parameter sets the amount of time before a second command (in most cases |
Tune when |
Client count increases significantly, or some clients are responding correctly but too slowly, and SUSE Manager excludes them from calls.
This line appears in |
Value default |
4 seconds |
Value recommendation |
4-20 seconds |
Location |
|
Example |
|
After changing |
Large |
More information |
3.12. java.salt_presence_ping_gather_job_timeout
Description |
Before any action is executed on a client, a presence ping is executed to make sure the client is reachable.
After |
Tune when |
Client count increases significantly, or some clients are responding correctly but too slowly, and SUSE Manager excludes them from calls.
This line appears in |
Value default |
1 second |
Value recommendation |
1-50 seconds |
Location |
|
Example |
|
More information |
3.13. java.taskomatic_channel_repodata_workers
Description |
Whenever content is changed in a software channel, its metadata needs to be recomputed before clients can use it. Channel-altering operations include the addition of a patch, the removal of a package or a repository synchronization run. This parameter specifies the maximum number of Taskomatic threads that SUSE Manager will use to recompute the channel metadata. Channel metadata computation is both CPU-bound and memory-heavy, so raising this parameter and operating on many channels simultaneously could cause Taskomatic to consume significant resources, but channels will be available to clients sooner. |
Tune when |
Channel count increases significantly (more than 50), or more concurrent operations on channels are expected. |
Value default |
2 |
Value recommendation |
2-10 |
Location |
|
Example |
|
After changing |
Check |
More information |
|
3.14. taskomatic.java.maxmemory
Description |
The maximum amount of memory Taskomatic can use. Generation of metadata, especially for some OSs, can be memory-intensive, so this parameter might need raising depending on the managed OS mix. |
Tune when |
|
Value default |
4096 MiB |
Value recommendation |
4096-16384 MiB |
Location |
|
Example |
|
After changing |
Check memory usage. |
More information |
|
3.15. org.quartz.threadPool.threadCount
Description |
The number of Taskomatic worker threads. Increasing this value allows Taskomatic to serve more clients in parallel. |
Tune when |
Client count increases significantly |
Value default |
20 |
Value recommendation |
20-200 |
Location |
|
Example |
|
After changing |
Check |
More information |
http://www.quartz-scheduler.org/documentation/2.4.0-SNAPSHOT/configuration.html |
3.16. org.quartz.scheduler.idleWaitTime
Description |
Cycle time for Taskomatic. Decreasing this value lowers the latency of Taskomatic. |
Tune when |
Client count is in the thousands. |
Value default |
5000 ms |
Value recommendation |
1000-5000 ms |
Location |
|
Example |
|
More information |
http://www.quartz-scheduler.org/documentation/2.4.0-SNAPSHOT/configuration.html |
3.17. MinionActionExecutor.parallel_threads
Description |
Number of Taskomatic threads dedicated to sending commands to Salt clients as a result of actions being executed. |
Tune when |
Client count is in the thousands. |
Value default |
1 |
Value recommendation |
1-10 |
Location |
|
Example |
|
3.18. SSHMinionActionExecutor.parallel_threads
Description |
Number of Taskomatic threads dedicated to sending commands to Salt SSH clients as a result of actions being executed. |
Tune when |
Client count is in the hundreds. |
Value default |
20 |
Value recommendation |
20-100 |
Location |
|
Example |
|
3.19. hibernate.c3p0.max_size
Description |
Maximum number of PostgreSQL connections simultaneously available to both Tomcat and Taskomatic. If any of those components requires more concurrent connections, their requests will be queued. |
Tune when |
|
Value default |
20 |
Value recommendation |
100 to 200, higher than the maximum of |
Location |
|
Example |
|
After changing |
Check |
More information |
3.20. rhn-search.java.maxmemory
Description |
The maximum amount of memory that the |
Tune when |
Client count increases significantly, and |
Value default |
512 MiB |
Value recommendation |
512-4096 MiB |
Location |
|
Example |
|
After changing |
Check memory usage. |
3.21. shared_buffers
Description |
The amount of memory reserved for PostgreSQL shared buffers, which contain caches of database tables and index data. |
Tune when |
RAM changes |
Value default |
25% of total RAM |
Value recommendation |
25-40% of total RAM |
Location |
|
Example |
|
After changing |
Check memory usage. |
More information |
https://www.postgresql.org/docs/15/runtime-config-resource.html#GUC-SHARED-BUFFERS |
3.22. max_connections
Description |
Maximum number of PostgreSQL connections available to applications.
More connections allow for more concurrent threads/workers in various components (in particular Tomcat and Taskomatic), which generally improves performance.
However, each connection consumes resources, in particular |
Tune when |
|
Value default |
400 |
Value recommendation |
Depends on other settings, use |
Location |
|
Example |
|
After changing |
Check memory usage. Monitor memory usage closely before and after the change. |
More information |
https://www.postgresql.org/docs/15/runtime-config-connection.html#GUC-MAX-CONNECTIONS |
3.23. work_mem
Description |
The amount of memory allocated by PostgreSQL every time a connection needs to do a sort or hash operation.
Every connection (as specified by |
Tune when |
Database operations are slow because of excessive temporary file disk I/O.
To test if that is happening, add |
Value recommendation |
2-20 MB |
Location |
|
Example |
|
After changing |
check if the SUSE Manager Server might need additional RAM. |
More information |
https://www.postgresql.org/docs/15/runtime-config-resource.html#GUC-WORK-MEM |
3.24. effective_cache_size
Description |
Estimation of the total memory available to PostgreSQL for caching.
It is the explicitly reserved memory ( |
Tune when |
Hardware RAM or memory usage increase significantly |
Value recommendation |
Start with 75% of total RAM.
For finer settings, use |
Location |
|
Example |
|
After changing |
Check memory usage |
Notes |
This is an estimation for the query planner, not an allocation. |
More information |
https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE |
3.25. thread_pool
Description |
The number of worker threads serving Salt API HTTP requests. A higher number can improve parallelism of SUSE Manager Server-initiated Salt operations, but will consume more memory. |
Tune when |
|
Value default |
100 |
Value recommendation |
100-500, but should be higher than the sum of |
Location |
|
Example |
|
After changing |
Check |
More information |
3.26. worker_threads
Description |
The number of |
Tune when |
Client count increases significantly, |
Value default |
8 |
Value recommendation |
8-32, depending on the number of the CPU cores available for the server, it is recommended to keep the value slightly less than the number of CPU cores. |
Location |
|
Example |
|
After changing |
Check memory usage.
Monitor memory usage closely before and after the change.
It makes sense to monitor the |
More information |
https://docs.saltstack.com/en/latest/ref/configuration/master.html#worker-threads |
3.27. auth_events
Description |
Determines whether the master will fire authentication events. Authentication events are fired when a minion performs an authentication check with the master. It helps to reduce the number of events published with the Salt Master Event Publisher and reduce the workload on Event Publisher subscribers. |
Tune when |
Large amount of |
Value default |
True |
Value recommendation |
False |
Location |
|
Example |
|
More information |
https://docs.saltproject.io/en/latest/ref/configuration/master.html#auth-events |
3.28. minion_data_cache_events
Description |
Determines whether the master will fire minion data cache events ( |
Tune when |
Large amount of |
Value default |
True |
Value recommendation |
False |
Location |
|
Example |
|
More information |
https://docs.saltproject.io/en/latest/ref/configuration/master.html#minion-data-cache-events |
3.29. pub_hwm
Description |
The maximum number of outstanding messages sent by |
Tune when |
Client count increases significantly and |
Value default |
1000 |
Value recommendation |
10000-100000 |
Location |
|
Example |
|
More information |
https://docs.saltstack.com/en/latest/ref/configuration/master.html#pub-hwm, https://zeromq.org/socket-api/#high-water-mark |
3.30. zmq_backlog
Description |
The maximum number of allowed client connections that have started but not concluded the opening process. If more than this number of clients connects in a very short time frame, connections are dropped and clients experience a delay re-connecting. |
Tune when |
Client count increases significantly and very many clients reconnect in a short time frame, TCP connections to the |
Value default |
1000 |
Value recommendation |
1000-5000 |
Location |
|
Example |
|
More information |
https://docs.saltstack.com/en/latest/ref/configuration/master.html#zmq-backlog, http://api.zeromq.org/3-0:zmq-getsockopt ( |
3.31. swappiness
Description |
How aggressively the kernel moves unused data from memory to the swap partition. Setting a lower parameter typically reduces swap usage and results in better performance, especially when RAM memory is abundant. |
Tune when |
RAM increases, or swap is used when RAM memory is sufficient. |
Value default |
60 |
Value recommendation |
1-60. For 128 GB of RAM, 10 is expected to give good results. |
Location |
|
Example |
|
More information |
https://documentation.suse.com/sles/15-SP4/html/SLES-all/cha-tuning-memory.html#cha-tuning-memory-vm |
3.32. wait_for_backend
Description |
Determines whether the |
Tune when |
Unstable connectivity between the SUSE Manager Proxy and the SUSE Manager Server. |
Value default |
False |
Value recommendation |
True |
Location |
|
Example |
|
More information |
3.33. tcp_keepalive
Description |
The tcp keepalive interval to set on TCP ports. This setting can be used to tune Salt connectivity issues in messy network environments with misbehaving firewalls. |
Tune when |
Unstable connectivity between managed clients and the SUSE Manager Proxy or the SUSE Manager Server. |
Value default |
True |
Value recommendation |
True |
Location |
|
Example |
|
After changing |
Check Minions Connectivity for more details to fine tune extra keepalive parameters. |
More information |
https://docs.saltproject.io/en/latest/ref/configuration/minion.html#tcp-keepalive, Minions Connectivity |
4. Memory Usage
Adjusting some of the parameters listed in this section can result in a higher amount of RAM being used by various components. It is important that the amount of hardware RAM is adequate after any significant change.
To determine how RAM is being used, you will need to check each process that consumes it.
- Operating system
-
Stop all SUSE Manager services and inspect the output of
free -h
. - Java-based components
-
This includes Taskomatic, Tomcat, and
rhn-search
. These services support a configurable memory cap. - The SUSE Manager Server
-
Depends on many factors and can only be estimated. Measure PostgreSQL reserved memory by checking
shared_buffers
, permanently. You can also multiplywork_mem
andmax_connections
, and multiply by three for a worst case estimate of per-query RAM. You will also need to check the operating system buffers and caches, which are used by PostgreSQL to host copies of database data. These often automatically occupy any available RAM.
It is important that the SUSE Manager Server has sufficient RAM to accommodate all of these processes, especially OS buffers and caches, to have reasonable PostgreSQL performance. We recommend you keep several gigabytes available at all times, and add more as the database size on disk increases.
Whenever the expected amount of memory available for OS buffers and caches changes, update the effective_cache_size
parameter to have PostgreSQL use it correctly. You can calculate the total available by finding the total RAM available, less the expected memory usage.
To get a live breakdown of the memory used by services on the SUSE Manager Server, use this command:
pidstat -p ALL -r --human 1 60 | tee pidstat-memory.log
This command will save a copy of displayed data in the pidstat-memory.log
file for later analysis.