9 Kernel Control Groups #
Kernel Control Groups (“cgroups”) are a kernel feature that allows assigning and limiting hardware and system resources for processes. Processes can also be organized in a hierarchical tree structure.
9.1 Overview #
Every process is assigned exactly one administrative cgroup. cgroups are ordered in a hierarchical tree structure. You can set resource limitations, such as CPU, memory, disk I/O, or network bandwidth usage,for single processes or for whole branches of the hierarchy tree.
On SUSE Linux Enterprise Server, systemd
uses cgroups to organize all
processes in groups, which systemd
calls slices. systemd
also
provides an interface for setting cgroup properties.
The command systemd-cgls
displays the hierarchy
tree.
This chapter is an overview. For more details, refer to the listed references.
9.2 Setting Resource Limits #
Be aware that resource consumption implicitly depends on the environment where your workload executes (e.g. size of data structures in libraries/kernel, forking behavior of utilities, computational efficiency), hence it is recommended to (re)calibrate your limits should the environment change.
Limitations to cgroups
can be set with the
systemctl set-property
command. The syntax is:
#
systemctl set-property [--runtime] NAME PROPERTY1=VALUE [PROPERTY2=VALUE]
Optionally, use the --runtime
option. With this
option, set limits do not persist after the next reboot.
Replace NAME with a systemd
service
slice, scope, socket, mount, or swap name. Replace properties with
one or more of the following:
CPUAccounting=
[yes|no]
Turns on CPU usage accounting. This property takes
yes
andno
as arguments.Example:
#
systemctl set-property user.slice CPUAccounting=yes
CPUQuota=
PERCENTAGEAssigns a CPU time to processes. The value is a percentage followed by a
%
as suffix. This impliesCPUAccounting=yes
.Example:
#
systemctl set-property user.slice CPUQuota=50%
MemoryAccounting=
[yes|no]
Turns on memory usage accounting. This property takes
yes
andno
as arguments.Example:
#
systemctl set-property user.slice MemoryAccounting=yes
MemoryLow=
BYTESUnused memory from processes below this limit will not be reclaimed for other use. Use suffixes K, M, G or T for BYTES. This implies
MemoryAccounting=yes
.Example:
#
systemctl set-property nginx.service MemoryLow=512M
Note: Unified Control Group HierarchyThis setting is available only if the unified control group hierarchy is used, and disables
MemoryLimit=
. To enable the unified control group hierarchy, appendsystemd.unified_cgroup_hierarchy=1
as a kernel command line parameter to the GRUB 2 boot loader. Refer to Chapter 14, The Boot Loader GRUB 2 for more details about configuring GRUB 2.MemoryHigh=
BYTESIf more memory above this limit is used, memory is aggressively taken away from the processes. Use suffixes K, M, G or T for BYTES. This implies
MemoryAccounting=yes
. For example:#
systemctl set-property nginx.service MemoryHigh=2G
Note: Unified Control Group HierarchyThis setting is available only if the unified control group hierarchy is used, and disables
MemoryLimit=
. To enable the unified control group hierarchy, appendsystemd.unified_cgroup_hierarchy=1
as a kernel command line parameter to the GRUB 2 boot loader. For more details about configuring GRUB 2, see Chapter 14, The Boot Loader GRUB 2.MemoryMax=
BYTESSets a maximum limit for used memory. Processes will be killed if they use more memory than allowed. Use suffixes K, M, G or T for BYTES. This implies
MemoryAccounting=yes
.Example:
#
systemctl set-property nginx.service MemoryMax=4G
DeviceAllow=
Allows read (
r
), write (w
) and mknod (m
) access. The command takes a device node specifier and a list ofr
,w
orm
, separated by a white space.Example:
#
systemctl set-property system.slice DeviceAllow="/dev/sdb1 r"
DevicePolicy=
[auto|closed|strict]
When set to
strict
, only access to devices that are listed inDeviceAllow
is allowed.closed
additionally allows access to standard pseudo devices including/dev/null
,/dev/zero
,/dev/full
,/dev/random
, and/dev/urandom
.auto
allows access to all devices if no specific rule is defined inDeviceAllow
.auto
is the default setting.
For more details and a complete list of properties, see man
systemd.resource-control
.
9.3 Preventing Fork Bombs with TasksMax #
systemd
228 shipped with a DefaultTasksMax
limit of 512. This limited the number of processes any system unit
can create at one time to 512. Previous versions had no default
limit. The goal was to improve security by preventing runaway
processes from creating excessive forks, or spawning enough
threads to exhaust system resources.
However, it soon became apparent that there is not a single
default that applies to all use cases. 512 is not low enough
to prevent a runaway process from crashing a system, especially
when other resources such as CPU and RAM are not restricted,
and not high enough for processes that create a lot of threads,
such as databases. In systemd
234 the default was changed to 15%,
which is 4915 tasks (15% of the kernel limit of 32768;
see cat /proc/sys/kernel/pid_max
). This default is
compiled, and can be changed in configuration files. The compiled
defaults are documented in
/etc/systemd/system.conf
. You can edit this file
to override the defaults, though there are other methods we will
show in the following sections.
9.3.1 Finding the Current Default TasksMax Values #
SUSE Linux Enterprise Server ships with two custom configurations that override the
upstream defaults for system units and for user slices, and sets them
both to infinity
.
/usr/lib/systemd/system.conf.d/20-suse-defaults.conf
contains these lines:
[Manager] DefaultTasksMax=infinity
/usr/lib/systemd/system/user-.slice.d/20-suse-defaults.conf
contains these lines:
[Slice] TasksMax=infinity
infinity
means having no limit. It is not a
requirement to change the default, but setting some limits may help to
prevent system crashes from runaway processes.
9.3.2 Overriding the DefaultTasksMax Value #
Change the global DefaultTasksMax
value by creating
a new override file,
/etc/systemd/system.conf.d/10-system-tasksmax.conf
,
and write the following lines to set new default limit of 256 tasks per
system unit:
[Manager] DefaultTasksMax=256
Load the new setting, then verify that it changed:
>
sudo
systemctl daemon-reload>
systemctl show --property DefaultTasksMax DefaultTasksMax=256
Adjust this default value to suit your needs. You can set higher limits on individual services as needed. This example is for MariaDB. First check the current active value:
>
systemctl status mariadb.service
● mariadb.service - MariaDB database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset>
Active: active (running) since Tue 2020-05-26 14:15:03 PDT; 27min ago
Docs: man:mysqld(8)
https://mariadb.com/kb/en/library/systemd/
Main PID: 11845 (mysqld)
Status: "Taking your SQL requests now..."
Tasks: 30 (limit: 256)
CGroup: /system.slice/mariadb.service
└─11845 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
The Tasks line shows that MariaDB currently has 30 tasks running, and has
an upper limit of the default 256, which is inadequate for a database.
The following example demonstrates how to raise MariaDB's limit to 8192.
Create a new override file with systemctl edit
, and
enter the new value:
>
sudo
systemctl edit mariadb.service [Service] TasksMax=8192>
systemctl status mariadb.service ● mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disab> Drop-In: /etc/systemd/system/mariadb.service.d └─override.conf Active: active (running) since Tue 2020-06-02 17:57:48 PDT; 7min ago Docs: man:mysqld(8) https://mariadb.com/kb/en/library/systemd/ Process: 3446 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper upgrade (code=exited, sta> Process: 3440 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper install (code=exited, sta> Main PID: 3452 (mysqld) Status: "Taking your SQL requests now..." Tasks: 30 (limit: 8192) CGroup: /system.slice/mariadb.service └─3452 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
systemctl edit
creates an override file,
/etc/systemd/system/mariadb.service.d/override.conf
,
that contains only the changes you want to apply to the existing unit file.
The value does not have to be 8192, but should be whatever limit is
appropriate for your workloads.
9.3.3 Default TasksMax Limit on Users #
The default limit on users should be fairly high, because user sessions
need more resources.
Set your own default for users by creating a new file, for example
/etc/systemd/system/user-.slice.d/user-taskmask.conf
.
The following example sets a default of 16284:
[Slice] TasksMax=16284
Then reload systemd to load the new value, and verify the change by querying the root slice and a user slice:
>
sudo
systemctl daemon-reload>
systemctl show --property TasksMax user-.slice TasksMax=16284>
systemctl show --property TasksMax user-1000.slice TasksMax=16284
How do you know what values to use? This varies according to your workloads, system resources, and other resource configurations. When your TasksMax value is too low, you will see error messages such as Failed to fork (Resources temporarily unavailable), Can't create thread to handle new connection, and Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable'.
For more information on configuring system resources in systemd, see
systemd.resource-control (5)
.
9.4 For More Information #
Kernel documentation (package
kernel-source
): files in/usr/src/linux/Documentation/cgroup-v1
and file/usr/src/linux/Documentation/cgroup-v2.txt
.https://lwn.net/Articles/604609/—Brown, Neil: Control Groups Series (2014, 7 parts).
https://lwn.net/Articles/243795/—Corbet, Jonathan: Controlling memory use in containers (2007).
https://lwn.net/Articles/236038/—Corbet, Jonathan: Process containers (2007).