10 Kernel control groups #
Kernel Control Groups (“cgroups”) are a kernel feature for assigning and limiting hardware and system resources for processes. Processes can also be organized in a hierarchical tree structure.
10.1 Overview #
Every process is assigned exactly one administrative cgroup. cgroups are ordered in a hierarchical tree structure. You can set resource limitations, such as CPU, memory, disk I/O, or network bandwidth usage, for single processes or for whole branches of the hierarchy tree.
On SUSE Linux Enterprise Server, systemd
uses cgroups to organize all processes in
groups, which systemd
calls slices. systemd
also provides an interface
for setting cgroup properties.
The command systemd-cgls
displays the hierarchy tree.
There are two versions of cgroup APIs provided by the kernel. These differ
in the cgroup attributes they provide, and in the organization of controller
hierarchies. systemd
attempts to abstract the differences away. By default
systemd
runs in the hybrid mode, which means
controllers are used through the v1 API. cgroup v2 is only used for
systemd's own tracking. There is also the unified mode
when the controllers are used though v2 API. You may set only one mode.
To enable the unified control group hierarchy, append
systemd.unified_cgroup_hierarchy=1
as a kernel command line
parameter to the GRUB 2 boot loader. (Refer to 第14章 「ブートローダGRUB 2」
for more details about configuring GRUB 2.)
10.2 Resource accounting #
Organizing processes into different cgroups can be used to obtain per-cgroup resource consumption data.
The accounting has relatively small but non-zero overhead, whose impact depends on the workload. Activating accounting for one unit will also implicitly activate it for all units in the same slice, and for all its parent slices, and the units contained in them.
The accounting can be set on a per-unit basis with directives such as
MemoryAccounting=
or globally for all units in
/etc/systemd/system.conf
with the directive
DefaultMemoryAccounting=
. Refer to man
systemd.resource-control
for the exhaustive list of possible
directives.
10.3 Setting resource limits #
Be aware that resource consumption implicitly depends on the environment where your workload executes (for example, size of data structures in libraries/kernel, forking behavior of utilities, computational efficiency). Hence it is recommended to (re)calibrate your limits should the environment change.
Limitations to cgroups can be set with the systemctl
set-property
command. The syntax is:
#
systemctl set-property [--runtime] NAME PROPERTY1=VALUE [PROPERTY2=VALUE]
The configured value is applied immediately. Optionally, use the
--runtime
option, so that the new values do not persist
after reboot.
Replace NAME with a systemd
service, scope, or
slice name.
For a complete list of properties and more details, see man
systemd.resource-control
.
10.4 Preventing fork bombs with TasksMax
#
systemd
supports configuring task count limits both for each individual
leaf unit, or aggregated on slices. Upstream systemd
ships with defaults
that limit the number of tasks in each unit (15% of the kernel global limit,
run /usr/sbin/sysctl kernel.pid_max
to see the total
limit). Each user's slice is limited to 33% of the kernel limit. However,
this is different for SLE.
10.4.1 Finding the current default TasksMax
values #
It became apparent, in practice, that there is not a single default that
applies to all use cases. SUSE Linux Enterprise Server ships with two custom
configurations that override the upstream defaults for system units and for
user slices, and sets them both to infinity
.
/usr/lib/systemd/system.conf.d/__25-defaults-SLE.conf
contains these lines:
[Manager] DefaultTasksMax=infinity
/usr/lib/systemd/system/user-.slice.d/25-defaults-SLE.conf
contains these lines:
[Slice] TasksMax=infinity
Use systemctl
to verify the DefaultTasksMax value:
>
systemctl show --property DefaultTasksMax
DefaultTasksMax=infinity
infinity
means having no limit. It is not a requirement
to change the default, but setting some limits may help to prevent system
crashes from runaway processes.
10.4.2 Overriding the DefaultTasksMax
value #
Change the global DefaultTasksMax
value by creating a
new override file,
/etc/systemd/system.conf.d/90-system-tasksmax.conf
,
and write the following lines to set a new default limit of 256 tasks per
system unit:
[Manager] DefaultTasksMax=256
Load the new setting, then verify that it changed:
>
sudo
systemctl daemon-reload
>
systemctl show --property DefaultTasksMax
DefaultTasksMax=256
Adjust this default value to suit your needs. You can set different limits on individual services as needed. This example is for MariaDB. First check the current active value:
>
systemctl status mariadb.service
● mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset> Active: active (running) since Tue 2020-05-26 14:15:03 PDT; 27min ago Docs: man:mysqld(8) https://mariadb.com/kb/en/library/systemd/ Main PID: 11845 (mysqld) Status: "Taking your SQL requests now..." Tasks: 30 (limit: 256) CGroup: /system.slice/mariadb.service └─11845 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
The Tasks line shows that MariaDB currently has 30 tasks running, and has an upper limit of the default 256, which is inadequate for a database. The following example demonstrates how to raise MariaDB's limit to 8192.
>
sudo
systemctl set-property mariadb.service TasksMax=8192
>
systemctl status mariadb.service
● mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disab> Drop-In: /etc/systemd/system/mariadb.service.d └─50-TasksMax.conf Active: active (running) since Tue 2020-06-02 17:57:48 PDT; 7min ago Docs: man:mysqld(8) https://mariadb.com/kb/en/library/systemd/ Process: 3446 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper upgrade (code=exited, sta> Process: 3440 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper install (code=exited, sta> Main PID: 3452 (mysqld) Status: "Taking your SQL requests now..." Tasks: 30 (limit: 8192) CGroup: /system.slice/mariadb.service └─3452 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
systemctl set-property
applies the new limit and creates
a drop-in file for persistence,
/etc/systemd/system/mariadb.service.d/50-TasksMax.conf
,
that contains only the changes you want to apply to the existing unit file.
The value does not have to be 8192, but should be whatever limit is
appropriate for your workloads.
10.4.3 Default TasksMax
limit on users #
The default limit on users should be fairly high, because user sessions
need more resources. Set your own default for any user by creating a new
file, for example
/etc/systemd/system/user-.slice.d/40-user-taskmask.conf
.
The following example sets a default of 16284:
[Slice] TasksMax=16284
See https://documentation.suse.com/sles/15-SP3/html/SLES-all/cha-systemd.html#sec-boot-systemd-custom-drop-in to learn what numeric prefixes are expected for drop-in files.
Then reload systemd to load the new value, and verify the change:
>
sudo
systemctl daemon-reload
>
systemctl show --property TasksMax user-1000.slice
TasksMax=16284
How do you know what values to use? This varies according to your
workloads, system resources, and other resource configurations. When your
TasksMax
value is too low, you will see error messages
such as Failed to fork (Resources temporarily
unavailable), Can't create thread to handle new
connection, and Error: Function call 'fork' failed
with error code 11, 'Resource temporarily unavailable'.
For more information on configuring system resources in systemd, see
systemd.resource-control (5)
.
10.5 Controlling I/O with proportional weight policy #
This section introduces using the Linux kernel's block I/O controller to prioritize I/O operations. The cgroup blkio subsystem controls and monitors access to I/O on block devices. State objects that contain the subsystem parameters for a cgroup are represented as pseudo-files within the cgroup virtual file system, also called a pseudo-file system.
The examples in this section show how writing values to some of these pseudo-files limits access or bandwidth, and reading values from some of these pseudo-files provides information on I/O operations. Examples are provided for both cgroup-v1 and cgroup-v2.
You need a test directory containing two files for testing performance and
changed settings. A quick way to create test files fully populated with text
is using the yes
command. The following example commands
create a test directory, and then populate it with two 537 MB text files:
host1:~ #
mkdir /io-cgrouphost1:~ #
cd /io-cgrouphost1:~ #
yes this is a test file | head -c 537MB > file1.txthost1:~ #
yes this is a test file | head -c 537MB > file2.txt
To run the examples open three command shells. Two shells are for reader processes, and one shell is for running the steps that control I/O. In the examples, each command prompt is labeled to indicate if it represents one of the reader processes, or I/O.
10.5.1 Using cgroup-v1 #
The following proportional weight policy files can be used to grant a reader process a higher priority for I/O operations than other reader processes accessing the same disk.
blkio.bfq.weight
(available in kernels starting with version 5.0 with blk-mq and when using the BFQ I/O scheduler)
To test this, run a single reader process (in the examples, reading from an
SSD) without controlling its I/O, using file2.txt
:
[io-controller] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[io-controller] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 5251 131072+0 records in 131072+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 1.33049 s, 404 MB/s
Now run a background process reading from the same disk:
[reader1] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[reader1] host1:/io-cgroup #
echo $$; dd if=file1.txt of=/dev/null bs=4k 5220 ...[reader2] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 5251 131072+0 records in 131072+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 2.61592 s, 205 MB/s
Each process gets half of the throughput for I/O operations. Next, set up two control groups—one for each process—verify that BFQ is used, and set a different weight for reader2:
[io-controller] host1:/io-cgroup #
cd /sys/fs/cgroup/blkio/[io-controller] host1:/sys/fs/cgroup/blkio/ #
mkdir reader1[io-controller] host1:/sys/fs/cgroup/blkio/ #
mkdir reader2[io-controller] host1:/sys/fs/cgroup/blkio/ #
echo 5220 > reader1/cgroup.procs[io-controller] host1:/sys/fs/cgroup/blkio/ #
echo 5251 > reader2/cgroup.procs[io-controller] host1:/sys/fs/cgroup/blkio/ #
cat /sys/block/sda/queue/scheduler mq-deadline kyber [bfq] none[io-controller] host1:/sys/fs/cgroup/blkio/ #
cat reader1/blkio.bfq.weight 100[io-controller] host1:/sys/fs/cgroup/blkio/ #
echo 200 > reader2/blkio.bfq.weight[io-controller] host1:/sys/fs/cgroup/blkio/ #
cat reader2/blkio.bfq.weight 200
With these settings and reader1 in the background, reader2 should have higher throughput than previously:
[reader1] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[reader1] host1:/io-cgroup #
echo $$; dd if=file1.txt of=/dev/null bs=4k 5220 ...[reader2] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 5251 131072+0 records in 131072+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 2.06604 s, 260 MB/s
The higher proportional weight resulted in higher throughput for reader2. Now double its weight again:
[io-controller] host1:/sys/fs/cgroup/blkio/ #
cat reader1/blkio.bfq.weight 100[io-controller] host1:/sys/fs/cgroup/blkio/ #
echo 400 > reader2/blkio.bfq.weight[io-controller] host1:/sys/fs/cgroup/blkio/ #
cat reader2/blkio.bfq.weight 400
This results in another increase in throughput for reader2:
[reader1] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[reader1] host1:/io-cgroup #
echo $$; dd if=file1.txt of=/dev/null bs=4k 5220 ...[reader2] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 5251 131072+0 records in 131072+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 1.69026 s, 318 MB/s
10.5.2 Using cgroup-v2 #
First set up your test environment as shown at the beginning of this chapter.
Then make sure that the Block IO controller is not active, as that is for
cgroup-v1. To do this, boot with kernel parameter
cgroup_no_v1=blkio
. Verify that this parameter was used,
and that the IO controller (cgroup-v2) is available:
[io-controller] host1:/io-cgroup #
cat /proc/cmdline BOOT_IMAGE=... cgroup_no_v1=blkio ...[io-controller] host1:/io-cgroup #
cat /sys/fs/cgroup/unified/cgroup.controllers io
Next, enable the IO controller:
[io-controller] host1:/io-cgroup #
cd /sys/fs/cgroup/unified/[io-controller] host1:/sys/fs/cgroup/unified #
echo '+io' > cgroup.subtree_control[io-controller] host1:/sys/fs/cgroup/unified #
cat cgroup.subtree_control io
Now run all the test steps, similarly to the steps for cgroup-v1. Note that some of the directories are different. Run a single reader process (in the examples, reading from an SSD) without controlling its I/O, using file2.txt:
[io-controller] host1:/sys/fs/cgroup/unified #
cd -[io-controller] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[io-controller] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 5633 [...]
Run a background process reading from the same disk and note your throughput values:
[reader1] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[reader1] host1:/io-cgroup #
echo $$; dd if=file1.txt of=/dev/null bs=4k 5633 [...][reader2] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 5703 [...]
Each process gets half of the throughput for I/O operations. Set up two control groups—one for each process—verify that BFQ is the active scheduler, and set a different weight for reader2:
[io-controller] host1:/io-cgroup #
cd -[io-controller] host1:/sys/fs/cgroup/unified #
mkdir reader1[io-controller] host1:/sys/fs/cgroup/unified #
mkdir reader2[io-controller] host1:/sys/fs/cgroup/unified #
echo 5633 > reader1/cgroup.procs[io-controller] host1:/sys/fs/cgroup/unified #
echo 5703 > reader2/cgroup.procs[io-controller] host1:/sys/fs/cgroup/unified #
cat /sys/block/sda/queue/scheduler mq-deadline kyber [bfq] none[io-controller] host1:/sys/fs/cgroup/unified #
cat reader1/io.bfq.weight default 100[io-controller] host1:/sys/fs/cgroup/unified #
echo 200 > reader2/io.bfq.weight[io-controller] host1:/sys/fs/cgroup/unified #
cat reader2/io.bfq.weight default 200
Test your throughput with the new settings. reader2 should show an increase in throughput.
[reader1] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[reader1] host1:/io-cgroup #
echo $$; dd if=file1 of=/dev/null bs=4k 5633 [...][reader2] host1:/io-cgroup #
echo $$; dd if=file2 of=/dev/null bs=4k count=131072 5703 [...]
Try doubling the weight again for reader2, and testing the new setting:
[reader2] host1:/io-cgroup #
echo 400 > reader1/blkio.bfq.weight[reader2] host1:/io-cgroup #
cat reader2/blkio.bfq.weight 400[reader1] host1:/io-cgroup #
sync; echo 3 > /proc/sys/vm/drop_caches[reader1] host1:/io-cgroup #
echo $$; dd if=file1.txt of=/dev/null bs=4k [...][reader2] host1:/io-cgroup #
echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072 [...]
10.6 More information #
Kernel documentation (package
kernel-source
): files in/usr/src/linux/Documentation/admin-guide/cgroup-v1
and file/usr/src/linux/Documentation/admin-guide/cgroup-v2.rst
.https://lwn.net/Articles/604609/—Brown, Neil: Control Groups Series (2014, 7 parts).
https://lwn.net/Articles/243795/—Corbet, Jonathan: Controlling memory use in containers (2007).
https://lwn.net/Articles/236038/—Corbet, Jonathan: Process containers (2007).