Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / SUSE Linux Enterprise Server Documentation / System Analysis and Tuning Guide / Resource Management / Kernel Control Groups
Applies to SUSE Linux Enterprise Server 15 SP2

10 Kernel Control Groups

Kernel Control Groups (cgroups) are a kernel feature that allows assigning and limiting hardware and system resources for processes. Processes can also be organized in a hierarchical tree structure.

10.1 Overview

Every process is assigned exactly one administrative cgroup. cgroups are ordered in a hierarchical tree structure. You can set resource limitations, such as CPU, memory, disk I/O, or network bandwidth usage, for single processes or for whole branches of the hierarchy tree.

On SUSE Linux Enterprise Server, systemd uses cgroups to organize all processes in groups, which systemd calls slices. systemd also provides an interface for setting cgroup properties.

The command systemd-cgls displays the hierarchy tree.

This chapter is an overview. For more details, refer to the listed references.

10.2 Resource accounting

The placement of processes into different cgroups can be used to obtain per-cgroup information of certain resource consumptions.

The accounting has relatively small but non-zero overhead whose impact depends on the workload. Be aware that turning on accounting for one unit will also implicitly turn it on for all units directly contained in the same slice and for all its parent slices and the units directly contained therein. Therefore the accounting cost is not exclusive to the single unit.

The accounting can be set on per-unit basis with directives such as MemoryAccounting= or globally for all units in /etc/systemd/system.conf with respective directive DefaultMemoryAccounting=. Refer to systemd.resource-control (5) for the exhaustive list of possible directives.

10.3 Setting Resource Limits

Note
Note: Implicit Resource Consumption

Be aware that resource consumption implicitly depends on the environment where your workload executes (for example, size of data structures in libraries/kernel, forking behavior of utilities, computational efficiency). Hence it is recommended to (re)calibrate your limits should the environment change.

Limitations to cgroups can be set with the systemctl set-property command. The syntax is:

# systemctl set-property [--runtime] NAME PROPERTY1=VALUE [PROPERTY2=VALUE]

Optionally, use the --runtime option. With this option, set limits do not persist after the next reboot.

Replace NAME with a systemd service slice, scope, socket, mount, or swap name. Replace properties with one or more of the following:

CPUQuota=PERCENTAGE

Assigns a CPU time to processes. The value is a percentage followed by a % as suffix. This implies CPUAccounting=yes.

Example:

# systemctl set-property user.slice CPUQuota=50%
MemoryLow=BYTES

Unused memory from processes below this limit will not be reclaimed for other use. Use suffixes K, M, G or T for BYTES. This implies MemoryAccounting=yes.

Example:

# systemctl set-property nginx.service MemoryLow=512M
Note
Note: Unified Control Group Hierarchy

This setting is available only if the unified control group hierarchy is used, and disables MemoryLimit=. To enable the unified control group hierarchy, append systemd.unified_cgroup_hierarchy=1 as a kernel command line parameter to the GRUB 2 boot loader. Refer to Chapter 14, The Boot Loader GRUB 2 for more details about configuring GRUB 2.

MemoryHigh=BYTES

If more memory above this limit is used, memory is aggressively taken away from the processes. Use suffixes K, M, G or T for BYTES. This implies MemoryAccounting=yes. For example:

# systemctl set-property nginx.service MemoryHigh=2G
Note
Note: Unified Control Group Hierarchy

This setting is available only if the unified control group hierarchy is used, and disables MemoryLimit=. To enable the unified control group hierarchy, append systemd.unified_cgroup_hierarchy=1 as a kernel command line parameter to the GRUB 2 boot loader. For more details about configuring GRUB 2, see Chapter 14, The Boot Loader GRUB 2.

MemoryMax=BYTES

Sets a maximum limit for used memory. Processes will be killed if they use more memory than allowed. Use suffixes K, M, G or T for BYTES. This implies MemoryAccounting=yes.

Example:

# systemctl set-property nginx.service MemoryMax=4G
DeviceAllow=

Allows read (r), write (w) and mknod (m) access. The command takes a device node specifier and a list of r, w or m, separated by a white space.

Example:

# systemctl set-property system.slice DeviceAllow="/dev/sdb1 r"
DevicePolicy=[auto|closed|strict]

When set to strict, only access to devices that are listed in DeviceAllow is allowed. closed additionally allows access to standard pseudo devices including /dev/null, /dev/zero, /dev/full, /dev/random, and /dev/urandom. auto allows access to all devices if no specific rule is defined in DeviceAllow. auto is the default setting.

For more details and a complete list of properties, see man systemd.resource-control.

10.4 Preventing Fork Bombs with TasksMax

systemd 228 shipped with a DefaultTasksMax limit of 512. This limited the number of processes any system unit can create at one time to 512. Previous versions had no default limit. The goal was to improve security by preventing runaway processes from creating excessive forks, or spawning enough threads to exhaust system resources.

However, it soon became apparent that there is not a single default that applies to all use cases. 512 is not low enough to prevent a runaway process from crashing a system, especially when other resources such as CPU and RAM are not restricted, and not high enough for processes that create a lot of threads, such as databases. In systemd 234, the default was changed to 15%, which is 4915 tasks (15% of the kernel limit of 32768; see cat /proc/sys/kernel/pid_max). This default is compiled, and can be changed in configuration files. The compiled defaults are documented in /etc/systemd/system.conf. You can edit this file to override the defaults, though there are other methods we will show in the following sections.

10.4.1 Finding the Current Default TasksMax Values

SUSE Linux Enterprise Server ships with two custom configurations that override the upstream defaults for system units and for user slices, and sets them both to infinity. /usr/lib/systemd/system.conf.d/20-suse-defaults.conf contains these lines:

[Manager]
DefaultTasksMax=infinity

/usr/lib/systemd/system/user-.slice.d/20-suse-defaults.conf contains these lines:

[Slice]
TasksMax=infinity

infinity means having no limit. It is not a requirement to change the default, but setting some limits may help to prevent system crashes from runaway processes.

10.4.2 Overriding the DefaultTasksMax Value

Change the global DefaultTasksMax value by creating a new override file, /etc/systemd/system.conf.d/90-system-tasksmax.conf, and write the following lines to set a new default limit of 256 tasks per system unit:

[Manager]
DefaultTasksMax=256

Load the new setting, then verify that it changed:

> sudo systemctl daemon-reload
> systemctl show --property DefaultTasksMax
DefaultTasksMax=256

Adjust this default value to suit your needs. You can set higher limits on individual services as needed. This example is for MariaDB. First check the current active value:

> systemctl status mariadb.service
  ● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset>
   Active: active (running) since Tue 2020-05-26 14:15:03 PDT; 27min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
 Main PID: 11845 (mysqld)
   Status: "Taking your SQL requests now..."
    Tasks: 30 (limit: 256)
   CGroup: /system.slice/mariadb.service
           └─11845 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql

The Tasks line shows that MariaDB currently has 30 tasks running, and has an upper limit of the default 256, which is inadequate for a database. The following example demonstrates how to raise MariaDB's limit to 8192. Create a new override file with systemctl edit, and enter the new value:

> sudo systemctl edit mariadb.service

[Service]
TasksMax=8192

> systemctl status mariadb.service 
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disab>
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─override.conf
   Active: active (running) since Tue 2020-06-02 17:57:48 PDT; 7min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 3446 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper upgrade (code=exited, sta>
  Process: 3440 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper install (code=exited, sta>
 Main PID: 3452 (mysqld)
   Status: "Taking your SQL requests now..."
    Tasks: 30 (limit: 8192)
   CGroup: /system.slice/mariadb.service
           └─3452 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql

systemctl edit creates an override file, /etc/systemd/system/mariadb.service.d/override.conf, that contains only the changes you want to apply to the existing unit file. The value does not have to be 8192, but should be whatever limit is appropriate for your workloads.

10.4.3 Default TasksMax Limit on Users

The default limit on users should be fairly high, because user sessions need more resources. Set your own default for users by creating a new file, for example /etc/systemd/system/user-.slice.d/user-taskmask.conf. The following example sets a default of 16284:

[Slice]
TasksMax=16284

Then reload systemd to load the new value, and verify the change:

> sudo systemctl daemon-reload

> systemctl show --property TasksMax user-.slice
TasksMax=16284

> systemctl show --property TasksMax user-1000.slice
TasksMax=16284

How do you know what values to use? This varies according to your workloads, system resources, and other resource configurations. When your TasksMax value is too low, you will see error messages such as Failed to fork (Resources temporarily unavailable), Can't create thread to handle new connection, and Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable'.

For more information on configuring system resources in systemd, see systemd.resource-control (5).

10.5 Controlling I/O with Proportional Weight Policy

This section introduces using the Linux kernel's block I/O controller to prioritize I/O operations. The cgroup blkio subsystem controls and monitors access to I/O on block devices. State objects that contain the subsystem parameters for a cgroup are represented as pseudofiles within the cgroup virtual file system, also called a pseudo-filesystem.

The examples in this section show how writing values to some of these pseudo-files limits access or bandwidth, and reading values from some of these pseudo-files provides information on I/O operations. Examples are provided for both cgroup-v1 and cgroup-v2.

You need a test directory containing two files for testing performance and changed settings. A quick way to create test files fully-populated with text is using the yes command. The following example commands create a test directory, and then populate it with two 537 MB text files:

host1:~ # mkdir /io-cgroup
host1:~ # cd /io-cgroup
host1:~ # yes this is a test file | head -c 537MB > file1.txt
host1:~ # yes this is a test file | head -c 537MB > file2.txt

To run the examples open three command shells. Two shells are for reader processes, and one shell is for running the steps that control I/O. In the examples, each command prompt is labeled to indicate if it represents one of the reader processes, or I/O.

10.5.1 Using cgroup-v1

The following proportional weight policy files can be used to grant a reader process a higher priority for I/O operations than other reader processes accessing the same disk.

  • blkio.weight (only available in kernels up to version 4.20 with legacy block layer and when using the CFQ I/O scheduler)

  • blkio.bfq.weight (available in kernels starting with version 5.0 with blk-mq and when using BFQ I/O scheduler)

To test this, run a single reader process (in the examples, reading from a SSD) without controlling its I/O, using file2.txt:

[io-controller] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[io-controller] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
5251
131072+0 records in
131072+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 1.33049 s, 404 MB/s

Now run a background process reading from the same disk:

[reader1] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[reader1] host1:/io-cgroup # echo $$; dd if=file1.txt of=/dev/null bs=4k
5220
...
[reader2] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
5251
131072+0 records in
131072+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.61592 s, 205 MB/s

Each process gets half of the throughput for I/O operations. Next, set up two control groups—one for each process—verify that BFQ is used, and set a different weight for reader2:

[io-controller] host1:/io-cgroup # cd /sys/fs/cgroup/blkio/
[io-controller] host1:/sys/fs/cgroup/blkio/ # mkdir reader1
[io-controller] host1:/sys/fs/cgroup/blkio/ # mkdir reader2
[io-controller] host1:/sys/fs/cgroup/blkio/ # echo 5220 > reader1/cgroup.procs
[io-controller] host1:/sys/fs/cgroup/blkio/ # echo 5251 > reader2/cgroup.procs
[io-controller] host1:/sys/fs/cgroup/blkio/ # cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none
[io-controller] host1:/sys/fs/cgroup/blkio/ # cat reader1/blkio.bfq.weight
100
[io-controller] host1:/sys/fs/cgroup/blkio/ # echo 200 > reader2/blkio.bfq.weight
[io-controller] host1:/sys/fs/cgroup/blkio/ # cat reader2/blkio.bfq.weight
200

With these settings and reader1 in the background, reader2 should have higher throughput than previously:

[reader1] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[reader1] host1:/io-cgroup # echo $$; dd if=file1.txt of=/dev/null bs=4k
5220
...
[reader2] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
5251
131072+0 records in
131072+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 2.06604 s, 260 MB/s

The higher proportional weight resulted in higher throughput for reader2. Now double its weight again:

[io-controller] host1:/sys/fs/cgroup/blkio/ # cat reader1/blkio.bfq.weight
100
[io-controller] host1:/sys/fs/cgroup/blkio/ # echo 400 > reader2/blkio.bfq.weight
[io-controller] host1:/sys/fs/cgroup/blkio/ # cat reader2/blkio.bfq.weight
400

This results in another increase in throughput for reader2:

[reader1] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[reader1] host1:/io-cgroup # echo $$; dd if=file1.txt of=/dev/null bs=4k
5220
...
[reader2] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
5251
131072+0 records in
131072+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 1.69026 s, 318 MB/s

10.5.2 Using cgroup-v2

First set up your test environment as shown at the beginning of this chapter.

Then make sure that the Block IO controller is not active, as that is for cgroup-v1. To do this, boot with kernel parameter cgroup_no_v1=blkio. Verify that this parameter was used, and that the IO controller (cgroup-v2) is available:

[io-controller] host1:/io-cgroup # cat /proc/cmdline
BOOT_IMAGE=... cgroup_no_v1=blkio ...
[io-controller] host1:/io-cgroup # cat /sys/fs/cgroup/unified/cgroup.controllers
io

Next, enable the IO controller:

[io-controller] host1:/io-cgroup # cd /sys/fs/cgroup/unified/
[io-controller] host1:/sys/fs/cgroup/unified # echo '+io' > cgroup.subtree_control
[io-controller] host1:/sys/fs/cgroup/unified # cat cgroup.subtree_control
io

Now run all the test steps, similarly to the steps for cgroup-v1. Note that some of the directories are different. Run a single reader process (in the examples, reading from a SSD) without controlling its I/O, using file2.txt:

[io-controller] host1:/sys/fs/cgroup/unified # cd -
[io-controller] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[io-controller] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
5633
[...]

Run a background process reading from the same disk and note your throughput values:

[reader1] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[reader1] host1:/io-cgroup # echo $$; dd if=file1.txt of=/dev/null bs=4k
5633
[...]
[reader2] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
5703
[...]

Each process gets half of the throughput for I/O operations. Set up two control groups, one for each process, verify that BFQ is the active scheduler, and set a different weight for reader2:

[io-controller] host1:/io-cgroup # cd -
[io-controller] host1:/sys/fs/cgroup/unified # mkdir reader1
[io-controller] host1:/sys/fs/cgroup/unified # mkdir reader2
[io-controller] host1:/sys/fs/cgroup/unified # echo 5633 > reader1/cgroup.procs
[io-controller] host1:/sys/fs/cgroup/unified # echo 5703 > reader2/cgroup.procs
[io-controller] host1:/sys/fs/cgroup/unified # cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none
[io-controller] host1:/sys/fs/cgroup/unified # cat reader1/io.bfq.weight
default 100
[io-controller] host1:/sys/fs/cgroup/unified # echo 200 > reader2/io.bfq.weight
[io-controller] host1:/sys/fs/cgroup/unified # cat reader2/io.bfq.weight
default 200

Test your throughput with the new settings. reader2 should show an increase in throughput.

[reader1] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[reader1] host1:/io-cgroup # echo $$; dd if=file1 of=/dev/null bs=4k
5633
[...]
[reader2] host1:/io-cgroup # echo $$; dd if=file2 of=/dev/null bs=4k count=131072
5703
[...]

Try doubling the weight again for reader2, and testing the new setting:

[reader2] host1:/io-cgroup # echo 400 > reader1/blkio.bfq.weight
[reader2] host1:/io-cgroup # cat reader2/blkio.bfq.weight
400
[reader1] host1:/io-cgroup # sync; echo 3 > /proc/sys/vm/drop_caches
[reader1] host1:/io-cgroup # echo $$; dd if=file1.txt of=/dev/null bs=4k
[...]
[reader2] host1:/io-cgroup # echo $$; dd if=file2.txt of=/dev/null bs=4k count=131072
[...]

10.6 For More Information