10 Kernel control groups #
Kernel Control Groups (“cgroups”) are a kernel feature for assigning and limiting hardware and system resources for processes. Processes can also be organized in a hierarchical tree structure.
10.1 Overview #
Every process is assigned exactly one administrative cgroup. cgroups are ordered in a hierarchical tree structure. You can set resource limitations, such as CPU, memory, disk I/O or network bandwidth usage, for single processes or for whole branches of the hierarchy tree.
      On SUSE Linux Enterprise Desktop, systemd uses cgroups to organize all processes in
      groups, which systemd calls slices. systemd also provides an
      interface for setting cgroup properties.
    
      The command systemd-cgls displays the hierarchy tree.
    
The kernel cgroup API comes in two variants—v1 and v2. Additionally, there can be multiple cgroup hierarchies exposing different APIs. From many possible combinations, there are two practical choices:
- unified: v2 hierarchy with controllers 
- hybrid: v2 hierarchy without controllers, and the controllers are on v1 hierarchies (deprecated) 
The default mode is unified. There is a hybrid mode that provides backward compatibility for applications that need it.
You may set only one mode.
10.1.1 Hybrid cgroup hierarchy #
cgroup v1 has been deprecated, and might be removed in a future release.
        To enable the hybrid control group hierarchy, append
        systemd.unified_cgroup_hierarchy=0 as a kernel
        command-line parameter to the GRUB 2 boot loader. For more details
        about configuring GRUB 2, refer to 第18章 「ブートローダGRUB 2」.
      
10.2 Resource accounting #
Organizing processes into different cgroups can be used to obtain per-cgroup resource consumption data.
The accounting has comparatively small but non-zero overhead, whose impact depends on the workload. Activating accounting for one unit also implicitly activates it for all units in the same slice, and for all its parent slices, and the units contained in them.
      The accounting can be set on a per-unit basis with directives such as
      MemoryAccounting= or globally for all units in
      /etc/systemd/system.conf with the directive
      DefaultMemoryAccounting=. Refer to man
      systemd.resource-control for the exhaustive list of possible
      directives.
    
10.3 Setting resource limits #
Be aware that resource consumption implicitly depends on the environment where your workload executes (for example, size of data structures in libraries/kernel, forking behavior of utilities, computational efficiency). Hence it is recommended to (re)calibrate your limits should the environment change.
      Limitations to cgroups can be set with the systemctl
      set-property command. The syntax is:
    
#systemctl set-property [--runtime] NAME PROPERTY1=VALUE [PROPERTY2=VALUE]
      The configured value is applied immediately. Optionally, use the
      --runtime option, so that the new values do not persist
      after reboot.
    
      Replace NAME with a systemd service, scope,
      or slice name.
    
      For a complete list of properties and more details, see man
      systemd.resource-control.
    
10.4 Preventing fork bombs with TasksMax #
      systemd supports configuring task count limits both for each individual
      leaf unit, or aggregated on slices. Upstream systemd ships with
      defaults that limit the number of tasks in each unit (15% of the kernel
      global limit, run /usr/sbin/sysctl kernel.pid_max to
      see the total limit). Each user's slice is limited to 33% of the kernel
      limit. However, this is different for SUSE Linux Enterprise Desktop.
    
10.4.1 Finding the current default TasksMax values #
        It became apparent, in practice, that there is not a single default
        that applies to all use cases. SUSE Linux Enterprise Desktop ships with two custom
        configurations that override the upstream defaults for system units and
        for user slices, and sets them both to infinity.
        /usr/lib/systemd/system.conf.d/__25-defaults-SLE.conf
        contains these lines:
      
[Manager] DefaultTasksMax=infinity
        /usr/lib/systemd/system/user-.slice.d/25-defaults-SLE.conf
         contains these lines:
      
[Slice] TasksMax=infinity
        Use systemctl to verify the DefaultTasksMax value:
      
>systemctl show --property DefaultTasksMaxDefaultTasksMax=infinity
        infinity means having no limit. It is not a
        requirement to change the default, but setting certain limits may help
        to prevent system crashes from runaway processes.
      
10.4.2 Overriding the DefaultTasksMax value #
        Change the global DefaultTasksMax value by creating
        a new override file,
        /etc/systemd/system.conf.d/90-system-tasksmax.conf,
        and write the following lines to set a new default limit of 256 tasks
        per system unit:
      
[Manager] DefaultTasksMax=256
Load the new setting, then verify that it changed:
>sudosystemctl daemon-reload>systemctl show --property DefaultTasksMaxDefaultTasksMax=256
Adjust this default value to suit your needs. You can set different limits on individual services as needed. This example is for MariaDB. First check the current active value:
>systemctl status mariadb.service● mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset> Active: active (running) since Tue 2020-05-26 14:15:03 PDT; 27min ago Docs: man:mysqld(8) https://mariadb.com/kb/en/library/systemd/ Main PID: 11845 (mysqld) Status: "Taking your SQL requests now..." Tasks: 30 (limit: 256) CGroup: /system.slice/mariadb.service └─11845 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
The Tasks line shows that MariaDB currently has 30 tasks running, and has an upper limit of the default 256, which is inadequate for a database. The following example demonstrates how to raise MariaDB's limit to 8192.
>sudosystemctl set-property mariadb.service TasksMax=8192>systemctl status mariadb.service● mariadb.service - MariaDB database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disab> Drop-In: /etc/systemd/system/mariadb.service.d └─50-TasksMax.conf Active: active (running) since Tue 2020-06-02 17:57:48 PDT; 7min ago Docs: man:mysqld(8) https://mariadb.com/kb/en/library/systemd/ Process: 3446 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper upgrade (code=exited, sta> Process: 3440 ExecStartPre=/usr/lib/mysql/mysql-systemd-helper install (code=exited, sta> Main PID: 3452 (mysqld) Status: "Taking your SQL requests now..." Tasks: 30 (limit: 8192) CGroup: /system.slice/mariadb.service └─3452 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
        systemctl set-property applies the new limit and
        creates a drop-in file for persistence,
        /etc/systemd/system/mariadb.service.d/50-TasksMax.conf,
        that contains only the changes you want to apply to the existing unit
        file. The value does not have to be 8192, but should be whatever limit
        is appropriate for your workloads.
      
10.4.3 Default TasksMax limit on users #
        The default limit on users should be high, because user sessions need
        more resources. Set your own default for any user by creating a new
        file, for example
        /etc/systemd/system/user-.slice.d/40-user-taskmask.conf.
        The following example sets a default of 16284:
      
[Slice] TasksMax=16284
See 19.5.3項 「ドロップインファイルを手動で作成する」 to learn what numeric prefixes are expected for drop-in files.
Then reload systemd to load the new value, and verify the change:
>sudosystemctl daemon-reload>systemctl show --property TasksMax user-1000.sliceTasksMax=16284
        How do you know what values to use? This varies according to your
        workloads, system resources, and other resource configurations. When
        your TasksMax value is too low, you may see error
        messages such as Failed to fork (Resources temporarily
        unavailable), Can't create thread to handle new
        connection, and Error: Function call 'fork' failed
        with error code 11, 'Resource temporarily unavailable'.
      
        For more information on configuring system resources in systemd, see
        systemd.resource-control (5).
      
10.5 I/O control with cgroups #
This section introduces using the Linux kernel's block I/O controller to prioritize or throttle I/O operations. This leverages the means provided by systemd to configure cgroups, and discusses probable pitfalls when dealing with proportional I/O control.
10.5.1 Prerequisites #
The following subsections describe steps that you must take in advance when you design and configure your system, since those aspects cannot be changed during runtime.
10.5.1.1 File system #
You should use a cgroup-writeback-aware file system (otherwise writeback charging is not possible). The recommended SUSE Linux Enterprise Desktop file systems added support in the following upstream releases:
- Btrfs (v4.3) 
- Ext4 (v4.3) 
- XFS (v5.3) 
As of SUSE Linux Enterprise Desktop 15 SP3, any of the named file systems can be used.
10.5.1.2 Block I/O scheduler #
The throttling policy is implemented higher in the stack, therefore it does not require any additional adjustments. The proportional I/O control policies have two different implementations: the BFQ controller, and the cost-based model. We describe the BFQ controller here. To exert its proportional implementation for a particular device, we must make sure that BFQ is the chosen scheduler. Check the current scheduler:
>cat /sys/class/block/sda/queue/schedulermq-deadline kyber bfq [none]
Switch the scheduler to BFQ:
#echo bfq > /sys/class/block/sda/queue/scheduler
You must specify the disk device (not a partition). The optimal way to set this attribute is a udev rule specific to the device. SUSE Linux Enterprise Desktop ships udev rules that already enable BFQ for rotational disk drives.
10.5.1.3 Cgroup hierarchy layout #
Normally, all tasks reside in the root cgroup and they compete against each other. When the tasks are distributed into the cgroup tree the competition occurs between sibling cgroups only. This applies to the proportional I/O control; the throttling hierarchically aggregates throughput of all descendants (see the following diagram).
r
`-  a      IOWeight=100
    `- [c] IOWeight=300
    `-  d  IOWeight=100
`- [b]     IOWeight=200I/O is originating only from cgroups c and b. Even though c has a higher weight, it is treated with lower priority because it is level-competing with b.
10.5.2 Configuring control quantities #
You can apply the values to (long running) services permanently.
>sudosystemctl set-property fast.service IOWeight=400>sudosystemctl set-property slow.service IOWeight=50>sudosystemctl set-property throttled.service IOReadBandwidthMax="/dev/sda 1M"
Alternatively, you can apply I/O control to individual commands, for example:
>sudosystemd-run --scope -p IOWeight=400 high_prioritized_command>sudosystemd-run --scope -p IOWeight=50 low_prioritized_command>sudosystemd-run --scope -p IOReadBandwidthMax="/dev/sda 1M" dd if=/dev/sda of=/dev/null bs=1M count=10
10.5.3 I/O control behavior and setting expectations #
The following list items describe I/O control behavior, and what you should expect under different conditions.
- I/O control works best for direct I/O operations (bypassing page cache), the situations where the actual I/O is decoupled from the caller (typically writeback via page cache) may manifest variously. For example, delayed I/O control or even no observed I/O control (consider little bursts or competing workloads that happen to never “meet,” submitting I/O at the same time, and saturating the bandwidth). For these reasons, the resulting ratio of I/O throughputs does not strictly follow the ratio of configured weights. 
- systemd performs scaling of configured weights (to adjust for narrower BFQ weight range), hence the resulting throughput ratios also differ. 
- The writeback activity depends on the amount of dirty pages, besides the global sysctl knobs ( - vm.dirty_background_ratioand- vm.dirty_ratio)). Memory limits of individual cgroups come into play when the dirty limits are distributed among cgroups, and this in turn may affect I/O intensity of affected cgroups.
- Not all storages are equal. The I/O control happens at the I/O scheduler layer, which has ramifications for setups with devices stacked on these that do no actual scheduling. Consider device mapper logical volumes spanning multiple physical devices, MD RAID, or even Btrfs RAID. I/O control over such setups may be challenging. 
- There is no separate setting for proportional I/O control of reads and writes. 
- Proportional I/O control is only one of the policies that can interact with each other (but responsible resource design perhaps avoids that). 
- The I/O device bandwidth is not the only shared resource on the I/O path. Global file system structures are involved, which is relevant when I/O control is meant to guarantee certain bandwidth; it does not, and it may even lead to priority inversion (prioritized cgroup waiting for a transaction of slower cgroup). 
- So far, we have been discussing only explicit I/O of file system data, but swap-in and swap-out can also be controlled. Although if such a need arises, it points out to improperly provisioned memory (or memory limits). 
10.5.4 Resource control in user sessions #
        In order to apply cgroup resource control within user sessions,
        controllers must be delegated user instances of systemd.
        SUSE Linux Enterprise Desktop ships systemd default configuration that delegates
        no controllers.
      
        You can use drop-in files to change the set of delegated controllers.
        For instance,
        /etc/systemd/system/user@.service.d/60-delegate.conf
        adds controllers to all users, while
        /etc/systemd/system/user@uid.service.d/60-delegate.conf
        adds controllers only to a particular user. The content of the file
        should be like the following:
      
[Service] Delegate=pids memory
        Both the systemd instance and the affected user instance must be notified
        to reload the new configuration.
      
>sudosystemctl daemon-reload>systemctl --user daemon-reexec
Alternatively, the affected user may log out and log in instead of applying the second line to restart their user instance.
10.6 More information #
- Kernel documentation (package - kernel-source): files in- /usr/src/linux/Documentation/admin-guide/cgroup-v1and file- /usr/src/linux/Documentation/admin-guide/cgroup-v2.rst.
- man systemd.resource-control
- https://lwn.net/Articles/604609/—Brown, Neil: Control Groups Series (2014, 7 parts). 
- https://lwn.net/Articles/243795/—Corbet, Jonathan: Controlling memory use in containers (2007). 
- https://lwn.net/Articles/236038/—Corbet, Jonathan: Process containers (2007).