Understanding Kernel Control Groups
- WHAT?
From granular resource allocation to real-time performance monitoring and isolation, control groups (cgroups) enable you to manage how system resources like CPU, memory, and network bandwidth are distributed among your processes in an organized way.
- WHY?
This article provides a comprehensive overview of managing system resources and process isolation through the use of kernel control groups (cgroups).
- EFFORT
The average reading time of this article is approximately 30 minutes.
- GOAL
You will be able to manage and isolate your system's hardware resources efficiently using kernel control groups (cgroups).
- REQUIREMENTS
Use Btrfs, Ext4, or XFS to ensure the kernel can properly track and charge background I/O operations to the correct groups.
Ensure that BFQ scheduler or IOCost are active on your block devices to enable the proportional weighting and prioritization of disk traffic.
For resource management within user sessions, you must explicitly delegate controllers to the user's systemd instance via drop-in configuration files.
1 Introduction to Control Groups #
Kernel control groups provide a hierarchical framework for organizing processes to monitor, isolate, and limit their consumption of system resources like CPU, memory, and I/O.
Every process is assigned to exactly one administrative control group (cgroup). These cgroups are organized into a unified hierarchical tree structure, allowing you to manage resource allocation for individual processes or entire branches of the hierarchy simultaneously. You can define specific limitations for system resources.
1.1 The role of systemd slices #
On SUSE Linux Enterprise Server, systemd serves as the primary manager for cgroups. Instead of requiring
administrators to interact with the raw cgroup file system manually, systemd abstracts this
management by organizing all processes into logical units called slices
(.slice units).
This integration bridges the gap between systemd service management and the kernel's resource tracking through several key mechanisms:
Direct hierarchical mapping: Slices represent inner nodes in a tree that maps directly onto the kernel’s cgroup directory structure
Built-in partitioning: By default, systemd automatically organizes the operating system into distinct root slices to prevent workloads from interfering with each other.
systemduses these root slices:system.slicefor servicesuser.slicefor user sessionsmachine.slicefor managing resources of virtual machinesinit.scopeto host the PID1 process itself
To visualize your current system organization, use the systemd-cgls command to display the hierarchy tree.
1.2 The unified hierarchy (cgroup v2) #
The Linux kernel previously supported two API variants (v1 and v2). SLES exclusively utilizes the unified (v2) hierarchy as the default and recommended mode.
- Unified hierarchy (cgroup v2)
A single hierarchy where all resource controllers are managed under a unified structure, providing better consistency and easier resource accounting.
- Hybrid/Legacy cgroup support (v1)
cgroup v1is functionally obsolete and unsupported in SUSE Linux Enterprise Server 16.0. If cgroup v1 needs to be used, switch to older SLES releases.
2 Configuring resource limits and accounting #
Beyond simply organizing processes, the integration between systemd and cgroups provides you with the dual capability to both measure and restrict resource consumption. Before applying strict constraints, systemd allows you to audit exactly how much CPU, memory, or I/O a workload uses. Once you establish these baselines, you can enforce runtime limitations dynamically to protect system stability.
2.1 Resource accounting #
Organizing processes into different cgroups can be used to obtain per-cgroup resource consumption data. The accounting has comparatively small but non-zero overhead, whose impact depends on the workload. Activating accounting for one unit also implicitly activates it for all units in the same slice, and for all its parent slices, and the units contained in them.
The accounting can be set on a per-unit basis with directives such as MemoryAccounting= or globally for all units in /etc/systemd/system.conf with the directive DefaultMemoryAccounting=.
2.2 Setting resource limits #
Be aware that resource consumption implicitly depends on the environment where your workload executes (for example, size of data structures in libraries/kernel, forking behavior of utilities, computational efficiency). Hence it is recommended to (re)calibrate your limits should the environment change.
Limitations to cgroups are primarily set with the systemctl
set-property command. The command syntax follows:
>sudosystemctl set-property [--runtime] NAME PROPERTY1=VALUE[PROPERTY2=VALUE]
NAME: Asystemdservice, scope, or slice name.--runtime: Optional. Use this if you do not want values to persist after reboot.PROPERTY: for a complete list, seeman systemd.resource-control.
3 Preventing fork bombs with TasksMax #
systemd provides the TasksMax
parameter to limit the number of tasks (processes and threads) within a unit or slice.
This serves as a critical defense against “fork bombs” and runaway
processes that could otherwise exhaust the system's Process ID (PID) pool.
While upstream systemd defaults to 15%
of the kernel global limit (viewable via sysctl kernel.pid_max),
SLES overrides this behavior to provide maximum compatibility for high-performance
workloads.
3.1 Identifying default TasksMax values #
In SLES, the default TasksMax value for both system units
and user slices is set to infinity. This means no restriction
is enforced by default. These overrides are located in the following vendor
configuration files:
System units:
/usr/lib/systemd/system.conf.d/25-defaults-SLE.confUser slices:
/usr/lib/systemd/system/user-.slice.d/25-defaults-SLE.conf
To verify the effective DefaultTasksMax on your system, run:
systemctl show --property DefaultTasksMax
While infinity prevents service disruption due to reaching
task limits, it leaves the system vulnerable to resource exhaustion. We
recommend setting explicit limits for services that might be prone to exploits.
For more information, you can also refer to Configuring and tuning process and threads limits
3.2 Overriding the DefaultTasksMax value #
To set a global default limit for all system units, create a drop-in file at
/etc/systemd/system.conf.d/90-system-tasksmax.conf. For
example, to limit each unit to 256 tasks:
[Manager] DefaultTasksMax=256
Apply the configuration and verify the change:
sudo systemctl daemon-reloadsystemctl show --property DefaultTasksMax
3.2.1 Setting per-service limits #
Specific applications, such as large databases, often require more tasks
than the global default. You can override the limit for a specific service
(for example, MariaDB) using systemctl set-property:
sudo systemctl set-property mariadb.service TasksMax=8192
This command creates a persistent drop-in file at
/etc/systemd/system/mariadb.service.d/50-TasksMax.conf.
Verify the new limit in the Tasks: line of the service status:
Tasks: 30 (limit: 8192)
3.3 Configuring task limits for user sessions #
User sessions often require higher limits than system services. To set a
custom default for all users, create
/etc/systemd/system/user-.slice.d/40-user-taskmax.conf:
[Slice] TasksMax=16284
If a TasksMax limit is too restrictive, the system logs
will report errors such as Failed to fork (Resource temporarily
unavailable). Incrementally increase the limit until the
workload stabilizes.
4 I/O Control with cgroups #
The following section outlines how to prioritize, isolate, and throttle disk input/output (I/O) operations by using systemd to manage the Linux kernel's block I/O controller. It defines the mandatory file system and storage scheduler prerequisites, demonstrates how to enforce limits on both persistent services and transient scopes, and details critical behavioral nuances—such as direct versus buffered I/O—that impact resource distribution across the system.
4.1 Prerequisites #
Before configuring I/O control, ensure your system meets the following requirements. These settings generally cannot be changed during runtime.
4.1.1 File System Support #
You must use a file system that supports cgroup writeback to enable accurate writeback charging. Supported file systems include:
Btrfs (v4.3+)
Ext4 (v4.3+)
XFS (v5.3+)
4.1.2 Block I/O Scheduler #
Proportional I/O control requires a specific scheduler. SUSE recommends
the BFQ controller.
To verify the current scheduler for a device (for example,
sda):
cat /sys/class/block/sda/queue/scheduler
mq-deadline kyber bfq [none]
To switch the scheduler to BFQ:
echo bfq > /sys/class/block/sda/queue/scheduler
Apply this setting to the disk device itself, not a partition. The optimal way to set this attribute is a udev rule specific to the device. SLES
includes udev rules that
automatically enable BFQ for rotational drives. As an alternative to BFQ, the io.cost-based I/O controller operates independently of the I/O scheduler. However, this proportional control mechanism requires manual tuning of its cost-based model parameters.
4.2 Configuring Control Quantities #
Apply I/O limits permanently to services using systemctl:
sudo systemctl set-property fast.service IOWeight=400sudo systemctl set-property slow.service IOWeight=50sudo systemctl set-property throttled.service IOReadBandwidthMax="/dev/sda 1M"
Alternatively, use systemd-run to apply limits to transient
scopes:
sudo systemd-run --scope -p IOWeight=400 high_prioritized_command4.3 I/O control behavior and expectations #
The following list items describe I/O control behavior, and what you should expect under different conditions.
I/O control works best for direct I/O operations (bypassing page cache). The situations where the actual I/O is decoupled from the caller (typically writeback via page cache) may manifest variously. For example, delayed I/O control or even no observed I/O control (consider little bursts or competing workloads that happen to never “meet,” submitting I/O at the same time, and saturating the bandwidth). For these reasons, the resulting ratio of I/O throughput does not strictly follow the ratio of configured weights.
systemd performs scaling of configured weights (to adjust for narrower BFQ weight range), hence the resulting throughput ratios also differ.
The writeback activity depends on the amount of dirty pages, besides the global sysctl knobs (
vm.dirty_background_ratioandvm.dirty_ratio)). Memory limits of individual cgroups come into play when the dirty limits are distributed among cgroups, and this in turn may affect I/O intensity of affected cgroups.Not all storages are equal. The I/O control happens at the I/O scheduler layer, which has ramifications for setups with devices stacked on these that do no actual scheduling. Consider device mapper logical volumes spanning multiple physical devices, MD RAID, or even Btrfs RAID. I/O control over such setups may be challenging.
There is no separate setting for proportional I/O control of reads and writes.
Proportional I/O control is only one of the policies that can interact with each other (but responsible resource design perhaps avoids that).
The I/O device bandwidth is not the only shared resource on the I/O path. Global file system structures are involved, which is relevant when I/O control is meant to guarantee certain bandwidth; it does not, and it may even lead to priority inversion (prioritized cgroup waiting for a transaction of slower cgroup).
So far, we have been discussing only explicit I/O of file system data, but swap-in and swap-out can also be controlled. Although if such a need arises, it points to improperly provisioned memory (or memory limits).
4.4 Resource Control in User Sessions #
SLES ships with a default systemd configuration that delegates no controllers for performance reasons.
To enable resource control within a user session, create a drop-in file
at /etc/systemd/system/user@.service.d/60-delegate.conf:
[Service] Delegate=pids memory
After modifying the configuration, notify the instances:
sudo systemctl daemon-reloadsystemctl --user daemon-reexec
Alternatively, the affected user may log out and log in instead of applying the second line to restart their user instance.
5 Legal Notice #
Copyright© 2006– 2026 SUSE LLC and contributors. All rights reserved.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.
For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.