14 Tuning the task scheduler #
Modern operating systems, such as SUSE® Linux Enterprise Server, normally run many tasks at the same time. For example, you can be searching in a text file while receiving an e-mail and copying a big file to an external hard disk. These simple tasks require many additional processes to be run by the system. To provide each task with its required system resources, the Linux kernel needs a tool to distribute available system resources to individual tasks. And this is exactly what the task scheduler does.
The following sections explain the most important terms related to a process scheduling. They also introduce information about the task scheduler policy, scheduling algorithm, description of the task scheduler used by SUSE Linux Enterprise Server, and references to other sources of relevant information.
14.1 Introduction #
The Linux kernel controls the way that tasks (or processes) are managed on the system. The task scheduler, sometimes called process scheduler, is the part of the kernel that decides which task to run next. It is responsible for the optimal use of system resources to guarantee that multiple tasks are being executed simultaneously. This makes it a core component of any multitasking operating system.
14.1.1 Preemption #
The theory behind task scheduling is simple. If there are runnable processes in a system, at least one process must always be running. If there are more runnable processes than processors in a system, not all the processes can be running all the time.
Therefore, certain processes need to be stopped temporarily, or suspended, so that others can be running again. The scheduler decides what process in the queue runs next.
As already mentioned, Linux, like all other Unix variants, is a multitasking operating system. That means that several tasks can be running at the same time. Linux provides a so called preemptive multitasking, where the scheduler decides when a process is suspended. This forced suspension is called preemption. All Unix flavors have been providing preemptive multitasking since the beginning.
14.1.2 Timeslice #
The time period for which a process runs before it is preempted is defined in advance. It is called a timeslice of a process and represents the amount of processor time that is provided to each process. By assigning timeslices, the scheduler makes global decisions for the running system, and prevents individual processes from taking control of the processor resources.
14.1.3 Process priority #
The scheduler evaluates processes based on their priority. To calculate the current priority of a process, the task scheduler uses complex algorithms. As a result, each process is given a value according to which it is “allowed” to run on a processor.
14.2 Process classification #
Processes are classified according to their purpose and behavior. Although the borderline is not always distinct, generally two criteria are used to sort them. These criteria are independent and do not exclude each other.
One approach is to classify a process either I/O-bound or processor-bound.
- I/O-bound
I/O stands for Input/Output devices, such as keyboards, mice, or optical and hard disks. I/O-bound processes spend most of the time submitting and waiting for requests. They are run frequently, but at short time intervals, not to block other processes waiting for I/O requests.
- processor-bound
Processor-bound tasks use their time to execute a code, and run until they are preempted by the scheduler. They do not block processes waiting for I/O requests, and, therefore, can be run less frequently but for longer time intervals.
Another approach is to divide processes by type into interactive, batch, and real-time processes.
Interactive processes spend a lot of time waiting for I/O requests, such as keyboard or mouse operations. The scheduler must wake up such processes quickly on user request, or the user finds the environment unresponsive. The typical delay is approximately 100 ms. Office applications, text editors or image manipulation programs represent typical interactive processes.
Batch processes often run in the background and do not need to be responsive. They receive lower priority from the scheduler. Multimedia converters, database search engines, or log files analyzers are typical examples of batch processes.
Real-time processes must never be blocked by low-priority processes, and the scheduler guarantees a short response time to them. Applications for editing multimedia content are a good example here.
14.3 Completely Fair Scheduler #
Since the Linux kernel version 2.6.23, a new approach has been taken to the scheduling of runnable processes. Completely Fair Scheduler (CFS) became the default Linux kernel scheduler. Since then, important changes and improvements have been made. The information in this chapter applies to SUSE Linux Enterprise Server with kernel version 2.6.32 and higher (including 3.x kernels). The scheduler environment was divided into several parts, and three main new features were introduced:
- Modular scheduler core
The core of the scheduler was enhanced with scheduling classes. These classes are modular and represent scheduling policies.
- Completely Fair Scheduler
Introduced in kernel 2.6.23 and extended in 2.6.24, CFS tries to assure that each process obtains its “fair” share of the processor time.
- Group scheduling
For example, if you split processes into groups according to which user is running them, CFS tries to provide each of these groups with the same amount of processor time.
As a result, CFS brings optimized scheduling for both servers and desktops.
14.3.1 How CFS works #
CFS tries to guarantee a fair approach to each runnable task. To find the most balanced way of task scheduling, it uses the concept of red-black tree. A red-black tree is a type of self-balancing data search tree which provides inserting and removing entries in a reasonable way so that it remains well balanced.
When CFS schedules a task it accumulates “virtual runtime” or vruntime. The next task picked to run is always the task with the minimum accumulated vruntime so far. By balancing the red-black tree when tasks are inserted into the run queue (a planned time line of processes to be executed next), the task with the minimum vruntime is always the first entry in the red-black tree.
The amount of vruntime a task accrues is related to its priority. High priority tasks gain vruntime at a slower rate than low priority tasks, which results in high priority tasks being picked to run on the processor more often.
14.3.2 Grouping processes #
Since the Linux kernel version 2.6.24, CFS can be tuned to be fair to groups rather than to tasks only. Runnable tasks are then grouped to form entities, and CFS tries to be fair to these entities instead of individual runnable tasks. The scheduler also tries to be fair to individual tasks within these entities.
The kernel scheduler lets you group runnable tasks using control groups. For more information, see Chapter 10, Kernel control groups.
14.3.3 Kernel configuration options #
Basic aspects of the task scheduler behavior can be set through the kernel configuration options. Setting these options is part of the kernel compilation process. Because kernel compilation process is a complex task and out of this document's scope, refer to relevant source of information.
If you run SUSE Linux Enterprise Server on a kernel that was not shipped with it, for example on a self-compiled kernel, you lose the entire support entitlement.
14.3.4 Terminology #
Documents regarding task scheduling policy often use several technical terms which you need to know to understand the information correctly. A few of them are as follows:
- Latency
Delay between the time a process is scheduled to run and the actual process execution.
- Granularity
The relation between granularity and latency can be expressed by the following equation:
gran = ( lat / rtasks ) - ( lat / rtasks / rtasks )
where gran stands for granularity, lat stand for latency, and rtasks is the number of running tasks.
14.3.4.1 Scheduling policies #
The Linux kernel supports the following scheduling policies:
- SCHED_FIFO
Scheduling policy designed for special time-critical applications. It uses the First In-First Out scheduling algorithm.
- SCHED_BATCH
Scheduling policy designed for CPU-intensive tasks.
- SCHED_IDLE
Scheduling policy intended for very low-priority tasks.
- SCHED_OTHER
Default Linux time-sharing scheduling policy used by most of the processes.
- SCHED_RR
Similar to
SCHED_FIFO
, but uses the Round Robin scheduling algorithm.
14.3.5 Changing real-time attributes of processes with chrt
#
The chrt
command sets or retrieves the real-time
scheduling attributes of a running process, or runs a command with the
specified attributes. You can get or retrieve both the scheduling
policy and priority of a process.
In the following examples, a process whose PID is 16244 is used.
To retrieve the real-time attributes of an existing task:
#
chrt -p 16244
pid 16244's current scheduling policy: SCHED_OTHER
pid 16244's current scheduling priority: 0
Before setting a new scheduling policy on the process, you need to find out the minimum and maximum valid priorities for each scheduling algorithm:
#
chrt -m
SCHED_SCHED_OTHER min/max priority : 0/0
SCHED_SCHED_FIFO min/max priority : 1/99
SCHED_SCHED_RR min/max priority : 1/99
SCHED_SCHED_BATCH min/max priority : 0/0
SCHED_SCHED_IDLE min/max priority : 0/0
In the above example, SCHED_OTHER, SCHED_BATCH, SCHED_IDLE polices only allow for priority 0, while that of SCHED_FIFO and SCHED_RR can range from 1 to 99.
To set SCHED_BATCH scheduling policy:
#
chrt -b -p 0 16244
pid 16244's current scheduling policy: SCHED_BATCH
pid 16244's current scheduling priority: 0
For more information on chrt
, see its man page
(man 1 chrt
).
14.3.6 Runtime tuning with sysctl
#
The sysctl
interface for examining and changing
kernel parameters at runtime introduces important variables, using
which you can change the default behavior of the task scheduler. The
syntax of the sysctl
is simple, and all the
following commands must be entered on the command line as root
.
To read a value from a kernel variable, enter
#
sysctl VARIABLE
To assign a value, enter
#
sysctl VARIABLE=VALUE
To get a list of all scheduler-related variables, run the
sysctl
command, and use grep
to
filter the output:
#
sysctl -A | grep "sched" | grep -v "domain"
kernel.sched_cfs_bandwidth_slice_us = 5000 kernel.sched_child_runs_first = 0 kernel.sched_compat_yield = 0 kernel.sched_latency_ns = 24000000 kernel.sched_migration_cost_ns = 500000 kernel.sched_min_granularity_ns = 8000000 kernel.sched_nr_migrate = 32 kernel.sched_rr_timeslice_ms = 25 kernel.sched_rt_period_us = 1000000 kernel.sched_rt_runtime_us = 950000 kernel.sched_schedstats = 0 kernel.sched_shares_window_ns = 10000000 kernel.sched_time_avg_ms = 1000 kernel.sched_tunable_scaling = 1 kernel.sched_wakeup_granularity_ns = 10000000
Variables ending with “_ns” and “_us” accept values in nanoseconds and microseconds, respectively.
A list of the most important task scheduler sysctl
tuning variables (located at /proc/sys/kernel/
)
with a short description follows:
sched_cfs_bandwidth_slice_us
When CFS bandwidth control is in use, this parameter controls the amount of runtime (bandwidth) transferred to a run queue from the task's control group bandwidth pool. Small values allow the global bandwidth to be shared in a fine-grained manner among tasks, larger values reduce transfer overhead. See https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt.
sched_child_runs_first
A freshly forked child runs before the parent continues execution. Setting this parameter to
1
is beneficial for an application in which the child performs an execution after fork.sched_compat_yield
Enables the aggressive CPU yielding behavior of the old O(1) scheduler by moving the relinquishing task to the end of the runnable queue (right-most position in the red-black tree). Applications that depend on the
sched_yield(2)
syscall behavior may see performance improvements by giving other processes a chance to run when there are highly contended resources (such as locks). Given that this call occurs in context switching, misusing the call can hurt the workload. Use it only when you see a drop in performance. The default value is0
.sched_migration_cost_ns
Amount of time after the last execution that a task is considered to be “cache hot” in migration decisions. A “hot” task has fewer chances of migration to another CPU, so increasing this variable reduces task migrations. The default value is
500000
(ns).If the CPU idle time is higher than expected when there are runnable processes, try reducing this value. If tasks bounce between CPUs or nodes too often, try increasing it.
sched_latency_ns
Targeted preemption latency for CPU bound tasks. Increasing this variable increases a CPU bound task's timeslice. A task's timeslice is its weighted fair share of the scheduling period:
timeslice = scheduling period * (task's weight/total weight of tasks in the run queue)
The task's weight depends on the task's nice level and the scheduling policy. Minimum task weight for a SCHED_OTHER task is 15, corresponding to nice 19. The maximum task weight is 88761, corresponding to nice -20.
Timeslices become smaller as the load increases. When the number of runnable tasks exceeds
sched_latency_ns
/sched_min_granularity_ns
, the slice becomes number_of_running_tasks *sched_min_granularity_ns
. Before that, the slice is equal tosched_latency_ns
.This value also specifies the maximum amount of time during which a sleeping task is considered to be running for entitlement calculations. Increasing this variable increases the amount of time a waking task may consume before being preempted, thus increasing scheduler latency for CPU bound tasks. The default value is
6000000
(ns).sched_min_granularity_ns
Minimal preemption granularity for CPU bound tasks. See
sched_latency_ns
for details. The default value is4000000
(ns).sched_wakeup_granularity_ns
The wake-up preemption granularity. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components. The default value is
2500000
(ns).Warning: Setting the right wake-up granularity valueSettings larger than half of
sched_latency_ns
result in no wake-up preemption. Short duty cycle tasks cannot compete with CPU hogs effectively.sched_rr_timeslice_ms
Quantum that SCHED_RR tasks are allowed to run before they are preempted and put to the end of the task list.
sched_rt_period_us
Period over which real-time task bandwidth enforcement is measured. The default value is
1000000
(µs).sched_rt_runtime_us
Quantum allocated to real-time tasks during sched_rt_period_us. Setting to -1 disables RT bandwidth enforcement. By default, RT tasks may consume 95%CPU/sec, thus leaving 5%CPU/sec or 0.05 s to be used by SCHED_OTHER tasks. The default value is
950000
(µs).sched_nr_migrate
Controls how many tasks can be migrated across processors for load-balancing purposes. Because balancing iterates the runqueue with interrupts disabled (softirq), it can incur in irq-latency penalties for real-time tasks. Therefore, increasing this value may give a performance boost to large SCHED_OTHER threads at the expense of increased irq-latencies for real-time tasks. The default value is
32
.sched_time_avg_ms
This parameter sets the period over which the time spent running real-time tasks is averaged. That average assists CFS in making load-balancing decisions and gives an indication of how busy a CPU is with high-priority real-time tasks.
The optimal setting for this parameter is highly workload dependent and depends, among other things, on how frequently real-time tasks are running and for how long.
debugfs
If the default Linux kernel version of your operating system is 5.13
or later (can be checked using the command rpm -q
kernel-default
), you might notice the messages in the kernel
logs that are similar to the following example:
[ 20.485624] The sched.sched_min_granularity_ns sysctl was moved to debugfs in kernel 5.13 for CPU scheduler debugging only. This sysctl will be removed in a future SLE release. [ 20.485632] The sched.sched_wakeup_granularity_ns sysctl was moved to debugfs in kernel 5.13 for CPU scheduler debugging only. This sysctl will be removed in a future SLE release.
This happens because six scheduler parameters have been moved from
/proc/sys/kernel/sched_*
to
/sys/kernel/debug/sched/*
. The affected
scheduler parameters are as follows:
sched_latency_ns
sched_migration_cost_ns
sched_min_granularity_ns
sched_nr_migrate
sched_tunable_scaling
sched_wakeup_granularity_n
For temporary convenience, the sysctl
for
these scheduler parameters still exists in SUSE Linux Enterprise Server. However,
due to planned changes in the CPU scheduler implementation, there is
no guarantee that either the sysctl
or the
debugfs
options will exist in a future
release of SUSE Linux Enterprise Server.
If any of your current system tuning configurations depend on these six scheduler parameters, we strongly recommend that you find an alternate method of achieving your objective and stop depending on them for production workloads.
14.3.7 Debugging interface and scheduler statistics #
CFS comes with a new improved debugging interface, and provides runtime
statistics information. Relevant files were added to the
/proc
file system, which can be examined simply
with the cat
or less
command. A
list of the related /proc
files follows with their
short description:
/proc/sched_debug
Contains the current values of all tunable variables (see Section 14.3.6, “Runtime tuning with
sysctl
”) that affect the task scheduler behavior, CFS statistics, and information about the run queues (CFS, RT and deadline) on all available processors. A summary of the task running on each processor is also shown, with the task name and PID, along with scheduler specific statistics. The first being thetree-key
column, it indicates the task's virtual runtime, and its name comes from the kernel sorting all runnable tasks by this key in a red-black tree. Theswitches
column indicates the total number of switches (involuntary or not), and theprio
refers to the process priority. Thewait-time
value indicates the amount of time the task waited to be scheduled. Finally bothsum-exec
andsum-sleep
account for the total amount of time (in nanoseconds) the task was running on the processor or asleep, respectively.#
cat /proc/sched_debug Sched Debug Version: v0.11, 4.4.21-64-default #1 ktime : 23533900.395978 sched_clk : 23543587.726648 cpu_clk : 23533900.396165 jiffies : 4300775771 sched_clock_stable : 0 sysctl_sched .sysctl_sched_latency : 6.000000 .sysctl_sched_min_granularity : 2.000000 .sysctl_sched_wakeup_granularity : 2.500000 .sysctl_sched_child_runs_first : 0 .sysctl_sched_features : 154871 .sysctl_sched_tunable_scaling : 1 (logarithmic) cpu#0, 2666.762 MHz .nr_running : 1 .load : 1024 .nr_switches : 1918946 [...] cfs_rq[0]:/ .exec_clock : 170176.383770 .MIN_vruntime : 0.000001 .min_vruntime : 347375.854324 .max_vruntime : 0.000001 [...] rt_rq[0]:/ .rt_nr_running : 0 .rt_throttled : 0 .rt_time : 0.000000 .rt_runtime : 950.000000 dl_rq[0]: .dl_nr_running : 0 task PID tree-key switches prio wait-time [...] ------------------------------------------------------------------------ R cc1 63477 98876.717832 197 120 0.000000 .../proc/schedstat
Displays statistics relevant to the current run queue. Also domain-specific statistics for SMP systems are displayed for all connected processors. Because the output format is not user-friendly, read the contents of
/usr/src/linux/Documentation/scheduler/sched-stats.txt
for more information./proc/PID/sched
Displays scheduling information on the process with id PID.
#
cat /proc/$(pidof gdm)/sched gdm (744, #threads: 3) ------------------------------------------------------------------- se.exec_start : 8888.758381 se.vruntime : 6062.853815 se.sum_exec_runtime : 7.836043 se.statistics.wait_start : 0.000000 se.statistics.sleep_start : 8888.758381 se.statistics.block_start : 0.000000 se.statistics.sleep_max : 1965.987638 [...] se.avg.decay_count : 8477 policy : 0 prio : 120 clock-delta : 128 mm->numa_scan_seq : 0 numa_migrations, 0 numa_faults_memory, 0, 0, 1, 0, -1 numa_faults_memory, 1, 0, 0, 0, -1
14.4 More information #
To get a compact knowledge about Linux kernel task scheduling, you need to explore several information sources. Here are some:
For task scheduler System Calls description, see the relevant manual page (for example
man 2 sched_setaffinity
).A useful lecture on Linux scheduler policy and algorithm is available in https://www.inf.fu-berlin.de/lehre/SS01/OS/Lectures/Lecture08.pdf.
A good overview of Linux process scheduling is given in Linux Kernel Development by Robert Love (ISBN-10: 0-672-32512-8). See https://www.informit.com/articles/article.aspx?p=101760.
A comprehensive overview of the Linux kernel internals is given in Understanding the Linux Kernel by Daniel P. Bovet and Marco Cesati (ISBN 978-0-596-00565-8).
Technical information about task scheduler is covered in files under
/usr/src/linux/Documentation/scheduler
.