Quick Start #
SUSE Linux Enterprise Real Time 12 SP5
SUSE Linux Enterprise Real Time is an add-on to SUSE® Linux Enterprise. It allows you to run tasks which require deterministic real-time processing in a SUSE Linux Enterprise environment.
SUSE Linux Enterprise Real Time meets this requirement by offering several options for CPU and I/O scheduling, CPU shielding and for setting CPU affinities to processes.
Copyright © 2006–2024 SUSE LLC and contributors. All rights reserved.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.
For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.
1 Product Overview #
If your business can respond more quickly to new information and changing market conditions, you have a distinct advantage over those that cannot. Running your time-sensitive mission-critical applications using SUSE Linux Enterprise Real Time reduces process dispatch latencies and gives you the time advantage you need to increase profits or avoid further financial losses, ahead of your competitors.
1.1 Key Features #
Some of the key features for SUSE Linux Enterprise Real Time are:
Pre-emptible real-time kernel.
Ability to assign high-priority processes.
Greater predictability to complete critical processes on time, every time.
In comparison to normal Linux kernel, which is optimized for overall system performance regardless of individual process response time, SUSE Linux Enterprise Real Time kernel is tuned toward predictable process response time.
Increased reliability.
Lower infrastructure costs.
Tracing and debugging tools that help you analyze and identify bottlenecks in mission-critical applications.
1.2 Specific Scenario #
SUSE Linux Enterprise Real Time Service Pack 3 supports Virtualization and Docker usage. Refer to Virtualization Guide for reference.
2 Installing SUSE Linux Enterprise Real Time #
To install SUSE Linux Enterprise Real Time 12 SP5, start a regular SUSE Linux Enterprise Server
12 SP5 installation. Select SUSE Linux Enterprise Real Time 12 SP5 as
an add-on product during the installation. Alternately, if SUSE Linux Enterprise Server
is already installed, you can start the Add-on Product
installation from YaST or enable SUSE Linux Enterprise Real Time in the YaST SUSE Customer Center
configuration. However, in the YaST Boot Loader
configurator you need to manually select the -rt
kernel
flavor as the default.
SUSE Linux Enterprise Real Time always needs a SUSE Linux Enterprise Server 12 SP5 base, it cannot be installed in stand-alone mode. For information on how to install add-on products, see the SUSE Linux Enterprise 12 SP5 Deployment Guide, available at http://www.suse.com/doc. Refer to chapter Installing Add-On Products.
The following sections provide a brief introduction to the tools and possibilities of SUSE Linux Enterprise Real Time.
3 Managing CPU Sets with cset
#
In some circumstances, it is beneficial to be able to run
specific tasks only on defined CPUs. For this reason, the Linux
kernel provides a feature called cpuset.
The cpuset
feature provides the means to do a so called “soft
partitioning” of the system. Dedicated CPUs, together
with some predefined memory, work on several tasks.
cset
consists of one “super command” called
shield
and the “regular commands”
set
and proc
. The purpose
of the super command shield
is to create a
common CPU shielding setup within one step by combining regular
commands.
For more information about options and parameters of the
shield
subcommand, view the help by running:
cset
help shield
3.1 Setting Up a CPU Shield for a Single CPU #
The command cset
provides the high level
functionality to set up and manipulate CPU Sets. An example for setting
up a CPU shield is:
cset
shield --cpu=3
This will shield CPU3. On a machine with 4 cores CPU0-CPU2 are unshielded.
3.2 Setting up CPU Shields for Multiple CPUs #
If you need to shield more than one CPUs, the argument of the
--cpu
option accepts comma separated lists of CPUs
including range specifications:
cset
shield --cpu=1,3,5-7
On a machine with 8 cores, this command will shield CPU1, CPU3, CPU5, CPU6, and CPU7. The CPUs CPU0, CPU2 and CPU4 will remain unshielded.
Already existing CPU shields can be extended by the same command. For example, to add CPU4 to the mentioned CPU set, use this command:
cset
shield --cpu=1,3-7
CPU1, CPU3, CPU5 to CPU6 were already shielded and only CPU4 will additionally be shielded. Technically, the command is updating the current CPU shield schema. To reduce the number of shielded CPUs and to unshield CPU1, for example, use the following command:
cset
shield --cpu=3-7
Now only CPU3, CPU4, CPU5, CPU6, and CPU7 are shielded. CPU0, CPU1, and CPU2 are available for system usage.
3.3 Showing CPU Shields #
After the CPU shielding is set up you can display the current
configuration by running cset shield
without
additional options:
cset
shield
cset: --> shielding system active with
cset: "system" cpuset of: 0-2 cpu, with: 47
cset: "user" cpuset of: 3-7 cpu, with: 0
By default, CPU shielding consists of at least of three
cpuset
s:
root
exists always and contains all available CPUs.system
is thecpuset
of unshielded CPUs.user
is thecpuset
of shielded CPUs
3.4 Shielding Processes #
Certain processes or groups of processes can be assigned to a
shielded cpuset
, after the CPU set is created. To start a new
process in the shielded CPU set use the --exec
option:
cset
shield --exec APPLICATION
To move already running processes to the shielded CPU set use
the --shield
and --pid
options. The
--pid
option accepts a comma-separated list
of PIDs and range specifications:
cset
shield --shield --pid=1,2,600-700
This moves processes with PID 1, 2, and from 600 to 700 to the
shielded CPU set. If there is a gap in the range from 600 to 700,
then only those available process will be moved to the shield
without warning. cset
handles threads like
processes and will also interpret TIDs and assign them to the
required CPU set.
The --shield
option does not check the
processes you request to move into the shield. This means that
the command will move any processes that
are bound to specific CPUs—even kernel threads. You can
cause a complete system lockup by indiscriminately specifying
arbitrary PIDs to the --shield
command.
3.5 Showing Shielded Processes #
Use the cset shield
command to show the
number of currently shielded processes (the same command can be
used to show the current CPU shield setup). To list shielded and
unshielded processes, add the --verbose
option:
cset
shield --verbose
cset: --> shielding system active with
cset: "system" cpuset of: 0-2,4-15 cpu, with:
USER PID PPID S TASK NAME
-------- ----- ----- - ---------
root 1 0 S init [3]
[...]
cset: "user" cpuset of: 3 cpu, with: 1
USER PID PPID S TASK NAME
-------- ----- ----- - ---------
root 10202 10170 S application
3.6 Unshielding Processes #
To remove a process (or group of processes) from the CPU shield use the
--unshield
option. The argument for
--unshield
is similar to the --shield
option. This option accepts a comma-separated list of PIDs/TIDs and
range specifications:
cset
shield --unshield --pid=2,650-655
This command will unshield the process with the PID 2 and the processes in range of 650 and 655.
3.7 Resetting CPU Sets #
To delete CPU sets use the cset
option
--reset
. This will unshield all CPUs and
migrate dedicated processes to all available CPUs again.
4 Managing Tree-like Structures with cset
#
More detailed configuration of cpusets can be done with the
cset
commands set
and
proc
.
The subcommand set
is used to create, modify
and destroy cpuset
s. Compared to the supercommand
shield
, the set
subcommand can
additionally assign memory nodes for NUMA machines.
Besides assigning memory nodes, the subcommand
set
creates cpusets in a
tree-like structure, rooted at the root
cpuset
.
To create a cpuset
with the subcommand set
you need to specify the CPUs which should be used. Either use a
comma-separated list or a range specification:
cset
set --cpu=1-7 "/one"
This command will create a cpuset
called one
with assigned CPUs from CPU1 to CPU7. To specify a new cpuset
called two
that is a subset of one
,
proceed as follows:
cset
set --cpu=6 "/one/two"
Cpusets follow certain rules. Children can only include CPUs
that the parents already have. If you try to specify a different
cpuset
, the kernel cpuset
subsystem will not let you create that
cpuset
. For example, if you create a cpuset
that contains CPU3,
and then attempt to create a child of that cpuset
with a CPU other
than 3, you will get an error, and the cpuset
will not be created.
The resulting error is somewhat cryptic and is usually
“Permission denied”.
To show a table containing useful information, like CPU list
and memory list, use the -r
parameter. The
“-X” column shows the exclusive state of CPU or
memory. The “path” column shows the real path in the
virtual cpuset
file system.
cset
set -r
On NUMA machines memory nodes can be assigned to a cpuset
similar to CPUs. The --mem
option of the
subcommand set
allows a comma-separated and
inclusive range specification of memory nodes. This example will
assign MEM1, MEM3, MEM4, MEM5 and MEM6 to the cpuset
new_set
:
cset
set --mem=1,3-6 new_set
Additionally, with the
--cpu_exclusive
and
--mem_exclusive
options (without any
additional arguments) set the CPUs or memory nodes exclusive to a
cpuset
:
cset
set --cpu_exclusive "/one"
The status of exclusive state of CPU or memory is shown in
the -X
column when running:
cset
set -r
For more detailed information about options and parameters of
the subcommand set
, view the
cset
help:
cset
help set
After the cpuset
is initialized, the subcommand
proc
can start processes on certain cpuset
s
with the --exec
option. The following will
start the application fastapp
within the cpuset
new_set
:
cset
proc --exec --set new_set fastapp
To move an already running process inside an already existing
cpuset
use the option --move
. It accepts a
comma-separated list and range specifications of PIDs. The
following command will move processes with PID
2442 and within range of 3000 to 3200 into the
cpuset
new_set
:
cset
proc --move 2442,3000-3200 new_set
Listing processes running within a specific cpuset
can be
done by using the option --list
.
cset
proc --list new_set
The subcommand proc
can also move the entire
list of processes within one cpuset to another cpuset
by using the
option --fromset
and
--toset
. This will move all process assigned to
old_set
and assign them to
new_set
:
cset
proc --move --fromset old_set \
--toset new_set
For more detailed information about options and parameters of
the subcommand proc
, view the help:
cset
help proc
5 Setting Real-time Attributes of a Process with
chrt
#
Use the chrt
command to manipulate the
real-time attributes of an already running process (like
scheduling policy and priority), or to execute a new process with
specified real-time attributes.
It is highly recommended for applications which do not use
real-time specific attributes by their own but should nevertheless
experience the full advantages of real-time. To get full real-time
experiences, call these applications with the
chrt
command and the right set of scheduler
policy and priority parameters.
With the following command, all running processes with their
real-time specific attributes are shown. The selection class
shows the current scheduler policy and rtprio
the real-time priority:
ps
-eo pid,tid,class,rtprio,comm
...
1437 1437 FF 40 fastapp
The truncated example above shows the
fastapp
process with PID
1437 running and with scheduler policy SCHED_FIFO
and priority
40. Scheduler policy abbreviations are:
TS - SCHED_OTHER
FF - SCHED_FIFO
RR - SCHED_RR
It is also possible to get the current scheduler policy and
priority of single processes by specifying the PID of the process
with the -p
parameter. For example:
chrt
-p 1437
Scheduler policies have different minimum and maximum
priority values. Minimum and maximum values for each available
scheduler policy can be retrieved with chrt
:
chrt
-m
To change the scheduler policy and the priority of a running
process, chrt
provides the options
--fifo
for SCHED_FIFO
,
--rr
for SCHED_RR
and
--other
for SCHED_OTHER
.
The following example will change the scheduler policy to
SCHED_FIFO
with priority 42 for PID 1437:
chrt
--fifo -p 42 1437
Handle changing of real-time attributes of processes with care. Increasing the priority of certain processes can harm the entire system, depending on the behavior of the process. In some cases, this can lead to a complete system lockup or bad influence on certain devices.
For more information about chrt, see the chrt
man page with man 1 chrt
.
6 Specifying a CPU Affinity with taskset
#
The default behavior of the kernel is to keep a process running on the same CPU if the system load is balanced over the available CPUs. Otherwise, the kernel tries to improve the load balancing by moving processes to an idling CPU. In some situations, however, it is desirable to set a CPU affinity for a given process. In this case, the kernel will not move the process away from the selected CPUs. For example, if you use shielding, the shielded CPUs will not run any processes that do not have an affinity to the shielded CPUs. Another possibility to remove load from the other CPUs is to run all low priority tasks on a selected CPU.
If a task is running inside a specific cpuset
, the affinity
dialog must match at least one of the CPUs available in this set.
The taskset
command will not move a process
outside the cpuset
it is running in.
To set or retrieve the CPU affinity of a task a bitmask is
used. It is represented by a hexadecimal number. If you count the
bits of this bitmask, the lowest bit represents the first logical
CPU as they are found in /proc/cpuinfo
. For
example:
0x00000001
is processor #0.
0x00000002
is processor #1.
0x00000003
is processor #0 and processor #1.
0xFFFFFFFE
all but the first CPU.
If a given dialog does not contain any valid CPU on the system, an error is returned. If taskset returns without an error, the given program has been scheduled to the specified list of CPUs.
The command taskset
starts a new process with a
given CPU affinity, or to redefine the
CPU affinity of an already running process.
taskset -p PID
Retrieves the current CPU affinity of the process with PID pid.
taskset -p maskPID
Sets the CPU affinity of the process with the PID to mask.
taskset maskcommand
Runs command with a CPU affinity of
mask
.
For more detailed information about
taskset
, see the man page man 1 taskset
.
7 Changing I/O Priorities with ionice
#
Handling I/O is one of the critical issues for all high-performance systems. If a task has lots of CPU power available, but must wait for the disk, it will not work as efficiently as it could. The Linux kernel provides three different scheduling classes to determine the I/O handling for a process. All of these classes can be fine-tuned with a nice level.
- The Best Effort Scheduler
The Best Effort scheduler is the default I/O scheduler, and is used for all processes that do not specify a different I/O scheduler class. By default, this scheduler sets its nice level according to the nice value of the running process.
There are eight different nice levels available for this scheduler. The lowest priority is represented by a nice level of seven, the highest priority is zero.
This scheduler has the scheduling class number 2.
- The Real Time Scheduler
The real-time I/O class always gets the highest priority for disk access. The other schedulers will only be served if no real-time request is present. This scheduling class may easily lock up the system if not implemented with care.
The real-time scheduler defines nice levels (similar to the Best Effort scheduler).
This scheduler has the scheduling class number 1.
- The Idle Scheduler
The Idle scheduler does not define any nice levels. I/O is only done in this class if no other scheduler runs an I/O request. This scheduler has the lowest available priority and can be used for processes that are not time-critical.
This scheduler has the scheduling class number 3.
To change I/O schedulers and nice values, use the
ionice
command. This provides a means to tune
the scheduler of already running processes, or to start new
processes with specific I/O settings.
ionice -c3 -p$$
Sets the scheduler of the current shell to
Idle
.ionice
Without additional parameters, this prints the I/O scheduler settings of the current shell.
ionice -c1 -p42 -n2
Sets the scheduler of the process with process id 42 to
Real Time
, and its nice value to 2.ionice -c3 /bin/bash
Starts the Bash shell with the
Idle
I/O scheduler.
For more detailed information about
ionice
, see the ionice
man
page with man 1 ionice
8 Changing the I/O Scheduler for Block Devices #
The Linux kernel provides several block device schedulers
that can be selected individually for each block device. All but
the noop
scheduler perform a kind of ordering
of requested blocks to reduce head movements on the hard disk. If
you use an external storage system that has its own scheduler, you
should disable the Linux internal reordering by selecting the
noop
scheduler.
- noop
The noop scheduler is a very simple scheduler that performs basic merging and sorting on I/O requests. This scheduler is mainly used for specialized environments that run their own schedulers optimized for the used hardware, such as storage systems or hardware RAID controllers.
- deadline
The main point of deadline scheduling is to try hard to answer a request before a given deadline. This results in very good I/O for a random single I/O in real-time environments.
In principle, the deadline scheduler uses two lists with all requests. One is sorted by block sequences to reduce seeking latencies, the other is sorted by expire times for each request. Normally, requests are served according to the block sequence, but if a request reaches its deadline, the scheduler starts to work on this request.
- cfq
The Completely Fair Queuing scheduler uses a separate I/O queue for each process. All of these queues get a similar time slice for disk access. With this procedure, the CFQ tries to divide the bandwidth evenly between all requesting processes. This scheduler has a similar throughput as the anticipatory scheduler, but the maximum latency is much shorter.
For the average system this scheduler yields the best results, and thus is the default I/O scheduler on SUSE Linux Enterprise systems.
To print the current scheduler of a block device like
/dev/sda
, use the following command:
cat
/sys/block/sda/queue/scheduler
noop deadline [cfq]
In this case, the scheduler for /dev/sda
is set to cfq
, the Completely Fair
Queuing
scheduler. This is the default scheduler on
SUSE Linux Enterprise Real Time.
To change the schedulers, echo one of the names
noop
, deadline
, or
cfq
into
/sys/block/<device>/scheduler
. For
example, if you want to set the I/O scheduler of the device
/dev/sda
to noop
, use
the following command:
echo
"noop" > /sys/block/sda/queue/scheduler
To set other variables in the /sys
file
system, use a similar approach.
9 Tuning the Block Device I/O Scheduler #
All schedulers, except for the noop
scheduler, have several common parameters that may be tuned for
each block device. You can access these parameters with
sysfs
in the
/sys/block/<device>/queue/iosched/
directory. The following parameters are tuneable for the
respective scheduler:
- Anticipatory Scheduler
read_batch_expire
If write requests are scheduled, this is the time in milliseconds that reads are served before pending writes get a time slice. If writes are more important than reads, set this value lower than
read_expire
.write_batch_expire
Similar to
read_batch_expire
for write requests.
- Deadline Scheduler
read_expire
The main focus of this scheduler is to limit the start latency for a request to a given time. Therefore, for each request, a deadline is calculated from the current time plus the value of
read_expire
in milliseconds.write_expire
Similar to
read_expire
for write requests.fifo_batch
If a request hits its deadline, it is necessary to move the request from the sorted I/O scheduler list to the dispatch queue. The variable
fifo_batch
controls how many requests are moved, depending on the cost of each request.front_merges
The scheduler normally tries to find contiguous I/O requests and merges them. There are two kinds of merges: The new I/O request may be in front of the existing I/O request (front merge), or it may follow behind the existing request (back merge). Most merges are back merges. Therefore, you can disable the front merge functionality by setting
front_merges
to0
.write_starved
In case some read or write requests hit their deadline, the scheduler prefers the read requests by default. To prevent write requests from being postponed forever, the variable
write_starved
controls how often read requests are preferred until write requests are preferred over read requests.
- CFQ Scheduler
back_seek_max
andback_seek_penalty
The CFQ scheduler normally uses a strict ascending elevator. When needed, it also allows small backward seeks, but it puts some penalty on them. The maximum backward sector seek is defined with
back_seek_max
, and the multiplier for the penalty is set byback_seek_penalty
.fifo_expire_async
andfifo_expire_sync
The
fifo_expire_*
variables define the timeout in milliseconds for asynchronous and synchronous I/O requests. To prefer synchronous operations over asynchronous ones,fifo_expire_sync
value should be lower than fifo_expire_async.quantum
Defines number of I/O requests to be dispatched at once by the block device. This parameter is used for synchronous requests.
slice_async
,slice_async_rq
,slice_sync
, andslice_idle
These variables define the time slices a block device gets for synchronous or asynchronous operations.
slice_async
andslice_sync
serve as a base value in milliseconds for asynchronous or synchronous disk slice length calculations.slice_async_rq
for how many requests can an asynchronous disk slice accommodate.slice_idle
defines how long I/O scheduler idles before servicing next thread.
The system default Block Device I/O Scheduler can be also set
by the kernel parameter elevator=
. For example,
elevator=deadline
changes the I/O Scheduler
to deadline
.
10 For More Information #
A lot of information about real-time implementations and administration can be found on the Internet. The following list contains several selected links:
More detailed information about the real-time Linux development and an introduction how to write a real-time application can be found in the real-time Linux community Wiki. http://rt.wiki.kernel.org, http://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application
The
cpuset
feature of the kernel is explained in/usr/src/linux/Documentation/cgroups/cpusets.txt
. More detailed documentation is available from http://lwn.net/Articles/127936/. -->For more information about the deadline I/O scheduler, refer to https://en.wikipedia.org/wiki/Deadline_scheduler. In your installed system, find further information in
/usr/src/linux/Documentation/block/deadline-iosched.txt
.The CFQ I/O scheduler is covered in detail in http://en.wikipedia.org/wiki/CFQ and
/usr/src/linux/Documentation/block/cfq-iosched.txt
.