Virtualization Guide #
SUSE Linux Enterprise Real Time 15 SP5 supports virtualization and Docker usage as a technology preview only (best-effort support).
To see the degree of interference from running within a particular KVM configuration versus running on a bare-metal configuration, each RT application has to be assessed individually. We do not give any specific guarantees on performance or deadlines being missed.
Virtualization inevitably introduces overhead, but there is currently no rule of thumb for the performance penalty incurred. It is up to each RT application developer to set performance and deadline requirements and evaluate if those requirements are met.
This guide provides the following three examples for user reference only.
1 Running RT applications with non-RT KVM guests #
It is possible to achieve isolation of real-time workloads running alongside
KVM by using standard methods—for example, cpusets and routing IRQs
to dedicated CPUs. These can be done using the cset
utility. Both libvirtd and KVM work fine in such configurations. System
configurations that share CPUs between both RT and KVM workloads are not
supported; proper isolation of workloads is imperative for achieving RT
deadline constraints. None of the below observations and recommendations are
specific to virtualization. Nevertheless, they can be considered
“best-effort” for isolating RT and KVM workloads. The basic
steps are:
1.1 Setup #
All examples were carried out on a 48-core Xeon machine with two NUMA nodes
and 64 GB of RAM running SUSE Linux Enterprise Real Time. The virtual machine was installed
with vm-install
, running SUSE Linux Enterprise Server on four CPUs and 2 GB
of memory. The disk was a physical disk /dev/sdb
as
recommended by the SUSE Linux Enterprise Server Virtualization Guide.
The cpuset
utility was used to shield the RT workload
from KVM as described in the SLE RT Shielding
Guide (see Shielding Linux Resources):
cset
shield --kthread=on -c 8-47
Affinity for the KVM vCPU tasks was modified via the virsh
vcpupin
command, with a 1-1 mapping. For example, vCPU 0
pinned to CPU 0, etc.
The CPUs were split into two groups. CPUs 0-7 were allocated to the
system
cpuset, and CPUs 8-47 were allocated to the
user
group. Having CPUs on the same socket in two
groups was done intentionally to monitor the effects on shared CPU
resources, such as last-level cache (LLC).
The RT workload used throughout is cyclictest
, executed
as so:
cset shield --exec cyclictest -- -a 8-47 -t 40 -n -m -p99 -d 0 -D 120 --quiet
1.2 Observations #
The following observations were made:
VM Heavy I/O
The test for this was to do the following in a VM:
dd if=/dev/zero of=empty bs=4096 count=$(((80*1024*1024)/4096))
Doing large amounts of disk I/O in the VM guests has a noticeable impact on the latency of RT tasks. This is because of the constant eviction of LLC data, resulting in more cache misses.
The maximum latencies in for the real-time workload are seen on those CPUs on the same socket as the CPUs available to the KVM workload. For example, where the LLC is a shared resource between the
system
anduser
cpuset.cpufreq drivers incur timer latency
Drivers like
intel_pstate
will set up a timer on each CPU to periodically sample and adjust the CPU's current P-state. If this fires at an inopportune time it can add delays to the scheduling of RT tasks, particularly because lots of the IRQ/timer code paths run with interrupts disabled.Interrupt handling introduces delays
The handling of interrupts can result in latencies that affect RT workloads. Interrupts should be routed to “housekeeping” CPUs that are not running RT applications.
Some kernel threads cannot be controlled with cpuset
Performing heavy I/O in the VM may cause kthreads to be scheduled on the CPUs dedicated for RT. This can occur, for example, when a kthread is flushing dirty pages to disk.
While it is impossible to move some kworker threads into the
system
cpuset, the above issue can be mitigated by setting the CPU affinity for those threads via:/sys/devices/virtual/workqueue/writeback/cpumask
1.3 Recommendations #
Suggestions for tuning machines running both RT and KVM workloads are as follows:
Use CPU affinity to schedule RT tasks to their own CPUs, and if possible, to CPUs on their own dedicated socket. Using a dedicated socket avoids the issue from Section 1.2, “Observations” above, where the LLC occupancy is churned by VMs doing lots of I/O operations. If that is not an option, some customers should look at Intel's Cache Allocation Technology to further enforce cache-allocation policies.
Disable drivers that arm per-CPU timers such as cpufreq drivers, for example,
intel_pstate=disable
.Set IRQ affinity to CPUs that are not running RT workloads and disable irqbalance.
Set IRQ affinity to CPUs that are not running RT workloads. This can be achieved by setting the
IRQBALANCE_BANNED_CPUS
environment variable used byirqbalance
(1) with a bitmask of banned CPUs. For the examples used throughout this document the following setting was used:IRQBALANCE_BANNED_CPUS="ffff,ffffff00"
Search for cpumask control files in
/sys
and set them appropriately for those cases that cannot be controlled via cpuset. The following command will list those files:find /sys -name cpumask
2 Running real-time applications within Docker #
It is important to note that real-time processes will be affected by container activity as there is insufficient isolation to guarantee zero cross-talk. There are no special settings, nor container-specific interactions to consider as from a RT prespective, nothing changes due to containers. Whether a noise source in a container is irrelevant. Interference may be considerably higher if multiple RT applications are executed in separate containers. Also bear in mind that while worst-case latency may be better than SLE, it will not necessarily perform better than NOPREEMPT due to the overhead required for RT.
Some shielding is possible but there is no tool-based support for it. There is a generic shield script attached that can move Docker contents onto shielded cores once running. Launching of either KVM or Docker directly into a shielded home did not appear to be possible but the Docker or virtualization team may be able to do better. The basic steps are:
2.1 Running real-time applications in a virtualized environment #
If you intend to run compute-intensive applications with real-time priority, you must make sure that kernel threads cannot starve. (This is general advice that applies to other real-time scenarios as well.)
A simple precaution is to use the
rtkthread=PRIORITY
and
rtworkqueues=PRIORITY
kernel
boot parameters. Set the PRIORITY values higher
than the priority of any process that has the potential to dominate a CPU.
This is not strictly real-time capable, but it is safer overall.
The kernel must be booted with
nortsched
command-line parameterThis is to hide cgroup scheduling from Docker. If cgroup scheduling is required, then isolating Docker containers is very problematic.
The
docker run
command must be passed--privileged=true
.This is required for using the RT classes.
Your container is equipped with the
chrt
system tool.
If no isolation is required for your use case, then it is ready. Start your
container with docker run
, using chrt
to set the RT class/priority of the program you execute when starting of the
container. For example:
docker run --privileged=true ... /usr/bin/chrt -f 1 /usr/sbin/sshd -D
The above (with additional arguments, of course) will start
sshd
within the container as
a SCHED_FIFO
task of priority 1. ssh
into it, and whatever you run in the container will inherit the scheduler's
RT class/priority.
2.2 Docker shielding #
There is currently no facility within Docker to launch a container directly into an isolated cpuset. You must do this manually.
# note cpuset mount point
cpuset_mnt=$(mount|grep cpuset|cut -d' ' -f3)
# create an isolated cpuset for your container
cset shield --userset=rtcpus --cpu=4-7 --kthread=on
# note path and id of your container
docker_path=$(docker run...)
docker_id=$(docker ps -q)
# move container content into the isolated cpuset
for i in $(cat ${cpuset_mnt}/system/docker/${docker_path}/tasks);
do
echo $i > ${cpuset_mnt}/rtcpus/tasks;
done
# stop/destroy the container
docker stop ${docker_id}
docker rm ${docker_id}
# remove dir docker abandons in the shield system directory
rmdir ${cpuset_mnt}/system/docker
# tear down the shield, and you're done
cset shield --userset=rtcpus --cpu=4-7 --reset
2.3 Scripts #
#!/bin/sh
let START_CPU=4
let END_CPU=63
let ONLINE=1
let SHIELD_UP=0
GOVERNOR="performance"
DEFAULT_MASK=ffffffff,ffffffff
SHIELD_MASK=00000000,0000000f
if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
RT_RUNTIME=$(cat /proc/sys/kernel/sched_rt_runtime_us)
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
NMI_WATCHDOG=$(cat /proc/sys/kernel/nmi_watchdog)
fi
CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
if [ ! -z $CPUSET_ROOT ]; then
if [ -d ${CPUSET_ROOT}/rtcpus ]; then
let SHIELD_UP=1
fi
if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
CPUSET_PREFIX=cpuset.
fi
fi
if [ $SHIELD_UP -eq 1 ]; then
# take it down
echo 1 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
cset shield --userset=rtcpus --reset
# restore default irq affinity
echo ${DEFAULT_MASK} > /proc/irq/default_smp_affinity
for irqlist in $(ls /proc/irq/*/smp_affinity); do
echo ${DEFAULT_MASK} > $irqlist 2>/dev/null
done
if [ -f /proc/sys/kernel/timer_migration ]; then
echo 1 > /proc/sys/kernel/timer_migration
fi
if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
echo ${RT_RUNTIME} > /proc/sys/kernel/sched_rt_runtime_us
fi
if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
echo 1 > /sys/kernel/debug/tracing/tracing_on
fi
if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
echo always > /sys/kernel/mm/transparent_hugepage/enabled
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
echo ${NMI_WATCHDOG} > /proc/sys/kernel/nmi_watchdog
fi
if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
fi
if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
fi
if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/cpumask
fi
if [ -f /proc/sys/vm/stat_interval ]; then
echo 1 > /proc/sys/vm/stat_interval
fi
if [ -f /sys/module/processor/parameters/latency_factor ]; then
echo 2 > /sys/module/processor/parameters/latency_factor
fi
if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
echo 0 > /sys/module/processor/parameters/ignore_ppc
fi
if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
echo 0 > /sys/module/processor/parameters/ignore_tpc
fi
if [ -f /etc/init.d/sgi_irqbalance ]; then
/etc/init.d/sgi_irqbalance start
fi
else
# route irqs away from shielded cpus
if [ -f /etc/init.d/sgi_irqbalance ]; then
/etc/init.d/sgi_irqbalance stop
fi
echo $SHIELD_MASK > /proc/irq/default_smp_affinity
for irqlist in $(ls /proc/irq/*/smp_affinity); do
echo $SHIELD_MASK > $irqlist 2>/dev/null
done
# poke some buttons..
if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
fi
if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
echo 0 > /sys/kernel/debug/tracing/tracing_on
fi
if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
echo 0 > /proc/sys/kernel/nmi_watchdog
fi
if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval
fi
if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
fi
if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/cpumask
fi
if [ -f /proc/sys/vm/stat_interval ]; then
echo 999999 > /proc/sys/vm/stat_interval
fi
if [ -f /sys/module/processor/parameters/latency_factor ]; then
echo 1 > /sys/module/processor/parameters/latency_factor
fi
if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
echo 1 > /sys/module/processor/parameters/ignore_ppc
fi
if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
echo 1 > /sys/module/processor/parameters/ignore_tpc
fi
# ...and fire up the shield
cset shield --userset=rtcpus --cpu=${START_CPU}-${END_CPU} --kthread=on
# If cpuset wasn't previously mounted (systemd will, like it or not),
# it has now been mounted. Find the mount point.
if [ -z $CPUSET_ROOT ]; then
CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
if [ -z $CPUSET_ROOT ]; then
# If it's not mounted now, bail.
echo EEK, cupset is not mounted!
exit
else
# ok, check for cgroup mount
if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
CPUSET_PREFIX=cpuset.
fi
fi
fi
echo 0 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
echo 1 > ${CPUSET_ROOT}/system/${CPUSET_PREFIX}sched_load_balance
echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_relax_domain_level
# this ain't gonna happen in -rt kernels, but...
if [ -f ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us ]; then
echo 300000 > ${CPUSET_ROOT}/system/cpu.rt_runtime_us
echo 300000 > ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us
fi
echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_load_balance
# wait a bit for sched_domain rebuild
sleep 1
# now go to hpc
if [ -f ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt ]; then
echo 1 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt
fi
# offline/online to migrate timers and whatnot
if [ $ONLINE -eq 1 ]; then
for i in `seq ${START_CPU} ${END_CPU}`; do
echo 0 > /sys/devices/system/cpu/cpu$i/online
done
for i in `seq ${START_CPU} ${END_CPU}`; do
echo 1 > /sys/devices/system/cpu/cpu$i/online
done
# re-add CPUs the kernel removed on offline
echo ${START_CPU}-${END_CPU} > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}cpus
# and prioritize re-initialized kthreads
systenctl restart set_kthread_prio
fi
if [ -f /proc/sys/kernel/timer_migration ]; then
echo 0 > /proc/sys/kernel/timer_migration
fi
GOVERNOR="performance"
fi
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then
CURRENT_GOVERNOR=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)
if ! [ $GOVERNOR = $CURRENT_GOVERNOR ]; then
for i in $(ls /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do
echo $GOVERNOR > $i;
done
fi
fi
sysjitter
to use the user affinity instead of whole box #sysjitter.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
--- a/sysjitter.c
+++ b/sysjitter.c
@@ -412,7 +412,7 @@ static void write_raw(struct thread *thr
FILE *f;
int i;
for (i = 0; i < g.n_threads; ++i) {
- sprintf(fname, "%s.%d", outf, i);
+ sprintf(fname, "%s.%d", outf, threads[i].core_i);
if ((f = fopen(fname, "w")) == NULL) {
fprintf(stderr, "ERROR: Could not open '%s' for writing\n", fname);
fprintf(stderr, "ERROR: %s\n", strerror(errno));
@@ -578,6 +578,7 @@ int main(int argc, char *argv[])
const char *outf = NULL;
char dummy;
int i, n_cores, runtime = 70;
+ cpu_set_t cpus;
g.max_interruptions = 1000000;
@@ -609,10 +610,13 @@ int main(int argc, char *argv[])
sscanf(argv[0], "%u%c", &g.threshold_nsec, &dummy) != 1)
usage(app);
+ CPU_ZERO(&cpus);
+ sched_getaffinity(0, sizeof(cpus), &cpus);
+
n_cores = sysconf(_SC_NPROCESSORS_ONLN);
- TEST(threads = malloc(n_cores * sizeof(threads[0])));
+ TEST(threads = malloc(CPU_COUNT(&cpus) * sizeof(threads[0])));
for (i = 0; i < n_cores; ++i)
- if (move_to_core(i) == 0)
+ if (CPU_ISSET(i, &cpus) && move_to_core(i) == 0)
threads[g.n_threads++].core_i = i;
signal(SIGALRM, handle_alarm);
3 Running RT applications with RT KVM guests #
Section 1, “Running RT applications with non-RT KVM guests” shows that it is possible to isolate real-time workloads running alongside KVM by using standard methods. In SLE RT 15 SP2 this can be done in user space using libvirt/qemu.
Applications and guest operating systems run inside KVM guests similarly to how they run on bare metal. Guests interface with emulated hardware presented by QEMU, which submits I/O requests to the host on behalf of the guest. Then the host kernel treats the guest I/Os like any user-space application.
In SLE RT 15 SP4, both QEMU and libvirt support isolating the CPUs, partitioning the memory for guests, and setting the vCPU/iothread scheduler policy and priority for running both non-RT KVM and RT KVM.
3.1 Support of QEMU/libvirt #
QEMU includes the
-realtime mlock=on|off
option. Mlocking QEMU and guest memory is enabled withmlock=on
(which is enabled by default) .libvirt supports CPU Allocation, CPU Tuning, and Memory Backing, which allows you to control RT parameters, see Section 3.2, “Sample of
libvirt.xml
”.- CPU allocation
You can define the maximum number of virtual CPUs allocated for the guest OS.
- CPU tuning
Pinning is a tuning option for the virtual CPUs in KVM guests. With pinning, you can control where the guest runs in order to reduce the overhead of scheduler switches, pin vCPUs to physical CPUs that have low utilization, and improve the data cache performance. Overall performance is improved when the memory that an application uses is local to the physical CPU, and the guest vCPU is pinned to this physical CPU.
We can specify the vCPU scheduler type (values
batch
,idle
,fifo
, orrr
), and priority for particular vCPU threads. Priority99
is too high, and it will massively interfere with the host's ability to function properly. There are host-side per-CPU threads that must be always be able to preempt, such as timersirq
threads.
- Memory backing
Use memory backing to allocate enough memory in the guest to avoid memory overcommit, and to lock the guest page memory in host memory to prevent it from being swapped out. This will show a performance improvement in some workloads.
3.2 Sample of libvirt.xml
#
<domain>
…
<vcpu placement='static' cpuset="1-4,^3,6" current="1">4</vcpu>
…
<cputune>
<vcpupin vcpu="0" cpuset="1-5,^2"/>
<vcpupin vcpu="1" cpuset="0,1"/>
<vcpupin vcpu="2" cpuset="2,3"/>
<vcpupin vcpu="3" cpuset="0,4"/>
<vcpusched '0-4,^3' scheduler='fifo' priority='1'/>
</cputune>
…
<memoryBacking>
<locked/>
<memoryBacking/>
…
</domain>
3.3 Other host settings #
Power management. Intel processors have a power management feature that puts the system into power-saving mode when the system is under-utilized. The system should be configured for maximum performance, rather than allowing power-saving mode.
Turboboost and Speedstep. Turboboost overclocks a core when CPU demand is high, whereas Speedstep dynamically adjusts the frequency of processor to meet processing needs. Turboboost requires Speedstep to be enabled, as it is an extension of Speedstep. For maximum performance, enable both Turboboost and Speedstep in the BIOS. The host OS may also need configuration to support running at higher clock speeds. For example:
cpupower
-c all frequency-set -g performanceDisable interrupt balancing (irqbalance). The irqbalance daemon is enabled by default. It distributes hardware interrupts across CPUs in a multi-core system to increase performance. When irqbalance is disabled, all interrupts will be handled by cpu0, and therefore the guest should NOT run on cpu0.
RT throttling. The default values for the realtime throttling mechanism allocate 95% of the CPU time to realtime tasks, and the remaining 5% to non-realtime tasks. If RT throttling is disabled, realtime tasks may use up to 100% of CPU time. Hence, programming failures in real-time applications can cause the entire system to hang because no other task can preempt the realtime tasks.
The above settings are just part of the configurations for the RT KVM to run at the “best-effort” performance. Other factors must be considered, such as storage and network. The overall KVM performance is dependent on the host hardware, firmware, BIOS settings, and the guest OS and application charactistics.