Virtualization Guide #
SUSE Linux Enterprise Real Time 12 SP5
SUSE Linux Enterprise Real Time 12 SP5 supports virtualization and Docker usage. The following text describes how to do so.
1 Running RT Applications with non-RT KVM Guests #
It is possible to achieve isolation of real-time workloads running alongside KVM by using standard methods. For example, cpusets and routing IRQs to dedicated CPUs, all of which can be achieved by using the cset utility. Both libvirtd and KVM work fine in such configurations. System configurations that share CPUs between both RT and KVM workloads are not supported; proper isolation of workloads is imperative for achieving RT deadline constraints. None of the below observations and recommendations are specific to virtualization. Nevertheless, they can be considered “best-effort” for isolating RT and KVM workloads. The basic steps are:
1.1 Setup #
All examples were carried out on a 48-core Xeon machine with 2
NUMA nodes and 64GB of RAM running SLE12 RT and the 3.12.49-rt
kernel. The virtual machine was installed with vm-install
, running
SLE12 SP5 on 4 CPUs and 2GB of memory. The disk used was physical
disk /dev/sdb
as recommended by the SUSE
virtualization documentation.
The cpuset
utility was used to shield the
RT workload from KVM as described in the
SLE RT Shielding Guide (see
How to Shield Your Linux Resources):
cset
shield --kthread=on -c 8-47
Affinity for the KVM vCPU tasks was modified via the virsh
vcpupin
command, with a 1-1 mapping. For example, vCPU 0
pinned to CPU 0, etc.
The CPUs were split into two groups. CPU 0-7 were allocated to
the system
cpuset and CPU 8-47 were allocated
to the user
group. Having CPUs on the same
socket in two groups was done intentionally to monitor the effects on
shared CPU resources, such as LLC.
The RT workload used throughout is cyclictest, executed like so:
cset shield --exec cyclictest -- -a 8-47 -t 40 -n -m -p99 -d 0 -D 120 --quiet
1.2 Observations #
The following observations were made:
VM Heavy I/O
The test for this was to do the following in a VM:
dd if=/dev/zero of=empty bs=4096 count=$(((80*1024*1024)/4096))
Doing large amounts of disk I/O in the VM guests has a noticeable impact on the latency of RT tasks. This is because of the constant eviction of LLC data, resulting in more cache misses.
The maximum latencies in for the real-time workload are seen on those CPUs on the same socket as the CPUs available to the KVM workload. For example, where the LLC is a shared resource between the
system
anduser
cpuset.cpufreq drivers incur timer latency
Drivers like
intel_pstate
will set up a timer on each CPU to periodically sample and adjust the CPU's current P-state. If this fires at an inopportune time it can add delays to the scheduling of RT tasks, particularly because lots of the IRQ/timer code paths run with interrupts disabled.Interrupt handling introduces delays
The handling of interrupts can result in latencies that affect RT workloads. Interrupts should be routed to “housekeeping” CPUs that are not running RT applications.
Some kernel threads cannot be controlled with cpuset
Performing heavy I/O in the VM may cause kthreads to be scheduled on the CPUs dedicated for RT. This can occur, for example, when a kthread is flushing dirty pages to disk.
While it is impossible to move some kworker threads into the
system
cpuset, the above issue can be mitigated by setting the CPU affinity for those threads via:/sys/devices/virtual/workqueue/writeback/cpumask
1.3 Recommendations #
Suggestions for tuning machines running both RT and KVM workloads are as follows:
Affinitize RT tasks to their own CPUs, and if possible, to CPUs on their own dedicated socket. Using a dedicated socket avoids the issue from Section 1.2, “Observations” above where the LLC occupancy is churned by VMs doing lots of I/O operations. If that is not an option some customers should look at Intel's Cache Allocation Technology to further enforce cache allocation policies.
Disable drivers that arm per-CPU timers such as cpufreq drivers, for example,
intel_pstate=disable
.Set IRQ affinity to CPUs that are not running RT workloads and disable irqbalance.
Set IRQ affinity to CPUs that are not running RT workloads. This can be achieved by setting the
IRQBALANCE_BANNED_CPUS
environment variable used byirqbalance
(1) with a bitmask of banned CPUs. For the examples used throughout this document the following setting was used:IRQBALANCE_BANNED_CPUS="ffff,ffffff00"
Search for cpumask control files in
/sys
and set them appropriately for those cases that cannot be controlled via cpuset. The following command will list those files:find /sys -name cpumask
2 RT Applications within Docker Feasibility #
It is important to note that real-time processes will be affected by container activity as there is insufficient isolation to guarantee zero cross-talk. There are no special settings, nor container-specific interactions to consider as from a RT prespective, nothing changes due to containers. Whether a noise source in a container is irrelevant. Interference may be considerably higher if multiple RT applications are executed in separate containers. Also bear in mind that while worst-case latency may be better than SLE, it will not necessarily perform better than NOPREEMPT due to the overhead required for RT.
Some shielding is possible but there is no tool-based support for it. There is a generic shield script attached that can move Docker contents onto shielded cores once running. Launching of either KVM/Docker directly into a shielded home did not appear to be possible but the Docker or virtualisation team may be able to do better. The basic steps are
2.1 Running Real-Time Applications in a Virtualized Environment #
Standard real-time dangers apply in that if the intention is to run a
compute intensive application with realtime priority, then the user must
ensure that kernel threads cannnot starve. A simple precaution is to
use rtkthread=prio
and rtworkqueues=prio
kernel boot
parameters, with priority set higher than anything that may dominate a
CPU. This is not strictly real-time capable, but it is safer overall.
kernel must be booted with
nortsched
commandline parameterThis is to hide cgroup scheduling from Docker. If cgroup scheduling is required then isolating docker is very problematic.
Docker run must be passed
--privileged=true
This is required for using the RT classes.
your container is equipped with the
chrt
system tool.
If no isolation is required for your use case then it's ready. Run
docker run
your container, using chrt
to set RT class/priority of that which you execute upon startup of the
container. Example:
docker run --privileged=true ... /usr/bin/chrt -f 1 /usr/sbin/sshd -D
The above (with additional arguments of course) will start sshd
within the container as a
SCHED_FIFO
task of priority 1. ssh into it, and
whatever you run therein will inherit scheduler RT class/priority.
2.2 Docker Shielding #
There is currently no facility withing Docker to launch a container directly into an isolated cpuset, this must be done manually.
# note cpuset mount point
cpuset_mnt=$(mount|grep cpuset|cut -d' ' -f3)
# create an isolated cpuset for your container
cset shield --userset=rtcpus --cpu=4-7 --kthread=on
# note path and id of your container
docker_path=$(docker run...)
docker_id=$(docker ps -q)
# move container content into the isolated cpuset
for i in $(cat ${cpuset_mnt}/system/docker/${docker_path}/tasks);
do
echo $i > ${cpuset_mnt}/rtcpus/tasks;
done
# stop/destroy the container
docker stop ${docker_id}
docker rm ${docker_id}
# remove dir docker abandons in the shield system directory
rmdir ${cpuset_mnt}/system/docker
# tear down the shield, and you're done
cset shield --userset=rtcpus --cpu=4-7 --reset
2.3 Scripts #
#!/bin/sh
let START_CPU=4
let END_CPU=63
let ONLINE=1
let SHIELD_UP=0
GOVERNOR="performance"
DEFAULT_MASK=ffffffff,ffffffff
SHIELD_MASK=00000000,0000000f
if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
RT_RUNTIME=$(cat /proc/sys/kernel/sched_rt_runtime_us)
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
NMI_WATCHDOG=$(cat /proc/sys/kernel/nmi_watchdog)
fi
CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
if [ ! -z $CPUSET_ROOT ]; then
if [ -d ${CPUSET_ROOT}/rtcpus ]; then
let SHIELD_UP=1
fi
if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
CPUSET_PREFIX=cpuset.
fi
fi
if [ $SHIELD_UP -eq 1 ]; then
# take it down
echo 1 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
cset shield --userset=rtcpus --reset
# restore default irq affinity
echo ${DEFAULT_MASK} > /proc/irq/default_smp_affinity
for irqlist in $(ls /proc/irq/*/smp_affinity); do
echo ${DEFAULT_MASK} > $irqlist 2>/dev/null
done
if [ -f /proc/sys/kernel/timer_migration ]; then
echo 1 > /proc/sys/kernel/timer_migration
fi
if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
echo ${RT_RUNTIME} > /proc/sys/kernel/sched_rt_runtime_us
fi
if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
echo 1 > /sys/kernel/debug/tracing/tracing_on
fi
if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
echo always > /sys/kernel/mm/transparent_hugepage/enabled
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
echo ${NMI_WATCHDOG} > /proc/sys/kernel/nmi_watchdog
fi
if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
fi
if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
fi
if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/cpumask
fi
if [ -f /proc/sys/vm/stat_interval ]; then
echo 1 > /proc/sys/vm/stat_interval
fi
if [ -f /sys/module/processor/parameters/latency_factor ]; then
echo 2 > /sys/module/processor/parameters/latency_factor
fi
if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
echo 0 > /sys/module/processor/parameters/ignore_ppc
fi
if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
echo 0 > /sys/module/processor/parameters/ignore_tpc
fi
if [ -f /etc/init.d/sgi_irqbalance ]; then
/etc/init.d/sgi_irqbalance start
fi
else
# route irqs away from shielded cpus
if [ -f /etc/init.d/sgi_irqbalance ]; then
/etc/init.d/sgi_irqbalance stop
fi
echo $SHIELD_MASK > /proc/irq/default_smp_affinity
for irqlist in $(ls /proc/irq/*/smp_affinity); do
echo $SHIELD_MASK > $irqlist 2>/dev/null
done
# poke some buttons..
if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
fi
if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
echo 0 > /sys/kernel/debug/tracing/tracing_on
fi
if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
echo 0 > /proc/sys/kernel/nmi_watchdog
fi
if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval
fi
if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
fi
if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/cpumask
fi
if [ -f /proc/sys/vm/stat_interval ]; then
echo 999999 > /proc/sys/vm/stat_interval
fi
if [ -f /sys/module/processor/parameters/latency_factor ]; then
echo 1 > /sys/module/processor/parameters/latency_factor
fi
if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
echo 1 > /sys/module/processor/parameters/ignore_ppc
fi
if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
echo 1 > /sys/module/processor/parameters/ignore_tpc
fi
# ...and fire up the shield
cset shield --userset=rtcpus --cpu=${START_CPU}-${END_CPU} --kthread=on
# If cpuset wasn't previously mounted (systemd will, like it or not),
# we just mounted it. Find the mount point.
if [ -z $CPUSET_ROOT ]; then
CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
if [ -z $CPUSET_ROOT ]; then
# If it's not mounted now, bail.
echo EEK, cupset is not mounted!
exit
else
# ok, check for cgroup mount
if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
CPUSET_PREFIX=cpuset.
fi
fi
fi
echo 0 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
echo 1 > ${CPUSET_ROOT}/system/${CPUSET_PREFIX}sched_load_balance
echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_relax_domain_level
# this ain't gonna happen in -rt kernels, but...
if [ -f ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us ]; then
echo 300000 > ${CPUSET_ROOT}/system/cpu.rt_runtime_us
echo 300000 > ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us
fi
echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_load_balance
# wait a bit for sched_domain rebuild
sleep 1
# now we can go to hpc
if [ -f ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt ]; then
echo 1 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt
fi
# offline/online to migrate timers and whatnot
if [ $ONLINE -eq 1 ]; then
for i in `seq ${START_CPU} ${END_CPU}`; do
echo 0 > /sys/devices/system/cpu/cpu$i/online
done
for i in `seq ${START_CPU} ${END_CPU}`; do
echo 1 > /sys/devices/system/cpu/cpu$i/online
done
# re-add CPUs the kernel removed on offline
echo ${START_CPU}-${END_CPU} > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}cpus
# and prioritize re-initialized kthreads
systenctl restart set_kthread_prio
fi
if [ -f /proc/sys/kernel/timer_migration ]; then
echo 0 > /proc/sys/kernel/timer_migration
fi
GOVERNOR="performance"
fi
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then
CURRENT_GOVERNOR=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)
if ! [ $GOVERNOR = $CURRENT_GOVERNOR ]; then
for i in $(ls /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do
echo $GOVERNOR > $i;
done
fi
fi
sysjitter.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
--- a/sysjitter.c
+++ b/sysjitter.c
@@ -412,7 +412,7 @@ static void write_raw(struct thread *thr
FILE *f;
int i;
for (i = 0; i < g.n_threads; ++i) {
- sprintf(fname, "%s.%d", outf, i);
+ sprintf(fname, "%s.%d", outf, threads[i].core_i);
if ((f = fopen(fname, "w")) == NULL) {
fprintf(stderr, "ERROR: Could not open '%s' for writing\n", fname);
fprintf(stderr, "ERROR: %s\n", strerror(errno));
@@ -578,6 +578,7 @@ int main(int argc, char *argv[])
const char *outf = NULL;
char dummy;
int i, n_cores, runtime = 70;
+ cpu_set_t cpus;
g.max_interruptions = 1000000;
@@ -609,10 +610,13 @@ int main(int argc, char *argv[])
sscanf(argv[0], "%u%c", &g.threshold_nsec, &dummy) != 1)
usage(app);
+ CPU_ZERO(&cpus);
+ sched_getaffinity(0, sizeof(cpus), &cpus);
+
n_cores = sysconf(_SC_NPROCESSORS_ONLN);
- TEST(threads = malloc(n_cores * sizeof(threads[0])));
+ TEST(threads = malloc(CPU_COUNT(&cpus) * sizeof(threads[0])));
for (i = 0; i < n_cores; ++i)
- if (move_to_core(i) == 0)
+ if (CPU_ISSET(i, &cpus) && move_to_core(i) == 0)
threads[g.n_threads++].core_i = i;
signal(SIGALRM, handle_alarm);
3 Running RT Applications with RT KVM Guests #
In Section 1, “Running RT Applications with non-RT KVM Guests”, we see that it is possible to isolate real-time workloads running alongside KVM by using standard methods. In SLE12 RT SP3 this can be done in user space using libvirt/qemu.
Applications and guest operating systems run inside KVM guests similarly to how they run on bare metal. The guest interfaces with emulated hardware presented by QEMU, which submits I/O requests to the host on behalf of the guest. Then the host kernel treats the guest I/Os like any user-space application.
In SLE12 SP5, both QEMU and libvirt support isolating the CPUs, partitioning the memory for guests, and setting the vCPU/iothread scheduler policy and priority for running both non-RT KVM and RT KVM.
3.1 Support of QEMU/libvirt #
QEMU includes the
-realtime mlock=on|off
option. Mlocking QEMU and guest memory is enabled withmlock=on
(which is enabled by default) .libvirt supports CPU Allocation, CPU Tuning, and Memory Backing, which allows you to control RT parameters, see Section 3.2, “Sample of
libvirt.xml
”.- CPU Allocation
We can define the maximum number of virtual CPUs allocated for the guest OS.
- CPU Tuning
Pinning is a tuning option for the virtual CPUs in KVM guests. With pinning we can control where the guest runs in order to reduce the overhead of scheduler switches, pin vCPUs to physical CPUs that have low utilization, and improve the data cache performance. Overall performance is improved when the memory that an application uses is local to the physical CPU, and the guest vCPU is pinned to this physical CPU.
We can specify the vCPU scheduler type (values batch, idle, fifo, rr), and priority for particular vCPU threads. Priority
99
is too high, and it will massively interfere with the host's ability to function properly. There are host-side per-CPU threads that must be always be able to preempt, for example, timer sirq threads.
- Memory Backing
Use memory backing to allocate enough memory in the guest to avoid memory overcommit, and to lock the guest page memory in host memory to prevent it from being swapped out. This will show a performance improvement in some workloads.
3.2 Sample of libvirt.xml
#
<domain>
…
<vcpu placement='static' cpuset="1-4,^3,6" current="1">4</vcpu>
…
<cputune>
<vcpupin vcpu="0" cpuset="1-5,^2"/>
<vcpupin vcpu="1" cpuset="0,1"/>
<vcpupin vcpu="2" cpuset="2,3"/>
<vcpupin vcpu="3" cpuset="0,4"/>
<vcpusched '0-4,^3' scheduler='fifo' priority='1'/>
</cputune>
…
<memoryBacking>
<locked/>
<memoryBacking/>
…
</domain>
3.3 Other Host Settings #
Power Management. Intel processors have a power management feature that puts the system into power-saving mode when the system is under-utilized. The system should be configured for maximum performance, rather than allowing power-saving mode.
Turboboost and Speedstep. Turboboost overclocks a core when CPU demand is high, whereas Speedstep dynamically adjusts the frequency of processor to meet processing needs. Turboboost requires Speedstep to be enabled, as it is an extension of Speedstep. For maximum performance, enable both Turboboost and Speedstep in BIOS. The host OS may also need configuration to support running at higher clock speeds. For example:
cpupower
-c all frequency-set -g performanceDisable Interrupt Balancing (irqbalance). The irqbalance daemon is enabled by default. It distributes hardware interrupts across CPUs in a multi-core system to increase performance. When irqbalance is disabled, all interrupts will be handled by cpu0, and therefore the guest should NOT run on cpu0.
RT Throttling. The default values for the realtime throttling mechanism allocate 95% of the CPU time to realtime tasks, and the remaining 5% to non-realtime tasks. If RT throttling is disabled, realtime tasks may use up to 100% of CPU time. Hence, programming failures in real-time applications can cause the entire system to hang because no other task can preempt the realtime tasks.
The above settings are just part of the configurations for the RT KVM to run at the “best-effort” performance. Other factors must be considered, such as storage and network. The overall KVM performance is dependent on the host hardware, firmware, BIOS settings, and the guest OS and application charactistics.