SUSE Linux Enterprise Real Time 15 SP5

Virtualization Guide #

Publication Date: 05/02/2024

SUSE Linux Enterprise Real Time 15 SP5 supports virtualization and Docker usage as a technology preview only (best-effort support).

To see the degree of interference from running within a particular KVM configuration versus running on a bare-metal configuration, each RT application has to be assessed individually. We do not give any specific guarantees on performance or deadlines being missed.

Virtualization inevitably introduces overhead, but there is currently no rule of thumb for the performance penalty incurred. It is up to each RT application developer to set performance and deadline requirements and evaluate if those requirements are met.

This guide provides the following three examples for user reference only.

1 Running RT applications with non-RT KVM guests #

It is possible to achieve isolation of real-time workloads running alongside KVM by using standard methods—for example, cpusets and routing IRQs to dedicated CPUs. These can be done using the cset utility. Both libvirtd and KVM work fine in such configurations. System configurations that share CPUs between both RT and KVM workloads are not supported; proper isolation of workloads is imperative for achieving RT deadline constraints. None of the below observations and recommendations are specific to virtualization. Nevertheless, they can be considered “best-effort” for isolating RT and KVM workloads. The basic steps are:

1.1 Setup #

All examples were carried out on a 48-core Xeon machine with two NUMA nodes and 64 GB of RAM running SUSE Linux Enterprise Real Time. The virtual machine was installed with vm-install, running SUSE Linux Enterprise Server on four CPUs and 2 GB of memory. The disk was a physical disk /dev/sdb as recommended by the SUSE Linux Enterprise Server Virtualization Guide.

The cpuset utility was used to shield the RT workload from KVM as described in the SLE RT Shielding Guide (see Shielding Linux Resources):

cset shield --kthread=on -c 8-47

Affinity for the KVM vCPU tasks was modified via the virsh vcpupin command, with a 1-1 mapping. For example, vCPU 0 pinned to CPU 0, etc.

The CPUs were split into two groups. CPUs 0-7 were allocated to the system cpuset, and CPUs 8-47 were allocated to the user group. Having CPUs on the same socket in two groups was done intentionally to monitor the effects on shared CPU resources, such as last-level cache (LLC).

The RT workload used throughout is cyclictest, executed as so:

cset shield --exec cyclictest -- -a 8-47 -t 40 -n -m -p99 -d 0 -D 120 --quiet

1.2 Observations #

The following observations were made:

VM Heavy I/O
The test for this was to do the following in a VM:
```
dd if=/dev/zero of=empty bs=4096 count=$(((80*1024*1024)/4096))
```
Doing large amounts of disk I/O in the VM guests has a noticeable impact on the latency of RT tasks. This is because of the constant eviction of LLC data, resulting in more cache misses.
The maximum latencies in for the real-time workload are seen on those CPUs on the same socket as the CPUs available to the KVM workload. For example, where the LLC is a shared resource between the system and user cpuset.
cpufreq drivers incur timer latency
Drivers like intel_pstate will set up a timer on each CPU to periodically sample and adjust the CPU's current P-state. If this fires at an inopportune time it can add delays to the scheduling of RT tasks, particularly because lots of the IRQ/timer code paths run with interrupts disabled.
Interrupt handling introduces delays
The handling of interrupts can result in latencies that affect RT workloads. Interrupts should be routed to “housekeeping” CPUs that are not running RT applications.
Some kernel threads cannot be controlled with cpuset
Performing heavy I/O in the VM may cause kthreads to be scheduled on the CPUs dedicated for RT. This can occur, for example, when a kthread is flushing dirty pages to disk.
While it is impossible to move some kworker threads into the system cpuset, the above issue can be mitigated by setting the CPU affinity for those threads via:
```
/sys/devices/virtual/workqueue/writeback/cpumask
```

1.3 Recommendations #

Suggestions for tuning machines running both RT and KVM workloads are as follows:

Use CPU affinity to schedule RT tasks to their own CPUs, and if possible, to CPUs on their own dedicated socket. Using a dedicated socket avoids the issue from Section 1.2, “Observations” above, where the LLC occupancy is churned by VMs doing lots of I/O operations. If that is not an option, some customers should look at Intel's Cache Allocation Technology to further enforce cache-allocation policies.
Disable drivers that arm per-CPU timers such as cpufreq drivers, for example, intel_pstate=disable.
Set IRQ affinity to CPUs that are not running RT workloads and disable irqbalance.
Set IRQ affinity to CPUs that are not running RT workloads. This can be achieved by setting the IRQBALANCE_BANNED_CPUS environment variable used by irqbalance(1) with a bitmask of banned CPUs. For the examples used throughout this document the following setting was used:
```
IRQBALANCE_BANNED_CPUS="ffff,ffffff00"
```
Search for cpumask control files in /sys and set them appropriately for those cases that cannot be controlled via cpuset. The following command will list those files:
```
find /sys -name cpumask
```

2 Running real-time applications within Docker #

It is important to note that real-time processes will be affected by container activity as there is insufficient isolation to guarantee zero cross-talk. There are no special settings, nor container-specific interactions to consider as from a RT prespective, nothing changes due to containers. Whether a noise source in a container is irrelevant. Interference may be considerably higher if multiple RT applications are executed in separate containers. Also bear in mind that while worst-case latency may be better than SLE, it will not necessarily perform better than NOPREEMPT due to the overhead required for RT.

Some shielding is possible but there is no tool-based support for it. There is a generic shield script attached that can move Docker contents onto shielded cores once running. Launching of either KVM or Docker directly into a shielded home did not appear to be possible but the Docker or virtualization team may be able to do better. The basic steps are:

2.1 Running real-time applications in a virtualized environment #

If you intend to run compute-intensive applications with real-time priority, you must make sure that kernel threads cannot starve. (This is general advice that applies to other real-time scenarios as well.)

A simple precaution is to use the rtkthread=PRIORITY and rtworkqueues=PRIORITY kernel boot parameters. Set the PRIORITY values higher than the priority of any process that has the potential to dominate a CPU. This is not strictly real-time capable, but it is safer overall.

Docker prerequisites #

The kernel must be booted with nortsched command-line parameter
This is to hide cgroup scheduling from Docker. If cgroup scheduling is required, then isolating Docker containers is very problematic.
The docker run command must be passed --privileged=true.
This is required for using the RT classes.
Your container is equipped with the chrt system tool.

If no isolation is required for your use case, then it is ready. Start your container with docker run, using chrt to set the RT class/priority of the program you execute when starting of the container. For example:

docker run --privileged=true ... /usr/bin/chrt -f 1 /usr/sbin/sshd -D

The above (with additional arguments, of course) will start sshd within the container as a SCHED_FIFO task of priority 1. ssh into it, and whatever you run in the container will inherit the scheduler's RT class/priority.

2.2 Docker shielding #

There is currently no facility within Docker to launch a container directly into an isolated cpuset. You must do this manually.

Example 1: Pseudo script #

# note cpuset mount point
cpuset_mnt=$(mount|grep cpuset|cut -d' ' -f3)

# create an isolated cpuset for your container
cset shield --userset=rtcpus --cpu=4-7 --kthread=on

# note path and id of your container
docker_path=$(docker run...)
docker_id=$(docker ps -q)

# move container content into the isolated cpuset
for i in $(cat ${cpuset_mnt}/system/docker/${docker_path}/tasks);
do
  echo $i > ${cpuset_mnt}/rtcpus/tasks;
done

# stop/destroy the container
docker stop ${docker_id}
docker rm ${docker_id}

# remove dir docker abandons in the shield system directory
rmdir ${cpuset_mnt}/system/docker

# tear down the shield, and you're done
cset shield --userset=rtcpus --cpu=4-7 --reset

2.3 Scripts #

Example 2: Sample shield script #

#!/bin/sh

let START_CPU=4
let END_CPU=63
let ONLINE=1
let SHIELD_UP=0
GOVERNOR="performance"

DEFAULT_MASK=ffffffff,ffffffff
SHIELD_MASK=00000000,0000000f

if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
  RT_RUNTIME=$(cat /proc/sys/kernel/sched_rt_runtime_us)
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
  NMI_WATCHDOG=$(cat /proc/sys/kernel/nmi_watchdog)
fi

CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
if [ ! -z $CPUSET_ROOT ]; then
  if [ -d ${CPUSET_ROOT}/rtcpus ]; then
    let SHIELD_UP=1
  fi
  if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
    CPUSET_PREFIX=cpuset.
  fi
fi

if [ $SHIELD_UP -eq 1 ]; then
  # take it down
  echo 1 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
  cset shield --userset=rtcpus --reset

  # restore default irq affinity
  echo ${DEFAULT_MASK} > /proc/irq/default_smp_affinity
  for irqlist in $(ls /proc/irq/*/smp_affinity); do
    echo ${DEFAULT_MASK} > $irqlist 2>/dev/null
  done

  if [ -f /proc/sys/kernel/timer_migration ]; then
    echo 1 > /proc/sys/kernel/timer_migration
  fi
  if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
    echo ${RT_RUNTIME} > /proc/sys/kernel/sched_rt_runtime_us
  fi
  if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
    echo 1 > /sys/kernel/debug/tracing/tracing_on
  fi
  if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
    echo always > /sys/kernel/mm/transparent_hugepage/enabled
  fi
  if [ -f /proc/sys/kernel/nmi_watchdog ]; then
   echo ${NMI_WATCHDOG} > /proc/sys/kernel/nmi_watchdog
  fi
  if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
   echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
  fi
  if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
   echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
  fi
  if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
    echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/cpumask
  fi
  if [ -f /proc/sys/vm/stat_interval ]; then
    echo 1 > /proc/sys/vm/stat_interval
  fi
  if [ -f /sys/module/processor/parameters/latency_factor ]; then
   echo 2 > /sys/module/processor/parameters/latency_factor
  fi
  if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
   echo 0 > /sys/module/processor/parameters/ignore_ppc
  fi
  if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
   echo 0 > /sys/module/processor/parameters/ignore_tpc
  fi
  if [ -f /etc/init.d/sgi_irqbalance ]; then
   /etc/init.d/sgi_irqbalance start
  fi
else
  # route irqs away from shielded cpus
  if [ -f /etc/init.d/sgi_irqbalance ]; then
    /etc/init.d/sgi_irqbalance stop
  fi
  echo $SHIELD_MASK > /proc/irq/default_smp_affinity
  for irqlist in $(ls /proc/irq/*/smp_affinity); do
    echo $SHIELD_MASK > $irqlist 2>/dev/null
  done

  # poke some buttons..
  if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
    echo -1 > /proc/sys/kernel/sched_rt_runtime_us
  fi
  if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
    echo 0 > /sys/kernel/debug/tracing/tracing_on
  fi
  if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
  fi
  if [ -f /proc/sys/kernel/nmi_watchdog ]; then
    echo 0 > /proc/sys/kernel/nmi_watchdog
  fi
  if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
    echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval
  fi
  if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
    echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
  fi
  if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
    echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/cpumask
  fi
  if [ -f /proc/sys/vm/stat_interval ]; then
    echo 999999 > /proc/sys/vm/stat_interval
  fi
  if [ -f /sys/module/processor/parameters/latency_factor ]; then
    echo 1 > /sys/module/processor/parameters/latency_factor
  fi
  if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
    echo 1 > /sys/module/processor/parameters/ignore_ppc
  fi
  if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
    echo 1 > /sys/module/processor/parameters/ignore_tpc
  fi

  # ...and fire up the shield
  cset shield --userset=rtcpus --cpu=${START_CPU}-${END_CPU} --kthread=on

  # If cpuset wasn't previously mounted (systemd will, like it or not),
  # it has now been mounted. Find the mount point.
  if [ -z $CPUSET_ROOT ]; then
   CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
   if [ -z $CPUSET_ROOT ]; then
     # If it's not mounted now, bail.
     echo EEK, cupset is not mounted!
     exit
   else
     # ok, check for cgroup mount
     if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
      CPUSET_PREFIX=cpuset.
     fi
   fi
  fi

  echo 0 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
  echo 1 > ${CPUSET_ROOT}/system/${CPUSET_PREFIX}sched_load_balance
  echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_relax_domain_level
  # this ain't gonna happen in -rt kernels, but...
  if [ -f ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us ]; then
    echo 300000 > ${CPUSET_ROOT}/system/cpu.rt_runtime_us
    echo 300000 > ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us
  fi
  echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_load_balance

  # wait a bit for sched_domain rebuild
  sleep 1

  # now go to hpc
  if [ -f ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt ]; then
    echo 1 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt
  fi

  # offline/online to migrate timers and whatnot
  if [ $ONLINE -eq 1 ]; then
    for i in `seq ${START_CPU} ${END_CPU}`; do
      echo 0 > /sys/devices/system/cpu/cpu$i/online
    done
    for i in `seq ${START_CPU} ${END_CPU}`; do
      echo 1 > /sys/devices/system/cpu/cpu$i/online
    done

    # re-add CPUs the kernel removed on offline
    echo ${START_CPU}-${END_CPU} > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}cpus

    # and prioritize re-initialized kthreads
    systenctl restart set_kthread_prio
  fi
  if [ -f /proc/sys/kernel/timer_migration ]; then
    echo 0 > /proc/sys/kernel/timer_migration
  fi
  GOVERNOR="performance"
fi

if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then
  CURRENT_GOVERNOR=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)
  if ! [ $GOVERNOR = $CURRENT_GOVERNOR ]; then
    for i in $(ls /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do
     echo $GOVERNOR > $i;
    done
  fi
fi

Example 3: Patch to sysjitter to use the user affinity instead of whole box #

sysjitter.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/sysjitter.c
+++ b/sysjitter.c
@@ -412,7 +412,7 @@ static void write_raw(struct thread *thr
 	FILE *f;
 	int i;
 	for (i = 0; i < g.n_threads; ++i) {
-		sprintf(fname, "%s.%d", outf, i);
+		sprintf(fname, "%s.%d", outf, threads[i].core_i);
 		if ((f = fopen(fname, "w")) == NULL) {
 			fprintf(stderr, "ERROR: Could not open '%s' for writing\n", fname);
 			fprintf(stderr, "ERROR: %s\n", strerror(errno));
@@ -578,6 +578,7 @@ int main(int argc, char *argv[])
 	const char *outf = NULL;
 	char dummy;
 	int i, n_cores, runtime = 70;
+	cpu_set_t cpus;

 	g.max_interruptions = 1000000;

@@ -609,10 +610,13 @@ int main(int argc, char *argv[])
 	    sscanf(argv[0], "%u%c", &g.threshold_nsec, &dummy) != 1)
 		usage(app);

+	CPU_ZERO(&cpus);
+	sched_getaffinity(0, sizeof(cpus), &cpus);
+
 	n_cores = sysconf(_SC_NPROCESSORS_ONLN);
-	TEST(threads = malloc(n_cores * sizeof(threads[0])));
+	TEST(threads = malloc(CPU_COUNT(&cpus) * sizeof(threads[0])));
 	for (i = 0; i < n_cores; ++i)
-		if (move_to_core(i) == 0)
+		if (CPU_ISSET(i, &cpus) && move_to_core(i) == 0)
 			threads[g.n_threads++].core_i = i;

 	signal(SIGALRM, handle_alarm);

3 Running RT applications with RT KVM guests #

Section 1, “Running RT applications with non-RT KVM guests” shows that it is possible to isolate real-time workloads running alongside KVM by using standard methods. In SLE RT 15 SP2 this can be done in user space using libvirt/qemu.

Applications and guest operating systems run inside KVM guests similarly to how they run on bare metal. Guests interface with emulated hardware presented by QEMU, which submits I/O requests to the host on behalf of the guest. Then the host kernel treats the guest I/Os like any user-space application.

In SLE RT 15 SP4, both QEMU and libvirt support isolating the CPUs, partitioning the memory for guests, and setting the vCPU/iothread scheduler policy and priority for running both non-RT KVM and RT KVM.

3.1 Support of QEMU/libvirt #

QEMU includes the -realtime mlock=on|off option. Mlocking QEMU and guest memory is enabled with mlock=on (which is enabled by default) .
libvirt supports CPU Allocation, CPU Tuning, and Memory Backing, which allows you to control RT parameters, see Section 3.2, “Sample of libvirt.xml”.
CPU allocation
You can define the maximum number of virtual CPUs allocated for the guest OS.
CPU tuning
Pinning is a tuning option for the virtual CPUs in KVM guests. With pinning, you can control where the guest runs in order to reduce the overhead of scheduler switches, pin vCPUs to physical CPUs that have low utilization, and improve the data cache performance. Overall performance is improved when the memory that an application uses is local to the physical CPU, and the guest vCPU is pinned to this physical CPU.
We can specify the vCPU scheduler type (values batch, idle, fifo, or rr), and priority for particular vCPU threads. Priority 99 is too high, and it will massively interfere with the host's ability to function properly. There are host-side per-CPU threads that must be always be able to preempt, such as timer sirq threads.
Memory backing
Use memory backing to allocate enough memory in the guest to avoid memory overcommit, and to lock the guest page memory in host memory to prevent it from being swapped out. This will show a performance improvement in some workloads.

3.2 Sample of `libvirt.xml` #

<domain>
   …
   <vcpu placement='static' cpuset="1-4,^3,6" current="1">4</vcpu>
   …
   <cputune>
       <vcpupin vcpu="0" cpuset="1-5,^2"/>
       <vcpupin vcpu="1" cpuset="0,1"/>
       <vcpupin vcpu="2" cpuset="2,3"/>
       <vcpupin vcpu="3" cpuset="0,4"/>
       <vcpusched '0-4,^3' scheduler='fifo' priority='1'/>
   </cputune>
   …
   <memoryBacking>
       <locked/>
   <memoryBacking/>
   …
</domain>

3.3 Other host settings #

Power management. Intel processors have a power management feature that puts the system into power-saving mode when the system is under-utilized. The system should be configured for maximum performance, rather than allowing power-saving mode.
Turboboost and Speedstep. Turboboost overclocks a core when CPU demand is high, whereas Speedstep dynamically adjusts the frequency of processor to meet processing needs. Turboboost requires Speedstep to be enabled, as it is an extension of Speedstep. For maximum performance, enable both Turboboost and Speedstep in the BIOS. The host OS may also need configuration to support running at higher clock speeds. For example:
```
cpupower -c all frequency-set -g performance
```
Disable interrupt balancing (irqbalance). The irqbalance daemon is enabled by default. It distributes hardware interrupts across CPUs in a multi-core system to increase performance. When irqbalance is disabled, all interrupts will be handled by cpu0, and therefore the guest should NOT run on cpu0.
RT throttling. The default values for the realtime throttling mechanism allocate 95% of the CPU time to realtime tasks, and the remaining 5% to non-realtime tasks. If RT throttling is disabled, realtime tasks may use up to 100% of CPU time. Hence, programming failures in real-time applications can cause the entire system to hang because no other task can preempt the realtime tasks.

The above settings are just part of the configurations for the RT KVM to run at the “best-effort” performance. Other factors must be considered, such as storage and network. The overall KVM performance is dependent on the host hardware, firmware, BIOS settings, and the guest OS and application charactistics.