documentation.suse.com / SUSE Edge Documentation / Product Documentation / Telco features configuration

36 Telco features configuration

This section documents and explains the configuration of Telco-specific features on ATIP-deployed clusters.

The directed network provisioning deployment method is used, as described in the ATIP Automated Provision (Chapter 37, Fully automated directed network provisioning) section.

The following topics are covered in this section:

36.1 Kernel image for real time

The real-time kernel image is not necessarily better than a standard kernel. It is a different kernel tuned to a specific use case. The real-time kernel is tuned for lower latency at the cost of throughput. The real-time kernel is not recommended for general purpose use, but in our case, this is the recommended kernel for Telco Workloads where latency is a key factor.

There are four top features:

  • Deterministic execution:

    Get greater predictability — ensure critical business processes complete in time, every time and deliver high-quality service, even under heavy system loads. By shielding key system resources for high-priority processes, you can ensure greater predictability for time-sensitive applications.

  • Low jitter:

    The low jitter built upon the highly deterministic technology helps to keep applications synchronized with the real world. This helps services that need ongoing and repeated calculation.

  • Priority inheritance:

    Priority inheritance refers to the ability of a lower priority process to assume a higher priority when there is a higher priority process that requires the lower priority process to finish before it can accomplish its task. SUSE Linux Enterprise Real Time solves these priority inversion problems for mission-critical processes.

  • Thread interrupts:

    Processes running in interrupt mode in a general-purpose operating system are not preemptible. With SUSE Linux Enterprise Real Time, these interrupts have been encapsulated by kernel threads, which are interruptible, and allow the hard and soft interrupts to be preempted by user-defined higher priority processes.

    In our case, if you have installed a real-time image like SLE Micro RT, kernel real time is already installed. From the SUSE Customer Center, you can download the real-time kernel image.

    Note
    Note

    For more information about the real-time kernel, visit SUSE Real Time.

36.2 Kernel arguments for low latency and high performance

The kernel arguments are important to be configured to enable the real-time kernel to work properly giving the best performance and low latency to run telco workloads. There are some important concepts to keep in mind when configuring the kernel arguments for this use case:

  • Remove kthread_cpus when using SUSE real-time kernel. This parameter controls on which CPUs kernel threads are created. It also controls which CPUs are allowed for PID 1 and for loading kernel modules (the kmod user-space helper). This parameter is not recognized and does not have any effect.

  • Add domain,nohz,managed_irq flags to isolcpus kernel argument. Without any flags, isolcpus is equivalent to specifying only the domain flag. This isolates the specified CPUs from scheduling, including kernel tasks. The nohz flag stops the scheduler tick on the specified CPUs (if only one task is runnable on a CPU), and the managed_irq flag avoids routing managed external (device) interrupts at the specified CPUs.

  • Remove intel_pstate=passive. This option configures intel_pstate to work with generic cpufreq governors, but to make this work, it disables hardware-managed P-states (HWP) as a side effect. To reduce the hardware latency, this option is not recommended for real-time workloads.

  • Replace intel_idle.max_cstate=0 processor.max_cstate=1 with idle=poll. To avoid C-State transitions, the idle=poll option is used to disable the C-State transitions and keep the CPU in the highest C-State. The intel_idle.max_cstate=0 option disables intel_idle, so acpi_idle is used, and acpi_idle.max_cstate=1 then sets max C-state for acpi_idle. On x86_64 architectures, the first ACPI C-State is always POLL, but it uses a poll_idle() function, which may introduce some tiny latency by reading the clock periodically, and restarting the main loop in do_idle() after a timeout (this also involves clearing and setting the TIF_POLL task flag). In contrast, idle=poll runs in a tight loop, busy-waiting for a task to be rescheduled. This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread.

  • Disable C1E in BIOS. This option is important to disable the C1E state in the BIOS to avoid the CPU from entering the C1E state when idle. The C1E state is a low-power state that can introduce latency when the CPU is idle.

  • Add nowatchdog to disable the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context. When it expires (i.e. a soft lockup is detected), it will print a warning (in the hard interrupt context), running any latency targets. Even if it never expires, it goes onto the timer list, slightly increasing the overhead of every timer interrupt. This option also disables the NMI watchdog, so NMIs cannot interfere.

  • Add nmi_watchdog=0. This option disables only the NMI watchdog.

This is an example of the kernel argument list including the aforementioned adjustments:

GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"

36.3 CPU tuned configuration

The CPU Tuned configuration allows the possibility to isolate the CPU cores to be used by the real-time kernel. It is important to prevent the OS from using the same cores as the real-time kernel, because the OS could use the cores and increase the latency in the real-time kernel.

To enable and configure this feature, the first thing is to create a profile for the CPU cores we want to isolate. In this case, we are isolating the cores 1-30 and 33-62.

$ echo "export tuned_params" >> /etc/grub.d/00_tuned

$ echo "isolated_cores=1-18,21-38" >> /etc/tuned/cpu-partitioning-variables.conf

$ tuned-adm profile cpu-partitioning
Tuned (re)started, changes applied.

Then we need to modify the GRUB option to isolate CPU cores and other important parameters for CPU usage. The following options are important to be customized with your current hardware specifications:

parametervaluedescription

isolcpus

domain,nohz,managed_irq,1-18,21-38

Isolate the cores 1-18 and 21-38

skew_tick

1

This option allows the kernel to skew the timer interrupts across the isolated CPUs.

nohz

on

This option allows the kernel to run the timer tick on a single CPU when the system is idle.

nohz_full

1-18,21-38

kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation.

rcu_nocbs

1-18,21-38

This option allows the kernel to run the RCU callbacks on a single CPU when the system is idle.

irqaffinity

0,19,20,39

This option allows the kernel to run the interrupts on a single CPU when the system is idle.

idle

poll

This minimizes the latency of exiting the idle state, but at the cost of keeping the CPU running at full speed in the idle thread.

nmi_watchdog

0

This option disables only the NMI watchdog.

nowatchdog

 

This option disables the soft-lockup watchdog which is implemented as a timer running in the timer hard-interrupt context.

With the values shown above, we are isolating 60 cores, and we are using four cores for the OS.

The following commands modify the GRUB configuration and apply the changes mentioned above to be present on the next boot:

Edit the /etc/default/grub file and add the parameters mentioned above:

GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"

Update the GRUB configuration:

$ transactional-update grub.cfg
$ reboot

To validate that the parameters are applied after the reboot, the following command can be used to check the kernel command line:

$ cat /proc/cmdline

There is another script that can be used to tune the CPU configuration, which basically is doing the following steps:

  • Set the CPU governor to performance.

  • Unset the timer migration to the isolated CPUs.

  • Migrate the kdaemon threads to the housekeeping CPUs.

  • Set the isolated CPUs latency to the lowest possible value.

  • Delay the vmstat updates to 300 seconds.

The script is available at SUSE ATIP Github repository - performance-settings.sh.

36.4 CNI Configuration

36.4.1 Cilium

Cilium is the default CNI plug-in for ATIP. To enable Cilium on RKE2 cluster as the default plug-in, the following configurations are required in the /etc/rancher/rke2/config.yaml file:

cni:
- cilium

This can also be specified with command-line arguments, that is, --cni=cilium into the server line in /etc/systemd/system/rke2-server file.

To use the SR-IOV network operator described in the next section (Section 36.5, “SR-IOV”), use Multus with another CNI plug-in, like Cilium or Calico, as a secondary plug-in.

cni:
- multus
- cilium
Note
Note

For more information about CNI plug-ins, visit Network Options.

36.5 SR-IOV

SR-IOV allows a device, such as a network adapter, to separate access to its resources among various PCIe hardware functions. There are different ways to deploy SR-IOV, and here, we show two different options:

  • Option 1: using the SR-IOV CNI device plug-ins and a config map to configure it properly.

  • Option 2 (recommended): using the SR-IOV Helm chart from Rancher Prime to make this deployment easy.

Option 1 - Installation of SR-IOV CNI device plug-ins and a config map to configure it properly

  • Prepare the config map for the device plug-in

Get the information to fill the config map from the lspci command:

$ lspci | grep -i acc
8a:00.0 Processing accelerators: Intel Corporation Device 0d5c

$ lspci | grep -i net
19:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
19:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
19:00.2 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
19:00.3 Ethernet controller: Broadcom Inc. and subsidiaries BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
51:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
51:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
51:01.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:01.1 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:01.2 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:01.3 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:11.0 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:11.1 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:11.2 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)
51:11.3 Ethernet controller: Intel Corporation Ethernet Adaptive Virtual Function (rev 02)

The config map consists of a JSON file that describes devices using filters to discover, and creates groups for the interfaces. The key is understanding filters and groups. The filters are used to discover the devices and the groups are used to create the interfaces.

It could be possible to set filters:

  • vendorID: 8086 (Intel)

  • deviceID: 0d5c (Accelerator card)

  • driver: vfio-pci (driver)

  • pfNames: p2p1 (physical interface name)

It could be possible to also set filters to match more complex interface syntax, for example:

  • pfNames: ["eth1#1,2,3,4,5,6"] or [eth1#1-6] (physical interface name)

Related to the groups, we could create a group for the FEC card and another group for the Intel card, even creating a prefix depending on our use case:

  • resourceName: pci_sriov_net_bh_dpdk

  • resourcePrefix: Rancher.io

There are a lot of combinations to discover and create the resource group to allocate some VFs to the pods.

Note
Note

For more information about the filters and groups, visit sr-iov network device plug-in.

After setting the filters and groups to match the interfaces depending on the hardware and the use case, the following config map shows an example to be used:

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [
            {
                "resourceName": "intel_fec_5g",
                "devicetype": "accelerator",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["0d5d"]
                }
            },
            {
                "resourceName": "intel_sriov_odu",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["1889"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["p2p1"]
                }
            },
            {
                "resourceName": "intel_sriov_oru",
                "selectors": {
                    "vendors": ["8086"],
                    "devices": ["1889"],
                    "drivers": ["vfio-pci"],
                    "pfNames": ["p2p2"]
                }
            }
        ]
    }
  • Prepare the daemonset file to deploy the device plug-in.

The device plug-in supports several architectures (arm, amd, ppc64le), so the same file can be used for different architectures deploying several daemonset for each architecture.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sriov-device-plugin
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-sriov-device-plugin-amd64
  namespace: kube-system
  labels:
    tier: node
    app: sriovdp
spec:
  selector:
    matchLabels:
      name: sriov-device-plugin
  template:
    metadata:
      labels:
        name: sriov-device-plugin
        tier: node
        app: sriovdp
    spec:
      hostNetwork: true
      nodeSelector:
        kubernetes.io/arch: amd64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      serviceAccountName: sriov-device-plugin
      containers:
      - name: kube-sriovdp
        image: rancher/hardened-sriov-network-device-plugin:v3.7.0-build20240816
        imagePullPolicy: IfNotPresent
        args:
        - --log-dir=sriovdp
        - --log-level=10
        securityContext:
          privileged: true
        resources:
          requests:
            cpu: "250m"
            memory: "40Mi"
          limits:
            cpu: 1
            memory: "200Mi"
        volumeMounts:
        - name: devicesock
          mountPath: /var/lib/kubelet/
          readOnly: false
        - name: log
          mountPath: /var/log
        - name: config-volume
          mountPath: /etc/pcidp
        - name: device-info
          mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
      volumes:
        - name: devicesock
          hostPath:
            path: /var/lib/kubelet/
        - name: log
          hostPath:
            path: /var/log
        - name: device-info
          hostPath:
            path: /var/run/k8s.cni.cncf.io/devinfo/dp
            type: DirectoryOrCreate
        - name: config-volume
          configMap:
            name: sriovdp-config
            items:
            - key: config.json
              path: config.json
  • After applying the config map and the daemonset, the device plug-in will be deployed and the interfaces will be discovered and available for the pods.

    $ kubectl get pods -n kube-system | grep sriov
    kube-system  kube-sriov-device-plugin-amd64-twjfl  1/1  Running  0  2m
  • Check the interfaces discovered and available in the nodes to be used by the pods:

    $ kubectl get $(kubectl get nodes -oname) -o jsonpath='{.status.allocatable}' | jq
    {
      "cpu": "64",
      "ephemeral-storage": "256196109726",
      "hugepages-1Gi": "40Gi",
      "hugepages-2Mi": "0",
      "intel.com/intel_fec_5g": "1",
      "intel.com/intel_sriov_odu": "4",
      "intel.com/intel_sriov_oru": "4",
      "memory": "221396384Ki",
      "pods": "110"
    }
  • The FEC is intel.com/intel_fec_5g and the value is 1.

  • The VF is intel.com/intel_sriov_odu or intel.com/intel_sriov_oru if you deploy it with a device plug-in and the config map without Helm charts.

Important
Important

If there are no interfaces here, it makes little sense to continue because the interface will not be available for pods. Review the config map and filters to solve the issue first.

Option 2 (recommended) - Installation using Rancher using Helm chart for SR-IOV CNI and device plug-ins

  • Get Helm if not present:

$ curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
  • Install SR-IOV.

This part could be done in two ways, using the CLI or using the Rancher UI.

Install Operator from CLI
helm install sriov-crd oci://registry.suse.com/edge/3.1/sriov-crd-chart -n sriov-network-operator
helm install sriov-network-operator oci://registry.suse.com/edge/3.1/sriov-network-operator-chart -n sriov-network-operator
Install Operator from Rancher UI

Once your cluster is installed, and you have access to the Rancher UI, you can install the SR-IOV Operator from the Rancher UI from the apps tab:

Note
Note

Make sure you select the right namespace to install the operator, for example, sriov-network-operator.

+ image::features_sriov.png[sriov.png]

  • Check the deployed resources crd and pods:

$ kubectl get crd
$ kubectl -n sriov-network-operator get pods
  • Check the label in the nodes.

With all resources running, the label appears automatically in your node:

$ kubectl get nodes -oyaml | grep feature.node.kubernetes.io/network-sriov.capable

feature.node.kubernetes.io/network-sriov.capable: "true"
  • Review the daemonset to see the new sriov-network-config-daemon and sriov-rancher-nfd-worker as active and ready:

$ kubectl get daemonset -A
NAMESPACE             NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                           AGE
calico-system            calico-node                     1         1         1       1            1           kubernetes.io/os=linux                                  15h
sriov-network-operator   sriov-network-config-daemon     1         1         1       1            1           feature.node.kubernetes.io/network-sriov.capable=true   45m
sriov-network-operator   sriov-rancher-nfd-worker        1         1         1       1            1           <none>                                                  45m
kube-system              rke2-ingress-nginx-controller   1         1         1       1            1           kubernetes.io/os=linux                                  15h
kube-system              rke2-multus-ds                  1         1         1       1            1           kubernetes.io/arch=amd64,kubernetes.io/os=linux         15h

In a few minutes (can take up to 10 min to be updated), the nodes are detected and configured with the SR-IOV capabilities:

$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -A
NAMESPACE             NAME     AGE
sriov-network-operator   xr11-2   83s
  • Check the interfaces detected.

The interfaces discovered should be the PCI address of the network device. Check this information with the lspci command in the host.

$ kubectl get sriovnetworknodestates.sriovnetwork.openshift.io -n kube-system -oyaml
apiVersion: v1
items:
- apiVersion: sriovnetwork.openshift.io/v1
  kind: SriovNetworkNodeState
  metadata:
    creationTimestamp: "2023-06-07T09:52:37Z"
    generation: 1
    name: xr11-2
    namespace: sriov-network-operator
    ownerReferences:
    - apiVersion: sriovnetwork.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: SriovNetworkNodePolicy
      name: default
      uid: 80b72499-e26b-4072-a75c-f9a6218ec357
    resourceVersion: "356603"
    uid: e1f1654b-92b3-44d9-9f87-2571792cc1ad
  spec:
    dpConfigVersion: "356507"
  status:
    interfaces:
    - deviceID: "1592"
      driver: ice
      eSwitchMode: legacy
      linkType: ETH
      mac: 40:a6:b7:9b:35:f0
      mtu: 1500
      name: p2p1
      pciAddress: "0000:51:00.0"
      totalvfs: 128
      vendor: "8086"
    - deviceID: "1592"
      driver: ice
      eSwitchMode: legacy
      linkType: ETH
      mac: 40:a6:b7:9b:35:f1
      mtu: 1500
      name: p2p2
      pciAddress: "0000:51:00.1"
      totalvfs: 128
      vendor: "8086"
    syncStatus: Succeeded
kind: List
metadata:
  resourceVersion: ""
Note
Note

If your interface is not detected here, ensure that it is present in the next config map:

$ kubectl get cm supported-nic-ids -oyaml -n sriov-network-operator

If your device is not there, edit the config map, adding the right values to be discovered (should be necessary to restart the sriov-network-config-daemon daemonset).

  • Create the NetworkNode Policy to configure the VFs.

Some VFs (numVfs) from the device (rootDevices) will be created, and it will be configured with the driver deviceType and the MTU:

Note
Note

The resourceName field must not contain any special characters and must be unique across the cluster. The example uses the deviceType: vfio-pci because dpdk will be used in combination with sr-iov. If you don’t use dpdk, the deviceType should be deviceType: netdevice (default value).

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-dpdk
  namespace: sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: intelnicsDpdk
  deviceType: vfio-pci
  numVfs: 8
  mtu: 1500
  nicSelector:
    deviceID: "1592"
    vendor: "8086"
    rootDevices:
    - 0000:51:00.0
  • Validate configurations:

$ kubectl get $(kubectl get nodes -oname) -o jsonpath='{.status.allocatable}' | jq
{
  "cpu": "64",
  "ephemeral-storage": "256196109726",
  "hugepages-1Gi": "60Gi",
  "hugepages-2Mi": "0",
  "intel.com/intel_fec_5g": "1",
  "memory": "200424836Ki",
  "pods": "110",
  "rancher.io/intelnicsDpdk": "8"
}
  • Create the sr-iov network (optional, just in case a different network is needed):

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: network-dpdk
  namespace: sriov-network-operator
spec:
  ipam: |
    {
      "type": "host-local",
      "subnet": "192.168.0.0/24",
      "rangeStart": "192.168.0.20",
      "rangeEnd": "192.168.0.60",
      "routes": [{
        "dst": "0.0.0.0/0"
      }],
      "gateway": "192.168.0.1"
    }
  vlan: 500
  resourceName: intelnicsDpdk
  • Check the network created:

$ kubectl get network-attachment-definitions.k8s.cni.cncf.io -A -oyaml

apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/resourceName: rancher.io/intelnicsDpdk
    creationTimestamp: "2023-06-08T11:22:27Z"
    generation: 1
    name: network-dpdk
    namespace: sriov-network-operator
    resourceVersion: "13124"
    uid: df7c89f5-177c-4f30-ae72-7aef3294fb15
  spec:
    config: '{ "cniVersion":"0.4.0", "name":"network-dpdk","type":"sriov","vlan":500,"vlanQoS":0,"ipam":{"type":"host-local","subnet":"192.168.0.0/24","rangeStart":"192.168.0.10","rangeEnd":"192.168.0.60","routes":[{"dst":"0.0.0.0/0"}],"gateway":"192.168.0.1"}
      }'
kind: List
metadata:
  resourceVersion: ""

36.6 DPDK

DPDK (Data Plane Development Kit) is a set of libraries and drivers for fast packet processing. It is used to accelerate packet processing workloads running on a wide variety of CPU architectures. The DPDK includes data plane libraries and optimized network interface controller (NIC) drivers for the following:

  1. A queue manager implements lockless queues.

  2. A buffer manager pre-allocates fixed size buffers.

  3. A memory manager allocates pools of objects in memory and uses a ring to store free objects; ensures that objects are spread equally on all DRAM channels.

  4. Poll mode drivers (PMD) are designed to work without asynchronous notifications, reducing overhead.

  5. A packet framework as a set of libraries that are helpers to develop packet processing.

The following steps will show how to enable DPDK and how to create VFs from the NICs to be used by the DPDK interfaces:

  • Install the DPDK package:

$ transactional-update pkg install dpdk dpdk-tools libdpdk-23
$ reboot
  • Kernel parameters:

To use DPDK, employ some drivers to enable certain parameters in the kernel:

parametervaluedescription

iommu

pt

This option enables the use of the vfio driver for the DPDK interfaces.

intel_iommu

on

This option enables the use of vfio for VFs.

To enable the parameters, add them to the /etc/default/grub file:

GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"

Update the GRUB configuration and reboot the system to apply the changes:

$ transactional-update grub.cfg
$ reboot
  • Load vfio-pci kernel module and enable SR-IOV on the NICs:

$ modprobe vfio-pci enable_sriov=1 disable_idle_d3=1
  • Create some virtual functions (VFs) from the NICs.

To create for VFs, for example, for two different NICs, the following commands are required:

$ echo 4 > /sys/bus/pci/devices/0000:51:00.0/sriov_numvfs
$ echo 4 > /sys/bus/pci/devices/0000:51:00.1/sriov_numvfs
  • Bind the new VFs with the vfio-pci driver:

$ dpdk-devbind.py -b vfio-pci 0000:51:01.0 0000:51:01.1 0000:51:01.2 0000:51:01.3 \
                              0000:51:11.0 0000:51:11.1 0000:51:11.2 0000:51:11.3
  • Review the configuration is correctly applied:

$ dpdk-devbind.py -s

Network devices using DPDK-compatible driver
============================================
0000:51:01.0 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:01.1 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:01.2 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:01.3 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:01.0 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:11.1 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:21.2 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio
0000:51:31.3 'Ethernet Adaptive Virtual Function 1889' drv=vfio-pci unused=iavf,igb_uio

Network devices using kernel driver
===================================
0000:19:00.0 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em1 drv=bnxt_en unused=igb_uio,vfio-pci *Active*
0000:19:00.1 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em2 drv=bnxt_en unused=igb_uio,vfio-pci
0000:19:00.2 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em3 drv=bnxt_en unused=igb_uio,vfio-pci
0000:19:00.3 'BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet 1751' if=em4 drv=bnxt_en unused=igb_uio,vfio-pci
0000:51:00.0 'Ethernet Controller E810-C for QSFP 1592' if=eth13 drv=ice unused=igb_uio,vfio-pci
0000:51:00.1 'Ethernet Controller E810-C for QSFP 1592' if=rename8 drv=ice unused=igb_uio,vfio-pci

36.7 vRAN acceleration (Intel ACC100/ACC200)

As communications service providers move from 4 G to 5 G networks, many are adopting virtualized radio access network (vRAN) architectures for higher channel capacity and easier deployment of edge-based services and applications. vRAN solutions are ideally located to deliver low-latency services with the flexibility to increase or decrease capacity based on the volume of real-time traffic and demand on the network.

One of the most compute-intensive 4 G and 5 G workloads is RAN layer 1 (L1) FEC, which resolves data transmission errors over unreliable or noisy communication channels. FEC technology detects and corrects a limited number of errors in 4 G or 5 G data, eliminating the need for retransmission. Since the FEC acceleration transaction does not contain cell state information, it can be easily virtualized, enabling pooling benefits and easy cell migration.

  • Kernel parameters

To enable the vRAN acceleration, we need to enable the following kernel parameters (if not present yet):

parametervaluedescription

iommu

pt

This option enables the use of vfio for the DPDK interfaces.

intel_iommu

on

This option enables the use of vfio for VFs.

Modify the GRUB file /etc/default/grub to add them to the kernel command line:

GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"

Update the GRUB configuration and reboot the system to apply the changes:

$ transactional-update grub.cfg
$ reboot

To verify that the parameters are applied after the reboot, check the command line:

$ cat /proc/cmdline
  • Load vfio-pci kernel modules to enable the vRAN acceleration:

$ modprobe vfio-pci enable_sriov=1 disable_idle_d3=1
  • Get interface information Acc100:

$ lspci | grep -i acc
8a:00.0 Processing accelerators: Intel Corporation Device 0d5c
  • Bind the physical interface (PF) with vfio-pci driver:

$ dpdk-devbind.py -b vfio-pci 0000:8a:00.0
  • Create the virtual functions (VFs) from the physical interface (PF).

Create 2 VFs from the PF and bind with vfio-pci following the next steps:

$ echo 2 > /sys/bus/pci/devices/0000:8a:00.0/sriov_numvfs
$ dpdk-devbind.py -b vfio-pci 0000:8b:00.0
  • Configure acc100 with the proposed configuration file:

$ pf_bb_config ACC100 -c /opt/pf-bb-config/acc100_config_vf_5g.cfg
Tue Jun  6 10:49:20 2023:INFO:Queue Groups: 2 5GUL, 2 5GDL, 2 4GUL, 2 4GDL
Tue Jun  6 10:49:20 2023:INFO:Configuration in VF mode
Tue Jun  6 10:49:21 2023:INFO: ROM version MM 99AD92
Tue Jun  6 10:49:21 2023:WARN:* Note: Not on DDR PRQ version  1302020 != 10092020
Tue Jun  6 10:49:21 2023:INFO:PF ACC100 configuration complete
Tue Jun  6 10:49:21 2023:INFO:ACC100 PF [0000:8a:00.0] configuration complete!
  • Check the new VFs created from the FEC PF:

$ dpdk-devbind.py -s
Baseband devices using DPDK-compatible driver
=============================================
0000:8a:00.0 'Device 0d5c' drv=vfio-pci unused=
0000:8b:00.0 'Device 0d5d' drv=vfio-pci unused=

Other Baseband devices
======================
0000:8b:00.1 'Device 0d5d' unused=

36.8 Huge pages

When a process uses RAM, the CPU marks it as used by that process. For efficiency, the CPU allocates RAM in chunks 4K bytes is the default value on many platforms. Those chunks are named pages. Pages can be swapped to disk, etc.

Since the process address space is virtual, the CPU and the operating system need to remember which pages belong to which process, and where each page is stored. The greater the number of pages, the longer the search for memory mapping. When a process uses 1 GB of memory, that is 262144 entries to look up (1 GB / 4 K). If a page table entry consumes 8 bytes, that is 2 MB (262144 * 8) to look up.

Most current CPU architectures support larger-than-default pages, which give the CPU/OS fewer entries to look up.

  • Kernel parameters

To enable the huge pages, we should add the next kernel parameters:

parametervaluedescription

hugepagesz

1G

This option allows to set the size of huge pages to 1 G

hugepages

40

This is the number of huge pages defined before

default_hugepagesz

1G

This is the default value to get the huge pages

Modify the GRUB file /etc/default/grub to add them to the kernel command line:

GRUB_CMDLINE_LINUX="skew_tick=1 BOOT_IMAGE=/boot/vmlinuz-6.4.0-9-rt root=UUID=77b713de-5cc7-4d4c-8fc6-f5eca0a43cf9 rd.timeout=60 rd.retry=45 console=ttyS1,115200 console=tty0 default_hugepagesz=1G hugepages=0 hugepages=40 hugepagesz=1G hugepagesz=2M ignition.platform.id=openstack intel_iommu=on iommu=pt irqaffinity=0,19,20,39 isolcpus=domain,nohz,managed_irq,1-18,21-38 mce=off nohz=on net.ifnames=0 nmi_watchdog=0 nohz_full=1-18,21-38 nosoftlockup nowatchdog quiet rcu_nocb_poll rcu_nocbs=1-18,21-38 rcupdate.rcu_cpu_stall_suppress=1 rcupdate.rcu_expedited=1 rcupdate.rcu_normal_after_boot=1 rcupdate.rcu_task_stall_timeout=0 rcutree.kthread_prio=99 security=selinux selinux=1"

Update the GRUB configuration and reboot the system to apply the changes:

$ transactional-update grub.cfg
$ reboot

To validate that the parameters are applied after the reboot, you can check the command line:

$ cat /proc/cmdline
  • Using huge pages

To use the huge pages, we need to mount them:

$ mkdir -p /hugepages
$ mount -t hugetlbfs nodev /hugepages

Deploy a Kubernetes workload, creating the resources and the volumes:

...
 resources:
   requests:
     memory: "24Gi"
     hugepages-1Gi: 16Gi
     intel.com/intel_sriov_oru: '4'
   limits:
     memory: "24Gi"
     hugepages-1Gi: 16Gi
     intel.com/intel_sriov_oru: '4'
...
...
volumeMounts:
  - name: hugepage
    mountPath: /hugepages
...
volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
...

36.9 CPU pinning configuration

  • Requirements

    1. Must have the CPU tuned to the performance profile covered in this section (Section 36.3, “CPU tuned configuration”).

    2. Must have the RKE2 cluster kubelet configured with the CPU management arguments adding the following block (as an example) to the /etc/rancher/rke2/config.yaml file:

kubelet-arg:
- "cpu-manager=true"
- "cpu-manager-policy=static"
- "cpu-manager-policy-options=full-pcpus-only=true"
- "cpu-manager-reconcile-period=0s"
- "kubelet-reserved=cpu=1"
- "system-reserved=cpu=1"
  • Using CPU pinning on Kubernetes

There are three ways to use that feature using the Static Policy defined in kubelet depending on the requests and limits you define on your workload:

  1. BestEffort QoS Class: If you do not define any request or limit for CPU, the pod is scheduled on the first CPU available on the system.

    An example of using the BestEffort QoS Class could be:

    spec:
      containers:
      - name: nginx
        image: nginx
  2. Burstable QoS Class: If you define a request for CPU, which is not equal to the limits, or there is no CPU request.

    Examples of using the Burstable QoS Class could be:

    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            memory: "200Mi"
          requests:
            memory: "100Mi"

    or

    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            memory: "200Mi"
            cpu: "2"
          requests:
            memory: "100Mi"
            cpu: "1"
  3. Guaranteed QoS Class: If you define a request for CPU, which is equal to the limits.

    An example of using the Guaranteed QoS Class could be:

    spec:
      containers:
        - name: nginx
          image: nginx
          resources:
            limits:
              memory: "200Mi"
              cpu: "2"
            requests:
              memory: "200Mi"
              cpu: "2"

36.10 NUMA-aware scheduling

Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a physical memory design used in SMP (multiprocessors) architecture, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

36.10.1 Identifying NUMA nodes

To identify the NUMA nodes, on your system use the following command:

$ lscpu | grep NUMA
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-63
Note
Note

For this example, we have only one NUMA node showing 64 CPUs.

NUMA needs to be enabled in the BIOS. If dmesg does not have records of NUMA initialization during the bootup, then NUMA-related messages in the kernel ring buffer might have been overwritten.

36.11 Metal LB

MetalLB is a load-balancer implementation for bare-metal Kubernetes clusters, using standard routing protocols like L2 and BGP as advertisement protocols. It is a network load balancer that can be used to expose services in a Kubernetes cluster to the outside world due to the need to use Kubernetes Services type LoadBalancer with bare-metal.

To enable MetalLB in the RKE2 cluster, the following steps are required:

  • Install MetalLB using the following command:

$ kubectl apply <<EOF -f
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: metallb
  namespace: kube-system
spec:
  chart: oci://registry.suse.com/edge/3.1/metallb-chart
  targetNamespace: metallb-system
  version: 0.14.9
  createNamespace: true
---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: endpoint-copier-operator
  namespace: kube-system
spec:
  chart: oci://registry.suse.com/edge/3.1/endpoint-copier-operator-chart
  targetNamespace: endpoint-copier-operator
  version: 0.2.1
  createNamespace: true
EOF
  • Create the IpAddressPool and the L2advertisement configuration:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kubernetes-vip-ip-pool
  namespace: metallb-system
spec:
  addresses:
    - 10.168.200.98/32
  serviceAllocation:
    priority: 100
    namespaces:
      - default
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: ip-pool-l2-adv
  namespace: metallb-system
spec:
  ipAddressPools:
    - kubernetes-vip-ip-pool
  • Create the endpoint service to expose the VIP:

apiVersion: v1
kind: Service
metadata:
  name: kubernetes-vip
  namespace: default
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: rke2-api
    port: 9345
    protocol: TCP
    targetPort: 9345
  - name: k8s-api
    port: 6443
    protocol: TCP
    targetPort: 6443
  sessionAffinity: None
  type: LoadBalancer
  • Check the VIP is created and the MetalLB pods are running:

$ kubectl get svc -n default
$ kubectl get pods -n default

36.12 Private registry configuration

Containerd can be configured to connect to private registries and use them to pull private images on each node.

Upon startup, RKE2 checks if a registries.yaml file exists at /etc/rancher/rke2/ and instructs containerd to use any registries defined in the file. If you wish to use a private registry, create this file as root on each node that will use the registry.

To add the private registry, create the file /etc/rancher/rke2/registries.yaml with the following content:

mirrors:
  docker.io:
    endpoint:
      - "https://registry.example.com:5000"
configs:
  "registry.example.com:5000":
    auth:
      username: xxxxxx # this is the registry username
      password: xxxxxx # this is the registry password
    tls:
      cert_file:            # path to the cert file used to authenticate to the registry
      key_file:             # path to the key file for the certificate used to authenticate to the registry
      ca_file:              # path to the ca file used to verify the registry's certificate
      insecure_skip_verify: # may be set to true to skip verifying the registry's certificate

or without authentication:

mirrors:
  docker.io:
    endpoint:
      - "https://registry.example.com:5000"
configs:
  "registry.example.com:5000":
    tls:
      cert_file:            # path to the cert file used to authenticate to the registry
      key_file:             # path to the key file for the certificate used to authenticate to the registry
      ca_file:              # path to the ca file used to verify the registry's certificate
      insecure_skip_verify: # may be set to true to skip verifying the registry's certificate

For the registry changes to take effect, you need to either configure this file before starting RKE2 on the node, or restart RKE2 on each configured node.

Note
Note

For more information about this, please check containerd registry configuration rke2.