Applies to SUSE AI 1.0

4 Preparing the cluster for AI Library #

This procedure assumes that you already have the base operating system installed on cluster nodes as well as the SUSE Rancher Prime: RKE2 Kubernetes distribution installed and operational. If you are installing from scratch, refer to Chapter 3, Installing the Linux and Kubernetes distribution first.

Install SUSE Rancher Prime (Section 4.1, “Installing SUSE Rancher Prime on a Kubernetes cluster in air-gapped environments”).
Important: Separate clusters for specific SUSE AI components
For production deployments, we strongly recommended deploying Rancher, SUSE Observability, and workloads from the AI library to separate Kubernetes clusters.
Install the NVIDIA GPU Operator as described in Section 4.2, “Installing the NVIDIA GPU Operator on the SUSE Rancher Prime: RKE2 cluster”.
Connect the downstream Kubernetes cluster to SUSE Rancher Prime running on the upstream cluster as described in Section 4.3, “Registering existing clusters”.
Configure the GPU-enabled nodes so that the SUSE AI containers are assigned to Pods that run on nodes equipped with NVIDIA GPU hardware. Find more details about assigning Pods to nodes in Section 4.4, “Assigning GPU nodes to applications”.
(Optional) Install SUSE Security as described in Section 4.5, “Installing SUSE Security”. Although this step is not required, we strongly encourage it to ensure data security in the production environment.
Install and configure SUSE Observability to observe the nodes used for SUSE AI application. Refer to Section 4.6, “Setting up SUSE Observability for SUSE AI” for more details.

4.1 Installing SUSE Rancher Prime on a Kubernetes cluster in air-gapped environments #

This section is about using the Helm CLI to install the Rancher server in an air-gapped environment.

4.1.1 Installation outline #

4.1.2 Set up the infrastructure and a private registry #

In this section, you will provision the underlying infrastructure for your Rancher management server in an air gapped environment. You will also set up the private container image registry that must be available to your Rancher node(s).

In this section, you will provision the underlying infrastructure for your Rancher management server in an air-gapped environment. You will also set up the private container image registry that must be available to your Rancher node(s). The procedures below focus on installing Rancher in the RKE2 cluster. To install the Rancher management server on a high-availability SUSE Rancher Prime: RKE2 cluster, we recommend setting up the following infrastructure:

Three Linux nodes, typically virtual machines, in an infrastructure provider such as Amazon’s EC2, Google Compute Engine or vSphere.
A load balancer to direct front-end traffic to the three nodes.
A DNS record to map a URL to the load balancer. This will become the Rancher server URL, and downstream Kubernetes clusters will need to reach it.
A private image registry to distribute container images to your machines.

These nodes must be in the same region or data center. You may place these servers in separate availability zones.

4.1.2.1 Why three nodes? #

In an RKE2 cluster, the Rancher server data is stored on etcd. This etcd database runs on all three nodes. The etcd database requires an odd number of nodes so that it can always elect a leader with a majority of the etcd cluster. If the etcd database cannot elect a leader, etcd can suffer from split brain, requiring the cluster to be restored from backup. If one of the three etcd nodes fails, the two remaining nodes can elect a leader because they have the majority of the total number of etcd nodes.

4.1.2.2 Set up Linux nodes #

These hosts will be disconnected from the Internet, but require being able to connect with your private registry. Make sure that your nodes fulfill the general installation requirements for OS, container runtime, hardware and networking. For an example of one way to set up Linux nodes, refer to this tutorial for setting up nodes as instances in Amazon EC2.

4.1.2.3 Set up the load balancer #

You will also need to set up a load balancer to direct traffic to the Rancher replica on both nodes. That will prevent the outage of any single node from taking down communications to the Rancher management server. When Kubernetes gets set up in a later step, the RKE2 tool will deploy an NGINX Ingress controller. This controller will listen on ports 80 and 443 of the worker nodes, answering traffic destined for specific hostnames. When Rancher is installed (also in a later step), the Rancher system creates an Ingress resource. That Ingress tells the NGINX Ingress controller to listen for traffic destined for the Rancher host name. The NGINX Ingress controller, when receiving traffic destined for the Rancher host name, will forward that traffic to the running Rancher pods in the cluster. For your implementation, consider if you want or need to use a Layer-4 or Layer-7 load balancer:

A layer-4 load balancer is the simpler of the two choices, in which you are forwarding TCP traffic to your nodes. We recommend configuring your load balancer as a Layer 4 balancer, forwarding traffic on ports TCP/80 and TCP/443 to the Rancher management cluster nodes. The Ingress controller on the cluster will redirect HTTP traffic to HTTPS and terminate SSL/TLS on port TCP/443. The Ingress controller will forward traffic on port TCP/80 to the Ingress pod in the Rancher deployment.
A layer-7 load balancer is a bit more complicated but can offer features that you may want. For instance, a layer-7 load balancer is capable of handling TLS termination at the load balancer, as opposed to Rancher doing TLS termination itself. This can be beneficial to centralize your TLS termination in your infrastructure. Layer-7 load balancing also allows your load balancer to make decisions based on HTTP attributes such as cookies—capabilities that a layer-4 load balancer cannot handle. If you decide to terminate the SSL/TLS traffic on a layer-7 load balancer, you will need to use the --set tls=external option when installing Rancher in a later step. For more information, refer to the Rancher Helm chart options.

For an example showing how to set up an NGINX load balancer, refer to this page. For a how-to guide for setting up an Amazon ELB Network Load Balancer, refer to this page.

Important

Do not use this load balancer (that is, the local cluster Ingress) to load balance applications other than Rancher following installation. Sharing this Ingress with other applications may result in WebSocket errors to Rancher following Ingress configuration reloads for other apps. We recommend dedicating the local cluster to Rancher and no other applications.

4.1.2.4 Set up the DNS record #

Once you have set up your load balancer, you will need to create a DNS record to send traffic to this load balancer. Depending on your environment, this may be an A record pointing to the LB IP, or it may be a CNAME pointing to the load balancer host name. In either case, make sure this record matches the host name you want Rancher to respond to. You will need to specify this host name in a later step when you install Rancher, and it is not possible to change it later. Make sure that your decision is final. For a how-to guide for setting up a DNS record to route domain traffic to an Amazon ELB load balancer, refer to the official AWS documentation.

4.1.2.5 Set up a private image registry #

Rancher supports air-gapped installations using a secure private registry. You must have your own private registry or other means of distributing container images to your machines. In a later step, when you set up your RKE2 Kubernetes cluster, you will create a private registries configuration file with details from this registry. If you need to create a private registry, refer to the documentation pages for your respective runtime:

4.1.3 Collect and publish images to your private registry #

This section describes how to set up your private registry so that when you install Rancher, Rancher will pull all the required images from this registry.

By default, all images used to provision Kubernetes clusters or launch any tools in Rancher, e.g. monitoring, pipelines, alerts, are pulled from Docker Hub. In an air-gapped installation of Rancher, you will need a private registry that is located somewhere accessible by your Rancher server. Then, you will load the registry with all the images.

Populating the private registry with images is the same process for installing SUSE Rancher Prime with Docker and for installing SUSE Rancher Prime on a Kubernetes cluster.

Prerequisites #

You must have a private registry available to use.
If the registry has certs, follow this K3s documentation about adding a private registry. The certs and registry configuration files need to be mounted into the SUSE Rancher Prime container.

The following steps populate your private registry.

Find the required assets for your SUSE Rancher Prime version (Section 4.1.3.1, “Find the required assets for your SUSE Rancher Prime version”)
Collect the cert-manager image (Section 4.1.3.2, “Collect the cert-manager image”) (unless you are bringing your own certificates or terminating TLS on a load balancer)
Save the images to your workstation (Section 4.1.3.3, “Save the images to your workstation”)
Populate the private registry (Section 4.1.3.4, “Populate the private registry”)

Prerequisites. These steps expect you to use a Linux workstation that has Internet access, access to your private registry, and at least 20 GB of disk space.

If you use ARM64 hosts, the registry must support manifests. As of April 2020, Amazon Elastic Container Registry does not support manifests.

4.1.3.1 Find the required assets for your SUSE Rancher Prime version #

Go to our releases page, find the SUSE Rancher Prime v2.x.x release that you want to install, and click Assets. Note: Do not use releases marked rc or Pre-release, as they are not stable for production environments.
From the release’s Assets section, download the following files, which are required to install Rancher in an air-gap environment:

Table 4.1: Required assets #

Release File	Description
`rancher-images.txt`	This file contains a list of images needed to install SUSE Rancher Prime, provision clusters and use SUSE Rancher Prime tools.
`rancher-save-images.sh`	This script pulls all the images in the `rancher-images.txt` from Docker Hub and saves all the images as `rancher-images.tar.gz`.
`rancher-load-images.sh`	This script loads images from the `rancher-images.tar.gz` file and pushes them to your private registry.

4.1.3.2 Collect the cert-manager image #

Note

Skip this step if you are using your own certificates, or if you are terminating TLS on an external load balancer.

In a Kubernetes Install, if you elect to use the Rancher default self-signed TLS certificates, you must add the cert-manager image to rancher-images.txt as well.

Fetch the latest cert-manager Helm chart and parse the template for image details:
Note
Recent changes to cert-manager require an upgrade. If you are upgrading SUSE Rancher Prime and using a version of cert-manager older than v0.12.0, please see our upgrade documentation.
```
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm fetch jetstack/cert-manager
helm template ./cert-manager-version.tgz | \
  awk '$1 ~ /image:/ {print $2}' | sed s/\"//g >> ./rancher-images.txt
```
Sort the image list and deduplicate it to remove any overlap between the sources:
```
> sort -u rancher-images.txt -o rancher-images.txt
```

4.1.3.3 Save the images to your workstation #

Make rancher-save-images.sh an executable:

{prompt_user}chmod +x rancher-save-images.sh

Run rancher-save-images.sh with the rancher-images.txt image list to create a tarball of all the required images:

{prompt_user}./rancher-save-images.sh --image-list ./rancher-images.txt

Result: Docker begins pulling the images used for an air-gap install. Be patient. This process takes a few minutes. When the process completes, your current directory will output a tarball named rancher-images.tar.gz. Check that the output is in the directory.

4.1.3.4 Populate the private registry #

Next, you will move the images in the rancher-images.tar.gz to your private registry using the scripts to load the images. Move the images in the rancher-images.tar.gz to your private registry using the scripts to load the images. The rancher-images.txt is expected to be on the workstation in the same directory that you are running the rancher-load-images.sh script. The rancher-images.tar.gz should also be in the same directory.

(Optional) Log in to your private registry if required:
```
> docker login REGISTRY.YOURDOMAIN.COM:PORT
```
Make rancher-load-images.sh an executable:
```
> chmod +x rancher-load-images.sh
```
Use rancher-load-images.sh to extract, tag and push rancher-images.txt and rancher-images.tar.gz to your private registry:
```
> ./rancher-load-images.sh --image-list ./rancher-images.txt \
   --registry REGISTRY.YOURDOMAIN.COM:PORT
```

4.1.4 Install Kubernetes #

This section describes how to install a Kubernetes cluster according to our best practices for the SUSE Rancher Prime server environment. This cluster should be dedicated to running only the SUSE Rancher Prime server.

Note

Skip this section if you are installing SUSE Rancher Prime on a single node with Docker.

Rancher can be installed on any Kubernetes cluster, including hosted Kubernetes providers.

The steps to set up an air-gapped Kubernetes cluster on RKE2 or K3s are shown below.

In this guide, we are assuming you have created your nodes in your air-gapped environment and have a secure Docker private registry on your bastion server.

== 1. Create RKE2 configuration

Create the config.yaml file at /etc/rancher/rke2/config.yaml. This will contain all the configuration options necessary to create a highly available RKE2 cluster.

On the first server the minimum configuration is:

token: my-shared-secret
tls-san:
  - loadbalancer-dns-domain.com

On each other server the configuration file should contain the same token and tell RKE2 to connect to the existing first server:

server: https://ip-of-first-server:9345
token: my-shared-secret
tls-san:
  - loadbalancer-dns-domain.com

For more information, refer to the RKE2 documentation.

Note

RKE2 additionally provides a resolv-conf option for kubelets, which may help with configuring DNS in air-gap networks.

== 2. Create Registry YAML

Create the registries.yaml file at /etc/rancher/rke2/registries.yaml. This will tell RKE2 the necessary details to connect to your private registry.

The registries.yaml file should look like this before plugging in the necessary information:

---
mirrors:
  customreg:
    endpoint:
      - "https://ip-to-server:5000"
configs:
  customreg:
    auth:
      username: xxxxxx # this is the registry username
      password: xxxxxx # this is the registry password
    tls:
      cert_file: <path to the cert file used in the registry>
      key_file:  <path to the key file used in the registry>
      ca_file: <path to the ca file used in the registry>

For more information on private registries configuration file for RKE2, refer to the RKE2 documentation.

== 3. Install RKE2

Rancher needs to be installed on a supported Kubernetes version. To find out which versions of Kubernetes are supported for your Rancher version, refer to the SUSE Rancher Support Matrix.

Download the install script, rke2, rke2-images, and sha256sum archives from the release and upload them into a directory on each server:

mkdir /tmp/rke2-artifacts && cd /tmp/rke2-artifacts/
wget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/rke2-images.linux-amd64.tar.zst
wget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/rke2.linux-amd64.tar.gz
wget https://github.com/rancher/rke2/releases/download/v1.21.5%2Brke2r2/sha256sum-amd64.txt
curl -sfL https://get.rke2.io --output install.sh

Next, run install.sh using the directory on each server, as in the example below:

INSTALL_RKE2_ARTIFACT_PATH=/tmp/rke2-artifacts sh install.sh

Then enable and start the service on all servers:

` systemctl enable rke2-server.service systemctl start rke2-server.service `

For more information, refer to the RKE2 documentation.

== 4. Save and Start Using the kubeconfig File

When you installed RKE2 on each Rancher server node, a kubeconfig file was created on the node at /etc/rancher/rke2/rke2.yaml. This file contains credentials for full access to the cluster, and you should save this file in a secure location.

To use this kubeconfig file,

Install kubectl, a Kubernetes command-line tool.
Copy the file at /etc/rancher/rke2/rke2.yaml and save it to the directory ~/.kube/config on your local machine.
In the kubeconfig file, the server directive is defined as localhost. Configure the server as the DNS of your load balancer, referring to port 6443. (The Kubernetes API server will be reached at port 6443, while the Rancher server will be reached at ports 80 and 443.) Here is an example rke2.yaml:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: [CERTIFICATE-DATA]
    server: [LOAD-BALANCER-DNS]:6443 # Edit this line
  name: default
contexts:
- context:
    cluster: default
    user: default
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: default
  user:
    password: [PASSWORD]
    username: admin

Result: You can now use kubectl to manage your RKE2 cluster. If you have more than one kubeconfig file, you can specify which one you want to use by passing in the path to the file when using kubectl:

kubectl --kubeconfig ~/.kube/config/rke2.yaml get pods --all-namespaces

For more information about the kubeconfig file, refer to the RKE2 documentation or the official Kubernetes documentation about organizing cluster access using kubeconfig files.

== Note on Upgrading

Upgrading an air-gap environment can be accomplished in the following manner:

Download the new air-gap artifacts and install script from the releases page for the version of RKE2 you will be upgrading to.
Run the script again just as you had done in the past with the same environment variables.
Restart the RKE2 service.

4.1.5 Install SUSE Rancher Prime #

This section is about how to deploy SUSE Rancher Prime for your air gapped environment in a high-availability Kubernetes installation. An air gapped environment could be where the SUSE Rancher Prime server will be installed offline, behind a firewall, or behind a proxy.

4.1.5.1 Privileged Access for SUSE Rancher Prime #

When the SUSE Rancher Prime server is deployed in the Docker container, a local Kubernetes cluster is installed within the container for SUSE Rancher Prime to use. Because many features of SUSE Rancher Prime run as deployments, and privileged mode is required to run containers within containers, you will need to install SUSE Rancher Prime with the --privileged option.

4.1.5.2 Kubernetes Instructions #

SUSE Rancher Prime recommends installing SUSE Rancher Prime on a Kubernetes cluster. A highly available Kubernetes install is comprised of three nodes running the SUSE Rancher Prime server components on a Kubernetes cluster. The persistence layer (etcd) is also replicated on these three nodes, providing redundancy and data duplication in case one of the nodes fails.

4.1.5.2.1 1. Add the Helm Chart Repository #

From a system that has access to the internet, fetch the latest Helm chart and copy the resulting manifests to a system that has access to the SUSE Rancher Prime server cluster.

If you haven’t already, install helm locally on a workstation that has internet access. Note: Refer to the Helm version requirements to choose a version of Helm to install SUSE Rancher Prime.
Use the helm repo add command to add the Helm chart repository that contains charts to install SUSE Rancher Prime.
```
helm repo add rancher-prime <helm-chart-repo-url>
```
Fetch the SUSE Rancher Prime chart. This will pull down the chart and save it in the current directory as a .tgz file.
- To fetch the latest version:
```
helm fetch rancher-prime/rancher
```
- To fetch a specific version:
  Check to see which version of SUSE Rancher Prime are available.
```
helm search repo --versions rancher-prime
```
  Fetch a specific version by specifying the --version parameter:
```
helm fetch rancher-prime/rancher --version=<version>
```

4.1.5.2.2 2. Choose your SSL Configuration #

The SUSE Rancher Prime server is designed to be secure by default and requires SSL/TLS configuration.

When SUSE Rancher Prime is installed on an air gapped Kubernetes cluster, there are two recommended options for the source of the certificate.

Note

If you want terminate SSL/TLS externally, see TLS termination on an External Load Balancer.

Configuration	Chart option	Description	Requires cert-manager
Rancher Generated Self-Signed Certificates	`ingress.tls.source=rancher`	Use certificates issued by Rancher’s generated CA (self signed) This is the default and does not need to be added when rendering the Helm template.	yes
Certificates from Files	`ingress.tls.source=secret`	Use your own certificate files by creating Kubernetes Secret(s). This option must be passed when rendering the Rancher Helm template.	no

4.1.5.2.3 Helm Chart Options for Air Gap Installations #

When setting up the Rancher Helm template, there are several options in the Helm chart that are designed specifically for air gap installations.

Chart Option	Chart Value	Description
`certmanager.version`	`<version>`	Configure proper SUSE Rancher Prime TLS issuer depending of running cert-manager version.
`systemDefaultRegistry`	`<REGISTRY.YOURDOMAIN.COM:PORT>`	Configure SUSE Rancher Prime server to always pull from your private registry when provisioning clusters.
`useBundledSystemChart`	`true`	Configure SUSE Rancher Prime server to use the packaged copy of Helm system charts. The system charts repository contains all the catalog items required for features such as monitoring, logging, alerting and global DNS. These Helm charts are located in GitHub, but since you are in an air gapped environment, using the charts that are bundled within SUSE Rancher Prime is much easier than setting up a Git mirror.

4.1.5.2.4 3. Fetch the Cert-Manager Chart #

Based on the choice your made in 2. Choose your SSL Configuration (Section 4.1.5.2.2, “2. Choose your SSL Configuration”), complete one of the procedures below.

4.1.5.2.4.1 Option A: Default Self-Signed Certificate #

By default, SUSE Rancher Prime generates a CA and uses cert-manager to issue the certificate for access to the SUSE Rancher Prime server interface.

Note

Recent changes to cert-manager require an upgrade. If you are upgrading SUSE Rancher Prime and using a version of cert-manager older than v0.11.0, please see our upgrade cert-manager documentation.

4.1.5.2.4.1.1 1. Add the cert-manager Repo #

From a system connected to the internet, add the cert-manager repo to Helm:

helm repo add jetstack https://charts.jetstack.io
helm repo update

4.1.5.2.4.1.2 2. Fetch the cert-manager Chart #

Fetch the latest cert-manager chart available from the Helm chart repository.

helm fetch jetstack/cert-manager --version v1.11.0

4.1.5.2.4.1.3 3. Retrieve the cert-manager CRDs #

Download the required CRD file for cert-manager:

   curl -L -o cert-manager-crd.yaml https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.crds.yaml

4.1.5.2.5 4. Install SUSE Rancher Prime #

Copy the fetched charts to a system that has access to the SUSE Rancher Prime server cluster to complete installation.

4.1.5.2.5.1 1. Install cert-manager #

Install cert-manager with the same options you would use to install the chart. Remember to set the image.repository option to pull the image from your private registry.

Note

To see options on how to customize the cert-manager install (including for cases where your cluster uses PodSecurityPolicies), see the cert-manager docs.

Example 4.1: Click to expand #

If you are using self-signed certificates, install cert-manager:

Create the namespace for cert-manager.
```
 kubectl create namespace cert-manager
```
Create the cert-manager CustomResourceDefinitions (CRDs).
```
 kubectl apply -f cert-manager-crd.yaml
```

Install cert-manager.

 helm install cert-manager ./cert-manager-v1.11.0.tgz \
     --namespace cert-manager \
     --set image.repository=<REGISTRY.YOURDOMAIN.COM:PORT>/quay.io/jetstack/cert-manager-controller \
     --set webhook.image.repository=<REGISTRY.YOURDOMAIN.COM:PORT>/quay.io/jetstack/cert-manager-webhook \
     --set cainjector.image.repository=<REGISTRY.YOURDOMAIN.COM:PORT>/quay.io/jetstack/cert-manager-cainjector \
     --set startupapicheck.image.repository=<REGISTRY.YOURDOMAIN.COM:PORT>/quay.io/jetstack/cert-manager-ctl

4.1.5.2.5.2 2. Install Rancher #

First, refer to Adding TLS Secrets to publish the certificate files so SUSE Rancher Prime and the ingress controller can use them.

Then, create the namespace for SUSE Rancher Prime using kubectl:

kubectl create namespace cattle-system

Next, install SUSE Rancher Prime, declaring your chosen options. Use the reference table below to replace each placeholder. SUSE Rancher Prime needs to be configured to use the private registry in order to provision any SUSE Rancher Prime launched Kubernetes clusters or SUSE Rancher Prime tools.

Placeholder	Description
`<VERSION>`	The version number of the output tarball.
`<RANCHER.YOURDOMAIN.COM>`	The DNS name you pointed at your load balancer.
`<REGISTRY.YOURDOMAIN.COM:PORT>`	The DNS name for your private registry.
`<CERTMANAGER_VERSION>`	Cert-manager version running on k8s cluster.

   helm install rancher ./rancher-<VERSION>.tgz \
    --namespace cattle-system \
    --set hostname=<RANCHER.YOURDOMAIN.COM> \
    --set certmanager.version=<CERTMANAGER_VERSION> \
    --set rancherImage=<REGISTRY.YOURDOMAIN.COM:PORT>/rancher/rancher \
    --set systemDefaultRegistry=<REGISTRY.YOURDOMAIN.COM:PORT> \ # Set a default private registry to be used in Rancher
    --set useBundledSystemChart=true # Use the packaged Rancher system charts

Optional: To install a specific SUSE Rancher Prime version, set the rancherImageTag value, example: --set rancherImageTag=v2.5.8

4.1.5.2.5.3 Option B: Certificates From Files Using Kubernetes Secrets #

4.1.5.2.5.3.1 1. Create Secrets #

Create Kubernetes secrets from your own certificates for SUSE Rancher Prime to use. The common name for the cert will need to match the host name option in the command below, or the ingress controller will fail to provision the site for SUSE Rancher Prime.

4.1.5.2.5.3.2 2. Install Rancher #

Install SUSE Rancher Prime, declaring your chosen options. Use the reference table below to replace each placeholder. SUSE Rancher Prime needs to be configured to use the private registry in order to provision any SUSE Rancher Prime launched Kubernetes clusters or SUSE Rancher Prime tools.

Placeholder	Description
`<VERSION>`	The version number of the output tarball.
`<RANCHER.YOURDOMAIN.COM>`	The DNS name you pointed at your load balancer.
`<REGISTRY.YOURDOMAIN.COM:PORT>`	The DNS name for your private registry.

   helm install rancher ./rancher-<VERSION>.tgz \
    --namespace cattle-system \
    --set hostname=<RANCHER.YOURDOMAIN.COM> \
    --set rancherImage=<REGISTRY.YOURDOMAIN.COM:PORT>/rancher/rancher \
    --set ingress.tls.source=secret \
    --set systemDefaultRegistry=<REGISTRY.YOURDOMAIN.COM:PORT> \ # Set a default private registry to be used in Rancher
    --set useBundledSystemChart=true # Use the packaged Rancher system charts

If you are using a Private CA signed cert, add --set privateCA=true following --set ingress.tls.source=secret:

   helm install rancher ./rancher-<VERSION>.tgz \
    --namespace cattle-system \
    --set hostname=<RANCHER.YOURDOMAIN.COM> \
    --set rancherImage=<REGISTRY.YOURDOMAIN.COM:PORT>/rancher/rancher \
    --set ingress.tls.source=secret \
    --set privateCA=true \
    --set systemDefaultRegistry=<REGISTRY.YOURDOMAIN.COM:PORT> \ # Set a default private registry to be used in Rancher
    --set useBundledSystemChart=true # Use the packaged Rancher system charts

The installation is complete.

4.1.5.3 Additional Resources #

These resources could be helpful when installing SUSE Rancher Prime:

4.2 Installing the NVIDIA GPU Operator on the SUSE Rancher Prime: RKE2 cluster #

The NVIDIA operator allows administrators of Kubernetes clusters to manage GPUs just like CPUs. It includes everything needed for pods to be able to operate GPUs.

4.2.1 Host OS requirements #

To expose the GPU to the pod correctly, the NVIDIA kernel drivers and the libnvidia-ml library must be correctly installed in the host OS. The NVIDIA Operator can automatically install drivers and libraries on some operating systems. Refer to the NVIDIA documentation for information on supported operating system releases. The installation of the NVIDIA components on your host OS is out of the scope of this documentation, refer to the NVIDIA documentation for instructions.

The following three commands should return a correct output if the kernel driver was correctly installed:

lsmod | grep nvidia returns a list of nvidia kernel modules. For example:

nvidia_uvm           2129920  0
nvidia_drm            131072  0
nvidia_modeset       1572864  1 nvidia_drm
video                  77824  1 nvidia_modeset
nvidia               9965568  2 nvidia_uvm,nvidia_modeset
ecc                    45056  1 nvidia

cat /proc/driver/nvidia/version returns the NVRM and GCC version of the driver. For example:

NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  555.42.06  Release Build  (abuild@host)  Thu Jul 11 12:00:00 UTC 2024
GCC version:  gcc version 7.5.0 (SUSE Linux)

find /usr/ -iname libnvidia-ml.so returns a path to the libnvidia-ml.so library. For example:
```
/usr/lib64/libnvidia-ml.so
```
This library is used by Kubernetes components to interact with the kernel driver.

4.2.2 Operator installation #

Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest. Use the kubectl apply -f gpu-operator.yaml command to install the operator.

Caution

The NVIDIA operator restarts containerd with a hangup call which restarts RKE2.

Important

The envvars ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED, ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS and DEVICE_LIST_STRATEGY are required to properly isolate GPU resources as explained in Preventing unprivileged access to GPUs in Kubernetes.

Important

NVIDIA GPU Operator v25.10.x uses Container Device Interface (CDI) specification which simplifies operations. It is recommended that you enable CDI (the default) and the NRI plug-in on RKE2. With both features enabled, you no longer need to pass extra environment variables for security requirements or set runtimeClassName: nvidia in your pod specifications.

# gpu-operator.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  version: v25.3.4
  targetNamespace: gpu-operator
  createNamespace: true
  valuesContent: |-
    driver:
      enabled: "false"
    toolkit:
      env:
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
        value: "false"
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
        value: "true"
    devicePlugin:
      env:
      - name: DEVICE_LIST_STRATEGY
        value: volume-mounts

After one minute approximately, you can perform the following checks to verify that everything is working as expected:

Assuming the drivers and libnvidia-ml.so library were previously installed, check if the operator detects them correctly:
```
kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' | grep "nvidia.com/gpu.deploy.driver"
```
You should see the value pre-installed. If you see true, the drivers were not correctly installed. If the pre-requirements (Section 4.2.1, “Host OS requirements”) were correct, it is possible that you forgot to reboot the node after installing all packages.
You can also check other driver labels with:
```
kubectl get node $NODENAME -o jsonpath='{.metadata.labels}' |  grep "nvidia.com"
```
You should see labels specifying driver and GPU (e.g. nvidia.com/gpu.machine or nvidia.com/cuda.driver.major).
Check if the GPU was added by nvidia-device-plugin-daemonset as an allocatable resource in the node:
```
kubectl get node $NODENAME -o jsonpath='{.status.allocatable}'
```
You should see "nvidia.com/gpu": followed by the number of GPUs in the node.
Check that the container runtime binary was installed by the operator (in particular by the nvidia-container-toolkit-daemonset):
```
ls /usr/local/nvidia/toolkit/nvidia-container-runtime
```
Verify if containerd config was updated to include the NVIDIA container runtime:
```
grep nvidia /var/lib/rancher/rke2/agent/etc/containerd/config.toml
```

Run a pod to verify that the GPU resource can successfully be scheduled on a pod and the pod can detect it

apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1

Important: Version Gate

Available as of October 2024 releases: v1.28.15+rke2r1, v1.29.10+rke2r1, v1.30.6+rke2r1, v1.31.2+rke2r1.

RKE2 will now use PATH to find alternative container runtimes, in addition to checking the default paths used by the container runtime packages. In order to use this feature, you must modify the RKE2 service’s PATH environment variable to add the directories containing the container runtime binaries.

It’s recommended that you modify one of this two environment files:

/etc/default/rke2-server # or rke2-agent
/etc/sysconfig/rke2-server # or rke2-agent

This example will add the PATH in /etc/default/rke2-server:

Caution

PATH changes should be done with care to avoid placing untrusted binaries in the path of services that run as root.

echo PATH=$PATH >> /etc/default/rke2-server

4.3 Registering existing clusters #

In this section, you will learn how to register existing RKE2 clusters in SUSE Rancher Prime (Rancher).

The cluster registration feature replaced the feature to import clusters.

The control that Rancher has to manage a registered cluster depends on the type of cluster. For details, see Management Capabilities for Registered Clusters. (Section 4.3.3, “Management Capabilities for Registered Clusters”)

4.3.1 Prerequisites #

4.3.1.1 Kubernetes Node Roles #

Registered RKE Kubernetes clusters must have all three node roles - etcd, controlplane and worker. A cluster with only controlplane components cannot be registered in Rancher.

For more information on RKE node roles, see the best practices.

4.3.1.2 Permissions #

To register a cluster in Rancher, you must have cluster-admin privileges within that cluster. If you don’t, grant these privileges to your user by running:

kubectl create clusterrolebinding cluster-admin-binding \
  --clusterrole cluster-admin \
  --user [USER_ACCOUNT]

Since, by default, Google Kubernetes Engine (GKE) doesn’t grant the cluster-admin role, you must run these commands on GKE clusters before you can register them. To learn more about role-based access control for GKE, please see the official Google documentation.

4.3.1.3 Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), and Google Kubernetes Engine (GKE) #

To successfully import or provision EKS, AKS, and GKE clusters from Rancher, the cluster must have at least one managed node group.

AKS clusters can be imported only if local accounts are enabled. If a cluster is configured to use Microsoft Entra ID for authentication, then Rancher will not be able to import it and report an error.

EKS Anywhere clusters can be imported/registered into Rancher with an API address and credentials, as with any downstream cluster. EKS Anywhere clusters are treated as imported clusters and do not have full lifecycle support from Rancher.

GKE Autopilot clusters aren’t supported. See Compare GKE Autopilot and Standard for more information about the differences between GKE modes.

4.3.2 Registering a Cluster #

Click ☰ > Cluster Management.
On the Clusters page, Import Existing.
Choose the type of cluster.
Use Member Roles to configure user authorization for the cluster. Click Add Member to add users that can access the cluster. Use the Role drop-down to set permissions for each user.
If you are importing a generic Kubernetes cluster in Rancher, perform the following steps for setup:
1. Click Agent Environment Variables under Cluster Options to set environment variables for rancher cluster agent. The environment variables can be set using key value pairs. If rancher agent requires use of proxy to communicate with Rancher server, HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables can be set using agent environment variables.
2. Enable Project Network Isolation to ensure the cluster supports Kubernetes NetworkPolicy resources. Users can select the Project Network Isolation option under the Advanced Options dropdown to do so.
3. Configure the version management feature for imported RKE2 and K3s clusters (Section 4.3.4, “Configuring Version Management for SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters”).
Click Create.
The prerequisite for cluster-admin privileges is shown (see Prerequisites above), including an example command to fulfil the prerequisite.
Copy the kubectl command to your clipboard and run it on a node where kubeconfig is configured to point to the cluster you want to import. If you are unsure it is configured correctly, run kubectl get nodes to verify before running the command shown in Rancher.
If you are using self-signed certificates, you will receive the message certificate signed by unknown authority. To work around this validation, copy the command starting with curl displayed in Rancher to your clipboard. Then run the command on a node where kubeconfig is configured to point to the cluster you want to import.
When you finish running the command(s) on your node, click Done.

Caution

The NO_PROXY environment variable is not standardized, and the accepted format of the value can differ between applications. When configuring the NO_PROXY variable for Rancher, the value must adhere to the format expected by Golang.

Specifically, the value should be a comma-delimited string which only contains IP addresses, CIDR notation, domain names, or special DNS labels (e.g. *). For a full description of the expected value format, refer to the upstream Golang documentation.

Result:

Your cluster is registered and assigned a state of Pending. Rancher is deploying resources to manage your cluster.
You can access your cluster after its state is updated to Active.
Active clusters are assigned two Projects: Default (containing the namespace default) and System (containing the namespaces cattle-system, ingress-nginx, kube-public and kube-system, if present).

Note

You can not re-register a cluster that is currently active in a Rancher setup.

4.3.2.1 Configuring an Imported EKS, AKS or GKE Cluster with Terraform #

You should define only the minimum fields that Rancher requires when importing an EKS, AKS or GKE cluster with Terraform. This is important as Rancher will overwrite what was in the cluster configuration with any configuration that the user has provided.

Caution

Even a small difference between the current cluster and a user-provided configuration could have unexpected results.

The minimum configuration fields required by Rancher to import EKS clusters with Terraform using eks_config_v2 are as follows:

cloud_credential_id
name
region
imported (this field should always be set to true for imported clusters)

Example YAML configuration for imported EKS clusters:

resource "rancher2_cluster" "my-eks-to-import" {
  name        = "my-eks-to-import"
  description = "Terraform EKS Cluster"
  eks_config_v2 {
    cloud_credential_id = rancher2_cloud_credential.aws.id
    name                = var.aws_eks_name
    region              = var.aws_region
    imported            = true
  }
}

You can find additional examples for other cloud providers in the Rancher2 Terraform Provider documentation.

4.3.3 Management Capabilities for Registered Clusters #

The control that Rancher has to manage a registered cluster depends on the type of cluster.

Features for All Registered Clusters (Section 4.3.3.1, “Features for All Registered Clusters”)
Additional Features for Registered RKE2 and K3s Clusters (Section 4.3.3.2, “Additional Features for Registered SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters”)
Additional Features for Registered EKS, AKS and GKE Clusters (Section 4.3.3.3, “Additional Features for Registered EKS, AKS, and GKE Clusters”)

4.3.3.1 Features for All Registered Clusters #

After registering a cluster, the cluster owner can:

Manage cluster access through role-based access control
Enable monitoring, alerts and notifiers
Enable logging
Enable Istio
Manage projects and workloads

4.3.3.2 Additional Features for Registered SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters #

K3s is a lightweight, fully compliant Kubernetes distribution for edge installations.

RKE2 is Rancher’s next-generation Kubernetes distribution for datacenter and cloud installations.

When an RKE2 or K3s cluster is registered in Rancher, Rancher will recognize it. The Rancher UI will expose features available to all registered clusters (Section 4.3.3.1, “Features for All Registered Clusters”), along with the following options for editing and upgrading the cluster:

Enable or disable version management (Section 4.3.4, “Configuring Version Management for SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters”)
Upgrade the Kubernetes version when version management is enabled
Configure the upgrade strategy (Section 4.3.5, “Configuring SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Cluster Upgrades”) when version management is enabled
View a read-only version of the cluster’s configuration arguments and environment variables used to launch each node

4.3.3.3 Additional Features for Registered EKS, AKS, and GKE Clusters #

Rancher handles registered EKS, AKS, or GKE clusters similarly to clusters created in Rancher. However, Rancher doesn’t destroy registered clusters when you delete them through the Rancher UI.

When you create an EKS, AKS, or GKE cluster in Rancher, then delete it, Rancher destroys the cluster. When you delete a registered cluster through Rancher, the Rancher server disconnects from the cluster. The cluster remains live, although it’s no longer in Rancher. You can still access the deregistered cluster in the same way you did before you registered it.

See Cluster Management Capabilities by Cluster Type for more information about what features are available for managing registered clusters.

4.3.4 Configuring Version Management for SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters #

Warning

When version management is enabled for an imported cluster, upgrading it outside of Rancher may lead to unexpected consequences.

The version management feature for imported RKE2 and K3s clusters can be configured using one of the following options:

Global default (default): Inherits behavior from the global imported-cluster-version-management setting.
True: Enables version management, allowing users to control the Kubernetes version and upgrade strategy of the cluster through Rancher.
False: Disables version management, enabling users to manage the cluster’s Kubernetes version independently, outside of Rancher.

You can define the default behavior for newly created clusters or existing ones set to "Global default" by modifying the imported-cluster-version-management setting.

Changes to the global imported-cluster-version-management setting take effect during the cluster’s next reconciliation cycle.

Note

If version management is enabled for a cluster, Rancher will deploy the system-upgrade-controller app, along with the associated Plans and other required Kubernetes resources, to the cluster. If version management is disabled, Rancher will remove these components from the cluster.

4.3.5 Configuring SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Cluster Upgrades #

Tip

It is a Kubernetes best practice to back up the cluster before upgrading. When upgrading a high-availability K3s cluster with an external database, back up the database in whichever way is recommended by the relational database provider.

The concurrency is the maximum number of nodes that are permitted to be unavailable during an upgrade. If number of unavailable nodes is larger than the concurrency, the upgrade will fail. If an upgrade fails, you may need to repair or remove failed nodes before the upgrade can succeed.

Controlplane concurrency: The maximum number of server nodes to upgrade at a single time; also the maximum unavailable server nodes
Worker concurrency: The maximum number worker nodes to upgrade at the same time; also the maximum unavailable worker nodes

In the RKE2 and K3s documentation, controlplane nodes are called server nodes. These nodes run the Kubernetes master, which maintains the desired state of the cluster. By default, these controlplane nodes have the capability to have workloads scheduled to them by default.

Also in the RKE2 and K3s documentation, nodes with the worker role are called agent nodes. Any workloads or pods that are deployed in the cluster can be scheduled to these nodes by default.

4.3.6 Debug Logging and Troubleshooting for Registered SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters #

Nodes are upgraded by the system upgrade controller running in the downstream cluster. Based on the cluster configuration, Rancher deploys two plans to upgrade nodes: one for controlplane nodes and one for workers. The system upgrade controller follows the plans and upgrades the nodes.

To enable debug logging on the system upgrade controller deployment, edit the configmap to set the debug environment variable to true. Then restart the system-upgrade-controller pod.

Logs created by the system-upgrade-controller can be viewed by running this command:

kubectl logs -n cattle-system system-upgrade-controller

The current status of the plans can be viewed with this command:

kubectl get plans -A -o yaml

Tip

If the cluster becomes stuck in upgrading, restart the system-upgrade-controller.

To prevent issues when upgrading, the Kubernetes upgrade best practices should be followed.

4.3.7 Authorized Cluster Endpoint Support for SUSE Rancher Prime: RKE2 and SUSE Rancher Prime: K3s Clusters #

Rancher supports Authorized Cluster Endpoints (ACE) for registered RKE2 and K3s clusters. This support includes manual steps you will perform on the downstream cluster to enable the ACE. For additional information on the authorized cluster endpoint, click here.

Note: Notes:

These steps only need to be performed on the control plane nodes of the downstream cluster. You must configure each control plane node individually.
The following steps will work on both RKE2 and K3s clusters registered in v2.6.x as well as those registered (or imported) from a previous version of Rancher with an upgrade to v2.6.x.
These steps will alter the configuration of the downstream RKE2 and K3s clusters and deploy the kube-api-authn-webhook. If a future implementation of the ACE requires an update to the kube-api-authn-webhook, then this would also have to be done manually. For more information on this webhook, click here.

4.3.7.1 Manual steps to be taken on the control plane of each downstream cluster to enable ACE: #

Create a file at /var/lib/rancher/{rke2,k3s}/kube-api-authn-webhook.yaml with the following contents:

 apiVersion: v1
 kind: Config
 clusters:
   name: Default
cluster:
  insecure-skip-tls-verify: true
  server: http://127.0.0.1:6440/v1/authenticate
 users:
   name: Default
user:
  insecure-skip-tls-verify: true
 current-context: webhook
 contexts:
   name: webhook
context:
  user: Default
  cluster: Default

Add the following to the configuration file (or create one if it doesn’t exist); note that the default location is /etc/rancher/{rke2,k3s}/config.yaml:
```
 kube-apiserver-arg:
     - authentication-token-webhook-config-file=/var/lib/rancher/{rke2,k3s}/kube-api-authn-webhook.yaml
```

Run the following commands:

sudo systemctl stop {rke2,k3s}-server
sudo systemctl start {rke2,k3s}-server

Finally, you must go back to the Rancher UI and edit the imported cluster there to complete the ACE enablement. Click on ⋮ > Edit Config, then click the Networking tab under Cluster Configuration. Finally, click the Enabled button for Authorized Endpoint. Once the ACE is enabled, you then have the option of entering a fully qualified domain name (FQDN) and certificate information.

Note

The FQDN field is optional, and if one is entered, it should point to the downstream cluster. Certificate information is only needed if there is a load balancer in front of the downstream cluster that is using an untrusted certificate. If you have a valid certificate, then nothing needs to be added to the CA Certificates field.

4.3.8 Annotating Registered Clusters #

For all types of registered Kubernetes clusters except for RKE2 and K3s Kubernetes clusters, Rancher doesn’t have any information about how the cluster is provisioned or configured.

Therefore, when Rancher registers a cluster, it assumes that several capabilities are disabled by default. Rancher assumes this in order to avoid exposing UI options to the user even when the capabilities are not enabled in the registered cluster.

However, if the cluster has a certain capability, a user of that cluster might still want to select the capability for the cluster in the Rancher UI. In order to do that, the user will need to manually indicate to Rancher that certain capabilities are enabled for the cluster.

By annotating a registered cluster, it is possible to indicate to Rancher that a cluster was given additional capabilities outside of Rancher.

The following annotation indicates Ingress capabilities. Note that the values of non-primitive objects need to be JSON encoded, with quotations escaped.

"capabilities.cattle.io/ingressCapabilities": "[
  {
    "customDefaultBackend":true,
    "ingressProvider":"asdf"
  }
]"

These capabilities can be annotated for the cluster:

ingressCapabilities
loadBalancerCapabilities
nodePoolScalingSupported
nodePortRange
taintSupport

All the capabilities and their type definitions can be viewed in the Rancher API view, at [Rancher Server URL]/v3/schemas/capabilities.

To annotate a registered cluster,

Click ☰ > Cluster Management.
On the Clusters page, go to the custom cluster you want to annotate and click ⋮ > Edit Config.
Expand the Labels & Annotations section.
Click Add Annotation.
Add an annotation to the cluster with the format capabilities/<capability>: <value> where value is the cluster capability that will be overridden by the annotation. In this scenario, Rancher is not aware of any capabilities of the cluster until you add the annotation.
Click Save.

Result: The annotation does not give the capabilities to the cluster, but it does indicate to Rancher that the cluster has those capabilities.

4.3.9 Troubleshooting #

This section lists some of the most common errors that may occur when importing a cluster and provides steps to troubleshoot them.

4.3.9.1 AKS #

The following error may occur if local accounts are disabled in your cluster:

Error: Getting static credential is not allowed because this cluster is set to disable local accounts.

To resolve this issue, enable local accounts before attempting to register a cluster:

az aks update --resource-group <resource-group> --name <cluster-name> --enable-local-accounts

4.4 Assigning GPU nodes to applications #

When deploying a containerized application to Kubernetes, you need to ensure that containers requiring GPU resources are run on appropriate worker nodes. For example, Ollama, a core component of SUSE AI, can deeply benefit from the use of GPU acceleration. This topic describes how to satisfy this requirement by explicitly requesting GPU resources and labeling worker nodes for configuring the node selector.

Requirements #

Kubernetes cluster—such as SUSE Rancher Prime: RKE2—must be available and configured with more than one worker node in which certain nodes have NVIDIA GPU resources and others do not.
This document assumes that any kind of deployment to the Kubernetes cluster is done using Helm charts.

4.4.1 Labeling GPU nodes #

To distinguish nodes with the GPU support from non-GPU nodes, Kubernetes uses labels. Labels are used for relevant metadata and should not be confused with annotations that provide simple information about a resource. It is possible to manipulate labels with the kubectl command, as well as by tweaking configuration files from the nodes. If an IaC tool such as Terraform is used, labels can be inserted in the node resource configuration files.

To label a single node, use the following command:

> kubectl label node <GPU_NODE_NAME> accelerator=nvidia-gpu

To achieve the same result by tweaking the node.yaml node configuration, add the following content and apply the changes with kubectl apply -f node.yaml:

apiVersion: v1
kind: Node
metadata:
  name: node-name
  labels:
    accelerator: nvidia-gpu

Tip: Labeling multiple nodes

To label multiple nodes, use the following command:

> kubectl label node \
  <GPU_NODE_NAME1> \
  <GPU_NODE_NAME2> ... \
  accelerator=nvidia-gpu

Tip

If Terraform is being used as an IaC tool, you can add labels to a group of nodes by editing the .tf files and adding the following values to a resource:

resource "node_group" "example" {
  labels = {
    "accelerator" = "nvidia-gpu"
  }
}

To check if the labels are correctly applied, use the following command:

> kubectl get nodes --show-labels

4.4.2 Assigning GPU nodes #

The matching between a container and a node is configured by the explicit resource allocation and the use of labels and node selectors. The use cases described below focus on NVIDIA GPUs.

4.4.2.1 Enable GPU passthrough #

Containers are isolated from the host environment by default. For the containers that rely on the allocation of GPU resources, their Helm charts must enable GPU passthrough so that the container can access and use the GPU resource. Without enabling the GPU passthrough, the container may still run, but it can only use the main CPU for all computations. Refer to Ollama Helm chart for an example of the configuration required for GPU acceleration.

4.4.2.2 Assignment by resource request #

After the NVIDIA GPU Operator is configured on a node, you can instantiate applications requesting the resource nvidia.com/gpu provided by the operator. Add the following content to your values.yaml file. Specify the number of GPUs according to your setup.

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

4.4.2.3 Assignment by labels and node selectors #

If affected cluster nodes are labeled with a label such as accelerator=nvidia-gpu, you can configure the node selector to check for the label. In this case, use the following values in your values.yaml file.

nodeSelector:
  accelerator: nvidia-gpu

4.4.3 Verifying Ollama GPU assignment #

If the GPU is correctly detected, the Ollama container logs this event:

| [...] source=routes.go:1172 msg="Listening on :11434 (version 0.0.0)"                                              │
│ [...] source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2502346830/runners                       │
│ [...] source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 cpu cpu_avx cpu_avx2]"                             │
│ [...] source=gpu.go:204 msg="looking for compatible GPUs"                                                          │
│ [...] source=types.go:105 msg="inference compute" id=GPU-c9ad37d0-d304-5d2a-c2e6-d3788cd733a7 library=cuda compute │

4.5 Installing SUSE Security #

This chapter describes how to install SUSE Security to scan SUSE AI nodes for vulnerabilities and improve data protection.

4.5.1 Installing SUSE Security in air-gapped environments #

Follow the information in this section to perform an air-gapped deployment of SUSE Security.

4.5.1.1 Required tools #

We need to install three tools for downloading all components of SUSE Security.

Helm - Application Lifecycle Manager
Skopeo - Image/Registry Tool
ZStandard - Compression Algorithm

Install Helm.

 > curl -fsSL \
  https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Install Skopeo and zstd.

 > sudo zypper update && zypper install zstd skopeo -y

4.5.1.2 Get images and a Helm chart #

To get all the needed images, use the chart itself. Using Helm, add the repo and download the chart. Use Skopeo for downloading and uploading.

Create a directory for the images.
```
> > mkdir -p neuvector/images
```

Add a Neuvector repository.

> helm repo add neuvector \
  https://neuvector.github.io/neuvector-helm/

Download the latest Neuvector chart.

> helm repo update \
  && helm pull neuvector/core -d neuvector

You should see a file named core-X.Y.Z.tgz. To obtain the list of required images, run the following command:

> helm template neuvector/core-*.tgz \
  | awk '$1 ~ /image:/ {print $2}' | sed -e 's/\"//g' \
  > neuvector/images/list.txt

Download the images based on the generated list.

for i in $(cat neuvector/images/list.txt); do \
  skopeo copy docker://$i docker-archive:neuvector/images/$(echo $i| awk -F/ '{print $3}'|sed 's/:/_/g').tar:$(echo $i| awk -F/ '{print $3}') \
done

Now you have a directory similar to the following one:

> ls -lR neuvector
neuvector:
total 16
-rw-r--r--. 1 root root 15892 Jan  8 14:33 core-2.4.0.tgz
drwxr-xr-x. 2 root root   153 Jan  8 14:35 images

neuvector/images:
total 953920
-rw-r--r--. 1 root root 236693504 Jan  8 14:35 controller_5.3.2.tar
-rw-r--r--. 1 root root 226704384 Jan  8 14:35 enforcer_5.3.2.tar
-rw-r--r--. 1 root root       176 Jan  8 14:34 list.txt
-rw-r--r--. 1 root root 331550208 Jan  8 14:35 manager_5.3.2.tar
-rw-r--r--. 1 root root 169589760 Jan  8 14:35 scanner_latest.tar
-rw-r--r--. 1 root root  12265472 Jan  8 14:35 updater_latest.tar

4.5.1.3 Compress and move the images to the local network #

Use tar with the ZST format for maximum compression level.

> tar -I zstd -vcf neuvector_airgap.zst neuvector

Move the created neuvector_airgap.zst archive to the isolated local network.

4.5.1.4 Uncompress and load the images #

Uncompress the images from the archive. The following example leaves them in the neuvector directory.

> tar -I zstd -vxf neuvector_airgap.zst

Loading the images into a local image registry requires you to understand your local network. This example uses registry.awesome.sauce as the DNS name. Loading the images is done with the skopeo command. Make sure that skopeo is installed on the air-gapped local machine. You may need to authenticate with skopeo login first.

>  export REGISTRY=registry.awesome.sauce
for file in $(ls neuvector/images | grep -v txt ); do
     skopeo copy docker-archive:neuvector/images/$file docker://$(echo $file | sed 's/.tar//g' | awk -F_ '{print "'$REGISTRY'/neuvector/"$1":"$2}')
done

With all the images loaded in a local image registry, you can install using Helm.

4.5.1.5 Install the transferred images on a local cluster #

To install the images from the local registry, you must override REGISTRY and NEU_URL variables. Also, adjust the imagePullSecrets option to include the secret for your cluster to authenticate to the registry.

> export REGISTRY=registry.awesome.sauce
> export NEU_URL=neuvector.awesome.sauce
> helm upgrade -i neuvector --namespace neuvector neuvector/core \
--create-namespace --set imagePullSecrets=regsecret --set k3s.enabled=true \
--set k3s.runtimePath=/run/k3s/containerd/containerd.sock \
--set manager.ingress.enabled=true --set controller.pvc.enabled=true \
--set controller.pvc.capacity=10Gi --set manager.svc.type=ClusterIP \
--set registry=$REGISTRY --set tag=5.3.2 \
--set controller.image.repository=neuvector/controller \
--set enforcer.image.repository=neuvector/enforcer \
--set manager.image.repository=neuvector/manager \
--set cve.updater.image.repository=neuvector/updater \
--set manager.ingress.host=$NEU_URL

4.6 Setting up SUSE Observability for SUSE AI #

SUSE Observability provides comprehensive monitoring and insights into your infrastructure and applications. It enables efficient tracking of metrics, logs and traces, helping you maintain optimal performance and troubleshoot issues effectively. This procedure guides you through setting up SUSE Observability for the SUSE AI environment using the SUSE AI Observability Extension.

4.6.1 Deployment scenarios #

You can deploy SUSE Observability and SUSE AI in two different ways:

Single-Cluster setup: Both SUSE AI and SUSE Observability are installed in the same Kubernetes cluster. This is a simpler approach ideal for testing and proof-of-concept deployments. Communication between components can use internal cluster DNS.
Multi-Cluster setup: SUSE AI and SUSE Observability are installed on separate, dedicated Kubernetes clusters. This setup is recommended for production environments because it isolates workloads. Communication requires exposing the SUSE Observability endpoints externally, for example, via an Ingress.

This section provides instructions for both scenarios.

4.6.2 Requirements #

To set up SUSE Observability for SUSE AI, you need to meet the following requirements:

Have access to SUSE Application Collection
Have a valid SUSE AI subscription
Have a valid license for SUSE Observability in SUSE Customer Center
Instrument your applications for telemetry data acquisition with OpenTelemetry.

For details on how to collect traces and metrics from SUSE AI components and user-developed applications, refer to Monitoring SUSE AI with OpenTelemetry and SUSE Observability. It includes configurations that are essential for full observability.

Important: SUSE Application Collection not instrumented by default

Applications from the SUSE Application Collection are not instrumented by default. If you want to monitor your AI applications, you need to follow the instrumentation guidelines that we provide in the document Monitoring SUSE AI with OpenTelemetry and SUSE Observability.

4.6.3 Setup process overview #

The following chart shows the high-level steps for the setup procedure. You will first set up the SUSE Observability cluster, then configure the SUSE AI cluster, and finally instrument your applications. Execute the steps in each column from left to right and top to bottom.

Blue steps are related to Helm chart installations.
Gray steps represent another type of interaction, such as coding.

The chart showing a high-level overview of the SUSE Observability setup

Figure 4.1: High-level overview of the SUSE Observability setup #

Tip: Setup clusters

You can create and configure Kubernetes clusters for SUSE AI and SUSE Observability as you prefer. If you are using SUSE Rancher Prime, check its documentation. For testing purposes, you can even share one cluster for both deployments. You can skip instructions on setting up a specific cluster if you already have one configured.

The diagram below shows the result of the above steps. There are two clusters represented, one for the SUSE Observability workload and another one for SUSE AI. You may use identical setup or customize it for your environment.

The chart showing setup of separate clusters for SUSE AI and SUSE Observability

Figure 4.2: Separate clusters for SUSE AI and SUSE Observability #

Points to notice #

You can install SUSE AI Observability Extension alongside SUSE Observability. It means that you can confidently use the internal Kubernetes DNS.
SUSE Observability contains several components and the following two of them need to be accessible by the AI Cluster:
- The Collector endpoint. Refer to Exposing SUSE Observability outside of the cluster or Self-hosted SUSE Observability for details about exposing it.
- The SUSE Observability API. Refer to Exposing SUSE Observability outside of the cluster for details about exposing it.
- Milvus metrics and traces can be scraped by the OpenTelemetry Collector with simple configurations, provided below. The same is true for GPU metrics.
- To get information from Open WebUI, Ollama or vLLM, you must have a specific instrumentation set. It can be an application instrumented with the OpenLIT SDK or other form of instrumentation following the same patterns.

Important

Remember that in multi-cluster setups, it is critical to properly expose your endpoints. Configure TLS, be careful with the configuration, and make sure to provide the right keys and tokens. More details are provided in the respective instructions.

4.6.4 Setting up the SUSE Observability cluster #

This initial step is identical for both single-cluster and multi-cluster deployments.

Install SUSE Observability. You can follow the official SUSE Observability air-gapped installation documentation for all installation instructions. Remember to expose your APIs and collector endpoints to your SUSE AI cluster.
Important: Multi-cluster setup
For multi-cluster setups, you must expose the SUSE Observability API and collector endpoints so that the SUSE AI cluster can reach them. Refer to the guide on exposing SUSE Observability outside of the cluster.

Install the SUSE Observability extension. Create a new Helm values file named genai_values.yaml. Before creating the file, review the placeholders below.

SUSE_OBSERVABILITY_API_URL

The URL of the SUSE Observability API. For multi-cluster deployments, this is the external URL. For single-cluster deployments, this can be the internal service URL. Example: https://suse-observability-api.your-domain.com

SUSE_OBSERVABILITY_API_KEY

The API key from the baseConfig_values.yaml file used during the SUSE Observability installation.

SUSE_OBSERVABILITY_API_TOKEN_TYPE

Can be api for a token from the Web UI or service for a Service Token (used for automation/CI).

SUSE_OBSERVABILITY_TOKEN

The API or Service token itself. Ignored if existingSecret is set.

TLS_CA_CERTIFICATE

The CA certificate content in PEM format (optional). Ignored if tls.existingSecret is set.

OBSERVED_CLUSTER_NAMES

A list of Kubernetes cluster names to install StackPacks for. Each cluster listed here will get a kubernetes-v2 StackPack instance provisioned in SUSE Observability. Example: ["suse-ai-cluster"].

Create the genai_values.yaml file with the following content:

global:
  imagePullSecrets:
  - application-collection 1
  imageRegistry: <LOCAL_DOCKER_REGISTRY_URL>:5043
serverUrl: <SUSE_OBSERVABILITY_API_URL>
tokenType: <SUSE_OBSERVABILITY_API_TOKEN_TYPE>
apiToken: <SUSE_OBSERVABILITY_TOKEN> 2
kubernetesClusters: 3
  - <OBSERVED_CLUSTER_NAME>
tls: 4
  enabled: false
  skipVerify: false
  existingSecret: ""
  certificates:
    ca: "<TLS_CA_CERTIFICATE>"

1	Instructs Helm to use credentials from the SUSE Application Collection. For instructions on how to configure the image pull secrets for the SUSE Application Collection, refer to the official documentation.
2	Alternatively, you can reference an existing Kubernetes secret by setting `existingSecret` and `existingSecretKey` instead of providing the token inline. This is the recommended approach for production deployments.
3	List of Kubernetes cluster names to install StackPacks for. Add one entry per cluster you want to observe.
4	Provides optional TLS configuration for secure communication. Set `tls.enabled` to `true` and provide a CA certificate via `tls.certificates.ca` or reference an existing secret with `tls.existingSecret`. Set `tls.skipVerify` to `true` only for development and testing environments.

Run the install command.
```
> helm upgrade --install ai-obs \
  charts/suse-ai-observability-extension-2.0.0.tgz \
  -f genai_values.yaml --namespace so-extensions --create-namespace
```
Note: TLS and self-signed certificates
The extension supports TLS communication. If you are using self-signed certificates or a custom CA, you can provide the certificate content in the tls.certificates section of the genai_values.yaml file. Alternatively, consider running the extension in the same cluster as SUSE Observability and then use the internal Kubernetes address.
Note: OpenTelemetry GenAI Semantic Conventions compatibility
The SUSE AI Observability Extension is compatible with the OpenTelemetry Semantic Conventions for Generative AI as defined in version v1.40.0. These conventions are currently in Development status within the OpenTelemetry specification.
Important: Upgrading from version 1.5.0 to 2.0.0
When upgrading the SUSE AI Observability Extension from version 1.5.0 to 2.0.0, you must also update the OpenTelemetry Collector and the SUSE AI Filter (suse_ai_filter.py) to their corresponding versions. These components are designed to work together, and running mismatched versions may result in missing telemetry data or broken topology views.
After the installation is complete, a new menu called GenAI is added to the Web interface and also a Kubernetes cron job is created that synchronizes the topology view with the components found in the SUSE AI cluster.

Verify SUSE Observability extension. After the installation, you can verify that a new lateral menu appears:
Figure 4.3: New GenAI Observability menu item #

4.6.5 Setting up the SUSE AI cluster #

Follow the instructions for your deployment scenario.

Single-cluster deployment: In this setup, the SUSE AI components are installed in the same cluster as SUSE Observability and can communicate using internal service DNS.
Multi-cluster deployment: In this setup, the SUSE AI cluster is separate. Communication relies on externally exposed endpoints of the SUSE Observability cluster.

The difference between deployment scenarios affects the OTEL Collector exporter configuration and the SUSE Observability Agent URL. Adjust the following values to match your SUSE Observability configuration.

SUSE_OBSERVABILITY_API_URL

The URL of the SUSE Observability API.

Single-cluster example: http://suse-observability-otel-collector.suse-observability.svc.cluster.local:4317

Multi-cluster example: https://suse-observability-api.your-domain.com

SUSE_OBSERVABILITY_COLLECTOR_ENDPOINT

The endpoint of the SUSE Observability Collector.

Single-cluster example: http://suse-observability-router.suse-observability.svc.cluster.local:8080/receiver/stsAgent

Multi-cluster example: https://suse-observability-router.your-domain.com/receiver/stsAgent

Install NVIDIA GPU Operator. Follow the instructions in https://documentation.suse.com/cloudnative/rke2/latest/en/advanced.html#_deploy_nvidia_operator.

Install OpenTelemetry collector. Create the namespace for the collector if it does not exist yet.

> kubectl create namespace observability

Create a secret with your SUSE Observability API key. Retrieve the API key using the Web UI or from the baseConfig_values.yaml file that you used during the SUSE Observability installation.

> kubectl create secret generic open-telemetry-collector \
  --namespace observability \
  --from-literal=API_KEY='<SUSE_OBSERVABILITY_API_KEY>'

Create the image pull secret for the SUSE Registry. The username is regcode and the password is the SCC registration code of your SUSE AI subscription.

> kubectl create secret docker-registry suse-ai-registry \
  --docker-server=registry.suse.com \
  --docker-username=regcode \
  --docker-password=<SCC_REG_CODE> \
  -n observability

Create a new file named otel-values.yaml with the following content.

global:
  imagePullSecrets:
  - suse-ai-registry
extraEnvs:
  - name: K8S_CLUSTER_NAME
    value: "<CLUSTER_NAME>" 1
  - name: SUSE_AI_NAMESPACE
    value: "<SUSE_AI_NAMESPACE>" 2
extraEnvsFrom:
  - secretRef:
      name: open-telemetry-collector
mode: deployment
clusterRole:
  create: true
  rules:
    - apiGroups: [""]
      resources: ["pods", "nodes", "endpoints", "services"]
      verbs: ["get", "list", "watch"]
image:
  registry: <LOCAL_DOCKER_REGISTRY_URL>:5043
  repository: suse-ai-opentelemetry-collector
  tag: 0.149.0
  pullPolicy: IfNotPresent
ports:
  metrics:
    enabled: true
presets:
  kubernetesAttributes:
    enabled: true
    extractAllPodLabels: true
config:
  receivers:
    elasticsearch:
      endpoint: "http://opensearch-cluster-master-headless.<OPENSEARCH_NAMESPACE>.svc.cluster.local:9200"
      # If you have basic auth enabled:
      username: "admin"
      password: "MySecurePass123"
      # If using self-signed certs or internal CAs:
      tls:
        insecure_skip_verify: true
      collection_interval: 10s
      # Optional: Enable detailed node metrics
      metrics:
        elasticsearch.node.fs.disk.total:
          enabled: true
        elasticsearch.node.fs.disk.available:
          enabled: true
    prometheus:
      config:
        scrape_configs:
          - job_name: 'gpu-metrics'
            scrape_interval: 10s
            scheme: http
            kubernetes_sd_configs:
              - role: endpoints
                namespaces:
                  names:
                    - gpu-operator
          - job_name: 'milvus'
            scrape_interval: 15s
            metrics_path: '/metrics'
            static_configs:
              - targets: ['<MILVUS_SERVICE_NAME>.<SUSE_AI_NAMESPACE>.svc.cluster.local:9091'] 3
          - job_name: 'qdrant'
            scrape_interval: 10s
            metrics_path: '/metrics'
            static_configs:
              - targets: ['<QDRANT_SERVICE_NAME>.<SUSE_AI_NAMESPACE>.svc.cluster.local:6333'] 4
          - job_name: 'vllm'
            scrape_interval: 10s
            scheme: http
            kubernetes_sd_configs:
              - role: service
            relabel_configs:
              - source_labels: [__meta_kubernetes_namespace]
                action: keep
                regex: '<VLLM_NAMESPACE>' 5

              - source_labels: [__meta_kubernetes_service_name]
                action: keep
                regex: '.*<VLLM_RELEASE_NAME>.*' 6
  exporters:
    topology:
      endpoint: http://<SUSE_OBSERVABILITY_ROUTER_ENDPOINT>:8080 7
      api_key: ${env:API_KEY}
      instance_url: ${env:K8S_CLUSTER_NAME}
      namespace: ${env:SUSE_AI_NAMESPACE}
      tls: 8
        insecure_skip_verify: true
    otlp:
      endpoint: http://<OPEN_TELEMETRY_COLLECTOR_NAME>.suse-observability.svc.cluster.local:4317 9
      headers:
        Authorization: "SUSEObservability ${env:API_KEY}"
      tls:
        insecure: true
  processors:
    resource/elasticsearch:
      attributes:
        - key: suse.ai.managed
          value: "true"
          action: insert
        - key: suse.ai.component.name
          value: "opensearch"
          action: insert
        - key: suse.ai.component.type
          value: "search-engine"
          action: insert
        - key: service.name
          value: "opensearch"
          action: insert
        - key: service.namespace
          value: ${env:SUSE_AI_NAMESPACE}
          action: insert
        - key: k8s.namespace.name
          value: ${env:SUSE_AI_NAMESPACE}
          action: upsert
        - key: service.instance.id
          value: "opensearch-cluster"
          action: insert
    transform/vllm:
      error_mode: ignore
      metric_statements:
        - context: resource
          statements:
            - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "vllm"
            - set(attributes["suse.ai.component.name"], "vllm") where attributes["service.name"] == "vllm"
            - set(attributes["suse.ai.component.type"], "inference-engine") where attributes["service.name"] == "vllm"
            - set(attributes["service.instance.id"], "vllm") where attributes["service.name"] == "vllm"
            - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "vllm"
    transform/qdrant:
      error_mode: ignore
      metric_statements:
        - context: resource
          statements:
            - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "qdrant"
            - set(attributes["suse.ai.component.name"], "qdrant") where attributes["service.name"] == "qdrant"
            - set(attributes["suse.ai.component.type"], "vectordb") where attributes["service.name"] == "qdrant"
            - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "qdrant"
    transform/milvus:
      error_mode: ignore
      metric_statements:
        - context: resource
          statements:
            - set(attributes["suse.ai.managed"], "true") where attributes["service.name"] == "milvus"
            - set(attributes["suse.ai.component.name"], "milvus") where attributes["service.name"] == "milvus"
            - set(attributes["suse.ai.component.type"], "vectordb") where attributes["service.name"] == "milvus"
            - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["service.name"] == "milvus"
    tail_sampling:
      decision_wait: 10s
      policies:
      - name: rate-limited-composite
        type: composite
        composite:
          max_total_spans_per_second: 500
          policy_order: [errors, slow-traces, rest]
          composite_sub_policy:
          - name: errors
            type: status_code
            status_code:
              status_codes: [ ERROR ]
          - name: slow-traces
            type: latency
            latency:
              threshold_ms: 1000
          - name: rest
            type: always_sample
          rate_allocation:
          - policy: errors
            percent: 33
          - policy: slow-traces
            percent: 33
          - policy: rest
            percent: 34
    resource:
      attributes:
      - key: k8s.cluster.name
        action: insert
        value: ${env:K8S_CLUSTER_NAME}
      - key: service.instance.id
        from_attribute: k8s.pod.uid
        action: insert
    # Infer inference engines from GenAI client metrics.
    filter/genai-metrics-only:
      error_mode: ignore
      metrics:
        metric:
          - not(IsMatch(name, "gen_ai\\..*"))
    groupbyattrs/infer-providers:
      keys:
        - gen_ai.provider.name
    transform/infer-providers:
      error_mode: ignore
      metric_statements:
        - context: resource
          statements:
            - set(attributes["service.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil
            - set(attributes["service.instance.id"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil
            - set(attributes["suse.ai.managed"], "true") where attributes["gen_ai.provider.name"] != nil
            - set(attributes["suse.ai.component.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil
            - set(attributes["suse.ai.component.type"], "inference-engine") where attributes["gen_ai.provider.name"] != nil
            - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["gen_ai.provider.name"] != nil
    # Infer LLM models from GenAI client metrics.
    groupbyattrs/infer-models:
      keys:
        - gen_ai.request.model
        - gen_ai.provider.name
    transform/infer-models:
      error_mode: ignore
      metric_statements:
        - context: resource
          statements:
            - set(attributes["service.name"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil
            - set(attributes["service.instance.id"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil
            - set(attributes["suse.ai.managed"], "true") where attributes["gen_ai.request.model"] != nil
            - set(attributes["suse.ai.component.name"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil
            - set(attributes["suse.ai.component.type"], "llm-model") where attributes["gen_ai.request.model"] != nil
            - set(attributes["k8s.namespace.name"], "${env:SUSE_AI_NAMESPACE}") where attributes["gen_ai.request.model"] != nil
    # Exclude services that already have suse.ai.component.name set.
    filter/exclude-already-tagged:
      error_mode: ignore
      metrics:
        resource:
          - attributes["suse.ai.component.name"] != nil
    # Infer application components from GenAI client metrics.
    transform/infer-applications:
      error_mode: ignore
      metric_statements:
        - context: resource
          statements:
            - set(attributes["suse.ai.managed"], "true")
            - set(attributes["suse.ai.component.type"], "application")
            - set(attributes["suse.ai.component.name"], attributes["service.name"])
    # Create provider -> model relations from GenAI trace spans.
    filter/genai-spans:
      error_mode: ignore
      traces:
        span:
          - attributes["gen_ai.request.model"] == nil
    groupbyattrs/model-relations:
      keys:
        - gen_ai.provider.name
    transform/model-relations:
      error_mode: ignore
      trace_statements:
        - context: resource
          statements:
            - set(attributes["service.name"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil
            - set(attributes["service.instance.id"], attributes["gen_ai.provider.name"]) where attributes["gen_ai.provider.name"] != nil
        - context: span
          statements:
            - set(attributes["peer.service"], attributes["gen_ai.request.model"]) where attributes["gen_ai.request.model"] != nil
    # Create application -> provider relations from GenAI trace spans.
    transform/provider-relations:
      error_mode: ignore
      trace_statements:
        - context: span
          statements:
            - set(attributes["peer.service"], resource.attributes["gen_ai.provider.name"]) where resource.attributes["gen_ai.provider.name"] != nil
    filter/dropMissingK8sAttributes:
      error_mode: ignore
      traces:
        span:
          - resource.attributes["k8s.node.name"] == nil
          - resource.attributes["k8s.pod.uid"] == nil
          - resource.attributes["k8s.namespace.name"] == nil
          - resource.attributes["k8s.pod.name"] == nil
  connectors:
    spanmetrics:
      metrics_expiration: 5m
      namespace: otel_span
    routing/traces:
      error_mode: ignore
      table:
      - statement: route()
        pipelines: [traces/sampling, traces/spanmetrics, traces/model-relations, traces/provider-relations, traces/topology]
  service:
    extensions:
      - health_check
    pipelines:
      traces:
        receivers: [otlp, jaeger]
        processors: [filter/dropMissingK8sAttributes, memory_limiter, resource]
        exporters: [routing/traces]
      traces/spanmetrics:
        receivers: [routing/traces]
        processors: []
        exporters: [spanmetrics]
      traces/sampling:
        receivers: [routing/traces]
        processors: [tail_sampling, batch]
        exporters: [debug, otlp]
      metrics:
        receivers: [otlp, spanmetrics, prometheus]
        processors: [memory_limiter, transform/qdrant, transform/milvus, transform/vllm, resource, batch]
        exporters: [debug, otlp]
      # Infer inference engines from GenAI client metrics.
      metrics/infer-providers:
        receivers: [otlp]
        processors: [filter/genai-metrics-only, groupbyattrs/infer-providers, transform/infer-providers, resource, batch]
        exporters: [otlp]
      # Infer LLM models from GenAI client metrics.
      metrics/infer-models:
        receivers: [otlp]
        processors: [filter/genai-metrics-only, groupbyattrs/infer-models, transform/infer-models, resource, batch]
        exporters: [otlp]
      # Infer application components from GenAI client metrics.
      metrics/infer-applications:
        receivers: [otlp]
        processors: [filter/genai-metrics-only, filter/exclude-already-tagged, transform/infer-applications, resource, batch]
        exporters: [otlp]
      # Create provider -> model relations from trace spans.
      traces/model-relations:
        receivers: [routing/traces]
        processors: [filter/genai-spans, groupbyattrs/model-relations, transform/model-relations, batch]
        exporters: [otlp]
      # Create application -> provider relations from trace spans.
      traces/provider-relations:
        receivers: [routing/traces]
        processors: [filter/genai-spans, transform/provider-relations, batch]
        exporters: [otlp]
      # Push product topology to {sobservability}.
      traces/topology:
        receivers: [routing/traces]
        processors: [batch]
        exporters: [topology]
      metrics/elasticsearch:
        receivers: [elasticsearch]
        processors: [memory_limiter, resource/elasticsearch, resource, batch]
        exporters: [debug, otlp]

1	Replace `<CLUSTER_NAME>` with the cluster’s name.
2	Replace `<SUSE_AI_NAMESPACE>` with the namespace where SUSE AI components are installed.
3	Configure the Milvus service and namespace for the Prometheus scraper. Because Milvus will be installed in subsequent steps, you can return to this step and edit the endpoint if necessary.
4	Configure the Qdrant service and namespace for the Prometheus scraper.
5	Update to match the values in the vLLM deployment section.
6	Update to match the values in the vLLM deployment section.
7	Set the topology exporter to your exposed SUSE Observability router. For single-cluster deployments, use the internal service URL (for example, `suse-observability-router.suse-observability.svc.cluster.local`). For multi-cluster deployments, use the external URL.
8	Optional TLS configuration for the topology exporter. Set `insecure_skip_verify` to `true` for self-signed certificates.
9	Set the exporter to your exposed SUSE Observability collector. Remember that the value can be distinct, depending on the deployment pattern. For production usage, we recommend using TLS communication.

Finally, run the installation command.

> helm upgrade --install opentelemetry-collector \
  oci://dp.apps.rancher.io/charts/opentelemetry-collector \
  -f otel-values.yaml --namespace observability

Verify the installation by checking the existence of a new deployment and service in the observability namespace.

Install the SUSE Observability Agent.

> helm upgrade --install \
  --namespace suse-observability --create-namespace \
  --set-string 'stackstate.apiKey'='<YOUR_API_KEY>' \ 1
  --set-string 'stackstate.cluster.name'='<CLUSTER_NAME>' \ 2
  --set-string 'stackstate.url'='http://suse-observability-router.suse-observability.svc.cluster.local:8080/receiver/stsAgent' \ 3
  --set 'nodeAgent.skipKubeletTLSVerify'=true suse-observability-agent \
  suse-observability/suse-observability-agent

1	Retrieve the API key using the Web UI or from the `baseConfig_values.yaml` file that you used during the SUSE Observability installation.
2	Replace `<CLUSTER_NAME>` with the cluster’s name.
3	Replace with your SUSE Observability server URL.

Install SUSE AI. Refer to Chapter 5, Installing applications from AI Library for the complete procedure.

Warning: SUSE Observability version 2.6.2 and above

With SUSE Observability version 2.6.2, a change of the standard behavior broke the vLLM monitoring performed by the extension. To fix it, update otel-values.yaml to include the following additions. No changes are required for people using SUSE Observability version 2.6.1 and below.

Add a new processor.

config:
  processors:
    ... # same as before
    transform:
      metric_statements:
        - context: metric
          statements:
            - replace_pattern(name, "^vllm:", "vllm_")

Modify the metrics pipeline to perform the transformation defined above:

config:
  service:
    pipelines:
      ... # same as before
      metrics:
        receivers: [otlp, spanmetrics, prometheus]
        processors: [transform, memory_limiter, resource, batch]
        exporters: [debug, otlp]

4.6.6 Instrument applications #

Instrumentation is the act of configuring your applications for telemetry data acquisition. Our stack employs OpenTelemetry standards as a vendor-neutral and open base for our telemetry. For a comprehensive guide on how to set up your instrumentation, please refer to Monitoring SUSE AI with OpenTelemetry and SUSE Observability.

By following the instructions in the document referenced above, you will be able to retrieve all relevant telemetry data from Open WebUI, Ollama, Milvus and vLLM by simply applying specific configuration to their Helm chart values. You can find links for advanced use cases (auto-instrumentation with the OTEL Operator) at the end of the document.