documentation.suse.com / Deploying and Installing SUSE AI

Deploying and Installing SUSE AI

Publication Date: 10 Oct 2025
WHAT?

This document provides a comprehensive, step-by-step guide for the SUSE AI deployment.

WHY?

To help users successfully complete the deployment process.

GOAL

To learn enough information to deploy SUSE AI in both testing and production environments.

EFFORT

Less than one hour of reading and an advanced knowledge of Linux deployment.

SUSE AI is a versatile product consisting of multiple software layers and components. This document outlines the complete workflow for deployment and installation of all SUSE AI dependencies, as well as SUSE AI itself. You can also find references to recommended hardware and software requirements, as well as steps to take after the product installation.

Tip
Tip: Hardware and software requirements

For hardware, software and application-specific requirements, refer to SUSE AI requirements.

1 Installation overview

The following chart illustrates the installation process of SUSE AI. It outlines the following possible scenarios:

  • You have clean cluster nodes prepared without a supported Linux operating system installed.

  • You have a supported Linux operating system and Kubernetes distribution installed on cluster nodes.

  • You have SUSE Rancher Prime and all supportive components installed on the Kubernetes cluster and are prepared to install the required applications from the AI Library.

SUSE AI installation process
Figure 1: SUSE AI installation process

2 Installing the Linux and Kubernetes distribution

This procedure includes the steps to install the base Linux operating system and a Kubernetes distribution for users who start deploying on cluster nodes from scratch. If you already have a Kubernetes cluster installed and running, you can skip this procedure and continue with Section 4.1, “Installation procedure”.

  1. Install and register a supported Linux operating system on each cluster node. We recommend using one of the following operating systems:

    For a list of supported operating systems, refer to https://www.suse.com/suse-rancher/support-matrix/all-supported-versions/.

  2. Install the NVIDIA GPU driver on cluster nodes with GPUs. Refer to Section 2.2, “Installing NVIDIA GPU drivers” for details.

  3. Install Kubernetes on cluster nodes. We recommend using the supported SUSE Rancher Prime: RKE2 distribution. Refer to SUSE Rancher Prime: RKE2 Installation for details. For a list of supported Kubernetes platforms, refer to https://www.suse.com/suse-rancher/support-matrix/all-supported-versions/.

2.1 Installing SUSE Linux Enterprise Server

Use the following procedures to install SLES on all supported hardware platforms. They assume you have successfully booted into the installation system. For more detailed installation instructions and deployment strategies, refer to SUSE Linux Enterprise Server Deployment Guide.

2.1.1 The Unified Installer

Starting with SLES 15, the installation medium consists only of the Unified Installer, a minimal system for installing, updating and registering all SUSE Linux Enterprise base products. During the installation, you can add functionality by selecting modules and extensions to be installed on top of the Unified Installer.

2.1.2 Installing offline or without registration

The default installation medium 15 SP6-Online-ARCH-GM-media1.iso is optimized for size and does not contain any modules and extensions. Therefore, the installation requires network access to register your product and retrieve repository data for the modules and extensions.

For installation without registering the system, use the 15 SP6-Full-ARCH-GM-media1.iso image from https://www.suse.com/download/sles/ and refer to Installing without registration.

Tip
Tip: Copying the installation media image to a removable flash disk

Use the following command to copy the contents of the installation image to a removable flash disk.

> sudo dd if=IMAGE of=FLASH_DISK bs=4M && sync

IMAGE needs to be replaced with the path to the 15 SP6-Online-ARCH-GM-media1.iso or 15 SP6-Full-ARCH-GM-media1.iso image file. FLASH_DISK needs to be replaced with the flash device. To identify the device, insert it and run:

# grep -Ff <(hwinfo --disk --short) <(hwinfo --usb --short)
disk:
  /dev/sdc             General USB Flash Disk

Make sure the size of the device is sufficient for the desired image. You can check the size of the device with:

# fdisk -l /dev/sdc | grep -e "^/dev"
     /dev/sdc1  *     2048 31490047 31488000  15G 83 Linux

In this example, the device has a capacity of 15 GB. The command to use for the 15 SP6-Full-ARCH-GM-media1.iso would be:

dd if=15 SP6-Full-ARCH-GM-media1.iso of=/dev/sdc bs=4M && sync

The device must not be mounted when running the dd command. Note that all data on the partition will be erased.

2.1.3 The installation procedure

To install SLES, boot or IPL into the installer from the Unified Installer medium and start the installation.

2.1.3.1 Language, keyboard and product selection
Language, keyboard and product selection screen
Figure 2: Language, keyboard and product selection

The Language and Keyboard Layout settings are initialized with the language you chose on the boot screen. If you do not change the default, it remains English (US). Change the settings here, if necessary. Use the Keyboard Test text box to test the layout.

Select SUSE Linux Enterprise Server 15 SP6 for installation. You need to have a registration code for the product. Proceed with Next.

Tip
Tip: Light and high-contrast themes

If you have difficulty reading the labels in the installer, you can change the widget colors and theme.

Click the Change the widget theme button or press ShiftF3 to open a theme selection dialog. Select a theme from the list and Close the dialog.

ShiftF4 switches to the color scheme for vision-impaired users. Press the buttons again to switch back to the default scheme.

2.1.3.2 License agreement
SLES License Agreement screen
Figure 3: License agreement

Read the License Agreement. It is presented in the language you have chosen on the boot screen. Translations are available via the License Language drop-down list. You need to accept the agreement by checking I Agree to the License Terms to install SLES. Proceed with Next.

2.1.3.3 Network settings
Network Settings screen
Figure 4: Network settings

A system analysis is performed, where the installer probes for storage devices and tries to find other installed systems. If the network was automatically configured via DHCP during the start of the installation, you are presented the registration step.

If the network is not yet configured, the Network Settings dialog opens. Choose a network interface from the list and configure it with Edit. Alternatively, Add an interface manually. See the sections on installer network settings and configuring a network connection with YaST for more information. If you prefer to do an installation without network access, skip this step without making any changes and proceed with Next.

2.1.3.4 Registration
Registration screen
Figure 5: Registration

To get technical support and product updates, you need to register and activate SLES with the SUSE Customer Center or a local registration server. Registering your product at this stage also grants you immediate access to the update repository. This enables you to install the system with the latest updates and patches available.

When registering, repositories and dependencies for modules and extensions are loaded from the registration server.

Register system at scc.suse.com

To register at the SUSE Customer Center, enter the E-mail Address associated with your SUSE Customer Center account and the Registration Code for SLES. Proceed with Next.

Register system via local RMT server

If your organization provides a local registration server, you may alternatively register to it. Activate Register System via local RMT Server and either choose a URL from the drop-down list or type in an address. Proceed with Next.

Skip registration

If you are offline or want to skip registration, activate Skip Registration. Accept the warning with OK and proceed with Next.

Important
Important: Skipping the registration

Your system and extensions need to be registered to retrieve updates and to be eligible for support. Skipping the registration is only possible when installing from the 15 SP6-Full-ARCH-GM-media1.iso image.

If you do not register during the installation, you can do so at any time later from the running system. To do so, run YaST › Product Registration or the command-line tool SUSEConnect.

Tip
Tip: Installing product patches at installation time

After SLES has been successfully registered, you are asked whether to install the latest available online updates during the installation. If choosing Yes, the system will be installed with the most current packages without having to apply the updates after installation. Activating this option is recommended.

Note
Note: Firewall settings for receiving updates

By default, the firewall on SUSE AI only blocks incoming connections. If your system is behind another firewall that blocks outgoing traffic, make sure to allow connections to https://scc.suse.com/ and https://updates.suse.com on ports 80 and 443 to receive updates.

2.1.3.5 Extension and module selection
Extension and Module Selection screen
Figure 6: Extension and module selection

After the system is successfully registered, the installer lists modules and extensions that are available for SLES. Modules are components that allow you to customize the product according to your needs. They are included in your SLES subscription. Extensions add functionality to your product. They must be purchased separately.

The availability of certain modules or extensions depends on the product selected in the first step of the installation. For a description of the modules and their lifecycles, select a module to see the accompanying text. More detailed information is available in the Modules and Extensions Quick Start.

The selection of modules indirectly affects the scope of the installation, because it defines which software sources (repositories) are available for installation and in the running system.

The following modules and extensions are available for SUSE Linux Enterprise Server:

Basesystem Module

This module adds a basic system on top of the Unified Installer. It is required by all other modules and extensions. The scope of an installation that only contains the base system is comparable to the installation pattern minimal system of previous SLES versions. This module is selected for installation by default and should not be deselected.

Dependencies: None

Certifications Module

Contains the FIPS certification packages.

Dependencies: Server Applications

Confidential Computing Technical Preview

Contains packages related to confidential computing.

Dependencies: Basesystem

Containers Module

Contains support and tools for containers.

Dependencies: Basesystem

Desktop Applications Module

Adds a graphical user interface and essential desktop applications to the system.

Dependencies: Basesystem

Development Tools Module

Contains the compilers (including gcc) and libraries required for compiling and debugging applications. Replaces the former Software Development Kit (SDK).

Dependencies: Basesystem, Desktop Applications

Legacy Module

Helps you with migrating applications from earlier versions of SLES and other systems to SLES 15 SP6 by providing packages which are discontinued on SUSE Linux Enterprise. Packages in this module are selected based on the requirements for migration and the level of complexity of configuration.

This module is recommended when migrating from a previous product version.

Dependencies: Basesystem, Server Applications

NVIDIA Compute Module

Contains the NVIDIA CUDA (Compute Unified Device Architecture) drivers.

The software in this module is provided by NVIDIA under the CUDA End User License Agreement and is not supported by SUSE.

Dependencies: Basesystem

Public Cloud Module

Contains all tools required to create images for deploying SLES in cloud environments such as Amazon Web Services (AWS), Microsoft Azure, Google Compute Platform, or OpenStack.

Dependencies: Basesystem, Server Applications

Python 3 Module

This module contains the most recent versions of the selected Python 3 packages.

Dependencies: Basesystem

SAP Business One Server

This module contains packages and system configurations specific to SAP Business One Server. It is maintained and supported under the SUSE Linux Enterprise Server product subscription.

Dependencies: Basesystem, Server Applications, Desktop Applications, Development Tools

Server Applications Module

Adds server functionality by providing network services such as DHCP server, name server, or Web server.

Dependencies: Basesystem

SUSE Linux Enterprise High Availability

Adds clustering support for mission-critical setups to SLES. This extension requires a separate license key.

Dependencies: Basesystem, Server Applications

SUSE Linux Enterprise Live Patching

Adds support for performing critical patching without having to shut down the system. This extension requires a separate license key.

Dependencies: Basesystem, Server Applications

SUSE Linux Enterprise Workstation Extension

Extends the functionality of SLES with packages from SUSE Linux Enterprise Desktop, like additional desktop applications (office suite, e-mail client, graphical editor, etc.) and libraries. It allows combining both products to create a fully featured workstation. This extension requires a separate license key.

Dependencies: Basesystem, Desktop Applications

SUSE Package Hub

Provides access to packages for SLES maintained by the openSUSE community. These packages are delivered without L3 support and do not interfere with the supportability of SLES. For more information, refer to https://packagehub.suse.com/.

Dependencies: Basesystem

Transactional Server Module

Adds support for transactional updates. Updates are either applied to the system as a single transaction or not applied at all. This happens without influencing the running system. If an update fails, or if the successful update is deemed to be incompatible or otherwise incorrect, it can be discarded to immediately return the system to its previous functioning state.

Dependencies: Basesystem

Web and Scripting Module

Contains packages intended for a running Web server.

Dependencies: Basesystem, Server Applications

Certain modules depend on the installation of other modules. Therefore, when selecting a module, other modules may be selected automatically to fulfill dependencies.

Depending on the product, the registration server can mark modules and extensions as recommended. Recommended modules and extensions are preselected for registration and installation. To avoid installing these recommendations, deselect them manually.

Select the modules and extensions you want to install and proceed with Next. In case you have chosen one or more extensions, you will be prompted to provide the respective registration codes. Depending on your choice, it may also be necessary to accept additional license agreements.

Important
Important: Default modules for offline installation

When performing an offline installation from the 15 SP6-Full-ARCH-GM-media1.iso, only the Basesystem Module is selected by default. To install the complete default package set of SUSE Linux Enterprise Server, additionally select the Server Applications Module and the Python 3 Module.

2.1.3.6 Add-on product
Add On Product screen
Figure 7: Add-on product

The Add-On Product dialog allows you to add additional software sources (called repositories) to SLES that are not provided by the SUSE Customer Center. Add-on products may include third-party products and drivers as well as additional software for your system.

Tip
Tip: Adding drivers during the installation

You can also add driver update repositories via the Add-On Product dialog. Driver updates for SUSE Linux Enterprise are provided at https://drivers.suse.com/. These drivers have been created through the SUSE SolidDriver Program.

To skip this step, proceed with Next. Otherwise, activate I would like to install an additional Add On Product. Specify a media type, a local path, or a network resource hosting the repository and follow the on-screen instructions.

Check Download Repository Description Files to download the files describing the repository now. If deactivated, they will be downloaded after the installation has started. Proceed with Next and insert a medium if required. Depending on the content of the product, it may be necessary to accept additional license agreements. Proceed with Next. If you have chosen an add-on product requiring a registration key, you will be asked to enter it before proceeding to the next step.

2.1.3.7 System role
System Role screen
Figure 8: System role

The availability of system roles depends on your selection of modules and extensions. System roles define, for example, the set of software patterns that are preselected for the installation. Refer to the description on the screen to make your choice. Select a role and proceed with Next. If from the enabled modules only one role or no role is suitable for the respective base product, the System Role dialog is omitted.

Tip
Tip: Release notes

From this point on, the Release Notes can be viewed from any screen during the installation process by selecting Release Notes.

2.1.3.8 Suggested partitioning
Suggested Partitioning screen
Figure 9: Suggested partitioning

Review the partition setup proposed by the system. If necessary, change it. You have the following options:

Guided setup

Starts a wizard that lets you refine the partitioning proposal. The options available here depend on your system setup. If it contains more than a single hard disk, you can choose which disk or disks to use and where to place the root partition. If the disks already contain partitions, decide whether to remove or resize them.

In subsequent steps, you may also add LVM support and disk encryption. You can change the file system for the root partition and decide whether or not to have a separate home partition.

Expert partitioner

Opens the Expert Partitioner. This gives you full control over the partitioning setup and lets you create a custom setup. This option is intended for experts. For details, see the Expert Partitioner chapter.

Warning
Warning: Disk space units

For partitioning purposes, disk space is measured in binary units rather than in decimal units. For example, if you enter sizes of 1GB, 1GiB or 1G, they all signify 1 GiB (Gibibyte), as opposed to 1 GB (Gigabyte).

Binary

1 GiB = 1 073 741 824 bytes.

Decimal

1 GB = 1 000 000 000 bytes.

Difference

1 GiB ≈ 1.07 GB.

To accept the proposed setup without any changes, choose Next to proceed.

2.1.3.9 Clock and time zone
Clock and Time Zone screen
Figure 10: Clock and time zone

Select the clock and time zone to use in your system. To manually adjust the time or to configure an NTP server for time synchronization, choose Other Settings. See the section on Clock and Time Zone for detailed information. Proceed with Next.

2.1.3.10 Local user
Local User screen
Figure 11: Local user creation

To create a local user, type the first and last name in the User’s Full Name field, the login name in the Username field, and the password in the Password field.

The password should be at least eight characters long and should contain both uppercase and lowercase letters and numbers. The maximum length for passwords is 72 characters, and passwords are case-sensitive.

For security reasons, it is also strongly recommended not to enable Automatic Login. You should also not Use this Password for the System Administrator but provide a separate root password in the next installation step.

If you install on a system where a previous Linux installation was found, you may Import User Data from a Previous Installation. Click Choose User for a list of available user accounts. Select one or more users.

In an environment where users are centrally managed (for example, by NIS or LDAP), you can skip the creation of local users. Select Skip User Creation in this case.

Proceed with Next.

2.1.3.11 Authentication for the system administrator root
Authentication for the system administrator “root” screen
Figure 12: Password for the system administrator root

Type a password for the system administrator (called the root user) or provide a public SSH key. If you want, you can use both.

Because the root user is equipped with extensive permissions, the password should be chosen carefully. You should never forget the root password. After you entered it here, the password cannot be retrieved.

Tip
Tip: Passwords and keyboard layout

It is recommended to use only US ASCII characters. In the event of a system error or when you need to start your system in rescue mode, the keyboard may not be localized.

To access the system remotely via SSH using a public key, import a key from removable media or an existing partition. See the section on Authentication for the system administrator root for more information.

Proceed with Next.

2.1.3.12 Installation settings
Installation Settings screen
Figure 13: Installation settings

Use the Installation Settings screen to review and—if necessary—change several proposed installation settings. The current configuration is listed for each setting. To change it, click the headline. Certain settings, such as firewall or SSH, can be changed directly by clicking the respective links.

Important
Important: Remote access

Changes you can make here can also be made later at any time from the installed system. However, if you need remote access right after the installation, you may need to open the SSH port in the Security settings.

Software

The scope of the installation is defined by the modules and extensions you have chosen for this installation. However, depending on your selection, not all packages available in a module are selected for installation.

Clicking Software opens the Software Selection and System Tasks screen, where you can change the software selection by selecting or deselecting patterns. Each pattern contains several software packages needed for specific functions (for example, KVM Host Server). For a more detailed selection based on software packages to install, select Details to switch to the YaST Software Manager. See Installing or removing software for more information.

Booting

This section shows the boot loader configuration. Changing the defaults is recommended only if really needed. Refer to The boot loader GRUB 2 for details.

Security

The CPU Mitigations refer to kernel boot command-line parameters for software mitigations that have been deployed to prevent CPU side-channel attacks. Click the selected entry to choose a different option. For details, see the section on CPU Mitigations.

By default, the Firewall is enabled on all configured network interfaces. To disable firewalld, click disable (not recommended). Refer to the Masquerading and Firewalls chapter for configuration details.

Note
Note: Firewall settings for receiving updates

By default, the firewall on SUSE AI only blocks incoming connections. If your system is behind another firewall that blocks outgoing traffic, make sure to allow connections to https://scc.suse.com/ and https://updates.suse.com on ports 80 and 443 to receive updates.

The SSH service is enabled by default, but its port (22) is closed in the firewall. Click open to open the port or disable to disable the service. If SSH is disabled, remote logins will not be possible. Refer to Securing network operations with OpenSSH for more information.

The default Major Linux Security Module is AppArmor. To disable it, select None as the module in the Security settings.

Security Policies

Click to enable the Defense Information Systems Agency STIG security policy. If any installation settings are incompatible with the policy, you will be prompted to modify them accordingly. Certain settings can be adjusted automatically while others require user input.

Enabling a security profile enables a full SCAP remediation on first boot. You can also perform a scan only or do nothing and manually remediate the system later with OpenSCAP. For more information, refer to the section on Security Profiles.

Network configuration

Displays the current network configuration. By default, wicked is used for server installations and NetworkManager for desktop workloads. Click Network Configuration to change the settings. For details, see the section on Configuring a network connection with YaST.

Important
Important: Support for NetworkManager

SUSE only supports NetworkManager for desktop workloads with SLED or the Workstation extension. All server certifications are done with wicked as the network configuration tool, and using NetworkManager may invalidate them. NetworkManager is not supported by SUSE for server workloads.

Kdump

Kdump saves the memory image (core dump) to the file system in case the kernel crashes. This enables you to find the cause of the crash by debugging the dump file. Kdump is preconfigured and enabled by default. See the Basic Kdump configuration for more information.

Default systemd target

If you have installed the desktop applications module, the system boots into the graphical target, with network, multi-user and display manager support. Switch to multi-user if you do not need to log in via a display manager.

System

View detailed hardware information by clicking System. In the resulting screen, you can also change Kernel Settings—see the section on System Information for more information.

2.1.3.13 Start the installation
Installation Settings screen with Confirm Installation dialog
Figure 14: Confirm installation

After you have finalized the system configuration on the Installation Settings screen, click Install. Depending on your software selection, you may need to agree to license agreements before the installation confirmation screen pops up. Up to this point, no changes have been made to your system. After you click Install a second time, the installation process starts.

2.1.3.14 The installation process
Performing Installation screen
Figure 15: Performing the installation

During the installation, the progress is shown. After the installation routine has finished, the computer is rebooted into the installed system.

2.2 Installing NVIDIA GPU drivers

This article demonstrates how to implement host-level NVIDIA GPU support via the open-driver. The open-driver is part of the core package repositories. Therefore, there is no need to compile it or download executable packages. This driver is built into the operating system rather than dynamically loaded by the NVIDIA GPU Operator. This configuration is desirable for customers who want to pre-build all artifacts required for deployment into the image, and where the dynamic selection of the driver version via Kubernetes is not a requirement.

2.2.1 Installing NVIDIA GPU drivers on SUSE Linux Enterprise Server

2.2.1.1 Requirements

If you are following this guide, it assumes that you have the following already available:

  • At least one host with SLES 15 SP6 installed, physical or virtual.

  • Your hosts are attached to a subscription as this is required for package access.

  • A compatible NVIDIA GPU installed or fully passed through to the virtual machine in which SLES is running.

  • Access to the root user—these instructions assume you are the root user, and not escalating your privileges via sudo.

2.2.1.2 Considerations before the installation
2.2.1.2.1 Select the driver generation

You must verify the driver generation for the NVIDIA GPU that your system has. For modern GPUs, the G06 driver is the most common choice. Find more details in the support database.

This section details the installation of the G06 generation of the driver.

2.2.1.2.2 Additional NVIDIA components

Besides the NVIDIA open-driver provided by SUSE as part of SLES, you might also need additional NVIDIA components. These could include OpenGL libraries, CUDA toolkits, command-line utilities such as nvidia-smi, and container-integration components such as nvidia-container-toolkit. Many of these components are not shipped by SUSE as they are proprietary NVIDIA software. This section describes how to configure additional repositories that give you access to these components and provides examples of using these tools to achieve a fully functional system.

2.2.1.3 The installation procedure
  1. Add a package repository from NVIDIA. This allows pulling in additional utilities, for example, nvidia-smi.

    For the AMD64/Intel 64 architecture, run:

    # zypper ar \
      https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/ \
      cuda-sle15
    # zypper --gpg-auto-import-keys refresh

    For the Arm AArch64 architecture, run:

    # zypper ar \
      https://developer.download.nvidia.com/compute/cuda/repos/sles15/sbsa/ \
      cuda-sle15
    transactional update # zypper --gpg-auto-import-keys refresh
  2. Install the Open Kernel driver KMP and detect the driver version.

    # zypper install -y --auto-agree-with-licenses \
      nv-prefer-signed-open-driver
    # version=$(rpm -qa --queryformat '%{VERSION}\n' \
      nv-prefer-signed-open-driver | cut -d "_" -f1 | sort -u | tail -n 1)
  3. You can then install the appropriate packages for additional utilities that are useful for testing purposes.

    # zypper install -y --auto-agree-with-licenses \
    nvidia-compute-utils-G06=${version} \
    nvidia-persistenced=${version}
  4. Reboot the host to make the changes effective.

    # reboot
  5. Log back in and use the nvidia-smi tool to verify that the driver is loaded successfully and that it can both access and enumerate your GPUs.

    # nvidia-smi

    The output of this command should show you something similar to the following output. In the example below, the system has one GPU.

    Fri Aug  1 15:32:10 2025       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
    | N/A   33C    P8             13W /   70W |       0MiB /  15360MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
                                                                                             
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
2.2.1.4 Validation of the driver installation

Running the nvidia-smi command has verified that, at the host level, the NVIDIA device can be accessed and that the drivers are loading successfully. To validate that it is functioning, you need to validate that the GPU can take instructions from a user-space application, ideally via a container and through the CUDA library, as that is typically what a real workload would use. For this, we can make a further modification to the host OS by installing nvidia-container-toolkit.

  1. Install the nvidia-container-toolkit package from the NVIDIA Container Toolkit repository.

    # zypper ar \
    "https://nvidia.github.io/libnvidia-container/stable/rpm/"\
    nvidia-container-toolkit.repo
    # zypper --gpg-auto-import-keys install \
      -y nvidia-container-toolkit

    The nvidia-container-toolkit.repo file contains a stable repository nvidia-container-toolkit and an experimental repository nvidia-container-toolkit-experimental. Use the stable repository for production use. The experimental repository is disabled by default.

  2. Verify that the system can successfully enumerate the devices using the NVIDIA Container Toolkit. The output should be verbose, with INFO and WARN messages, but no ERROR messages.

    # nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

    This ensures that any container started on the machine can employ discovered NVIDIA GPU devices.

  3. You can then run a Podman-based container. Doing this via podman gives you a good way of validating access to the NVIDIA device from within a container, which should give confidence for doing the same with Kubernetes at a later stage.

    Give Podman access to the labeled NVIDIA devices that were taken care of by the previous command and simply run the bash command.

    # podman run --rm --device nvidia.com/gpu=all \
      --security-opt=label=disable \
      -it registry.suse.com/bci/bci-base:latest bash

    You can now execute commands from within a temporary Podman container. It does not have access to your underlying system and is ephemeral—whatever you change in the container does not persist. Also, you cannot break anything on the underlying host.

  4. Inside the container, install the required CUDA libraries. Identify their version from the output of the nvidia-smi command. From the above example, we are installing CUDA version 13.0 with many examples, demos and development kits to fully validate the GPU.

    # zypper ar \
      http://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/ \
      cuda-sle15-sp6
    # zypper --gpg-auto-import-keys refresh
    # zypper install -y cuda-libraries-13-0 cuda-demo-suite-12-9
  5. Inside the container, run the deviceQuery CUDA example of the same version, which comprehensively validates GPU access via CUDA and from within the container itself.

    # /usr/local/cuda-12.9/extras/demo_suite/deviceQuery Starting...
    
     CUDA Device Query (Runtime API)
    
    Detected 1 CUDA Capable device(s)
    
    Device 0: "Tesla T4"
      CUDA Driver Version / Runtime Version          13.0/ 13.0
      CUDA Capability Major/Minor version number:    7.5
      Total amount of global memory:                 14913 MBytes (15637086208 bytes)
      (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
      GPU Max Clock rate:                            1590 MHz (1.59 GHz)
      Memory Clock rate:                             5001 Mhz
      Memory Bus Width:                              256-bit
      L2 Cache Size:                                 4194304 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  1024
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Device supports Compute Preemption:            Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.0, CUDA Runtime Version = 13.0, NumDevs = 1, Device0 = Tesla T4
    Result = PASS

    From inside the container, you can continue to run any other CUDA workload—such as compilers—to run further tests. When finished, you can exit the container.

    # exit
    Important
    Important

    Changes you have made in the container and packages you have installed inside will be lost and will not impact the underlying operating system.

2.2.2 Installing NVIDIA GPU drivers on SUSE Linux Micro

2.2.2.1 Requirements

If you are following this guide, it assumes that you have the following already available:

  • At least one host with SUSE Linux Micro 6.1 installed, physical or virtual.

  • Your hosts are attached to a subscription as this is required for package access.

  • A compatible NVIDIA GPU installed or fully passed through to the virtual machine in which SUSE Linux Micro is running.

  • Access to the root user—these instructions assume you are the root user, and not escalating your privileges via sudo.

2.2.2.2 Considerations before the installation
2.2.2.2.1 Select the driver generation

You must verify the driver generation for the NVIDIA GPU that your system has. For modern GPUs, the G06 driver is the most common choice. Find more details in the support database.

This section details the installation of the G06 generation of the driver.

2.2.2.2.2 Additional NVIDIA components

Besides the NVIDIA open-driver provided by SUSE as part of SUSE Linux Micro, you might also need additional NVIDIA components. These could include OpenGL libraries, CUDA toolkits, command-line utilities such as nvidia-smi, and container-integration components such as nvidia-container-toolkit. Many of these components are not shipped by SUSE as they are proprietary NVIDIA software. This section describes how to configure additional repositories that give you access to these components and provides examples of using these tools to achieve a fully functional system.

2.2.2.3 The installation procedure
  1. On each GPU-enabled host, open up a transactional-update shell session to create a new read/write snapshot of the underlying operating system so that we can make changes to the immutable platform.

    # transactional-update shell
  2. When you are in the transactional-update shell session, add a package repository from NVIDIA. This allows pulling in additional utilities, for example, nvidia-smi.

    For the AMD64/Intel 64 architecture, run:

    transactional update # zypper ar \
      https://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/ \
      cuda-sle15
    transactional update # zypper --gpg-auto-import-keys refresh

    For the Arm AArch64 architecture, run:

    transactional update # zypper ar \
      https://developer.download.nvidia.com/compute/cuda/repos/sles15/sbsa/ \
      cuda-sle15
    transactional update # zypper --gpg-auto-import-keys refresh
  3. Install the Open Kernel driver KMP and detect the driver version.

    transactional update # zypper install -y --auto-agree-with-licenses \
      nvidia-open-driver-G06-signed-cuda-kmp-default
    transactional update # version=$(rpm -qa --queryformat '%{VERSION}\n' \
      nvidia-open-driver-G06-signed-cuda-kmp-default \
      | cut -d "_" -f1 | sort -u | tail -n 1)
  4. You can then install the appropriate packages for additional utilities that are useful for testing purposes.

    transactional update # zypper install -y --auto-agree-with-licenses \
    nvidia-compute-utils-G06=${version} \
    nvidia-persistenced=${version}
  5. Exit the transactional-update session and reboot to the new snapshot that contains the changes you have made.

    transactional update # exit
    # reboot
  6. After the system has rebooted, log back in and use the nvidia-smi tool to verify that the driver is loaded successfully and that it can both access and enumerate your GPUs.

    # nvidia-smi

    The output of this command should show you something similar to the following output. In the example below, the system has one GPU.

    Fri Aug  1 14:53:26 2025       
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  Tesla T4                       On  |   00000000:00:1E.0 Off |                    0 |
    | N/A   34C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
    |                                         |                        |                  N/A |
    +-----------------------------------------+------------------------+----------------------+
                                                                                             
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
2.2.2.4 Validation of the driver installation

Running the nvidia-smi command has verified that, at the host level, the NVIDIA device can be accessed and that the drivers are loading successfully. To validate that it is functioning, you need to validate that the GPU can take instructions from a user-space application, ideally via a container and through the CUDA library, as that is typically what a real workload would use. For this, we can make a further modification to the host OS by installing nvidia-container-toolkit.

  1. Open another transactional-update shell.

    #  transactional-update shell
  2. Install the nvidia-container-toolkit package from the NVIDIA Container Toolkit repository.

    transactional update # zypper ar \
    "https://nvidia.github.io/libnvidia-container/stable/rpm/"\
    nvidia-container-toolkit.repo
    transactional update # zypper --gpg-auto-import-keys install \
      -y nvidia-container-toolkit

    The nvidia-container-toolkit.repo file contains a stable repository nvidia-container-toolkit and an experimental repository nvidia-container-toolkit-experimental. Use the stable repository for production use. The experimental repository is disabled by default.

  3. Exit the transactional-update session and reboot to the new snapshot that contains the changes you have made.

    transactional update # exit
    # reboot
  4. Verify that the system can successfully enumerate the devices using the NVIDIA Container Toolkit. The output should be verbose, with INFO and WARN messages, but no ERROR messages.

    # nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

    This ensures that any container started on the machine can employ discovered NVIDIA GPU devices.

  5. You can then run a Podman-based container. Doing this via podman gives you a good way of validating access to the NVIDIA device from within a container, which should give confidence for doing the same with Kubernetes at a later stage.

    Give Podman access to the labeled NVIDIA devices that were taken care of by the previous command and simply run the bash command.

    # podman run --rm --device nvidia.com/gpu=all \
      --security-opt=label=disable \
      -it registry.suse.com/bci/bci-base:latest bash

    You can now execute commands from within a temporary Podman container. It does not have access to your underlying system and is ephemeral—whatever you change in the container does not persist. Also, you cannot break anything on the underlying host.

  6. Inside the container, install the required CUDA libraries. Identify their version from the output of the nvidia-smi command. From the above example, we are installing CUDA version 13.0 with many examples, demos and development kits to fully validate the GPU.

    # zypper ar \
      http://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/ \
      cuda-sle15-sp6
    # zypper --gpg-auto-import-keys refresh
    # zypper install -y cuda-libraries-13-0 cuda-demo-suite-12-9
  7. Inside the container, run the deviceQuery CUDA example of the same version, which comprehensively validates GPU access via CUDA and from within the container itself.

    # /usr/local/cuda-12.9/extras/demo_suite/deviceQuery Starting...
    
     CUDA Device Query (Runtime API)
    
    Detected 1 CUDA Capable device(s)
    
    Device 0: "Tesla T4"
      CUDA Driver Version / Runtime Version          13.0 / 13.0
      CUDA Capability Major/Minor version number:    7.5
      Total amount of global memory:                 14914 MBytes (15638134784 bytes)
      (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
      GPU Max Clock rate:                            1590 MHz (1.59 GHz)
      Memory Clock rate:                             5001 Mhz
      Memory Bus Width:                              256-bit
      L2 Cache Size:                                 4194304 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  1024
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Device supports Compute Preemption:            Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.0, CUDA Runtime Version = 13.0, NumDevs = 1, Device0 = Tesla T4
    Result = PASS

    From inside the container, you can continue to run any other CUDA workload—such as compilers—to run further tests. When finished, you can exit the container.

    # exit
    Important
    Important

    Changes you have made in the container and packages you have installed inside will be lost and will not impact the underlying operating system.

3 Preparing the cluster for AI Library

This procedure assumes that you already have the base operating system installed on cluster nodes as well as the SUSE Rancher Prime: RKE2 Kubernetes distribution installed and operational. If you are installing from scratch, refer to Section 2, “Installing the Linux and Kubernetes distribution” first.

  1. Install SUSE Rancher Prime.

  2. Install the NVIDIA GPU Operator on the cluster.

    Tip
    Tip: Installing NVIDIA GPU Operator on SUSE Rancher Prime: RKE2

    If you run SUSE Rancher Prime: RKE2, follow these steps:

    1. On the agent nodes, run the following command:

      # echo PATH=$PATH:/usr/local/nvidia/toolkit >> /etc/default/rke2-agent
    2. On the server nodes, run the following command:

      # echo PATH=$PATH:/usr/local/nvidia/toolkit >> /etc/default/rke2-server
    3. Follow the steps described in https://documentation.suse.com/cloudnative/rke2/latest/en/advanced.html#_deploy_nvidia_operator.

  3. Connect the Kubernetes cluster to SUSE Rancher Prime. Refer to https://documentation.suse.com/cloudnative/rancher-manager/latest/en/cluster-deployment/register-existing-clusters.html for details.

  4. Configure the GPU-enabled nodes so that the SUSE AI containers are assigned to Pods that run on nodes equipped with NVIDIA GPU hardware. Find more details about assigning Pods to nodes in Section 3.1, “Assigning GPU nodes to applications”.

  5. Install and configure SUSE Security to scan the nodes used for SUSE AI. Although this step is not required, we strongly encourage it to ensure the security in the production environment.

  6. Install and configure SUSE Observability to observe the nodes used for SUSE AI application. Refer to Section 3.2, “Setting up SUSE Observability for SUSE AI for more details.

3.1 Assigning GPU nodes to applications

When deploying a containerized application to Kubernetes, you need to ensure that containers requiring GPU resources are run on appropriate worker nodes. For example, Ollama, a core component of SUSE AI, can deeply benefit from the use of GPU acceleration. This topic describes how to satisfy this requirement by explicitly requesting GPU resources and labeling worker nodes for configuring the node selector.

Requirements
  • Kubernetes cluster—such as SUSE Rancher Prime: RKE2—must be available and configured with more than one worker node in which certain nodes have NVIDIA GPU resources and others do not.

  • This document assumes that any kind of deployment to the Kubernetes cluster is done using Helm charts.

3.1.1 Labeling GPU nodes

To distinguish nodes with the GPU support from non-GPU nodes, Kubernetes uses labels. Labels are used for relevant metadata and should not be confused with annotations that provide simple information about a resource. It is possible to manipulate labels with the kubectl command, as well as by tweaking configuration files from the nodes. If an IaC tool such as Terraform is used, labels can be inserted in the node resource configuration files.

To label a single node, use the following command:

> kubectl label node GPU_NODE_NAME accelerator=nvidia-gpu

To achieve the same result by tweaking the node.yaml node configuration, add the following content and apply the changes with kubectl apply -f node.yaml:

apiVersion: v1
kind: Node
metadata:
  name: node-name
  labels:
    accelerator: nvidia-gpu
Tip
Tip: Labeling multiple nodes

To label multiple nodes, use the following command:

> kubectl label node \
  GPU_NODE_NAME1 \
  GPU_NODE_NAME2 ... \
  accelerator=nvidia-gpu
Tip
Tip

If Terraform is being used as an IaC tool, you can add labels to a group of nodes by editing the .tf files and adding the following values to a resource:

resource "node_group" "example" {
  labels = {
    "accelerator" = "nvidia-gpu"
  }
}

To check if the labels are correctly applied, use the following command:

> kubectl get nodes --show-labels

3.1.2 Assigning GPU nodes

The matching between a container and a node is configured by the explicit resource allocation and the use of labels and node selectors. The use cases described below focus on NVIDIA GPUs.

3.1.2.1 Enable GPU passthrough

Containers are isolated from the host environment by default. For the containers that rely on the allocation of GPU resources, their Helm charts must enable GPU passthrough so that the container can access and use the GPU resource. Without enabling the GPU passthrough, the container may still run, but it can only use the main CPU for all computations. Refer to Ollama Helm chart for an example of the configuration required for GPU acceleration.

3.1.2.2 Assignment by resource request

After the NVIDIA GPU Operator is configured on a node, you can instantiate applications requesting the resource nvidia.com/gpu provided by the operator. Add the following content to your values.yaml file. Specify the number of GPUs according to your setup.

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1
3.1.2.3 Assignment by labels and node selectors

If affected cluster nodes are labeled with a label such as accelerator=nvidia-gpu, you can configure the node selector to check for the label. In this case, use the following values in your values.yaml file.

nodeSelector:
  accelerator: nvidia-gpu

3.1.3 Verifying Ollama GPU assignment

If the GPU is correctly detected, the Ollama container logs this event:

| [...] source=routes.go:1172 msg="Listening on :11434 (version 0.0.0)"                                              │
│ [...] source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2502346830/runners                       │
│ [...] source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 cpu cpu_avx cpu_avx2]"                             │
│ [...] source=gpu.go:204 msg="looking for compatible GPUs"                                                          │
│ [...] source=types.go:105 msg="inference compute" id=GPU-c9ad37d0-d304-5d2a-c2e6-d3788cd733a7 library=cuda compute │

3.2 Setting up SUSE Observability for SUSE AI

SUSE Observability provides comprehensive monitoring and insights into your infrastructure and applications. It enables efficient tracking of metrics, logs and traces, helping you maintain optimal performance and troubleshoot issues effectively. This procedure guides you through setting up SUSE Observability for the SUSE AI environment using the SUSE AI Observability Extension.

3.2.1 Deployment scenarios

You can deploy SUSE Observability and SUSE AI in two different ways:

  • Single-Cluster setup: Both SUSE AI and SUSE Observability are installed in the same Kubernetes cluster. This is a simpler approach ideal for testing and proof-of-concept deployments. Communication between components can use internal cluster DNS.

  • Multi-Cluster setup: SUSE AI and SUSE Observability are installed on separate, dedicated Kubernetes clusters. This setup is recommended for production environments because it isolates workloads. Communication requires exposing the SUSE Observability endpoints externally, for example, via an Ingress.

This section provides instructions for both scenarios.

3.2.2 Requirements

To set up SUSE Observability for SUSE AI, you need to meet the following requirements:

  • Have access to SUSE Application Collection

  • Have a valid SUSE AI subscription

  • Have a valid license for SUSE Observability in SUSE Customer Center

  • Instrument your applications for telemetry data acquisition with OpenTelemetry.

For details on how to collect traces and metrics from SUSE AI components and user-developed applications, refer to Monitoring SUSE AI with OpenTelemetry and SUSE Observability. It includes configurations that are essential for full observability.

Important
Important: SUSE Application Collection not instrumented by default

Applications from the SUSE Application Collection are not instrumented by default. If you want to monitor your AI applications, you need to follow the instrumentation guidelines that we provide in the document Monitoring SUSE AI with OpenTelemetry and SUSE Observability.

3.2.3 Setup process overview

The following chart shows the high-level steps for the setup procedure. You will first set up the SUSE Observability cluster, then configure the SUSE AI cluster, and finally instrument your applications. Execute the steps in each column from left to right and top to bottom.

  • Blue steps are related to Helm chart installations.

  • Gray steps represent another type of interaction, such as coding.

The chart showing a high-level overview of the SUSE Observability setup
Figure 16: High-level overview of the SUSE Observability setup
Tip
Tip: Setup clusters

You can create and configure Kubernetes clusters for SUSE AI and SUSE Observability as you prefer. If you are using SUSE Rancher Prime, check its documentation. For testing purposes, you can even share one cluster for both deployments. You can skip instructions on setting up a specific cluster if you already have one configured.

The diagram below shows the result of the above steps. There are two clusters represented, one for the SUSE Observability workload and another one for SUSE AI. You may use identical setup or customize it for your environment.

The chart showing setup of separate clusters for SUSE AI and SUSE Observability
Figure 17: Separate clusters for SUSE AI and SUSE Observability
Points to notice
  • You can install SUSE AI Observability Extension alongside SUSE Observability. It means that you can confidently use the internal Kubernetes DNS.

  • SUSE Observability contains several components and the following two of them need to be accessible by the AI Cluster:

    Important
    Important

    Remember that in multi-cluster setups, it is critical to properly expose your endpoints. Configure TLS, be careful with the configuration, and make sure to provide the right keys and tokens. More details are provided in the respective instructions.

3.2.4 Setting up the SUSE Observability cluster

This initial step is identical for both single-cluster and multi-cluster deployments.

  1. Install SUSE Observability. You can follow the official SUSE Observability installation documentation for all installation instructions. Remember to expose your APIs and collector endpoints to your SUSE AI cluster.

    Important
    Important: Multi-cluster setup

    For multi-cluster setups, you must expose the SUSE Observability API and collector endpoints so that the SUSE AI cluster can reach them. Refer to the guide on exposing SUSE Observability outside of the cluster.

  2. Install the SUSE Observability extension. Create a new Helm values file named genai_values.yaml. Before creating the file, review the placeholders below.

    SUSE_OBSERVABILITY_API_URL

    The URL of the SUSE Observability API. For multi-cluster deployments, this is the external URL. For single-cluster deployments, this can be the internal service URL. Example: http://suse-observability-api.your-domain.com

    SUSE_OBSERVABILITY_API_KEY

    The API key from the baseConfig_values.yaml file used during the SUSE Observability installation.

    SUSE_OBSERVABILITY_API_TOKEN_TYPE

    Can be api for a token from the Web UI or service for a Service Token.

    SUSE_OBSERVABILITY_TOKEN

    The API or Service token itself.

    OBSERVED_SERVER_NAME

    The name of the cluster to observe. It must match the name used in the Kubernetes StackPack configuration. Example: suse-ai-cluster.

    Create the genai_values.yaml file with the following content:

    global:
      imagePullSecrets:
      - application-collection 1
      
    serverUrl: SUSE_OBSERVABILITY_API_URL
    apiKey: SUSE_OBSERVABILITY_API_KEY
    tokenType: SUSE_OBSERVABILITY_API_TOKEN_TYPE
    apiToken: SUSE_OBSERVABILITY_TOKEN
    clusterName: OBSERVED_SERVER_NAME

    1

    Instructs Helm to use credentials from the SUSE Application Collection. For instructions on how to configure the image pull secrets for the SUSE Application Collection, refer to the official documentation.

    Run the install command.

    > helm upgrade --install ai-obs \
      oci://dp.apps.rancher.io/charts/suse-ai-observability-extension \
      -f genai_values.yaml --namespace so-extensions --create-namespace
    Note
    Note: Self-signed certificates not supported

    Self-signed certificates are not supported. Consider running the extension in the same cluster as SUSE Observability and then use the internal Kubernetes address.

    After the installation is complete, a new menu called GenAI is added to the Web interface and also a Kubernetes cron job is created that synchronizes the topology view with the components found in the SUSE AI cluster.

  3. Verify SUSE Observability extension. After the installation, you can verify that a new lateral menu appears:

    An image of a new left menu item GenAI Observability
    Figure 18: New GenAI Observability menu item

3.2.5 Setting up the SUSE AI cluster

Follow the instructions for your deployment scenario.

Single-cluster deployment

In this setup, the SUSE AI components are installed in the same cluster as SUSE Observability and can communicate using internal service DNS.

Multi-cluster deployment

In this setup, the SUSE AI cluster is separate. Communication relies on externally exposed endpoints of the SUSE Observability cluster.

The difference between deployment scenarios affects the OTEL Collector exporter configuration and the SUSE Observability Agent URL as described in the following list.

SUSE_OBSERVABILITY_API_URL

The URL of the SUSE Observability API.

Single-cluster example: http://suse-observability-otel-collector.suse-observability.svc.cluster.local:4317

Multi-cluster example: https://suse-observability-api.your-domain.com

SUSE_OBSERVABILITY_COLLECTOR_ENDPOINT

The endpoint of the SUSE Observability Collector.

Single-cluster example: http://suse-observability-router.suse-observability.svc.cluster.local:8080/receiver/stsAgent

Multi-cluster example: https://suse-observability-router.your-domain.com/receiver/stsAgent

  1. Install NVIDIA GPU Operator. Follow the instructions in https://documentation.suse.com/cloudnative/rke2/latest/en/advanced.html#_deploy_nvidia_operator.

  2. Install OpenTelemetry collector. Create a secret with your SUSE Observability API key in the namespace where you want to install the collector. Retrieve the API key using the Web UI or from the baseConfig_values.yaml file that you used during the SUSE Observability installation. If the namespace does not exist yet, create it.

    kubectl create namespace observability
    kubectl create secret generic open-telemetry-collector \
      --namespace observability \
      --from-literal=API_KEY='SUSE_OBSERVABILITY_API_KEY'

    Create a new file named otel-values.yaml with the following content.

    global:
      imagePullSecrets:
      - application-collection
    
    extraEnvsFrom:
      - secretRef:
          name: open-telemetry-collector
    mode: deployment
    ports:
      metrics:
        enabled: true
    presets:
      kubernetesAttributes:
        enabled: true
        extractAllPodLabels: true
    config:
      receivers:
        prometheus:
          config:
            scrape_configs:
              - job_name: 'gpu-metrics'
                scrape_interval: 10s
                scheme: http
                kubernetes_sd_configs:
                  - role: endpoints
                    namespaces:
                      names:
                        - gpu-operator
              - job_name: 'milvus'
                scrape_interval: 15s
                metrics_path: '/metrics'
     
                static_configs:
                  - targets: ['MILVUS_SERVICE_NAME.SUSE_AI_NAMESPACE.svc.cluster.local:9091'] 1
      exporters:
        otlp:
          endpoint: https://OPEN_TELEMETRY_COLLECTOR_NAME.suse-observability.svc.cluster.local:4317 2
          headers:
            Authorization: "SUSEObservability ${env:API_KEY}"
          tls:
            insecure: true
      processors:
        tail_sampling:
          decision_wait: 10s
          policies:
          - name: rate-limited-composite
            type: composite
            composite:
              max_total_spans_per_second: 500
              policy_order: [errors, slow-traces, rest]
              composite_sub_policy:
              - name: errors
                type: status_code
                status_code:
                  status_codes: [ ERROR ]
              - name: slow-traces
                type: latency
                latency:
                  threshold_ms: 1000
              - name: rest
                type: always_sample
              rate_allocation:
              - policy: errors
                percent: 33
              - policy: slow-traces
                percent: 33
              - policy: rest
                percent: 34
        resource:
          attributes:
          - key: k8s.cluster.name
            action: upsert
            value: CLUSTER_NAME 3
          - key: service.instance.id
            from_attribute: k8s.pod.uid
            action: insert
        filter/dropMissingK8sAttributes:
          error_mode: ignore
          traces:
            span:
              - resource.attributes["k8s.node.name"] == nil
              - resource.attributes["k8s.pod.uid"] == nil
              - resource.attributes["k8s.namespace.name"] == nil
              - resource.attributes["k8s.pod.name"] == nil
      connectors:
        spanmetrics:
          metrics_expiration: 5m
          namespace: otel_span
        routing/traces:
          error_mode: ignore
          table:
          - statement: route()
            pipelines: [traces/sampling, traces/spanmetrics]
      service:
        extensions:
          - health_check
        pipelines:
          traces:
            receivers: [otlp, jaeger]
            processors: [filter/dropMissingK8sAttributes, memory_limiter, resource]
            exporters: [routing/traces]
          traces/spanmetrics:
            receivers: [routing/traces]
            processors: []
            exporters: [spanmetrics]
          traces/sampling:
            receivers: [routing/traces]
            processors: [tail_sampling, batch]
            exporters: [debug, otlp]
          metrics:
            receivers: [otlp, spanmetrics, prometheus]
            processors: [memory_limiter, resource, batch]
            exporters: [debug, otlp]

    1

    Configure the Milvus service and namespace for the Prometheus scraper. Because Milvus will be installed in subsequent steps, you can return to this step and edit the endpoint if necessary.

    2

    Set the exporter to your exposed SUSE Observability collector. Remember that the value can be distinct, depending on the deployment pattern. For production usage, we recommend using TLS communication.

    3

    Replace CLUSTER_NAME with the cluster's name.

    Finally, run the installation command.

    > helm upgrade --install opentelemetry-collector \
      oci://dp.apps.rancher.io/charts/opentelemetry-collector \
      -f otel-values.yaml --namespace observability

    Verify the installation by checking the existence of a new deployment and service in the observability namespace.

  3. The GPU metrics scraper that we configure in the OTEL Collector requires custom RBAC rules. Create a file named otel-rbac.yaml with the following content:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: suse-observability-otel-scraper
    rules:
      - apiGroups:
          - ""
        resources:
          - services
          - endpoints
        verbs:
          - list
          - watch
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: suse-observability-otel-scraper
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: suse-observability-otel-scraper
    subjects:
      - kind: ServiceAccount
        name: opentelemetry-collector
        namespace: observability

    Then apply the configuration by running the following command.

    > kubectl apply -n gpu-operator -f otel-rbac.yaml
  4. Install the SUSE Observability Agent.

    > helm upgrade --install \
      --namespace suse-observability --create-namespace \
      --set-string 'stackstate.apiKey'='YOUR_API_KEY'1 \
      --set-string 'stackstate.cluster.name'='CLUSTER_NAME2' \
      --set-string 'stackstate.url'='http://suse-observability-router.suse-observability.svc.cluster.local:8080/receiver/stsAgent'3 \
      --set 'nodeAgent.skipKubeletTLSVerify'=true suse-observability-agent \
      suse-observability/suse-observability-agent

    1

    Retrieve the API key using the Web UI or from the baseConfig_values.yaml file that you used during the SUSE Observability installation.

    2

    Replace CLUSTER_NAME with the cluster's name.

    3

    Replace with your SUSE Observability server URL.

  5. Install SUSE AI. Refer to Section 4, “Installing applications from AI Library” for the complete procedure.

3.2.6 Instrument applications

Instrumentation is the act of configuring your applications for telemetry data acquisition. Our stack employs OpenTelemetry standards as a vendor-neutral and open base for our telemetry. For a comprehensive guide on how to set up your instrumentation, please refer to Monitoring SUSE AI with OpenTelemetry and SUSE Observability.

By following the instructions in the document referenced above, you will be able to retrieve all relevant telemetry data from Open WebUI, Ollama and Milvus by simply applying specific configuration to their Helm chart values. You can find links for advanced use cases (auto-instrumentation with OTEL Operator) at the end of the document.

4 Installing applications from AI Library

SUSE AI is delivered as a set of components that you can combine to meet specific use cases. To enable the full integrated stack, you need to deploy multiple applications in sequence. Applications with the fewest dependencies must be installed first, followed by dependent applications once their required dependencies are in place within the cluster.

You can either install required AI Library components manually using their Helm charts, or use SUSE AI Deployer to include all the dependencies in one step.

4.1 Installation procedure

This procedure includes steps to install AI Library applications.

  1. Purchase the SUSE AI entitlement. It is a separate entitlement from SUSE Rancher Prime.

  2. Access SUSE AI via the SUSE Application Collection at https://apps.rancher.io/ to perform the check for the SUSE AI entitlement.

  3. If the entitlement check is successful, you are given access to the SUSE AI-related Helm charts and container images, and can deploy directly from the SUSE Application Collection.

  4. Visit the SUSE Application Collection, sign in and get the user access token as described in https://docs.apps.rancher.io/get-started/authentication/.

  5. Create a Kubernetes namespace if it does not already exist. The steps in this procedure assume that all containers are deployed into the same namespace referred to as SUSE_AI_NAMESPACE. Replace its name to match your preferences.

    > kubectl create namespace SUSE_AI_NAMESPACE
  6. Create the SUSE Application Collection secret.

    > kubectl create secret docker-registry application-collection \
      --docker-server=dp.apps.rancher.io \
      --docker-username=APPCO_USERNAME \
      --docker-password=APPCO_USER_TOKEN \
      -n SUSE_AI_NAMESPACE
  7. Log in to the Helm registry.

    > helm registry login dp.apps.rancher.io/charts \
      -u APPCO_USERNAME \
      -p APPCO_USER_TOKEN
  8. Install cert-manager as described in Section 4.2, “Installing cert-manager”.

  9. Install AI Library components. You can either install each component separately, or use the SUSE AI Deployer chart to install the components together as described in Section 4.7, “Installing AI Library components using SUSE AI Deployer”.

    1. Install Milvus as described in Section 4.3, “Installing Milvus”.

    2. (Optional) Install Ollama as described in Section 4.4, “Installing Ollama”.

    3. Install Open WebUI as described in Section 4.5, “Installing Open WebUI”.

4.2 Installing cert-manager

cert-manager is an extensible X.509 certificate controller for Kubernetes workloads. It supports certificates from popular public issuers as well as private issuers. cert-manager ensures that the certificates are valid and up-to-date, and attempts to renew certificates at a configured time before expiry.

In previous releases, cert-manager was automatically installed together with Open WebUI. Currently, cert-manager is no longer part of the Open WebUI Helm chart and you need to install it separately.

4.2.1 Details about the cert-manager application

Before deploying cert-manager, it is important to know more about the supported configurations and documentation. The following command provides the corresponding details:

helm show values oci://dp.apps.rancher.io/charts/cert-manager

Alternatively, you can also refer to the cert-manager Helm chart page on the SUSE Application Collection site at https://apps.rancher.io/applications/cert-manager. It contains available versions and the link to pull the cert-manager container image.

4.2.2 cert-manager installation procedure

Tip
Tip

Before the installation, you need to get user access to the SUSE Application Collection, create a Kubernetes namespace, and log in to the Helm registry as described in Section 4.1, “Installation procedure”.

  • Install the cert-manager chart.

    > helm upgrade --install cert-manager \
      oci://dp.apps.rancher.io/charts/cert-manager \
      -n CERT_MANAGER_NAMESPACE \
      --set crds.enabled=true \
      --set 'global.imagePullSecrets[0].name'=application-collection

4.2.3 Upgrading cert-manager

To upgrade cert-manager to a specific new version, run the following command:

> helm upgrade --install cert-manager \
  oci://dp.apps.rancher.io/charts/cert-manager \
  -n CERT_MANAGER_NAMESPACE \
  --version VERSION_NUMBER

To upgrade cert-manager to the latest version, run the following command:

> helm upgrade --install cert-manager \
  oci://dp.apps.rancher.io/charts/cert-manager \
  -n CERT_MANAGER_NAMESPACE

4.2.4 Uninstalling cert-manager

To uninstall cert-manager, run the following command:

> helm uninstall cert-manager -n CERT_MANAGER_NAMESPACE

4.3 Installing Milvus

Milvus is a scalable, high-performance vector database designed for AI applications. It enables efficient organization and searching of massive unstructured datasets, including text, images and multi-modal content. This procedure walks you through the installation of Milvus and its dependencies.

4.3.1 Details about the Milvus application

Before deploying Milvus, it is important to know more about the supported configurations and documentation. The following command provides the corresponding details:

helm show values oci://dp.apps.rancher.io/charts/milvus

Alternatively, you can also refer to the Milvus Helm chart page on the SUSE Application Collection site at https://apps.rancher.io/applications/milvus. It contains Milvus dependencies, available versions and the link to pull the Milvus container image.

Milvus page in the SUSE Application Collection
Figure 19: Milvus page in the SUSE Application Collection

4.3.2 Milvus installation procedure

Tip
Tip

Before the installation, you need to get user access to the SUSE Application Collection, create a Kubernetes namespace, and log in to the Helm registry as described in Section 4.1, “Installation procedure”.

  1. When installed as part of SUSE AI, Milvus depends on etcd, MinIO and Apache Kafka. Because the Milvus chart uses a non-default configuration, create an override file milvus_custom_overrides.yaml with the following content.

    Tip
    Tip

    As a template, you can download the Milvus Helm chart that includes the values.yaml file with the default configuration by running the following command:

    > helm pull oci://dp.apps.rancher.io/charts/milvus --version 4.2.2
    global:
      imagePullSecrets:
      - application-collection
      
    cluster:
      enabled: True
    standalone:
      persistence:
        persistentVolumeClaim:
          storageClassName: "local-path"
    etcd:
      replicaCount: 1
      persistence:
        storageClassName: "local-path"
    minio:
      mode: distributed
      replicas: 4
      rootUser: "admin"
      rootPassword: "adminminio"
      persistence:
        storageClass: "local-path"
      resources:
        requests:
          memory: 1024Mi
    kafka:
      enabled: true
      name: kafka
      replicaCount: 3
      broker:
        enabled: true
      cluster:
        listeners:
          client:
            protocol: 'PLAINTEXT'
          controller:
            protocol: 'PLAINTEXT'
      persistence:
        enabled: true
        annotations: {}
        labels: {}
        existingClaim: ""
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 8Gi
        storageClassName: "local-path"
    extraConfigFiles:1
      user.yaml: |+
        trace:
          exporter: jaeger
          sampleFraction: 1
          jaeger:
            url: "http://opentelemetry-collector.observability.svc.cluster.local:14268/api/traces"2

    1

    The extraConfigFiles section is optional, required only to receive telemetry data from Open WebUI.

    2

    The URL of the OpenTelemetry Collector installed by the user.

    Tip
    Tip

    The above example uses local storage. For production environments, we recommend using an enterprise class storage solution such as SUSE Storage in which case the storageClassName option must be set to longhorn.

  2. Install the Milvus Helm chart using the milvus_custom_overrides.yaml override file.

    > helm upgrade --install \
      milvus oci://dp.apps.rancher.io/charts/milvus \
      -n SUSE_AI_NAMESPACE \
      --version 4.2.2 -f milvus_custom_overrides.yaml
4.3.2.1 Using Apache Kafka with SUSE Storage

When Milvus is deployed in cluster mode, it uses Apache Kafka as a message queue. If Apache Kafka uses SUSE Storage as a storage back-end, you need to create an XFS storage class and make it available for the Apache Kafka deployment. Otherwise deploying Apache Kafka with a storage class of an Ext4 file system fails with the following error:

"Found directory /mnt/kafka/logs/lost+found, 'lost+found' is not
  in the form of topic-partition or topic-partition.uniqueId-delete
  (if marked for deletion)"

To introduce the XFS storage class, follow these steps:

  1. Create a file named longhorn-xfs.yaml with the following content:

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: longhorn-xfs
    provisioner: driver.longhorn.io
    allowVolumeExpansion: true
    reclaimPolicy: Delete
    volumeBindingMode: Immediate
    parameters:
      numberOfReplicas: "3"
      staleReplicaTimeout: "30"
      fromBackup: ""
      fsType: "xfs"
      dataLocality: "disabled"
      unmapMarkSnapChainRemoved: "ignored"
  2. Create the new storage class using the kubectl command.

    > kubectl apply -f longhorn-xfs.yaml
  3. Update the Milvus overrides YAML file to reference the Apache Kafka storage class, as in the following example:

      [...]
        kafka:
        enabled: true
        persistence:
          storageClassName: longhorn-xfs

4.3.3 Upgrading Milvus

The Milvus chart receives application updates and updates of the Helm chart templates. New versions may include changes that require manual steps. These steps are listed in the corresponding README file. All Milvus dependencies are updated automatically during Milvus upgrade.

To upgrade Milvus, identify the new version number and run the following command below:

> helm upgrade --install \
  milvus oci://dp.apps.rancher.io/charts/milvus \
  -n SUSE_AI_NAMESPACE \
  --version VERSION_NUMBER \
  -f milvus_custom_overrides.yaml

4.3.4 Uninstalling Milvus

To uninstall Milvus, run the following command:

> helm uninstall milvus -n SUSE_AI_NAMESPACE

4.4 Installing Ollama

Ollama is a tool for running and managing language models locally on your computer. It offers a simple interface to download, run and interact with models without relying on cloud resources.

Tip
Tip

When installing SUSE AI, Ollama is installed by the Open WebUI installation by default. If you decide to install Ollama separately, disable its installation during the installation of Open WebUI as outlined in Example 6, “Open WebUI override file with Ollama installed separately”.

4.4.1 Details about the Ollama application

Before deploying Ollama, it is important to know more about the supported configurations and documentation. The following command provides the corresponding details:

helm show values oci://dp.apps.rancher.io/charts/ollama

Alternatively, you can also refer to the Ollama Helm chart page on the SUSE Application Collection site at https://apps.rancher.io/applications/ollama. It contains the available versions and a link to pull the Ollama container image.

4.4.2 Ollama installation procedure

Tip
Tip

Before the installation, you need to get user access to the SUSE Application Collection, create a Kubernetes namespace, and log in to the Helm registry as described in Section 4.1, “Installation procedure”.

  1. Create the ollama_custom_overrides.yaml file to override the values of the parent Helm chart. Refer to Section 4.4.5, “Values for the Ollama Helm chart” for more details.

  2. Install the Ollama Helm chart using the ollama-custom-overrides.yaml override file.

    > helm upgrade \
      --install ollama oci://dp.apps.rancher.io/charts/ollama \
      -n SUSE_AI_NAMESPACE \
      -f ollama_custom_overrides.yaml
    Tip
    Tip: Hugging Face models

    Models downloaded from Hugging Face need to be converted before they can be used by Ollama. Refer to https://github.com/ollama/ollama/blob/main/docs/import.md for more details.

4.4.3 Uninstalling Ollama

To uninstall Ollama, run the following command:

> helm uninstall ollama -n SUSE_AI_NAMESPACE

4.4.4 Upgrading Ollama

You can upgrade Ollama to a specific version by running the following command:

> helm upgrade ollama oci://dp.apps.rancher.io/charts/ollama \
  -n SUSE_AI_NAMESPACE \
  --version OLLAMA_VERSION_NUMBER -f ollama_custom_overrides.yaml

If you omit the --version option, Ollama gets upgraded to the latest available version.

4.4.4.1 Upgrading from version 0.x.x to 1.x.x

The version 1.x.x introduces the ability to load models in memory at startup. To reflect this, change ollama.models to ollama.models.pull in the Ollama Helm chart to avoid errors before upgrading, for example:

Example 1: Ollama Helm chart version 0.x.x
[...]
ollama:
  models:
    - "gemma:2b"
    - "llama3.1"
Example 2: Ollama Helm chart version 1.x.x
[...]
ollama:
  models:
    pull:
      - "gemma:2b"
      - "llama3.1"

Without this change you may experience the following error when trying to upgrade from 0.x.x to 1.x.x.

coalesce.go:286: warning: cannot overwrite table with non table for
ollama.ollama.models (map[pull:[] run:[]])
Error: UPGRADE FAILED: template: ollama/templates/deployment.yaml:145:27:
executing "ollama/templates/deployment.yaml" at <.Values.ollama.models.pull>:
can't evaluate field pull in type interface {}

4.4.5 Values for the Ollama Helm chart

To override the default values during the Helm chart installation or update, you can create an override YAML file with custom values. Then, apply these values by specifying the path to the override file with the -f option of the helm command.

Important
Important: GPU section

Ollama can run optimized for NVIDIA GPUs if the following conditions are fulfilled:

If you do not want to use the NVIDIA GPU, remove the gpu section from ollama_custom_overrides.yaml or disable it.

 ollama:
  [...]
  gpu:
    enabled: false
    type: 'nvidia'
    number: 1
Example 3: Basic override file with GPU and two models pulled at startup
global:
  imagePullSecrets:
  - application-collection
ingress:
  enabled: false
defaultModel: "gemma:2b"
runtimeClassName: nvidia
ollama:
  models:
    pull:
      - "gemma:2b"
      - "llama3.1"
    run:
      - "gemma:2b"
      - "llama3.1"
  gpu:
    enabled: true
    type: 'nvidia'
    number: 1
    nvidiaResource: "nvidia.com/gpu"
persistentVolume:1
  enabled: true
  storageClass: local-path2

1

Without the persistentVolume option enabled, changes made to Ollama—such as downloading other LLM— are lost when the container is restarted.

2

Use local-path storage only for testing purposes. For production use, we recommend using a storage solution suitable for persistent storage, such as SUSE Storage.

Example 4: Basic override file with Ingress and no GPU
ollama:
  models:
    pull:
      - llama2
    run:
      - llama2
  persistentVolume:
    enabled: true
    storageClass: local-path1
ingress:
  enabled: true
  hosts:
  - host: OLLAMA_API_URL
    paths:
      - path: /
        pathType: Prefix

1

Use local-path storage (requires installing the corresponding provisioner) only for testing purposes. For production use, we recommend using a storage solution suitable for persistent storage, such as SUSE Storage.

Table 1: Override file options for the Ollama Helm chart

Key

Type

Default

Description

affinity

object

{}

Affinity for pod assignment

autoscaling.enabled

bool

false

Enable autoscaling

autoscaling.maxReplicas

int

100

Number of maximum replicas

autoscaling.minReplicas

int

1

Number of minimum replicas

autoscaling.targetCPUUtilizationPercentage

int

80

CPU usage to target replica

extraArgs

list

[]

Additional arguments on the output Deployment definition.

extraEnv

list

[]

Additional environment variables on the output Deployment definition.

fullnameOverride

string

""

String to fully override template

global.imagePullSecrets

list

[]

Global override for container image registry pull secrets

global.imageRegistry

string

""

Global override for container image registry

hostIPC

bool

false

Use the host’s IPC namespace

hostNetwork

bool

false

Use the host's network namespace

hostPID

bool

false

Use the host's PID namespace.

image.pullPolicy

string

"IfNotPresent"

Image pull policy to use for the Ollama container

image.registry

string

"dp.apps.rancher.io"

Image registry to use for the Ollama container

image.repository

string

"containers/ollama"

Image repository to use for the Ollama container

image.tag

string

"0.3.6"

Image tag to use for the Ollama container

imagePullSecrets

list

[]

Docker registry secret names as an array

ingress.annotations

object

{}

Additional annotations for the Ingress resource

ingress.className

string

""

IngressClass that is used to implement the Ingress (Kubernetes 1.18+)

ingress.enabled

bool

false

Enable Ingress controller resource

ingress.hosts[0].host

string

"ollama.local"

ingress.hosts[0].paths[0].path

string

"/"

ingress.hosts[0].paths[0].pathType

string

"Prefix"

ingress.tls

list

[]

The TLS configuration for host names to be covered with this Ingress record

initContainers

list

[]

Init containers to add to the pod

knative.containerConcurrency

int

0

Knative service container concurrency

knative.enabled

bool

false

Enable Knative integration

knative.idleTimeoutSeconds

int

300

Knative service idle timeout seconds

knative.responseStartTimeoutSeconds

int

300

Knative service response start timeout seconds

knative.timeoutSeconds

int

300

Knative service timeout seconds

livenessProbe.enabled

bool

true

Enable livenessProbe

livenessProbe.failureThreshold

int

6

Failure threshold for livenessProbe

livenessProbe.initialDelaySeconds

int

60

Initial delay seconds for livenessProbe

livenessProbe.path

string

"/"

Request path for livenessProbe

livenessProbe.periodSeconds

int

10

Period seconds for livenessProbe

livenessProbe.successThreshold

int

1

Success threshold for livenessProbe

livenessProbe.timeoutSeconds

int

5

Timeout seconds for livenessProbe

nameOverride

string

""

String to partially override template (maintains the release name)

nodeSelector

object

{}

Node labels for pod assignment

ollama.gpu.enabled

bool

false

Enable GPU integration

ollama.gpu.number

int

1

Specify the number of GPUs

ollama.gpu.nvidiaResource

string

"nvidia.com/gpu"

Only for NVIDIA cards; change to nvidia.com/mig-1g.10gb to use MIG slice

ollama.gpu.type

string

"nvidia"

GPU type: nvidia or amd. If ollama.gpu.enabled is enabled, the default value is nvidia. If set to amd, this adds the rocm suffix to the image tag if image.tag is not override. This is because AMD and CPU/CUDA are different images.

ollama.insecure

bool

false

Add insecure flag for pulling at container startup

ollama.models

list

[]

List of models to pull at container startup. The more you add, the longer the container takes to start if models are not present models: - llama2 - mistral

ollama.mountPath

string

""

Override ollama-data volume mount path, default: "/root/.ollama"

persistentVolume.accessModes

list

["ReadWriteOnce"]

Ollama server data Persistent Volume access modes. Must match those of existing PV or dynamic provisioner, see https://kubernetes.io/docs/concepts/storage/persistent-volumes/.

persistentVolume.annotations

object

{}

Ollama server data Persistent Volume annotations

persistentVolume.enabled

bool

false

Enable persistence using PVC

persistentVolume.existingClaim

string

""

If you want to bring your own PVC for persisting Ollama state, pass the name of the created + ready PVC here. If set, this Chart does not create the default PVC. Requires server.persistentVolume.enabled: true

persistentVolume.size

string

"30Gi"

Ollama server data Persistent Volume size

persistentVolume.storageClass

string

""

If persistentVolume.storageClass is present, and is set to either a dash (-) or empty string ( /), dynamic provisioning is disabled. Otherwise, the storageClassName for persistent volume claim is set to the given value specified by persistentVolume.storageClass. If persistentVolume.storageClass is absent, the default storage class is used for dynamic provisioning whenever possible. See https://kubernetes.io/docs/concepts/storage/storage-classes/ for more details.

persistentVolume.subPath

string

""

Subdirectory of Ollama server data Persistent Volume to mount. Useful if the volume's root directory is not empty.

persistentVolume.volumeMode

string

""

Ollama server data Persistent Volume Binding Mode. If empty (the default) or set to null, no volumeBindingMode specification is set, choosing the default mode.

persistentVolume.volumeName

string

""

Ollama server Persistent Volume name. It can be used to force-attach the created PVC to a specific PV.

podAnnotations

object

{}

Map of annotations to add to the pods

podLabels

object

{}

Map of labels to add to the pods

podSecurityContext

object

{}

Pod Security Context

readinessProbe.enabled

bool

true

Enable readinessProbe

readinessProbe.failureThreshold

int

6

Failure threshold for readinessProbe

readinessProbe.initialDelaySeconds

int

30

Initial delay seconds for readinessProbe

readinessProbe.path

string

"/"

Request path for readinessProbe

readinessProbe.periodSeconds

int

5

Period seconds for readinessProbe

readinessProbe.successThreshold

int

1

Success threshold for readinessProbe

readinessProbe.timeoutSeconds

int

3

Timeout seconds for readinessProbe

replicaCount

int

1

Number of replicas

resources.limits

object

{}

Pod limit

resources.requests

object

{}

Pod requests

runtimeClassName

string

""

Specify runtime class

securityContext

object

{}

Container Security Context

service.annotations

object

{}

Annotations to add to the service

service.nodePort

int

31434

Service node port when service type is NodePort

service.port

int

11434

Service port

service.type

string

"ClusterIP"

Service type

serviceAccount.annotations

object

{}

Annotations to add to the service account

serviceAccount.automount

bool

true

Whether to automatically mount a ServiceAccount's API credentials

serviceAccount.create

bool

true

Whether a service account should be created

serviceAccount.name

string

""

The name of the service account to use. If not set and create is true, a name is generated using the full name template.

tolerations

list

[]

Tolerations for pod assignment

topologySpreadConstraints

object

{}

Topology Spread Constraints for pod assignment

updateStrategy

object

{"type":""}

How to replace existing pods.

updateStrategy.type

string

""

Can be Recreate or RollingUpdate; default is RollingUpdate

volumeMounts

list

[]

Additional volumeMounts on the output Deployment definition

volumes

list

[]

Additional volumes on the output Deployment definition

4.5 Installing Open WebUI

Open WebUI is a Web-based user interface designed for interacting with AI models.

4.5.1 Details about the Open WebUI application

Before deploying Open WebUI, it is important to know more about the supported configurations and documentation. The following command provides the corresponding details:

helm show values oci://dp.apps.rancher.io/charts/open-webui

Alternatively, you can also refer to the Open WebUI Helm chart page on the SUSE Application Collection site at https://apps.rancher.io/applications/open-webui. It contains available versions and the link to pull the Open WebUI container image.

4.5.2 Open WebUI installation procedure

Tip
Tip

Before the installation, you need to get user access to the SUSE Application Collection, create a Kubernetes namespace, and log in to the Helm registry as described in Section 4.1, “Installation procedure”.

Requirements

To install Open WebUI, you need to have the following:

  1. Create the owui_custom_overrides.yaml file to override the values of the parent Helm chart. The file contains URLs for Milvus and Ollama and specifies whether a stand-alone Ollama deployment is used or whether Ollama is installed as part of the Open WebUI installation. Find more details in Section 4.5.5, “Examples of Open WebUI Helm chart override files”. For a list of all installation options with examples, refer to Section 4.5.6, “Values for the Open WebUI Helm chart”.

  2. Install the Open WebUI Helm chart using the owui_custom_overrides.yaml override file.

    > helm upgrade --install \
      open-webui charts/open-webui-X.Y.Z.tgz \
      -n SUSE_AI_NAMESPACE \
      --version X.Y.Z -f owui_custom_overrides.yaml

4.5.3 Upgrading Open WebUI

To upgrade Open WebUI to a specific new version, run the following command:

> helm upgrade --install open-webui \
  oci://dp.apps.rancher.io/charts/open-webui \
  -n SUSE_AI_NAMESPACE \
  --version VERSION_NUMBER \
  -f owui_custom_overrides.yaml

To upgrade Open WebUI to the latest version, run the following command:

> helm upgrade --install open-webui \
  oci://dp.apps.rancher.io/charts/open-webui \
  -n SUSE_AI_NAMESPACE \
  -f owui_custom_overrides.yaml

4.5.4 Uninstalling Open WebUI

To uninstall Open WebUI, run the following command:

> helm uninstall open-webui -n SUSE_AI_NAMESPACE

4.5.5 Examples of Open WebUI Helm chart override files

To override the default values during the Helm chart installation or update, you can create an override YAML file with custom values. Then, apply these values by specifying the path to the override file with the -f option of the helm command.

Example 5: Open WebUI override file with Ollama included

The following override file installs Ollama during the Open WebUI installation. Replace SUSE_AI_NAMESPACE with your Kubernetes namespace.

global:
  imagePullSecrets:
  - application-collection
  
ollamaUrls:
- http://open-webui-ollama.SUSE_AI_NAMESPACE.svc.cluster.local:11434
persistence:
  enabled: true
  storageClass: local-path1
ollama:
  enabled: true
  ingress:
    enabled: false
  defaultModel: "gemma:2b"
  ollama:
    models:2
      - "gemma:2b"
      - "llama3.1"
    gpu:3
      enabled: true
      type: 'nvidia'
      number: 1
    persistentVolume:4
      enabled: true
      storageClass: local-path5
pipelines:
  enabled: False
  persistence:
    storageClass: local-path6
ingress:
  enabled: true
  class: ""
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
  host: suse-ollama-webui7
  tls: true
extraEnvVars:
- name: DEFAULT_MODELS8
  value: "gemma:2b"
- name: DEFAULT_USER_ROLE
  value: "user"
- name: WEBUI_NAME
  value: "SUSE AI"
- name: GLOBAL_LOG_LEVEL
  value: INFO
- name: RAG_EMBEDDING_MODEL
  value: "sentence-transformers/all-MiniLM-L6-v2"
- name: VECTOR_DB
  value: "milvus"
- name: MILVUS_URI
  value: http://milvus.SUSE_AI_NAMESPACE.svc.cluster.local:19530
- name: INSTALL_NLTK_DATASETS9
  value: "true"

2

Specifies that two large language models (LLM) will be loaded in Ollama when the container starts.

3

Enables GPU support for Ollama. The type must be nvidia because NVIDIA GPUs are the only supported devices. number must be between 1 and the number of NVIDIA GPUs present on the system.

4

Without the persistentVolume option enabled, changes made to Ollama—such as downloading other LLM— are lost when the container is restarted.

1 5 6

Use local-path storage only for testing purposes. For production use, we recommend using a storage solution suitable for persistent storage, such as SUSE Storage.

8

Specifies the default LLM for Ollama.

7

Specifies the host name for the Open WebUI Web UI.

9

Installs the natural language toolkit (NLTK) datasets for Ollama. Refer to https://www.nltk.org/index.html for licensing information.

Example 6: Open WebUI override file with Ollama installed separately

The following override file installs Ollama separately from the Open WebUI installation. Replace SUSE_AI_NAMESPACE with your Kubernetes namespace.

global:
  imagePullSecrets:
  - application-collection
  
ollamaUrls:
- http://ollama.SUSE_AI_NAMESPACE.svc.cluster.local:11434
persistence:
  enabled: true
  storageClass: local-path1
ollama:
  enabled: false
pipelines:
  enabled: False
  persistence:
    storageClass: local-path2
ingress:
  enabled: true
  class: ""
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
  host: suse-ollama-webui
  tls: true
extraEnvVars:
- name: DEFAULT_MODELS3
  value: "gemma:2b"
- name: DEFAULT_USER_ROLE
  value: "user"
- name: WEBUI_NAME
  value: "SUSE AI"
- name: GLOBAL_LOG_LEVEL
  value: INFO
- name: RAG_EMBEDDING_MODEL
  value: "sentence-transformers/all-MiniLM-L6-v2"
- name: VECTOR_DB
  value: "milvus"
- name: MILVUS_URI
  value: http://milvus.SUSE_AI_NAMESPACE.svc.cluster.local:19530
- name: ENABLE_OTEL4
  value: "true"
- name: OTEL_EXPORTER_OTLP_ENDPOINT5
  value: http://opentelemetry-collector.observability.svc.cluster.local:43176

1 2

Use local-path storage only for testing purposes. For production use, we recommend using a storage solution suitable for persistent storage, such as SUSE Storage.

3

Specifies the default LLM for Ollama.

4 5

These values are optional, required only to receive telemetry data from Open WebUI.

6

The URL of the OpenTelemetry Collector installed by the user.

Example 7: Open WebUI override file with a connection to vLLM

The following example shows how to extend the extraEnvVars section of the Open WebUI override file to connect to vLLM. Replace SUSE_AI_NAMESPACE with your Kubernetes namespace.

Tip
Tip

Find more details about installing vLLM in Section 4.6, “Installing vLLM”.

extraEnvVars:
[...]
- name: OPENAI_API_BASE_URL
  value: "http://vllm-router-service.SUSE_AI_NAMESPACE.svc.cluster.local:80/v1"
- name: OPENAI_API_KEY
  value: "dummy" 1

1

Open WebUI will require you to provide the OpenAI API key.

If the Open WebUI installation has pipelines enabled besides the vLLM deployment, you can extend the extraEnvVars section as follows.

extraEnvVars:
[...]
- name: OPENAI_API_BASE_URLS
  value: "http://open-webui-pipelines.SUSE_AI_NAMESPACE.svc.cluster.local:9099;http://vllm-router-service.SUSE_AI_NAMESPACE.svc.cluster.local:80/v1"
- name: OPENAI_API_KEYS
  value: "0p3n-w3bu!;dummy"

4.5.6 Values for the Open WebUI Helm chart

To override the default values during the Helm chart installation or update, you can create an override YAML file with custom values. Then, apply these values by specifying the path to the override file with the -f option of the helm command.

Table 2: Available options for the Open WebUI Helm chart

Key

Type

Default

Description

affinity

object

{}

Affinity for pod assignment

annotations

object

{}

cert-manager.enabled

bool

true

clusterDomain

string

"cluster.local"

Value of cluster domain

containerSecurityContext

object

{}

Configure container security context, see https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-containe.

extraEnvVars

list

[{"name":"OPENAI_API_KEY", "value":"0p3n-w3bu!"}]

Environment variables added to the Open WebUI deployment. Most up-to-date environment variables can be found in https://docs.openwebui.com/getting-started/env-configuration/.

extraEnvVars[0]

object

{"name":"OPENAI_API_KEY","value":"0p3n-w3bu!"}

Default API key value for Pipelines. It should be updated in a production deployment and changed to the required API key if not using Pipelines.

global.imagePullSecrets

list

[]

Global override for container image registry pull secrets

global.imageRegistry

string

""

Global override for container image registry

global.tls.additionalTrustedCAs

bool

false

global.tls.issuerName

string

"suse-private-ai"

global.tls.letsEncrypt.email

string

"none@example.com"

global.tls.letsEncrypt.environment

string

"staging"

global.tls.letsEncrypt.ingress.class

string

""

global.tls.source

string

"suse-private-ai"

The source of Open WebUI TLS keys, see Section 4.5.6.1, “TLS sources”.

image.pullPolicy

string

"IfNotPresent"

Image pull policy to use for the Open WebUI container

image.registry

string

"dp.apps.rancher.io"

Image registry to use for the Open WebUI container

image.repository

string

"containers/open-webui"

Image repository to use for the Open WebUI container

image.tag

string

"0.3.32"

Image tag to use for the Open WebUI container

imagePullSecrets

list

[]

Configure imagePullSecrets to use private registry, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry.

ingress.annotations

object

{"nginx.ingress.kubernetes.io/ssl-redirect":"true"}

Use appropriate annotations for your Ingress controller, such as nginx.ingress.kubernetes.io/rewrite-target: / for NGINX.

ingress.class

string

""

ingress.enabled

bool

true

ingress.existingSecret

string

""

ingress.host

string

""

ingress.tls

bool

true

nameOverride

string

""

nodeSelector

object

{}

Node labels for pod assignment

ollama.enabled

bool

true

Automatically install Ollama Helm chart from https://otwld.github.io/ollama-helm/. Configure the following Helm values.

ollama.fullnameOverride

string

"open-webui-ollama"

If enabling embedded Ollama, update fullnameOverride to your desired Ollama name value, or else it will use the default ollama.name value from the Ollama chart.

ollamaUrls

list

[]

A list of Ollama API endpoints. These can be added instead of automatically installing the Ollama Helm chart, or in addition to it.

openaiBaseApiUrl

string

""

OpenAI base API URL to use. Defaults to the Pipelines service endpoint when Pipelines are enabled, or to https://api.openai.com/v1 if Pipelines are not enabled and this value is blank.

persistence.accessModes

list

["ReadWriteOnce"]

If using multiple replicas, you must update accessModes to ReadWriteMany.

persistence.annotations

object

{}

persistence.enabled

bool

true

persistence.existingClaim

string

""

Use existingClaim to reuse an existing Open WebUI PVC instead of creating a new one.

persistence.selector

object

{}

persistence.size

string

"2Gi"

persistence.storageClass

string

""

pipelines.enabled

bool

false

Automatically install Pipelines chart to extend Open WebUI functionality using Pipelines, see https://github.com/open-webui/pipelines.

pipelines.extraEnvVars

list

[]

This section can be used to pass the required environment variables to your pipelines (such as the Langfuse host name).

podAnnotations

object

{}

podSecurityContext

object

{}

Configure pod security context, see https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-containe.

replicaCount

int

1

resources

object

{}

service

object

{"annotations":{},"containerPort":8080, "labels":{},"loadBalancerClass":"", "nodePort":"","port":80,"type":"ClusterIP"}

Service values to expose Open WebUI pods to cluster

tolerations

list

[]

Tolerations for pod assignment

topologySpreadConstraints

list

[]

Topology Spread Constraints for pod assignment

4.5.6.1 TLS sources

There are three recommended options where Open WebUI can obtain TLS certificates for secure communication.

Self-Signed TLS certificate

This is the default method. You need to install cert-manager on the cluster to issue and maintain the certificates. This method generates a CA and signs the Open WebUI certificate using the CA. cert-manager then manages the signed certificate.

For this method, use the following Helm chart option:

global.tls.source=suse-private-ai
Let's Encrypt

This method also uses cert-manager, but it is combined with a special issuer for Let's Encrypt that performs all actions—including request and validation—to get the Let's Encrypt certificate issued. This configuration uses HTTP validation (HTTP-01) and therefore the load balancer must have a public DNS record and be accessible from the Internet.

For this method, use the following Helm chart option:

global.tls.source=letsEncrypt
Provide your own certificate

This method allows you to bring your own signed certificate to secure the HTTPS traffic. In this case, you must upload this certificate and associated key as PEM-encoded files named tls.crt and tls.key.

For this method, use the following Helm chart option:

global.tls.source=secret

4.6 Installing vLLM

vLLM is an open-source high-performance inference and serving engine for large language models (LLMs). It is designed to maximize throughput and reduce latency by using an efficient memory management system that handles dynamic batching and streaming outputs. In short, vLLM makes running LLMs cheaper and faster in production.

Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using its Helm chart, which is part of AI Library. The Helm chart deploys the full vLLM production stack and enables you to run optimized LLM inference workloads on NVIDIA GPU in your Kubernetes cluster. It consists of the following components:

  • Serving Engine runs the model inference.

  • Router handles OpenAI-compatible API requests.

  • LMCache (optional) improves caching efficiency.

  • CacheServer (optional) is a distributed KV cache back-end.

4.6.1 Details about the vLLM application

Before deploying vLLM, it is important to know more about the supported configurations and documentation. The following command provides the corresponding details:

helm show values oci://dp.apps.rancher.io/charts/vllm

Alternatively, you can also refer to the vLLM Helm chart page on the SUSE Application Collection site at https://apps.rancher.io/applications/vllm. It contains vLLM dependencies, available versions and the link to pull the vLLM container image.

4.6.2 vLLM installation procedure

Tip
Tip

Before the installation, you need to get user access to the SUSE Application Collection, create a Kubernetes namespace, and log in to the Helm registry as described in Section 4.1, “Installation procedure”.

Warning
Warning: NVIDIA GPUs required

NVIDIA GPUs must be available in your Kubernetes cluster to successfully deploy and run vLLM.

Important
Important: Limitation

The current release of SUSE AI vLLM does not support Ray and LoraController.

  1. Create a vllm_custom_overrides.yaml file to override the default values of the Helm chart. Find examples of override files in Section 4.6.6, “Examples of vLLM Helm chart override files”.

  2. After saving the override file as vllm_custom_overrides.yaml, apply its configuration with the following command.

    > helm upgrade --install \
      vllm oci://dp.apps.rancher.io/charts/vllm \
      -n SUSE_AI_NAMESPACE \
      -f vllm_custom_overrides.yaml

4.6.3 Integrating vLLM with Open WebUI

You can integrate vLLM in Open WebUI either using the Open WebUI Web user interface, or updating Open WebUI override file during Open WebUI deployment (see Example 7, “Open WebUI override file with a connection to vLLM”).

Procedure 1: Integrating vLLM with Open WebUI via the Web user interface
Requirements
  • You must have Open WebUI administrator privileges to access configuration screens or settings mentioned in this section.

  1. In the bottom left of the Open WebUI window, click your avatar icon to open the user menu and select Admin Panel.

  2. Click the Settings tab and select Connections from the left menu.

  3. In the Manage OpenAI API Connections section, add a new connection URL to the vLLM router service, for example:

    http://vllm-router-service.SUSE_AI_NAMESPACE.svc.cluster.local:80/v1

    Confirm with Save.

    A screenshot of the Open WebUI user interface for adding a new connection to vLLM
    Figure 20: Adding a vLLM connection to Open WebUI

4.6.4 Upgrading vLLM

The vLLM chart receives application updates and updates of the Helm chart templates. New versions may include changes that require manual steps. These steps are listed in the corresponding README file. All vLLM dependencies are updated automatically during a vLLM upgrade.

To upgrade vLLM, identify the new version number and run the following command below:

> helm upgrade --install \
  vllm oci://dp.apps.rancher.io/charts/vllm \
  -n SUSE_AI_NAMESPACE \
  --version VERSION_NUMBER \
  -f vllm_custom_overrides.yaml
Tip
Tip

If you omit the --version option, vLLM gets upgraded to the latest available version.

Note
Note: Rolling update

The helm upgrade command performs a rolling update on Deployments or StatefulSets with the following conditions:

  • The old pod stays running until the new pod passes readiness checks.

  • If the cluster is already at GPU capacity, the new pod cannot start because there is no GPU left to schedule it. This requires patching the deployment using the Recreate update strategy. The following commands identify the vLLM deployment name and patch its deployment.

    > kubectl get deployments -n SUSE_AI_NAMESPACE
    > kubectl patch deployment VLLM_DEPLOYMENT_NAME \
      -n SUSE_AI_NAMESPACE \
      -p '{"spec": {"strategy": {"type": "Recreate", "rollingUpdate": null}}}'

4.6.5 Uninstalling vLLM

To uninstall vLLM, run the following command:

> helm uninstall vllm -n SUSE_AI_NAMESPACE

4.6.6 Examples of vLLM Helm chart override files

To override the default values during the Helm chart installation or update, you can create an override YAML file with custom values. Then, apply these values by specifying the path to the override file with the -f option of the helm command.

Example 8: Minimal configuration

The following override file installs vLLM using a model that is publicly available.

global:
  imagePullSecrets:
  - application-collection
servingEngineSpec:
  modelSpec:
  - name: "phi3-mini-4k"
    registry: "dp.apps.rancher.io"
    repository: "containers/vllm-openai"
    tag: "0.9.1"
    imagePullPolicy: "IfNotPresent"
    modelURL: "microsoft/Phi-3-mini-4k-instruct"
    replicaCount: 1
    requestCPU: 6
    requestMemory: "16Gi"
    requestGPU: 1
Procedure 2: Validating the installation
  • Pulling the images can take a long time. You can monitor the status of the vLLM installation by running the following command:

    > kubectl get pods -n SUSE_AI_NAMESPACE
      
    NAME                                           READY   STATUS    RESTARTS   AGE
    [...]
    vllm-deployment-router-7588bf995c-5jbkf        1/1     Running   0          8m9s
    vllm-phi3-mini-4k-deployment-vllm-79d6fdc-tx7  1/1     Running   0          8m9s

    Pods for the vLLM deployment should transition to the states Ready and Running.

Procedure 3: Validating the stack
  1. Expose the vllm-router-service port to the host machine:

    > kubectl port-forward svc/vllm-router-service \
      -n SUSE_AI_NAMESPACE 30080:80
  2. Query the OpenAI-compatible API to list the available models:

    > curl -o- http://localhost:30080/v1/models
  3. Send a query to the OpenAI /completion endpoint to generate a completion for a prompt:

    > curl -X POST http://localhost:30080/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "microsoft/Phi-3-mini-4k-instruct",
        "prompt": "Once upon a time,",
        "max_tokens": 10
      }'
    
    # example output of generated completions
    {
        "id": "cmpl-3dd11a3624654629a3828c37bac3edd2",
        "object": "text_completion",
        "created": 1757530703,
        "model": "microsoft/Phi-3-mini-4k-instruct",
        "choices": [
            {
                "index": 0,
                "text": " in a bustling city full of concrete and",
                "logprobs": null,
                "finish_reason": "length",
                "stop_reason": null,
                "prompt_logprobs": null
            }
        ],
        "usage": {
            "prompt_tokens": 5,
            "total_tokens": 15,
            "completion_tokens": 10,
            "prompt_tokens_details": null
        },
        "kv_transfer_params": null
    }
Example 9: Basic configuration

The following vLLM override file includes basic configuration options.

Prerequisites
  • Access to a Hugging Face token (HF_TOKEN).

  • The model meta-llama/Llama-3.1-8B-Instruct from this example is a gated model that requires you to accept the agreement to access it. For more information, see https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct.

  • The runtimeClassName specified here is nvidia.

  • Update the storageClass: entry for each modelSpec.

# vllm_custom_overrides.yaml
global:
  imagePullSecrets:
  - application-collection
servingEngineSpec:
  runtimeClassName: "nvidia"
  modelSpec:
  - name: "llama3" 1
    registry: "dp.apps.rancher.io" 2
    repository: "containers/vllm-openai" 3
    tag: "0.9.1" 4
    imagePullPolicy: "IfNotPresent"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct" 5
    replicaCount: 1 6
    requestCPU: 10 7
    requestMemory: "16Gi" 8
    requestGPU: 1 9
    storageClass: STORAGE_CLASS
    pvcStorage: "50Gi" 10
    pvcAccessMode:
      - ReadWriteOnce

    vllmConfig:
      enableChunkedPrefill: false 11
      enablePrefixCaching: false 12
      maxModelLen: 4096 13
      dtype: "bfloat16" 14
      extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"] 15

    hf_token: HF_TOKEN 16

1

The unique identifier for your model deployment.

2

The Docker image registry containing the model's serving engine image.

3

The Docker image repository containing the model's serving engine image.

4

The version of the model image to use.

5

The URL pointing to the model on Hugging Face or another hosting service.

6

The number of replicas for the deployment, which allows scaling for load.

7

The amount of CPU resources requested per replica.

8

Memory allocation for the deployment. Sufficient memory is required to load the model.

9

The number of GPUs to allocate for the deployment.

10

The Persistent Volume Claim (PVC) size for model storage.

11

Optimizes performance by prefetching model chunks.

12

Enables caching of prompt prefixes to speed up inference for repeated prompts.

13

The maximum sequence length the model can handle.

14

The data type for model weights, such as bfloat16 for mixed-precision inference and faster performance on modern GPUs.

15

Additional command-line arguments for vLLM, such as disabling request logging or setting GPU memory utilization.

16

Your Hugging Face token for accessing gated models. Replace HF_TOKEN with your actual token.

Example 10: Loading prefetched models from persistent storage

Prefetching models to a Persistent Volume Claim (PVC) prevents repeated downloads from Hugging Face during pod startup. The process involves creating a PVC and a job to fetch the model. This PVC is mounted at /models, where the prefetch job stores the model weights. Subsequently, the vLLM modelURL is set to this path, which ensures that the model is loaded locally instead of being downloaded when the pod starts.

  1. Define a PVC for model weights using the following YAML specification.

    # pvc-models.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: models-pvc
      namespace: SUSE_AI_NAMESPACE
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 50Gi # Adjust size based on your model
      storageClassName: STORAGE_CLASS

    Save it as pvc-models.yaml and apply with kubectl apply -f pvc-models.yaml.

  2. Create a secret resource for the Hugging Face token.

    > kubectl create secret -n SUSE_AI_NAMESPACE \
      generic huggingface-credentials \
      --from-literal=HUGGING_FACE_HUB_TOKEN=HF_TOKEN
  3. Create a YAML specification for prefetching the model and save it as job-prefetch-llama3.1-8b.yaml.

    # job-prefetch-llama3.1-8b.yaml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: prefetch-llama3.1-8b
      namespace: SUSE_AI_NAMESPACE
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: hf-download
            image: python:3.10-slim
            env:
            - name: HF_TOKEN
              valueFrom: { secretKeyRef: { name: huggingface-credentials, key: HUGGING_FACE_HUB_TOKEN } }
            - name: HF_HUB_ENABLE_HF_TRANSFER
              value: "1"
            - name: HF_HUB_DOWNLOAD_TIMEOUT
              value: "60"
            command: ["bash","-lc"]
            args:
            - |
              set -e
              echo "Logging in..."
              echo "Installing Hugging Face CLI..."
              pip install "huggingface_hub[cli]"
              pip install "hf_transfer"
              hf auth login --token "${HF_TOKEN}"
              echo "Downloading Llama 3.1 8B Instruct to /models/llama-3.1-8b-it ..."
              hf download meta-llama/Llama-3.1-8B-Instruct --local-dir /models/llama-3.1-8b-it
            volumeMounts:
            - name: models
              mountPath: /models
          volumes:
          - name: models
            persistentVolumeClaim:
              claimName: models-pvc

    Apply the specification with the following commands:

    > kubectl apply -f job-prefetch-llama3.1-8b.yaml
    > kubectl -n SUSE_AI_NAMESPACE \
      wait --for=condition=complete job/prefetch-llama3.1-8b
  4. Update the custom vLLM override file with support for PVC.

    # vllm_custom_overrides.yaml
    global:
      imagePullSecrets:
      - application-collection
    servingEngineSpec:
      runtimeClassName: "nvidia"
      modelSpec:
      - name: "llama3"
        registry: "dp.apps.rancher.io"
        repository: "containers/vllm-openai"
        tag: "0.9.1"
        imagePullPolicy: "IfNotPresent"
        modelURL: "/models/llama-3.1-8b-it"
        replicaCount: 1
    
        requestCPU: 10
        requestMemory: "16Gi"
        requestGPU: 1
    
        extraVolumes:
          - name: models-pvc
            persistentVolumeClaim:
              claimName: models-pvc 1
    
        extraVolumeMounts:
          - name: models-pvc
            mountPath: /models 2
    
        vllmConfig:
          maxModelLen: 4096
    
        hf_token: HF_TOKEN

    1

    Specify your PVC name.

    2

    The mount path must match the base directory of the servingEngineSpec.modelSpec.modeURL value specified above.

    Save it as vllm_custom_overrides.yaml and apply with kubectl apply -f vllm_custom_overrides.yaml.

  5. The following example lists mounted PVCs for a pod.

    > kubectl exec -it vllm-llama3-deployment-vllm-858bd967bd-w26f7 \
      -n SUSE_AI_NAMESPACE -- ls -l /models
    drwxr-xr-x 1 root root 608 Aug 22 16:29 llama-3.1-8b-it
Example 11: Configuration with multiple models

This example shows how to configure multiple models to run on different GPUs. Remember to update the entries hf_token and storageClass.

Note
Note: Ray is not supported

Ray is currently not supported. Therefore, sharding a single large model across multiple GPUs is not supported.

# vllm_custom_overrides.yaml
global:
  imagePullSecrets:
  - application-collection
servingEngineSpec:
  modelSpec:
  - name: "llama3"
    registry: "dp.apps.rancher.io"
    repository: "containers/vllm-openai"
    tag: "0.9.1"
    imagePullPolicy: "IfNotPresent"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 1
    requestCPU: 10
    requestMemory: "16Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    storageClass: STORAGE_CLASS
    vllmConfig:
      maxModelLen: 4096
    hf_token: HF_TOKEN_FOR_LLAMA_31

  - name: "mistral"
    registry: "dp.apps.rancher.io"
    repository: "containers/vllm-openai"
    tag: "0.9.1"
    imagePullPolicy: "IfNotPresent"
    modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
    replicaCount: 1
    requestCPU: 10
    requestMemory: "16Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    storageClass: STORAGE_CLASS
    vllmConfig:
      maxModelLen: 4096
    hf_token: HF_TOKEN_FOR_MISTRAL
Example 12: CPU offloading

This example demonstrates how to enable KV cache offloading to the CPU using LMCache in a vLLM deployment. You can enable LMCache and set the CPU offloading buffer size using the lmcacheConfig field. In the following example, the buffer is set to 20 GB, but you can adjust this value based on your workload. Remember to update the entries hf_token and storageClass.

Warning
Warning: Experimental Features

Setting lmcacheConfig.enabled to true implicitly enables the LMCACHE_USE_EXPERIMENTAL flag for LMCache. These experimental features are only supported on newer GPU generations. It is not recommended to enable them without a compelling reason.

# vllm_custom_overrides.yaml
global:
  imagePullSecrets:
  - application-collection
servingEngineSpec:
  runtimeClassName: "nvidia"
  modelSpec:
  - name: "mistral"
    registry: "dp.apps.rancher.io"
    repository: "containers/lmcache-vllm-openai"
    tag: "0.3.2"
    imagePullPolicy: "IfNotPresent"
    modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
    replicaCount: 1
    requestCPU: 10
    requestMemory: "40Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    storageClass: STORAGE_CLASS
    pvcAccessMode:
      - ReadWriteOnce
    vllmConfig:
      maxModelLen: 32000

    lmcacheConfig:
      enabled: false
      cpuOffloadingBufferSize: "20"

    hf_token: HF_TOKEN
Example 13: Shared remote KV cache storage with LMCache

This example shows how to enable remote KV cache storage using LMCache in a vLLM deployment. The configuration defines a cacheserverSpec and uses two replicas. Remember to replace the placeholder values for hf_token and storageClass before applying the configuration.

Warning
Warning: Experimental features

Setting lmcacheConfig.enabled to true implicitly enables the LMCACHE_USE_EXPERIMENTAL flag for LMCache. These experimental features are only supported on newer GPU generations. It is not recommended to enable them without a compelling reason.

# vllm_custom_overrides.yaml
global:
  imagePullSecrets:
  - application-collection
servingEngineSpec:
  runtimeClassName: "nvidia"
  modelSpec:
  - name: "mistral"
    registry: "dp.apps.rancher.io"
    repository: "containers/lmcache-vllm-openai"
    tag: "0.3.2"
    imagePullPolicy: "IfNotPresent"
    modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
    replicaCount: 2
    requestCPU: 10
    requestMemory: "40Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    storageClass: STORAGE_CLASS
    vllmConfig:
      enablePrefixCaching: true
      maxModelLen: 16384
    lmcacheConfig:
      enabled: false
      cpuOffloadingBufferSize: "20"
    hf_token: HF_TOKEN
    initContainer:
      name: "wait-for-cache-server"
      image: "dp.apps.rancher.io/containers/lmcache-vllm-openai:0.3.2"
      command: ["/bin/sh", "-c"]
      args:
        - |
          timeout 60 bash -c '
          while true; do
            /opt/venv/bin/python3 /workspace/LMCache/examples/kubernetes/health_probe.py $(RELEASE_NAME)-cache-server-service $(LMCACHE_SERVER_SERVICE_PORT) && exit 0
            echo "Waiting for LMCache server..."
            sleep 2
          done'
cacheserverSpec:
  replicaCount: 1
  containerPort: 8080
  servicePort: 81
  serde: "naive"
  registry: "dp.apps.rancher.io"
  repository: "containers/lmcache-vllm-openai"
  tag: "0.3.2"
  resources:
    requests:
      cpu: "4"
      memory: "8G"
    limits:
      cpu: "4"
      memory: "10G"
  labels:
    environment: "cacheserver"
    release: "cacheserver"
routerSpec:
  resources:
    requests:
      cpu: "1"
      memory: "2G"
    limits:
      cpu: "1"
      memory: "2G"
  routingLogic: "session"
  sessionKey: "x-user-id"

4.7 Installing AI Library components using SUSE AI Deployer

SUSE AI Deployer consists of a meta Helm chart that takes care of downloading and installing individual AI Library components required by SUSE AI on a Kubernetes cluster.

The following procedure describes how to customize and use the SUSE AI Deployer to install AI Library components. It assumes that you already completed steps described in Section 4.1, “Installation procedure” including the installation of cert-manager.

  1. Pull the SUSE AI Deployer Helm chart with the relevant chart version and untar it. You can find the latest of the chart on the SUSE Application Collection page at https://apps.rancher.io/applications/suse-ai-deployer.

    > helm pull oci://dp.apps.rancher.io/charts/suse-ai-deployer \
      --version 1.0.0 --untar
    > cd suse-ai-deployer
  2. Inspect the downloaded chart and its default values.

    > helm show chart .
    > helm show values .
    Tip
    Tip

    To see default values for the charts of the individual components within the meta chart, run the following commands.

    > helm show values charts/ollama/
    > helm show values charts/open-webui/
    > helm show values charts/milvus/
    > helm show values charts/pytorch
  3. Explore downloaded example override files in the suse-ai-deployer/examples subdirectory. It typically includes the following files:

    suse-gen-ai-minimal.yaml

    Basic configuration to get started with GenAI. It deploys Ollama without GPU support, Open WebUI, and Milvus in stand-alone mode using local storage. PyTorch is disabled.

    suse-gen-ai.yaml

    Configuration optimized for production usage. It deploys Ollama with GPU support, Open WebUI, and Milvus in cluster mode using Longhorn storage. PyTorch is disabled.

    suse-ml-stack.yaml

    Basic configuration that enables deployment of PyTorch with no GPU support with Longhorn storage. It deploys PyTorch but disables Ollama, Open WebUI and Milvus.

  4. Create custom-overrides.yaml override file based one of the above examples. The examples use self-signed certificates for TLS communication. To use other option (see Section 4.5.6.1, “TLS sources”), copy the global section from the values.yaml file into your custom-overrides.yaml and update its tls section as needed.

  5. Install the SUSE AI Deployer Helm chart with while overriding values from the custom-overrides.yaml file. Use the appropriate RELEASE_NAME and SUSE_AI_NAMESPACE based the configuration in custom-overrides.yaml.

    > helm upgrade --install \
      RELEASE_NAME \
      --namespace  SUSE_AI_NAMESPACE \
      --create-namespace \
      --values ./custom-overrides.yaml \
      --version 1.0.0 \
      oci://dp.apps.rancher.io/charts/suse-ai-deployer

5 Steps after the installation is complete

Once the SUSE AI installation is finished, follow these tasks to complete the initial setup and configuration.

  1. Log in to SUSE AI Open WebUI using the default credentials.

  2. After you have logged in, update the administrator password for SUSE AI.

  3. From the available language models, configure the one you prefer. Optionally, install a custom language model. Refer to the section Setting base AI models and Setting the default AI model for more details

  4. Configure user management with role-base access control (RBAC) as described in https://documentation.suse.com/suse-ai/1.0/html/openwebui-configuring/index.html#openwebui-managing-user-roles

  5. Integrate single sign-on authentication manager—such as Okta—with Open WebUI as described in https://documentation.suse.com/suse-ai/1.0/html/openwebui-configuring/index.html#openwebui-authentication-via-okta.

  6. Configure retrieval-augmented generation (RAG) to let the model process content relevant to the customer.

Documentation survey