Installing the NVIDIA GPU Operator

Publication Date: 13 Feb 2025

WHAT?: The NVIDIA GPU Operator simplifies the management of NVIDIA GPU resources in a Kubernetes cluster.
WHY?: To learn how to simplify the installation, configuration and monitoring of NVIDIA GPU drivers, as well as associated components like CUDA.
EFFORT: Learning to install the NVIDIA GPU Operator in a Kubernetes cluster requires less than 30 minutes of your time and basic Linux administration skills.

Revision History: Installing the NVIDIA GPU Operator

1 Introduction to the NVIDIA GPU Operator #

This article explains the NVIDIA GPU Operator, outlines the NVIDIA GPU components it manages, and summarizes the benefits of using it.

1.1 What is the NVIDIA GPU Operator? #

The NVIDIA GPU Operator is a Kubernetes operator that simplifies the management and deployment of NVIDIA GPU resources in a Kubernetes cluster. It automates the configuration and monitoring of NVIDIA GPU drivers, as well as associated components like CUDA, container runtimes, and other GPU-related software.

1.2 How does the NVIDIA GPU Operator work? #

The NVIDIA GPU Operator follows this workflow:

Operator deployment. The NVIDIA Operator is deployed as a Helm chart or using Kubernetes manifests.
Node labeling & GPU discovery. Once installed, the operator deploys the GPU Feature Discovery (GFD) daemon, which scans the hardware on each node for NVIDIA GPUs. It labels nodes with GPU-specific information, making it easier for Kubernetes to schedule GPU workloads based on available hardware.
NVIDIA Container Toolkit configuration. The GPU operator installs and configures the NVIDIA Container Toolkit, which allows GPU-accelerated containers to run in Kubernetes.
CUDA runtime and libraries. The operator ensures that the CUDA toolkit is properly installed, making it easier for applications requiring CUDA to work seamlessly without manual intervention.
Validation and health monitoring. After setting up the environment, the operator continuously monitors the health of the GPU resources. It also exposes health metrics for administrators to view and use for decision-making.
Scheduling GPU workloads. Once the environment is configured, you can install workloads that require GPU acceleration. Kubernetes will use the node labels and available GPU resources to schedule these jobs on GPU-enabled nodes automatically.

1.3 Benefits of using the NVIDIA GPU Operator #

Using the NVIDIA GPU Operator has the following key benefits:

Automated setup. No manual runtime configuration is needed.
Cluster-wide management. Works across the entire Kubernetes cluster, scaling with node additions or removals.
Simplified updates. Automates updates for GPU-related components.
Optimized GPU usage. Ensures that GPU resources are efficiently allocated and used.

2 Installing the NVIDIA GPU Operator on the SUSE Rancher Prime: RKE2 Kubernetes cluster #

This topic describes how to install the NVIDIA GPU Operator on the SUSE Rancher Prime: RKE2 Kubernetes cluster.

Procedure 1: Installation of the NVIDIA GPU Operator #

Requirements #

If you are following this guide, it assumes that you have the following already available:

A successfully deployed and fully operational SUSE Rancher Prime: RKE2 Kubernetes cluster.
At least one worker node with a compatible NVIDIA GPU and corresponding NVIDIA drivers installed.
The Helm package manager available on a host from which you operate the SUSE Rancher Prime: RKE2 cluster.

Create a values.yaml configuration file and insert the following toolkit environmental variables.

toolkit:
  env:
  - name: CONTAINERD_CONFIG1
    value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
  - name: CONTAINERD_SOCKET2
    value: /run/k3s/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS3
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT4
    value: "true"

1	The path on the host to the containerd configuration you want to have updated with support for the `nvidia-container-runtime`. It defaults to `/etc/containerd/config.toml`.
2	The path on the host to the socket file used to communicate with containerd. The NVIDIA GPU Operator uses it to send a SIGHUP signal to the `containerd` daemon to reload its configuration. It defaults to `/run/containerd/containerd.sock`.
3	The name of the Runtime Class you would like to associate with the `nvidia-container-runtime`. Pods launched with a `runtimeClassName` equal to `CONTAINERD_RUNTIME_CLASS` always run with the `nvidia-container-runtime`. The default `CONTAINERD_RUNTIME_CLASS` is `nvidia`.
4	A flag indicating whether you want to set `nvidia-container-runtime` as the default runtime used to launch all containers. When set to `false`, only containers in pods with a `runtimeClassName` equal to `CONTAINERD_RUNTIME_CLASS` run with the `nvidia-container-runtime`. The default value is `true`.

Pass the values.yaml file to the following helm command:

> helm install gpu-operator -n gpu-operator \
  --create-namespace nvidia/gpu-operator \
  --set driver.enabled=false -f values.yaml

Note: Installation without a YAML file

An alternative way of installing the NVIDIA GPU Operator is by specifying the above configuration options directly with the helm command. In this case, you do not store the toolkit options in the YAML file.

> helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
    --set driver.enabled=false

3 Legal Notice #

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.

For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.