Installing the NVIDIA GPU Operator
- WHAT?
The NVIDIA GPU Operator simplifies management of NVIDIA GPU resource in Kubernetes cluster.
- WHY?
To learn how to automate the installation, configuration and monitoring of NVIDIA GPU drivers, as well as associated components like CUDA.
- EFFORT
To learn how to install the NVIDIA GPU Operator in a Kubernetes cluster requires less than 30 minutes of your time and basic Linux administration skills.
1 Introduction to NVIDIA GPU Operator #
This article describes what the NVIDIA GPU Operator is, which components of the NVIDIA GPU it manages, and summarizes the benefits of using NVIDIA GPU Operator.
1.1 What is the NVIDIA GPU Operator? #
The NVIDIA GPU Operator is a Kubernetes operator that simplifies the management and deployment of NVIDIA GPU resources in a Kubernetes cluster. It automates the configuration and monitoring of NVIDIA GPU drivers, as well as associated components like CUDA, container runtimes, and other GPU-related software.
1.2 How does the NVIDIA GPU Operator work? #
NVIDIA GPU Operator follows this workflow:
Operator deployment. The NVIDIA Operator is deployed as a Helm chart or using Kubernetes manifests.
Node labeling & GPU discovery. Once installed, the operator deploys the GPU Feature Discovery (GFD) daemon, which scans the hardware on each node for NVIDIA GPUs. It labels nodes with GPU-specific information, making it easier for Kubernetes to schedule GPU workloads based on available hardware.
NVIDIA driver installation. The operator ensures that the appropriate NVIDIA drivers are installed on the cluster nodes.
NVIDIA Container Toolkit configuration. The GPU operator installs and configures the NVIDIA Container Toolkit, which allows GPU-accelerated containers to run in Kubernetes.
CUDA runtime and libraries. The operator ensures that the CUDA toolkit is properly installed, making it easier for applications requiring CUDA to work seamlessly without manual intervention.
Validation and health monitoring. After setting up the environment, the operator continuously monitors the health of the GPU resources. It also exposes health metrics for administrators to view and use for decision-making.
Scheduling GPU workloads. Once the environment is configured, you can install workloads that require GPU acceleration. Kubernetes will use the node labels and available GPU resources to schedule these jobs on GPU-enabled nodes automatically.
1.3 Benefits of using the NVIDIA GPU Operator #
Using the NVIDIA GPU Operator has the following key benefits:
Automated setup. No manual runtime configuration is needed.
Cluster-wide management. Works across the entire Kubernetes cluster, scaling with node additions or removals.
Simplified updates. Automates updates of GPU-related components.
Optimized GPU usage. Ensures that GPU resources are efficiently allocated and used.
2 Installing NVIDIA GPU Operator on the RKE2 Kubernetes cluster #
This topic describes how to install the NVIDIA GPU Operator on the RKE2 Kubernetes cluster.
If you are following this guide, it assumes that you have the following already available:
Successfully deployed and fully operational RKE2 Kubernetes cluster.
At least one worker node with a compatible NVIDIA GPU and corresponding NVIDIA drivers installed.
The Helm package manager available on a host from which you operate the RKE2 cluster.
Create a
values.yaml
configuration file and insert the followingtoolkit
environmental variables.toolkit: env: - name: CONTAINERD_CONFIG1 value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl - name: CONTAINERD_SOCKET2 value: /run/k3s/containerd/containerd.sock - name: CONTAINERD_RUNTIME_CLASS3 value: nvidia - name: CONTAINERD_SET_AS_DEFAULT4 value: "true"
The path on the host to the containerd configuration you want to have updated with support for the
nvidia-container-runtime
. it defaults to/etc/containerd/config.toml
.The path on the host to the socket file used to communicate with containerd. The NVIDIA GPU Operator uses it to send a SIGHUP signal to the
containerd
daemon to reload its configuration. It defaults to/run/containerd/containerd.sock
.The name of the Runtime Class you would like to associate with the
nvidia-container-runtime
. Pods launched with aruntimeClassName
equal toCONTAINERD_RUNTIME_CLASS
always run with thenvidia-container-runtime
. The defaultCONTAINERD_RUNTIME_CLASS
isnvidia
.A flag indicating whether you want to set
nvidia-container-runtime
as the default runtime used to launch all containers. When set tofalse
, only containers in pods with aruntimeClassName
equal toCONTAINERD_RUNTIME_CLASS
run with thenvidia-container-runtime
. The default value istrue
.Pass the
values.yaml
file to the followinghelm
command:>
helm install gpu-operator -n gpu-operator \ --create-namespace nvidia/gpu-operator \ --set driver.enabled=false -f values.yaml
An alternative way of installing the NVIDIA GPU Operator is by specifying the
above configuration options directly with the helm
command. In this case, you do not store the toolkit
options in the YAML file.
>
helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia \ --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \ --set-string toolkit.env[3].value=true \ --set driver.enabled=false
3 Legal Notice #
Copyright© 2006–2024 SUSE LLC and contributors. All rights reserved.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.
For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.