Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / Installing the NVIDIA GPU Operator

Installing the NVIDIA GPU Operator

Publication Date: 09 Oct 2024
WHAT?

The NVIDIA GPU Operator simplifies management of NVIDIA GPU resource in Kubernetes cluster.

WHY?

To learn how to automate the installation, configuration and monitoring of NVIDIA GPU drivers, as well as associated components like CUDA.

EFFORT

To learn how to install the NVIDIA GPU Operator in a Kubernetes cluster requires less than 30 minutes of your time and basic Linux administration skills.

1 Introduction to NVIDIA GPU Operator

This article describes what the NVIDIA GPU Operator is, which components of the NVIDIA GPU it manages, and summarizes the benefits of using NVIDIA GPU Operator.

1.1 What is the NVIDIA GPU Operator?

The NVIDIA GPU Operator is a Kubernetes operator that simplifies the management and deployment of NVIDIA GPU resources in a Kubernetes cluster. It automates the configuration and monitoring of NVIDIA GPU drivers, as well as associated components like CUDA, container runtimes, and other GPU-related software.

1.2 How does the NVIDIA GPU Operator work?

NVIDIA GPU Operator follows this workflow:

  1. Operator deployment. The NVIDIA Operator is deployed as a Helm chart or using Kubernetes manifests.

  2. Node labeling & GPU discovery. Once installed, the operator deploys the GPU Feature Discovery (GFD) daemon, which scans the hardware on each node for NVIDIA GPUs. It labels nodes with GPU-specific information, making it easier for Kubernetes to schedule GPU workloads based on available hardware.

  3. NVIDIA driver installation. The operator ensures that the appropriate NVIDIA drivers are installed on the cluster nodes.

  4. NVIDIA Container Toolkit configuration. The GPU operator installs and configures the NVIDIA Container Toolkit, which allows GPU-accelerated containers to run in Kubernetes.

  5. CUDA runtime and libraries. The operator ensures that the CUDA toolkit is properly installed, making it easier for applications requiring CUDA to work seamlessly without manual intervention.

  6. Validation and health monitoring. After setting up the environment, the operator continuously monitors the health of the GPU resources. It also exposes health metrics for administrators to view and use for decision-making.

  7. Scheduling GPU workloads. Once the environment is configured, you can install workloads that require GPU acceleration. Kubernetes will use the node labels and available GPU resources to schedule these jobs on GPU-enabled nodes automatically.

1.3 Benefits of using the NVIDIA GPU Operator

Using the NVIDIA GPU Operator has the following key benefits:

  • Automated setup. No manual runtime configuration is needed.

  • Cluster-wide management. Works across the entire Kubernetes cluster, scaling with node additions or removals.

  • Simplified updates. Automates updates of GPU-related components.

  • Optimized GPU usage. Ensures that GPU resources are efficiently allocated and used.

2 Installing NVIDIA GPU Operator on the RKE2 Kubernetes cluster

This topic describes how to install the NVIDIA GPU Operator on the RKE2 Kubernetes cluster.

Procedure 1: Installation of NVIDIA GPU Operator
Requirements

If you are following this guide, it assumes that you have the following already available:

  • Successfully deployed and fully operational RKE2 Kubernetes cluster.

  • At least one worker node with a compatible NVIDIA GPU and corresponding NVIDIA drivers installed.

  • The Helm package manager available on a host from which you operate the RKE2 cluster.

  1. Create a values.yaml configuration file and insert the following toolkit environmental variables.

    toolkit:
      env:
      - name: CONTAINERD_CONFIG1
        value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
      - name: CONTAINERD_SOCKET2
        value: /run/k3s/containerd/containerd.sock
      - name: CONTAINERD_RUNTIME_CLASS3
        value: nvidia
      - name: CONTAINERD_SET_AS_DEFAULT4
        value: "true"

    1

    The path on the host to the containerd configuration you want to have updated with support for the nvidia-container-runtime. it defaults to /etc/containerd/config.toml.

    2

    The path on the host to the socket file used to communicate with containerd. The NVIDIA GPU Operator uses it to send a SIGHUP signal to the containerd daemon to reload its configuration. It defaults to /run/containerd/containerd.sock.

    3

    The name of the Runtime Class you would like to associate with the nvidia-container-runtime. Pods launched with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASS always run with the nvidia-container-runtime. The default CONTAINERD_RUNTIME_CLASS is nvidia.

    4

    A flag indicating whether you want to set nvidia-container-runtime as the default runtime used to launch all containers. When set to false, only containers in pods with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASS run with the nvidia-container-runtime. The default value is true.

  2. Pass the values.yaml file to the following helm command:

    > helm install gpu-operator -n gpu-operator \
      --create-namespace nvidia/gpu-operator \
      --set driver.enabled=false -f values.yaml
Note
Note: Installation without a YAML file

An alternative way of installing the NVIDIA GPU Operator is by specifying the above configuration options directly with the helm command. In this case, you do not store the toolkit options in the YAML file.

> helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
    --set driver.enabled=false