Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
documentation.suse.com / Installing NVIDIA GPU drivers on SUSE Linux Micro

Installing NVIDIA GPU drivers on SUSE Linux Micro

Publication Date: 09 Oct 2024
WHAT?

NVIDIA GPU drivers utilize the full potential of the GPU unit.

WHY?

To learn how to install NVIDIA GPU drivers on SUSE Linux Micro 6.0 to be able to fully use the computing power of the GPU in higher-level applications, such as AI workloads.

EFFORT

To understand the information in this article and install NVIDIA GPU drivers on your SUSE Linux Micro host requires less than 1 hour of your time and basic Linux administration skills.

GOAL

You are able to install the NVIDIA GPU drivers on a SUSE Linux Micro host with supported GPU card attached.

1 Installing NVIDIA GPU drivers on SUSE Linux Enterprise Micro

1.1 Introduction

This guide demonstrates how to implement host-level NVIDIA GPU support via the open-driver on SUSE Linux Enterprise Micro 6.0. The open-driver is part of the core SUSE Linux Enterprise Micro package repositories. Therefore, there is no need to compile it or download executable packages. This driver is built into the operating system rather than dynamically loaded by the NVIDIA GPU Operator. This configuration is desirable for customers that want to pre-build all artifacts required for deployment into the image, and where the dynamic selection of the driver version via Kubernetes is not a requirement.

1.2 Requirements

If you are following this guide, it assumes that you have the following already available:

  • At least one host with SUSE Linux Enterprise Micro 6.0 installed, physical or virtual.

  • Your hosts are attached to a subscription as this is required for package access.

  • A compatible NVIDIA GPU installed or fully passed through to the virtual machine in which SUSE Linux Enterprise Micro is running.

  • Access to the root user—these instructions assume you are the root user, and not escalating your privileges via sudo.

1.3 Considerations before the installation

1.3.1 Select the driver generation

You must verify the driver generation for the NVIDIA GPU that your system has. For modern GPUs, the G06 driver is the most common choice. Find more details in the support database.

This section details the installation of the G06 generation of the driver.

1.3.2 Additional NVIDIAcomponents

Besides the NVIDIA open-driver provided by SUSE as part of SUSE Linux Enterprise Micro, you might also need additional NVIDIA components. These could include OpenGL libraries, CUDA toolkits, command-line utilities such as nvidia-smi, and container-integration components such as nvidia-container-toolkit. Many of these components are not shipped by SUSE as they are proprietary NVIDIA software. Therefore, this section describes how to configure additional repositories that give you access to these components and walk through certain examples of how to use these tools, resulting in a fully functional system.

1.3.3 Package version mismatch

It is important to distinguish between SUSE and NVIDIA repositories. There can be a mismatch between the package versions that NVIDIA makes available versus what SUSE has built. This usually arises when SUSE makes a new version of the open-driver available, and it takes a couple of days before the equivalent packages are made available in NVIDIA repositories to match.

We recommend that you ensure that the driver version that you are selecting is compatible with your GPU and meets any CUDA requirements that you may have by checking:

  • The CUDA release notes.

  • The driver version that you plan on deploying has a matching version in the NVIDIA SLE15-SP6 repository and ensuring that you have equivalent package versions for the supporting components available.

Tip
Tip: Finding NVIDIA open-driver versions

To find the NVIDIA open-driver versions, either run zypper se -s nvidia-open-driver on the target machine or search the SUSE Customer Center for the nvidia-open-driver string in SLE Micro 6.0 for x86_64.

In the following example, you can see four versions available, with 550.54.14 being the newest:

SCC with SUSE Linux Enterprise Micro drivers selection
Figure 1: SCC with SUSE Linux Enterprise Micro drivers selection

1.4 The installation procedure

  1. On SUSE Linux Enterprise Micro host, open up a transactional-update shell session to create a new read/write snapshot of the underlying operating system so that we can make changes to the immutable platform.

    # transactional-update shell
  2. When you are in the transactional-update shell session, add an additional package repository from NVIDIA. This allows us to pull in additional utilities, for example, nvidia-smi.

    transactional update # zypper ar
    https://download.nvidia.com/suse/sle15sp6/ nvidia-sle15sp6-main
    # zypper --gpg-auto-import-keys refresh
  3. You can then install the driver and nvidia-compute-utils for additional utilities that are useful for testing purposes.

    transactional update # zypper install -y --auto-agree-with-licenses \
      nvidia-open-driver-G06-signed-kmp nvidia-compute-utils-G06
    Note
    Note: Possible version mismatch

    If the installation fails now, this might indicate a dependency mismatch between the selected driver version and what NVIDIA ships in their repositories. Refer to Section 1.3.3, “Package version mismatch” to verify that the versions match. Attempt to install a different driver version. For example, if the NVIDIA repositories have an earlier version, you can try specifying nvidia-open-driver-G06-signed-kmp=550.54.14 on your install command to specify a version that aligns.

  4. (Optional) If you are not using a supported GPU, you can see if the driver works by enabling support at the module level.

    transactional update # sed -i '/NVreg_OpenRmEnableUnsupportedGpus/s/^#//g' /etc/modprobe.d/50-nvidia-default.conf
  5. Exit the transactional-update session and reboot to the new snapshot that contains the changes you have made.

    transactional update # exit
    # reboot
  6. After the system has rebooted, log back in and use the nvidia-smi tool to verify that the driver is loaded successfully and that it can both access and enumerate your GPUs.

    # nvidia-smi

    The output of this command should show you something similar to the following output. In the example below, we have two GPUs.

    Wed Oct 28 12:31:06 2024
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 550.54.14    Driver Version: 550.54.14    CUDA Version: 12.3     |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name       Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |  0  NVIDIA A100-PCIE-40GB Off | 00000000:17:00.0 Off |                    0 |
      | N/A   29C    P0    35W / 250W |      4MiB / 40960MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+
      |  1  NVIDIA A100-PCIE-40GB Off | 00000000:CA:00.0 Off |                    0 |
      | N/A   30C    P0    33W / 250W |      4MiB / 40960MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      |  No running processes found                                                 |
      +-----------------------------------------------------------------------------+

1.5 Validation of the driver installation

The nvidia-smi command has verified that, at the host level, the NVIDIA device can be accessed and that the drivers are loading successfully. To validate that it is functioning, you need to validate that the GPU can take instructions from a user-space application, ideally via a container, and through the CUDA library, as that is typically what a real workload would use. For this, we can make a further modification to the host OS by installing nvidia-container-toolkit.

  1. Open another transactional-update shell.

    #  transactional-update shell
  2. Install the nvidia-container-toolkit package from the NVIDIA Container Toolkit repository.

    transactional update # zypper ar https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
    transactional update # zypper --gpg-auto-import-keys install -y nvidia-container-toolkit

    The nvidia-container-toolkit.repo file contains a stable repository nvidia-container-toolkit and an experimental repository nvidia-container-toolkit-experimental. Use the stable repository for production use. The experimental repository is disabled by default.

  3. Exit the transactional-update session and reboot to the new snapshot that contains the changes you have made.

    transactional update # exit
    # reboot
  4. Verify that the system can successfully enumerate the devices using the NVIDIA Container Toolkit. The output should be verbose, with INFO and WARN messages, but no ERROR messages.

    # nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

    This ensures that any container started on the machine can employ discovered NVIDIA GPU devices.

  5. You can then run a Podman-based container. Doing this via podman gives you a good way of validating access to the NVIDIA device from within a container, which should give confidence for doing the same with Kubernetes at a later stage.

    Give Podman access to the labeled NVIDIA devices that were taken care of by the previous command and simply run the bash command.

    # podman run --rm --device nvidia.com/gpu=all \
      --security-opt=label=disable \
      -it registry.suse.com/bci/bci-base:latest bash

    You can now execute commands from within a temporary Podman container. It does not have access to your underlying system and is ephemeral—whatever you change hin the container does not persist. Also, you cannot break anything on the underlying host.

  6. Inside the container, install the required CUDA libraries In the example below, we are installing CUDA 12.3 and pulling many examples, demos and development kits so you can fully validate the GPU

    # zypper ar http://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/ cuda-sle15-sp6
    # zypper in -y cuda-libraries-devel-12-3 cuda-minimal-build-12-3 cuda-demo-suite-12-3
  7. Inside the container, run the deviceQuery CUDA example, which comprehensively validates GPU access via CUDA and from within the container itself.

    # /usr/local/cuda-12/extras/demo_suite/deviceQuery Starting...
    CUDA Device Query (Runtime API) version (CUDART static linking)
    
    Detected 2 CUDA Capable device(s)
    
    Device 0: "NVIDIA A100-PCIE-40GB"
      CUDA Driver Version / Runtime Version          12.2 / 12.1
      CUDA Capability Major/Minor version number:    8.0
      Total amount of global memory:                 40339 MBytes (42298834944 bytes)
      (108) Multiprocessors, ( 64) CUDA Cores/MP:     6912 CUDA Cores
      GPU Max Clock rate:                            1410 MHz (1.41 GHz)
      Memory Clock rate:                             1215 Mhz
      Memory Bus Width:                              5120-bit
      L2 Cache Size:                                 41943040 bytes
      Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
      Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       49152 bytes
      Total number of registers available per block: 65536
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
      Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             512 bytes
      Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
      Run time limit on kernels:                     No
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Enabled
      Device supports Unified Addressing (UVA):      Yes
      Device supports Compute Preemption:            Yes
      Supports Cooperative Kernel Launch:            Yes
      Supports MultiDevice Co-op Kernel Launch:      Yes
      Device PCI Domain ID / Bus ID / location ID:   0 / 23 / 0
      Compute Mode:
          < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    
    Device 1: <snip to reduce output for multiple devices>
          < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    > Peer access from NVIDIA A100-PCIE-40GB (GPU0) -> NVIDIA A100-PCIE-40GB (GPU1) : Yes
    > Peer access from NVIDIA A100-PCIE-40GB (GPU1) -> NVIDIA A100-PCIE-40GB (GPU0) : Yes
    
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2, Device0 = NVIDIA A100-PCIE-40GB, Device1 = NVIDIA A100-PCIE-40GB
    Result = PASS

    From inside the container, you can continue to run any other CUDA workload—such as compilers—to run further tests. When done, you can exit from the container, noting that whatever you have installed in there will be lost and has not impacted the underlying operating system.

    # exit