Installing NVIDIA GPU Drivers on SUSE Linux Micro
- WHAT?
NVIDIA GPU drivers use the full potential of the GPU unit.
- WHY?
To learn how to install NVIDIA GPU drivers on SUSE Linux Micro 6.0 to fully use the computing power of the GPU in higher-level applications, such as AI workloads.
- EFFORT
Understanding the information in this article and installing NVIDIA GPU drivers on your SUSE Linux Micro host requires less than one hour of your time and basic Linux administration skills.
- GOAL
You can install NVIDIA GPU drivers on a SUSE Linux Micro host with a supported GPU card attached.
1 Installing NVIDIA GPU drivers on SUSE Linux Micro #
1.1 Introduction #
This guide demonstrates how to implement host-level NVIDIA GPU support via the open-driver on SUSE Linux Micro 6.0. The open-driver is part of the core SUSE Linux Micro package repositories. Therefore, there is no need to compile it or download executable packages. This driver is built into the operating system rather than dynamically loaded by the NVIDIA GPU Operator. This configuration is desirable for customers that want to pre-build all artifacts required for deployment into the image, and where the dynamic selection of the driver version via Kubernetes is not a requirement.
1.2 Requirements #
If you are following this guide, it assumes that you have the following already available:
At least one host with SUSE Linux Micro 6.0 installed, physical or virtual.
Your hosts are attached to a subscription as this is required for package access.
A compatible NVIDIA GPU installed or fully passed through to the virtual machine in which SUSE Linux Micro is running.
Access to the
root
user—these instructions assume you are theroot
user, and not escalating your privileges viasudo
.
1.3 Considerations before the installation #
1.3.1 Select the driver generation #
You must verify the driver generation for the NVIDIA GPU that your
system has. For modern GPUs, the G06
driver is the
most common choice. Find more details in
the
support database.
This section details the installation of the G06
generation of the driver.
1.3.2 Additional NVIDIAcomponents #
Besides the NVIDIA open-driver provided by SUSE as part of
SUSE Linux Micro, you might also need additional NVIDIA components. These
could include OpenGL libraries, CUDA toolkits, command-line utilities
such as nvidia-smi
, and container-integration
components such as nvidia-container-toolkit. Many of these components
are not shipped by SUSE as they are proprietary NVIDIA software.
This section describes how to configure additional
repositories that give you access to these components and
provides examples of using these tools to achieve a fully
functional system.
1.3.3 Package version mismatch #
It is important to distinguish between SUSE and NVIDIA repositories. There can be a mismatch between the package versions that NVIDIA makes available versus what SUSE has built. This usually arises when SUSE makes a new version of the open-driver available, and it takes a couple of days before the equivalent packages are made available in NVIDIA repositories to match.
We recommend ensuring that the selected driver version is compatible with your GPU and meets any specific CUDA requirements by checking:
The CUDA release notes.
The driver version you plan to deploy has a matching version in the NVIDIA SLE15-SP6 repository. Ensure that you also have equivalent package versions available for the supporting components.
To find the NVIDIA open-driver versions, either run zypper
se -s nvidia-open-driver
on the target machine or search the
SUSE Customer Center for the “nvidia-open-driver” string in SLE Micro
6.0 for x86_64.
In the following example, you can see four versions available, with 550.54.14 being the newest:
1.4 The installation procedure #
On SUSE Linux Micro host, open up a
transactional-update shell
session to create a new read/write snapshot of the underlying operating system so that we can make changes to the immutable platform.#
transactional-update shell
When you are in the
transactional-update shell
session, add an additional package repository from NVIDIA. This allows us to pull in additional utilities, for example,nvidia-smi
.transactional update #
zypper ar https://download.nvidia.com/suse/sle15sp6/ nvidia-sle15sp6-main
#
zypper --gpg-auto-import-keys refresh
You can then install the driver and nvidia-compute-utils for additional utilities that are useful for testing purposes.
transactional update #
zypper install -y --auto-agree-with-licenses \ nvidia-open-driver-G06-signed-kmp nvidia-compute-utils-G06
Note: Possible version mismatchIf the installation fails now, this might indicate a dependency mismatch between the selected driver version and what NVIDIA ships in their repositories. Refer to Section 1.3.3, “Package version mismatch” to verify that the versions match. Attempt to install a different driver version. For example, if the NVIDIA repositories have an earlier version, you can try specifying
nvidia-open-driver-G06-signed-kmp=550.54.14
on your install command to specify a version that aligns.(Optional) If you are not using a supported GPU, you can see if the driver works by enabling support at the module level.
transactional update #
sed -i '/NVreg_OpenRmEnableUnsupportedGpus/s/^#//g' /etc/modprobe.d/50-nvidia-default.conf
Exit the
transactional-update
session and reboot to the new snapshot that contains the changes you have made.transactional update #
exit
#
rebootAfter the system has rebooted, log back in and use the
nvidia-smi
tool to verify that the driver is loaded successfully and that it can both access and enumerate your GPUs.#
nvidia-smi
The output of this command should show you something similar to the following output. In the example below, we have two GPUs.
Wed Oct 28 12:31:06 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCIE-40GB Off | 00000000:17:00.0 Off | 0 | | N/A 29C P0 35W / 250W | 4MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:CA:00.0 Off | 0 | | N/A 30C P0 33W / 250W | 4MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
1.5 Validation of the driver installation #
Running the nvidia-smi
command has verified that, at
the host level, the NVIDIA device can be accessed and that the drivers
are loading successfully. To validate that it is functioning, you need to
validate that the GPU can take instructions from a user-space application,
ideally via a container and through the CUDA library, as that is typically
what a real workload would use. For this, we can make a further
modification to the host OS by installing
nvidia-container-toolkit.
Open another transactional-update shell.
#
transactional-update shell
Install the nvidia-container-toolkit package from the NVIDIA Container Toolkit repository.
transactional update #
zypper ar https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
transactional update #
zypper --gpg-auto-import-keys install -y nvidia-container-toolkitThe
nvidia-container-toolkit.repo
file contains a stable repositorynvidia-container-toolkit
and an experimental repositorynvidia-container-toolkit-experimental
. Use the stable repository for production use. The experimental repository is disabled by default.Exit the
transactional-update
session and reboot to the new snapshot that contains the changes you have made.transactional update #
exit
#
rebootVerify that the system can successfully enumerate the devices using the NVIDIA Container Toolkit. The output should be verbose, with INFO and WARN messages, but no ERROR messages.
#
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
This ensures that any container started on the machine can employ discovered NVIDIA GPU devices.
You can then run a Podman-based container. Doing this via
podman
gives you a good way of validating access to the NVIDIA device from within a container, which should give confidence for doing the same with Kubernetes at a later stage.Give Podman access to the labeled NVIDIA devices that were taken care of by the previous command and simply run the
bash
command.#
podman run --rm --device nvidia.com/gpu=all \ --security-opt=label=disable \ -it registry.suse.com/bci/bci-base:latest bash
You can now execute commands from within a temporary Podman container. It does not have access to your underlying system and is ephemeral—whatever you change in the container does not persist. Also, you cannot break anything on the underlying host.
Inside the container, install the required CUDA libraries. In the example below, we are installing CUDA 12.3 and pulling many examples, demos and development kits so you can fully validate the GPU.
#
zypper ar http://developer.download.nvidia.com/compute/cuda/repos/sles15/x86_64/ cuda-sle15-sp6
#
zypper in -y cuda-libraries-devel-12-3 cuda-minimal-build-12-3 cuda-demo-suite-12-3
Inside the container, run the
deviceQuery
CUDA example, which comprehensively validates GPU access via CUDA and from within the container itself.#
/usr/local/cuda-12/extras/demo_suite/deviceQuery
Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "NVIDIA A100-PCIE-40GB" CUDA Driver Version / Runtime Version 12.2 / 12.1 CUDA Capability Major/Minor version number: 8.0 Total amount of global memory: 40339 MBytes (42298834944 bytes) (108) Multiprocessors, ( 64) CUDA Cores/MP: 6912 CUDA Cores GPU Max Clock rate: 1410 MHz (1.41 GHz) Memory Clock rate: 1215 Mhz Memory Bus Width: 5120-bit L2 Cache Size: 41943040 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 3 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 23 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: <snip to reduce output for multiple devices> < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from NVIDIA A100-PCIE-40GB (GPU0) -> NVIDIA A100-PCIE-40GB (GPU1) : Yes > Peer access from NVIDIA A100-PCIE-40GB (GPU1) -> NVIDIA A100-PCIE-40GB (GPU0) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 2, Device0 = NVIDIA A100-PCIE-40GB, Device1 = NVIDIA A100-PCIE-40GB Result = PASSFrom inside the container, you can continue to run any other CUDA workload—such as compilers—to run further tests. When done, you can exit from the container, noting that whatever you have installed in there will be lost and has not impacted the underlying operating system.
#
exit
2 Legal Notice #
Copyright© 2006–2024 SUSE LLC and contributors. All rights reserved.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled “GNU Free Documentation License”.
For SUSE trademarks, see https://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.
All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.