SUSE AI Requirements|GPU hardware for AI/ML workloads
Applies to SUSE AI 1.0

2 GPU hardware for AI/ML workloads

To run AI/ML workloads, such as training machine learning models or running inference workloads, deploy cluster nodes with compatible NVIDIA GPUs to gain acceleration.

2.1 Using the NVIDIA GPU Operator

Configuring and managing nodes with hardware resources can require multiple configurations for software components. These include drivers, container runtimes and libraries. To use NVIDIA GPUs in a Kubernetes cluster, you need to configure the NVIDIA GPU Operator. Because GPU is a special resource in the cluster, you need to install the following components to enable deployment of workloads for processing on the GPU.

  • NVIDIA drivers (to enable CUDA)

  • Kubernetes device plug-in

  • Container runtime

  • Other tools to provide capabilities such as monitoring or automatic node labeling

To ensure that the NVIDIA GPU Operator is installed correctly, the Kubernetes cluster must meet the following prerequisites:

  • All worker nodes must run the same operating system version to use the NVIDIA GPU Driver container.

  • Nodes must be configured with a container engine, such as Docker (CE/EE), containerd or Podman.

  • Nodes should be equipped with NVIDIA GPUs.

  • Nodes should have NVIDIA drivers installed.

2.2 Supported GPUs

The NVIDIA GPU Operator is compatible with a range of NVIDIA GPUs. For a full list of supported GPUs, refer to the NVIDIA GPU Operator Platform Support documentation.