2 GPU hardware for AI/ML workloads #
To run AI/ML workloads, such as training machine learning models or running inference workloads, deploy cluster nodes with compatible NVIDIA GPUs to gain acceleration.
2.1 Using the NVIDIA GPU Operator #
Configuring and managing nodes with hardware resources can require multiple configurations for software components. These include drivers, container runtimes and libraries. To use NVIDIA GPUs in a Kubernetes cluster, you need to configure the NVIDIA GPU Operator. Because GPU is a special resource in the cluster, you need to install the following components to enable deployment of workloads for processing on the GPU.
NVIDIA drivers (to enable CUDA)
Kubernetes device plug-in
Container runtime
Other tools to provide capabilities such as monitoring or automatic node labeling
To ensure that the NVIDIA GPU Operator is installed correctly, the Kubernetes cluster must meet the following prerequisites:
All worker nodes must run the same operating system version to use the NVIDIA GPU Driver container.
Nodes must be configured with a container engine, such as Docker (CE/EE), containerd or Podman.
Nodes should be equipped with NVIDIA GPUs.
Nodes should have NVIDIA drivers installed.
2.2 Supported GPUs #
The NVIDIA GPU Operator is compatible with a range of NVIDIA GPUs. For a full list of supported GPUs, refer to the NVIDIA GPU Operator Platform Support documentation.