CNCF AI Conformance
The CNCF Kubernetes AI Conformance defines a set of additional capabilities, APIs and configurations that a Kubernetes cluster MUST offer, on top of standard CNCF Kubernetes Conformance, to reliably and efficiently run AI/ML workloads.
This page shows how to meet these requirements using RKE2 v1.35.5+rke2r2 and NVIDIA GPU operator v26.3.2. The operator is installed following the GPU Operators documentation
Support Dynamic Resource Allocation (DRA)
DRA is a new API that enables more flexible, fine-grained resource requests beyond simple counts and has been generally available (GA) since v1.34.
Verify that all resource.k8s.io/v1 DRA API resources are enabled by running:
kubectl api-resources --api-group=resource.k8s.io
Expected Output:
NAME SHORTNAMES APIVERSION NAMESPACED KIND
deviceclasses resource.k8s.io/v1 false DeviceClass
resourceclaims resource.k8s.io/v1 true ResourceClaim
resourceclaimtemplates resource.k8s.io/v1 true ResourceClaimTemplate
resourceslices resource.k8s.io/v1 false ResourceSlice
Driver Runtime Management
DRA is the ideal API to verify that GPU drivers and container runtime configurations are correctly installed and maintained on nodes with accelerators. However, because DRA features are still evolving into the Kubernetes core ecosystem, platforms utilize Node Feature Discovery (NFD) and GPU Feature Discovery (GFD) to fulfill this requirement via node metadata attestation.
1. Verifying Base Container Runtime Version
To confirm the base engine handling the container lifecycle, the platform exposes the underlying containerd runtime via the core Node API:
kubectl get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[*].status.nodeInfo.containerRuntimeVersion}'
Expected output:
containerd://2.2.3-k3s1
2. Verifying Accelerator Drivers and Runtime Configurations
By querying the node labels, administrators can programmatically verify both the driver installation and the specialized container runtime configuration patch applied to containerd:
kubectl get nodes -l nvidia.com/gpu.present=true -o jsonpath='{.items[*].metadata.labels}'
Expected output:
nvidia.com/cuda.driver-version.full: "595.71.05"
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu-driver-upgrade-state: "upgrade-done"
-
gpu.deploy.container-toolkit: "true"explicitly confirms that the container runtime configuration (the toolkit patch tocontainerd) is successfully injected and active. -
gpu-driver-upgrade-state: "upgrade-done"confirms the operator is actively maintaining the lifecycle of these components.
GPU Sharing support
There are different strategies to share a GPU. We are going to use time-slicing because it is a pure software trick handled by the NVIDIA driver and container runtime. Therefore, it is compatible with almost every GPU.
Please refer to the NVIDIA documentation for details about how to apply time-slicing.
If everything worked as expected, you can verify it worked by executing:
kubectl describe node $GPU-NODE-NAME
Expected output (assuming 4 replicas were configured):
Labels:
nvidia.com/gpu.product=Tesla-T4-SHARED
nvidia.com/gpu.replicas=4
nvidia.com/gpu.sharing-strategy=time-slicing
Capcity:
nvidia.com/gpu: 4
Allocatable:
nvidia.com/gpu: 4
Maintain API consistency when using vGPUs and GPUs
Maintaining consistency with the API and request mechanisms of physical fractional GPUs ensures that workloads, scheduling, policies, and user expectations remain the same across both physical and virtual backends, enabling portability and avoiding fragmented, vendor-specific resource models.
The NVIDIA GPU operator supports vGPUs and maintain the required API consistency. Please refer to the NVIDIA documentation for details about how to configure the NVIDIA GPU operator to support vGPU.
Support the Gateway API
Gateway API represents the next generation of Kubernetes ingress, load balancing and service mesh APIs.
To enable the Gateway API in RKE2, the cluster must be deployed with Traefik enabled and its KubernetesGateway provider configured, as explained in the Ingress Controller docs.
Verify that all gateway.networking.k8s.io/v1 Gateway API resources are enabled by running:
kubectl api-resources --api-group=gateway.networking.k8s.io/v1
Expected Output:
NAME SHORTNAMES APIVERSION NAMESPACED KIND
gatewayclasses gc gateway.networking.k8s.io/v1 false GatewayClass
gateways gtw gateway.networking.k8s.io/v1 true Gateway
grpcroutes gateway.networking.k8s.io/v1 true GRPCRoute
httproutes gateway.networking.k8s.io/v1 true HTTPRoute
referencegrants refgrant gateway.networking.k8s.io/v1beta1 true ReferenceGrant
To verify Traefik is consuming Gateway API resources:
-
Create a GatewayClass:
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: traefik spec: controllerName: traefik.io/gateway-controller -
Check the status:
kubectl get gatewayclass traefik -o jsonpath='{.status}'Expected Output:
"message":"Handled by Traefik controller","observedGeneration":1,"reason":"Handled","status":"True","type":"Accepted"
Gang Scheduling
A gang scheduling solution (e.g., Kueue or Volcano) must be available for installation to ensure all-or-nothing scheduling for distributed AI workloads.
We will use Volcano in RKE2 for this verification test.
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
The installation creates three deployments in the volcano-system namespace:
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/volcano-admission 1/1 1 1 130m deployment.apps/volcano-controllers 1/1 1 1 130m deployment.apps/volcano-scheduler 1/1 1 1 130m
Verification is complete, but we will perform a functional test. The following step creates a gang job with two tasks (each requiring an NVIDIA GPU) on a two-GPU cluster:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: gpu-nbody-gang-job
namespace: default
spec:
minAvailable: 2
schedulerName: volcano
tasks:
- name: nbody-task-1
replicas: 1
template:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container-1
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
- name: nbody-task-2
replicas: 1
template:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container-2
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
Both pods should be running after a few seconds.
To test gang-scheduling failure, modify the manifest to use minAvailable: 3 and add a third task. Re-submit the job:
- name: nbody-task-3
replicas: 1
template:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container-3
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
Observe that the three pods remain in a Pending status. This demonstrates that gang scheduling is working as expected.
default gpu-nbody-gang-job-nbody-task-1-0 0/1 Pending 0 50s
default gpu-nbody-gang-job-nbody-task-2-0 0/1 Pending 0 50s
default gpu-nbody-gang-job-nbody-task-3-0 0/1 Pending 0 50s
Cluster autoscaler
If the platform provides a cluster autoscaler or an equivalent mechanism, it must be capable of scaling accelerator-specific node groups based on pending pods. Since RKE2 is as a Kubernetes distribution, it does not provide an integrated cluster autoscaler.
For reference, we explain how to use the upstream autoscaler autoscaler with Azure as an example.
-
Create a Virtual Machine Scale Set (VMSS) with GPU-equipped VMs.
-
Deploy RKE2 with the following options:
disable-cloud-controller: true # Only in rke2-server kubelet-arg: # On both rke2-server and rke2-agent - --cloud-provider=external -
Install the Azure CCM:
helm install --repo https://raw.githubusercontent.com/kubernetes-sigs/cloud-provider-azure/master/helm/repo cloud-provider-azure --generate-name --set cloudControllerManager.imageRepository=mcr.microsoft.com/oss/kubernetes --set cloudControllerManager.imageName=azure-cloud-controller-manager --set cloudNodeManager.imageRepository=mcr.microsoft.com/oss/kubernetes --set cloudNodeManager.imageName=azure-cloud-node-manager --set cloudControllerManager.configureCloudRoutes=false --set cloudControllerManager.allocateNodeCidrs=false -
Create the
azure.jsonfile and save it to/etc/kubernetes/azure.json. Ensure it contains the following two options:"useManagedIdentityExtension": false, "useInstanceMetadata": trueThe deployed nodes should include a ProviderID. Verify this with:
kubectl get nodes -o yaml | grep ProviderIDThe ProviderID is retrieved from the instance’s Metadata. Check this with:
curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2021-02-01" -
Install the upstream autoscaler.
-
Firstly, create a
values.yamlconfiguration file, specifying the VMSS from Step 1 and other necessary Azure details. -
Then run the following
helmcommands:helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler -f values.yaml
-
-
When correctly deployed, the autoscaler monitors for pods requesting a GPU resource. If the cluster cannot satisfy the request, the autoscaler contacts Azure to automatically provision and add a new GPU node to the cluster.
Horizontal pod autoscaler
The ability to scale Pods based custom metrics relevant to AI/ML workloads, is achieved using the HorizontalPodAutoscaler (HPA), which is included by default in Kubernetes.
To demonstrate this requirement, install an Ollama deployment in RKE2. The following manifest is then used for verification:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 1
maxReplicas: 3
metrics:
- type: Object
object:
describedObject:
apiVersion: v1
kind: Namespace
name: suse-private-ai
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"
Increasing the load on Ollama will raise the GPU utilization to 70%, triggering the deployment of new Ollama pods.
Accelerator Performance Metrics
This requirement mandates a functional accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This solution must include a core set of metrics for per-accelerator utilization and memory usage.
When the NVIDIA GPU Operator is installed (as described in the GPU Operators documentation), a nvidia-dcgm-exporter DaemonSet and Service are deployed. Query this service to collect required GPU metrics, such as accelerator utilization, memory usage, temperature, power usage, etc.
For example, if you SSH into one cluster node from within the cluster, it will show the metrics exposed using the OpenMetrics text format. The following section details how to deploy Prometheus and Grafana to consume them.
# Get the clusterIP
svcIP=$(kubectl get svc nvidia-dcgm-exporter -n gpu-operator -o jsonpath='{.spec.clusterIP}')
# Get the port
svcPort=$(kubectl get svc nvidia-dcgm-exporter -n gpu-operator -o jsonpath='{.spec.ports[0].port}')
# Output the metrics
curl -sL http://${svcIP}:${svcPort}/metrics
AI Job & Inference Service Metrics
This requirement mandates a system capable of discovering and collecting metrics exposed by workloads in a standardized format.
Prometheus and Grafana fulfill this requirement. First, install them:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
> --namespace monitoring \
> --create-namespace
Once installed, create a ServiceMonitor to scrape metrics from workloads. As an example, the following manifest configures Prometheus to collect DCGM metrics from the NVIDIA GPU Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nvidia-dcgm-monitor
namespace: monitoring
labels:
release: prometheus-stack
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
namespaceSelector:
matchNames:
- gpu-operator
endpoints:
- port: gpu-metrics
path: /metrics
interval: 15s
After a few minutes, the Grafana dashboard will show the DCGM metrics, such as DCGM_FI_DEV_GPU_UTIL.
Secure accelerator access
This requirement mandates access to accelerators from within containers must be properly isolated and mediated by Kubernetes
The NVIDIA GPU operator in v26.3.2 automatically configures that stack with the proper isolation.
Verify the isolation requirement by running the following three Pods in a cluster with only one GPU:
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark1
namespace: default
spec:
restartPolicy: OnFailure
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark2
namespace: default
spec:
restartPolicy: OnFailure
containers:
- name: cuda-container2
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark3
namespace: default
spec:
restartPolicy: OnFailure
containers:
- name: cuda-container3
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
Expected Results (Isolation Confirmed):
-
Pod 1 runs successfully and consumes the GPU.
-
Pod 2 is not scheduled by Kubernetes because the only GPU available in the cluster is already being consumed by Pod 1.
-
Pod 3 runs but fails to find an available GPU, as seen in the logs.
This outcome demonstrates that accelerator isolation is working correctly.
Robust CRD and Controller Operation
This requirement mandates the installation and reliable function of at least one complex AI Operator with CRDs. Verification requires confirming that CRDs are registered and that an Admission Webhook rejects invalid configurations.
To verify this requirement, install the Kubeflow Training Operator in RKE2. Since a Helm chart is unavailable, use the following kubectl command as a workaround:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.8.0"
Verify that CRDs are installed and the webhook is registered:
$> kubectl get crds | grep kubeflow
mpijobs.kubeflow.org 2025-10-24T13:04:27Z
mxjobs.kubeflow.org 2025-10-24T13:04:27Z
paddlejobs.kubeflow.org 2025-10-24T13:04:28Z
pytorchjobs.kubeflow.org 2025-10-24T13:04:28Z
tfjobs.kubeflow.org 2025-10-24T13:04:29Z
xgboostjobs.kubeflow.org 2025-10-24T13:04:29Z
$> kubectl get validatingwebhookconfigurations
validator.training-operator.kubeflow.org 5 10m
$> kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
training-operator-f7d4b59f6-vdnh9 1/1 Running 0 9m54s
Test the admission webhook’s rejection capability by attempting to apply the following invalid TFJob manifest (missing the required image field):
# saved as invalid-tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tfjob-invalid-test
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
containers:
- name: tensorflow
# INTENTIONAL ERROR: Missing the 'image' field
# image: tensorflow/tensorflow:latest
# command: ["/bin/bash", "-c"]
# args: ["echo 'Chief running'; sleep 10;"]
The Admission Webhook returns the expected error, confirming its function:
Error from server (Forbidden): error when creating "invalid-tfjob.yaml": admission webhook "validator.tfjob.training-operator.kubeflow.org" denied the request: spec.tfReplicaSpecs[Chief].template.spec.containers[0].image: Required value: must be required
Remove the comments in the previous example and re-try the job to see successful deployment.