CNCF AI Conformance
The CNCF Kubernetes AI Conformance defines a set of additional capabilities, APIs and configurations that a Kubernetes cluster MUST offer, on top of standard CNCF Kubernetes Conformance, to reliably and efficiently run AI/ML workloads.
This page shows how to meet these requirements using RKE2 v1.34.1+rke2r1.
Support Dynamic Resource Allocation (DRA)
DRA is a new API that enables more flexible, fine-grained resource requests beyond simple counts and has been generally available (GA) since v1.34.
Verify that all resource.k8s.io/v1 DRA API resources are enabled by running:
kubectl api-resources --api-group=resource.k8s.io
Expected Output:
NAME SHORTNAMES APIVERSION NAMESPACED KIND
deviceclasses resource.k8s.io/v1 false DeviceClass
resourceclaims resource.k8s.io/v1 true ResourceClaim
resourceclaimtemplates resource.k8s.io/v1 true ResourceClaimTemplate
resourceslices resource.k8s.io/v1 false ResourceSlice
Support the Gateway API
Gateway API represents the next generation of Kubernetes ingress, load balancing and service mesh APIs.
To enable the Gateway API in RKE2, the cluster must be deployed with Traefik enabled and its KubernetesGateway provider configured, as explained in the Ingress Controller docs.
Verify that all gateway.networking.k8s.io/v1 Gateway API resources are enabled by running:
kubectl api-resources --api-group=gateway.networking.k8s.io/v1
Expected Output:
NAME SHORTNAMES APIVERSION NAMESPACED KIND
gatewayclasses gc gateway.networking.k8s.io/v1 false GatewayClass
gateways gtw gateway.networking.k8s.io/v1 true Gateway
grpcroutes gateway.networking.k8s.io/v1 true GRPCRoute
httproutes gateway.networking.k8s.io/v1 true HTTPRoute
referencegrants refgrant gateway.networking.k8s.io/v1beta1 true ReferenceGrant
To verify Traefik is consuming Gateway API resources:
-
Create a GatewayClass:
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: traefik spec: controllerName: traefik.io/gateway-controller -
Check the status:
kubectl get gatewayclass traefik -o jsonpath='{.status}'Expected Output:
"message":"Handled by Traefik controller","observedGeneration":1,"reason":"Handled","status":"True","type":"Accepted"
Gang Scheduling
A gang scheduling solution (e.g., Kueue or Volcano) must be available for installation to ensure all-or-nothing scheduling for distributed AI workloads.
We will use Volcano in RKE2 for this verification test.
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
The installation creates three deployments in the volcano-system namespace:
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/volcano-admission 1/1 1 1 130m deployment.apps/volcano-controllers 1/1 1 1 130m deployment.apps/volcano-scheduler 1/1 1 1 130m
Verification is complete, but we will perform a functional test. The following step creates a gang job with two tasks (each requiring an NVIDIA GPU) on a two-GPU cluster:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: gpu-nbody-gang-job
namespace: default
spec:
minAvailable: 2
schedulerName: volcano
tasks:
- name: nbody-task-1
replicas: 1
template:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container-1
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
- name: nbody-task-2
replicas: 1
template:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container-2
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
Both pods should be running after a few seconds.
To test gang-scheduling failure, modify the manifest to use minAvailable: 3 and add a third task. Re-submit the job:
- name: nbody-task-3
replicas: 1
template:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container-3
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
Observe that the three pods remain in a Pending status. This demonstrates that gang scheduling is working as expected.
default gpu-nbody-gang-job-nbody-task-1-0 0/1 Pending 0 50s
default gpu-nbody-gang-job-nbody-task-2-0 0/1 Pending 0 50s
default gpu-nbody-gang-job-nbody-task-3-0 0/1 Pending 0 50s
Cluster autoscaler
If the platform provides a cluster autoscaler or an equivalent mechanism, it must be capable of scaling accelerator-specific node groups based on pending pods. Since RKE2 is as a Kubernetes distribution, it does not provide an integrated cluster autoscaler.
For reference, we explain how to use the upstream autoscaler autoscaler with Azure as an example.
-
Create a Virtual Machine Scale Set (VMSS) with GPU-equipped VMs.
-
Deploy RKE2 with the following options:
disable-cloud-controller: true # Only in rke2-server kubelet-arg: # On both rke2-server and rke2-agent - --cloud-provider=external -
Install the Azure CCM:
helm install --repo https://raw.githubusercontent.com/kubernetes-sigs/cloud-provider-azure/master/helm/repo cloud-provider-azure --generate-name --set cloudControllerManager.imageRepository=mcr.microsoft.com/oss/kubernetes --set cloudControllerManager.imageName=azure-cloud-controller-manager --set cloudNodeManager.imageRepository=mcr.microsoft.com/oss/kubernetes --set cloudNodeManager.imageName=azure-cloud-node-manager --set cloudControllerManager.configureCloudRoutes=false --set cloudControllerManager.allocateNodeCidrs=false -
Create the
azure.jsonfile and save it to/etc/kubernetes/azure.json. Ensure it contains the following two options:"useManagedIdentityExtension": false, "useInstanceMetadata": trueThe deployed nodes should include a ProviderID. Verify this with:
kubectl get nodes -o yaml | grep ProviderIDThe ProviderID is retrieved from the instance’s Metadata. Check this with:
curl -H Metadata:true "http://169.254.169.254/metadata/instance?api-version=2021-02-01" -
Install the upstream autoscaler.
-
Firstly, create a
values.yamlconfiguration file, specifying the VMSS from Step 1 and other necessary Azure details. -
Then run the following
helmcommands:helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler -f values.yaml
-
-
When correctly deployed, the autoscaler monitors for pods requesting a GPU resource. If the cluster cannot satisfy the request, the autoscaler contacts Azure to automatically provision and add a new GPU node to the cluster.
Horizontal pod autoscaler
The ability to scale Pods based custom metrics relevant to AI/ML workloads, is achieved using the HorizontalPodAutoscaler (HPA), which is included by default in Kubernetes.
To demonstrate this requirement, install an Ollama deployment in RKE2. The following manifest is then used for verification:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 1
maxReplicas: 3
metrics:
- type: Object
object:
describedObject:
apiVersion: v1
kind: Namespace
name: suse-private-ai
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"
Increasing the load on Ollama will raise the GPU utilization to 70%, triggering the deployment of new Ollama pods.
Accelerator Performance Metrics
This requirement mandates a functional accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This solution must include a core set of metrics for per-accelerator utilization and memory usage.
When the NVIDIA GPU Operator is installed (as described in the GPU Operators documentation), a nvidia-dcgm-exporter DaemonSet and Service are deployed. Query this service to collect required GPU metrics, such as accelerator utilization, memory usage, temperature, power usage, etc.
For example, if you SSH into one cluster node from within the cluster, it will show the metrics exposed using the OpenMetrics text format. The following section details how to deploy Prometheus and Grafana to consume them.
# Get the clusterIP
svcIP=$(kubectl get svc nvidia-dcgm-exporter -n gpu-operator -o jsonpath='{.spec.clusterIP}')
# Get the port
svcPort=$(kubectl get svc nvidia-dcgm-exporter -n gpu-operator -o jsonpath='{.spec.ports[0].port}')
# Output the metrics
curl -sL http://${svcIP}:${svcPort}/metrics
AI Job & Inference Service Metrics
This requirement mandates a system capable of discovering and collecting metrics exposed by workloads in a standardized format.
Prometheus and Grafana fulfill this requirement. First, install them:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
> --namespace monitoring \
> --create-namespace
Once installed, create a ServiceMonitor to scrape metrics from workloads. As an example, the following manifest configures Prometheus to collect DCGM metrics from the NVIDIA GPU Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nvidia-dcgm-monitor
namespace: monitoring
labels:
release: prometheus-stack
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
namespaceSelector:
matchNames:
- gpu-operator
endpoints:
- port: gpu-metrics
path: /metrics
interval: 15s
After a few minutes, the Grafana dashboard will show the DCGM metrics, such as DCGM_FI_DEV_GPU_UTIL.
Secure accelerator access
This requirement mandates access to accelerators from within containers must be properly isolated and mediated by Kubernetes. To achieve this, install the NVIDIA GPU Operator as described in the GPU Operator documentation docs.
After installation, verify the toolkit configuration at /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml contains:
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false
Ensure the device-plugin DaemonSet includes the following environment variable:
DEVICE_LIST_STRATEGY: volume-mounts
If the configuration is correct, verify the isolation requirement by running the following three Pods in a cluster with only one GPU:
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark1
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark2
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container2
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
resources:
limits:
nvidia.com/gpu: 1
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark3
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container3
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
command: ["/bin/bash", "-c"]
args:
- "while true; do sleep 5 && cuda-samples/nbody -gpu -benchmark; done"
Expected Results (Isolation Confirmed):
-
Pod 1 runs successfully and consumes the GPU.
-
Pod 2 is not scheduled by Kubernetes because the only GPU available in the cluster is already being consumed by Pod 1.
-
Pod 3 runs but fails to find an available GPU, as seen in the logs.
This outcome demonstrates that accelerator isolation is working correctly.
Robust CRD and Controller Operation
This requirement mandates the installation and reliable function of at least one complex AI Operator with CRDs. Verification requires confirming that CRDs are registered and that an Admission Webhook rejects invalid configurations.
To verify this requirement, install the Kubeflow Training Operator in RKE2. Since a Helm chart is unavailable, use the following kubectl command as a workaround:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.8.0"
Verify that CRDs are installed and the webhook is registered:
$> kubectl get crds | grep kubeflow
mpijobs.kubeflow.org 2025-10-24T13:04:27Z
mxjobs.kubeflow.org 2025-10-24T13:04:27Z
paddlejobs.kubeflow.org 2025-10-24T13:04:28Z
pytorchjobs.kubeflow.org 2025-10-24T13:04:28Z
tfjobs.kubeflow.org 2025-10-24T13:04:29Z
xgboostjobs.kubeflow.org 2025-10-24T13:04:29Z
$> kubectl get validatingwebhookconfigurations
validator.training-operator.kubeflow.org 5 10m
$> kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
training-operator-f7d4b59f6-vdnh9 1/1 Running 0 9m54s
Test the admission webhook’s rejection capability by attempting to apply the following invalid TFJob manifest (missing the required image field):
# saved as invalid-tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tfjob-invalid-test
spec:
tfReplicaSpecs:
Chief:
replicas: 1
template:
spec:
containers:
- name: tensorflow
# INTENTIONAL ERROR: Missing the 'image' field
# image: tensorflow/tensorflow:latest
# command: ["/bin/bash", "-c"]
# args: ["echo 'Chief running'; sleep 10;"]
The Admission Webhook returns the expected error, confirming its function:
Error from server (Forbidden): error when creating "invalid-tfjob.yaml": admission webhook "validator.tfjob.training-operator.kubeflow.org" denied the request: spec.tfReplicaSpecs[Chief].template.spec.containers[0].image: Required value: must be required
Remove the comments in the previous example and re-try the job to see successful deployment.