2 Monitoring GPU usage #
To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.
Grant permissions (RBAC). The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.
Create a file named
otel-rbac.yamlwith the following content. It defines aRolewith permissions to get services and endpoints, and aRoleBindingto grant these permissions to the OpenTelemetry Collector’s service account.--- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: suse-observability-otel-scraper rules: - apiGroups: - "" resources: - services - endpoints verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: suse-observability-otel-scraper roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: suse-observability-otel-scraper subjects: - kind: ServiceAccount name: OPENTELEMETRY-COLLECTOR namespace: OBSERVABILITY ---ImportantVerify that the
ServiceAccountname and namespace in theRoleBindingmatch your OpenTelemetry Collector’s deployment.Apply this configuration to the
gpu-operatornamespace.> kubectl apply -n gpu-operator -f otel-rbac.yamlConfigure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector’s values file. This tells the collector to scrape metrics from any endpoint in the
gpu-operatornamespace every 10 seconds.config: receivers: prometheus: config: scrape_configs: - job_name: 'gpu-metrics' scrape_interval: 10s scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - gpu-operator