2 Monitoring GPU usage #
To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.
Verify RBAC permissions. The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster. These permissions are automatically configured when you install the collector with the
clusterRolesection in theotel-values.yamlfile.NoteIf you installed the OpenTelemetry Collector without the
clusterRoleconfiguration, you must upgrade the collector with the updatedotel-values.yamlthat includes theclusterRolesection.Configure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector’s values file. This tells the collector to scrape metrics from any endpoint in the
gpu-operatornamespace every 10 seconds.config: receivers: prometheus: config: scrape_configs: - job_name: 'gpu-metrics' scrape_interval: 10s scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - gpu-operator