Applies to SUSE AI 1.0

2 Monitoring GPU usage #

To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.

Collect GPU metrics (recommended) #

Verify RBAC permissions. The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster. These permissions are automatically configured when you install the collector with the clusterRole section in the otel-values.yaml file.
Note
If you installed the OpenTelemetry Collector without the clusterRole configuration, you must upgrade the collector with the updated otel-values.yaml that includes the clusterRole section.

Configure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector’s values file. This tells the collector to scrape metrics from any endpoint in the gpu-operator namespace every 10 seconds.

config:
  receivers:
    prometheus:
      config:
        scrape_configs:
          - job_name: 'gpu-metrics'
            scrape_interval: 10s
            scheme: http
            kubernetes_sd_configs:
              - role: endpoints
                namespaces:
                  names:
                    - gpu-operator