|Index|Monitoring SUSE AI with OpenTelemetry and SUSE Observability|Monitoring GPU usage
Applies to SUSE AI 1.0

2 Monitoring GPU usage

To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.

Collect GPU metrics (recommended)
  1. Verify RBAC permissions. The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster. These permissions are automatically configured when you install the collector with the clusterRole section in the otel-values.yaml file.

    Note
    Note

    If you installed the OpenTelemetry Collector without the clusterRole configuration, you must upgrade the collector with the updated otel-values.yaml that includes the clusterRole section.

  2. Configure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector’s values file. This tells the collector to scrape metrics from any endpoint in the gpu-operator namespace every 10 seconds.

    config:
      receivers:
        prometheus:
          config:
            scrape_configs:
              - job_name: 'gpu-metrics'
                scrape_interval: 10s
                scheme: http
                kubernetes_sd_configs:
                  - role: endpoints
                    namespaces:
                      names:
                        - gpu-operator