Monitoring SUSE AI with OpenTelemetry and SUSE Observability|Monitoring GPU usage
Applies to SUSE AI 1.0

2 Monitoring GPU usage

To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.

Collect GPU metrics (recommended)
  1. Grant permissions (RBAC). The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.

    Create a file named otel-rbac.yaml with the following content. It defines a Role with permissions to get services and endpoints, and a RoleBinding to grant these permissions to the OpenTelemetry Collector’s service account.

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: suse-observability-otel-scraper
    rules:
      - apiGroups:
          - ""
        resources:
          - services
          - endpoints
        verbs:
          - list
          - watch
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: suse-observability-otel-scraper
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: suse-observability-otel-scraper
    subjects:
      - kind: ServiceAccount
        name: OPENTELEMETRY-COLLECTOR
        namespace: OBSERVABILITY
    ---
    Important
    Important

    Verify that the ServiceAccount name and namespace in the RoleBinding match your OpenTelemetry Collector’s deployment.

  2. Apply this configuration to the gpu-operator namespace.

    > kubectl apply -n gpu-operator -f otel-rbac.yaml
  3. Configure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector’s values file. This tells the collector to scrape metrics from any endpoint in the gpu-operator namespace every 10 seconds.

    config:
      receivers:
        prometheus:
          config:
            scrape_configs:
              - job_name: 'gpu-metrics'
                scrape_interval: 10s
                scheme: http
                kubernetes_sd_configs:
                  - role: endpoints
                    namespaces:
                      names:
                        - gpu-operator