documentation.suse.com / Monitoring SUSE AI with OpenTelemetry and SUSE Observability

Monitoring SUSE AI with OpenTelemetry and SUSE Observability

Publication Date: 18 Sep 2025
WHAT?

This document focuses on techniques for gathering telemetry data from all SUSE AI components, including metrics, logs and traces.

WHY?

To observe, analyze and maintain the behavior, performance and health of your SUSE AI environment, and to troubleshoot issues effectively.

EFFORT

Setting up the recommended monitoring configurations with SUSE Observability is straightforward. Advanced setups for more granular control may require additional time for specialized analysis and fine-tuning.

GOAL

To visualize the complete topology of your services and operations, providing deep insights and clarity into your SUSE AI environment.

1 Introduction

This document focuses on techniques for gathering telemetry data from all SUSE AI components.

For most of the components, it presents two distinct paths:

Recommended settings

Straightforward configurations designed to utilize the SUSE Observability Extension, providing a quick solution for your environment.

Advanced configuration

For users who require deeper and more granular control. Advanced options unlock additional observability signals that are relevant for specialized analysis and fine-tuning.

Several setups are specific to the product, while others—particularly for scraping metrics—are configured directly within the OpenTelemetry Collector. By implementing the recommended settings, you can visualize the complete topology of your services and operations, bringing clarity to your SUSE AI environment.

1.1 What is SUSE AI monitoring?

Monitoring SUSE AI involves observing and analyzing the behavior, performance and health of its components. In a complex, distributed system like SUSE AI, this is achieved by collecting and interpreting telemetry data. This data is typically categorized into the three pillars of observability:

Metrics

Numerical data representing system performance, such as CPU usage, memory consumption or request latency.

Logs

Time-stamped text records of events that occurred within the system, useful for debugging and auditing.

Traces

A representation of the path of a request as it travels through all the different services in the system. Traces are essential for understanding performance bottlenecks and errors in the system architecture.

1.2 How monitoring works

SUSE AI uses OpenTelemetry, an open-source observability framework, for instrumenting applications. Instrumentation is the process of adding code to an application to generate telemetry data. By using OpenTelemetry, SUSE AI ensures a standardized, vendor-neutral approach to data collection.

The collected data is then sent to SUSE Observability, which provides a comprehensive platform for visualizing, analyzing and alerting on the telemetry data. This allows administrators and developers to gain deep insights into the system, maintain optimal performance, and troubleshoot issues effectively.

2 Monitoring GPU usage

To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.

Procedure 1: Collect GPU metrics (recommended)
  1. Grant permissions (RBAC). The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.

    Create a file named otel-rbac.yaml with the following content. It defines a Role with permissions to get services and endpoints, and a RoleBinding to grant these permissions to the OpenTelemetry Collector's service account.

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      name: suse-observability-otel-scraper
    rules:
      - apiGroups:
          - ""
        resources:
          - services
          - endpoints
        verbs:
          - list
          - watch
    
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: suse-observability-otel-scraper
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: Role
      name: suse-observability-otel-scraper
    subjects:
      - kind: ServiceAccount
        name: OPENTELEMETRY-COLLECTOR
        namespace: OBSERVABILITY
    ---
    Important
    Important

    Verify that the ServiceAccount name and namespace in the RoleBinding match your OpenTelemetry Collector's deployment.

  2. Apply this configuration to the gpu-operator namespace.

    > kubectl apply -n gpu-operator -f otel-rbac.yaml
  3. Configure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector's values file. This tells the collector to scrape metrics from any endpoint in the gpu-operator namespace every 10 seconds.

    config:
      receivers:
        prometheus:
          config:
            scrape_configs:
              - job_name: 'gpu-metrics'
                scrape_interval: 10s
                scheme: http
                kubernetes_sd_configs:
                  - role: endpoints
                    namespaces:
                      names:
                        - gpu-operator

3 Monitoring Open WebUI

The preferred way of retrieving relevant telemetry data from Open WebUI is to use the SUSE AI Filter. It requires enabling and configuring Open WebUI Pipelines.

Procedure 2: Configuring pipeline filter during Open WebUI installation (recommended)
  1. Verify that the Open WebUI installation override file owui_custom_overrides.yaml includes the following content.

    pipelines:
      enabled: true
      persistence:
        storageClass: longhorn 1
      extraEnvVars: 2
        - name: PIPELINES_URLS 3
          value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py"
        - name: OTEL_SERVICE_NAME 4
          value: "Open WebUI"
        - name: OTEL_EXPORTER_HTTP_OTLP_ENDPOpen WebUINT 5
          value: "http://opentelemetry-collector.suse-observability.svc.cluster.local:4318"
        - name: PRICING_JSON 6
          value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/pricing.json"
    extraEnvVars:
    - name: OPENAI_API_KEY 7
      value: "0p3n-w3bu!"
    Note
    Note

    In the above example, there are two extraEnvVars blocks: one at the root level and another inside the pipelines configuration. The root-level extraEnvVars is fed into Open WebUI to configure the communication between Open WebUI and Open WebUI Pipelines. The extraEnvVars inside the configuration are injected into the container that acts as a runtime for the pipelines.

    1

    longhorn or local-path.

    2

    The environment variables that you are making available for the pipeline's runtime container.

    3

    A list of pipeline URLs to be downloaded and installed by default. Individual URLs are separated by a semicolon ;.

    4

    The service name that appears in traces and topological representations in SUSE Observability.

    5

    The endpoint for the OpenTelemetry collector. Make sure to use the HTTP port of your collector.

    6

    A file for the model multipliers in cost estimation. You can customize it to match your actual infrastructure experimentally.

    7

    The value for the API key between Open WebUI and Open WebUI Pipelines. The default value is 0p3n-w3bu!.

  2. After you fill the override file with correct values, install or update Open WebUI.

    > helm upgrade \
      --install open-webui oci://dp.apps.rancher.io/charts/open-webui \
      -n suse-private-ai \
      --create-namespace \
      --version 7.2.0 \
      -f owui_custom_overrides.yaml
    Tip
    Tip

    Make sure to set the version, namespace and other options to the proper values.

    After the installation is successful, you can access tracing data in SUSE Observability for each chat.

    Tip
    Tip

    You can verify that a new connection was created with correct credentials in Admin Panel › Settings › Connections.

    Connections settings page within an admin panel, highlighting a newly added Ollama API connection with its URL and an active status
    Figure 1: New connection added for the pipeline
Procedure 3: Configuring a pipeline filter in Open WebUI (recommended)

If you already have a running instance of Open WebUI with the pipelines enabled and configured, you can set up the SUSE AI Filter in its Web user interface.

Requirements
  • You must have Open WebUI administrator privileges to access configuration screens or settings mentioned in this section.

  1. In the bottom left of the Open WebUI window, click your avatar icon to open the user menu and select Admin Panel.

  2. Click the Settings tab and select Pipelines from the left menu.

  3. In the Install from Github URL section, enter https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py and click the upload button on the right to upload the pipeline from the URL.

  4. After the upload is finished, you can review the configuration of the pipeline. Confirm with Save.

    Screenshot showing the detailed configuration and code content for a SUSE AI filter pipeline
    Figure 2: Adding SUSE AI filter pipeline
Procedure 4: Configuring default Open WebUI metrics and traces (advanced)

Open WebUI also offers certain built-in OpenTelemetry integration for traces and metrics. These signals are related to the API consumption but do not provide details about the GenAI monitoring. That is why we need to configure the SUSE AI filter as described in Procedure 2, “Configuring pipeline filter during Open WebUI installation (recommended)”.

  1. Append the following environment variables to your extraEnvVars section in the owui_custom_overrides.yaml file mentioned in Procedure 2, “Configuring pipeline filter during Open WebUI installation (recommended)”.

    [...]
    extraEnvVars:
    - name: OPENAI_API_KEY
      value: "0p3n-w3bu!"
    - name: ENABLE_OTEL
      value: "true"
    - name: ENABLE_OTEL_METRICS
      value: "true"
    - name: OTEL_EXPORTER_OTLP_INSECURE
      value: "false" 1
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: CUSTOM_OTEL_ENDPOINT 2
    - name: OTEL_SERVICE_NAME
      value: CUSTOM_OTEL_IDENTIFYER 3

    1

    Set to "true" for testing or controlled environments, and "false" for production deployments with TLS communication.

    2

    Enter your custom OpenTelemetry collector endpoint URL, such as "http://opentelemetry-collector.suse-observability.svc.cluster.local:4318".

    3

    Specify a custom identifier for the OpenTelemetry service, such as "OI Core"".

  2. Save the enhanced override file and update Open WebUI:

    > helm upgrade \
      --install open-webui oci://dp.apps.rancher.io/charts/open-webui \
      -n suse-private-ai \
      --create-namespace \
      --version 7.2.0 \
      -f owui_custom_overrides.yaml

4 Monitoring Milvus

Milvus is monitored by scraping its Prometheus-compatible metrics endpoint. The SUSE Observability Extension uses these metrics to visualize Milvus's status and activity.

4.1 Scraping the metrics (recommended)

Add the following job to the scrape_configs section of your OpenTelemetry Collector's configuration. It instructs the collector to scrape the /metrics endpoint of the Milvus service every 15 seconds.

config:
  receivers:
    prometheus:
      config:
        scrape_configs:
          - job_name: 'milvus'
            scrape_interval: 15s
            metrics_path: '/metrics'
 
            static_configs:
            - targets: ['milvus.suse-private-ai.svc.cluster.local:9091'] 1

1

Your Milvus service metrics endpoint. The example milvus.suse-private-ai.svc.cluster.local:9091 is a common default, but you should verify that it matches your installation service name and namespace.

4.2 Tracing (advanced)

Milvus can also export detailed tracing data.

Important
Important: High data volume

Enabling tracing in Milvus can generate a large amount of data. We recommend configuring sampling at the collector level to avoid performance issues and high storage costs.

To enable tracing, configure the following settings in your Milvus Helm chart values:

extraConfigFiles:
  user.yaml: |+
    trace:
      exporter: jaeger
      sampleFraction: 1
      jaeger: url: "http://opentelemetry-collector.observability.svc.cluster.local:14268/api/traces" 1

1

The URL of the OpenTelemetry Collector installed by the user.

5 Monitoring user-managed applications

To monitor other applications, you can utilize OpenTelemetry SDKs or any other instrumentation provider compatible with OpenTelemetry's semantics, for example, OpenLIT SDK. For more details, refer to Appendix B, Instrument applications with OpenLIT SDK.

OpenTelemetry offers several instrumentation techniques for different deployment scenarios and applications. You can instrument applications either manually, with more detailed control, or automatically for an easier starting point.

Tip
Tip

One of the most straightforward ways of getting started with OpenTelemetry is using the OpenTelemetry Operator for Kubernetes, which is available in the SUSE Application Collection. Find more information in this extensive guide on how to use this operator for instrumenting your applications.

5.1 Ensuring that the telemetry data is properly captured by the SUSE Observability Extension

For the SUSE Observability Extension to acknowledge an application as a GenAI application, it needs to have a meter configured. It must provide at least the RequestsTotal metric with the following attributes:

  • TelemetrySDKLanguage

  • ServiceName

  • ServiceInstanceId

  • ServiceNamespace

  • GenAIEnvironment

  • GenAiApplicationName

  • GenAiSystem

  • GenAiOperationName

  • GenAiRequestModel

Both the meter and the tracer must contain the following resource attributes:

service.name

The logical name of the service. Defaults to "My App".

service.version

The version of the service. Defaults to "1.0".

deployment.environment

The name of the deployment environment, such as "production" or "staging". Defaults to "default"".

telemetry.sdk.name

The value must be "openlit".

The following metrics are utilized in the graphs of the SUSE Observability Extension:

gen_ai.client.token.usage

Measures the number of used input and output tokens.

Type: histogram

Unit: token

gen_ai.total.requests

Number of requests.

Type: counter

Unit: integer

gen_ai.usage.cost

The distribution of GenAI request costs.

Type: histogram

Unit: USD

gen_ai.usage.input_tokens

Number of prompt tokens processed.

Type: counter

Unit: integer

gen_ai.usage.output_tokens

Number of completion tokens processed.

Type: counter

Unit: integer

gen_ai.client.token.usage

Number of tokens processed.

Type: counter

Unit: integer

5.2 Troubleshooting

1. No metrics received from any components.
  • Verify the OpenTelemetry Collector deployment.

  • Check if the exporter is properly set to the SUSE Observability collector and with the correct API key and endpoint specified.

2. No metrics received from the GPU.
  • Verify if the RBAC rules were applied.

  • Verify if the metrics receiver scraper is configured.

  • Check the NVIDIA DCGM Exporter for errors.

3. No metrics received from Milvus.
  • Verify if Milvus chart configuration is exposing the metrics endpoint.

  • Verify if the metrics receiver scraper is configured.

  • For usage metrics, confirm that requests were actually made to Milvus.

4. No tracing data received from any components.
  • Verify the OpenTelemetry Collector deployment.

  • Check if the exporter is properly set to the SUSE Observability collector, with the right API key and endpoint set.

5. No tracing data received from Open WebUI.
  • Verify if the SUSE AI Observability Filter was installed and configured properly.

  • Verify if chat requests actually happened.

6. Cost estimation is far from real values.

Recalculate the multipliers for the PRICING_JSON in the SUSE AI Observability Filter.

7. There is high demand for storage volume.

Verify if sampling is being applied in the OpenTelemetry Collector.

Documentation survey