Monitoring SUSE AI with OpenTelemetry and SUSE Observability
- WHAT?
This document focuses on techniques for gathering telemetry data from all SUSE AI components, including metrics, logs and traces.
- WHY?
To observe, analyze and maintain the behavior, performance and health of your SUSE AI environment, and to troubleshoot issues effectively.
- EFFORT
Setting up the recommended monitoring configurations with SUSE Observability is straightforward. Advanced setups for more granular control may require additional time for specialized analysis and fine-tuning.
- GOAL
To visualize the complete topology of your services and operations, providing deep insights and clarity into your SUSE AI environment.
1 Introduction #
This document focuses on techniques for gathering telemetry data from all SUSE AI components.
For most of the components, it presents two distinct paths:
- Recommended settings
Straightforward configurations designed to utilize the SUSE Observability Extension, providing a quick solution for your environment.
- Advanced configuration
For users who require deeper and more granular control. Advanced options unlock additional observability signals that are relevant for specialized analysis and fine-tuning.
Several setups are specific to the product, while others—particularly for scraping metrics—are configured directly within the OpenTelemetry Collector. By implementing the recommended settings, you can visualize the complete topology of your services and operations, bringing clarity to your SUSE AI environment.
1.1 What is SUSE AI monitoring? #
Monitoring SUSE AI involves observing and analyzing the behavior, performance and health of its components. In a complex, distributed system like SUSE AI, this is achieved by collecting and interpreting telemetry data. This data is typically categorized into the three pillars of observability:
- Metrics
Numerical data representing system performance, such as CPU usage, memory consumption or request latency.
- Logs
Time-stamped text records of events that occurred within the system, useful for debugging and auditing.
- Traces
A representation of the path of a request as it travels through all the different services in the system. Traces are essential for understanding performance bottlenecks and errors in the system architecture.
1.2 How monitoring works #
SUSE AI uses OpenTelemetry, an open-source observability framework, for instrumenting applications. Instrumentation is the process of adding code to an application to generate telemetry data. By using OpenTelemetry, SUSE AI ensures a standardized, vendor-neutral approach to data collection.
The collected data is then sent to SUSE Observability, which provides a comprehensive platform for visualizing, analyzing and alerting on the telemetry data. This allows administrators and developers to gain deep insights into the system, maintain optimal performance, and troubleshoot issues effectively.
2 Monitoring GPU usage #
To effectively monitor the performance and utilization of your GPUs, configure the OpenTelemetry Collector to scrape metrics from the NVIDIA DCGM Exporter, which is deployed as part of the NVIDIA GPU Operator.
Grant permissions (RBAC). The OpenTelemetry Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.
Create a file named
otel-rbac.yaml
with the following content. It defines aRole
with permissions to get services and endpoints, and aRoleBinding
to grant these permissions to the OpenTelemetry Collector's service account.--- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: suse-observability-otel-scraper rules: - apiGroups: - "" resources: - services - endpoints verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: suse-observability-otel-scraper roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: suse-observability-otel-scraper subjects: - kind: ServiceAccount name: OPENTELEMETRY-COLLECTOR namespace: OBSERVABILITY ---
ImportantVerify that the
ServiceAccount
name and namespace in theRoleBinding
match your OpenTelemetry Collector's deployment.Apply this configuration to the
gpu-operator
namespace.>
kubectl apply -n gpu-operator -f otel-rbac.yaml
Configure the OpenTelemetry Collector. Add the following Prometheus receiver configuration to your OpenTelemetry Collector's values file. This tells the collector to scrape metrics from any endpoint in the
gpu-operator
namespace every 10 seconds.config: receivers: prometheus: config: scrape_configs: - job_name: 'gpu-metrics' scrape_interval: 10s scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - gpu-operator
3 Monitoring Open WebUI #
The preferred way of retrieving relevant telemetry data from Open WebUI is to use the SUSE AI Filter. It requires enabling and configuring Open WebUI Pipelines.
Verify that the Open WebUI installation override file
owui_custom_overrides.yaml
includes the following content.pipelines: enabled: true persistence: storageClass: longhorn 1 extraEnvVars: 2 - name: PIPELINES_URLS 3 value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py" - name: OTEL_SERVICE_NAME 4 value: "Open WebUI" - name: OTEL_EXPORTER_HTTP_OTLP_ENDPOpen WebUINT 5 value: "http://opentelemetry-collector.suse-observability.svc.cluster.local:4318" - name: PRICING_JSON 6 value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/pricing.json" extraEnvVars: - name: OPENAI_API_KEY 7 value: "0p3n-w3bu!"
NoteIn the above example, there are two
extraEnvVars
blocks: one at the root level and another inside thepipelines
configuration. The root-levelextraEnvVars
is fed into Open WebUI to configure the communication between Open WebUI and Open WebUI Pipelines. TheextraEnvVars
inside the configuration are injected into the container that acts as a runtime for thepipelines
.longhorn
orlocal-path
.The environment variables that you are making available for the pipeline's runtime container.
A list of pipeline URLs to be downloaded and installed by default. Individual URLs are separated by a semicolon
;
.The service name that appears in traces and topological representations in SUSE Observability.
The endpoint for the OpenTelemetry collector. Make sure to use the HTTP port of your collector.
A file for the model multipliers in cost estimation. You can customize it to match your actual infrastructure experimentally.
The value for the API key between Open WebUI and Open WebUI Pipelines. The default value is “0p3n-w3bu!”.
After you fill the override file with correct values, install or update Open WebUI.
>
helm upgrade \ --install open-webui oci://dp.apps.rancher.io/charts/open-webui \ -n suse-private-ai \ --create-namespace \ --version 7.2.0 \ -f owui_custom_overrides.yaml
TipMake sure to set the version, namespace and other options to the proper values.
After the installation is successful, you can access tracing data in SUSE Observability for each chat.
TipYou can verify that a new connection was created with correct credentials in
› › .Figure 1: New connection added for the pipeline #
If you already have a running instance of Open WebUI with the pipelines enabled and configured, you can set up the SUSE AI Filter in its Web user interface.
You must have Open WebUI administrator privileges to access configuration screens or settings mentioned in this section.
In the bottom left of the Open WebUI window, click your avatar icon to open the user menu and select
.Click the
tab and select from the left menu.In the
section, enterhttps://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py
and click the upload button on the right to upload the pipeline from the URL.After the upload is finished, you can review the configuration of the pipeline. Confirm with
.Figure 2: Adding SUSE AI filter pipeline #
Open WebUI also offers certain built-in OpenTelemetry integration for traces and metrics. These signals are related to the API consumption but do not provide details about the GenAI monitoring. That is why we need to configure the SUSE AI filter as described in Procedure 2, “Configuring pipeline filter during Open WebUI installation (recommended)”.
Append the following environment variables to your
extraEnvVars
section in theowui_custom_overrides.yaml
file mentioned in Procedure 2, “Configuring pipeline filter during Open WebUI installation (recommended)”.[...] extraEnvVars: - name: OPENAI_API_KEY value: "0p3n-w3bu!" - name: ENABLE_OTEL value: "true" - name: ENABLE_OTEL_METRICS value: "true" - name: OTEL_EXPORTER_OTLP_INSECURE value: "false" 1 - name: OTEL_EXPORTER_OTLP_ENDPOINT value: CUSTOM_OTEL_ENDPOINT 2 - name: OTEL_SERVICE_NAME value: CUSTOM_OTEL_IDENTIFYER 3
Set to
"true"
for testing or controlled environments, and"false"
for production deployments with TLS communication.Enter your custom OpenTelemetry collector endpoint URL, such as
"http://opentelemetry-collector.suse-observability.svc.cluster.local:4318"
.Specify a custom identifier for the OpenTelemetry service, such as
"OI Core""
.Save the enhanced override file and update Open WebUI:
>
helm upgrade \ --install open-webui oci://dp.apps.rancher.io/charts/open-webui \ -n suse-private-ai \ --create-namespace \ --version 7.2.0 \ -f owui_custom_overrides.yaml
4 Monitoring Milvus #
Milvus is monitored by scraping its Prometheus-compatible metrics endpoint. The SUSE Observability Extension uses these metrics to visualize Milvus's status and activity.
4.1 Scraping the metrics (recommended) #
Add the following job to the scrape_configs
section of
your OpenTelemetry Collector's configuration. It instructs the collector to
scrape the /metrics
endpoint of the Milvus service
every 15 seconds.
config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'milvus'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['milvus.suse-private-ai.svc.cluster.local:9091'] 1
Your Milvus service metrics endpoint. The example
|
4.2 Tracing (advanced) #
Milvus can also export detailed tracing data.
Enabling tracing in Milvus can generate a large amount of data. We recommend configuring sampling at the collector level to avoid performance issues and high storage costs.
To enable tracing, configure the following settings in your Milvus Helm chart values:
extraConfigFiles:
user.yaml: |+
trace:
exporter: jaeger
sampleFraction: 1
jaeger: url: "http://opentelemetry-collector.observability.svc.cluster.local:14268/api/traces" 1
The URL of the OpenTelemetry Collector installed by the user. |
5 Monitoring user-managed applications #
To monitor other applications, you can utilize OpenTelemetry SDKs or any other instrumentation provider compatible with OpenTelemetry's semantics, for example, OpenLIT SDK. For more details, refer to Appendix B, Instrument applications with OpenLIT SDK.
OpenTelemetry offers several instrumentation techniques for different deployment scenarios and applications. You can instrument applications either manually, with more detailed control, or automatically for an easier starting point.
One of the most straightforward ways of getting started with OpenTelemetry is using the OpenTelemetry Operator for Kubernetes, which is available in the SUSE Application Collection. Find more information in this extensive guide on how to use this operator for instrumenting your applications.
5.1 Ensuring that the telemetry data is properly captured by the SUSE Observability Extension #
For the SUSE Observability Extension to acknowledge an application as a
GenAI application, it needs to have a meter configured. It must provide at
least the RequestsTotal
metric with the following
attributes:
TelemetrySDKLanguage
ServiceName
ServiceInstanceId
ServiceNamespace
GenAIEnvironment
GenAiApplicationName
GenAiSystem
GenAiOperationName
GenAiRequestModel
Both the meter and the tracer must contain the following resource attributes:
- service.name
The logical name of the service. Defaults to
"My App"
.- service.version
The version of the service. Defaults to
"1.0"
.- deployment.environment
The name of the deployment environment, such as
"production"
or"staging"
. Defaults to"default""
.- telemetry.sdk.name
The value must be
"openlit"
.
The following metrics are utilized in the graphs of the SUSE Observability Extension:
- gen_ai.client.token.usage
Measures the number of used input and output tokens.
Type: histogram
Unit: token
- gen_ai.total.requests
Number of requests.
Type: counter
Unit: integer
- gen_ai.usage.cost
The distribution of GenAI request costs.
Type: histogram
Unit: USD
- gen_ai.usage.input_tokens
Number of prompt tokens processed.
Type: counter
Unit: integer
- gen_ai.usage.output_tokens
Number of completion tokens processed.
Type: counter
Unit: integer
- gen_ai.client.token.usage
Number of tokens processed.
Type: counter
Unit: integer
5.2 Troubleshooting #
- 1. No metrics received from any components.
Verify the OpenTelemetry Collector deployment.
Check if the exporter is properly set to the SUSE Observability collector and with the correct API key and endpoint specified.
- 2. No metrics received from the GPU.
Verify if the RBAC rules were applied.
Verify if the metrics receiver scraper is configured.
Check the NVIDIA DCGM Exporter for errors.
- 3. No metrics received from Milvus.
Verify if Milvus chart configuration is exposing the metrics endpoint.
Verify if the metrics receiver scraper is configured.
For usage metrics, confirm that requests were actually made to Milvus.
- 4. No tracing data received from any components.
Verify the OpenTelemetry Collector deployment.
Check if the exporter is properly set to the SUSE Observability collector, with the right API key and endpoint set.
- 5. No tracing data received from Open WebUI.
Verify if the SUSE AI Observability Filter was installed and configured properly.
Verify if chat requests actually happened.
- 6. Cost estimation is far from real values.
Recalculate the multipliers for the
PRICING_JSON
in the SUSE AI Observability Filter.
- 7. There is high demand for storage volume.
Verify if sampling is being applied in the OpenTelemetry Collector.