Monitoring SUSE AI with OpenTelemetry and SUSE Observability|Monitoring user-managed applications
Applies to SUSE AI 1.0

7 Monitoring user-managed applications

To monitor other applications, you can utilize OpenTelemetry SDKs or any other instrumentation provider compatible with OpenTelemetry’s semantics, for example, OpenLIT SDK. For more details, refer to Appendix B, Instrument applications with OpenLIT SDK.

OpenTelemetry offers several instrumentation techniques for different deployment scenarios and applications. You can instrument applications either manually, with more detailed control, or automatically for an easier starting point.

Tip
Tip

One of the most straightforward ways of getting started with OpenTelemetry is using the OpenTelemetry Operator for Kubernetes, which is available in the SUSE Application Collection. Find more information in this extensive guide on how to use this operator for instrumenting your applications.

7.1 Ensuring that the telemetry data is properly captured by the SUSE Observability Extension

For the SUSE Observability Extension to acknowledge an application as a GenAI application, it needs to have a meter configured. It must provide at least the RequestsTotal metric with the following attributes:

  • TelemetrySDKLanguage

  • ServiceName

  • ServiceInstanceId

  • ServiceNamespace

  • GenAIEnvironment

  • GenAiApplicationName

  • GenAiSystem

  • GenAiOperationName

  • GenAiRequestModel

Both the meter and the tracer must contain the following resource attributes:

service.name

The logical name of the service. Defaults to 'My App'.

service.version

The version of the service. Defaults to '1.0'.

deployment.environment

The name of the deployment environment, such as 'production' or 'staging'. Defaults to 'default'.

telemetry.sdk.name

The value must be 'openlit'.

The following metrics are utilized in the graphs of the SUSE Observability Extension:

gen_ai.client.token.usage

Measures the number of used input and output tokens.

Type: histogram

Unit: token

gen_ai.total.requests

Number of requests.

Type: counter

Unit: integer

gen_ai.usage.cost

The distribution of GenAI request costs.

Type: histogram

Unit: USD

gen_ai.usage.input_tokens

Number of prompt tokens processed.

Type: counter

Unit: integer

gen_ai.usage.output_tokens

Number of completion tokens processed.

Type: counter

Unit: integer

gen_ai.client.token.usage

Number of tokens processed.

Type: counter

Unit: integer

7.2 Troubleshooting

No metrics received from any components.
  • Verify the OpenTelemetry Collector deployment.

  • Check if the exporter is properly set to the SUSE Observability collector and with the correct API key and endpoint specified.

No metrics received from the GPU.
  • Verify if the RBAC rules were applied.

  • Verify if the metrics receiver scraper is configured.

  • Check the NVIDIA DCGM Exporter for errors.

No metrics received from Milvus.
  • Verify if Milvus chart configuration is exposing the metrics endpoint.

  • Verify if the metrics receiver scraper is configured.

  • For usage metrics, confirm that requests were actually made to Milvus.

No metrics received from vLLM.
  • Verify if the vLLM chart configuration is exposing the metrics endpoint.

  • Verify if the metrics receiver scraper is properly configured, in particular, the placeholders.

  • If some metrics are present, but not usage-related ones, verify if you actually made requests to vLLM.

  • If vLLM was not identified and/or vLLM metrics have the prefix "vllm:" , check the collector configuration. You may be using SUSE Observability 2.6.2, which requires additional configuration.

GPU nodes are stuck in NotReady status.
  • Verify if there is a driver version mismatch between the host and the version that the GPU operator expects.

  • You may need to reinstall the Kubernetes stackpack.

No tracing data received from any components.
  • Verify the OpenTelemetry Collector deployment.

  • Check if the exporter is properly set to the SUSE Observability collector, with the right API key and endpoint set.

No tracing data received from Open WebUI.
  • Verify if the SUSE AI Observability Filter was installed and configured properly.

  • Verify if chat requests actually happened.

Cost estimation is far from real values.

Recalculate the multipliers for the PRICING_JSON in the SUSE AI Observability Filter.

There is high demand for storage volume.

Verify if sampling is being applied in the OpenTelemetry Collector.