Monitoring SUSE AI with OpenTelemetry and SUSE Observability|Introduction
Applies to SUSE AI 1.0

1 Introduction

This document focuses on techniques for gathering telemetry data from all SUSE AI components.

For most of the components, it presents two distinct paths:

Recommended settings

Straightforward configurations designed to utilize the SUSE Observability Extension, providing a quick solution for your environment.

Advanced configuration

For users who require deeper and more granular control. Advanced options unlock additional observability signals that are relevant for specialized analysis and fine-tuning.

Several setups are specific to the product, while others—particularly for scraping metrics—are configured directly within the OpenTelemetry Collector. By implementing the recommended settings, you can visualize the complete topology of your services and operations, bringing clarity to your SUSE AI environment.

1.1 What is SUSE AI monitoring?

Monitoring SUSE AI involves observing and analyzing the behavior, performance and health of its components. In a complex, distributed system like SUSE AI, this is achieved by collecting and interpreting telemetry data. This data is typically categorized into the three pillars of observability:

Metrics

Numerical data representing system performance, such as CPU usage, memory consumption or request latency.

Logs

Time-stamped text records of events that occurred within the system, useful for debugging and auditing.

Traces

A representation of the path of a request as it travels through all the different services in the system. Traces are essential for understanding performance bottlenecks and errors in the system architecture.

1.2 How monitoring works

SUSE AI uses OpenTelemetry, an open-source observability framework, for instrumenting applications. Instrumentation is the process of adding code to an application to generate telemetry data. By using OpenTelemetry, SUSE AI ensures a standardized, vendor-neutral approach to data collection.

The collected data is then sent to SUSE Observability, which provides a comprehensive platform for visualizing, analyzing and alerting on the telemetry data. This allows administrators and developers to gain deep insights into the system, maintain optimal performance, and troubleshoot issues effectively.