Preventing AI hallucinations with prompt engineering

Applies to SUSE AI 1.0

Glossary #

AI, artificial intelligence #: Refers to the simulation of human intelligence in machines that are designed to learn and solve problems like humans. Enables computers to understand language, make decisions and improve from experience.
Air gap #: A security measure where a computer network is physically isolated from unsecured networks, including the public Internet.
Batch size #: The number of samples processed simultaneously during model inference, affecting processing speed and resource utilization.
BYOC, bring your own certificate #: A practice allowing users to provide their own SSL/TLS certificates for securing communications instead of using default or auto-generated ones.
CA, certification authority #: An entity that issues digital certificates to verify the identity of certificate holders and ensure secure communications.
Chain-of-thought (CoT) prompting #: A prompting technique that guides AI models to break down complex problems into step-by-step reasoning processes, improving response accuracy and transparency.
Chat template #: A structured format for organizing conversations between users and AI models, defining how system prompts, user inputs, and AI responses are formatted and processed.
Context window #: The maximum amount of text (tokens) that an AI model can process at once, including both the input prompt and generated response.
CRD, custom resource definitions #: Extensions of the Kubernetes API that allow users to define custom resources and their controllers in a Kubernetes cluster.
CUDA, Compute Unified Device Architecture #: NVIDIA's parallel computing platform and programming model used to accelerate AI workloads on GPU hardware.
Data leakage #: The unintended exposure of sensitive information through AI model responses, potentially compromising data security and privacy.
Embeddings #: Numerical representations of data (text, images, etc.) in a high-dimensional space that capture semantic relationships and enable AI models to process information effectively.
Fine-tuning #: The process of further training a pre-trained AI model on specific data to adapt it for particular tasks or domains, improving its performance for targeted applications.
GenAI, generative AI #: A type of artificial intelligence that can create new content such as text, images or music.
GPU, graphics processing unit #: Specialized hardware designed for parallel processing. In AI applications, GPUs accelerate model training and inference tasks.
Hallucination #: An AI behavior where the model generates false or unsupported information that appears plausible but has no basis in provided context or real facts.
Helm #: A package manager for Kubernetes that helps install and manage applications. Helm uses charts to define, install and upgrade complex Kubernetes applications.
Helm chart #: A package format for Kubernetes applications that contains all resource definitions needed to deploy and configure application workloads.
IaC, infrastructure as code #: The practice of managing and provisioning infrastructure through machine-readable definition files rather than manual processes.
Inference #: The process of using a trained AI model to make predictions or generate outputs based on new input data.
Kubernetes pods #: The smallest deployable units in Kubernetes that can host one or more containers, sharing networking and storage resources.
LLM, large language model #: An advanced AI model trained on amounts of text data to understand and generate human-like text. Can perform tasks like translation, summarization and answering questions.
Model weights #: The learned parameters of an AI model that determine how it processes inputs and generates outputs. These weights are adjusted during training to optimize model performance.
NLG, natural language generation #: A process of automatically generating human-like text from structured data or other forms of input. Designed to convert raw data into coherent and meaningful language easily understood by humans.
NLU, natural language understanding #: A process AI uses to analyze and understand the meaning of the input query.
NVIDIA GPU driver #: Software that enables communication between the operating system and NVIDIA graphics hardware, essential for GPU-accelerated AI workloads.
NVIDIA GPU Operator #: A Kubernetes operator that automates the management of NVIDIA GPUs in container environments, handling driver deployment, runtime configuration, and monitoring.
Ollama #: An open source framework for running and serving AI models locally. Ollama simplifies the process of downloading, running and managing large language models.
OpenGL #: A cross-platform API for rendering 2D and 3D graphics, commonly used in visualization applications and GPU-accelerated computing.
Prompt Engineering #: The practice of crafting effective input queries to AI models to obtain desired and accurate outputs. Good prompt engineering helps prevent hallucinations and improves response quality.
Prompt injection #: A security vulnerability where malicious inputs attempt to override or bypass an AI model's system prompt or safety constraints.
Quantization #: A technique to reduce AI model size and computational requirements by converting model parameters to lower precision formats while maintaining acceptable performance.
RAG, retrieval-augmented generation #: A technique that enhances AI responses by retrieving relevant information from a knowledge base before generating answers, improving accuracy and reducing hallucinations.
RBAC, role-based access control #: A security model that restricts system access based on roles assigned to users, managing permissions and authorization in Kubernetes clusters.
Semantic search #: A search method using AI to understand the meaning and context of queries rather than just matching keywords, enabling more relevant results.
System prompt #: Initial instructions given to an AI model that define its behavior, role and response parameters. System prompts help maintain consistent and appropriate AI responses.
Temperature #: A parameter controlling the randomness in AI model outputs. Lower values produce more focused and deterministic responses, while higher values increase creativity and variability.
Token #: The basic unit of text processing in AI models, representing parts of words, characters or symbols. Models process text by breaking it into tokens for analysis and generation.
Top-K #: A parameter that limits token selection during text generation to the K most likely next tokens, helping control output quality and relevance.
Top-P #: Also known as nucleus sampling, a parameter that selects from the smallest set of tokens whose cumulative probability exceeds P, providing dynamic control over text generation diversity.
Vector database #: A specialized database designed to store and efficiently query high-dimensional vectors that represent data in AI applications, enabling similarity searches and semantic operations.
Vector store #: A specialized storage system optimized for managing and querying vector embeddings, essential for semantic search and RAG implementations in AI applications.