本文档采用自动化机器翻译技术翻译。 尽管我们力求提供准确的译文,但不对翻译内容的完整性、准确性或可靠性作出任何保证。 若出现任何内容不一致情况,请以原始 英文 版本为准,且原始英文版本为权威文本。

编写补救指南以帮助用户排除问题

概述

SUSE® Observability 提供 开箱即用的监控,可对 Kubernetes 集群中可能发生的常见问题进行监控。这些监控还包含开箱即用的补救指南,旨在指导用户准确排除问题。它们是基于最佳实践和社区知识创建的。按照本页面上的指示,了解如何自己编写有效的补救指南。

准则

  • 提供逐步说明,引导用户解决监控检测到的问题;

  • 确保说明按最可能的原因排序。

  • 如果可能,包含相关数据和/或资源的链接,以加快调查速度。

  • 保持简洁明了:

    • 避免过度解释 - 如果有必要,请添加支持文档的链接;

    • 避免使用目录和类似的内容块;

    • 避免有相同内容的摘要;

  • 尽量以结构化的方式格式化指南。使用:

    • 项目符号

    • 编号

    • 简短句子

    • 段落

    • 内联格式示例

  • 如果存在未解决的问题(可能有不同的未知原因),请提供升级问题的指导。例如,向用户提供支持链接/号码等。

补救指南示例

When a Kubernetes container has errors, it can enter into a state called CrashLoopBackOff, where Kubernetes attempts to restart the container to resolve the issue. The container will continue to restart until the problem is resolved.Take the following steps to diagnose the problem:

### Pod Events

Check the pod events to identify any explicit errors or warnings.
1. Go to the "Events" section in the middle of the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}})
2. Check if there is are events like "BackOff", "FailedScheduling", "FailedAttachVolume" or "OOMKilled" in the Alert Category by clicking on 'Alerts'.
3. You can see the details of the event (click on the event) to give more information about the issue.
4. If the 'Show related event' option is enabled all events of resources related to this resource like a deployment will also show up and can give you a clue if any change on them is causing this issue. You can see this by checking if there is a correlation between the time of a deployment and a change of behaviour seen by the metrics and events of this pod.
For easy correlation you can use 'shift'-'click' to add markers to the different graph, log and event widgets.

### Container Logs
Check the container logs for any explicit errors or warnings
Inspect the [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) of all the containers in this pod.
Search for hints in the logs by:
1.  Looking for changes in logging pattern, by looking at the number of logs per time unit (The histogram bars).
    In many cases the change in pattern will indicate what is going on.
    You can click-drag on the histogram bars to narrow the logs displayed to that time-frame.
2.  Searching for "Error" or "Fatal" in the search bar.
3.  Looking at the logs around the time that the monitor triggered

### Recent Changes
Look at the pod age in the "About" section on the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) to identify any recent deployments that might have caused the issue
1. The "Age" is shown in the "About" section on the left side of the screen
2. If the "Age" and the time that the monitor was triggered are in close proximity then take a look at the most recent deployment by clicking on [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange).

插入链接

我们使用的语法对于“深层链接”和“页面内链接”是不同的。“深层链接”将用户从当前页面重定向,而“页面内链接”将用户留在同一页面。

深层链接

要链接到当前资源的任何视角(例如“亮点”、“拓扑”、“事件”、“指标”),请使用以下语法:

[highlight page](/#/components/\{{ componentUrnForUrl \}})
[topology](/#/components/{{ componentUrnForUrl }}/topology)
[events](/#/components/{{ componentUrnForUrl }}/events)
[metrics](/#/components/{{ componentUrnForUrl }}/metrics)

页面内链接

要链接到当前资源的任何附加数据(例如“显示日志”、“显示最后更改”、“显示状态”、“显示配置”),请使用以下语法:

[logs](/#/components/\{{ componentUrnForUrl \}}#logs)
[last change](/#/components/\{{ componentUrnForUrl \}}#lastChange)
[status](/#/components/\{{ componentUrnForUrl \}}#status)
[configuration](/#/components/\{{ componentUrnForUrl \}}#configuration)