etcd 节点故障排除

本文介绍了对具有 etcd 角色的节点进行故障排除的命令和提示。

检查 etcd 容器是否正在运行

etcd 容器的状态应该是 Up。Up 后面显示的时间指的是容器运行的时间。

docker ps -a -f=name=etcd$

输出示例：

CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS          PORTS     NAMES
d26adbd23643   rancher/mirrored-coreos-etcd:v3.5.7   "/usr/local/bin/etcd…"   30 minutes ago   Up 30 minutes             etcd

etcd 容器日志记录

容器的日志记录可能包含问题的信息。

docker logs etcd

日志解释

日志	解释
`health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused	无法连接到端口 2380 上显示的地址。检查 etcd 容器是否在显示地址的主机上运行。
`xxx is starting a new election at term x	etcd 集群失去了集群仲裁数量，并正在尝试建立一个新的 leader。运行 etcd 的大多数节点关闭/无法访问时，可能会发生这种情况。
`connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}	主机防火墙正在阻止网络通信。
`rafthttp: request cluster ID mismatch	具有 etcd 实例日志 `rafthttp: request cluster ID mismatch` 的节点正在尝试加入已经添加另一个对等节点（peer）的集群。你需要从集群中删除该节点，然后再重新添加。
`rafthttp: failed to find member	集群状态（`/var/lib/etcd`）包含加入集群的错误信息。你需要从集群中删除该节点，清理状态目录，然后再重新添加。

`health check for peer xxx could not connect: dial tcp IP:2380: getsockopt: connection refused

无法连接到端口 2380 上显示的地址。检查 etcd 容器是否在显示地址的主机上运行。

`xxx is starting a new election at term x

etcd 集群失去了集群仲裁数量，并正在尝试建立一个新的 leader。运行 etcd 的大多数节点关闭/无法访问时，可能会发生这种情况。

`connection error: desc = "transport: Error while dialing dial tcp 0.0.0.0:2379: i/o timeout"; Reconnecting to {0.0.0.0:2379 0 <nil>}

主机防火墙正在阻止网络通信。

`rafthttp: request cluster ID mismatch

具有 etcd 实例日志 rafthttp: request cluster ID mismatch 的节点正在尝试加入已经添加另一个对等节点（peer）的集群。你需要从集群中删除该节点，然后再重新添加。

`rafthttp: failed to find member

集群状态（/var/lib/etcd）包含加入集群的错误信息。你需要从集群中删除该节点，清理状态目录，然后再重新添加。

etcd 集群和连接检查

运行 etcd 的主机的地址配置决定了 etcd 监听的地址。如果为运行 etcd 的主机配置了内部地址，则需要显式指定 etcdctl 的端点。如果任何命令的响应是 Error: context deadline exceeded，则 etcd 实例不健康（仲裁丢失或实例未正确加入集群）。

检查所有节点上的 etcd 成员

输出应包含具有 etcd 角色的所有节点，而且所有节点上的输出应该是相同的。

命令：

docker exec etcd etcdctl member list

检查端点状态

RAFT TERM 的值应该是相等的，而且 RAFT INDEX 相差不能太大。

命令：

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table

输出示例：

+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 | 333ef673fc4add56 |  3.5.7  |   24 MB |     false |        72 |      66887 |
| https://IP:2379 | 5feed52d940ce4cf |  3.5.7  |   24 MB |      true |        72 |      66887 |
| https://IP:2379 | db6b3bdb559a848d |  3.5.7  |   25 MB |     false |        72 |      66887 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+

检查端点健康

命令：

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint health

输出示例：

https://IP:2379 is healthy: successfully committed proposal: took = 2.113189ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.649963ms
https://IP:2379 is healthy: successfully committed proposal: took = 2.451201ms

检查端口 TCP/2379 上的连接

命令：

for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f5); do
   echo "Validating connection to ${endpoint}/health"
   docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/health"
done

输出示例：

Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}
Validating connection to https://IP:2379/health
{"health": "true"}

检查端口 TCP/2380 上的连接

命令：

for endpoint in $(docker exec etcd etcdctl member list | cut -d, -f4); do
  echo "Validating connection to ${endpoint}/version";
  docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl --http1.1 -s -w "\n" --cacert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CACERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --cert $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_CERT" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) --key $(docker inspect -f '{{range $index, $value := .Config.Env}}{{if eq (index (split $value "=") 0) "ETCDCTL_KEY" }}{{range $i, $part := (split $value "=")}}{{if gt $i 1}}{{print "="}}{{end}}{{if gt $i 0}}{{print $part}}{{end}}{{end}}{{end}}{{end}}' etcd) "${endpoint}/version"
done

输出示例：

Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}
Validating connection to https://IP:2380/version
{"etcdserver":"3.5.7","etcdcluster":"3.5.0"}

etcd 告警

etcd 会触发告警（例如空间不足时）。

命令：

docker exec etcd etcdctl alarm list

触发 NOSPACE 告警的输出示例：

memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE

etcd 空间错误

相关的错误消息是 etcdserver: mvcc: database space exceeded 或 applying raft message exceeded backend quota。告警 NOSPACE 会被触发。

解决：

压缩键空间
对所有 etcd 成员进行碎片整理
检查端点状态
解除告警

压缩键空间

命令：

rev=$(docker exec etcd etcdctl endpoint status --write-out json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')
docker exec etcd etcdctl compact "$rev"

输出示例：

compacted revision xxx

对所有 etcd 成员进行碎片整理

命令：

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl defrag

输出示例：

Finished defragmenting etcd member[link:https://IP:2379]
Finished defragmenting etcd member[link:https://IP:2379]
Finished defragmenting etcd member[link:https://IP:2379]

检查端点状态

命令：

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ',') etcd etcdctl endpoint status --write-out table

输出示例：

+-----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------+------------------+---------+---------+-----------+-----------+------------+
| https://IP:2379 |  e973e4419737125 |  3.5.7  |  553 kB |     false |        32 |    2449410 |
| https://IP:2379 | 4a509c997b26c206 |  3.5.7  |  553 kB |     false |        32 |    2449410 |
| https://IP:2379 | b217e736575e9dd3 |  3.5.7  |  553 kB |      true |        32 |    2449410 |
+-----------------+------------------+---------+---------+-----------+-----------+------------+

解除告警

如果压缩和整理碎片后确定数据库大小下降了，则需要解除告警来允许 etcd 再次写入。

命令：

docker exec etcd etcdctl alarm list
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list

输出示例：

docker exec etcd etcdctl alarm list
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
memberID:x alarm:NOSPACE
docker exec etcd etcdctl alarm disarm
docker exec etcd etcdctl alarm list

配置日志级别

你无法再动态更改 etcd v3.5 或更高版本中的日志级别。

etcd v3.5 及更高版本

要配置 etcd 的日志级别，请编辑集群 YAML：

services:
  etcd:
    extra_args:
      log-level: "debug"

etcd v3.4 及更早版本

在早期的 etcd 版本中，你可以使用 API 动态更改日志级别。使用以下命令来配置调试日志：

docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"DEBUG"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log

要将日志级别重置回默认值 (INFO)，你可以使用以下命令。

命令：

docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -XPUT -d '{"Level":"INFO"}' --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) $(docker exec etcd printenv ETCDCTL_ENDPOINTS)/config/local/log

etcd 内容

如果要查看 etcd 的内容，你可以查看流事件，也可以直接查询 etcd。详情请参阅以下示例。

查看流事件

命令：

docker exec etcd etcdctl watch --prefix /registry

如果你只想查看受影响的键（而不是二进制数据），你可以将 | grep -a ^/registry 尾附到该命令来过滤键。

直接查询 etcd

命令：

docker exec etcd etcdctl get /registry --prefix=true --keys-only

你可以使用以下命令来处理数据，从而获取每个键的计数摘要：

docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nr

更换不健康的 etcd 节点

如果你 etcd 集群中的某个节点变得不健康，在将新的 etcd 节点添加到集群之前，我们建议你修复或删除故障/不健康的节点。