This page looks best with JavaScript enabled

Recovery of etcd failing node

 ·  ☕ 2 min read  ·  🤖 Alexander Chernov

One of my etcd nodes in my home k8s cluster has been failing with following message:

2021-01-14 11:16:09.233458 I | embed: listening for peers on 192.168.0.33:2380
raft2021/01/14 11:16:09 tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?

These are steps I took to fix it:

Connect to one of the healthy nodes and prepare env variables for the connections

export ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt'
export ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt'
export ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key'
export ETCDCTL_ENDPOINTS='https://[127.0.0.1]:2379'

For the operations I used official documentation .

Verify current cluster members

# etcdctl  member list

1afbd87f4cc07a99, started, nas, https://192.168.0.33:2380, https://192.168.0.33:2379
4de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
7ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379

The failing node is named nas, so we are going to remove it from the quorum

etcdctl member remove 1afbd87f4cc07a99

verify that it has been removed

# etcdctl  member list
4de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
7ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379

remove failing pod

kubectl delete pod etcd-nas

on the failing node delete the corrupted data folder

rm -rf /var/lib/etcd/member

Add the failing node back to the etcd cluster

1
2
3
4
5
6
# etcdctl member add nas --peer-urls="https://192.168.0.33:2380"
Member 693136f4829284e2 added to cluster d965929a4c4424e9

ETCD_NAME="nas"
ETCD_INITIAL_CLUSTER="xps-server=https://192.168.0.29:2380,nas=https://192.168.0.33:2380,cooler-master=https://192.168.0.253:2380"
ETCD_INITIAL_CLUSTER_STATE="existing

Verify the member list

# etcdctl member list
43fc832867481d8c, unstarted, , https://192.168.0.33:2379,https://192.168.0.33:2380,
4de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
7ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379

after restart of the node nas, I could see that the third member has joined and pod is up and running


Alexander Chernov
WRITTEN BY
Alexander Chernov