Recovery of etcd failing node

One of my etcd nodes in my home k8s cluster has been failing with following message:

12021-01-14 11:16:09.233458 I | embed: listening for peers on 192.168.0.33:2380
2raft2021/01/14 11:16:09 tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
3panic: tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
fallback

These are steps I took to fix it:

Connect to one of the healthy nodes and prepare env variables for the connections

1export ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt'
2export ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt'
3export ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key'
4export ETCDCTL_ENDPOINTS='https://[127.0.0.1]:2379'
fallback

For the operations I used official documentation .

Verify current cluster members

1# etcdctl  member list
2
31afbd87f4cc07a99, started, nas, https://192.168.0.33:2380, https://192.168.0.33:2379
44de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
57ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379
fallback

The failing node is named nas, so we are going to remove it from the quorum

1etcdctl member remove 1afbd87f4cc07a99

fallback

verify that it has been removed

1# etcdctl  member list
24de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
37ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379
fallback

remove failing pod

1kubectl delete pod etcd-nas

fallback

on the failing node delete the corrupted data folder

1rm -rf /var/lib/etcd/member

fallback

Add the failing node back to the etcd cluster

1# etcdctl member add nas --peer-urls="https://192.168.0.33:2380"
2Member 693136f4829284e2 added to cluster d965929a4c4424e9
3
4ETCD_NAME="nas"
5ETCD_INITIAL_CLUSTER="xps-server=https://192.168.0.29:2380,nas=https://192.168.0.33:2380,cooler-master=https://192.168.0.253:2380"
6ETCD_INITIAL_CLUSTER_STATE="existing
bash

Verify the member list

1# etcdctl member list
243fc832867481d8c, unstarted, , https://192.168.0.33:2379,https://192.168.0.33:2380,
34de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
47ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379
fallback

after restart of the node nas, I could see that the third member has joined and pod is up and running

Recovery of etcd failing node

Copyright

Comments

Recovery of etcd failing node

Copyright

Related Posts

Comments