One of my etcd nodes in my home k8s cluster has been failing with following message:
2021-01-14 11:16:09.233458 I | embed: listening for peers on 192.168.0.33:2380
raft2021/01/14 11:16:09 tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
These are steps I took to fix it:
Connect to one of the healthy nodes and prepare env variables for the connections
export ETCDCTL_CACERT='/etc/kubernetes/pki/etcd/ca.crt'
export ETCDCTL_CERT='/etc/kubernetes/pki/etcd/server.crt'
export ETCDCTL_KEY='/etc/kubernetes/pki/etcd/server.key'
export ETCDCTL_ENDPOINTS='https://[127.0.0.1]:2379'
For the operations I used official documentation .
Verify current cluster members
# etcdctl member list
1afbd87f4cc07a99, started, nas, https://192.168.0.33:2380, https://192.168.0.33:2379
4de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
7ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379
The failing node is named nas
, so we are going to remove it from the quorum
etcdctl member remove 1afbd87f4cc07a99
verify that it has been removed
# etcdctl member list
4de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
7ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379
remove failing pod
kubectl delete pod etcd-nas
on the failing node delete the corrupted data folder
rm -rf /var/lib/etcd/member
Add the failing node back to the etcd cluster
|
|
Verify the member list
# etcdctl member list
43fc832867481d8c, unstarted, , https://192.168.0.33:2379,https://192.168.0.33:2380,
4de56726b08ede88, started, xps-server, https://192.168.0.29:2380, https://192.168.0.29:2379
7ad397dcfdcca303, started, cooler-master, https://192.168.0.253:2380, https://192.168.0.253:2379
after restart of the node nas
, I could see that the third member has joined and pod is up and running