If you receive a following error in alertmanager
1etcd cluster "kube-etcd": database size in use on instance xxx.xxx.xxx.xxx:2381
2is 49.55% of the actual allocated disk space, please run defragmentation
3(e.g. etcdctl defrag) to retrieve the unused fragmented disk space.
it can be easily solved by running a...
Kubernetes networking
Pause container
In Kubernetes, the pause container serves as the “parent container” for all of the containers in your pod, and it has two main responsibilities:
- it serves as the basis of Linux namespace sharing in the pod
- with PID (process ID) namespace sharing enabled, it serves as...
Longhorn is a brilliant piece of software (especially for those who cannot use block volumes on their Kubernetes cluster), but it can (and does) happen to have some volumes getting degraded and thus requiring to recover from a backup.
While it’s fairly “trivial” to install a stacked kubernetes cluster with kubeadm on any cloud provider or managed bare metal (where you have a certain degree of management over the networking which permits you to use bgp for example), it’s not so trivial when your nodes are situated in different network segments (clouds) and/or behind NAT.
With this guide I will try to alleviate a pain related to this kind of setup.
When using kube-prometheus-stack
and kubernetes cluster provisioned by kubeadm
you will likely have an issue of prometheus not being able to connect to the kube-proxy metrics
It can be easily fixed by editing it’s config map
kubectl edit cm/kube-proxy -n kube-system
Change the metricsBindAddress
property from...
If you are using finalizers in your argo-cd applications, you may find that it’s impossible to delete them if argo-cd installation is broken.
In order to fix it, you can either update them one by one with kubectl edit
and remove the finalizer tag, or you can run following command which will patch all argocd...
When working with EKS under AWS, it’s possible that at some point you wanted to run a pod under a certain role, and you’ve encountered a following error:
An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity
What’s frustrating, is that by default AWS doesn’t provide you a lot of feedback of why that error happened.
So I’ve written down some debug steps for further reference:
One of my etcd nodes in my home k8s cluster has been failing with following message:
12021-01-14 11:16:09.233458 I | embed: listening for peers on 192.168.0.33:2380
2raft2021/01/14 11:16:09 tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
3panic: tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
These are steps I took to fix it: