Useful debugging tools for kubernetes

Dec 3, 2022 1 min read k8s debug kubernetes network-diag

Sometimes its useful to be able to run some ephemeral containers on kubernetes cluster in order to perform some debugging (i.e. dns resolution, pinging nodes, etc).

Sadly, most of the times, it can be tricky to remember all overrides so here is a small list

Run an ephemeral shell on a random node

1kubectl run -ti...

shell

Fixing etcdDatabaseHighFragmentationRatio prometheus alert

Sep 3, 2022 1 min read k8s playbook etcd prometheus

If you receive a following error in alertmanager

1etcd cluster "kube-etcd": database size in use on instance xxx.xxx.xxx.xxx:2381
2is 49.55% of the actual allocated disk space, please run defragmentation 
3(e.g. etcdctl defrag) to retrieve the unused fragmented disk space.
shell

it can be easily solved by running a...

Kubernetes networking

Aug 17, 2022 5 min read k8s study-notes k8s networking

Kubernetes networking

Pause container

In Kubernetes, the pause container serves as the “parent container” for all of the containers in your pod, and it has two main responsibilities:

it serves as the basis of Linux namespace sharing in the pod
with PID (process ID) namespace sharing enabled, it serves as...

Longhorn disaster recovery jobs

Aug 15, 2022 2 min read k8s disaster recovery longhorn k8s

Longhorn is a brilliant piece of software (especially for those who cannot use block volumes on their Kubernetes cluster), but it can (and does) happen to have some volumes getting degraded and thus requiring to recover from a backup.

How to create cross cloud self managed kubernetes cluster

Aug 14, 2022 8 min read k8s cross-cloud k8s kubeadm

While it’s fairly “trivial” to install a stacked kubernetes cluster with kubeadm on any cloud provider or managed bare metal (where you have a certain degree of management over the networking which permits you to use bgp for example), it’s not so trivial when your nodes are situated in different network segments (clouds) and/or behind NAT.

With this guide I will try to alleviate a pain related to this kind of setup.

Prometheus - kube-proxy endpoint connection refused

May 30, 2022 1 min read k8s prometheus monitoring k8s

When using kube-prometheus-stack and kubernetes cluster provisioned by kubeadm you will likely have an issue of prometheus not being able to connect to the kube-proxy metrics

It can be easily fixed by editing it’s config map

kubectl edit cm/kube-proxy -n kube-system

Change the metricsBindAddress property from...

How to mass remove finalizers from argocd applications

Jan 24, 2022 1 min read k8s kubectl k8s finalizers

If you are using finalizers in your argo-cd applications, you may find that it’s impossible to delete them if argo-cd installation is broken.

In order to fix it, you can either update them one by one with kubectl edit and remove the finalizer tag, or you can run following command which will patch all argocd...

An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation

Sep 29, 2021 2 min read k8s aws aws k8s iam sts eks

When working with EKS under AWS, it’s possible that at some point you wanted to run a pod under a certain role, and you’ve encountered a following error:

An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity

What’s frustrating, is that by default AWS doesn’t provide you a lot of feedback of why that error happened.

So I’ve written down some debug steps for further reference:

Recovery of etcd failing node

Jan 14, 2021 2 min read k8s etcd k8s recovery

One of my etcd nodes in my home k8s cluster has been failing with following message:

12021-01-14 11:16:09.233458 I | embed: listening for peers on 192.168.0.33:2380
2raft2021/01/14 11:16:09 tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
3panic: tocommit(29492601) is out of range [lastIndex(29492469)]. Was the raft log corrupted, truncated, or lost?
fallback

These are steps I took to fix it: