Kubernetes networking

Kubernetes networking

Pause container

In Kubernetes, the pause container serves as the “parent container” for all of the containers in your pod, and it has two main responsibilities:

  • it serves as the basis of Linux namespace sharing in the pod
  • with PID (process ID) namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.

In Linux, each running process communicates within a network namespace that provides a logical networking stack with its own routes, firewall rules, and network devices.

When the namespace is created, a mount point for it is created under /var/run/netns, allowing the namespace to persist even if there is no process attached to it.

By default, Linux assigns every process to the root network namespace to provide access to the external world.

When you run a new process, the process inherits its namespaces from the parent process, so, the network namespace is shared between all the containers running under the same pod

Container to container networking

In terms of Docker constructs, a Pod is modelled as a group of Docker containers that share a network namespace. Containers within a Pod all have the same IP address and port space assigned through the network namespace assigned to the Pod, and can find each other via localhost since they reside in the same namespace.

Pod to pod networking

Pods within the same node

Linux network namespaces can be connected using a Linux Virtual Ethernet Device or veth pair consisting of two virtual interfaces that can be spread over multiple namespaces.

To connect Pod namespaces, we can assign one side of the veth pair to the root network namespace, and the other side to the Pod’s network namespace (think about patch cable). Now, we want the Pods to talk to each other through the root namespace, and for this we use a network bridge.

A Linux Ethernet bridge is a virtual Layer 2 networking device used to unite two or more network segments, working transparently to connect two networks together. The bridge operates by maintaining a forwarding table between sources and destinations by examining the destination of the data packets that travel through it and deciding whether or not to pass the packets to other network segments connected to the bridge.

To sum it up, lets assume that pod A wants to send a packet to the pod B. The proceeding would be eth0 (within pod A) -> veth0 (in the root namespace) -> br0 (bridge within the root ns) -> veth1 -> eth0 (within pod B)

Pods accross different nodes

The exact implementation can vary depending on the CNI used, but we can still identify some common patterns.

Generally, every Node in your cluster is assigned a CIDR block specifying the IP addresses available to Pods running on that Node, so once traffic destined for the CIDR block reaches the Node, it is the Node’s responsibility to forward traffic to the correct Pod.

On the source node, once the packet arrives at the bridge on the root network ns, the arp resolution will fail due to the fact that we do not have any device with such MAC address running on the node. Upon the failure, the packet will be routed to the default route: root ns eth0 device.

The packet crosses the network layer and ends up at the destination Node root ns, where it is routed through the bridge to the correct virtual Ethernet device, and to the pod through its eth0 interface.

Pod to service networking

Netfilter & Iptables

When creating a new Kubernetes Service, a new virtual IP (also known as a cluster IP) is created on your behalf. Anywhere within the cluster, traffic addressed to the virtual IP will be load-balanced to the set of backing Pods associated with the Service.

To perform load balancing within the cluster, Kubernetes relies on the netfilter (framework provided by Linux that allows various networking-related operations to be implemented in the form of customized handlers)

Iptables is a user-space program providing a table-based system for defining rules for manipulating and transforming packets using the Netfilter framework. In Kubernetes, iptables rules are configured by the kube-proxy controller that watches the Kubernetes API server for changes. Once a such change is detected (pod going offline, change of ip, etc), iptables are amended by the kube-proxy.

The iptables rules watch for traffic destined for a Service’s virtual IP and, on a match, a random Pod IP address is selected from the set of available Pods and the iptables rule changes the packet’s destination IP address from the Service’s virtual IP to the IP of the selected Pod. On the return of the packet, its header is rewritten again to change the source from pod to service ip (thus effectively executing NAT)

IPVS

Starting from version 1.11 of Kubernetes, there is an additional option for in-cluster load balancing: IPVS (IP Virtual Server) IPVS is also built on top of Netfilter and implements transport-layer load balancing as part of the Linux kernel. IPVS is incorporated into the LVS (Linux Virtual Server), where it runs on a host and acts as a load balancer in front of a cluster of real servers. IPVS can direct requests for TCP- and UDP-based services to the real servers, and make services of the real servers appear as virtual services on a single IP address.

When declaring a Kubernetes Service, you can specify whether or not you want in-cluster load balancing to be done using iptables or IPVS. IPVS is specifically designed for load balancing and uses more efficient data structures (hash tables), allowing for almost unlimited scale compared to iptables. When creating a Service load balanced with IPVS, three things happen: a dummy IPVS interface is created on the Node, the Service’s IP address is bound to the dummy IPVS interface, and IPVS servers are created for each Service IP address

Bibliography

Most of the content for this post has been shamelessly taken from the following sources:

Copyright

Comments