Skip to main content

Design Decisions

Default maximum pods

  • Kubespray max pods per node defaults to 110 pods which is fine for worker nodes with a relatively small amount of memory
    • If the node has 64GB memory or more I typically increase to 254 max pods per node and align it with calico by increasing its default pool blocksize from 26 to 24
    • If the node has significant memory >512GB then I would undertake large node tuning to accomodate up to 400 pods as other factors such as networking, conntrack, file descriptors and scheduling overhead become real limits before RAM

Networking

  • Utilise calico for networking and network security
    • Provides pod-to-pod, pod-to-service, and pod-to-external connectivity
    • Can run in pure layer 3 (routing) mode, or use encapsulation (IP-in-IP, VXLAN, WireGuard) when the underlay network doesn’t support direct routing
    • Network Policy enforcement
      • Implements Kubernetes NetworkPolicies (who can talk to who)
      • Supports advanced security policies (global, namespaced, application-aware)
    • Flexibility
      • Can run with BGP routing, overlays (IPIP, VXLAN), or in hybrid modes
      • Works in cloud environments (AWS, GCP, Azure), virtualised environments (VMware, Parallels, VirtualBox and bare metal)
    • Configure a full BGP mesh using BIRD
      • Pod to Pod on the same node → always direct via the veth pair (no encapsulation, no BGP needed)
      • Pod to Pod on different nodes. Assuming the underlay cannot route pod CIDR's then encapsulate using IP-in-IP

MTU

  • MTU
    • Determine MTU in the environment. From a node, send ICMP echo request packets to another node on the same network with the Don't Fragment (DF) set in the IP header and a payload size of 1472 bytes. Another 28 bytes will automatically be added by the headers) for a total packet size of 1500 bytes. ping -M do -s 1472 lab.reids.net.au
    • Calico pod veth MTU should be underlay NIC MTU − encapsulation overhead:
      • No encapsulation (pure BGP/direct): veth_mtu = 1500
      • IP-in-IP (IPv4): overhead ≈ 20 bytes → 1500 − 20 = 1480 (chosen option)
      • VXLAN: overhead ≈ 50 bytes → 1500 − 50 = 1450
    • If this ping fails with message too long then keep reducing the payload size until the ping succeeds. Add 28 bytes and that is the node's interface MTU
    • Note that some networks support jumbo frames which allow MTUs larger than 1500. This must be set consistently on all switches in the path, otherwise fragmentation or drops occur. Often used on storage networks (iSCSI, NFS over IP)