Skip to main content

Kubespray troubleshooting

Common errors and remedies.

Kubespray

Additional information.

  • In these kubespray commands I am using the shorthand -b which is short for --become and is equivalent to --become --become-user=root (because root is the default target for become)

    • -b (short for --become) → escalate to root for tasks that need it
    • -u dev → connect as user dev over SSH
    • -K → prompt for the dev user’s sudo password (unless passwordless sudo is set)
  • Addons: Kubespray can deploy a number of addons automatically. I prefer to deploy these manually after the cluster has been deployed and is stable, although I have used them in the past

    • For example if you want to deploy the dashboard and helm then you would edit inventory/devcluster/group_vars/k8s_cluster/addons.yml and set:

      dashboard_enabled: true
      helm_enabled: true
    • Then you need to rerun the addons playbook (not the whole cluster build). Kubespray has a dedicated playbook for this. From inside your virtualenv on the ansible host, run:

      cd ~/kubespray-devcluster/kubespray

      ansible-playbook -i inventory/devcluster/inventory.inicluster.yml -b -u dev -K --tags=addons
    • The --tags=addons flag ensures that all enabled addons will be applied

    • If you want to deploy one add-on at a time, such as dashboard, then run:

      ansible-playbook -i inventory/devcluster/inventory.ini cluster.yml -b -u dev -K --tags dashboard
      • Note that you still have to enable the add-on in the addons.yml configuration. If you don't the playbook will run but the add-on will not get deployed
    • You can also deploy several add-on's at a time, such as dashboard and helm:

      ansible-playbook -i inventory/devcluster/inventory.ini cluster.yml -b -u dev -K --tags dashboard,helm
  • New Node: Process to deploy additional node. Assume a new physical worker node (dev-w-p1) is to be added to the cluster

    • Ensure the following steps have been completed before running the playbook
      • Prepare the node to be added
      • Ensure it is reachable
      • Update kubespray's host file with the node information
      • Deploy the ssh-key to the node, and test ssh to it
      • Update the inventory/devcluster/inventory.ini with the role of the new node
    • Refresh facts for all hosts before limiting. Running facts.yml ensures all nodes (old + new) have up-to-date facts cached:
      ansible-playbook -i inventory/devcluster/inventory.ini playbooks/facts.yml -b -u dev -K
    • Run the scale playbook:
      ansible-playbook -i inventory/devcluster/inventory.ini scale.yml -b -u dev -K --limit dev-w-p1
  • New Nodes: If you need to deploy many additional nodes then follow the same process as adding one node except the scale playbook will be run without limiting it

    • Run the scale playbook:
      ansible-playbook -i inventory/devcluster/inventory.ini scale.yml -b -u dev -K
  • Remove Node: Process to remove a single node

    • Run the remove node playbook to remove the node from the cluster (cordon, drain, delete node, etc):
      ansible-playbook -i inventory/devcluster/inventory.ini remove-node.yml -b -u dev -K -e "node=dev-w-p1"
    • Wipe the node you removed (cleans kubelet/containerd configs, iptables, CNI, etc):
      ansible-playbook -i inventory/devcluster/inventory.ini reset.yml -b -u dev -K -e "reset_confirmation=yes" -e "reset_nodes=dev-w-p1"
    • Update inventory: remove dev-w-p1 from inventory.ini
    • Remove ssh key relating to removed node
      ssh-keygen -f "/home/dev/.ssh/known_hosts" -R "dev-w-p1"
    • If removing a control-plane node, make sure your LB/keepalived backends and any etcd membership considerations are handled; the Kubespray playbook will take care of the Kubernetes side, but you will need to update HAProxy on all load balancers and remove the node
      • Then reload HAProxy backends: sudo systemctl reload haproxy
      • Verify removal from etcd:
        kubectl -n kube-system get endpoints etcd -o wide
    • Verify removal:
      kubectl get nodes
      kubectl get csr | grep -i dev-w-p1 || true
  • Node Reset (but not delete the node)

    • There are a few use cases where you might want to reset a node but not delete it

      • Recycling a node inside the same cluster
        • Example: The node got into a bad state (broken kubelet, containerd issues, corrupted CNI/iptables)
        • You reset the node to wipe its Kubernetes state, then re-run Kubespray (scale.yml) to rejoin it to the cluster
        • In this case, you don’t need to delete it from Kubernetes if it’s already been removed by the control plane (e.g., it showed up as NotReady too long and was force-removed)
      • Re-adding the same node after a failed join
        • If you tried adding a node (via scale.yml) but it failed halfway (maybe kubelet certificates didn’t bootstrap, container runtime misconfigured), you can reset it and re-run the add playbook
        • No need to delete it from the cluster if it never actually joined successfully
      • Converting the node’s role
        • Suppose a host was provisioned as a worker but you decide it should be a control-plane node
        • You reset it to clean the old kubelet/containerd state, then reconfigure inventory and re-run scale.yml
        • The reset ensures no leftover worker configs interfere
      • Troubleshooting / Lab environments
        • Benchmark fresh joins
        • Test scale.yml playbook
        • Rotate through different container runtimes (containerd vs CRI-O)
      • Keeping the cluster object intact temporarily
        • Sometimes you want to wipe a node but keep its object in Kubernetes until you’re sure the cleanup went well (e.g., draining workloads, checking LB configs)
        • Resetting first lets you safely wipe the host while still having the cluster “remember” it — later you can delete if needed
    • Process for a single node reset:

      cd ~/kubespray-devcluster/kubespray
      • Drain if the node still exists in the cluster
      # See if it’s in the cluster
      kubectl get node dev-w-p1

      # If present, safely evict workloads
      kubectl drain dev-w-p1 --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
      • Run the reset playbook limited to the host; confirmation required. Wipe the node you removed (cleans kubelet/containerd configs, iptables, CNI, etc). Important - when resetting just one node, always use both the limit and the variable
      ansible-playbook -i inventory/devcluster/inventory.ini reset.yml -b -u dev -K --limit dev-w-v1 -e reset_nodes=dev-w-v1 -e reset_confirmation=yes
      - Verify it’s clean
      ```bash
      # On dev-w-p1 these should be gone or empty:
      ls /etc/kubernetes /var/lib/kubelet /var/lib/cni || true
      sudo systemctl is-active kubelet || true
      sudo crictl ps || true # likely shows nothing if containerd/crio was wiped
    • After a single-node reset

      • If you didn’t run remove-node.yml, Kubernetes will still have the node object. It’ll show NotReady until you rejoin or delete it. That’s why draining first is recommended
  • Cluster Reset: Also known as deleting the cluster - use with caution

    • Dry run. See what will run before doing it:
    cd ~/kubespray-devcluster/kubespray

    ansible-playbook -i inventory/devcluster/inventory.ini reset.yml --list-tasks
    • Full reset:
    cd ~/kubespray-devcluster/kubespray

    ansible-playbook -i inventory/devcluster/inventory.ini reset.yml -b -u dev -K -e reset_confirmation=yes