Kubespray troubleshooting
Common errors and remedies.
Kubespray
Additional information.
-
In these kubespray commands I am using the shorthand -b which is short for --become and is equivalent to --become --become-user=root (because root is the default target for become)
- -b (short for --become) → escalate to root for tasks that need it
- -u dev → connect as user dev over SSH
- -K → prompt for the dev user’s sudo password (unless passwordless sudo is set)
-
Addons: Kubespray can deploy a number of addons automatically. I prefer to deploy these manually after the cluster has been deployed and is stable, although I have used them in the past
-
For example if you want to deploy the dashboard and helm then you would edit
inventory/devcluster/group_vars/k8s_cluster/addons.ymland set:dashboard_enabled: true
helm_enabled: true -
Then you need to rerun the addons playbook (not the whole cluster build). Kubespray has a dedicated playbook for this. From inside your virtualenv on the ansible host, run:
cd ~/kubespray-devcluster/kubespray
ansible-playbook -i inventory/devcluster/inventory.inicluster.yml -b -u dev -K --tags=addons -
The
--tags=addonsflag ensures that all enabled addons will be applied -
If you want to deploy one add-on at a time, such as
dashboard, then run:ansible-playbook -i inventory/devcluster/inventory.ini cluster.yml -b -u dev -K --tags dashboard- Note that you still have to enable the add-on in the addons.yml configuration. If you don't the playbook will run but the add-on will not get deployed
-
You can also deploy several add-on's at a time, such as
dashboardandhelm:ansible-playbook -i inventory/devcluster/inventory.ini cluster.yml -b -u dev -K --tags dashboard,helm
-
-
New Node: Process to deploy additional node. Assume a new physical worker node (
dev-w-p1) is to be added to the cluster- Ensure the following steps have been completed before running the playbook
- Prepare the node to be added
- Ensure it is reachable
- Update kubespray's host file with the node information
- Deploy the ssh-key to the node, and test ssh to it
- Update the
inventory/devcluster/inventory.iniwith the role of the new node
- Refresh facts for all hosts before limiting. Running facts.yml ensures all nodes (old + new) have up-to-date facts cached:
ansible-playbook -i inventory/devcluster/inventory.ini playbooks/facts.yml -b -u dev -K - Run the scale playbook:
ansible-playbook -i inventory/devcluster/inventory.ini scale.yml -b -u dev -K --limit dev-w-p1
- Ensure the following steps have been completed before running the playbook
-
New Nodes: If you need to deploy many additional nodes then follow the same process as adding one node except the scale playbook will be run without limiting it
- Run the scale playbook:
ansible-playbook -i inventory/devcluster/inventory.ini scale.yml -b -u dev -K
- Run the scale playbook:
-
Remove Node: Process to remove a single node
- Run the remove node playbook to remove the node from the cluster (cordon, drain, delete node, etc):
ansible-playbook -i inventory/devcluster/inventory.ini remove-node.yml -b -u dev -K -e "node=dev-w-p1" - Wipe the node you removed (cleans kubelet/containerd configs, iptables, CNI, etc):
ansible-playbook -i inventory/devcluster/inventory.ini reset.yml -b -u dev -K -e "reset_confirmation=yes" -e "reset_nodes=dev-w-p1" - Update inventory: remove dev-w-p1 from
inventory.ini - Remove ssh key relating to removed node
ssh-keygen -f "/home/dev/.ssh/known_hosts" -R "dev-w-p1" - If removing a control-plane node, make sure your LB/keepalived backends and any etcd membership considerations are handled; the Kubespray playbook will take care of the Kubernetes side, but you will need to update HAProxy on all load balancers and remove the node
- Then reload HAProxy backends:
sudo systemctl reload haproxy - Verify removal from etcd:
kubectl -n kube-system get endpoints etcd -o wide
- Then reload HAProxy backends:
- Verify removal:
kubectl get nodes
kubectl get csr | grep -i dev-w-p1 || true
- Run the remove node playbook to remove the node from the cluster (cordon, drain, delete node, etc):
-
Node Reset (but not delete the node)
-
There are a few use cases where you might want to reset a node but not delete it
- Recycling a node inside the same cluster
- Example: The node got into a bad state (broken kubelet, containerd issues, corrupted CNI/iptables)
- You reset the node to wipe its Kubernetes state, then re-run Kubespray (scale.yml) to rejoin it to the cluster
- In this case, you don’t need to delete it from Kubernetes if it’s already been removed by the control plane (e.g., it showed up as NotReady too long and was force-removed)
- Re-adding the same node after a failed join
- If you tried adding a node (via scale.yml) but it failed halfway (maybe kubelet certificates didn’t bootstrap, container runtime misconfigured), you can reset it and re-run the add playbook
- No need to delete it from the cluster if it never actually joined successfully
- Converting the node’s role
- Suppose a host was provisioned as a worker but you decide it should be a control-plane node
- You reset it to clean the old kubelet/containerd state, then reconfigure inventory and re-run scale.yml
- The reset ensures no leftover worker configs interfere
- Troubleshooting / Lab environments
- Benchmark fresh joins
- Test scale.yml playbook
- Rotate through different container runtimes (containerd vs CRI-O)
- Keeping the cluster object intact temporarily
- Sometimes you want to wipe a node but keep its object in Kubernetes until you’re sure the cleanup went well (e.g., draining workloads, checking LB configs)
- Resetting first lets you safely wipe the host while still having the cluster “remember” it — later you can delete if needed
- Recycling a node inside the same cluster
-
Process for a single node reset:
cd ~/kubespray-devcluster/kubespray- Drain if the node still exists in the cluster
# See if it’s in the cluster
kubectl get node dev-w-p1
# If present, safely evict workloads
kubectl drain dev-w-p1 --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m- Run the reset playbook limited to the host; confirmation required. Wipe the node you removed (cleans kubelet/containerd configs, iptables, CNI, etc). Important - when resetting just one node, always use both the limit and the variable
ansible-playbook -i inventory/devcluster/inventory.ini reset.yml -b -u dev -K --limit dev-w-v1 -e reset_nodes=dev-w-v1 -e reset_confirmation=yes- Verify it’s clean
```bash
# On dev-w-p1 these should be gone or empty:
ls /etc/kubernetes /var/lib/kubelet /var/lib/cni || true
sudo systemctl is-active kubelet || true
sudo crictl ps || true # likely shows nothing if containerd/crio was wiped -
After a single-node reset
- If you didn’t run remove-node.yml, Kubernetes will still have the node object. It’ll show NotReady until you rejoin or delete it. That’s why draining first is recommended
-
-
Cluster Reset: Also known as deleting the cluster - use with caution
- Dry run. See what will run before doing it:
cd ~/kubespray-devcluster/kubespray
ansible-playbook -i inventory/devcluster/inventory.ini reset.yml --list-tasks- Full reset:
cd ~/kubespray-devcluster/kubespray
ansible-playbook -i inventory/devcluster/inventory.ini reset.yml -b -u dev -K -e reset_confirmation=yes