Cluster Maintenance
Exam relevance: CKA ✅ (Cluster Architecture — 25%, Troubleshooting — 30%) | CKAD: Not directly tested
Node Maintenance — Drain, Cordon, Uncordon
When you need to take a node offline for maintenance (OS updates, hardware changes):
cordon — Mark Node Unschedulable
# Prevent new pods from being scheduled on this node
kubectl cordon worker-1
# Existing pods continue running
kubectl get nodes
# NAME STATUS ROLES AGE
# worker-1 Ready,SchedulingDisabled <none> 30d
drain — Evict All Pods and Cordon
# Safely evict all pods from the node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
# Common flags:
# --ignore-daemonsets Required (DaemonSet pods can't be evicted normally)
# --delete-emptydir-data Allow evicting pods using emptyDir volumes (data is lost)
# --force Force eviction of unmanaged pods (no controller)
# --grace-period=30 Override pod termination grace period
What drain does:
- Cordons the node (marks unschedulable)
- Evicts pods respecting PodDisruptionBudgets
- Pods managed by Deployments/ReplicaSets are recreated on other nodes
- DaemonSet pods are ignored (they belong on every node)
- Standalone pods (no controller) are deleted permanently — use
--forcefor these
uncordon — Allow Scheduling Again
# After maintenance, allow pods to be scheduled again
kubectl uncordon worker-1
Note: Existing pods that were evicted do NOT automatically move back. Only new pods can be scheduled on the uncordoned node.
Cluster Upgrade with kubeadm
Kubernetes releases a new minor version every ~4 months. The upgrade process:
- Upgrade control plane node(s) first
- Upgrade worker nodes one at a time
Rule: You can only upgrade one minor version at a time (1.30 → 1.31, not 1.30 → 1.32).
Step-by-Step: Upgrade Control Plane
# 1. Check current version
kubectl get nodes
kubeadm version
kubelet --version
# 2. Find available versions
apt-cache madison kubeadm
# or
apt list -a kubeadm
# 3. Upgrade kubeadm on control plane
apt-get update
apt-get install -y kubeadm=1.31.0-1.1
# 4. Verify kubeadm version
kubeadm version
# 5. Plan the upgrade (dry-run — shows what will happen)
kubeadm upgrade plan
# 6. Apply the upgrade on the FIRST control plane node
kubeadm upgrade apply v1.31.0
# For ADDITIONAL control plane nodes, use:
# kubeadm upgrade node
# 7. Drain the control plane node
kubectl drain controlplane --ignore-daemonsets
# 8. Upgrade kubelet and kubectl
apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1
# 9. Restart kubelet
systemctl daemon-reload
systemctl restart kubelet
# 10. Uncordon
kubectl uncordon controlplane
Step-by-Step: Upgrade Worker Node
# ON THE CONTROL PLANE:
# 1. Drain the worker
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
# SSH TO THE WORKER NODE:
ssh worker-1
# 2. Upgrade kubeadm
apt-get update
apt-get install -y kubeadm=1.31.0-1.1
# 3. Upgrade node configuration
kubeadm upgrade node
# 4. Upgrade kubelet and kubectl
apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1
# 5. Restart kubelet
systemctl daemon-reload
systemctl restart kubelet
# 6. Exit back to control plane
exit
# ON THE CONTROL PLANE:
# 7. Uncordon the worker
kubectl uncordon worker-1
# 8. Verify
kubectl get nodes
Upgrade Order Summary
1. kubeadm (on the node)
2. kubeadm upgrade apply/node
3. kubelet + kubectl (on the node)
4. systemctl restart kubelet
etcd Backup and Restore
etcd holds ALL cluster state. Backing it up is the most critical maintenance task.
Finding etcd Connection Details
# From the etcd static pod manifest
cat /etc/kubernetes/manifests/etcd.yaml | grep -E 'listen-client|cert-file|key-file|trusted-ca'
# Typical values:
# --listen-client-urls=https://127.0.0.1:2379
# --cert-file=/etc/kubernetes/pki/etcd/server.crt
# --key-file=/etc/kubernetes/pki/etcd/server.key
# --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
# Or use describe
kubectl describe pod etcd-controlplane -n kube-system
Backup
ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Verify Backup
ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-backup.db --write-table
Restore
# 1. Stop the API server and etcd (move manifests away)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# Wait for them to stop
crictl ps | grep -E 'etcd|apiserver'
# 2. Restore to a NEW directory
ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup.db \
--data-dir=/var/lib/etcd-restored
# 3. Update etcd manifest to use the new data directory
# Edit /tmp/etcd.yaml:
# Change hostPath for etcd-data volume from /var/lib/etcd to /var/lib/etcd-restored
# 4. Move manifests back
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
# 5. Wait for components to restart
kubectl get pods -n kube-system
Alternative restore method (simpler but same idea):
# 1. Restore snapshot to new directory
ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup.db \
--data-dir=/var/lib/etcd-restored
# 2. Edit etcd static pod to point to new directory
vi /etc/kubernetes/manifests/etcd.yaml
# Change:
# volumes:
# - hostPath:
# path: /var/lib/etcd ← change this
# type: DirectoryOrCreate
# To:
# path: /var/lib/etcd-restored ← to this
# kubelet detects the change and restarts etcd automatically
Exam Tip for etcd
The etcd backup/restore question is worth 7-8% of the CKA exam. Memorize:
ETCDCTL_API=3(always version 3)- Three cert flags:
--cacert,--cert,--key --endpoints=https://127.0.0.1:2379snapshot savefor backup,snapshot restore --data-dir=<NEW>for restore- After restore: update etcd manifest to point to the new data directory
Operating System Upgrades
When a node goes down:
- If it comes back within 5 minutes: Pods are still there, kubelet restarts them
- If it's down > 5 minutes (default
pod-eviction-timeout): Controller manager considers pods dead, recreates them elsewhere
For planned OS maintenance:
# 1. Drain the node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
# 2. Do the OS upgrade (kernel update, patches, reboot, etc.)
ssh worker-1
apt-get update && apt-get upgrade -y
reboot
# 3. Wait for node to come back
# 4. Uncordon
kubectl uncordon worker-1
PodDisruptionBudgets (PDB)
PDBs prevent drain/eviction from removing too many pods at once. They ensure minimum availability during maintenance.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: webapp-pdb
spec:
minAvailable: 2 # At least 2 pods must stay running
# OR
# maxUnavailable: 1 # At most 1 pod can be down
selector:
matchLabels:
app: webapp
# Create PDB imperatively
kubectl create pdb webapp-pdb --selector=app=webapp --min-available=2
# Check PDBs
kubectl get pdb
kubectl describe pdb webapp-pdb
How PDBs affect drain:
kubectl drainrespects PDBs — it won't evict pods if doing so would violate the PDB- If a PDB blocks drain, you'll see the drain command hang — you may need to wait for replacement pods or increase replicas
--forcedoes NOT override PDBs (it only affects unmanaged pods)
Cluster Component Health
Checking Control Plane Health
# Component status (deprecated but still works)
kubectl get componentstatuses
# or
kubectl get cs
# Check system pods
kubectl get pods -n kube-system
# Check node status
kubectl get nodes
kubectl describe node controlplane
# API server health
kubectl get --raw /healthz
kubectl get --raw /livez
kubectl get --raw /readyz
Checking kubelet
# On the node:
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50
Checking Control Plane Logs
# Static pod logs
kubectl logs kube-apiserver-controlplane -n kube-system
kubectl logs kube-scheduler-controlplane -n kube-system
kubectl logs kube-controller-manager-controlplane -n kube-system
kubectl logs etcd-controlplane -n kube-system
# Or directly on the node
crictl logs <container-id>
Key Takeaways
- drain → maintain → uncordon is the standard node maintenance flow
kubectl drainneeds--ignore-daemonsets(almost always) and respects PDBs- Cluster upgrades: one minor version at a time, control plane first, then workers
- Upgrade order on each node: kubeadm →
kubeadm upgrade→ kubelet + kubectl → restart kubelet - etcd backup:
ETCDCTL_API=3 etcdctl snapshot savewith three cert flags - etcd restore:
snapshot restore --data-dir=<NEW>then update etcd manifest - PDBs protect availability during drain — know
minAvailablevsmaxUnavailable - After restore/upgrade, always verify with
kubectl get nodesandkubectl get pods -n kube-system