Cluster Maintenance

Exam relevance: CKA ✅ (Cluster Architecture — 25%, Troubleshooting — 30%) | CKAD: Not directly tested

Node Maintenance — Drain, Cordon, Uncordon

When you need to take a node offline for maintenance (OS updates, hardware changes):

cordon — Mark Node Unschedulable

# Prevent new pods from being scheduled on this node
kubectl cordon worker-1

# Existing pods continue running
kubectl get nodes
# NAME       STATUS                    ROLES    AGE
# worker-1   Ready,SchedulingDisabled  <none>   30d

drain — Evict All Pods and Cordon

# Safely evict all pods from the node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data

# Common flags:
# --ignore-daemonsets       Required (DaemonSet pods can't be evicted normally)
# --delete-emptydir-data    Allow evicting pods using emptyDir volumes (data is lost)
# --force                   Force eviction of unmanaged pods (no controller)
# --grace-period=30         Override pod termination grace period

What drain does:

Cordons the node (marks unschedulable)
Evicts pods respecting PodDisruptionBudgets
Pods managed by Deployments/ReplicaSets are recreated on other nodes
DaemonSet pods are ignored (they belong on every node)
Standalone pods (no controller) are deleted permanently — use --force for these

uncordon — Allow Scheduling Again

# After maintenance, allow pods to be scheduled again
kubectl uncordon worker-1

Note: Existing pods that were evicted do NOT automatically move back. Only new pods can be scheduled on the uncordoned node.

Cluster Upgrade with kubeadm

Kubernetes releases a new minor version every ~4 months. The upgrade process:

Upgrade control plane node(s) first
Upgrade worker nodes one at a time

Rule: You can only upgrade one minor version at a time (1.30 → 1.31, not 1.30 → 1.32).

Step-by-Step: Upgrade Control Plane

# 1. Check current version
kubectl get nodes
kubeadm version
kubelet --version

# 2. Find available versions
apt-cache madison kubeadm
# or
apt list -a kubeadm

# 3. Upgrade kubeadm on control plane
apt-get update
apt-get install -y kubeadm=1.31.0-1.1

# 4. Verify kubeadm version
kubeadm version

# 5. Plan the upgrade (dry-run — shows what will happen)
kubeadm upgrade plan

# 6. Apply the upgrade on the FIRST control plane node
kubeadm upgrade apply v1.31.0
# For ADDITIONAL control plane nodes, use:
# kubeadm upgrade node

# 7. Drain the control plane node
kubectl drain controlplane --ignore-daemonsets

# 8. Upgrade kubelet and kubectl
apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1

# 9. Restart kubelet
systemctl daemon-reload
systemctl restart kubelet

# 10. Uncordon
kubectl uncordon controlplane

Step-by-Step: Upgrade Worker Node

# ON THE CONTROL PLANE:
# 1. Drain the worker
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data

# SSH TO THE WORKER NODE:
ssh worker-1

# 2. Upgrade kubeadm
apt-get update
apt-get install -y kubeadm=1.31.0-1.1

# 3. Upgrade node configuration
kubeadm upgrade node

# 4. Upgrade kubelet and kubectl
apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1

# 5. Restart kubelet
systemctl daemon-reload
systemctl restart kubelet

# 6. Exit back to control plane
exit

# ON THE CONTROL PLANE:
# 7. Uncordon the worker
kubectl uncordon worker-1

# 8. Verify
kubectl get nodes

Upgrade Order Summary

kubeadm (on the node)
kubeadm upgrade apply/node
kubelet + kubectl (on the node)
systemctl restart kubelet

etcd Backup and Restore

etcd holds ALL cluster state. Backing it up is the most critical maintenance task.

Finding etcd Connection Details

# From the etcd static pod manifest
cat /etc/kubernetes/manifests/etcd.yaml | grep -E 'listen-client|cert-file|key-file|trusted-ca'

# Typical values:
# --listen-client-urls=https://127.0.0.1:2379
# --cert-file=/etc/kubernetes/pki/etcd/server.crt
# --key-file=/etc/kubernetes/pki/etcd/server.key
# --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

# Or use describe
kubectl describe pod etcd-controlplane -n kube-system

Backup

ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Verify Backup

ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-backup.db --write-table

Restore

# 1. Stop the API server and etcd (move manifests away)
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
mv /etc/kubernetes/manifests/etcd.yaml /tmp/

# Wait for them to stop
crictl ps | grep -E 'etcd|apiserver'

# 2. Restore to a NEW directory
ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup.db \
  --data-dir=/var/lib/etcd-restored

# 3. Update etcd manifest to use the new data directory
# Edit /tmp/etcd.yaml:
#   Change hostPath for etcd-data volume from /var/lib/etcd to /var/lib/etcd-restored

# 4. Move manifests back
mv /tmp/etcd.yaml /etc/kubernetes/manifests/
mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# 5. Wait for components to restart
kubectl get pods -n kube-system

Alternative restore method (simpler but same idea):

# 1. Restore snapshot to new directory
ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup.db \
  --data-dir=/var/lib/etcd-restored

# 2. Edit etcd static pod to point to new directory
vi /etc/kubernetes/manifests/etcd.yaml
# Change:
#   volumes:
#   - hostPath:
#       path: /var/lib/etcd          ← change this
#       type: DirectoryOrCreate
# To:
#       path: /var/lib/etcd-restored  ← to this

# kubelet detects the change and restarts etcd automatically

Exam Tip for etcd

The etcd backup/restore question is worth 7-8% of the CKA exam. Memorize:

ETCDCTL_API=3 (always version 3)
Three cert flags: --cacert, --cert, --key
--endpoints=https://127.0.0.1:2379
snapshot save for backup, snapshot restore --data-dir=<NEW> for restore
After restore: update etcd manifest to point to the new data directory

Operating System Upgrades

When a node goes down:

If it comes back within 5 minutes: Pods are still there, kubelet restarts them
If it's down > 5 minutes (default pod-eviction-timeout): Controller manager considers pods dead, recreates them elsewhere

For planned OS maintenance:

# 1. Drain the node
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data

# 2. Do the OS upgrade (kernel update, patches, reboot, etc.)
ssh worker-1
apt-get update && apt-get upgrade -y
reboot

# 3. Wait for node to come back
# 4. Uncordon
kubectl uncordon worker-1

PodDisruptionBudgets (PDB)

PDBs prevent drain/eviction from removing too many pods at once. They ensure minimum availability during maintenance.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: webapp-pdb
spec:
  minAvailable: 2                      # At least 2 pods must stay running
  # OR
  # maxUnavailable: 1                  # At most 1 pod can be down
  selector:
    matchLabels:
      app: webapp

# Create PDB imperatively
kubectl create pdb webapp-pdb --selector=app=webapp --min-available=2

# Check PDBs
kubectl get pdb
kubectl describe pdb webapp-pdb

How PDBs affect drain:

kubectl drain respects PDBs — it won't evict pods if doing so would violate the PDB
If a PDB blocks drain, you'll see the drain command hang — you may need to wait for replacement pods or increase replicas
--force does NOT override PDBs (it only affects unmanaged pods)

Cluster Component Health

Checking Control Plane Health

# Component status (deprecated but still works)
kubectl get componentstatuses
# or
kubectl get cs

# Check system pods
kubectl get pods -n kube-system

# Check node status
kubectl get nodes
kubectl describe node controlplane

# API server health
kubectl get --raw /healthz
kubectl get --raw /livez
kubectl get --raw /readyz

Checking kubelet

# On the node:
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50

Checking Control Plane Logs

# Static pod logs
kubectl logs kube-apiserver-controlplane -n kube-system
kubectl logs kube-scheduler-controlplane -n kube-system
kubectl logs kube-controller-manager-controlplane -n kube-system
kubectl logs etcd-controlplane -n kube-system

# Or directly on the node
crictl logs <container-id>

Key Takeaways

drain → maintain → uncordon is the standard node maintenance flow
kubectl drain needs --ignore-daemonsets (almost always) and respects PDBs
Cluster upgrades: one minor version at a time, control plane first, then workers
Upgrade order on each node: kubeadm → kubeadm upgrade → kubelet + kubectl → restart kubelet
etcd backup: ETCDCTL_API=3 etcdctl snapshot save with three cert flags
etcd restore: snapshot restore --data-dir=<NEW> then update etcd manifest
PDBs protect availability during drain — know minAvailable vs maxUnavailable
After restore/upgrade, always verify with kubectl get nodes and kubectl get pods -n kube-system

Node Maintenance — Drain, Cordon, Uncordon​

cordon — Mark Node Unschedulable​

drain — Evict All Pods and Cordon​

uncordon — Allow Scheduling Again​

Cluster Upgrade with kubeadm​

Step-by-Step: Upgrade Control Plane​

Step-by-Step: Upgrade Worker Node​

Upgrade Order Summary​

etcd Backup and Restore​

Finding etcd Connection Details​

Backup​

Verify Backup​

Restore​

Exam Tip for etcd​

Operating System Upgrades​

PodDisruptionBudgets (PDB)​

Cluster Component Health​

Checking Control Plane Health​

Checking kubelet​

Checking Control Plane Logs​

Key Takeaways​