Skip to main content

Scheduling

Exam relevance: CKA ✅ (Workloads & Scheduling — 15%) | CKAD ✅ (Application Environment — 25%)


How the Scheduler Works (Recap)

When a Pod has no nodeName, the scheduler:

  1. Filters — eliminates nodes that can't run the pod
  2. Scores — ranks remaining nodes by preference
  3. Binds — assigns the pod to the best node (sets nodeName)

You can influence every step of this process.


nodeSelector — Simple Node Selection

The simplest way to constrain a pod to specific nodes. Match nodes by their labels.

# Add label to a node
kubectl label node worker-1 disktype=ssd

# Verify
kubectl get nodes --show-labels
kubectl get nodes -l disktype=ssd
apiVersion: v1
kind: Pod
metadata:
name: fast-app
spec:
nodeSelector:
disktype: ssd # Only schedule on nodes with this label
containers:
- name: app
image: nginx

If no node matches, the pod stays Pending.


Node Affinity — Advanced Node Selection

Node affinity is a more expressive version of nodeSelector. It supports:

  • Required rules (must match — hard constraint)
  • Preferred rules (try to match — soft constraint)
  • Set-based operators (In, NotIn, Exists, DoesNotExist, Gt, Lt)

requiredDuringSchedulingIgnoredDuringExecution

Pod MUST be scheduled on a matching node. If no node matches, pod stays Pending.

apiVersion: v1
kind: Pod
metadata:
name: gpu-app
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- "true"
- "yes"
containers:
- name: app
image: tensorflow/tensorflow:latest-gpu

preferredDuringSchedulingIgnoredDuringExecution

Scheduler TRIES to place pod on matching node, but will place it elsewhere if no match.

spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80 # Higher weight = stronger preference (1-100)
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1a
- weight: 20
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-east-1b

Operators

OperatorMeaning
InLabel value is in the list
NotInLabel value is NOT in the list
ExistsLabel key exists (any value)
DoesNotExistLabel key does NOT exist
GtLabel value is greater than (numeric)
LtLabel value is less than (numeric)

"IgnoredDuringExecution" — What Does It Mean?

If a node's labels change AFTER a pod is already running, the pod is NOT evicted. The rule only applies at scheduling time. (There is a planned RequiredDuringExecution that would evict — not yet stable.)


Taints and Tolerations

Taints are applied to nodes. They repel pods unless the pod has a matching toleration.

Think of it as: nodes say "stay away unless you can tolerate me."

Applying Taints

# Taint a node
kubectl taint nodes worker-1 key=value:NoSchedule

# Examples
kubectl taint nodes worker-1 env=production:NoSchedule
kubectl taint nodes worker-1 gpu=true:NoExecute

# Remove a taint (add - at the end)
kubectl taint nodes worker-1 env=production:NoSchedule-

# View taints on a node
kubectl describe node worker-1 | grep Taints

Taint Effects

EffectBehavior
NoScheduleNew pods without toleration won't be scheduled here. Existing pods stay.
PreferNoScheduleScheduler avoids this node but will use it as last resort.
NoExecuteNew pods won't schedule AND existing pods without toleration are evicted.

Adding Tolerations to Pods

apiVersion: v1
kind: Pod
metadata:
name: production-app
spec:
tolerations:
- key: "env"
operator: "Equal" # key=value must match exactly
value: "production"
effect: "NoSchedule"
containers:
- name: app
image: nginx

Toleration Operators

OperatorMeaning
EqualKey and value must match exactly
ExistsOnly the key needs to exist (value is ignored)

Special Tolerations

# Tolerate ALL taints with a specific key (any value, any effect)
tolerations:
- key: "env"
operator: "Exists"

# Tolerate EVERYTHING (run anywhere — used by DaemonSets)
tolerations:
- operator: "Exists"

Built-in Taints

Kubernetes automatically adds these taints:

TaintWhen
node.kubernetes.io/not-readyNode is not ready
node.kubernetes.io/unreachableNode is unreachable
node.kubernetes.io/memory-pressureNode is low on memory
node.kubernetes.io/disk-pressureNode is low on disk
node.kubernetes.io/pid-pressureNode has too many processes
node.kubernetes.io/unschedulableNode is cordoned
node-role.kubernetes.io/control-plane:NoScheduleControl plane node

Taints + Tolerations vs Node Affinity

They solve different problems:

MechanismWho decides?What it does
Taints/TolerationsNode says "keep out"Repels pods FROM a node
Node AffinityPod says "I want that node"Attracts pods TO a node

To guarantee a pod runs ONLY on specific nodes, use both:

  1. Taint the nodes → keeps other pods away
  2. Add toleration + nodeAffinity on your pod → ensures it goes there AND can tolerate the taint

Pod Affinity and Anti-Affinity

Pod affinity/anti-affinity schedules pods based on which other pods are already running on a node.

Pod Affinity — "Schedule near this pod"

spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname # Same node

This says: "Schedule this pod on a node that already has a pod with label app=cache."

Pod Anti-Affinity — "Schedule away from this pod"

spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- webapp
topologyKey: kubernetes.io/hostname # Different node

This says: "Don't schedule this pod on any node that already has a pod with label app=webapp." Common for spreading replicas across nodes for high availability.

topologyKey

Defines the "zone" for affinity rules:

topologyKeyMeaning
kubernetes.io/hostnameSame/different node
topology.kubernetes.io/zoneSame/different availability zone
topology.kubernetes.io/regionSame/different region

Practical Example: Spread Web Pods Across Nodes

apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname
containers:
- name: webapp
image: nginx

Topology Spread Constraints

More fine-grained control over how pods are spread across topology domains.

spec:
topologySpreadConstraints:
- maxSkew: 1 # Max difference in pod count between zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels:
app: webapp
  • maxSkew: 1 means the difference in pod count between any two zones can't exceed 1
  • Ensures even distribution across zones

Resource Requests and Limits

Resources affect scheduling — the scheduler uses requests to decide which node has capacity.

How They Work

RequestsLimits
CPUGuaranteed minimum. Scheduler uses this to find a node.Maximum CPU. Pod gets throttled if exceeded.
MemoryGuaranteed minimum. Scheduler uses this.Maximum memory. Pod gets OOMKilled if exceeded.
containers:
- name: app
image: nginx
resources:
requests:
cpu: "250m" # 250 millicores = 0.25 CPU
memory: "128Mi" # 128 mebibytes
limits:
cpu: "500m" # Throttled above this
memory: "256Mi" # OOMKilled above this

CPU Units

ValueMeaning
11 full CPU core
500mHalf a CPU core
100m1/10th of a CPU core
0.1Same as 100m

Memory Units

ValueMeaning
128Mi128 mebibytes (1 Mi = 1,048,576 bytes)
1Gi1 gibibyte
128M128 megabytes (1 M = 1,000,000 bytes) — use Mi instead

QoS Classes

Kubernetes assigns a QoS class based on how you set requests and limits:

ClassConditionEviction Priority
GuaranteedAll containers have requests = limits (both CPU and memory)Last to be evicted
BurstableAt least one container has a request or limit setMiddle
BestEffortNo requests or limits set on any containerFirst to be evicted

When a node runs out of memory, Kubernetes evicts BestEffort pods first, then Burstable, then Guaranteed.


LimitRange — Default Resource Boundaries

A LimitRange sets default and max/min resource values for a namespace. Pods that don't specify resources get the defaults.

apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: dev
spec:
limits:
- type: Container
default: # Default limits (applied if not specified)
cpu: "500m"
memory: "256Mi"
defaultRequest: # Default requests (applied if not specified)
cpu: "100m"
memory: "128Mi"
max: # Maximum allowed
cpu: "2"
memory: "1Gi"
min: # Minimum allowed
cpu: "50m"
memory: "64Mi"
- type: Pod
max:
cpu: "4"
memory: "2Gi"

ResourceQuota — Namespace Resource Caps

A ResourceQuota limits the total resources consumed across all pods in a namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-quota
namespace: dev
spec:
hard:
requests.cpu: "10" # Total CPU requests across all pods
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50" # Max number of pods
services: "10"
persistentvolumeclaims: "20"
configmaps: "50"
secrets: "50"
services.nodeports: "5"

When a ResourceQuota is set, all pods MUST specify resource requests/limits — otherwise creation is rejected. Use LimitRange to auto-apply defaults.

# Check quota usage
kubectl describe resourcequota dev-quota -n dev

Manual Scheduling (nodeName)

Bypass the scheduler entirely by setting nodeName:

spec:
nodeName: worker-2 # Pod goes directly to this node
containers:
- name: app
image: nginx

Warning: If the node doesn't exist or has no capacity, the pod fails. No filtering or scoring happens.


Priority and Preemption

Higher-priority pods can preempt (evict) lower-priority pods when nodes are full.

# Create a PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000 # Higher = more priority
globalDefault: false
preemptionPolicy: PreemptLowerPriority # or Never
description: "Critical workloads"
# Use it in a pod
spec:
priorityClassName: high-priority
containers:
- name: critical-app
image: nginx

Key Takeaways

  1. nodeSelector is simple: label nodes, match in pod spec
  2. Node affinity is powerful: required vs preferred, set-based operators
  3. Taints repel, tolerations allow — they work together
  4. Use taints + affinity together to dedicate nodes to specific workloads
  5. Pod anti-affinity spreads replicas for HA — common exam pattern
  6. Requests affect scheduling; limits affect runtime enforcement
  7. CPU is throttled when limit exceeded; memory causes OOMKill
  8. LimitRange sets defaults; ResourceQuota sets namespace totals
  9. Guaranteed QoS (requests=limits) is the safest for critical workloads