Once your application is live, the next challenge is handling traffic. Kubernetes makes scaling easy, but there’s a logic to how it works.
1. Manual Scaling (The Quick Fix)
If you know a marketing campaign is starting or a heavy task is coming up, you can scale manually via the CLI:
kubectl scale deployment web-app-deployment --replicas=10
- Use Case: Immediate control when you anticipate a load spike.
- Limitation: It requires a human to monitor the traffic and run the command.
2. Horizontal Pod Autoscaler (HPA)
The HPA is the “Autopilot” mode. It scales the number of pods up or down based on actual resource usage.
Requirement: For HPA to work, you must have the Metrics Server installed in your cluster and define resources (requests) in your Deployment YAML.
kubectl autoscale deployment web-app-deployment --cpu-percent=50 --min=2 --max=10
- The Logic: Every 15 seconds, Kubernetes checks the CPU usage. If the average exceeds 50%, it adds pods (up to 10). If the load drops, it removes pods (down to 2) to save resources and costs.
3. Vertical Pod Autoscaler (VPA) – Making Pods “Stronger”
Sometimes, adding more pods doesn’t help—for example, with legacy apps or databases that don’t scale horizontally. This is where the VPA comes in. Instead of changing the number of pods, it adjusts the CPU and RAM of the existing ones.
Unlike HPA, VPA is usually defined via a YAML manifest:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: web-app-deployment
updatePolicy:
updateMode: "Auto"
- Update Modes:
- Off: VPA only provides “Recommendations” for CPU/RAM settings but changes nothing.
- Initial: Assigns resources only when the pod is first created.
- Auto: If a pod needs more power, VPA restarts it with higher resource limits.
Do not use HPA and VPA on the same metric (like CPU) simultaneously, as they will fight each other for control.
4. The Role of Health Checks
Scaling is useless if the new pods aren’t actually working. Always use Readiness Probes in your YAML:
- Readiness Probe: Tells Kubernetes when a pod is ready to receive traffic.
- Liveness Probe: Tells Kubernetes if a pod is still alive or if it needs to be restarted.
4.1 Readiness Probes: “Are you actually ready?”
When you scale up (manually or automatically), Kubernetes creates new pods. However, just because a container has started doesn’t mean the app is ready to handle traffic. It might still be loading a cache or connecting to a database.
Without a Readiness Probe, Kubernetes sends traffic to the pod immediately, leading to 502 or 503 errors for your users.
Implementation Example:
spec:
containers:
- name: web-app
image: nginx
readinessProbe:
httpGet:
path: /healthz # A specific endpoint in your app
port: 80
initialDelaySeconds: 5 # Wait 5s before first check
periodSeconds: 10 # Check every 10s
failureThreshold: 3 # Remove from traffic after 3 failures
4.2 Liveness Probes: “Are you still alive?”
While the Readiness Probe determines if a pod can start receiving traffic, the Liveness Probe monitors the pod for its entire lifecycle.
The Problem: Sometimes an application crashes in a way that the container stays “Running,” but the app inside is frozen, deadlocked, or broken. To Kubernetes, the pod looks fine, but it’s actually a “zombie” that can’t do anything.
The Solution: The Liveness Probe tells Kubernetes: “Check if the heart is still beating. If not, kill the pod and start a fresh one.”
Implementation Example:
spec:
containers:
- name: web-app
image: nginx
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 15 # Give the app plenty of time to boot
periodSeconds: 20 # Check every 20s
failureThreshold: 3 # Restart the container after 3 failed attempts
- What happens on failure? Unlike the Readiness Probe (which just stops traffic), a failing Liveness Probe triggers a container restart .
- The “Death Loop” Warning: Be careful! If you set the
initialDelaySecondstoo low, Kubernetes might kill your app before it has finished starting up. This creates a “CrashLoopBackOff” where the app is never allowed to fully boot.
Summary: Successful scaling depends on precision. Use HPA for web apps, VPA for resource-heavy single instances, and always implement Readiness Probes to ensure your users never see a “Service Unavailable” screen during a scale-up.