CKAD-journey

Troubleshooting Kubernetes Health Probes

This guide provides solutions for common issues encountered with Kubernetes container health probes.

Diagnostic Commands

Always start your troubleshooting with these commands:

# Get basic pod status
kubectl get pod <pod-name>

# Get detailed pod information, including probe configuration and events
kubectl describe pod <pod-name>

# View pod logs
kubectl logs <pod-name>

# View pod events
kubectl get events --field-selector involvedObject.name=<pod-name>

Common Issues and Solutions

1. Container Restarts in a Loop

Symptoms:

Pod shows high restart count
Pod status cycles between Running and CrashLoopBackOff

Possible Causes:

Liveness probe is too strict or improperly configured
Application temporarily fails during startup but probe starts checking too early
Resource constraints causing application timeouts during probe checks

Solutions:

Check probe configuration:

kubectl describe pod <pod-name> | grep -A 15 "Liveness:"

Increase liveness probe tolerance:

livenessProbe:
  # Increase these values
  initialDelaySeconds: 30    # Give app more time to start
  periodSeconds: 10          # Check less frequently
  timeoutSeconds: 5          # Allow more time for response
  failureThreshold: 3        # Allow more failures before restart

Examine container logs for errors during probe checks:
```
kubectl logs <pod-name> --previous
```

Check if application has sufficient resources:

kubectl describe pod <pod-name> | grep -A 10 "Limits:"

Implement a startup probe to give the application time to initialize before liveness checks begin:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

2. Pod Running but Not Ready

Symptoms:

Pod shows Running status but READY column shows 0/1
Service does not route traffic to the pod

Possible Causes:

Readiness probe is failing
Application is not properly serving the readiness endpoint
External dependencies required by readiness check are unavailable

Solutions:

Check readiness probe status:

kubectl describe pod <pod-name> | grep -A 15 "Readiness:"

Test the readiness endpoint manually:

# Get pod IP
POD_IP=$(kubectl get pod <pod-name> -o jsonpath='{.status.podIP}')
   
# For HTTP probes - create a test pod
kubectl run test --rm -it --image=curlimages/curl -- curl -v http://$POD_IP:8080/ready
   
# For TCP probes - create a test pod
kubectl run test --rm -it --image=busybox -- nc -zv $POD_IP 3306

Check application logs for readiness issues:

kubectl logs <pod-name> | grep -i ready

Verify external dependencies are available (databases, APIs, other services)

Modify readiness probe parameters to be more tolerant:

readinessProbe:
  periodSeconds: 10          # Check less frequently 
  timeoutSeconds: 5          # Allow more time for response
  failureThreshold: 3        # Allow more failures before marking not ready

3. Probes Passing Locally but Failing in Kubernetes

Symptoms:

Application works when tested directly but fails when accessed via Kubernetes probes
Probe failure messages in kubectl describe pod output

Possible Causes:

Network path differences between local testing and Kubernetes
Different port or path configurations
Kubernetes probe timeout too short for application response time

Solutions:

Verify probe endpoint configuration:

kubectl describe pod <pod-name> | grep -A 15 "Liveness\|Readiness\|Startup"

Test exact probe path from inside the container:

kubectl exec <pod-name> -- curl -v http://localhost:<port>/<path>

Check for network policy restrictions that might block probe requests

Increase probe timeout:

livenessProbe:
  timeoutSeconds: 5  # Increase from default 1s to 5s

Check for port binding issues - ensure app is listening on 0.0.0.0, not just 127.0.0.1

4. Intermittent Probe Failures

Symptoms:

Pod occasionally shows not ready then becomes ready again
Logs show sporadic restarts due to probe failures
Service endpoints fluctuate

Possible Causes:

Application occasionally takes too long to respond
Resource contention affecting response times
Garbage collection or maintenance routines disrupting probes
Network instability

Solutions:

Check if failures correlate with high load periods:
```
kubectl top pod <pod-name>
```

Make probe parameters more tolerant:

livenessProbe:
  # Increase these values
  periodSeconds: 15          # Check less frequently
  timeoutSeconds: 10         # Allow more time for response
  failureThreshold: 5        # Require more consecutive failures

Look for patterns in failures (time of day, load patterns, etc.)
Check node resources where the pod is running:
```
kubectl describe node <node-name>
```

Add a startup probe with generous thresholds:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30       # 30 * 10s = 5 minutes to start
  periodSeconds: 10

5. HTTP Probe Returns Unexpected Status Codes

Symptoms:

HTTP probe failures in logs
Pod health fluctuates despite application appearing to work

Possible Causes:

Application returns non-2xx status codes
Health endpoint returning unexpected responses
Redirects causing probe failures (3xx status codes not handling properly)

Solutions:

Test endpoint manually:

kubectl exec <pod-name> -- curl -v http://localhost:<port>/<path>

Check endpoint response code:

kubectl exec <pod-name> -- curl -I http://localhost:<port>/<path>

Configure probe to accept specific status codes:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:
    - name: Accept
      value: application/json

Ensure health endpoint returns appropriate status code (200-399 for success)
Check if endpoint is redirecting (301/302/307) as this can cause issues

6. Slow Application Startup Causing Premature Failures

Symptoms:

Pod repeatedly restarts before becoming ready
Events show liveness or readiness probe failure during initialization

Possible Causes:

Application initialization time exceeds probe’s initialDelaySeconds
Insufficient startup probe configuration
Heavy initialization workloads (DB migrations, cache warming)

Solutions:

Add a startup probe (Kubernetes 1.16+):

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Increase initialDelaySeconds for both probes:

livenessProbe:
  initialDelaySeconds: 120  # Increase to 2 minutes
readinessProbe:
  initialDelaySeconds: 30   # Increase to 30 seconds

For very slow starting apps, calculate proper thresholds:

Maximum startup time = failureThreshold × periodSeconds

Consider optimizing application startup time if practical

7. Issues with Exec Probes

Symptoms:

Exec probe failures despite application running
Permission denied or command not found errors

Possible Causes:

Script/command not executable
Path issues within container
Permission problems
Command timeout

Solutions:

Verify command exists and is executable:

kubectl exec <pod-name> -- ls -la /path/to/script

Test command execution manually:

kubectl exec <pod-name> -- /path/to/command

Check for permission issues:

kubectl exec <pod-name> -- chmod +x /path/to/script

Use shell for complex commands:

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "test -f /tmp/healthy || exit 1"

Increase timeoutSeconds if command takes too long

8. TCP Socket Probe Failures

Symptoms:

TCP socket connection failures
Service appears to be running but probe fails

Possible Causes:

Application not listening on specified port
Firewall or network policy restrictions
Socket backlog full

Solutions:

Verify service is listening on the correct port:
```
kubectl exec <pod-name> -- netstat -tlnp
```
Check for port binding restrictions (make sure app binds to 0.0.0.0, not just 127.0.0.1)

Test TCP connection manually:

kubectl exec <pod-name> -- nc -zv localhost <port>

Check for any NetworkPolicy restrictions
Examine application logs for socket/binding errors

Advanced Troubleshooting

Debugging with Ephemeral Containers (Kubernetes 1.18+)

For complex cases, ephemeral debug containers can be useful:

# Start debug container (requires Kubernetes 1.18+ with feature enabled)
kubectl debug -it <pod-name> --image=busybox --target=<container-name>

# From the debug container, you can test network connections, check processes, etc.
wget -O- http://localhost:8080/healthz
netstat -tlnp
ps aux

Analyzing Probe Traffic with tcpdump

For network-related issues:

# Create a privileged debug pod
kubectl run debug-pod --privileged --rm -it --image=nicolaka/netshoot -- bash

# Install tcpdump if needed
apt-get update && apt-get install -y tcpdump

# Capture probe traffic
tcpdump -i eth0 port <probe-port> -vvv

Detailed Probe Timing Analysis

For performance-related problems:

# Get probe timing details
kubectl get pod <pod-name> -o json | jq '.status.conditions[] | select(.type=="Ready")'

# Check kubelet logs for probe details (on the node)
journalctl -u kubelet | grep <pod-name> | grep -i probe

Preventative Measures

Start with application-appropriate probe settings:
- Use startup probes for slow-starting applications
- Set initialDelaySeconds based on realistic startup time
- Configure periodSeconds and timeoutSeconds based on expected response times
Implement dedicated lightweight health endpoints that:
- Check minimal dependencies
- Respond quickly
- Have minimal resource requirements
- Return appropriate status codes
Test probe behavior under load before deploying to production
Document normal application behavior to make troubleshooting easier

CKAD Exam Tips

For the CKAD exam, remember these troubleshooting tips:

Always check pod status with kubectl get pods first
Use kubectl describe pod <pod-name> to see probe configuration and recent events
Check logs with kubectl logs <pod-name>
Know how to adjust probe parameters for common issues:
- Increase initialDelaySeconds for slow-starting applications
- Adjust periodSeconds and timeoutSeconds for slow-responding apps
- Use failureThreshold to control tolerance for intermittent failures
Know when to use each probe type and mechanism
Be able to quickly diagnose if a probe failure is due to:
- Misconfiguration
- Application issues
- Resource constraints
- Network problems

Remember that solving probe issues often requires a systematic approach of: check configuration → test manually → adjust parameters → verify solution.