CKAD-journey

Kubernetes High Availability Reference Guide

Introduction

High availability (HA) is a crucial aspect of production-grade Kubernetes clusters, ensuring that your platform remains operational despite failures of individual components. This guide provides comprehensive information on implementing, testing, and maintaining HA in your Kubernetes environments.

Core HA Concepts
Control Plane HA
etcd Configuration
Storage Considerations
Network Redundancy
Application HA Strategies
Monitoring and Alerting
Performance Testing at Scale
Disaster Recovery
Common Failure Scenarios
References and Tools

Core HA Concepts

Redundancy

Redundancy is the foundation of high availability in any distributed system. For Kubernetes, this means:

Multiple copies of critical control plane components
Replicated data storage (etcd)
Multiple worker nodes for application workloads
Redundant network paths

Leader Election

Control plane components like the scheduler and controller manager use leader election to ensure that only one instance is active at a time. This prevents conflicts while providing redundancy:

leaderElection:
  leaderElect: true
  resourceName: component-name
  resourceNamespace: kube-system
  leaseDuration: 15s
  renewDeadline: 10s
  retryPeriod: 2s

Idempotency

Operations should be designed to be idempotent - meaning they can be safely repeated without changing the result beyond the initial application. This is especially important in environments where network failures may cause retries.

Self-healing

Kubernetes has built-in self-healing capabilities:

ReplicaSets ensure desired pod count is maintained
Node controllers detect node failures
Kubelet restarts failed containers
Liveness and readiness probes detect application health

Control Plane HA

Recommended Topology

A highly available Kubernetes control plane typically consists of:

3 or more control plane nodes
Load balancer in front of API servers
Replicated etcd (either stacked or external)

Stacked etcd Topology

Stacked topology: etcd members and control plane components run on the same nodes

External etcd Topology

External topology: etcd runs on dedicated hosts

API Server Configuration

The API server should be configured for high availability with:

Multiple instances behind a load balancer
Appropriate resource limits
Connection pooling

Scheduler and Controller Manager

These components use leader election to ensure only one instance is active:

spec:
  containers:
  - command:
    - kube-scheduler
    - --leader-elect=true

See the High Availability Scheduler Configuration for detailed information.

etcd Configuration

etcd is a critical component that stores all cluster state. For high availability:

Deploy at least 3 etcd instances (preferably 3 or 5, always an odd number)
Ensure they run on separate physical/virtual machines
Use low-latency network connections between instances
Regularly backup the etcd data

For detailed configuration, see etcd Cluster Configuration.

Storage Considerations

Persistent Volumes

For highly available storage:

Use cloud provider storage with multi-zone redundancy
Consider using distributed storage solutions like Rook/Ceph
Create StorageClasses with appropriate reclaim policies
Implement regular backup solutions for critical data

Data Protection Strategies

StatefulSets for stateful applications
PersistentVolumeClaims with appropriate access modes
Backup solutions like Velero for cluster-wide data protection

Network Redundancy

Load Balancer Configuration

Configure load balancers for control plane and application services:

Health checks for backend nodes
Session affinity where needed
Appropriate timeouts and connection limits

Service Mesh Considerations

Service meshes like Istio or Linkerd can provide:

Advanced routing capabilities
Circuit breaking for failing services
Better visibility into network health

Application HA Strategies

Pod Disruption Budgets

Use PodDisruptionBudgets to ensure minimum availability during voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # or maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

Anti-affinity Rules

Spread pods across nodes and zones:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - my-app
      topologyKey: "kubernetes.io/hostname"

Topology Spread Constraints

Ensure balanced distribution:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app

Monitoring and Alerting

Comprehensive monitoring is critical for HA:

Control plane component health
etcd metrics
Node and pod resource utilization
Network health

Recommended stack:

Prometheus for metrics collection
Grafana for visualization
Alertmanager for alerts

Essential alerts:

etcd member availability
Control plane component health
Node availability
Certificate expiration

Performance Testing at Scale

Kubemark

Kubemark is a tool used to simulate large Kubernetes clusters for testing control plane performance without requiring the full resource footprint.

Kubemark Architecture

See the Kubemark Setup Guide for detailed instructions.

Performance Metrics

Key metrics to track for large clusters:

API server latency (target: 99th percentile under 1 second)
Pod startup time (target: 99th percentile under 5 seconds)
etcd operation latency
Controller reconciliation times

API Call Latencies

Pod Startup Latencies

Disaster Recovery

Backup Strategy

Regular backups should include:

etcd data
Persistent volumes
Kubernetes resources

Tools like Velero can help with comprehensive backups.

Recovery Procedures

Document and regularly test recovery procedures:

etcd restoration
Control plane rebuilding
Data restoration
Service validation

Common Failure Scenarios

Prepare for these common failure modes:

Control Plane Node Failure

Impact: API operations may be temporarily unavailable
Mitigation: Multiple control plane nodes, leader election
Recovery: Automatic with proper configuration

etcd Member Failure

Impact: Cluster state operations impacted if quorum lost
Mitigation: At least 3 etcd members, proper monitoring
Recovery: Replace failed member, restore from backup if needed

Network Partition

Impact: Nodes may become unreachable, potential split-brain
Mitigation: Multi-zone deployment, robust health checks
Recovery: Automatic reconciliation once connectivity restored

Worker Node Failure

Impact: Pods on that node become unavailable
Mitigation: Pod anti-affinity, multiple replicas
Recovery: Automatic rescheduling of pods

References and Tools

Kubernetes Documentation

Tools

etcd Documentation
Kubemark
Velero for backup and recovery
Chaos Mesh for chaos engineering tests

Contributing

Contributions to this HA reference guide are welcome! Please submit a pull request with your additions or corrections.