Kubernetes High Availability Reference Guide
Introduction
High availability (HA) is a crucial aspect of production-grade Kubernetes clusters, ensuring that your platform remains operational despite failures of individual components. This guide provides comprehensive information on implementing, testing, and maintaining HA in your Kubernetes environments.
Table of Contents
- Core HA Concepts
- Control Plane HA
- etcd Configuration
- Storage Considerations
- Network Redundancy
- Application HA Strategies
- Monitoring and Alerting
- Performance Testing at Scale
- Disaster Recovery
- Common Failure Scenarios
- References and Tools
Core HA Concepts
Redundancy
Redundancy is the foundation of high availability in any distributed system. For Kubernetes, this means:
- Multiple copies of critical control plane components
- Replicated data storage (etcd)
- Multiple worker nodes for application workloads
- Redundant network paths
Leader Election
Control plane components like the scheduler and controller manager use leader election to ensure that only one instance is active at a time. This prevents conflicts while providing redundancy:
leaderElection:
leaderElect: true
resourceName: component-name
resourceNamespace: kube-system
leaseDuration: 15s
renewDeadline: 10s
retryPeriod: 2s
Idempotency
Operations should be designed to be idempotent - meaning they can be safely repeated without changing the result beyond the initial application. This is especially important in environments where network failures may cause retries.
Self-healing
Kubernetes has built-in self-healing capabilities:
- ReplicaSets ensure desired pod count is maintained
- Node controllers detect node failures
- Kubelet restarts failed containers
- Liveness and readiness probes detect application health
Control Plane HA
Recommended Topology
A highly available Kubernetes control plane typically consists of:
- 3 or more control plane nodes
- Load balancer in front of API servers
- Replicated etcd (either stacked or external)

Stacked topology: etcd members and control plane components run on the same nodes

External topology: etcd runs on dedicated hosts
API Server Configuration
The API server should be configured for high availability with:
- Multiple instances behind a load balancer
- Appropriate resource limits
- Connection pooling
Scheduler and Controller Manager
These components use leader election to ensure only one instance is active:
spec:
containers:
- command:
- kube-scheduler
- --leader-elect=true
See the High Availability Scheduler Configuration for detailed information.
etcd Configuration
etcd is a critical component that stores all cluster state. For high availability:
- Deploy at least 3 etcd instances (preferably 3 or 5, always an odd number)
- Ensure they run on separate physical/virtual machines
- Use low-latency network connections between instances
- Regularly backup the etcd data
For detailed configuration, see etcd Cluster Configuration.
Storage Considerations
Persistent Volumes
For highly available storage:
- Use cloud provider storage with multi-zone redundancy
- Consider using distributed storage solutions like Rook/Ceph
- Create StorageClasses with appropriate reclaim policies
- Implement regular backup solutions for critical data
Data Protection Strategies
- StatefulSets for stateful applications
- PersistentVolumeClaims with appropriate access modes
- Backup solutions like Velero for cluster-wide data protection
Network Redundancy
Load Balancer Configuration
Configure load balancers for control plane and application services:
- Health checks for backend nodes
- Session affinity where needed
- Appropriate timeouts and connection limits
Service Mesh Considerations
Service meshes like Istio or Linkerd can provide:
- Advanced routing capabilities
- Circuit breaking for failing services
- Better visibility into network health
Application HA Strategies
Pod Disruption Budgets
Use PodDisruptionBudgets to ensure minimum availability during voluntary disruptions:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: my-app
Anti-affinity Rules
Spread pods across nodes and zones:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: "kubernetes.io/hostname"
Topology Spread Constraints
Ensure balanced distribution:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
Monitoring and Alerting
Comprehensive monitoring is critical for HA:
- Control plane component health
- etcd metrics
- Node and pod resource utilization
- Network health
Recommended stack:
- Prometheus for metrics collection
- Grafana for visualization
- Alertmanager for alerts
Essential alerts:
- etcd member availability
- Control plane component health
- Node availability
- Certificate expiration
Kubemark
Kubemark is a tool used to simulate large Kubernetes clusters for testing control plane performance without requiring the full resource footprint.

See the Kubemark Setup Guide for detailed instructions.
Key metrics to track for large clusters:
- API server latency (target: 99th percentile under 1 second)
- Pod startup time (target: 99th percentile under 5 seconds)
- etcd operation latency
- Controller reconciliation times


Disaster Recovery
Backup Strategy
Regular backups should include:
- etcd data
- Persistent volumes
- Kubernetes resources
Tools like Velero can help with comprehensive backups.
Recovery Procedures
Document and regularly test recovery procedures:
- etcd restoration
- Control plane rebuilding
- Data restoration
- Service validation
Common Failure Scenarios
Prepare for these common failure modes:
Control Plane Node Failure
- Impact: API operations may be temporarily unavailable
- Mitigation: Multiple control plane nodes, leader election
- Recovery: Automatic with proper configuration
etcd Member Failure
- Impact: Cluster state operations impacted if quorum lost
- Mitigation: At least 3 etcd members, proper monitoring
- Recovery: Replace failed member, restore from backup if needed
Network Partition
- Impact: Nodes may become unreachable, potential split-brain
- Mitigation: Multi-zone deployment, robust health checks
- Recovery: Automatic reconciliation once connectivity restored
Worker Node Failure
- Impact: Pods on that node become unavailable
- Mitigation: Pod anti-affinity, multiple replicas
- Recovery: Automatic rescheduling of pods
Kubernetes Documentation
Contributing
Contributions to this HA reference guide are welcome! Please submit a pull request with your additions or corrections.