Skip to content

HA & Disaster Recovery

Where This Fits in the Enterprise Architecture

Section titled “Where This Fits in the Enterprise Architecture”

Enterprise HA and DR Architecture The central infra team designs the HA topology (multi-AZ node placement, PDBs as policy), configures DR infrastructure (standby clusters, Velero schedules, DNS failover), and runs DR drills quarterly. Tenant teams define PodDisruptionBudgets for their workloads and must declare their RTO/RPO requirements during onboarding.


High Availability — Within a Single Region

Section titled “High Availability — Within a Single Region”

Both EKS and GKE distribute nodes across availability zones automatically when configured correctly.

EKS managed node groups span AZs defined in the subnet configuration. GKE regional clusters run the control plane AND nodes across 3 zones automatically.

resource "aws_eks_node_group" "general" {
cluster_name = aws_eks_cluster.primary.name
node_group_name = "general"
node_role_arn = aws_iam_role.node.arn
# Subnets in different AZs — EKS distributes nodes across them
subnet_ids = [
aws_subnet.private_az1.id, # me-south-1a
aws_subnet.private_az2.id, # me-south-1b
aws_subnet.private_az3.id, # me-south-1c
]
scaling_config {
desired_size = 6 # 2 per AZ
min_size = 3 # minimum 1 per AZ
max_size = 30
}
instance_types = ["m6i.xlarge"]
update_config {
max_unavailable_percentage = 25 # rolling update, max 25% at a time
}
}

Karpenter multi-AZ: NodePool requirements automatically spread across specified zones:

requirements:
- key: topology.kubernetes.io/zone
operator: In
values: ["me-south-1a", "me-south-1b", "me-south-1c"]

Topology spread constraints tell the scheduler to distribute pods evenly across failure domains (zones, nodes, or custom topology keys).

apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
namespace: payments
spec:
replicas: 6
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
topologySpreadConstraints:
# Spread across AZs — hard requirement
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: payments-api
# Spread across nodes — soft preference
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: payments-api
# Anti-affinity as backup (older approach, still valid)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["payments-api"]
topologyKey: kubernetes.io/hostname

How topology spread works with 6 replicas across 3 AZs:

Topology Spread Constraints Example

PDBs protect workloads during voluntary disruptions — node drains, cluster upgrades, Karpenter consolidation. They do NOT protect against involuntary disruptions (node crash, OOMKill).

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payments-api-pdb
namespace: payments
spec:
# Option A: minimum available (use for critical services)
minAvailable: "60%" # at least 60% must be running
# Option B: maximum unavailable (use for flexible services)
# maxUnavailable: 1 # at most 1 pod can be down
selector:
matchLabels:
app: payments-api

PDB interaction with cluster operations:

VOLUNTARY DISRUPTION FLOW
===========================
kubectl drain node-1 Karpenter consolidation Cluster upgrade
| | |
v v v
+------------------------------------------------------------------+
| Eviction API |
| "Can I evict pod X from node Y?" |
| | |
| v |
| PDB check: Is minAvailable satisfied AFTER eviction? |
| | |
| YES → evict pod, reschedule on another node |
| NO → block eviction, retry later |
| | |
| If blocked too long: |
| - kubectl drain: hangs (timeout configurable) |
| - Karpenter: skips node, tries another |
| - Cluster upgrade: pauses, alerts |
+------------------------------------------------------------------+

Kyverno policy enforcing PDB existence:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-pdb
spec:
validationFailureAction: Enforce
rules:
- name: require-pdb-for-deployments
match:
any:
- resources:
kinds: ["Deployment"]
namespaces: ["payments", "trading", "lending"]
preconditions:
all:
- key: "{{request.object.spec.replicas}}"
operator: GreaterThan
value: 1
validate:
message: "Production deployments with >1 replica must have a PodDisruptionBudget"
deny:
conditions:
all:
- key: "{{request.object.metadata.labels.has-pdb}}"
operator: NotEquals
value: "true"

RTO RPO COST COMPLEXITY
=== === ==== ==========
Backup-Restore 4-24 hrs 1-24 hrs $ Low
Pilot Light 1-2 hrs minutes $$ Medium
Active-Passive 5-30 min seconds $$$ High
Active-Active ~0 (auto) ~0 $$$$ Very High

DR Pattern — Backup-Restore Best for: Non-critical workloads, dev/staging environments, batch processing systems.

HA Architecture for Payment Processing

DR Pattern — Active-Passive

DR Pattern — Active-Active


Velero backs up Kubernetes resources (deployments, services, configmaps, secrets, PVs) and restores them to the same or different cluster.

Velero Backup Architecture

On EKS, Velero uses S3 for backup storage and EBS Snapshots for persistent volume backups. On GKE, you can use either the native Backup for GKE managed service or Velero with GCS.

# Velero BackupStorageLocation
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: aws-primary
namespace: velero
spec:
provider: aws
objectStorage:
bucket: prod-velero-backups-me-south-1
prefix: eks-prod
config:
region: me-south-1
s3ForcePathStyle: "false"
---
# Velero VolumeSnapshotLocation
apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
name: aws-ebs
namespace: velero
spec:
provider: aws
config:
region: me-south-1
---
# Cross-region backup for DR
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: aws-dr
namespace: velero
spec:
provider: aws
objectStorage:
bucket: prod-velero-backups-eu-west-1 # DR region bucket
prefix: eks-prod-dr
config:
region: eu-west-1
# Schedule full cluster backup every 6 hours
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: full-cluster-backup
namespace: velero
spec:
schedule: "0 */6 * * *" # every 6 hours
template:
storageLocation: aws-primary
volumeSnapshotLocations:
- aws-ebs
includedNamespaces:
- payments
- orders
- trading
- lending
excludedResources:
- events
- pods # pods are recreated by deployments
snapshotMoveData: true # use Kopia to move snapshot data to object storage
ttl: 720h # retain for 30 days
---
# Critical namespace: more frequent backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: payments-hourly
namespace: velero
spec:
schedule: "0 * * * *" # every hour
template:
storageLocation: aws-primary
includedNamespaces:
- payments
ttl: 168h # retain for 7 days

# Primary region health check
resource "aws_route53_health_check" "primary" {
fqdn = "primary-alb.payments.internal.bank.com"
port = 443
type = "HTTPS"
resource_path = "/healthz"
failure_threshold = 3
request_interval = 10
tags = {
Name = "primary-payments-health"
}
}
# Primary record (failover routing)
resource "aws_route53_record" "payments_primary" {
zone_id = aws_route53_zone.bank.zone_id
name = "payments.bank.com"
type = "A"
alias {
name = aws_lb.primary_alb.dns_name
zone_id = aws_lb.primary_alb.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
set_identifier = "primary"
}
# DR record
resource "aws_route53_record" "payments_dr" {
zone_id = aws_route53_zone.bank.zone_id
name = "payments.bank.com"
type = "A"
alias {
name = aws_lb.dr_alb.dns_name
zone_id = aws_lb.dr_alb.zone_id
evaluate_target_health = true
}
failover_routing_policy {
type = "SECONDARY"
}
set_identifier = "dr"
}

For DR, ArgoCD in the Shared Services account manages both primary and DR clusters:

ARGOCD MULTI-CLUSTER
=====================
Shared Services Account
+----------------------------------+
| ArgoCD (management cluster) |
| |
| Application: payments-primary |
| → target: primary EKS/GKE |
| → sync: automated |
| |
| Application: payments-dr |
| → target: DR EKS/GKE |
| → sync: automated |
| → replicas: reduced (or 0) |
| |
| ApplicationSet: all-envs |
| → generates apps for both |
| clusters from same manifests |
+----------------------------------+
| |
v v
Primary Cluster DR Cluster
(full capacity) (pilot light)

ApplicationSet for multi-cluster DR:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: payments-multi-cluster
namespace: argocd
spec:
generators:
- list:
elements:
- cluster: prod-primary
url: https://primary-eks.me-south-1.eks.amazonaws.com
region: me-south-1
replicas: "6"
- cluster: prod-dr
url: https://dr-eks.eu-west-1.eks.amazonaws.com
region: eu-west-1
replicas: "2" # reduced for pilot-light
template:
metadata:
name: "payments-{{cluster}}"
spec:
project: production
source:
repoURL: https://github.com/bank/gitops-repo.git
targetRevision: main
path: apps/payments/overlays/{{cluster}}
destination:
server: "{{url}}"
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true

Scenario 1: Design HA for a Payment Processing App on EKS

Section titled “Scenario 1: Design HA for a Payment Processing App on EKS”

“Design high availability for a payment processing application on EKS that cannot tolerate more than 30 seconds of downtime.”

Answer framework:

Pod level:

  • Replicas: minimum 6 (2 per AZ)
  • Topology spread: DoNotSchedule across zones, ScheduleAnyway across nodes
  • PDB: minAvailable: 60% — at least 4 of 6 pods always running
  • Health checks: liveness (restart if hung), readiness (remove from LB if unhealthy), startup (grace period for init)

Node level:

  • Karpenter: NodePool across 3 AZs, on-demand only (no spot for payments)
  • Disruption budget: nodes: "0" during business hours (no consolidation during trading)

Data level:

  • RDS Multi-AZ: synchronous replication, automatic failover (35-60 seconds)
  • ElastiCache Redis: Multi-AZ with auto-failover for session/cache

Network level:

  • ALB: cross-zone load balancing enabled
  • Health checks: ALB target group checks readiness endpoint every 5 seconds
Internet
|
v
+------+------+
| Route 53 |
| (failover) |
+------+------+
|
+------+------+
| ALB |
| (cross-zone)|
+------+------+
|
+---------+---------+
| | |
+--+--+ +--+--+ +--+--+
|AZ-1a| |AZ-1b| |AZ-1c|
|Pod 1| |Pod 3| |Pod 5|
|Pod 2| |Pod 4| |Pod 6|
+--+--+ +--+--+ +--+--+
| | |
+--+---------+---------+--+
| RDS Multi-AZ |
| Primary (AZ-1a) |
| Standby (AZ-1b) |
+--------------------------+

Scenario 2: Full Region Failover Walkthrough

Section titled “Scenario 2: Full Region Failover Walkthrough”

“Your primary region is completely down. Walk me through the Kubernetes failover step by step.”

Pre-conditions (must already be in place):

  • DR cluster running in pilot-light mode (cluster exists, reduced replicas)
  • Velero backups every 6 hours to DR region S3/GCS
  • RDS read replica in DR region with async replication
  • ArgoCD managing both clusters from a separate management cluster
  • Route 53 health checks monitoring primary

Failover steps:

MINUTE 0: Primary region outage detected
- Route 53 health check fails (3 consecutive, 30s interval = ~90 seconds)
- PagerDuty alert fires to on-call SRE
MINUTE 1-2: Automated DNS failover
- Route 53 automatically switches to SECONDARY record
- Traffic starts flowing to DR region ALB
- (DR cluster has reduced replicas — may see degraded performance)
MINUTE 2-5: Scale up DR cluster
- SRE runs: kubectl scale deployment payments-api --replicas=6 -n payments
- OR: ArgoCD overlay for DR is updated with full replica count
- Karpenter provisions additional nodes (60-90 seconds each)
MINUTE 5-10: Database failover
- Promote RDS read replica to standalone primary:
aws rds promote-read-replica --db-instance-identifier payments-db-dr
- Update application config (via Secrets Manager / ConfigMap) with new DB endpoint
- This is the longest step — RDS promotion takes 5-10 minutes
MINUTE 10-15: Validation
- Run smoke tests against DR endpoint
- Verify payment processing end-to-end
- Check monitoring dashboards in DR region
- Confirm no data loss (check last replicated transaction)
MINUTE 15+: Communication
- Update status page
- Notify downstream teams
- Begin root cause analysis on primary

Total RTO: 10-15 minutes (most time spent on DB promotion). RPO: Replication lag of the RDS read replica (typically seconds, up to minutes under heavy load).


Scenario 3: DR Testing Without Affecting Production

Section titled “Scenario 3: DR Testing Without Affecting Production”

“How do you test DR for Kubernetes without affecting production?”

DR testing strategy:

  1. Velero restore test (monthly):
    • Create a temporary test cluster in the DR region
    • Restore latest Velero backup to the test cluster
    • Run automated smoke tests against restored workloads
    • Compare resource counts (deployments, services, configmaps) with source
    • Tear down test cluster after validation
Terminal window
# Restore to test cluster
velero restore create dr-test-$(date +%Y%m%d) \
--from-backup full-cluster-backup-20260315060000 \
--namespace-mappings payments:payments-test \
--restore-volumes=true
# Validate
kubectl get deployments -n payments-test
kubectl run smoke-test --image=curlimages/curl --rm -it -- \
curl -s http://payments-api.payments-test.svc/healthz
  1. DNS failover test (quarterly):

    • Use a test domain (e.g., payments-test.bank.com) that mirrors prod DNS config
    • Simulate primary health check failure
    • Verify traffic switches to DR within expected timeframe
    • Measure actual RTO
  2. Chaos engineering (ongoing):

    • Use Litmus Chaos or Chaos Mesh to inject failures
    • Kill random pods, nodes, even simulate AZ failure
    • Verify PDBs, topology spread, and self-healing work as expected
# Litmus Chaos — simulate AZ failure
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: az-failure-test
namespace: payments
spec:
appinfo:
appns: payments
applabel: app=payments-api
appkind: deployment
chaosServiceAccount: litmus-sa
experiments:
- name: node-drain
spec:
components:
env:
- name: NODE_LABEL
value: topology.kubernetes.io/zone=me-south-1a
- name: TOTAL_CHAOS_DURATION
value: "300" # 5 minutes
  1. Game day (bi-annually):
    • Full DR drill with all teams
    • Actually fail over production to DR region
    • Run for 1-2 hours on DR
    • Fail back to primary
    • Document lessons learned

Scenario 4: Active-Active Global SaaS Across 2 Regions

Section titled “Scenario 4: Active-Active Global SaaS Across 2 Regions”

“Design an active-active Kubernetes deployment across 2 regions for a global SaaS platform serving customers in Middle East and Europe.”

Architecture:

Active-Active Multi-Region Architecture Key design decisions:

DecisionChoiceRationale
DatabaseAurora Global DBMulti-region, <1s replication, write-forwarding
DNS routingGeoproximityRoute users to nearest region by latency
Session stateElastiCache Global DatastoreCross-region session replication
Static assetsCloudFront / Cloud CDNEdge-cached globally
Conflict resolutionLast-writer-wins + idempotency keysPayments must be idempotent
Cluster managementArgoCD ApplicationSetSame manifests, region-specific overlays

Data consistency for payments:

  • Use idempotency keys on all payment requests
  • Write-forwarding adds ~2-3s latency for EU writes (acceptable for payments)
  • Read-after-write consistency guaranteed in the primary region
  • Eventually consistent reads in the secondary (typically <1s lag)

Scenario 5: Pods Stuck in Terminating — What Is Happening?

Section titled “Scenario 5: Pods Stuck in Terminating — What Is Happening?”

“A node went down and pods are stuck in Terminating state for over 10 minutes. What’s happening and how do you fix it?”

Root cause analysis:

NODE FAILURE → PODS STUCK TERMINATING
=======================================
Node goes NotReady
|
v
Node controller waits (pod-eviction-timeout, default 5 min)
|
v
Marks pods for deletion
|
v
Pods enter "Terminating" state
|
+--- Normal case: kubelet runs preStop hook → sends SIGTERM → waits
| terminationGracePeriodSeconds → sends SIGKILL → pod deleted
|
+--- Node is DOWN: kubelet is not running!
→ No one to execute the termination sequence
→ Pod stays "Terminating" indefinitely
→ API server cannot confirm deletion

Common causes of stuck Terminating:

CauseHow to IdentifyFix
Node down, kubelet not runningkubectl get nodes shows NotReadyForce delete pod
Finalizer blocking deletionkubectl get pod -o yaml shows finalizers:Remove finalizer
PV still attached to podkubectl describe pod shows volume detach pendingForce detach volume
preStop hook hangingPod has long terminationGracePeriodSecondsReduce grace period

Resolution steps:

Terminal window
# 1. Check node status
kubectl get nodes
# Look for "NotReady" nodes
# 2. Check if pod has finalizers
kubectl get pod stuck-pod -n payments -o jsonpath='{.metadata.finalizers}'
# 3. Force delete the pod (last resort)
kubectl delete pod stuck-pod -n payments --grace-period=0 --force
# 4. If PV is stuck, force detach
aws ec2 detach-volume --volume-id vol-xxx --force
# OR for GCP:
gcloud compute instances detach-disk node-name --disk=pv-disk
# 5. If finalizer is blocking, patch it out
kubectl patch pod stuck-pod -n payments -p '{"metadata":{"finalizers":null}}' --type=merge
# 6. If the node will not recover, delete it
kubectl delete node problem-node
# Karpenter or CA will provision a replacement

HA CHECKLIST (per workload)
============================
[ ] Replicas >= 3 (minimum 1 per AZ)
[ ] Topology spread constraints configured
[ ] PDB defined (minAvailable or maxUnavailable)
[ ] Liveness, readiness, startup probes
[ ] Resource requests and limits set
[ ] Anti-affinity for critical pods
[ ] Database: Multi-AZ / HA configuration
[ ] Cache: Multi-AZ with auto-failover
DR CHECKLIST (per cluster)
============================
[ ] Velero scheduled backups (every 6 hours)
[ ] Backup retention: 30 days minimum
[ ] Cross-region backup storage configured
[ ] DR cluster exists (pilot-light or active-passive)
[ ] DNS failover configured (Route 53 / Cloud DNS)
[ ] Database replica in DR region
[ ] ArgoCD manages both clusters
[ ] DR drill tested in last 90 days
[ ] Runbook documented for failover procedure
[ ] RTO/RPO documented and validated