Skip to content

Autoscaling — HPA, VPA, Karpenter & Cluster Autoscaler

Where This Fits in the Enterprise Architecture

Section titled “Where This Fits in the Enterprise Architecture”

Enterprise Autoscaling Architecture — Where This Fits The central infra team deploys and configures the scaling controllers (Karpenter, KEDA, Metrics Server, Prometheus Adapter). Tenant teams define HPA and VPA on their workloads. The infra team sets NodePool constraints — instance families, availability zones, and budget limits — that tenants cannot override.


HPA scales the number of pod replicas based on observed metrics. It runs a control loop every 15 seconds (default) and calculates the desired replica count:

desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))

Metric sources:

SourceExampleRequires
Resource metricsCPU, memory utilizationmetrics-server
Custom metricsrequests_per_second, queue_depthPrometheus Adapter
External metricsSQS queue length, Pub/Sub backlogKEDA or cloud provider adapter
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payments-api
namespace: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-api
minReplicas: 3
maxReplicas: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # react fast to spikes
policies:
- type: Percent
value: 100 # double pods per scale-up
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 min before scaling down
policies:
- type: Percent
value: 25 # remove 25% at a time
periodSeconds: 120
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75

HPA with Custom Metrics (Prometheus Adapter)

Section titled “HPA with Custom Metrics (Prometheus Adapter)”

To scale on application-specific metrics like requests per second, you need the Prometheus Adapter:

+----------------+ +---------------------+ +------------------+
| Prometheus | ----> | Prometheus Adapter | ----> | HPA Controller |
| (scrapes pods) | | (exposes as K8s API) | | (reads metrics) |
+----------------+ +---------------------+ +------------------+
^ |
| v
Pod metrics Scale decision
/metrics endpoint → Deployment replicas

Prometheus Adapter ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

HPA using custom metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payments-api-custom
namespace: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-api
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500" # scale when avg exceeds 500 rps/pod
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-processor
namespace: orders
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-processor
minReplicas: 1
maxReplicas: 30
metrics:
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: "orders-queue"
target:
type: AverageValue
averageValue: "10" # 1 pod per 10 messages

VPA automatically adjusts CPU and memory requests (and optionally limits) based on historical usage. In enterprise environments, run VPA in recommendation-only mode — it provides right-sizing data without disrupting running pods.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payments-api-vpa
namespace: payments
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-api
updatePolicy:
updateMode: "Off" # recommendation-only, no live updates
resourcePolicy:
containerPolicies:
- containerName: payments-api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]

Reading VPA recommendations:

Terminal window
kubectl get vpa payments-api-vpa -n payments -o jsonpath='{.status.recommendation}' | jq .

Output tells you actual right-sized values vs what you set — feed this into your Helm values or Kustomize patches.


Cluster Autoscaler (CA) scales managed node groups based on pod scheduling failures. When a pod is Pending due to insufficient resources, CA adds a node. When nodes are underutilized for 10+ minutes, CA cordons and drains them.

Cluster Autoscaler Flow Limitations of Cluster Autoscaler:

LimitationImpact
Tied to pre-defined node groupsMust predict instance types ahead of time
Slow provisioning (2-5 min)Spiky workloads suffer cold-start delays
Scales one group at a timeCannot mix instance types intelligently
No consolidationWastes money on underutilized nodes
Complex ASG managementMultiple ASGs for GPU, ARM, spot instances

Cluster Autoscaler on EKS is deployed via Helm and uses IRSA for IAM authentication. It auto-discovers node groups by tags and manages scaling decisions based on pending pod resource requirements.

GKE Cluster Autoscaler is built into GKE and enabled at the node pool level — no separate controller to deploy. GKE Autopilot goes further and eliminates node management entirely — Google manages scaling, and you only define pod resource requests.

# values.yaml for cluster-autoscaler Helm chart
autoDiscovery:
clusterName: prod-eks-cluster
tags:
- k8s.io/cluster-autoscaler/enabled
- k8s.io/cluster-autoscaler/prod-eks-cluster
awsRegion: me-south-1
rbac:
serviceAccount:
create: true
name: cluster-autoscaler
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/ClusterAutoscalerRole
extraArgs:
balance-similar-node-groups: true
skip-nodes-with-local-storage: false
expander: least-waste # pick the node group with least wasted resources
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
max-node-provision-time: 15m

Karpenter is AWS’s next-generation node provisioner that replaces Cluster Autoscaler. It provisions nodes just-in-time from the full EC2 fleet, without pre-defined node groups.

Karpenter Architecture Why Karpenter over Cluster Autoscaler?

AspectCluster AutoscalerKarpenter
Instance selectionPre-defined node groupsAny EC2 instance type
Provisioning speed2-5 minutes60-90 seconds
Right-sizingLimited to group configPicks optimal size per pod
ConsolidationNoneActive bin-packing
Spot handlingPer-ASG spot configMixed spot/on-demand per NodePool
MaintenanceManage multiple ASGsOne NodePool covers many types
ArchitectureASG-basedDirect EC2 fleet API
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
# Resource budget — prevents runaway scaling
limits:
cpu: "1000" # max 1000 vCPUs across all nodes
memory: 2000Gi
# Weight for scheduling priority (higher = preferred)
weight: 50
# Disruption settings — how Karpenter optimizes existing nodes
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s # how long to wait after node becomes empty
# Budgets control how many nodes can be disrupted at once
budgets:
- nodes: "10%" # max 10% of nodes disrupted at once
- nodes: "0" # block disruption during peak hours
schedule: "0 9 * * 1-5" # Mon-Fri 9am
duration: 8h # until 5pm
template:
metadata:
labels:
team: platform
tier: general
spec:
# Reference to AWS-specific configuration
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: general
# Instance type constraints
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # compute, general, memory optimized
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"] # only 6th gen+ (c6i, m6i, r6i, etc.)
- key: karpenter.k8s.aws/instance-size
operator: In
values: ["large", "xlarge", "2xlarge", "4xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["me-south-1a", "me-south-1b", "me-south-1c"]
# Taints for workload isolation
taints:
- key: workload-type
value: general
effect: NoSchedule
# Expire nodes after 720h (30 days) for AMI rotation
expireAfter: 720h
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: general
spec:
# AMI selection
amiSelectorTerms:
- alias: al2023@latest # Amazon Linux 2023, auto-updated
# Network configuration — nodes go into private subnets
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eks-cluster
network-tier: private
# Security groups
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eks-cluster
# IAM instance profile for the nodes
role: KarpenterNodeRole-prod-eks-cluster
# Block device — encrypted EBS
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
encrypted: true
kmsKeyID: arn:aws:kms:me-south-1:111111111111:key/key-id
deleteOnTermination: true
iops: 3000
throughput: 125
# Metadata options — enforce IMDSv2
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required # IMDSv2 mandatory
# Tags applied to all EC2 instances
tags:
Environment: production
ManagedBy: karpenter
CostCenter: platform-team

Consolidation is Karpenter’s cost-optimization engine. It continuously evaluates whether nodes can be replaced with cheaper alternatives or removed entirely.

Karpenter Consolidation Flow

# Karpenter IAM Role (for the controller itself)
module "karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"
version = "~> 20.0"
cluster_name = module.eks.cluster_name
# Enable IRSA for Karpenter controller
enable_irsa = true
irsa_oidc_provider_arn = module.eks.oidc_provider_arn
irsa_namespace_service_accounts = ["kube-system:karpenter"]
# Node IAM role (for EC2 instances Karpenter launches)
node_iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
# Install Karpenter via Helm
resource "helm_release" "karpenter" {
namespace = "kube-system"
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
version = "1.1.0"
values = [
yamlencode({
settings = {
clusterName = module.eks.cluster_name
clusterEndpoint = module.eks.cluster_endpoint
interruptionQueue = module.karpenter.queue_name
}
serviceAccount = {
annotations = {
"eks.amazonaws.com/role-arn" = module.karpenter.iam_role_arn
}
}
replicas = 2 # HA for the controller itself
})
]
depends_on = [module.karpenter]
}

GKE’s equivalent of intelligent node scaling. NAP automatically creates new node pools with optimal machine types based on pending pod requirements.

resource "google_container_cluster" "primary" {
name = "prod-gke-cluster"
location = "me-central1"
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
minimum = 4
maximum = 500
}
resource_limits {
resource_type = "memory"
minimum = 16
maximum = 2000
}
resource_limits {
resource_type = "nvidia-tesla-t4" # GPU limits
minimum = 0
maximum = 8
}
auto_provisioning_defaults {
service_account = google_service_account.gke_nodes.email
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
management {
auto_repair = true
auto_upgrade = true
}
disk_size = 100
disk_type = "pd-ssd"
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
}
}
}

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale on event sources: message queues, databases, HTTP metrics, cron schedules, and more.

KEDA Architecture KEDA scales to zero — something native HPA cannot do (HPA minimum is 1).

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: orders
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15 # check every 15 seconds
cooldownPeriod: 120 # wait 2 min before scaling to zero
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
authenticationRef:
name: keda-aws-credentials
metadata:
queueURL: https://sqs.me-south-1.amazonaws.com/111111111111/orders-queue
queueLength: "5" # 1 pod per 5 messages
awsRegion: me-south-1
identityOwner: operator # IRSA-based authentication
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-gateway-scaler
namespace: gateway
spec:
scaleTargetRef:
name: api-gateway
minReplicaCount: 2
maxReplicaCount: 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_requests_per_second
query: |
sum(rate(http_requests_total{namespace="gateway"}[2m]))
threshold: "1000" # scale when total rps exceeds 1000
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: trading-platform
namespace: trading
spec:
scaleTargetRef:
name: trading-engine
minReplicaCount: 5
maxReplicaCount: 200
triggers:
# Market hours: pre-scale before open
- type: cron
metadata:
timezone: Asia/Dubai
start: "0 8 * * 1-5" # 8am Sun-Thu (Dubai market hours)
end: "0 14 * * 1-5" # 2pm close
desiredReplicas: "50" # pre-warm 50 replicas
# Reactive scaling on top
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: order_rate
query: sum(rate(trade_orders_total{namespace="trading"}[1m]))
threshold: "100"

Scaling Architecture — Putting It All Together

Section titled “Scaling Architecture — Putting It All Together”

Enterprise Scaling Architecture

The layered approach:

  1. VPA in Off mode — continuously recommends right-sized requests
  2. HPA or KEDA — scales pods horizontally based on demand
  3. Karpenter — provisions right-sized nodes when pods are pending
  4. Consolidation — Karpenter bin-packs and replaces underutilized nodes overnight

Scenario 1: 10x Traffic Spikes During Business Hours

Section titled “Scenario 1: 10x Traffic Spikes During Business Hours”

“Design autoscaling for a banking transaction service that sees 10x traffic spikes during market open (8am-2pm Dubai time) but is nearly idle overnight.”

Answer framework:

Predictive + Reactive scaling:

  1. KEDA cron trigger — pre-scale to 30 pods at 7:45am before market opens
  2. HPA on custom metric (transactions_per_second) — react to actual load above baseline
  3. Karpenter NodePool — provisions nodes in 60-90 seconds when pods go Pending

Node-level:

  • Karpenter with mixed capacity: 70% on-demand (baseline), 30% spot (burst)
  • Disruption budget: nodes: "0" from 8am-3pm (no consolidation during trading hours)
  • Allow consolidation overnight (3pm-7am) to save costs

Scale-down protection:

  • HPA stabilizationWindowSeconds: 600 for scale-down (10 min cooldown)
  • Karpenter consolidateAfter: 60s only outside business hours
7:45am KEDA cron fires → 30 pods warm
8:00am Market opens → HPA scales 30→80 based on TPS
8:01am Karpenter provisions 5 new nodes (60s each)
2:00pm Market closes → traffic drops
2:10pm HPA stabilization window (10 min) passes → scale to 40
3:00pm Disruption window opens → Karpenter consolidates nodes
11:00pm KEDA scales to 0 (minReplicaCount: 0 for batch processors)

Scenario 2: Karpenter vs Cluster Autoscaler — When Would You Choose Each?

Section titled “Scenario 2: Karpenter vs Cluster Autoscaler — When Would You Choose Each?”

“Your team is debating whether to use Karpenter or Cluster Autoscaler. What’s your recommendation?”

Choose Karpenter when:

  • You need fast provisioning (<90 seconds vs 2-5 minutes)
  • You want automatic cost optimization via consolidation
  • Your workloads need diverse instance types (GPU, ARM, spot mix)
  • You want to avoid managing multiple ASGs/managed node groups
  • You need disruption budgets with schedule-based controls

Choose Cluster Autoscaler when:

  • You run GKE (Karpenter is AWS-only; GKE has NAP/Autopilot instead)
  • You have strict compliance requiring pre-approved instance types only
  • Your existing infrastructure is deeply tied to ASG lifecycle hooks
  • You need integration with AWS-native scaling (Target Tracking, Step Scaling)

Our recommendation for a new EKS deployment: Karpenter with a fallback managed node group for system components (CoreDNS, kube-proxy, Karpenter itself). Karpenter should not manage the nodes it runs on.

+---------------------------+ +---------------------------+
| Managed Node Group | | Karpenter-Managed Nodes |
| (system components) | | (workload pods) |
| - CoreDNS | | - Payment API |
| - kube-proxy | | - Order Processor |
| - Karpenter controller | | - Trading Engine |
| - Metrics Server | | - Batch Jobs |
| Always 2-3 nodes (fixed) | | 0-100 nodes (elastic) |
+---------------------------+ +---------------------------+

Scenario 3: Pods Are Scaling But Nodes Are Not

Section titled “Scenario 3: Pods Are Scaling But Nodes Are Not”

“HPA scaled our deployment from 5 to 30 pods, but 20 pods are stuck in Pending. Nodes aren’t being added. What’s wrong?”

Debugging checklist:

Terminal window
# 1. Check pod events for scheduling failures
kubectl describe pod <pending-pod> -n payments
# Look for: "Insufficient cpu", "Insufficient memory", "no nodes available"
# 2. Check if Cluster Autoscaler is running
kubectl get pods -n kube-system | grep cluster-autoscaler
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=50
# 3. Check Karpenter logs (if using Karpenter)
kubectl logs -n kube-system deployment/karpenter --tail=50
# Look for: "could not launch node", "insufficient capacity"
# 4. Check NodePool limits
kubectl get nodepool general -o yaml
# Is the CPU/memory limit reached?
# 5. Check node group max size
aws eks describe-nodegroup --cluster-name prod --nodegroup-name general \
--query 'nodegroup.scalingConfig'

Common root causes:

CauseFix
Cluster Autoscaler not runningRedeploy CA, check IRSA permissions
NodePool CPU limit reachedIncrease spec.limits.cpu
ASG max size reachedIncrease max in managed node group
PDB blocking drain (for consolidation)Review PDB maxUnavailable settings
Subnet out of IP addressesAdd subnets, use larger CIDR
Instance type unavailable (spot)Add more instance families to NodePool
Taints without tolerationsPods need matching tolerations
Node affinity too restrictiveRelax nodeAffinity rules

Scenario 4: Scale-to-Zero for Dev/Staging Workloads

Section titled “Scenario 4: Scale-to-Zero for Dev/Staging Workloads”

“How do you handle scale-to-zero for dev and staging environments to save costs?”

Architecture:

Dev/Staging Scale-to-Zero Architecture Implementation:

# KEDA HTTP Add-on for scale-to-zero on HTTP traffic
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: dev-api
namespace: dev
spec:
scaleTargetRef:
name: dev-api
cooldownPeriod: 900 # 15 min idle → scale to zero
minReplicaCount: 0
maxReplicaCount: 3
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_requests_active
query: |
sum(rate(http_requests_total{namespace="dev",service="dev-api"}[5m]))
threshold: "1"

Cost savings: Dev environments that are idle 16+ hours/day save ~65% on compute. Multiply across 20 dev namespaces and it is significant.


Scenario 5: GPU Workload Scaling for ML Inference

Section titled “Scenario 5: GPU Workload Scaling for ML Inference”

“Design autoscaling for a GPU-based ML inference workload that processes images. It needs P4d instances with NVIDIA A100 GPUs.”

Challenges:

  • GPU instances are expensive ($30+/hour) — cannot over-provision
  • GPU instances have limited availability — cannot always get them
  • Model loading takes 30-60 seconds — cold start is painful
  • Standard CPU metrics do not reflect GPU utilization

Solution:

# Karpenter NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
limits:
cpu: "200"
memory: 800Gi
nvidia.com/gpu: "16" # max 16 GPUs total
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 300s # wait 5 min (GPU instances are slow to start)
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodes
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["p", "g"] # GPU instance families
- key: karpenter.k8s.aws/instance-family
operator: In
values: ["p4d", "p5", "g5"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # spot not reliable for GPUs
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
---
# HPA on custom GPU metric (from DCGM exporter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-inference
namespace: ml
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-inference
minReplicas: 1 # always keep 1 warm
maxReplicas: 8
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600 # 10 min (GPU instances are expensive to re-provision)
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization_percent # from DCGM exporter via Prometheus Adapter
target:
type: AverageValue
averageValue: "70" # scale when avg GPU util > 70%

GKE equivalent: Use GKE with GPU node pools and NAP:

resource "google_container_node_pool" "gpu_pool" {
name = "gpu-pool"
cluster = google_container_cluster.primary.name
location = "me-central1"
autoscaling {
min_node_count = 0
max_node_count = 8
}
node_config {
machine_type = "a2-highgpu-1g" # NVIDIA A100
guest_accelerator {
type = "nvidia-tesla-a100"
count = 1
gpu_driver_installation_config {
gpu_driver_version = "LATEST"
}
}
taint {
key = "nvidia.com/gpu"
value = "present"
effect = "NO_SCHEDULE"
}
}
}

Autoscaling Decision Tree