Autoscaling — HPA, VPA, Karpenter & Cluster Autoscaler

Where This Fits in the Enterprise Architecture

Enterprise Autoscaling Architecture — Where This Fits The central infra team deploys and configures the scaling controllers (Karpenter, KEDA, Metrics Server, Prometheus Adapter). Tenant teams define HPA and VPA on their workloads. The infra team sets NodePool constraints — instance families, availability zones, and budget limits — that tenants cannot override.

Pod-Level Autoscaling

Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on observed metrics. It runs a control loop every 15 seconds (default) and calculates the desired replica count:

desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))

Metric sources:

Source	Example	Requires
Resource metrics	CPU, memory utilization	metrics-server
Custom metrics	requests_per_second, queue_depth	Prometheus Adapter
External metrics	SQS queue length, Pub/Sub backlog	KEDA or cloud provider adapter

HPA with CPU and Memory

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api
  namespace: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  minReplicas: 3
  maxReplicas: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30        # react fast to spikes
      policies:
        - type: Percent
          value: 100                         # double pods per scale-up
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300        # wait 5 min before scaling down
      policies:
        - type: Percent
          value: 25                          # remove 25% at a time
          periodSeconds: 120
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 75

HPA with Custom Metrics (Prometheus Adapter)

To scale on application-specific metrics like requests per second, you need the Prometheus Adapter:

+----------------+       +---------------------+       +------------------+
|   Prometheus   | ----> | Prometheus Adapter   | ----> |   HPA Controller |
| (scrapes pods) |       | (exposes as K8s API) |       | (reads metrics)  |
+----------------+       +---------------------+       +------------------+
       ^                                                        |
       |                                                        v
  Pod metrics                                            Scale decision
  /metrics endpoint                                      → Deployment replicas

Prometheus Adapter ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace: {resource: "namespace"}
            pod: {resource: "pod"}
        name:
          matches: "^(.*)_total"
          as: "${1}_per_second"
        metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

HPA using custom metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api-custom
  namespace: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  minReplicas: 3
  maxReplicas: 100
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "500"            # scale when avg exceeds 500 rps/pod

HPA with External Metrics (SQS Queue)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-processor
  namespace: orders
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicas: 1
  maxReplicas: 30
  metrics:
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: "orders-queue"
        target:
          type: AverageValue
          averageValue: "10"              # 1 pod per 10 messages

Vertical Pod Autoscaler (VPA)

VPA automatically adjusts CPU and memory requests (and optionally limits) based on historical usage. In enterprise environments, run VPA in recommendation-only mode — it provides right-sizing data without disrupting running pods.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payments-api-vpa
  namespace: payments
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  updatePolicy:
    updateMode: "Off"                    # recommendation-only, no live updates
  resourcePolicy:
    containerPolicies:
      - containerName: payments-api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]

Reading VPA recommendations:

kubectl get vpa payments-api-vpa -n payments -o jsonpath='{.status.recommendation}' | jq .

Output tells you actual right-sized values vs what you set — feed this into your Helm values or Kustomize patches.

Node-Level Autoscaling

Cluster Autoscaler

Cluster Autoscaler (CA) scales managed node groups based on pod scheduling failures. When a pod is Pending due to insufficient resources, CA adds a node. When nodes are underutilized for 10+ minutes, CA cordons and drains them.

Cluster Autoscaler Flow Limitations of Cluster Autoscaler:

Limitation	Impact
Tied to pre-defined node groups	Must predict instance types ahead of time
Slow provisioning (2-5 min)	Spiky workloads suffer cold-start delays
Scales one group at a time	Cannot mix instance types intelligently
No consolidation	Wastes money on underutilized nodes
Complex ASG management	Multiple ASGs for GPU, ARM, spot instances

AWS: Cluster Autoscaler on EKS

Cluster Autoscaler on EKS is deployed via Helm and uses IRSA for IAM authentication. It auto-discovers node groups by tags and manages scaling decisions based on pending pod resource requirements.

GCP: Cluster Autoscaler on GKE

GKE Cluster Autoscaler is built into GKE and enabled at the node pool level — no separate controller to deploy. GKE Autopilot goes further and eliminates node management entirely — Google manages scaling, and you only define pod resource requests.

# values.yaml for cluster-autoscaler Helm chart
autoDiscovery:
  clusterName: prod-eks-cluster
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/prod-eks-cluster

awsRegion: me-south-1

rbac:
  serviceAccount:
    create: true
    name: cluster-autoscaler
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/ClusterAutoscalerRole

extraArgs:
  balance-similar-node-groups: true
  skip-nodes-with-local-storage: false
  expander: least-waste                   # pick the node group with least wasted resources
  scale-down-delay-after-add: 10m
  scale-down-unneeded-time: 10m
  max-node-provision-time: 15m

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DescribeTags",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeImages",
        "ec2:GetInstanceTypesFromInstanceRequirements",
        "eks:DescribeNodegroup"
      ],
      "Resource": "*"
    }
  ]
}

resource "google_container_node_pool" "general" {
  name     = "general-pool"
  cluster  = google_container_cluster.primary.name
  location = "me-central1"

  autoscaling {
    min_node_count  = 2
    max_node_count  = 20
    location_policy = "BALANCED"          # spread across zones
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }

  node_config {
    machine_type = "e2-standard-4"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]

    labels = {
      team = "payments"
      tier = "general"
    }
  }
}

resource "google_container_cluster" "autopilot" {
  name     = "prod-autopilot"
  location = "me-central1"

  enable_autopilot = true

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }
}

Karpenter Deep Dive

Karpenter is AWS’s next-generation node provisioner that replaces Cluster Autoscaler. It provisions nodes just-in-time from the full EC2 fleet, without pre-defined node groups.

Karpenter Architecture Why Karpenter over Cluster Autoscaler?

Aspect	Cluster Autoscaler	Karpenter
Instance selection	Pre-defined node groups	Any EC2 instance type
Provisioning speed	2-5 minutes	60-90 seconds
Right-sizing	Limited to group config	Picks optimal size per pod
Consolidation	None	Active bin-packing
Spot handling	Per-ASG spot config	Mixed spot/on-demand per NodePool
Maintenance	Manage multiple ASGs	One NodePool covers many types
Architecture	ASG-based	Direct EC2 fleet API

Karpenter v1 — NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  # Resource budget — prevents runaway scaling
  limits:
    cpu: "1000"                          # max 1000 vCPUs across all nodes
    memory: 2000Gi

  # Weight for scheduling priority (higher = preferred)
  weight: 50

  # Disruption settings — how Karpenter optimizes existing nodes
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s                # how long to wait after node becomes empty

    # Budgets control how many nodes can be disrupted at once
    budgets:
      - nodes: "10%"                     # max 10% of nodes disrupted at once
      - nodes: "0"                       # block disruption during peak hours
        schedule: "0 9 * * 1-5"          # Mon-Fri 9am
        duration: 8h                     # until 5pm

  template:
    metadata:
      labels:
        team: platform
        tier: general
    spec:
      # Reference to AWS-specific configuration
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: general

      # Instance type constraints
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]        # compute, general, memory optimized
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]                  # only 6th gen+ (c6i, m6i, r6i, etc.)
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge", "4xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["me-south-1a", "me-south-1b", "me-south-1c"]

      # Taints for workload isolation
      taints:
        - key: workload-type
          value: general
          effect: NoSchedule

      # Expire nodes after 720h (30 days) for AMI rotation
      expireAfter: 720h

Karpenter v1 — EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: general
spec:
  # AMI selection
  amiSelectorTerms:
    - alias: al2023@latest              # Amazon Linux 2023, auto-updated

  # Network configuration — nodes go into private subnets
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eks-cluster
        network-tier: private

  # Security groups
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eks-cluster

  # IAM instance profile for the nodes
  role: KarpenterNodeRole-prod-eks-cluster

  # Block device — encrypted EBS
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true
        kmsKeyID: arn:aws:kms:me-south-1:111111111111:key/key-id
        deleteOnTermination: true
        iops: 3000
        throughput: 125

  # Metadata options — enforce IMDSv2
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required                 # IMDSv2 mandatory

  # Tags applied to all EC2 instances
  tags:
    Environment: production
    ManagedBy: karpenter
    CostCenter: platform-team

Karpenter Consolidation Explained

Consolidation is Karpenter’s cost-optimization engine. It continuously evaluates whether nodes can be replaced with cheaper alternatives or removed entirely.

Karpenter Consolidation Flow

Terraform for Karpenter Installation

# Karpenter IAM Role (for the controller itself)
module "karpenter" {
  source  = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> 20.0"

  cluster_name = module.eks.cluster_name

  # Enable IRSA for Karpenter controller
  enable_irsa                     = true
  irsa_oidc_provider_arn          = module.eks.oidc_provider_arn
  irsa_namespace_service_accounts = ["kube-system:karpenter"]

  # Node IAM role (for EC2 instances Karpenter launches)
  node_iam_role_additional_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

# Install Karpenter via Helm
resource "helm_release" "karpenter" {
  namespace  = "kube-system"
  name       = "karpenter"
  repository = "oci://public.ecr.aws/karpenter"
  chart      = "karpenter"
  version    = "1.1.0"

  values = [
    yamlencode({
      settings = {
        clusterName       = module.eks.cluster_name
        clusterEndpoint   = module.eks.cluster_endpoint
        interruptionQueue = module.karpenter.queue_name
      }
      serviceAccount = {
        annotations = {
          "eks.amazonaws.com/role-arn" = module.karpenter.iam_role_arn
        }
      }
      replicas = 2     # HA for the controller itself
    })
  ]

  depends_on = [module.karpenter]
}

GKE Node Auto-Provisioning (NAP)

GKE’s equivalent of intelligent node scaling. NAP automatically creates new node pools with optimal machine types based on pending pod requirements.

resource "google_container_cluster" "primary" {
  name     = "prod-gke-cluster"
  location = "me-central1"

  cluster_autoscaling {
    enabled = true

    resource_limits {
      resource_type = "cpu"
      minimum       = 4
      maximum       = 500
    }
    resource_limits {
      resource_type = "memory"
      minimum       = 16
      maximum       = 2000
    }
    resource_limits {
      resource_type = "nvidia-tesla-t4"    # GPU limits
      minimum       = 0
      maximum       = 8
    }

    auto_provisioning_defaults {
      service_account = google_service_account.gke_nodes.email
      oauth_scopes    = ["https://www.googleapis.com/auth/cloud-platform"]

      management {
        auto_repair  = true
        auto_upgrade = true
      }

      disk_size = 100
      disk_type = "pd-ssd"

      shielded_instance_config {
        enable_secure_boot          = true
        enable_integrity_monitoring = true
      }
    }
  }
}

Event-Driven Autoscaling with KEDA

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale on event sources: message queues, databases, HTTP metrics, cron schedules, and more.

KEDA Architecture KEDA scales to zero — something native HPA cannot do (HPA minimum is 1).

KEDA ScaledObject for SQS Queue

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: orders
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15                    # check every 15 seconds
  cooldownPeriod: 120                    # wait 2 min before scaling to zero
  minReplicaCount: 0                     # scale to zero when idle
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      authenticationRef:
        name: keda-aws-credentials
      metadata:
        queueURL: https://sqs.me-south-1.amazonaws.com/111111111111/orders-queue
        queueLength: "5"                 # 1 pod per 5 messages
        awsRegion: me-south-1
        identityOwner: operator          # IRSA-based authentication

KEDA ScaledObject for Prometheus Metric

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-gateway-scaler
  namespace: gateway
spec:
  scaleTargetRef:
    name: api-gateway
  minReplicaCount: 2
  maxReplicaCount: 100
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: http_requests_per_second
        query: |
          sum(rate(http_requests_total{namespace="gateway"}[2m]))
        threshold: "1000"                # scale when total rps exceeds 1000

KEDA Cron Trigger (Predictive Scaling)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: trading-platform
  namespace: trading
spec:
  scaleTargetRef:
    name: trading-engine
  minReplicaCount: 5
  maxReplicaCount: 200
  triggers:
    # Market hours: pre-scale before open
    - type: cron
      metadata:
        timezone: Asia/Dubai
        start: "0 8 * * 1-5"            # 8am Sun-Thu (Dubai market hours)
        end: "0 14 * * 1-5"             # 2pm close
        desiredReplicas: "50"            # pre-warm 50 replicas
    # Reactive scaling on top
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: order_rate
        query: sum(rate(trade_orders_total{namespace="trading"}[1m]))
        threshold: "100"

Scaling Architecture — Putting It All Together

Enterprise Scaling Architecture

The layered approach:

VPA in Off mode — continuously recommends right-sized requests
HPA or KEDA — scales pods horizontally based on demand
Karpenter — provisions right-sized nodes when pods are pending
Consolidation — Karpenter bin-packs and replaces underutilized nodes overnight

Interview Scenarios

Scenario 1: 10x Traffic Spikes During Business Hours

“Design autoscaling for a banking transaction service that sees 10x traffic spikes during market open (8am-2pm Dubai time) but is nearly idle overnight.”

Answer framework:

Predictive + Reactive scaling:

KEDA cron trigger — pre-scale to 30 pods at 7:45am before market opens
HPA on custom metric (transactions_per_second) — react to actual load above baseline
Karpenter NodePool — provisions nodes in 60-90 seconds when pods go Pending

Node-level:

Karpenter with mixed capacity: 70% on-demand (baseline), 30% spot (burst)
Disruption budget: nodes: "0" from 8am-3pm (no consolidation during trading hours)
Allow consolidation overnight (3pm-7am) to save costs

Scale-down protection:

HPA stabilizationWindowSeconds: 600 for scale-down (10 min cooldown)
Karpenter consolidateAfter: 60s only outside business hours

7:45am  KEDA cron fires → 30 pods warm
8:00am  Market opens → HPA scales 30→80 based on TPS
8:01am  Karpenter provisions 5 new nodes (60s each)
2:00pm  Market closes → traffic drops
2:10pm  HPA stabilization window (10 min) passes → scale to 40
3:00pm  Disruption window opens → Karpenter consolidates nodes
11:00pm KEDA scales to 0 (minReplicaCount: 0 for batch processors)

Scenario 2: Karpenter vs Cluster Autoscaler — When Would You Choose Each?

“Your team is debating whether to use Karpenter or Cluster Autoscaler. What’s your recommendation?”

Choose Karpenter when:

You need fast provisioning (<90 seconds vs 2-5 minutes)
You want automatic cost optimization via consolidation
Your workloads need diverse instance types (GPU, ARM, spot mix)
You want to avoid managing multiple ASGs/managed node groups
You need disruption budgets with schedule-based controls

Choose Cluster Autoscaler when:

You run GKE (Karpenter is AWS-only; GKE has NAP/Autopilot instead)
You have strict compliance requiring pre-approved instance types only
Your existing infrastructure is deeply tied to ASG lifecycle hooks
You need integration with AWS-native scaling (Target Tracking, Step Scaling)

Our recommendation for a new EKS deployment: Karpenter with a fallback managed node group for system components (CoreDNS, kube-proxy, Karpenter itself). Karpenter should not manage the nodes it runs on.

+---------------------------+     +---------------------------+
| Managed Node Group        |     | Karpenter-Managed Nodes   |
| (system components)       |     | (workload pods)            |
| - CoreDNS                 |     | - Payment API              |
| - kube-proxy              |     | - Order Processor          |
| - Karpenter controller    |     | - Trading Engine           |
| - Metrics Server          |     | - Batch Jobs               |
| Always 2-3 nodes (fixed)  |     | 0-100 nodes (elastic)      |
+---------------------------+     +---------------------------+

Scenario 3: Pods Are Scaling But Nodes Are Not

“HPA scaled our deployment from 5 to 30 pods, but 20 pods are stuck in Pending. Nodes aren’t being added. What’s wrong?”

Debugging checklist:

# 1. Check pod events for scheduling failures
kubectl describe pod <pending-pod> -n payments
# Look for: "Insufficient cpu", "Insufficient memory", "no nodes available"

# 2. Check if Cluster Autoscaler is running
kubectl get pods -n kube-system | grep cluster-autoscaler
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=50

# 3. Check Karpenter logs (if using Karpenter)
kubectl logs -n kube-system deployment/karpenter --tail=50
# Look for: "could not launch node", "insufficient capacity"

# 4. Check NodePool limits
kubectl get nodepool general -o yaml
# Is the CPU/memory limit reached?

# 5. Check node group max size
aws eks describe-nodegroup --cluster-name prod --nodegroup-name general \
  --query 'nodegroup.scalingConfig'

Common root causes:

Cause	Fix
Cluster Autoscaler not running	Redeploy CA, check IRSA permissions
NodePool CPU limit reached	Increase `spec.limits.cpu`
ASG max size reached	Increase max in managed node group
PDB blocking drain (for consolidation)	Review PDB `maxUnavailable` settings
Subnet out of IP addresses	Add subnets, use larger CIDR
Instance type unavailable (spot)	Add more instance families to NodePool
Taints without tolerations	Pods need matching `tolerations`
Node affinity too restrictive	Relax `nodeAffinity` rules

Scenario 4: Scale-to-Zero for Dev/Staging Workloads

“How do you handle scale-to-zero for dev and staging environments to save costs?”

Architecture:

Dev/Staging Scale-to-Zero Architecture Implementation:

# KEDA HTTP Add-on for scale-to-zero on HTTP traffic
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: dev-api
  namespace: dev
spec:
  scaleTargetRef:
    name: dev-api
  cooldownPeriod: 900                    # 15 min idle → scale to zero
  minReplicaCount: 0
  maxReplicaCount: 3
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: http_requests_active
        query: |
          sum(rate(http_requests_total{namespace="dev",service="dev-api"}[5m]))
        threshold: "1"

Cost savings: Dev environments that are idle 16+ hours/day save ~65% on compute. Multiply across 20 dev namespaces and it is significant.

Scenario 5: GPU Workload Scaling for ML Inference

“Design autoscaling for a GPU-based ML inference workload that processes images. It needs P4d instances with NVIDIA A100 GPUs.”

Challenges:

GPU instances are expensive ($30+/hour) — cannot over-provision
GPU instances have limited availability — cannot always get them
Model loading takes 30-60 seconds — cold start is painful
Standard CPU metrics do not reflect GPU utilization

Solution:

# Karpenter NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  limits:
    cpu: "200"
    memory: 800Gi
    nvidia.com/gpu: "16"                 # max 16 GPUs total

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 300s               # wait 5 min (GPU instances are slow to start)

  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodes
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["p", "g"]             # GPU instance families
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["p4d", "p5", "g5"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]          # spot not reliable for GPUs
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
---
# HPA on custom GPU metric (from DCGM exporter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference
  namespace: ml
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference
  minReplicas: 1                         # always keep 1 warm
  maxReplicas: 8
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600    # 10 min (GPU instances are expensive to re-provision)
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization_percent  # from DCGM exporter via Prometheus Adapter
        target:
          type: AverageValue
          averageValue: "70"             # scale when avg GPU util > 70%

GKE equivalent: Use GKE with GPU node pools and NAP:

resource "google_container_node_pool" "gpu_pool" {
  name     = "gpu-pool"
  cluster  = google_container_cluster.primary.name
  location = "me-central1"

  autoscaling {
    min_node_count = 0
    max_node_count = 8
  }

  node_config {
    machine_type = "a2-highgpu-1g"       # NVIDIA A100

    guest_accelerator {
      type  = "nvidia-tesla-a100"
      count = 1
      gpu_driver_installation_config {
        gpu_driver_version = "LATEST"
      }
    }

    taint {
      key    = "nvidia.com/gpu"
      value  = "present"
      effect = "NO_SCHEDULE"
    }
  }
}