Autoscaling — HPA, VPA, Karpenter & Cluster Autoscaler
Where This Fits in the Enterprise Architecture
Section titled “Where This Fits in the Enterprise Architecture”
The central infra team deploys and configures the scaling controllers (Karpenter, KEDA, Metrics Server, Prometheus Adapter). Tenant teams define HPA and VPA on their workloads. The infra team sets NodePool constraints — instance families, availability zones, and budget limits — that tenants cannot override.
Pod-Level Autoscaling
Section titled “Pod-Level Autoscaling”Horizontal Pod Autoscaler (HPA)
Section titled “Horizontal Pod Autoscaler (HPA)”HPA scales the number of pod replicas based on observed metrics. It runs a control loop every 15 seconds (default) and calculates the desired replica count:
desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))Metric sources:
| Source | Example | Requires |
|---|---|---|
| Resource metrics | CPU, memory utilization | metrics-server |
| Custom metrics | requests_per_second, queue_depth | Prometheus Adapter |
| External metrics | SQS queue length, Pub/Sub backlog | KEDA or cloud provider adapter |
HPA with CPU and Memory
Section titled “HPA with CPU and Memory”apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: payments-api namespace: paymentsspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payments-api minReplicas: 3 maxReplicas: 50 behavior: scaleUp: stabilizationWindowSeconds: 30 # react fast to spikes policies: - type: Percent value: 100 # double pods per scale-up periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 # wait 5 min before scaling down policies: - type: Percent value: 25 # remove 25% at a time periodSeconds: 120 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 75HPA with Custom Metrics (Prometheus Adapter)
Section titled “HPA with Custom Metrics (Prometheus Adapter)”To scale on application-specific metrics like requests per second, you need the Prometheus Adapter:
+----------------+ +---------------------+ +------------------+| Prometheus | ----> | Prometheus Adapter | ----> | HPA Controller || (scrapes pods) | | (exposes as K8s API) | | (reads metrics) |+----------------+ +---------------------+ +------------------+ ^ | | v Pod metrics Scale decision /metrics endpoint → Deployment replicasPrometheus Adapter ConfigMap:
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-adapter-config namespace: monitoringdata: config.yaml: | rules: - seriesQuery: 'http_requests_total{namespace!="",pod!=""}' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: matches: "^(.*)_total" as: "${1}_per_second" metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'HPA using custom metric:
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: payments-api-custom namespace: paymentsspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payments-api minReplicas: 3 maxReplicas: 100 metrics: - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "500" # scale when avg exceeds 500 rps/podHPA with External Metrics (SQS Queue)
Section titled “HPA with External Metrics (SQS Queue)”apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: order-processor namespace: ordersspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: order-processor minReplicas: 1 maxReplicas: 30 metrics: - type: External external: metric: name: sqs_queue_length selector: matchLabels: queue: "orders-queue" target: type: AverageValue averageValue: "10" # 1 pod per 10 messagesVertical Pod Autoscaler (VPA)
Section titled “Vertical Pod Autoscaler (VPA)”VPA automatically adjusts CPU and memory requests (and optionally limits) based on historical usage. In enterprise environments, run VPA in recommendation-only mode — it provides right-sizing data without disrupting running pods.
apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: payments-api-vpa namespace: paymentsspec: targetRef: apiVersion: apps/v1 kind: Deployment name: payments-api updatePolicy: updateMode: "Off" # recommendation-only, no live updates resourcePolicy: containerPolicies: - containerName: payments-api minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 4 memory: 8Gi controlledResources: ["cpu", "memory"]Reading VPA recommendations:
kubectl get vpa payments-api-vpa -n payments -o jsonpath='{.status.recommendation}' | jq .Output tells you actual right-sized values vs what you set — feed this into your Helm values or Kustomize patches.
Node-Level Autoscaling
Section titled “Node-Level Autoscaling”Cluster Autoscaler
Section titled “Cluster Autoscaler”Cluster Autoscaler (CA) scales managed node groups based on pod scheduling failures. When a pod is Pending due to insufficient resources, CA adds a node. When nodes are underutilized for 10+ minutes, CA cordons and drains them.
Limitations of Cluster Autoscaler:
| Limitation | Impact |
|---|---|
| Tied to pre-defined node groups | Must predict instance types ahead of time |
| Slow provisioning (2-5 min) | Spiky workloads suffer cold-start delays |
| Scales one group at a time | Cannot mix instance types intelligently |
| No consolidation | Wastes money on underutilized nodes |
| Complex ASG management | Multiple ASGs for GPU, ARM, spot instances |
AWS: Cluster Autoscaler on EKS
Section titled “AWS: Cluster Autoscaler on EKS”Cluster Autoscaler on EKS is deployed via Helm and uses IRSA for IAM authentication. It auto-discovers node groups by tags and manages scaling decisions based on pending pod resource requirements.
GCP: Cluster Autoscaler on GKE
Section titled “GCP: Cluster Autoscaler on GKE”GKE Cluster Autoscaler is built into GKE and enabled at the node pool level — no separate controller to deploy. GKE Autopilot goes further and eliminates node management entirely — Google manages scaling, and you only define pod resource requests.
# values.yaml for cluster-autoscaler Helm chartautoDiscovery: clusterName: prod-eks-cluster tags: - k8s.io/cluster-autoscaler/enabled - k8s.io/cluster-autoscaler/prod-eks-cluster
awsRegion: me-south-1
rbac: serviceAccount: create: true name: cluster-autoscaler annotations: eks.amazonaws.com/role-arn: arn:aws:iam::111111111111:role/ClusterAutoscalerRole
extraArgs: balance-similar-node-groups: true skip-nodes-with-local-storage: false expander: least-waste # pick the node group with least wasted resources scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m max-node-provision-time: 15m{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeScalingActivities", "autoscaling:DescribeTags", "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeLaunchTemplateVersions", "ec2:DescribeInstanceTypes", "ec2:DescribeImages", "ec2:GetInstanceTypesFromInstanceRequirements", "eks:DescribeNodegroup" ], "Resource": "*" } ]}resource "google_container_node_pool" "general" { name = "general-pool" cluster = google_container_cluster.primary.name location = "me-central1"
autoscaling { min_node_count = 2 max_node_count = 20 location_policy = "BALANCED" # spread across zones }
management { auto_repair = true auto_upgrade = true }
node_config { machine_type = "e2-standard-4" oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
labels = { team = "payments" tier = "general" } }}resource "google_container_cluster" "autopilot" { name = "prod-autopilot" location = "me-central1"
enable_autopilot = true
ip_allocation_policy { cluster_secondary_range_name = "pods" services_secondary_range_name = "services" }}Karpenter Deep Dive
Section titled “Karpenter Deep Dive”Karpenter is AWS’s next-generation node provisioner that replaces Cluster Autoscaler. It provisions nodes just-in-time from the full EC2 fleet, without pre-defined node groups.
Why Karpenter over Cluster Autoscaler?
| Aspect | Cluster Autoscaler | Karpenter |
|---|---|---|
| Instance selection | Pre-defined node groups | Any EC2 instance type |
| Provisioning speed | 2-5 minutes | 60-90 seconds |
| Right-sizing | Limited to group config | Picks optimal size per pod |
| Consolidation | None | Active bin-packing |
| Spot handling | Per-ASG spot config | Mixed spot/on-demand per NodePool |
| Maintenance | Manage multiple ASGs | One NodePool covers many types |
| Architecture | ASG-based | Direct EC2 fleet API |
Karpenter v1 — NodePool
Section titled “Karpenter v1 — NodePool”apiVersion: karpenter.sh/v1kind: NodePoolmetadata: name: generalspec: # Resource budget — prevents runaway scaling limits: cpu: "1000" # max 1000 vCPUs across all nodes memory: 2000Gi
# Weight for scheduling priority (higher = preferred) weight: 50
# Disruption settings — how Karpenter optimizes existing nodes disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 30s # how long to wait after node becomes empty
# Budgets control how many nodes can be disrupted at once budgets: - nodes: "10%" # max 10% of nodes disrupted at once - nodes: "0" # block disruption during peak hours schedule: "0 9 * * 1-5" # Mon-Fri 9am duration: 8h # until 5pm
template: metadata: labels: team: platform tier: general spec: # Reference to AWS-specific configuration nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: general
# Instance type constraints requirements: - key: kubernetes.io/arch operator: In values: ["amd64", "arm64"] - key: karpenter.sh/capacity-type operator: In values: ["on-demand", "spot"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] # compute, general, memory optimized - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["5"] # only 6th gen+ (c6i, m6i, r6i, etc.) - key: karpenter.k8s.aws/instance-size operator: In values: ["large", "xlarge", "2xlarge", "4xlarge"] - key: topology.kubernetes.io/zone operator: In values: ["me-south-1a", "me-south-1b", "me-south-1c"]
# Taints for workload isolation taints: - key: workload-type value: general effect: NoSchedule
# Expire nodes after 720h (30 days) for AMI rotation expireAfter: 720hKarpenter v1 — EC2NodeClass
Section titled “Karpenter v1 — EC2NodeClass”apiVersion: karpenter.k8s.aws/v1kind: EC2NodeClassmetadata: name: generalspec: # AMI selection amiSelectorTerms: - alias: al2023@latest # Amazon Linux 2023, auto-updated
# Network configuration — nodes go into private subnets subnetSelectorTerms: - tags: karpenter.sh/discovery: prod-eks-cluster network-tier: private
# Security groups securityGroupSelectorTerms: - tags: karpenter.sh/discovery: prod-eks-cluster
# IAM instance profile for the nodes role: KarpenterNodeRole-prod-eks-cluster
# Block device — encrypted EBS blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 100Gi volumeType: gp3 encrypted: true kmsKeyID: arn:aws:kms:me-south-1:111111111111:key/key-id deleteOnTermination: true iops: 3000 throughput: 125
# Metadata options — enforce IMDSv2 metadataOptions: httpEndpoint: enabled httpProtocolIPv6: disabled httpPutResponseHopLimit: 2 httpTokens: required # IMDSv2 mandatory
# Tags applied to all EC2 instances tags: Environment: production ManagedBy: karpenter CostCenter: platform-teamKarpenter Consolidation Explained
Section titled “Karpenter Consolidation Explained”Consolidation is Karpenter’s cost-optimization engine. It continuously evaluates whether nodes can be replaced with cheaper alternatives or removed entirely.
Terraform for Karpenter Installation
Section titled “Terraform for Karpenter Installation”# Karpenter IAM Role (for the controller itself)module "karpenter" { source = "terraform-aws-modules/eks/aws//modules/karpenter" version = "~> 20.0"
cluster_name = module.eks.cluster_name
# Enable IRSA for Karpenter controller enable_irsa = true irsa_oidc_provider_arn = module.eks.oidc_provider_arn irsa_namespace_service_accounts = ["kube-system:karpenter"]
# Node IAM role (for EC2 instances Karpenter launches) node_iam_role_additional_policies = { AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" }
tags = { Environment = "production" ManagedBy = "terraform" }}
# Install Karpenter via Helmresource "helm_release" "karpenter" { namespace = "kube-system" name = "karpenter" repository = "oci://public.ecr.aws/karpenter" chart = "karpenter" version = "1.1.0"
values = [ yamlencode({ settings = { clusterName = module.eks.cluster_name clusterEndpoint = module.eks.cluster_endpoint interruptionQueue = module.karpenter.queue_name } serviceAccount = { annotations = { "eks.amazonaws.com/role-arn" = module.karpenter.iam_role_arn } } replicas = 2 # HA for the controller itself }) ]
depends_on = [module.karpenter]}GKE Node Auto-Provisioning (NAP)
Section titled “GKE Node Auto-Provisioning (NAP)”GKE’s equivalent of intelligent node scaling. NAP automatically creates new node pools with optimal machine types based on pending pod requirements.
resource "google_container_cluster" "primary" { name = "prod-gke-cluster" location = "me-central1"
cluster_autoscaling { enabled = true
resource_limits { resource_type = "cpu" minimum = 4 maximum = 500 } resource_limits { resource_type = "memory" minimum = 16 maximum = 2000 } resource_limits { resource_type = "nvidia-tesla-t4" # GPU limits minimum = 0 maximum = 8 }
auto_provisioning_defaults { service_account = google_service_account.gke_nodes.email oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
management { auto_repair = true auto_upgrade = true }
disk_size = 100 disk_type = "pd-ssd"
shielded_instance_config { enable_secure_boot = true enable_integrity_monitoring = true } } }}Event-Driven Autoscaling with KEDA
Section titled “Event-Driven Autoscaling with KEDA”KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to scale on event sources: message queues, databases, HTTP metrics, cron schedules, and more.
KEDA scales to zero — something native HPA cannot do (HPA minimum is 1).
KEDA ScaledObject for SQS Queue
Section titled “KEDA ScaledObject for SQS Queue”apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: order-processor namespace: ordersspec: scaleTargetRef: name: order-processor pollingInterval: 15 # check every 15 seconds cooldownPeriod: 120 # wait 2 min before scaling to zero minReplicaCount: 0 # scale to zero when idle maxReplicaCount: 50 triggers: - type: aws-sqs-queue authenticationRef: name: keda-aws-credentials metadata: queueURL: https://sqs.me-south-1.amazonaws.com/111111111111/orders-queue queueLength: "5" # 1 pod per 5 messages awsRegion: me-south-1 identityOwner: operator # IRSA-based authenticationKEDA ScaledObject for Prometheus Metric
Section titled “KEDA ScaledObject for Prometheus Metric”apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: api-gateway-scaler namespace: gatewayspec: scaleTargetRef: name: api-gateway minReplicaCount: 2 maxReplicaCount: 100 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc:9090 metricName: http_requests_per_second query: | sum(rate(http_requests_total{namespace="gateway"}[2m])) threshold: "1000" # scale when total rps exceeds 1000KEDA Cron Trigger (Predictive Scaling)
Section titled “KEDA Cron Trigger (Predictive Scaling)”apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: trading-platform namespace: tradingspec: scaleTargetRef: name: trading-engine minReplicaCount: 5 maxReplicaCount: 200 triggers: # Market hours: pre-scale before open - type: cron metadata: timezone: Asia/Dubai start: "0 8 * * 1-5" # 8am Sun-Thu (Dubai market hours) end: "0 14 * * 1-5" # 2pm close desiredReplicas: "50" # pre-warm 50 replicas # Reactive scaling on top - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc:9090 metricName: order_rate query: sum(rate(trade_orders_total{namespace="trading"}[1m])) threshold: "100"Scaling Architecture — Putting It All Together
Section titled “Scaling Architecture — Putting It All Together”The layered approach:
- VPA in Off mode — continuously recommends right-sized requests
- HPA or KEDA — scales pods horizontally based on demand
- Karpenter — provisions right-sized nodes when pods are pending
- Consolidation — Karpenter bin-packs and replaces underutilized nodes overnight
Interview Scenarios
Section titled “Interview Scenarios”Scenario 1: 10x Traffic Spikes During Business Hours
Section titled “Scenario 1: 10x Traffic Spikes During Business Hours”“Design autoscaling for a banking transaction service that sees 10x traffic spikes during market open (8am-2pm Dubai time) but is nearly idle overnight.”
Answer framework:
Predictive + Reactive scaling:
- KEDA cron trigger — pre-scale to 30 pods at 7:45am before market opens
- HPA on custom metric (transactions_per_second) — react to actual load above baseline
- Karpenter NodePool — provisions nodes in 60-90 seconds when pods go Pending
Node-level:
- Karpenter with mixed capacity: 70% on-demand (baseline), 30% spot (burst)
- Disruption budget:
nodes: "0"from 8am-3pm (no consolidation during trading hours) - Allow consolidation overnight (3pm-7am) to save costs
Scale-down protection:
- HPA
stabilizationWindowSeconds: 600for scale-down (10 min cooldown) - Karpenter
consolidateAfter: 60sonly outside business hours
7:45am KEDA cron fires → 30 pods warm8:00am Market opens → HPA scales 30→80 based on TPS8:01am Karpenter provisions 5 new nodes (60s each)2:00pm Market closes → traffic drops2:10pm HPA stabilization window (10 min) passes → scale to 403:00pm Disruption window opens → Karpenter consolidates nodes11:00pm KEDA scales to 0 (minReplicaCount: 0 for batch processors)Scenario 2: Karpenter vs Cluster Autoscaler — When Would You Choose Each?
Section titled “Scenario 2: Karpenter vs Cluster Autoscaler — When Would You Choose Each?”“Your team is debating whether to use Karpenter or Cluster Autoscaler. What’s your recommendation?”
Choose Karpenter when:
- You need fast provisioning (<90 seconds vs 2-5 minutes)
- You want automatic cost optimization via consolidation
- Your workloads need diverse instance types (GPU, ARM, spot mix)
- You want to avoid managing multiple ASGs/managed node groups
- You need disruption budgets with schedule-based controls
Choose Cluster Autoscaler when:
- You run GKE (Karpenter is AWS-only; GKE has NAP/Autopilot instead)
- You have strict compliance requiring pre-approved instance types only
- Your existing infrastructure is deeply tied to ASG lifecycle hooks
- You need integration with AWS-native scaling (Target Tracking, Step Scaling)
Our recommendation for a new EKS deployment: Karpenter with a fallback managed node group for system components (CoreDNS, kube-proxy, Karpenter itself). Karpenter should not manage the nodes it runs on.
+---------------------------+ +---------------------------+| Managed Node Group | | Karpenter-Managed Nodes || (system components) | | (workload pods) || - CoreDNS | | - Payment API || - kube-proxy | | - Order Processor || - Karpenter controller | | - Trading Engine || - Metrics Server | | - Batch Jobs || Always 2-3 nodes (fixed) | | 0-100 nodes (elastic) |+---------------------------+ +---------------------------+Scenario 3: Pods Are Scaling But Nodes Are Not
Section titled “Scenario 3: Pods Are Scaling But Nodes Are Not”“HPA scaled our deployment from 5 to 30 pods, but 20 pods are stuck in Pending. Nodes aren’t being added. What’s wrong?”
Debugging checklist:
# 1. Check pod events for scheduling failureskubectl describe pod <pending-pod> -n payments# Look for: "Insufficient cpu", "Insufficient memory", "no nodes available"
# 2. Check if Cluster Autoscaler is runningkubectl get pods -n kube-system | grep cluster-autoscalerkubectl logs -n kube-system deployment/cluster-autoscaler --tail=50
# 3. Check Karpenter logs (if using Karpenter)kubectl logs -n kube-system deployment/karpenter --tail=50# Look for: "could not launch node", "insufficient capacity"
# 4. Check NodePool limitskubectl get nodepool general -o yaml# Is the CPU/memory limit reached?
# 5. Check node group max sizeaws eks describe-nodegroup --cluster-name prod --nodegroup-name general \ --query 'nodegroup.scalingConfig'Common root causes:
| Cause | Fix |
|---|---|
| Cluster Autoscaler not running | Redeploy CA, check IRSA permissions |
| NodePool CPU limit reached | Increase spec.limits.cpu |
| ASG max size reached | Increase max in managed node group |
| PDB blocking drain (for consolidation) | Review PDB maxUnavailable settings |
| Subnet out of IP addresses | Add subnets, use larger CIDR |
| Instance type unavailable (spot) | Add more instance families to NodePool |
| Taints without tolerations | Pods need matching tolerations |
| Node affinity too restrictive | Relax nodeAffinity rules |
Scenario 4: Scale-to-Zero for Dev/Staging Workloads
Section titled “Scenario 4: Scale-to-Zero for Dev/Staging Workloads”“How do you handle scale-to-zero for dev and staging environments to save costs?”
Architecture:
Implementation:
# KEDA HTTP Add-on for scale-to-zero on HTTP trafficapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: dev-api namespace: devspec: scaleTargetRef: name: dev-api cooldownPeriod: 900 # 15 min idle → scale to zero minReplicaCount: 0 maxReplicaCount: 3 triggers: - type: prometheus metadata: serverAddress: http://prometheus.monitoring.svc:9090 metricName: http_requests_active query: | sum(rate(http_requests_total{namespace="dev",service="dev-api"}[5m])) threshold: "1"Cost savings: Dev environments that are idle 16+ hours/day save ~65% on compute. Multiply across 20 dev namespaces and it is significant.
Scenario 5: GPU Workload Scaling for ML Inference
Section titled “Scenario 5: GPU Workload Scaling for ML Inference”“Design autoscaling for a GPU-based ML inference workload that processes images. It needs P4d instances with NVIDIA A100 GPUs.”
Challenges:
- GPU instances are expensive ($30+/hour) — cannot over-provision
- GPU instances have limited availability — cannot always get them
- Model loading takes 30-60 seconds — cold start is painful
- Standard CPU metrics do not reflect GPU utilization
Solution:
# Karpenter NodePool for GPU workloadsapiVersion: karpenter.sh/v1kind: NodePoolmetadata: name: gpu-inferencespec: limits: cpu: "200" memory: 800Gi nvidia.com/gpu: "16" # max 16 GPUs total
disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 300s # wait 5 min (GPU instances are slow to start)
template: spec: nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: gpu-nodes requirements: - key: karpenter.k8s.aws/instance-category operator: In values: ["p", "g"] # GPU instance families - key: karpenter.k8s.aws/instance-family operator: In values: ["p4d", "p5", "g5"] - key: karpenter.sh/capacity-type operator: In values: ["on-demand"] # spot not reliable for GPUs - key: kubernetes.io/arch operator: In values: ["amd64"] taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule---# HPA on custom GPU metric (from DCGM exporter)apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: ml-inference namespace: mlspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ml-inference minReplicas: 1 # always keep 1 warm maxReplicas: 8 behavior: scaleUp: stabilizationWindowSeconds: 30 policies: - type: Pods value: 2 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 600 # 10 min (GPU instances are expensive to re-provision) metrics: - type: Pods pods: metric: name: gpu_utilization_percent # from DCGM exporter via Prometheus Adapter target: type: AverageValue averageValue: "70" # scale when avg GPU util > 70%GKE equivalent: Use GKE with GPU node pools and NAP:
resource "google_container_node_pool" "gpu_pool" { name = "gpu-pool" cluster = google_container_cluster.primary.name location = "me-central1"
autoscaling { min_node_count = 0 max_node_count = 8 }
node_config { machine_type = "a2-highgpu-1g" # NVIDIA A100
guest_accelerator { type = "nvidia-tesla-a100" count = 1 gpu_driver_installation_config { gpu_driver_version = "LATEST" } }
taint { key = "nvidia.com/gpu" value = "present" effect = "NO_SCHEDULE" } }}