Control Plane & Internals

Where This Fits

EKS and GKE clusters run in Workload Accounts/Projects, consuming VPCs from the Network Hub via Transit Gateway attachments (AWS) or Shared VPC service projects (GCP). The central infrastructure team provisions and manages the clusters. Tenant application teams deploy workloads into namespaces with RBAC boundaries, resource quotas, and network policies.

Enterprise Reference Architecture — Kubernetes Placement

Kubernetes Control Plane Components

Every Kubernetes cluster has a control plane (the brain) and worker nodes (the muscle). Understanding each component at the API level is essential for architect-level interviews.

API Server (kube-apiserver)

The API server is the front door to the entire cluster. Every interaction goes through it: kubectl, kubelet, controllers, the scheduler, external admission webhooks — everything.

What it does:

Exposes the Kubernetes API as a RESTful HTTP service
Authenticates every request (client certificates, bearer tokens, OIDC, webhook)
Authorizes every request (RBAC, ABAC, webhook, Node authorizer)
Runs admission controllers (mutating and validating webhooks)
Validates and persists objects to etcd
Serves as the only component that talks to etcd directly

Key details for interviews:

Stateless and horizontally scalable (EKS/GKE run multiple replicas behind a load balancer)
Supports watch semantics: clients open long-lived HTTP connections to receive change notifications
Request flow: Authentication -> Authorization -> Mutating Admission -> Schema Validation -> Validating Admission -> etcd write
Rate limited via --max-requests-inflight and --max-mutating-requests-inflight
API priority and fairness (APF) in newer versions allows flow control per-user/group

etcd

A distributed key-value store that holds ALL cluster state. If etcd is lost without backup, the cluster is unrecoverable.

What it stores:

Every Kubernetes object: Pods, Deployments, Services, ConfigMaps, Secrets, RBAC rules
All custom resources (CRDs and their instances)
Lease objects for leader election (controller manager, scheduler)

How it works:

Raft consensus protocol: requires a quorum (majority) for writes
3-node cluster tolerates 1 failure; 5-node cluster tolerates 2 failures
Leader handles all writes; followers replicate
Sequential consistency for reads (linearizable reads optional but slower)
Performance-sensitive: latency spikes in etcd directly impact API server response times

Enterprise considerations:

EKS: AWS manages etcd entirely — encrypted at rest with AWS KMS, replicated across 3 AZs, automatic backups
GKE: Google manages etcd — encrypted at rest with Google-managed or CMEK keys, replicated across zones
Self-managed: you must handle backups (etcdctl snapshot save), defragmentation, compaction, and disk performance (SSD required)

Scheduler (kube-scheduler)

Watches for newly created Pods that have no node assignment and selects the best node to run them.

Scheduling algorithm (two phases):

Filtering — eliminates nodes that cannot run the Pod:
- Insufficient CPU or memory (vs resource requests)
- Node taints the Pod does not tolerate
- Node affinity rules not satisfied
- Pod topology spread constraints violated
- PVC zone constraints (EBS volumes are AZ-bound)
Scoring — ranks remaining nodes:
- Least requested resources (spread load)
- Node affinity preferences (soft rules)
- Inter-pod affinity/anti-affinity
- Image locality (node already has the container image)

Key details:

Only one scheduler instance is active at a time (leader election via Lease object)
Custom schedulers can run alongside the default scheduler
Scheduler profiles (v1.25+) allow multiple scheduling configurations
The scheduler writes a Binding object to the API server (not directly to etcd)

Controller Manager (kube-controller-manager)

Runs control loops (controllers) that watch cluster state and make changes to move from current state to desired state.

Key controllers:

Controller	What It Does
Deployment	Creates/updates ReplicaSets based on Deployment spec
ReplicaSet	Ensures correct number of Pod replicas exist
Node	Monitors node health, marks NotReady, evicts pods after timeout
Job	Tracks job completions, manages pod creation for batch work
EndpointSlice	Populates endpoint slices for Services
ServiceAccount	Creates default ServiceAccount in new namespaces
Namespace	Handles namespace lifecycle (finalizer cleanup)
PV/PVC	Binds PersistentVolumeClaims to PersistentVolumes
TTL	Cleans up finished Jobs after TTL expires

How a control loop works: Generic Controller Loop Pattern

Cloud Controller Manager

Runs cloud-specific control loops that integrate Kubernetes with the underlying cloud provider.

What it manages:

Node controller: registers nodes with cloud metadata (instance type, zone, external IP), removes nodes when instances are terminated
Route controller: configures cloud network routes so pods can communicate across nodes
Service controller: creates cloud load balancers (NLB, ALB, Cloud Load Balancer) when a Service has type: LoadBalancer

In EKS and GKE, this runs as part of the managed control plane. You interact with it indirectly through annotations on Services and Ingress resources.

kubelet

The agent that runs on every worker node. It is responsible for the actual container lifecycle.

What it does:

Watches the API server for Pods assigned to its node
Pulls container images via the container runtime (containerd)
Starts, stops, and monitors containers
Runs liveness, readiness, and startup probes
Reports node status (capacity, allocatable, conditions) to the API server
Manages volume mounts (calls CSI driver for persistent volumes)
Manages pod sandbox creation via the Container Runtime Interface (CRI)

Key details:

kubelet does NOT run as a container — it is a systemd service on the node
Communicates with the API server over TLS (client certificate authentication)
Cadvisor is embedded in kubelet for container resource metrics
Node allocatable = total capacity minus system reserved minus kube reserved minus eviction threshold

Node Resource Allocation

kube-proxy

Runs on every node and implements the Service abstraction by programming network rules.

Modes:

Mode	How It Works	Trade-offs
iptables (default)	Creates iptables rules for each Service/endpoint pair	Simple, but O(n) rule evaluation; slow with 10K+ services
IPVS	Uses Linux IPVS (IP Virtual Server) kernel module	O(1) lookup, supports more LB algorithms, better at scale
nftables (v1.29+)	Uses nftables instead of iptables	Modern replacement, atomic rule updates

What it does:

ClusterIP: routes traffic from cluster-ip:port to a backend Pod IP
NodePort: opens a port on every node, forwards to backend Pods
LoadBalancer: works with cloud controller to expose via external LB
Session affinity via sessionAffinity: ClientIP

CoreDNS

Cluster-internal DNS server deployed as a Deployment (typically 2 replicas) in kube-system.

What it resolves:

service-name.namespace.svc.cluster.local -> ClusterIP
pod-ip-dashed.namespace.pod.cluster.local -> Pod IP
Headless services: returns individual Pod IPs (A records)
External DNS: forwards to upstream DNS (VPC DNS resolver)

Configuration: CoreDNS is configured via a ConfigMap (coredns in kube-system). Common customizations:

Forward specific domains to on-prem DNS servers (hybrid cloud)
Add custom DNS entries
Enable logging for troubleshooting

Enterprise pattern:

CoreDNS Corefile (enterprise hybrid example)
=============================================
.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
    }
    forward corp.bank.internal 10.200.0.2 10.200.0.3   # on-prem DNS
    forward . /etc/resolv.conf                           # VPC DNS resolver
    cache 30
    reload
}

Container Runtime (containerd)

The actual process that manages container lifecycle on the node.

Kubernetes communicates with containerd via the CRI (Container Runtime Interface)
Docker was removed as a runtime in Kubernetes 1.24; containerd is the standard
containerd pulls images, creates container sandboxes (via runc), manages storage
Both EKS and GKE use containerd as the default runtime
Image pull policies: Always, IfNotPresent, Never

How Components Communicate

The kubectl apply Flow

Kubernetes Control Plane — kubectl apply Request Flow

When you run kubectl apply -f deployment.yaml, here is exactly what happens:

kubectl apply flow through the Kubernetes API

Control Loop Flow (Deployment -> ReplicaSet -> Pod)

Controller Chain — Deployment to Running Pod

Service Traffic Flow

Service Routing — ClusterIP Traffic Flow

EKS — How AWS Implements the Control Plane

EKS Architecture Overview

EKS Control Plane Architecture

Key EKS Decisions

Control plane access:

Public + Private (default): API server reachable from internet + from within VPC
Private only (enterprise standard): API server only reachable from within VPC via ENIs. Requires VPN/Direct Connect or a bastion host. This is what banks use.

Node types:

Type	When to Use	Bank Recommendation
Managed Node Groups	Standard workloads, auto-handles drain/upgrade	Primary choice
Self-Managed	Custom AMIs, GPU, specific kernel config	Special cases only
Fargate	Serverless pods, burst workloads, isolation per pod	Batch jobs, dev/test
Karpenter	Intelligent autoscaling, mixed instance types	Cost optimization

VPC CNI (Amazon VPC CNI Plugin):

Pods get real VPC IP addresses from the node’s subnet
No overlay network; pods are directly routable in the VPC
Enables Security Groups for Pods — apply VPC security groups to individual pods
IP address management: each ENI provides a pool of secondary IPs
Prefix delegation mode: assign /28 prefixes instead of individual IPs (more pod density)
Custom networking: pods can use different subnets than nodes (separate CIDR for pods)

EKS Add-ons (managed by AWS):

vpc-cni — networking
kube-proxy — service routing
coredns — cluster DNS
aws-ebs-csi-driver — EBS persistent volumes
aws-efs-csi-driver — EFS shared storage
adot — AWS Distro for OpenTelemetry
aws-guardduty-agent — runtime threat detection

EKS Pod Identity (replaces IRSA):

Simpler than IRSA — no OIDC provider per cluster
Associate a Kubernetes ServiceAccount with an IAM role via the EKS API
Pod automatically gets temporary IAM credentials
Supported in EKS add-ons and custom workloads

GKE Architecture Overview

GKE Control Plane Architecture

Key GKE Decisions

Cluster type:

Type	Control Plane	Nodes	SLA	Bank Recommendation
Zonal Standard	Single zone	You manage	99.5%	Dev/test only
Regional Standard	3 zones	You manage	99.95%	Production workloads
Autopilot	3 zones	Google manages	99.95%	Simpler operations

GKE Autopilot vs Standard:

Standard: You create and manage node pools, choose instance types, handle node upgrades, configure node auto-provisioning
Autopilot: Google manages everything below the Pod spec. You define workloads; Google provisions the right nodes. Per-pod billing. Built-in security hardening. Increasingly the recommended default.

Networking (VPC-native / alias IPs):

Pods get IPs from secondary IP ranges on the subnet (alias IPs)
No overlay network; pods are natively routable in the VPC
Each node gets a /24 of pod IPs by default (max 110 pods per node)
Shared VPC: GKE cluster in service project uses subnets from host project
Dataplane V2 (default on new clusters): eBPF-based (Cilium), replaces kube-proxy, provides built-in network policy enforcement

Release channels:

Channel	Update Speed	Stability	Use Case
Rapid	Latest versions, frequent updates	Least stable	Testing new features
Regular	Balanced (2-3 months after Rapid)	Good stability	Non-critical production
Stable	Most tested (2-3 months after Regular)	Most stable	Critical production
Extended	Extra long support (24 months)	Patch-only	Banks, regulated industries

GKE Enterprise (formerly Anthos):

Fleet management: manage multiple GKE clusters as one logical unit
Config Sync: GitOps-based configuration management across clusters
Policy Controller: OPA-based policy enforcement
Service Mesh (managed Istio)
Multi-cluster Services: service discovery across clusters

Workload Identity Federation (for GKE):

Map a Kubernetes ServiceAccount to a Google Cloud IAM service account
Pods automatically receive GCP credentials without storing keys
Enabled per node pool; requires annotation on the K8s ServiceAccount

Terraform — EKS vs GKE Cluster Provisioning

EKS (Terraform)
GKE (Terraform)

# ============================================================
# EKS Cluster — Private Endpoint, Managed Node Groups
# Workload Account, consuming VPC from Network Hub via TGW
# ============================================================

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-banking-eks"
  cluster_version = "1.31"

  # Private cluster — API server only accessible within VPC
  cluster_endpoint_public_access  = false
  cluster_endpoint_private_access = true

  # VPC from Network Hub (TGW-attached subnets)
  vpc_id     = data.aws_vpc.workload.id
  subnet_ids = data.aws_subnets.private.ids

  # Control plane logging
  cluster_enabled_log_types = [
    "api", "audit", "authenticator",
    "controllerManager", "scheduler"
  ]

  # KMS encryption for secrets
  cluster_encryption_config = {
    provider_key_arn = aws_kms_key.eks_secrets.arn
    resources        = ["secrets"]
  }

  # EKS Add-ons (managed by AWS)
  cluster_addons = {
    vpc-cni = {
      most_recent              = true
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"    # more pod IPs per node
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
    kube-proxy = { most_recent = true }
    coredns    = { most_recent = true }
    aws-ebs-csi-driver = {
      most_recent              = true
      service_account_role_arn = module.ebs_csi_irsa.iam_role_arn
    }
  }

  # Managed Node Groups
  eks_managed_node_groups = {
    # General purpose — application workloads
    general = {
      ami_type       = "AL2023_x86_64_STANDARD"
      instance_types = ["m7i.xlarge", "m6i.xlarge"]
      capacity_type  = "ON_DEMAND"

      min_size     = 3
      max_size     = 20
      desired_size = 6

      # Spread across AZs
      subnet_ids = data.aws_subnets.private.ids

      labels = {
        workload-type = "general"
      }

      # Node group update config
      update_config = {
        max_unavailable_percentage = 33  # rolling update
      }
    }

    # Memory-optimized — caching, in-memory processing
    memory_optimized = {
      ami_type       = "AL2023_x86_64_STANDARD"
      instance_types = ["r7i.2xlarge", "r6i.2xlarge"]
      capacity_type  = "ON_DEMAND"

      min_size     = 0
      max_size     = 10
      desired_size = 2

      labels = {
        workload-type = "memory-optimized"
      }

      taints = [{
        key    = "workload-type"
        value  = "memory-optimized"
        effect = "NO_SCHEDULE"
      }]
    }
  }

  # Access configuration — use EKS access entries (not aws-auth ConfigMap)
  authentication_mode = "API_AND_CONFIG_MAP"

  access_entries = {
    platform_admins = {
      principal_arn = "arn:aws:iam::role/PlatformAdminRole"
      policy_associations = {
        admin = {
          policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
          access_scope = { type = "cluster" }
        }
      }
    }
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
    Team        = "platform-engineering"
  }
}

# Pod Identity association (replaces IRSA for new workloads)
resource "aws_eks_pod_identity_association" "app" {
  cluster_name    = module.eks.cluster_name
  namespace       = "payments"
  service_account = "payments-api"
  role_arn        = aws_iam_role.payments_api.arn
}

# ============================================================
# GKE Regional Cluster — Private, Shared VPC, Workload Identity
# Workload Project, consuming subnets from Network Host Project
# ============================================================

resource "google_container_cluster" "prod" {
  name     = "prod-banking-gke"
  location = "me-central1"  # Doha region (or me-central2 for Dammam)
  project  = var.workload_project_id

  # Remove default node pool — we manage our own
  remove_default_node_pool = true
  initial_node_count       = 1

  # Stable release channel for regulated workloads
  release_channel {
    channel = "STABLE"
  }

  # Shared VPC networking
  network    = "projects/${var.host_project_id}/global/networks/${var.vpc_name}"
  subnetwork = "projects/${var.host_project_id}/regions/me-central1/subnetworks/gke-prod"

  # VPC-native (alias IPs)
  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"     # secondary range on subnet
    services_secondary_range_name = "services" # secondary range on subnet
  }

  # Private cluster
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = true   # no public API server
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }

  # Authorized networks (access from bastion / VPN)
  master_authorized_networks_config {
    cidr_blocks {
      cidr_block   = "10.0.0.0/8"
      display_name = "internal-network"
    }
  }

  # Workload Identity
  workload_identity_config {
    workload_pool = "${var.workload_project_id}.svc.id.goog"
  }

  # Dataplane V2 (eBPF/Cilium — replaces kube-proxy)
  datapath_provider = "ADVANCED_DATAPATH"

  # Binary Authorization — only deploy signed images
  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }

  # Encrypt secrets with CMEK
  database_encryption {
    state    = "ENCRYPTED"
    key_name = google_kms_crypto_key.gke_secrets.id
  }

  # Logging and monitoring
  logging_config {
    enable_components = [
      "SYSTEM_COMPONENTS", "WORKLOADS",
      "APISERVER", "SCHEDULER", "CONTROLLER_MANAGER"
    ]
  }

  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
    managed_prometheus { enabled = true }
  }

  # Security posture
  security_posture_config {
    mode               = "BASIC"
    vulnerability_mode = "VULNERABILITY_ENTERPRISE"
  }
}

# General-purpose node pool
resource "google_container_node_pool" "general" {
  name     = "general"
  location = "me-central1"
  cluster  = google_container_cluster.prod.name
  project  = var.workload_project_id

  # Multi-zone (regional cluster spreads across 3 zones)
  node_count = 2  # per zone = 6 total

  autoscaling {
    min_node_count  = 2
    max_node_count  = 10
    location_policy = "BALANCED"
  }

  # Surge upgrades
  management {
    auto_repair  = true
    auto_upgrade = true
  }
  upgrade_settings {
    max_surge       = 1
    max_unavailable = 0   # zero-downtime node upgrades
    strategy        = "SURGE"
  }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-ssd"

    # Workload Identity on the node pool
    workload_metadata_config {
      mode = "GKE_METADATA"
    }

    # Shielded nodes
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }

    # Use Container-Optimized OS
    image_type = "COS_CONTAINERD"

    # No external IP — private nodes
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]

    labels = {
      workload-type = "general"
    }
  }
}

# Workload Identity binding
resource "google_service_account_iam_member" "payments_wi" {
  service_account_id = google_service_account.payments_api.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.workload_project_id}.svc.id.goog[payments/payments-api]"
}

EKS vs GKE — Side-by-Side Comparison

Feature	EKS	GKE
Control plane HA	3 AZs (always)	Regional: 3 zones; Zonal: 1 zone
SLA	99.95%	Regional: 99.95%; Zonal: 99.5%
etcd	AWS-managed, KMS encrypted	Google-managed, CMEK optional
Pod networking	VPC CNI (real VPC IPs)	VPC-native (alias IPs)
Service routing	kube-proxy (iptables/IPVS)	Dataplane V2 (eBPF/Cilium)
Serverless	Fargate (per-pod, no visible nodes)	Autopilot (managed nodes, per-pod billing)
IAM integration	Pod Identity / IRSA	Workload Identity Federation
Version management	Manual or managed upgrades	Release channels (Rapid/Regular/Stable/Extended)
Add-on management	EKS Add-ons (vpc-cni, coredns, etc.)	GKE Add-ons (auto-managed)
Multi-cluster	EKS Connector (limited)	GKE Enterprise / Fleet management
Node upgrades	Rolling update per node group	Surge upgrades per node pool
Network policy	Calico (add-on)	Dataplane V2 (built-in)
Image security	ECR scanning, GuardDuty	Binary Authorization, Artifact Analysis
Hybrid/on-prem	EKS Anywhere	GKE Enterprise (Anthos)

Interview Scenarios

Scenario 1: “Walk me through how a pod gets scheduled in Kubernetes”

What the interviewer wants: Proof that you understand the full component chain, not just “the scheduler picks a node.”

Strong answer structure:

“When I run kubectl apply, the request hits the API server, which authenticates the client — in EKS that means validating the IAM identity via the OIDC authenticator, or in GKE it is validated via Google OAuth. Then RBAC authorization checks whether this identity can create a Deployment in the target namespace.

Next, mutating admission controllers run — this is where things like Istio sidecar injection, default resource limits from LimitRange, or OPA/Gatekeeper policy mutations happen. Then validating admission controllers enforce policies like ‘all containers must have resource limits’ or ‘images must come from our approved registry.’

Once the Deployment object is persisted in etcd, the Deployment controller creates a ReplicaSet. The ReplicaSet controller then creates the individual Pod objects — but without a node assignment yet.

The scheduler is watching for these unscheduled pods. It runs the filtering phase — eliminating nodes where the pod cannot fit due to resource requests, taints, affinity rules, or topology constraints. Then the scoring phase ranks remaining nodes by criteria like balanced resource utilization, topology spread, and image locality.

Once the scheduler picks a node, it writes a Binding object. The kubelet on that node sees the new assignment via its watch, pulls the container image through containerd, calls the CNI plugin (VPC CNI on EKS, Cilium on GKE) to set up networking, calls the CSI driver if volumes are needed, and starts the containers.

Finally, the kubelet runs startup probes (if configured), then readiness probes. Once readiness passes, the EndpointSlice controller adds the pod’s IP to the Service’s endpoint list, and kube-proxy (or Dataplane V2 on GKE) updates its routing rules so traffic can reach the pod.”

Scenario 2: “What happens when a node goes down?”

Answer:

Timeline of Events When a Node Fails

Scenario 3: “Design an EKS cluster for production at a bank”

Key decisions to walk through:

Networking: Private endpoint only. VPC from Network Hub via TGW. VPC CNI with prefix delegation for pod density. Custom networking for separate pod CIDR.
Node strategy: Managed node groups with m7i.xlarge for general, r7i.2xlarge for memory-intensive. On-Demand for production (no Spot for banking). Karpenter for intelligent bin-packing.
Security: EKS Pod Identity for IAM roles. KMS encryption for secrets. GuardDuty EKS Runtime Monitoring. No public ECR — use private ECR in Shared Services account with cross-account access.
Multi-tenancy: Namespace per team. RBAC via EKS access entries. ResourceQuotas and LimitRanges. Network policies (Calico).
Observability: Control plane logging to CloudWatch (audit, API, authenticator). Prometheus + Grafana in Shared Services account. Fluent Bit DaemonSet for application logs.
Upgrades: Blue-green node groups for zero-downtime. Test in staging first. PodDisruptionBudgets on all workloads.

Scenario 4: “GKE Standard vs Autopilot — when do you choose each?”

Standard when:

You need DaemonSets with privileged access (legacy security agents)
Specific machine types (GPU, high-memory) with fine-grained control
Custom node images or kernel tuning
You want control over node pool topology and placement

Autopilot when:

You want minimal operational overhead
Per-pod billing is preferred (no paying for unused node capacity)
Security hardening out of the box (no SSH to nodes, no privileged containers by default)
Teams should focus on workloads, not infrastructure
Recommended default for new GKE clusters unless you have a specific Standard requirement

“At a bank, I would start with GKE Autopilot for application workloads where the team does not need node-level control. For workloads requiring specific hardware (GPU for ML, high-IOPS for databases) or legacy security agents running as privileged DaemonSets, I would use Standard with dedicated node pools. A fleet can mix both.”

Scenario 5: “How do EKS and GKE differ in control plane architecture?”

Dimension	EKS	GKE
Isolation	Control plane in AWS-managed VPC, ENIs in your VPC	Control plane in Google-managed VPC, peered to yours
Access	NLB endpoint (public/private)	Internal LB (private endpoint)
Networking	VPC CNI (real VPC IPs, iptables-based routing)	VPC-native alias IPs, Dataplane V2 (eBPF)
Version control	Manual upgrade + managed add-on updates	Release channels with auto-upgrade
IAM binding	Pod Identity / IRSA (OIDC federation)	Workload Identity Federation (metadata server)
etcd access	None (fully managed)	None (fully managed)
Logging	CloudWatch Logs (opt-in per log type)	Cloud Logging (opt-in per component)

“Both are managed — you never touch API server flags or etcd. The biggest architectural difference is networking: EKS uses VPC CNI where pods get first-class VPC IPs and you still use kube-proxy for service routing. GKE uses Dataplane V2 which is eBPF-based, replaces kube-proxy entirely, and provides built-in network policy enforcement without needing Calico.”

Scenario 6: “How do you handle Kubernetes version upgrades across 10 clusters?”

Strategy:

Upgrade Pipeline

EKS approach:

Upgrade control plane version (AWS handles this, ~15 min)
Upgrade managed add-ons (VPC CNI, CoreDNS, kube-proxy, CSI drivers)
Create new node group with new AMI → drain old node group → delete old
Use PodDisruptionBudgets to prevent service disruption during drain

GKE approach:

Use release channels — clusters auto-upgrade within channel cadence
Stable channel for production (most tested, longest lead time)
Maintenance windows to control when auto-upgrades happen
Surge upgrade settings: max_surge=1, max_unavailable=0 for zero-downtime node upgrades
Use GKE Enterprise fleet management to orchestrate upgrades across clusters

Scenario 7: “Explain the difference between EKS Fargate and GKE Autopilot”

Dimension	EKS Fargate	GKE Autopilot
What it manages	Individual pods in micro-VMs	Nodes (but you never manage them)
Node visibility	No nodes visible (`kubectl get nodes` shows virtual nodes)	Nodes are visible but fully managed
Billing	Per pod (vCPU + memory per second)	Per pod (CPU, memory, ephemeral storage)
DaemonSets	Not supported	Supported
Privileged containers	Not supported	Supported (with restrictions)
GPU	Not supported	Supported
Persistent volumes	EFS only (no EBS)	PD-SSD, PD-Balanced supported
Startup time	30-60 seconds (cold start)	Standard pod startup
Max pods per namespace	Fargate profile limits	No special limits
Security	Strong isolation (micro-VM per pod)	Node-level isolation (hardened)

“Fargate is true serverless — each pod runs in its own Firecracker micro-VM with no shared kernel. But the restrictions are significant: no DaemonSets means no Datadog agent, no Fluentd. Autopilot is more like ‘managed nodes’ — Google handles the underlying nodes, but you can still run DaemonSets, GPU workloads, and use block storage. For a bank, I would use Fargate selectively for batch jobs and isolated workloads, and use managed node groups for everything else on EKS. On GKE, Autopilot is viable for most workloads.”

Scenario 8: “Your API server is slow — what could be causing it?”

Diagnostic checklist:

API Server Slowness — Root Cause Tree

How to diagnose on EKS:

# Check API server metrics (EKS exposes via /metrics endpoint from within VPC)
kubectl get --raw /metrics | grep apiserver_request_duration_seconds

# Check for slow requests
kubectl get --raw /metrics | grep apiserver_request_total | grep SLOW

# Check etcd request latency
kubectl get --raw /metrics | grep etcd_request_duration_seconds

# List all webhooks and check for failures
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations

# Check API server audit logs (CloudWatch)
# Look for high-latency requests, 429 (throttled), 504 (timeout)

# Check if control plane events exist
kubectl get events -n kube-system --sort-by='.lastTimestamp'

How to diagnose on GKE:

# GKE Cloud Logging — API server logs
# Filter: resource.type="k8s_cluster" AND
#         protoPayload.methodName="io.k8s.*" AND
#         protoPayload.status.code >= 500

# GKE Metrics Explorer
# kubernetes.io/apiserver/request_duration_seconds
# kubernetes.io/apiserver/request_count (group by response code)

# Check webhook configurations
kubectl get mutatingwebhookconfigurations -o json | \
  jq '.items[].webhooks[].timeoutSeconds'

Key Takeaways for Interviews

Know the flow. The kubectl apply -> API server -> etcd -> controller -> scheduler -> kubelet chain is the most asked Kubernetes question. Practice explaining it without notes.
EKS vs GKE is not “which is better.” It is about trade-offs. VPC CNI vs alias IPs, kube-proxy vs Dataplane V2, IRSA/Pod Identity vs Workload Identity, manual upgrades vs release channels.
Enterprise decisions matter more than component theory. The interviewer wants to hear “private endpoint, KMS encryption, managed node groups, Pod Identity” — not just “the scheduler assigns pods to nodes.”
Always think about failure modes. What happens when a node dies? What happens when etcd is slow? What happens when a webhook is down? These are the questions that separate senior engineers from architects.
Managed does not mean you ignore it. You still need to understand what EKS/GKE manages for you, because when something goes wrong, you need to know where to look. You cannot debug API server latency if you do not know which components are involved.

References

AWS

EKS Best Practices Guide — operational excellence, security, reliability, and cost optimization for EKS
EKS Workshop — hands-on labs covering EKS fundamentals, autoscaling, observability, and security

GCP

Google Kubernetes Engine (GKE) Documentation — cluster creation, management, and best practices
GKE Cluster Architecture — control plane and node architecture details

Tools & Frameworks

Kubernetes Cluster Architecture — official Kubernetes documentation on control plane components and node architecture
Kubernetes Components — API server, etcd, scheduler, controller manager, and kubelet