Troubleshooting — 17 Debug Scenarios
Your clusters are running. Alerts are firing. Something is broken at 2 AM. This page is your daily reference — 17 scenarios you will encounter repeatedly as the central infra team managing EKS/GKE for 50+ tenant teams.
Where This Fits
Section titled “Where This Fits”Essential Debug Toolkit
Section titled “Essential Debug Toolkit”Before diving into scenarios, here are the commands you will use in every single debugging session:
# The Big Five — run these first for ANY pod issuekubectl get pods -n <ns> -o widekubectl describe pod <pod> -n <ns>kubectl logs <pod> -n <ns> --previouskubectl get events -n <ns> --sort-by='.metadata.creationTimestamp' | tail -20kubectl top pod -n <ns>
# Node-levelkubectl get nodes -o widekubectl describe node <node>kubectl top nodes
# Networkkubectl get svc,ep,ingress -n <ns>kubectl get networkpolicy -n <ns>
# Authkubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>Scenario 1: Pod Stuck in Pending
Section titled “Scenario 1: Pod Stuck in Pending”Symptoms
Section titled “Symptoms”Pod stays in Pending status indefinitely. No node is assigned.
$ kubectl get pods -n paymentsNAME READY STATUS RESTARTS AGEpayment-api-7d4f8b6c9-x2k4l 0/1 Pending 0 12mpayment-api-7d4f8b6c9-m9n3p 0/1 Pending 0 12mDebug Commands
Section titled “Debug Commands”# Step 1: Describe the pod — look at the Events section at the bottom$ kubectl describe pod payment-api-7d4f8b6c9-x2k4l -n payments# ...Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 12m default-scheduler 0/5 nodes are available: 2 Insufficient cpu, 3 node(s) had taint {team=data: NoSchedule} that the pod didn't tolerate.
# Step 2: Check node capacity and allocatable resources$ kubectl describe nodes | grep -A 5 "Allocated resources"Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 3800m (95%) 7600m (190%) memory 6Gi (82%) 12Gi (164%)
# Step 3: Check if there are taints blocking scheduling$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taintsNAME TAINTSip-10-1-1-100.ec2.internal [map[effect:NoSchedule key:team value:data]]ip-10-1-1-101.ec2.internal [map[effect:NoSchedule key:team value:data]]ip-10-1-1-102.ec2.internal [map[effect:NoSchedule key:team value:data]]ip-10-1-2-200.ec2.internal <none>ip-10-1-2-201.ec2.internal <none>
# Step 4: Check PVC binding if the pod uses persistent volumes$ kubectl get pvc -n paymentsNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEpayment-data Pending gp3-encrypted 12mRoot Cause
Section titled “Root Cause”One or more of these:
- Insufficient resources — all nodes are at capacity (cpu/memory requests exhausted)
- Taints not tolerated — nodes have taints the pod doesn’t tolerate
- Node affinity mismatch — pod requires specific node labels that no node has
- PVC not bound — the pod references a PVC that is stuck in Pending (see Scenario 7)
- Pod topology spread constraints — cannot satisfy distribution requirements
- ResourceQuota exceeded — namespace quota for cpu/memory is maxed out
# Check ResourceQuota$ kubectl get resourcequota -n paymentsNAME AGE REQUEST LIMITpayments-quota 30d requests.cpu: 3900m/4000m, limits.cpu: 7800m/8000m, requests.memory: 6.5Gi/8Gi limits.memory: 13Gi/16GiFix + Prevention
Section titled “Fix + Prevention”# Immediate: If resource shortage, scale up nodes or reduce requests# For Karpenter — it should auto-provision. If not, see Scenario 16.# For Cluster Autoscaler:kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# If taint issue, add tolerations to the pod spec:# spec.tolerations:# - key: "team"# operator: "Equal"# value: "payments"# effect: "NoSchedule"
# If ResourceQuota, request increase or optimize existing workloads:kubectl patch resourcequota payments-quota -n payments \ --type='json' -p='[{"op":"replace","path":"/spec/hard/requests.cpu","value":"6000m"}]'Prevention:
- Set up Prometheus alerts for node utilization > 80%
- Use Karpenter/NAP for just-in-time node provisioning
- Enforce
LimitRangeso teams cannot request excessive resources - Review
ResourceQuotaduring team onboarding
Scenario 2: CrashLoopBackOff
Section titled “Scenario 2: CrashLoopBackOff”Symptoms
Section titled “Symptoms”Pod starts, crashes, restarts, crashes again. Backoff delay increases each time.
$ kubectl get pods -n checkoutNAME READY STATUS RESTARTS AGEcheckout-svc-5f6d7c8b9-k3m2n 0/1 CrashLoopBackOff 7 (2m ago) 15mDebug Commands
Section titled “Debug Commands”# Step 1: Check the PREVIOUS container's logs (current container already crashed)$ kubectl logs checkout-svc-5f6d7c8b9-k3m2n -n checkout --previouspanic: runtime error: invalid memory address or nil pointer dereference[signal SIGSEGV: segmentation violation]goroutine 1 [running]:main.main() /app/main.go:42 +0x1a4
# Step 2: If no --previous logs, check current attempt$ kubectl logs checkout-svc-5f6d7c8b9-k3m2n -n checkoutError: required environment variable DATABASE_URL not set
# Step 3: Check exit code and reason from describe$ kubectl describe pod checkout-svc-5f6d7c8b9-k3m2n -n checkout# Look for: Last State: Terminated Reason: Error Exit Code: 1 Started: Sun, 15 Mar 2026 02:14:22 +0000 Finished: Sun, 15 Mar 2026 02:14:23 +0000# Or: Last State: Terminated Reason: OOMKilled Exit Code: 137
# Step 4: Check if ConfigMap/Secret exists$ kubectl get configmap checkout-config -n checkoutError from server (NotFound): configmaps "checkout-config" not found
# Step 5: Check if liveness probe is killing the container$ kubectl describe pod checkout-svc-5f6d7c8b9-k3m2n -n checkout | grep -A 10 "Liveness" Liveness: http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3Events: Warning Unhealthy 2m (x9 over 14m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503 Normal Killing 2m (x3 over 12m) kubelet Container checkout failed liveness probe, will be restartedRoot Cause
Section titled “Root Cause”Common causes ranked by frequency:
- Missing ConfigMap/Secret — app requires env vars that don’t exist in the cluster
- Application bug — nil pointer, unhandled exception on startup
- OOMKilled — container exceeds memory limit (exit code 137). See Scenario 10
- Liveness probe too aggressive — app needs 30s to start, probe fails at 5s
- Wrong command/args — container entrypoint is incorrect
- Permissions — app cannot read files, connect to database, or access cloud APIs
Fix + Prevention
Section titled “Fix + Prevention”# Missing config — create the missing ConfigMapkubectl create configmap checkout-config -n checkout \ --from-literal=DATABASE_URL="postgresql://db.internal:5432/checkout"
# Liveness probe too aggressive — increase initialDelaySeconds# In the Deployment spec:# livenessProbe:# httpGet:# path: /healthz# port: 8080# initialDelaySeconds: 30 # <-- was 5, increase this# periodSeconds: 10# failureThreshold: 5 # <-- was 3, more tolerance
# OOMKilled — increase memory limitkubectl set resources deployment checkout-svc -n checkout \ --limits=memory=512Mi --requests=memory=256Mi
# Quick restart without waiting for backoffkubectl delete pod checkout-svc-5f6d7c8b9-k3m2n -n checkoutPrevention:
- Use
startupProbefor slow-starting apps (separate from liveness) - Enforce ExternalSecrets or Sealed Secrets so configs are always present via GitOps
- Set up VPA in recommendation mode to right-size memory limits
- Run pre-deploy checks in CI: validate all referenced ConfigMaps/Secrets exist
Scenario 3: ImagePullBackOff
Section titled “Scenario 3: ImagePullBackOff”Symptoms
Section titled “Symptoms”Pod cannot pull its container image. Status flips between ErrImagePull and ImagePullBackOff.
$ kubectl get pods -n fraud-detectionNAME READY STATUS RESTARTS AGEfraud-ml-model-6b7c8d9e0-p4q5 0/1 ImagePullBackOff 0 8mDebug Commands
Section titled “Debug Commands”# Step 1: Describe pod — look at Events$ kubectl describe pod fraud-ml-model-6b7c8d9e0-p4q5 -n fraud-detectionEvents: Warning Failed 8m kubelet Failed to pull image "123456789012.dkr.ecr.eu-west-1.amazonaws.com/fraud-ml:v2.3.1": rpc error: code = Unknown desc = Error response from daemon: pull access denied for 123456789012.dkr.ecr.eu-west-1.amazonaws.com/fraud-ml, repository does not exist or may require 'docker login'
# Step 2: Check if the image actually exists in ECR$ aws ecr describe-images --repository-name fraud-ml \ --image-ids imageTag=v2.3.1 --region eu-west-1# Error: ImageNotFoundException
# Step 3: Check imagePullSecrets configuration$ kubectl get pod fraud-ml-model-6b7c8d9e0-p4q5 -n fraud-detection \ -o jsonpath='{.spec.imagePullSecrets}'[] # <-- empty, no pull secret configured
# Step 4: Check ServiceAccount for ECR/GAR pull permissions$ kubectl get sa -n fraud-detection -o yaml | grep -A 3 annotations annotations: eks.amazonaws.com/role-arn: "" # <-- no IRSA role for ECR access
# Step 5: Verify ECR token (for debugging only)$ aws ecr get-login-password --region eu-west-1 | \ docker login --username AWS --password-stdin \ 123456789012.dkr.ecr.eu-west-1.amazonaws.comLogin Succeeded # <-- if this works, it's a node/SA permissions issueRoot Cause
Section titled “Root Cause”- Image tag doesn’t exist — typo in tag, image never pushed, tag was overwritten
- ECR/GAR authentication failure — node IAM role lacks
ecr:GetDownloadUrlForLayer, or IRSA not configured - Private registry without imagePullSecret — pulling from Docker Hub, Quay, or another private registry
- Wrong region — ECR repo is in
us-east-1but cluster is ineu-west-1 - ECR token expired — ECR tokens last 12 hours; if using static secrets, they expire
- Network — node cannot reach the registry (missing NAT Gateway, VPC endpoint, or firewall rule)
Fix + Prevention
Section titled “Fix + Prevention”# Image doesn't exist — verify and pushaws ecr describe-images --repository-name fraud-ml --region eu-west-1 \ --query 'imageDetails[*].imageTags' --output table# Push the correct image from CI/CD
# ECR auth — ensure node IAM role has ECR permissions (EKS managed node group)# Or use IRSA for pull:# Policy: arn:aws:iam::policy/AmazonEC2ContainerRegistryReadOnly
# For private registries — create imagePullSecretkubectl create secret docker-registry regcred -n fraud-detection \ --docker-server=https://index.docker.io/v1/ \ --docker-username=<user> \ --docker-password=<token>
# Attach to ServiceAccount (better than per-pod)kubectl patch serviceaccount default -n fraud-detection \ -p '{"imagePullSecrets": [{"name": "regcred"}]}'
# Network — if behind NAT, check route tables and security groups# Consider ECR VPC Endpoint for private subnets:# com.amazonaws.<region>.ecr.dkr# com.amazonaws.<region>.ecr.api# com.amazonaws.<region>.s3 (gateway endpoint for image layers)Prevention:
- Use immutable image tags (sha256 digests) instead of mutable tags like
latest - Set up ECR replication to the cluster’s region
- Use ECR pull-through cache for public images
- Create a platform-level
imagePullSecretvia ExternalSecrets in every namespace - CI pipeline should verify image exists before updating deployment manifest
Scenario 4: Pod Running But Not Receiving Traffic
Section titled “Scenario 4: Pod Running But Not Receiving Traffic”Symptoms
Section titled “Symptoms”Pod shows Running and 1/1 Ready, but requests never reach it. Users report 503 or connection timeout.
$ kubectl get pods -n api-gatewayNAME READY STATUS RESTARTS AGEapi-gw-7d8e9f0a1-b2c3d 1/1 Running 0 30m
$ kubectl get svc -n api-gatewayNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEapi-gw ClusterIP 10.100.45.123 <none> 8080/TCP 30dDebug Commands
Section titled “Debug Commands”# Step 1: Check if endpoints exist for the service$ kubectl get endpoints api-gw -n api-gatewayNAME ENDPOINTS AGEapi-gw <none> 30d# ^^ EMPTY endpoints — this is the problem
# Step 2: Compare service selector with pod labels$ kubectl get svc api-gw -n api-gateway -o jsonpath='{.spec.selector}'{"app":"api-gateway","version":"v2"}
$ kubectl get pod api-gw-7d8e9f0a1-b2c3d -n api-gateway --show-labelsNAME READY STATUS LABELSapi-gw-7d8e9f0a1-b2c3d 1/1 Running app=api-gw,version=v2# ^^ Label is "app=api-gw" but service selects "app=api-gateway" — MISMATCH
# Step 3: If labels match, check readiness probe$ kubectl describe pod api-gw-7d8e9f0a1-b2c3d -n api-gateway | grep -A 8 "Readiness" Readiness: http-get http://:8080/ready delay=5s timeout=1s period=10s #success=1 #failure=3 ... Warning Unhealthy 1m (x45 over 28m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
# Step 4: Test connectivity from inside the cluster$ kubectl run debug-net --rm -it --image=busybox -n api-gateway -- sh/ # wget -qO- http://api-gw.api-gateway.svc.cluster.local:8080/readywget: server returned error: HTTP/1.1 503 Service Unavailable
# Step 5: Check the actual container port$ kubectl get pod api-gw-7d8e9f0a1-b2c3d -n api-gateway \ -o jsonpath='{.spec.containers[0].ports}'[{"containerPort":3000,"protocol":"TCP"}]# ^^ Container listens on 3000 but service targets 8080Root Cause
Section titled “Root Cause”- Service selector does not match pod labels — most common cause
- Readiness probe failing — pod is Running but not Ready, so it’s removed from endpoints
- Port mismatch — service
targetPortdoesn’t match container’s actual listening port - Container listening on wrong interface — app binds to
127.0.0.1instead of0.0.0.0 - NetworkPolicy blocking ingress traffic to the pod (see Scenario 14)
Fix + Prevention
Section titled “Fix + Prevention”# Fix label mismatch — update service selector OR pod labelskubectl patch svc api-gw -n api-gateway \ --type='json' -p='[{"op":"replace","path":"/spec/selector/app","value":"api-gw"}]'
# Fix port mismatch — update service targetPortkubectl patch svc api-gw -n api-gateway \ --type='json' -p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":3000}]'
# Verify fix — endpoints should now show the pod IP$ kubectl get endpoints api-gw -n api-gatewayNAME ENDPOINTS AGEapi-gw 10.1.2.34:3000 30dPrevention:
- Use Helm/Kustomize templates where service selector and pod labels are generated from the same variable
- Add integration tests in CI that deploy to a test namespace and verify endpoints are populated
- Standardize on port naming (
http,grpc) in golden path templates - Alert on services with 0 endpoints:
kube_endpoint_address_available == 0
Scenario 5: DNS Resolution Failures
Section titled “Scenario 5: DNS Resolution Failures”Symptoms
Section titled “Symptoms”Pods cannot resolve internal service names or external domains. Application logs show connection errors to hostnames.
$ kubectl exec -it checkout-svc-abc123 -n checkout -- nslookup payment-svc.payments.svc.cluster.local;; connection timed out; no servers could be reached
$ kubectl exec -it checkout-svc-abc123 -n checkout -- nslookup google.com;; connection timed out; no servers could be reachedDebug Commands
Section titled “Debug Commands”# Step 1: Check CoreDNS pods$ kubectl get pods -n kube-system -l k8s-app=kube-dnsNAME READY STATUS RESTARTS AGEcoredns-5d78c9869d-j4k5l 0/1 CrashLoopBackOff 12 2hcoredns-5d78c9869d-m6n7o 0/1 CrashLoopBackOff 12 2h
# Step 2: Check CoreDNS logs$ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20[FATAL] plugin/loop: Loop (127.0.0.1:53 -> :53) detected for zone ".", flushing cache
# Step 3: Check CoreDNS ConfigMap$ kubectl get configmap coredns -n kube-system -o yamlapiVersion: v1data: Corefile: | .:53 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa } forward . /etc/resolv.conf cache 30 loop reload loadbalance }
# Step 4: Check if DNS service has endpoints$ kubectl get svc kube-dns -n kube-systemNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)kube-dns ClusterIP 10.100.0.10 <none> 53/UDP,53/TCP
$ kubectl get endpoints kube-dns -n kube-systemNAME ENDPOINTS AGEkube-dns <none> 90d# ^^ No endpoints because CoreDNS pods are crashing
# Step 5: Test with explicit DNS server (bypass pod's resolv.conf)$ kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default 10.100.0.10;; connection timed out; no servers could be reached
# Step 6: Check resolv.conf inside the pod$ kubectl exec -it checkout-svc-abc123 -n checkout -- cat /etc/resolv.confnameserver 10.100.0.10search checkout.svc.cluster.local svc.cluster.local cluster.local ec2.internaloptions ndots:5Root Cause
Section titled “Root Cause”- CoreDNS pods crashing — DNS loop detected (node’s
/etc/resolv.confhas127.0.0.1as nameserver) - CoreDNS resource starvation — too many DNS queries, CoreDNS OOMKilled or CPU-throttled
ndots:5performance issue — every non-FQDN query generates 5 search domain lookups before the actual query- Upstream DNS unreachable — VPC DNS resolver (
.2address) rate limited or AmazonProvidedDNS issues - NetworkPolicy blocking UDP/53 — egress policy blocks DNS traffic to
kube-system
Fix + Prevention
Section titled “Fix + Prevention”# Fix DNS loop — edit CoreDNS ConfigMap to forward to VPC DNS directlykubectl edit configmap coredns -n kube-system# Change: forward . /etc/resolv.conf# To: forward . 169.254.169.253 (EKS VPC DNS)# Or: forward . 169.254.169.254 (GKE metadata DNS)
# Restart CoreDNSkubectl rollout restart deployment coredns -n kube-system
# Fix ndots performance — in pod spec, set dnsConfig:# spec:# dnsConfig:# options:# - name: ndots# value: "2"
# Scale CoreDNS for large clusters (>100 nodes)kubectl scale deployment coredns -n kube-system --replicas=5
# Or use NodeLocal DNSCache (preferred for large clusters)# This runs a DNS cache on every node, reducing CoreDNS load# Deploy: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/Prevention:
- Deploy NodeLocal DNSCache as part of cluster baseline
- Monitor CoreDNS with:
coredns_dns_requests_total,coredns_dns_responses_rcode_total - Alert on CoreDNS pod restarts
- Use FQDN with trailing dot in critical service calls (
payment-svc.payments.svc.cluster.local.) - Set
ndots:2in golden path pod templates
Scenario 6: Node NotReady
Section titled “Scenario 6: Node NotReady”Symptoms
Section titled “Symptoms”One or more nodes show NotReady. Pods on those nodes are evicted or stuck.
$ kubectl get nodesNAME STATUS ROLES AGE VERSIONip-10-1-1-100.ec2.internal Ready <none> 30d v1.29.1ip-10-1-1-101.ec2.internal NotReady <none> 30d v1.29.1ip-10-1-1-102.ec2.internal Ready <none> 30d v1.29.1Debug Commands
Section titled “Debug Commands”# Step 1: Describe the NotReady node — check Conditions$ kubectl describe node ip-10-1-1-101.ec2.internalConditions: Type Status LastHeartbeatTime Reason ---- ------ ----------------- ------ MemoryPressure True Sun, 15 Mar 2026 02:00:00 +0000 KubeletHasMemoryPressure DiskPressure True Sun, 15 Mar 2026 02:00:00 +0000 KubeletHasDiskPressure PIDPressure False Sun, 15 Mar 2026 02:00:00 +0000 KubeletHasSufficientPID Ready False Sun, 15 Mar 2026 01:58:00 +0000 KubeletNotReady message: 'container runtime not ready'
# Step 2: Check node resource usage$ kubectl top node ip-10-1-1-101.ec2.internalNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%ip-10-1-1-101.ec2.internal 3800m 95% 14900Mi 96%
# Step 3: Check system pods on that node$ kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-1-1-101.ec2.internalNAMESPACE NAME READY STATUS RESTARTS AGEkube-system aws-node-t8k2l 1/1 Running 0 30dkube-system kube-proxy-m4n5o 1/1 Running 0 30dmonitoring prom-node-exporter-p6q7r 1/1 Running 0 30dpayments payment-worker-heavy-abc123 1/1 Running 0 2h
# Step 4: If you have SSH access (EKS managed nodes via SSM)$ aws ssm start-session --target i-0abc123def456$ journalctl -u kubelet --since "30 minutes ago" | tail -30Mar 15 02:00:12 kubelet: E0315 02:00:12.123456 node_status.go: "node not ready" err="container runtime not responding"$ systemctl status containerd● containerd.service - containerd container runtime Active: inactive (dead) since Sun 2026-03-15 01:58:00 UTC
# Step 5: Check if EC2 instance has issues$ aws ec2 describe-instance-status --instance-ids i-0abc123def456{ "InstanceStatuses": [{ "SystemStatus": {"Status": "impaired"}, "InstanceStatus": {"Status": "ok"} }]}Root Cause
Section titled “Root Cause”- Memory/Disk pressure — kubelet marks node NotReady when system resources are exhausted
- Container runtime crashed — containerd/dockerd is not responding
- Kubelet stopped — kubelet process died, node stops sending heartbeats
- Network partition — node cannot reach API server (security group change, NACL, route table)
- EC2 system failure — underlying hardware issue, instance status check failed
- Disk full —
/var/lib/containerdfull from old images, container logs, or emptyDir volumes
Fix + Prevention
Section titled “Fix + Prevention”# Immediate — if hardware issue, cordon and drainkubectl cordon ip-10-1-1-101.ec2.internalkubectl drain ip-10-1-1-101.ec2.internal --ignore-daemonsets --delete-emptydir-data --force
# On the node (via SSM):# Restart containerdsudo systemctl restart containerdsudo systemctl restart kubelet
# Clear disk spacesudo crictl rmi --prunesudo journalctl --vacuum-size=500M
# For managed node groups — just terminate the instance# ASG will replace it automaticallyaws ec2 terminate-instances --instance-ids i-0abc123def456
# For Karpenter — delete the node, Karpenter provisions replacementkubectl delete node ip-10-1-1-101.ec2.internalPrevention:
- Set kubelet eviction thresholds:
--eviction-hard=memory.available<500Mi,nodefs.available<10% - Use Karpenter
ttlSecondsUntilExpiredto cycle nodes regularly (e.g., 7 days) - Monitor node conditions with Prometheus:
kube_node_status_condition{condition="Ready",status="true"} == 0 - Use instance types with enough ephemeral storage (or attach separate EBS for containerd)
- Set resource requests on ALL pods to prevent noisy-neighbor memory exhaustion
Scenario 7: PVC Stuck in Pending
Section titled “Scenario 7: PVC Stuck in Pending”Symptoms
Section titled “Symptoms”PersistentVolumeClaim stays in Pending. Pods using it are also stuck in Pending.
$ kubectl get pvc -n databasesNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEpostgres-data-0 Pending gp3-encrypted 15mDebug Commands
Section titled “Debug Commands”# Step 1: Describe the PVC$ kubectl describe pvc postgres-data-0 -n databasesEvents: Warning ProvisioningFailed 2m (x7 over 15m) ebs.csi.aws.com_ebs-csi-controller-xxx failed to provision volume with StorageClass "gp3-encrypted": rpc error: could not create volume "pvc-xxx" in zone "eu-west-1a": UnauthorizedOperation: not authorized to perform: ec2:CreateVolume
# Step 2: Check if StorageClass exists$ kubectl get storageclassNAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODEgp2 (default) kubernetes.io/aws-ebs Delete Immediategp3-encrypted ebs.csi.aws.com Retain WaitForFirstConsumer# ^^ StorageClass exists
# Step 3: Check EBS CSI driver pods$ kubectl get pods -n kube-system -l app=ebs-csi-controllerNAME READY STATUS RESTARTS AGEebs-csi-controller-5d8f9g0h1-a2b3c 6/6 Running 0 7debs-csi-controller-5d8f9g0h1-d4e5f 6/6 Running 0 7d
# Step 4: Check EBS CSI controller logs$ kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner --tail=20E0315 02:30:00.123456 controller.go:XXX could not create volume: UnauthorizedOperation: not authorized to perform ec2:CreateVolume on resource arn:aws:ec2:eu-west-1:123456789012:volume/*
# Step 5: Check IRSA role for the CSI driver$ kubectl get sa ebs-csi-controller-sa -n kube-system -o yaml | grep role-arn eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ebs-csi-role
# Step 6: Verify IAM policy$ aws iam get-role-policy --role-name ebs-csi-role --policy-name ebs-policy# Look for ec2:CreateVolume, ec2:AttachVolume, ec2:DeleteVolume permissionsRoot Cause
Section titled “Root Cause”- IAM permissions — EBS CSI / PD CSI driver service account lacks permissions to create volumes
- StorageClass not found — PVC references a StorageClass that doesn’t exist
- AZ mismatch with
WaitForFirstConsumer— node is ineu-west-1abut PV was pre-provisioned ineu-west-1b - Quota exceeded — AWS EBS volume quota or GCP PD quota hit
- Encryption KMS key — StorageClass specifies a KMS key the CSI driver role cannot access
- CSI driver not installed — EBS CSI driver addon not enabled on the cluster
Fix + Prevention
Section titled “Fix + Prevention”# Fix IAM — attach the correct policy to the CSI driver IRSA roleaws iam attach-role-policy --role-name ebs-csi-role \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
# Fix StorageClass — create if missingcat <<EOF | kubectl apply -f -apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: gp3-encryptedprovisioner: ebs.csi.aws.comparameters: type: gp3 encrypted: "true" kmsKeyId: arn:aws:kms:eu-west-1:123456789012:key/mrk-abc123reclaimPolicy: RetainvolumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: trueEOF
# Fix KMS — grant CSI role access to the KMS keyaws kms create-grant --key-id mrk-abc123 \ --grantee-principal arn:aws:iam::123456789012:role/ebs-csi-role \ --operations "CreateGrant" "Encrypt" "Decrypt" "GenerateDataKey" "DescribeKey"
# Check quotaaws service-quotas get-service-quota --service-code ebs \ --quota-code L-D18FCD1D --region eu-west-1Prevention:
- Include EBS CSI driver as a cluster add-on in Terraform (not manual install)
- Pre-create and test StorageClasses in cluster baseline
- Use
volumeBindingMode: WaitForFirstConsumerto avoid AZ mismatch - Monitor PVC age with Prometheus: alert if PVC is Pending > 5 minutes
- Grant KMS key access in Terraform alongside the CSI driver role
Scenario 8: Ingress Not Routing
Section titled “Scenario 8: Ingress Not Routing”Symptoms
Section titled “Symptoms”Ingress resource exists but external traffic returns 404, 502, or connection refused.
$ kubectl get ingress -n frontendNAME CLASS HOSTS ADDRESS PORTS AGEweb-app alb app.bank.com k8s-frontend-web-abc123.eu-west-1.elb.amazonaws.com 80,443 10m
# But accessing app.bank.com returns 502 Bad Gateway$ curl -I https://app.bank.comHTTP/2 502server: awselb/2.0Debug Commands
Section titled “Debug Commands”# Step 1: Check Ingress resource details$ kubectl describe ingress web-app -n frontendRules: Host Path Backends ---- ---- -------- app.bank.com / web-app-svc:80 (10.1.2.34:3000)
# Step 2: Check if the backend service and endpoints exist$ kubectl get svc web-app-svc -n frontendNAME TYPE CLUSTER-IP PORT(S) AGEweb-app-svc ClusterIP 10.100.67.89 80/TCP 10m
$ kubectl get endpoints web-app-svc -n frontendNAME ENDPOINTS AGEweb-app-svc 10.1.2.34:3000 10m
# Step 3: Check ALB Ingress Controller / AWS Load Balancer Controller logs$ kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=30{"level":"error","msg":"Failed to reconcile ingress frontend/web-app: failed to resolve target group health check: backend service web-app-svc does not have matching target port annotation"}
# Step 4: Check AWS target group health$ aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/k8s-frontend-web/abc123{ "TargetHealthDescriptions": [{ "Target": {"Id": "10.1.2.34", "Port": 3000}, "TargetHealth": { "State": "unhealthy", "Reason": "Target.FailedHealthChecks", "Description": "Health checks failed with these codes: [404]" } }]}# ^^ ALB health check is hitting the pod but getting 404
# Step 5: Check health check path configuration$ kubectl get ingress web-app -n frontend -o yaml | grep -A 5 healthcheck# Look for annotations:# alb.ingress.kubernetes.io/healthcheck-path: /healthz# If missing, ALB defaults to "/" which may return 404
# Step 6: Test the health check from inside the cluster$ kubectl exec -it web-app-abc123 -n frontend -- wget -qO- http://localhost:3000/healthzOK$ kubectl exec -it web-app-abc123 -n frontend -- wget -qO- http://localhost:3000/# If this returns 404, the ALB health check is failing on "/"Root Cause
Section titled “Root Cause”- ALB health check failing — default path
/returns 404, need to sethealthcheck-pathannotation - Target group port mismatch — ALB sends traffic to wrong port on the pod
- Security group — ALB security group cannot reach pod network (missing node SG ingress rule)
- Subnet tags missing — ALB controller cannot discover subnets without
kubernetes.io/role/elb: 1tags - DNS not pointing to ALB —
app.bank.comCNAME does not resolve to the ALB address - TLS termination misconfigured — certificate ARN in annotation doesn’t match the domain
Fix + Prevention
Section titled “Fix + Prevention”# Fix health check path — add annotationkubectl annotate ingress web-app -n frontend \ alb.ingress.kubernetes.io/healthcheck-path=/healthz --overwrite
# Fix security group — ensure ALB SG allows traffic to node SG# AWS Load Balancer Controller manages this if:# alb.ingress.kubernetes.io/security-groups is set correctly
# Fix subnet discovery — tag subnetsaws ec2 create-tags --resources subnet-abc123 \ --tags Key=kubernetes.io/role/elb,Value=1
# Fix TLS — add certificate ARNkubectl annotate ingress web-app -n frontend \ alb.ingress.kubernetes.io/certificate-arn=arn:aws:acm:eu-west-1:123456789012:certificate/abc-123
# Verify ALB target health after fix$ aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/k8s-frontend-web/abc123{ "TargetHealthDescriptions": [{ "Target": {"Id": "10.1.2.34", "Port": 3000}, "TargetHealth": {"State": "healthy"} }]}Prevention:
- Include health check annotations in golden path Ingress templates
- Ensure subnet tagging is part of Terraform VPC module
- Use external-dns to auto-manage DNS records from Ingress resources
- Alert on ALB target health:
aws_alb_tg_unhealthy_host_count > 0
Scenario 9: IRSA / Workload Identity Not Working
Section titled “Scenario 9: IRSA / Workload Identity Not Working”Symptoms
Section titled “Symptoms”Pod cannot access AWS/GCP APIs. Application logs show AccessDenied or 403 Forbidden.
$ kubectl logs s3-uploader-abc123 -n data-pipelineAn error occurred (AccessDenied) when calling the PutObject operation:User: arn:aws:sts::123456789012:assumed-role/eksctl-cluster-nodegroup-NodeInstanceRole/i-abc123is not authorized to perform: s3:PutObject on resource: arn:aws:s3:::bank-data-lake/*# ^^ Using NODE role instead of IRSA role — IRSA is not workingDebug Commands
Section titled “Debug Commands”# Step 1: Check ServiceAccount annotation$ kubectl get sa s3-uploader-sa -n data-pipeline -o yamlapiVersion: v1kind: ServiceAccountmetadata: name: s3-uploader-sa namespace: data-pipeline annotations: {} # <-- NO IRSA annotation!
# Step 2: Check if the pod is using the correct ServiceAccount$ kubectl get pod s3-uploader-abc123 -n data-pipeline \ -o jsonpath='{.spec.serviceAccountName}'default # <-- Using "default" SA, not "s3-uploader-sa"
# Step 3: Verify the projected token volume exists$ kubectl get pod s3-uploader-abc123 -n data-pipeline \ -o jsonpath='{.spec.volumes[?(@.name=="aws-iam-token")]}' | jq .# Should return a projected volume with audience "sts.amazonaws.com"# If empty — IRSA token not being projected
# Step 4: Check the IAM role trust policy$ aws iam get-role --role-name s3-uploader-role --query 'Role.AssumeRolePolicyDocument'{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/ABC123" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.eu-west-1.amazonaws.com/id/ABC123:sub": "system:serviceaccount:data-pipeline:s3-uploader-sa", "oidc.eks.eu-west-1.amazonaws.com/id/ABC123:aud": "sts.amazonaws.com" } } }]}
# Step 5: Verify OIDC provider exists$ aws eks describe-cluster --name prod-cluster \ --query 'cluster.identity.oidc.issuer' --output texthttps://oidc.eks.eu-west-1.amazonaws.com/id/ABC123
$ aws iam list-open-id-connect-providers | grep ABC123"Arn": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/ABC123"
# Step 6: Test from inside the pod$ kubectl exec -it s3-uploader-abc123 -n data-pipeline -- env | grep AWSAWS_ROLE_ARN= # <-- empty, IRSA not injectedAWS_WEB_IDENTITY_TOKEN_FILE= # <-- emptyRoot Cause
Section titled “Root Cause”- ServiceAccount missing annotation —
eks.amazonaws.com/role-arnnot set - Pod using wrong ServiceAccount — Deployment spec says
serviceAccountName: default - Trust policy mismatch — namespace or SA name in trust policy condition doesn’t match
- OIDC provider not created — IRSA requires the EKS OIDC provider registered in IAM
- Token audience mismatch — trust policy expects
sts.amazonaws.combut token has different audience - Webhook not mutating — Amazon EKS Pod Identity Webhook not installed (needed for IRSA token injection)
Fix + Prevention
Section titled “Fix + Prevention”# Fix 1: Annotate the ServiceAccountkubectl annotate sa s3-uploader-sa -n data-pipeline \ eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/s3-uploader-role
# Fix 2: Update Deployment to use the correct SAkubectl patch deployment s3-uploader -n data-pipeline \ --type='json' -p='[{"op":"replace","path":"/spec/template/spec/serviceAccountName","value":"s3-uploader-sa"}]'
# Fix 3: Fix trust policy — ensure namespace and SA matchaws iam update-assume-role-policy --role-name s3-uploader-role \ --policy-document file://trust-policy.json# trust-policy.json must have correct:# system:serviceaccount:<namespace>:<sa-name>
# Fix 4: Create OIDC provider if missingeksctl utils associate-iam-oidc-provider --cluster prod-cluster --approve
# Verify fix — new pod should have AWS env vars$ kubectl exec -it s3-uploader-NEW -n data-pipeline -- env | grep AWSAWS_ROLE_ARN=arn:aws:iam::123456789012:role/s3-uploader-roleAWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/tokenPrevention:
- Define IRSA/WI as part of namespace provisioning (Crossplane or Terraform)
- Use EKS Pod Identity (newer, simpler) instead of IRSA for new clusters
- Validate IRSA in CI:
kubectl auth can-ichecks in post-deploy verification - Template the trust policy alongside the SA in the same Terraform module
Scenario 10: OOMKilled
Section titled “Scenario 10: OOMKilled”Symptoms
Section titled “Symptoms”Container is terminated because it exceeded its memory limit. Exit code 137.
$ kubectl get pods -n ml-inferenceNAME READY STATUS RESTARTS AGEmodel-server-7d8e9f0-a1b2c 0/1 OOMKilled 4 (30s ago) 10m
$ kubectl describe pod model-server-7d8e9f0-a1b2c -n ml-inference Last State: Terminated Reason: OOMKilled Exit Code: 137Debug Commands
Section titled “Debug Commands”# Step 1: Check current resource limits$ kubectl get pod model-server-7d8e9f0-a1b2c -n ml-inference \ -o jsonpath='{.spec.containers[0].resources}' | jq .{ "limits": {"cpu": "1", "memory": "512Mi"}, "requests": {"cpu": "500m", "memory": "256Mi"}}
# Step 2: Check actual memory usage before OOM (from Prometheus/metrics-server)$ kubectl top pod -n ml-inference --sort-by=memoryNAME CPU(cores) MEMORY(bytes)model-server-7d8e9f0-a1b2c 450m 509Mi # <-- hitting 512Mi limit
# Step 3: Check if VPA has recommendations$ kubectl get vpa -n ml-inferenceNAME MODE CPU MEM PROVIDED AGEmodel-server Off - - True 7d
$ kubectl describe vpa model-server -n ml-inferenceRecommendation: Container Recommendations: Container Name: model-server Lower Bound: Cpu: 200m, Memory: 768Mi Target: Cpu: 500m, Memory: 1Gi # <-- VPA recommends 1Gi Upper Bound: Cpu: 2, Memory: 2Gi
# Step 4: Check node-level OOM killer activity (via SSM)$ journalctl -k | grep -i "out of memory"Mar 15 02:45:00 kernel: Out of memory: Killed process 12345 (model-server) total-vm:1048576kB, anon-rss:524288kB, file-rss:0kB, shmem-rss:0kB
# Step 5: Check if it's a JVM/Go app with known memory patterns$ kubectl logs model-server-7d8e9f0-a1b2c -n ml-inference --previous | tail -5Exception in thread "main" java.lang.OutOfMemoryError: Java heap space# ^^ JVM heap not capped — it grows beyond container limitRoot Cause
Section titled “Root Cause”- Memory limit too low — application genuinely needs more memory than the limit allows
- Memory leak — application allocates memory without releasing it over time
- JVM heap not bounded — Java app defaults to 25% of node memory, exceeding container limit
- ML model loading — model loaded into memory exceeds container limit
- No limits set + node memory exhaustion — without limits, pod consumes all node memory and the kernel OOM killer terminates it
Fix + Prevention
Section titled “Fix + Prevention”# Immediate — increase memory limit based on VPA recommendationkubectl set resources deployment model-server -n ml-inference \ --limits=memory=1536Mi --requests=memory=1Gi
# For JVM apps — set heap explicitly relative to container limit# In Deployment env:# - name: JAVA_OPTS# value: "-XX:MaxRAMPercentage=75.0"# This caps JVM heap at 75% of container memory limit
# For Go apps — set GOMEMLIMIT# - name: GOMEMLIMIT# value: "400MiB"
# Enable VPA in Auto mode (if approved by platform policy)# Or use VPA in recommendation-only mode and update limits manuallyPrevention:
- ALWAYS set memory limits on all containers (enforce with OPA/Kyverno)
- Run VPA in recommendation mode on all namespaces, review monthly
- For JVM: mandate
-XX:MaxRAMPercentage=75.0in golden path - Alert on
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85 - Set
LimitRangedefaults so pods without explicit limits still get bounded
Scenario 11: Pod Stuck in Terminating
Section titled “Scenario 11: Pod Stuck in Terminating”Symptoms
Section titled “Symptoms”Pod is in Terminating state indefinitely. kubectl delete pod does not work.
$ kubectl get pods -n legacy-appNAME READY STATUS RESTARTS AGElegacy-worker-abc123 1/1 Terminating 0 45mDebug Commands
Section titled “Debug Commands”# Step 1: Check for finalizers on the pod$ kubectl get pod legacy-worker-abc123 -n legacy-app -o jsonpath='{.metadata.finalizers}'["custom.finalizer.io/cleanup"]# ^^ Finalizer is blocking deletion
# Step 2: Check if the node hosting the pod is responsive$ kubectl get pod legacy-worker-abc123 -n legacy-app -o jsonpath='{.spec.nodeName}'ip-10-1-1-101.ec2.internal
$ kubectl get node ip-10-1-1-101.ec2.internalNAME STATUS ROLES AGE VERSIONip-10-1-1-101.ec2.internal NotReady <none> 30d v1.29.1# ^^ Node is NotReady — kubelet cannot execute the graceful shutdown
# Step 3: Check if there's a PVC with a finalizer blocking things$ kubectl get pvc -n legacy-app -o jsonpath='{range .items[*]}{.metadata.name}: {.metadata.finalizers}{"\n"}{end}'legacy-data: ["kubernetes.io/pvc-protection"]
# Step 4: Check if preStop hook is hanging$ kubectl get pod legacy-worker-abc123 -n legacy-app \ -o jsonpath='{.spec.containers[0].lifecycle.preStop}'{"exec":{"command":["sh","-c","sleep 3600"]}}# ^^ preStop hook sleeps for 3600 seconds (1 hour!)
# Step 5: Check terminationGracePeriodSeconds$ kubectl get pod legacy-worker-abc123 -n legacy-app \ -o jsonpath='{.spec.terminationGracePeriodSeconds}'3600 # <-- 1 hour grace period, pod won't force-kill until this expiresRoot Cause
Section titled “Root Cause”- Finalizer blocking — a controller’s finalizer is present but the controller is not running to remove it
- Node is NotReady — kubelet on the node cannot execute SIGTERM/SIGKILL sequence
- preStop hook hanging — long-running preStop hook (e.g., draining connections) never completes
- terminationGracePeriodSeconds too high — set to 3600+ seconds
- PVC protection finalizer — PVC is still in use, preventing pod deletion chain
Fix + Prevention
Section titled “Fix + Prevention”# Fix 1: Remove finalizer (if controller is gone)kubectl patch pod legacy-worker-abc123 -n legacy-app \ --type='json' -p='[{"op":"remove","path":"/metadata/finalizers"}]'
# Fix 2: Force delete (last resort — use with caution)kubectl delete pod legacy-worker-abc123 -n legacy-app --force --grace-period=0
# Fix 3: If node is NotReady, the pod object stays until the node recovers# or until you force-delete. Fix the node first (Scenario 6) or:kubectl delete pod legacy-worker-abc123 -n legacy-app --force --grace-period=0
# Fix 4: Reduce terminationGracePeriodSeconds in Deployment spec# spec.terminationGracePeriodSeconds: 30 (sane default)Prevention:
- Set
terminationGracePeriodSeconds: 30in golden path templates (override only with justification) - Ensure preStop hooks have bounded execution time
- Monitor for pods in Terminating state > 5 minutes
- If using custom finalizers, ensure the controller has HA (multiple replicas)
- For StatefulSets, use pod disruption budgets to control deletion safely
Scenario 12: HPA Not Scaling
Section titled “Scenario 12: HPA Not Scaling”Symptoms
Section titled “Symptoms”HPA exists but replicas remain at minimum even under high load.
$ kubectl get hpa -n checkoutNAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGEcheckout-hpa Deployment/checkout-svc <unknown>/70% 2 20 2 30d# ^^^^^^^^^ <unknown> means metrics not availableDebug Commands
Section titled “Debug Commands”# Step 1: Describe the HPA for detailed status$ kubectl describe hpa checkout-hpa -n checkoutConditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu in container "checkout" of pod "checkout-svc-abc123"
# Step 2: Check if metrics-server is running$ kubectl get pods -n kube-system -l k8s-app=metrics-serverNAME READY STATUS RESTARTS AGEmetrics-server-6d684c7b5d-x9y0z 1/1 Running 0 30d
# Step 3: Verify metrics are available$ kubectl top pods -n checkoutNAME CPU(cores) MEMORY(bytes)checkout-svc-abc123 450m 256Micheckout-svc-def456 380m 230Mi
# Step 4: Check if the pods have resource REQUESTS set (required for CPU-based HPA)$ kubectl get pod checkout-svc-abc123 -n checkout \ -o jsonpath='{.spec.containers[0].resources}'{"limits":{"memory":"512Mi"}}# ^^ NO CPU request! HPA cannot calculate percentage without a request baseline
# Step 5: For custom metrics (e.g., queue depth), check the metrics API$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .# If 404 — custom metrics adapter (Prometheus Adapter / KEDA) not installed
# Step 6: Check HPA events$ kubectl get events -n checkout --field-selector involvedObject.name=checkout-hpaLAST SEEN TYPE REASON MESSAGE2m Warning FailedGetResourceMetric missing request for cpu5m Warning FailedComputeMetricsReplicas failed to get cpu utilizationRoot Cause
Section titled “Root Cause”- No CPU requests on pods — HPA needs resource requests to calculate utilization percentage
- Metrics server not installed or broken —
kubectl topreturns errors - Custom metrics adapter missing — using custom/external metrics but Prometheus Adapter or KEDA not deployed
- Wrong metric name — HPA references a metric that doesn’t exist
- Cooldown period — HPA recently scaled down and is in
--horizontal-pod-autoscaler-downscale-stabilizationwindow (default 5min) - MaxReplicas reached — HPA already at max, cannot scale further
Fix + Prevention
Section titled “Fix + Prevention”# Fix 1: Add CPU requests to the Deploymentkubectl set resources deployment checkout-svc -n checkout \ --requests=cpu=200m,memory=256Mi --limits=memory=512Mi
# Fix 2: Install metrics-server (if missing)kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Fix 3: Verify after adding requests$ kubectl get hpa -n checkoutNAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGEcheckout-hpa Deployment/checkout-svc 65%/70% 2 20 2 30d# ^^^^ Now showing actual percentage
# Fix 4: For custom metrics, install KEDA# KEDA ScaledObject example for SQS queue depth:# apiVersion: keda.sh/v1alpha1# kind: ScaledObject# metadata:# name: checkout-scaledobject# spec:# scaleTargetRef:# name: checkout-svc# minReplicaCount: 2# maxReplicaCount: 20# triggers:# - type: aws-sqs-queue# metadata:# queueURL: https://sqs.eu-west-1.amazonaws.com/123456789012/checkout-queue# queueLength: "5"Prevention:
- Enforce CPU requests on all pods via OPA/Kyverno admission policy
- Include HPA + resource requests in golden path Helm templates
- Deploy metrics-server and KEDA as cluster baseline add-ons
- Monitor
kube_horizontalpodautoscaler_status_condition{condition="ScalingActive",status="false"} - Set up alerts for HPA at maxReplicas for > 10 minutes
Scenario 13: Certificate Expiry (cert-manager)
Section titled “Scenario 13: Certificate Expiry (cert-manager)”Symptoms
Section titled “Symptoms”TLS certificates are about to expire or already expired. Browsers show certificate errors, API clients fail.
$ kubectl get certificates -n frontendNAME READY SECRET AGEapp-bank-tls False app-bank-tls 90d
$ kubectl describe certificate app-bank-tls -n frontendStatus: Conditions: Type: Ready Status: False Reason: Renewing Message: Renewing certificate as renewal was scheduled at 2026-03-14 Not After: 2026-03-15T00:00:00Z # <-- expires TODAYDebug Commands
Section titled “Debug Commands”# Step 1: Check cert-manager pods$ kubectl get pods -n cert-managerNAME READY STATUS RESTARTS AGEcert-manager-7d8e9f0a1-b2c3d 1/1 Running 0 30dcert-manager-cainjector-5f6g7h8i9-j0k1l 1/1 Running 0 30dcert-manager-webhook-3d4e5f6g7-h8i9j 1/1 Running 0 30d
# Step 2: Check the Certificate resource status$ kubectl describe certificate app-bank-tls -n frontendEvents: Warning Failed 2m (x15 over 24h) cert-manager-certificates-issuing The certificate request has failed to complete and will be retried: Failed to wait for order resource "app-bank-tls-order-xyz" to become ready: order is in "errored" state: acme: order error: 403
# Step 3: Check the Order and Challenge$ kubectl get orders -n frontendNAME STATE AGEapp-bank-tls-order-xyz errored 24h
$ kubectl describe order app-bank-tls-order-xyz -n frontendStatus: State: errored Reason: "acme: order error: one or more domains had a problem"
$ kubectl get challenges -n frontendNAME STATE DOMAIN AGEapp-bank-tls-challenge-abc123 pending app.bank.com 24h
$ kubectl describe challenge app-bank-tls-challenge-abc123 -n frontendStatus: Reason: Waiting for DNS-01 challenge propagation: DNS record for "_acme-challenge.app.bank.com" not yet propagated State: pending
# Step 4: Check ClusterIssuer/Issuer$ kubectl get clusterissuerNAME READY AGEletsencrypt-prod True 180d
$ kubectl describe clusterissuer letsencrypt-prodStatus: Acme: Uri: https://acme-v02.api.letsencrypt.org/acme/acct/123456 Conditions: Type: Ready Status: True
# Step 5: Check if DNS credentials for Route53/Cloud DNS are valid$ kubectl get secret route53-credentials -n cert-manager -o yaml# Verify the access key is not expired/rotated
# Step 6: cert-manager logs for detailed errors$ kubectl logs -n cert-manager -l app=cert-manager --tail=30E0315 cert-manager/challenges "msg"="propagation check failed" "error"="DNS record for \"_acme-challenge.app.bank.com\" not yet propagated" "dnsName"="app.bank.com" "type"="DNS-01"Root Cause
Section titled “Root Cause”- DNS-01 challenge failing — cert-manager cannot create the
_acme-challengeTXT record (IAM permissions, wrong hosted zone) - HTTP-01 challenge failing — challenge solver pod cannot be reached from the internet (ingress misconfigured, firewall)
- Rate limiting — Let’s Encrypt rate limits: 50 certs per domain per week, 5 duplicate certs per week
- Credential expiry — Route53/Cloud DNS IAM credentials used by cert-manager have expired
- cert-manager webhook down — webhook not running, certificate resources cannot be validated
- Cluster DNS issue — cert-manager pods cannot resolve Let’s Encrypt API (see Scenario 5)
Fix + Prevention
Section titled “Fix + Prevention”# Fix DNS-01 — verify IAM permissions for Route53aws iam simulate-principal-policy \ --policy-source-arn arn:aws:iam::123456789012:role/cert-manager-role \ --action-names route53:ChangeResourceRecordSets route53:GetChange \ --resource-arns "arn:aws:route53:::hostedzone/Z1234567890"
# Fix — if permissions are correct but record not propagating, delete and retrykubectl delete challenge app-bank-tls-challenge-abc123 -n frontend# cert-manager will create a new challenge automatically
# Emergency — if cert already expired, manually create cert from ACM/existingkubectl create secret tls app-bank-tls -n frontend \ --cert=./fullchain.pem --key=./privkey.pem --dry-run=client -o yaml | kubectl apply -f -
# Force renewalcmctl renew app-bank-tls -n frontend
# Check cert expiry dates across all namespaces$ kubectl get certificates --all-namespaces -o custom-columns=\NAMESPACE:.metadata.namespace,NAME:.metadata.name,\READY:.status.conditions[0].status,EXPIRY:.status.notAfterNAMESPACE NAME READY EXPIRYfrontend app-bank-tls False 2026-03-15T00:00:00Zpayments pay-bank-tls True 2026-05-20T00:00:00ZPrevention:
- Alert on cert expiry 30 days before:
certmanager_certificate_expiration_timestamp_seconds - time() < 30*24*3600 - Use cert-manager with DNS-01 (more reliable than HTTP-01 for internal services)
- Set up IRSA/Workload Identity for cert-manager instead of static credentials
- Monitor
certmanager_certificate_ready_status{condition="True"} == 0 - Run
cmctl check apias part of cluster health checks
Scenario 14: Network Policy Blocking Traffic
Section titled “Scenario 14: Network Policy Blocking Traffic”Symptoms
Section titled “Symptoms”Pods cannot communicate with each other despite services being correctly configured. Connection timeout or reset.
# From checkout pod, trying to reach payment service$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \ http://payment-svc.payments.svc.cluster.local:8080/api/chargewget: download timed outDebug Commands
Section titled “Debug Commands”# Step 1: List NetworkPolicies in BOTH source and destination namespaces$ kubectl get networkpolicy -n paymentsNAME POD-SELECTOR AGEdefault-deny-all <none> 30dallow-monitoring app=payment-svc 30d
$ kubectl get networkpolicy -n checkoutNAME POD-SELECTOR AGEdefault-deny-all <none> 30dallow-egress-dns <none> 30d
# Step 2: Inspect the default-deny policy in destination namespace$ kubectl describe networkpolicy default-deny-all -n paymentsSpec: PodSelector: <none> (Coverage: all pods in the namespace) Allowing ingress traffic: <none> (Selected pods are isolated for ingress connectivity) Allowing egress traffic: <none> (Selected pods are isolated for egress connectivity) Policy Types: Ingress, Egress# ^^ Denies ALL ingress — checkout pods blocked
# Step 3: Check if there's an allow rule for the specific traffic$ kubectl get networkpolicy allow-monitoring -n payments -o yamlspec: podSelector: matchLabels: app: payment-svc ingress: - from: - namespaceSelector: matchLabels: name: monitoring # <-- only monitoring namespace allowed ports: - port: 8080# ^^ Only monitoring can reach payment-svc. Checkout is NOT allowed.
# Step 4: Check egress rules in source namespace$ kubectl describe networkpolicy default-deny-all -n checkout# If egress is also denied, the checkout pod cannot make ANY outbound connections# Unless there are specific egress allow rules
# Step 5: Verify namespace labels (required for namespaceSelector)$ kubectl get namespace payments --show-labelsNAME STATUS AGE LABELSpayments Active 90d kubernetes.io/metadata.name=payments,name=payments,team=payments
$ kubectl get namespace checkout --show-labelsNAME STATUS AGE LABELScheckout Active 90d kubernetes.io/metadata.name=checkout,team=checkout# ^^ checkout namespace has label "team=checkout" — this is what we need to match
# Step 6: Test with a debug pod in the same namespace as the destination$ kubectl run debug -n payments --rm -it --image=busybox -- wget -qO- http://payment-svc:8080/healthzOK # <-- Works from within the namespace, confirming NetworkPolicy is the issueRoot Cause
Section titled “Root Cause”- Default deny without matching allow —
default-deny-allblocks everything, no ingress rule allows checkout namespace - Missing egress rule in source namespace — checkout namespace also has default deny on egress
- Namespace labels missing —
namespaceSelectorin the allow rule references a label that doesn’t exist on the source namespace - Port not specified in allow rule — ingress rule allows the namespace but not the specific port
- CNI doesn’t support NetworkPolicy — some CNIs (e.g., Flannel) don’t enforce NetworkPolicies
Fix + Prevention
Section titled “Fix + Prevention”# Create an ingress allow rule for checkout -> paymentscat <<EOF | kubectl apply -f -apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-checkout-to-payment namespace: paymentsspec: podSelector: matchLabels: app: payment-svc policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: team: checkout podSelector: matchLabels: app: checkout-svc ports: - port: 8080 protocol: TCPEOF
# Also ensure checkout namespace has egress allowed to paymentscat <<EOF | kubectl apply -f -apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-egress-to-payments namespace: checkoutspec: podSelector: matchLabels: app: checkout-svc policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: team: payments ports: - port: 8080 protocol: TCPEOF
# Verify$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- \ http://payment-svc.payments.svc.cluster.local:8080/api/charge{"status": "ok"}Prevention:
- Define NetworkPolicies in GitOps alongside the namespace provisioning
- Create a “service dependency map” — which namespaces talk to which
- Include DNS egress rule in every default-deny policy template
- Use Cilium’s
hubble observefor real-time flow visibility - Test NetworkPolicies in staging before applying to production
Scenario 15: ArgoCD Sync Failures
Section titled “Scenario 15: ArgoCD Sync Failures”Symptoms
Section titled “Symptoms”ArgoCD Application shows OutOfSync or SyncFailed. Resources are not being deployed.
# ArgoCD CLI$ argocd app get payments --refreshName: paymentsProject: tenant-paymentsServer: https://kubernetes.default.svcNamespace: paymentsStatus: OutOfSyncHealth: DegradedSync Status: SyncFailedMessage: ComparisonError: failed to sync: one or more objects failed to apply: admission webhook "validate.kyverno.svc-fail" denied the request: resource violated policy require-labelsDebug Commands
Section titled “Debug Commands”# Step 1: Check sync status and errors$ argocd app get payments --show-operationOperation: SyncSync Revision: abc123def456Phase: FailedMessage: one or more objects failed to apply
STEP RESOURCE RESULT MESSAGE1 Namespace/payments Synced namespace/payments configured2 Deployment/payment-api Failed admission webhook denied: missing label "team"3 Service/payment-api Skipped dependent resource failed
# Step 2: Check ArgoCD application logs$ argocd app logs payments --tail=20time="2026-03-15T02:00:00Z" level=error msg="ComparisonError" application=payments error="failed to compute diff: CRD certificates.cert-manager.io not found in cluster"
# Step 3: Check if there's a resource hook ordering issue$ kubectl get applications.argoproj.io payments -n argocd \ -o jsonpath='{.status.operationState.syncResult.resources}' | jq .[ {"kind":"CustomResourceDefinition","status":"SyncFailed", "message":"resource mapping not found for name: certificate"}]# ^^ CRD must be applied before the CR that uses it
# Step 4: Check if it's a drift / server-side apply conflict$ argocd app diff payments--- live+++ desired@@ -10,6 +10,7 @@ labels: app: payment-api+ team: payments # <-- this label is in Git but not in cluster version: v2
# Step 5: Check ArgoCD controller and repo-server$ kubectl get pods -n argocdNAME READY STATUS RESTARTS AGEargocd-application-controller-0 1/1 Running 0 7dargocd-repo-server-5d8e9f0-a1b2c 1/1 Running 0 7dargocd-server-7d8e9f0-c3d4e 1/1 Running 0 7d
$ kubectl logs argocd-repo-server-5d8e9f0-a1b2c -n argocd --tail=20time="2026-03-15T02:00:00Z" level=error msg="failed to generate manifest" error="helm template failed: Error: chart 'payment-api' version '2.3.1' not found in repository 'https://charts.internal.bank.com'"
# Step 6: Check repo connectivity$ argocd repo listTYPE NAME REPO STATUS MESSAGEgit infra git@github.com:bank/infra-manifests.git Successfulhelm charts https://charts.internal.bank.com Failed connection refusedRoot Cause
Section titled “Root Cause”- Admission webhook rejection — Kyverno/OPA/Gatekeeper policy denies the resource (missing labels, wrong image registry)
- CRD ordering — Custom Resources applied before their CRDs exist (cert-manager Certificate before CRD)
- Helm chart not found — internal Helm repo is down or chart version doesn’t exist
- Server-side apply conflict — field managed by another controller (e.g., HPA manages replicas, ArgoCD tries to set them too)
- Resource quota exceeded — namespace quota prevents creating new resources
- RBAC — ArgoCD service account doesn’t have permissions to create the resource
Fix + Prevention
Section titled “Fix + Prevention”# Fix 1: Admission webhook — add the required label in Git# Edit the Deployment manifest in Git and commit:# metadata:# labels:# team: payments # <-- add this
# Fix 2: CRD ordering — use sync waves# On the CRD:# annotations:# argocd.argoproj.io/sync-wave: "-1" # Apply CRDs first# On the CR:# annotations:# argocd.argoproj.io/sync-wave: "1" # Apply after CRDs
# Fix 3: Server-side apply conflict with HPA — ignore replicas diff# In ArgoCD Application spec:# ignoreDifferences:# - group: apps# kind: Deployment# jsonPointers:# - /spec/replicas
# Fix 4: Retry syncargocd app sync payments --retry-limit 3
# Fix 5: Force sync (overwrite cluster state with Git)argocd app sync payments --force
# Fix 6: Check RBACkubectl auth can-i create deployments --as=system:serviceaccount:argocd:argocd-application-controller -n paymentsPrevention:
- Use sync waves for all CRD + CR combinations
- Configure
ignoreDifferencesfor fields managed by controllers (HPA replicas, mutating webhooks) - Test manifests against admission policies in CI before pushing to Git
- Set up ArgoCD notifications (Slack/Teams) for sync failures
- Use ArgoCD ApplicationSets for consistent configuration across tenant apps
- Monitor
argocd_app_sync_status{sync_status="OutOfSync"}andargocd_app_health_status{health_status!="Healthy"}
Scenario 16: Karpenter Not Provisioning Nodes
Section titled “Scenario 16: Karpenter Not Provisioning Nodes”Symptoms
Section titled “Symptoms”Pods are stuck in Pending but Karpenter is not launching new nodes, even though it should.
$ kubectl get pods -n batch-processingNAME READY STATUS RESTARTS AGEbatch-job-abc123 0/1 Pending 0 20mbatch-job-def456 0/1 Pending 0 20m
$ kubectl get nodesNAME STATUS ROLES AGE VERSIONip-10-1-1-100.ec2.internal Ready <none> 7d v1.29.1# Only 1 node, no new nodes being provisionedDebug Commands
Section titled “Debug Commands”# Step 1: Check Karpenter controller logs$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=302026-03-15T02:00:00.000Z ERROR controller.provisioner Could not schedule pod, incompatible with provisioner "default": no instance type satisfied resources {"cpu":"16","memory":"64Gi"} and target NodePool requirements [{key: "karpenter.sh/capacity-type", operator: In, values: [spot]}]
# Step 2: Check NodePool configuration$ kubectl get nodepoolsNAME NODECLASS NODES READY AGEdefault default 1 1 30d
$ kubectl describe nodepool defaultSpec: Template: Spec: Requirements: - Key: karpenter.sh/capacity-type Operator: In Values: ["spot"] - Key: node.kubernetes.io/instance-type Operator: In Values: ["m5.xlarge", "m5.2xlarge"] - Key: topology.kubernetes.io/zone Operator: In Values: ["eu-west-1a"] Limits: Cpu: "32" # <-- cluster limit: 32 vCPUs total Memory: "128Gi" Disruption: ConsolidationPolicy: WhenUnderutilized
# Step 3: Check current usage against limits$ kubectl get nodepool default -o jsonpath='{.status}' | jq .{ "resources": { "cpu": "28", # <-- 28 of 32 used, only 4 vCPU remaining "memory": "112Gi" }}# ^^ Pods need 16 CPU but only 4 available in the NodePool limit
# Step 4: Check EC2NodeClass (subnet, security group, AMI)$ kubectl get ec2nodeclassesNAME AGEdefault 30d
$ kubectl describe ec2nodeclass defaultSpec: Subnet Selector: karpenter.sh/discovery: prod-cluster Security Group Selector: karpenter.sh/discovery: prod-cluster AMI Family: AL2Status: Subnets: - ID: subnet-abc123 Zone: eu-west-1a Security Groups: - ID: sg-abc123 AMIs: - ID: ami-abc123 Name: amazon-eks-node-1.29-v20260301
# Step 5: Check for EC2 capacity issues$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=50 | grep -i "insufficient"InsufficientInstanceCapacity: We currently do not have sufficient m5.2xlarge capacity in the Availability Zone eu-west-1a
# Step 6: Check if the pod has nodeSelector or affinity that conflicts$ kubectl get pod batch-job-abc123 -n batch-processing \ -o jsonpath='{.spec.nodeSelector}'{"kubernetes.io/arch":"arm64"}# ^^ Pod requires ARM but NodePool only allows x86 instance typesRoot Cause
Section titled “Root Cause”- NodePool CPU/memory limits reached — Karpenter respects the
limitson NodePools - Instance type constraints too narrow — only allowing 2 instance types, and those are unavailable in the AZ
- Spot capacity unavailable — requesting spot-only but no spot capacity for the selected instance types
- AZ restriction — only allowing one AZ, which has capacity issues
- Architecture mismatch — pod requires ARM (
arm64) but NodePool only provisions x86 instances - Subnet capacity — subnet has no available IP addresses
- IAM permissions — Karpenter node role cannot launch EC2 instances
Fix + Prevention
Section titled “Fix + Prevention”# Fix 1: Increase NodePool limitskubectl patch nodepool default --type='merge' -p '{ "spec": {"limits": {"cpu": "64", "memory": "256Gi"}}}'
# Fix 2: Broaden instance type selectionkubectl patch nodepool default --type='json' -p='[{ "op": "replace", "path": "/spec/template/spec/requirements/1", "value": { "key": "node.kubernetes.io/instance-type", "operator": "In", "values": ["m5.xlarge","m5.2xlarge","m5a.xlarge","m5a.2xlarge", "m6i.xlarge","m6i.2xlarge","c5.xlarge","c5.2xlarge"] }}]'
# Fix 3: Allow on-demand fallback (not spot-only)# In NodePool requirements:# - key: karpenter.sh/capacity-type# operator: In# values: ["spot", "on-demand"]
# Fix 4: Allow multiple AZs# In NodePool requirements:# - key: topology.kubernetes.io/zone# operator: In# values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
# Fix 5: For ARM pods, add ARM instance types# In NodePool:# - key: kubernetes.io/arch# operator: In# values: ["amd64", "arm64"]# - key: node.kubernetes.io/instance-type# operator: In# values: ["m6g.xlarge", "m6g.2xlarge"] # Graviton
# Verify — watch Karpenter provision a node$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f2026-03-15T02:30:00.000Z INFO controller.provisioner Computed 1 new node(s) will fit 2 pod(s)2026-03-15T02:30:05.000Z INFO controller.provisioner Launched node: ip-10-1-1-103.ec2.internal, type: m6i.2xlarge, zone: eu-west-1b, capacity-type: on-demandPrevention:
- Set NodePool limits with 50% headroom above normal peak
- Use at least 15 instance types (Karpenter picks the cheapest available)
- Allow both spot and on-demand with spot preference
- Monitor
karpenter_pods_state{state="pending"}andkarpenter_nodepool_usagevs limits - Alert when NodePool usage > 80% of limits
- Review Karpenter logs weekly for
InsufficientInstanceCapacitypatterns
Scenario 17: Cross-Namespace Service Communication Failing
Section titled “Scenario 17: Cross-Namespace Service Communication Failing”Symptoms
Section titled “Symptoms”Service in namespace A cannot reach a service in namespace B, even though both services are running and healthy within their own namespaces.
# From checkout namespace, trying to call payment service in payments namespace$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \ http://payment-svc:8080/api/chargewget: bad address 'payment-svc:8080' # <-- DNS cannot resolve
# Or using FQDN:$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \ http://payment-svc.payments.svc.cluster.local:8080/api/chargewget: download timed out # <-- DNS resolves but connection blockedDebug Commands
Section titled “Debug Commands”# Step 1: Verify DNS resolution across namespaces$ kubectl exec -it checkout-abc123 -n checkout -- nslookup payment-svc.payments.svc.cluster.localServer: 10.100.0.10Address: 10.100.0.10:53
Name: payment-svc.payments.svc.cluster.localAddress: 10.100.23.45# ^^ DNS works. Problem is not DNS.
# Step 2: Check if the service has endpoints$ kubectl get endpoints payment-svc -n paymentsNAME ENDPOINTS AGEpayment-svc 10.1.2.34:8080 30d# ^^ Endpoints exist
# Step 3: Check NetworkPolicies (most likely cause)$ kubectl get networkpolicy -n paymentsNAME POD-SELECTOR AGEdefault-deny-all <none> 30d# ^^ Default deny blocks all ingress to payments namespace
$ kubectl get networkpolicy -n checkoutNAME POD-SELECTOR AGEdefault-deny-all <none> 30dallow-egress-dns <none> 30d# ^^ Default deny blocks all egress from checkout namespace (except DNS)
# Step 4: Check if FQDN is required (short name only works within same namespace)$ kubectl exec -it checkout-abc123 -n checkout -- cat /etc/resolv.confnameserver 10.100.0.10search checkout.svc.cluster.local svc.cluster.local cluster.localoptions ndots:5# With ndots:5, "payment-svc" resolves as:# payment-svc.checkout.svc.cluster.local → NXDOMAIN (not in checkout ns)# payment-svc.svc.cluster.local → NXDOMAIN# payment-svc.cluster.local → NXDOMAIN# payment-svc → NXDOMAIN# MUST use: payment-svc.payments.svc.cluster.local
# Step 5: Test connectivity with NetworkPolicy temporarily removed (STAGING ONLY)$ kubectl delete networkpolicy default-deny-all -n payments --dry-run=client# If removing the policy fixes it, NetworkPolicy is confirmed as the cause
# Step 6: Check for Cilium/Calico-specific network policies$ kubectl get ciliumnetworkpolicy -n payments 2>/dev/nullNAME AGEcilium-default-deny 30dcilium-allow-internal 30d
$ kubectl describe ciliumnetworkpolicy cilium-allow-internal -n paymentsSpec: Endpoint Selector: Match Labels: app: payment-svc Ingress: - From Endpoints: - Match Labels: io.kubernetes.pod.namespace: payments # <-- only same namespaceRoot Cause
Section titled “Root Cause”- Using short service name —
payment-svconly resolves within the same namespace; must usepayment-svc.payments.svc.cluster.local - NetworkPolicy blocking cross-namespace ingress — default deny in destination namespace without allow rule for source namespace (most common in enterprise setups)
- NetworkPolicy blocking egress — default deny in source namespace blocks outbound connections
- Cilium/Calico-specific policies — CRD-based policies more restrictive than K8s-native NetworkPolicy
- Service type mismatch — ExternalName service pointing to wrong FQDN
- Istio AuthorizationPolicy — service mesh deny-by-default blocking cross-namespace traffic
Fix + Prevention
Section titled “Fix + Prevention”# Fix 1: Application must use FQDN for cross-namespace calls# In app config or env:# PAYMENT_SERVICE_URL=http://payment-svc.payments.svc.cluster.local:8080
# Fix 2: Create NetworkPolicy allowing cross-namespace trafficcat <<EOF | kubectl apply -f -apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-checkout-ingress namespace: paymentsspec: podSelector: matchLabels: app: payment-svc policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: team: checkout podSelector: matchLabels: app: checkout-svc ports: - port: 8080 protocol: TCP---apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-payments-egress namespace: checkoutspec: podSelector: matchLabels: app: checkout-svc policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: team: payments ports: - port: 8080 protocol: TCPEOF
# Fix 3: For Istio, create AuthorizationPolicycat <<EOF | kubectl apply -f -apiVersion: security.istio.io/v1kind: AuthorizationPolicymetadata: name: allow-checkout namespace: paymentsspec: selector: matchLabels: app: payment-svc rules: - from: - source: namespaces: ["checkout"] principals: ["cluster.local/ns/checkout/sa/checkout-sa"] to: - operation: methods: ["POST"] paths: ["/api/charge"]EOF
# Verify$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- \ http://payment-svc.payments.svc.cluster.local:8080/api/charge{"status": "ok", "transaction_id": "txn-789"}Prevention:
- Standardize on FQDN for all cross-namespace service calls in golden path configs
- Maintain a service dependency matrix as part of namespace provisioning
- Create NetworkPolicy templates that are applied alongside namespace creation
- Use Cilium Hubble or Calico flow logs to visualize cross-namespace traffic patterns
- Document allowed communication paths in a “service mesh” diagram per tenant
Cross-Namespace Communication Checklist:+---------------------------------------------+| 1. DNS: Use FQDN (svc.ns.svc.cluster.local) || 2. Ingress NetPol: Allow source namespace || 3. Egress NetPol: Allow destination namespace || 4. Istio AuthZ: Allow source principal || 5. Endpoints: Verify svc has backends || 6. Ports: Match in all policies |+---------------------------------------------+Quick Reference: Scenario to Command Map
Section titled “Quick Reference: Scenario to Command Map”+----+---------------------------+-------------------------------------------+| # | Scenario | First Command to Run |+----+---------------------------+-------------------------------------------+| 1 | Pod Pending | kubectl describe pod <pod> -n <ns> || 2 | CrashLoopBackOff | kubectl logs <pod> --previous -n <ns> || 3 | ImagePullBackOff | kubectl describe pod <pod> -n <ns> || 4 | No traffic to pod | kubectl get endpoints <svc> -n <ns> || 5 | DNS failures | kubectl get pods -n kube-system -l || | | k8s-app=kube-dns || 6 | Node NotReady | kubectl describe node <node> || 7 | PVC Pending | kubectl describe pvc <pvc> -n <ns> || 8 | Ingress not routing | kubectl describe ingress <name> -n <ns> || 9 | IRSA/WI not working | kubectl get sa <sa> -n <ns> -o yaml || 10 | OOMKilled | kubectl describe pod <pod> -n <ns> || 11 | Pod Terminating | kubectl get pod <pod> -o jsonpath= || | | '{.metadata.finalizers}' || 12 | HPA not scaling | kubectl describe hpa <hpa> -n <ns> || 13 | Cert expiry | kubectl describe certificate <cert> -n <ns>|| 14 | NetworkPolicy blocking | kubectl get networkpolicy -n <ns> || 15 | ArgoCD sync failure | argocd app get <app> --show-operation || 16 | Karpenter not provisioning| kubectl logs -n kube-system -l || | | app.kubernetes.io/name=karpenter || 17 | Cross-ns comm failure | kubectl get networkpolicy -n <dest-ns> |+----+---------------------------+-------------------------------------------+Prometheus Alerts for All 17 Scenarios
Section titled “Prometheus Alerts for All 17 Scenarios”These are the alerts you should have configured as the platform team:
# Alert rules covering all 17 scenariosgroups:- name: k8s-troubleshooting-alerts rules: # Scenario 1: Pod Pending > 5 min - alert: PodStuckPending expr: kube_pod_status_phase{phase="Pending"} == 1 for: 5m
# Scenario 2: CrashLoopBackOff - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0 for: 5m
# Scenario 3: ImagePullBackOff - alert: ImagePullFailure expr: kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"} == 1 for: 5m
# Scenario 4: Service with no endpoints - alert: ServiceNoEndpoints expr: kube_endpoint_address_available == 0 for: 5m
# Scenario 5: CoreDNS down - alert: CoreDNSDown expr: up{job="coredns"} == 0 for: 2m
# Scenario 6: Node NotReady - alert: NodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 3m
# Scenario 7: PVC Pending > 5 min - alert: PVCStuckPending expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1 for: 5m
# Scenario 10: OOMKilled - alert: ContainerOOMKilled expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
# Scenario 12: HPA at max replicas - alert: HPAAtMaxReplicas expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas for: 10m
# Scenario 13: Certificate expiring in 14 days - alert: CertificateExpiringSoon expr: certmanager_certificate_expiration_timestamp_seconds - time() < 14*24*3600