Skip to content

Troubleshooting — 17 Debug Scenarios

Your clusters are running. Alerts are firing. Something is broken at 2 AM. This page is your daily reference — 17 scenarios you will encounter repeatedly as the central infra team managing EKS/GKE for 50+ tenant teams.

Troubleshooting — Where This Fits

Before diving into scenarios, here are the commands you will use in every single debugging session:

Terminal window
# The Big Five — run these first for ANY pod issue
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl get events -n <ns> --sort-by='.metadata.creationTimestamp' | tail -20
kubectl top pod -n <ns>
# Node-level
kubectl get nodes -o wide
kubectl describe node <node>
kubectl top nodes
# Network
kubectl get svc,ep,ingress -n <ns>
kubectl get networkpolicy -n <ns>
# Auth
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>

Pod stays in Pending status indefinitely. No node is assigned.

Terminal window
$ kubectl get pods -n payments
NAME READY STATUS RESTARTS AGE
payment-api-7d4f8b6c9-x2k4l 0/1 Pending 0 12m
payment-api-7d4f8b6c9-m9n3p 0/1 Pending 0 12m
Terminal window
# Step 1: Describe the pod — look at the Events section at the bottom
$ kubectl describe pod payment-api-7d4f8b6c9-x2k4l -n payments
# ...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12m default-scheduler 0/5 nodes are available:
2 Insufficient cpu, 3 node(s) had taint
{team=data: NoSchedule} that the pod didn't tolerate.
# Step 2: Check node capacity and allocatable resources
$ kubectl describe nodes | grep -A 5 "Allocated resources"
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3800m (95%) 7600m (190%)
memory 6Gi (82%) 12Gi (164%)
# Step 3: Check if there are taints blocking scheduling
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME TAINTS
ip-10-1-1-100.ec2.internal [map[effect:NoSchedule key:team value:data]]
ip-10-1-1-101.ec2.internal [map[effect:NoSchedule key:team value:data]]
ip-10-1-1-102.ec2.internal [map[effect:NoSchedule key:team value:data]]
ip-10-1-2-200.ec2.internal <none>
ip-10-1-2-201.ec2.internal <none>
# Step 4: Check PVC binding if the pod uses persistent volumes
$ kubectl get pvc -n payments
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
payment-data Pending gp3-encrypted 12m

One or more of these:

  1. Insufficient resources — all nodes are at capacity (cpu/memory requests exhausted)
  2. Taints not tolerated — nodes have taints the pod doesn’t tolerate
  3. Node affinity mismatch — pod requires specific node labels that no node has
  4. PVC not bound — the pod references a PVC that is stuck in Pending (see Scenario 7)
  5. Pod topology spread constraints — cannot satisfy distribution requirements
  6. ResourceQuota exceeded — namespace quota for cpu/memory is maxed out
Terminal window
# Check ResourceQuota
$ kubectl get resourcequota -n payments
NAME AGE REQUEST LIMIT
payments-quota 30d requests.cpu: 3900m/4000m, limits.cpu: 7800m/8000m,
requests.memory: 6.5Gi/8Gi limits.memory: 13Gi/16Gi
Terminal window
# Immediate: If resource shortage, scale up nodes or reduce requests
# For Karpenter — it should auto-provision. If not, see Scenario 16.
# For Cluster Autoscaler:
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# If taint issue, add tolerations to the pod spec:
# spec.tolerations:
# - key: "team"
# operator: "Equal"
# value: "payments"
# effect: "NoSchedule"
# If ResourceQuota, request increase or optimize existing workloads:
kubectl patch resourcequota payments-quota -n payments \
--type='json' -p='[{"op":"replace","path":"/spec/hard/requests.cpu","value":"6000m"}]'

Prevention:

  • Set up Prometheus alerts for node utilization > 80%
  • Use Karpenter/NAP for just-in-time node provisioning
  • Enforce LimitRange so teams cannot request excessive resources
  • Review ResourceQuota during team onboarding

Pod starts, crashes, restarts, crashes again. Backoff delay increases each time.

Terminal window
$ kubectl get pods -n checkout
NAME READY STATUS RESTARTS AGE
checkout-svc-5f6d7c8b9-k3m2n 0/1 CrashLoopBackOff 7 (2m ago) 15m
Terminal window
# Step 1: Check the PREVIOUS container's logs (current container already crashed)
$ kubectl logs checkout-svc-5f6d7c8b9-k3m2n -n checkout --previous
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation]
goroutine 1 [running]:
main.main()
/app/main.go:42 +0x1a4
# Step 2: If no --previous logs, check current attempt
$ kubectl logs checkout-svc-5f6d7c8b9-k3m2n -n checkout
Error: required environment variable DATABASE_URL not set
# Step 3: Check exit code and reason from describe
$ kubectl describe pod checkout-svc-5f6d7c8b9-k3m2n -n checkout
# Look for:
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 15 Mar 2026 02:14:22 +0000
Finished: Sun, 15 Mar 2026 02:14:23 +0000
# Or:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
# Step 4: Check if ConfigMap/Secret exists
$ kubectl get configmap checkout-config -n checkout
Error from server (NotFound): configmaps "checkout-config" not found
# Step 5: Check if liveness probe is killing the container
$ kubectl describe pod checkout-svc-5f6d7c8b9-k3m2n -n checkout | grep -A 10 "Liveness"
Liveness: http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
Events:
Warning Unhealthy 2m (x9 over 14m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing 2m (x3 over 12m) kubelet Container checkout failed liveness probe, will be restarted

Common causes ranked by frequency:

  1. Missing ConfigMap/Secret — app requires env vars that don’t exist in the cluster
  2. Application bug — nil pointer, unhandled exception on startup
  3. OOMKilled — container exceeds memory limit (exit code 137). See Scenario 10
  4. Liveness probe too aggressive — app needs 30s to start, probe fails at 5s
  5. Wrong command/args — container entrypoint is incorrect
  6. Permissions — app cannot read files, connect to database, or access cloud APIs
Terminal window
# Missing config — create the missing ConfigMap
kubectl create configmap checkout-config -n checkout \
--from-literal=DATABASE_URL="postgresql://db.internal:5432/checkout"
# Liveness probe too aggressive — increase initialDelaySeconds
# In the Deployment spec:
# livenessProbe:
# httpGet:
# path: /healthz
# port: 8080
# initialDelaySeconds: 30 # <-- was 5, increase this
# periodSeconds: 10
# failureThreshold: 5 # <-- was 3, more tolerance
# OOMKilled — increase memory limit
kubectl set resources deployment checkout-svc -n checkout \
--limits=memory=512Mi --requests=memory=256Mi
# Quick restart without waiting for backoff
kubectl delete pod checkout-svc-5f6d7c8b9-k3m2n -n checkout

Prevention:

  • Use startupProbe for slow-starting apps (separate from liveness)
  • Enforce ExternalSecrets or Sealed Secrets so configs are always present via GitOps
  • Set up VPA in recommendation mode to right-size memory limits
  • Run pre-deploy checks in CI: validate all referenced ConfigMaps/Secrets exist

Pod cannot pull its container image. Status flips between ErrImagePull and ImagePullBackOff.

Terminal window
$ kubectl get pods -n fraud-detection
NAME READY STATUS RESTARTS AGE
fraud-ml-model-6b7c8d9e0-p4q5 0/1 ImagePullBackOff 0 8m
Terminal window
# Step 1: Describe pod — look at Events
$ kubectl describe pod fraud-ml-model-6b7c8d9e0-p4q5 -n fraud-detection
Events:
Warning Failed 8m kubelet Failed to pull image
"123456789012.dkr.ecr.eu-west-1.amazonaws.com/fraud-ml:v2.3.1":
rpc error: code = Unknown desc = Error response from daemon:
pull access denied for 123456789012.dkr.ecr.eu-west-1.amazonaws.com/fraud-ml,
repository does not exist or may require 'docker login'
# Step 2: Check if the image actually exists in ECR
$ aws ecr describe-images --repository-name fraud-ml \
--image-ids imageTag=v2.3.1 --region eu-west-1
# Error: ImageNotFoundException
# Step 3: Check imagePullSecrets configuration
$ kubectl get pod fraud-ml-model-6b7c8d9e0-p4q5 -n fraud-detection \
-o jsonpath='{.spec.imagePullSecrets}'
[] # <-- empty, no pull secret configured
# Step 4: Check ServiceAccount for ECR/GAR pull permissions
$ kubectl get sa -n fraud-detection -o yaml | grep -A 3 annotations
annotations:
eks.amazonaws.com/role-arn: "" # <-- no IRSA role for ECR access
# Step 5: Verify ECR token (for debugging only)
$ aws ecr get-login-password --region eu-west-1 | \
docker login --username AWS --password-stdin \
123456789012.dkr.ecr.eu-west-1.amazonaws.com
Login Succeeded # <-- if this works, it's a node/SA permissions issue
  1. Image tag doesn’t exist — typo in tag, image never pushed, tag was overwritten
  2. ECR/GAR authentication failure — node IAM role lacks ecr:GetDownloadUrlForLayer, or IRSA not configured
  3. Private registry without imagePullSecret — pulling from Docker Hub, Quay, or another private registry
  4. Wrong region — ECR repo is in us-east-1 but cluster is in eu-west-1
  5. ECR token expired — ECR tokens last 12 hours; if using static secrets, they expire
  6. Network — node cannot reach the registry (missing NAT Gateway, VPC endpoint, or firewall rule)
Terminal window
# Image doesn't exist — verify and push
aws ecr describe-images --repository-name fraud-ml --region eu-west-1 \
--query 'imageDetails[*].imageTags' --output table
# Push the correct image from CI/CD
# ECR auth — ensure node IAM role has ECR permissions (EKS managed node group)
# Or use IRSA for pull:
# Policy: arn:aws:iam::policy/AmazonEC2ContainerRegistryReadOnly
# For private registries — create imagePullSecret
kubectl create secret docker-registry regcred -n fraud-detection \
--docker-server=https://index.docker.io/v1/ \
--docker-username=<user> \
--docker-password=<token>
# Attach to ServiceAccount (better than per-pod)
kubectl patch serviceaccount default -n fraud-detection \
-p '{"imagePullSecrets": [{"name": "regcred"}]}'
# Network — if behind NAT, check route tables and security groups
# Consider ECR VPC Endpoint for private subnets:
# com.amazonaws.<region>.ecr.dkr
# com.amazonaws.<region>.ecr.api
# com.amazonaws.<region>.s3 (gateway endpoint for image layers)

Prevention:

  • Use immutable image tags (sha256 digests) instead of mutable tags like latest
  • Set up ECR replication to the cluster’s region
  • Use ECR pull-through cache for public images
  • Create a platform-level imagePullSecret via ExternalSecrets in every namespace
  • CI pipeline should verify image exists before updating deployment manifest

Scenario 4: Pod Running But Not Receiving Traffic

Section titled “Scenario 4: Pod Running But Not Receiving Traffic”

Pod shows Running and 1/1 Ready, but requests never reach it. Users report 503 or connection timeout.

Terminal window
$ kubectl get pods -n api-gateway
NAME READY STATUS RESTARTS AGE
api-gw-7d8e9f0a1-b2c3d 1/1 Running 0 30m
$ kubectl get svc -n api-gateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
api-gw ClusterIP 10.100.45.123 <none> 8080/TCP 30d
Terminal window
# Step 1: Check if endpoints exist for the service
$ kubectl get endpoints api-gw -n api-gateway
NAME ENDPOINTS AGE
api-gw <none> 30d
# ^^ EMPTY endpoints — this is the problem
# Step 2: Compare service selector with pod labels
$ kubectl get svc api-gw -n api-gateway -o jsonpath='{.spec.selector}'
{"app":"api-gateway","version":"v2"}
$ kubectl get pod api-gw-7d8e9f0a1-b2c3d -n api-gateway --show-labels
NAME READY STATUS LABELS
api-gw-7d8e9f0a1-b2c3d 1/1 Running app=api-gw,version=v2
# ^^ Label is "app=api-gw" but service selects "app=api-gateway" — MISMATCH
# Step 3: If labels match, check readiness probe
$ kubectl describe pod api-gw-7d8e9f0a1-b2c3d -n api-gateway | grep -A 8 "Readiness"
Readiness: http-get http://:8080/ready delay=5s timeout=1s period=10s #success=1 #failure=3
...
Warning Unhealthy 1m (x45 over 28m) kubelet Readiness probe failed:
HTTP probe failed with statuscode: 503
# Step 4: Test connectivity from inside the cluster
$ kubectl run debug-net --rm -it --image=busybox -n api-gateway -- sh
/ # wget -qO- http://api-gw.api-gateway.svc.cluster.local:8080/ready
wget: server returned error: HTTP/1.1 503 Service Unavailable
# Step 5: Check the actual container port
$ kubectl get pod api-gw-7d8e9f0a1-b2c3d -n api-gateway \
-o jsonpath='{.spec.containers[0].ports}'
[{"containerPort":3000,"protocol":"TCP"}]
# ^^ Container listens on 3000 but service targets 8080
  1. Service selector does not match pod labels — most common cause
  2. Readiness probe failing — pod is Running but not Ready, so it’s removed from endpoints
  3. Port mismatch — service targetPort doesn’t match container’s actual listening port
  4. Container listening on wrong interface — app binds to 127.0.0.1 instead of 0.0.0.0
  5. NetworkPolicy blocking ingress traffic to the pod (see Scenario 14)
Terminal window
# Fix label mismatch — update service selector OR pod labels
kubectl patch svc api-gw -n api-gateway \
--type='json' -p='[{"op":"replace","path":"/spec/selector/app","value":"api-gw"}]'
# Fix port mismatch — update service targetPort
kubectl patch svc api-gw -n api-gateway \
--type='json' -p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":3000}]'
# Verify fix — endpoints should now show the pod IP
$ kubectl get endpoints api-gw -n api-gateway
NAME ENDPOINTS AGE
api-gw 10.1.2.34:3000 30d

Prevention:

  • Use Helm/Kustomize templates where service selector and pod labels are generated from the same variable
  • Add integration tests in CI that deploy to a test namespace and verify endpoints are populated
  • Standardize on port naming (http, grpc) in golden path templates
  • Alert on services with 0 endpoints: kube_endpoint_address_available == 0

Pods cannot resolve internal service names or external domains. Application logs show connection errors to hostnames.

Terminal window
$ kubectl exec -it checkout-svc-abc123 -n checkout -- nslookup payment-svc.payments.svc.cluster.local
;; connection timed out; no servers could be reached
$ kubectl exec -it checkout-svc-abc123 -n checkout -- nslookup google.com
;; connection timed out; no servers could be reached
Terminal window
# Step 1: Check CoreDNS pods
$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-5d78c9869d-j4k5l 0/1 CrashLoopBackOff 12 2h
coredns-5d78c9869d-m6n7o 0/1 CrashLoopBackOff 12 2h
# Step 2: Check CoreDNS logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20
[FATAL] plugin/loop: Loop (127.0.0.1:53 -> :53) detected for zone ".",
flushing cache
# Step 3: Check CoreDNS ConfigMap
$ kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
# Step 4: Check if DNS service has endpoints
$ kubectl get svc kube-dns -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
kube-dns ClusterIP 10.100.0.10 <none> 53/UDP,53/TCP
$ kubectl get endpoints kube-dns -n kube-system
NAME ENDPOINTS AGE
kube-dns <none> 90d
# ^^ No endpoints because CoreDNS pods are crashing
# Step 5: Test with explicit DNS server (bypass pod's resolv.conf)
$ kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default 10.100.0.10
;; connection timed out; no servers could be reached
# Step 6: Check resolv.conf inside the pod
$ kubectl exec -it checkout-svc-abc123 -n checkout -- cat /etc/resolv.conf
nameserver 10.100.0.10
search checkout.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
  1. CoreDNS pods crashing — DNS loop detected (node’s /etc/resolv.conf has 127.0.0.1 as nameserver)
  2. CoreDNS resource starvation — too many DNS queries, CoreDNS OOMKilled or CPU-throttled
  3. ndots:5 performance issue — every non-FQDN query generates 5 search domain lookups before the actual query
  4. Upstream DNS unreachable — VPC DNS resolver (.2 address) rate limited or AmazonProvidedDNS issues
  5. NetworkPolicy blocking UDP/53 — egress policy blocks DNS traffic to kube-system
Terminal window
# Fix DNS loop — edit CoreDNS ConfigMap to forward to VPC DNS directly
kubectl edit configmap coredns -n kube-system
# Change: forward . /etc/resolv.conf
# To: forward . 169.254.169.253 (EKS VPC DNS)
# Or: forward . 169.254.169.254 (GKE metadata DNS)
# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system
# Fix ndots performance — in pod spec, set dnsConfig:
# spec:
# dnsConfig:
# options:
# - name: ndots
# value: "2"
# Scale CoreDNS for large clusters (>100 nodes)
kubectl scale deployment coredns -n kube-system --replicas=5
# Or use NodeLocal DNSCache (preferred for large clusters)
# This runs a DNS cache on every node, reducing CoreDNS load
# Deploy: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

Prevention:

  • Deploy NodeLocal DNSCache as part of cluster baseline
  • Monitor CoreDNS with: coredns_dns_requests_total, coredns_dns_responses_rcode_total
  • Alert on CoreDNS pod restarts
  • Use FQDN with trailing dot in critical service calls (payment-svc.payments.svc.cluster.local.)
  • Set ndots:2 in golden path pod templates

One or more nodes show NotReady. Pods on those nodes are evicted or stuck.

Terminal window
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-1-1-100.ec2.internal Ready <none> 30d v1.29.1
ip-10-1-1-101.ec2.internal NotReady <none> 30d v1.29.1
ip-10-1-1-102.ec2.internal Ready <none> 30d v1.29.1
Terminal window
# Step 1: Describe the NotReady node — check Conditions
$ kubectl describe node ip-10-1-1-101.ec2.internal
Conditions:
Type Status LastHeartbeatTime Reason
---- ------ ----------------- ------
MemoryPressure True Sun, 15 Mar 2026 02:00:00 +0000 KubeletHasMemoryPressure
DiskPressure True Sun, 15 Mar 2026 02:00:00 +0000 KubeletHasDiskPressure
PIDPressure False Sun, 15 Mar 2026 02:00:00 +0000 KubeletHasSufficientPID
Ready False Sun, 15 Mar 2026 01:58:00 +0000 KubeletNotReady
message: 'container runtime not ready'
# Step 2: Check node resource usage
$ kubectl top node ip-10-1-1-101.ec2.internal
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-1-1-101.ec2.internal 3800m 95% 14900Mi 96%
# Step 3: Check system pods on that node
$ kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-1-1-101.ec2.internal
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-t8k2l 1/1 Running 0 30d
kube-system kube-proxy-m4n5o 1/1 Running 0 30d
monitoring prom-node-exporter-p6q7r 1/1 Running 0 30d
payments payment-worker-heavy-abc123 1/1 Running 0 2h
# Step 4: If you have SSH access (EKS managed nodes via SSM)
$ aws ssm start-session --target i-0abc123def456
$ journalctl -u kubelet --since "30 minutes ago" | tail -30
Mar 15 02:00:12 kubelet: E0315 02:00:12.123456 node_status.go:
"node not ready" err="container runtime not responding"
$ systemctl status containerd
containerd.service - containerd container runtime
Active: inactive (dead) since Sun 2026-03-15 01:58:00 UTC
# Step 5: Check if EC2 instance has issues
$ aws ec2 describe-instance-status --instance-ids i-0abc123def456
{
"InstanceStatuses": [{
"SystemStatus": {"Status": "impaired"},
"InstanceStatus": {"Status": "ok"}
}]
}
  1. Memory/Disk pressure — kubelet marks node NotReady when system resources are exhausted
  2. Container runtime crashed — containerd/dockerd is not responding
  3. Kubelet stopped — kubelet process died, node stops sending heartbeats
  4. Network partition — node cannot reach API server (security group change, NACL, route table)
  5. EC2 system failure — underlying hardware issue, instance status check failed
  6. Disk full/var/lib/containerd full from old images, container logs, or emptyDir volumes
Terminal window
# Immediate — if hardware issue, cordon and drain
kubectl cordon ip-10-1-1-101.ec2.internal
kubectl drain ip-10-1-1-101.ec2.internal --ignore-daemonsets --delete-emptydir-data --force
# On the node (via SSM):
# Restart containerd
sudo systemctl restart containerd
sudo systemctl restart kubelet
# Clear disk space
sudo crictl rmi --prune
sudo journalctl --vacuum-size=500M
# For managed node groups — just terminate the instance
# ASG will replace it automatically
aws ec2 terminate-instances --instance-ids i-0abc123def456
# For Karpenter — delete the node, Karpenter provisions replacement
kubectl delete node ip-10-1-1-101.ec2.internal

Prevention:

  • Set kubelet eviction thresholds: --eviction-hard=memory.available&lt;500Mi,nodefs.available&lt;10%
  • Use Karpenter ttlSecondsUntilExpired to cycle nodes regularly (e.g., 7 days)
  • Monitor node conditions with Prometheus: kube_node_status_condition{condition="Ready",status="true"} == 0
  • Use instance types with enough ephemeral storage (or attach separate EBS for containerd)
  • Set resource requests on ALL pods to prevent noisy-neighbor memory exhaustion

PersistentVolumeClaim stays in Pending. Pods using it are also stuck in Pending.

Terminal window
$ kubectl get pvc -n databases
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
postgres-data-0 Pending gp3-encrypted 15m
Terminal window
# Step 1: Describe the PVC
$ kubectl describe pvc postgres-data-0 -n databases
Events:
Warning ProvisioningFailed 2m (x7 over 15m) ebs.csi.aws.com_ebs-csi-controller-xxx
failed to provision volume with StorageClass "gp3-encrypted":
rpc error: could not create volume "pvc-xxx" in zone "eu-west-1a":
UnauthorizedOperation: not authorized to perform: ec2:CreateVolume
# Step 2: Check if StorageClass exists
$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
gp2 (default) kubernetes.io/aws-ebs Delete Immediate
gp3-encrypted ebs.csi.aws.com Retain WaitForFirstConsumer
# ^^ StorageClass exists
# Step 3: Check EBS CSI driver pods
$ kubectl get pods -n kube-system -l app=ebs-csi-controller
NAME READY STATUS RESTARTS AGE
ebs-csi-controller-5d8f9g0h1-a2b3c 6/6 Running 0 7d
ebs-csi-controller-5d8f9g0h1-d4e5f 6/6 Running 0 7d
# Step 4: Check EBS CSI controller logs
$ kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner --tail=20
E0315 02:30:00.123456 controller.go:XXX could not create volume:
UnauthorizedOperation: not authorized to perform ec2:CreateVolume
on resource arn:aws:ec2:eu-west-1:123456789012:volume/*
# Step 5: Check IRSA role for the CSI driver
$ kubectl get sa ebs-csi-controller-sa -n kube-system -o yaml | grep role-arn
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ebs-csi-role
# Step 6: Verify IAM policy
$ aws iam get-role-policy --role-name ebs-csi-role --policy-name ebs-policy
# Look for ec2:CreateVolume, ec2:AttachVolume, ec2:DeleteVolume permissions
  1. IAM permissions — EBS CSI / PD CSI driver service account lacks permissions to create volumes
  2. StorageClass not found — PVC references a StorageClass that doesn’t exist
  3. AZ mismatch with WaitForFirstConsumer — node is in eu-west-1a but PV was pre-provisioned in eu-west-1b
  4. Quota exceeded — AWS EBS volume quota or GCP PD quota hit
  5. Encryption KMS key — StorageClass specifies a KMS key the CSI driver role cannot access
  6. CSI driver not installed — EBS CSI driver addon not enabled on the cluster
Terminal window
# Fix IAM — attach the correct policy to the CSI driver IRSA role
aws iam attach-role-policy --role-name ebs-csi-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
# Fix StorageClass — create if missing
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-encrypted
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
kmsKeyId: arn:aws:kms:eu-west-1:123456789012:key/mrk-abc123
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF
# Fix KMS — grant CSI role access to the KMS key
aws kms create-grant --key-id mrk-abc123 \
--grantee-principal arn:aws:iam::123456789012:role/ebs-csi-role \
--operations "CreateGrant" "Encrypt" "Decrypt" "GenerateDataKey" "DescribeKey"
# Check quota
aws service-quotas get-service-quota --service-code ebs \
--quota-code L-D18FCD1D --region eu-west-1

Prevention:

  • Include EBS CSI driver as a cluster add-on in Terraform (not manual install)
  • Pre-create and test StorageClasses in cluster baseline
  • Use volumeBindingMode: WaitForFirstConsumer to avoid AZ mismatch
  • Monitor PVC age with Prometheus: alert if PVC is Pending > 5 minutes
  • Grant KMS key access in Terraform alongside the CSI driver role

Ingress resource exists but external traffic returns 404, 502, or connection refused.

Terminal window
$ kubectl get ingress -n frontend
NAME CLASS HOSTS ADDRESS PORTS AGE
web-app alb app.bank.com k8s-frontend-web-abc123.eu-west-1.elb.amazonaws.com 80,443 10m
# But accessing app.bank.com returns 502 Bad Gateway
$ curl -I https://app.bank.com
HTTP/2 502
server: awselb/2.0
Terminal window
# Step 1: Check Ingress resource details
$ kubectl describe ingress web-app -n frontend
Rules:
Host Path Backends
---- ---- --------
app.bank.com / web-app-svc:80 (10.1.2.34:3000)
# Step 2: Check if the backend service and endpoints exist
$ kubectl get svc web-app-svc -n frontend
NAME TYPE CLUSTER-IP PORT(S) AGE
web-app-svc ClusterIP 10.100.67.89 80/TCP 10m
$ kubectl get endpoints web-app-svc -n frontend
NAME ENDPOINTS AGE
web-app-svc 10.1.2.34:3000 10m
# Step 3: Check ALB Ingress Controller / AWS Load Balancer Controller logs
$ kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=30
{"level":"error","msg":"Failed to reconcile ingress frontend/web-app:
failed to resolve target group health check: backend service web-app-svc
does not have matching target port annotation"}
# Step 4: Check AWS target group health
$ aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/k8s-frontend-web/abc123
{
"TargetHealthDescriptions": [{
"Target": {"Id": "10.1.2.34", "Port": 3000},
"TargetHealth": {
"State": "unhealthy",
"Reason": "Target.FailedHealthChecks",
"Description": "Health checks failed with these codes: [404]"
}
}]
}
# ^^ ALB health check is hitting the pod but getting 404
# Step 5: Check health check path configuration
$ kubectl get ingress web-app -n frontend -o yaml | grep -A 5 healthcheck
# Look for annotations:
# alb.ingress.kubernetes.io/healthcheck-path: /healthz
# If missing, ALB defaults to "/" which may return 404
# Step 6: Test the health check from inside the cluster
$ kubectl exec -it web-app-abc123 -n frontend -- wget -qO- http://localhost:3000/healthz
OK
$ kubectl exec -it web-app-abc123 -n frontend -- wget -qO- http://localhost:3000/
# If this returns 404, the ALB health check is failing on "/"
  1. ALB health check failing — default path / returns 404, need to set healthcheck-path annotation
  2. Target group port mismatch — ALB sends traffic to wrong port on the pod
  3. Security group — ALB security group cannot reach pod network (missing node SG ingress rule)
  4. Subnet tags missing — ALB controller cannot discover subnets without kubernetes.io/role/elb: 1 tags
  5. DNS not pointing to ALBapp.bank.com CNAME does not resolve to the ALB address
  6. TLS termination misconfigured — certificate ARN in annotation doesn’t match the domain
Terminal window
# Fix health check path — add annotation
kubectl annotate ingress web-app -n frontend \
alb.ingress.kubernetes.io/healthcheck-path=/healthz --overwrite
# Fix security group — ensure ALB SG allows traffic to node SG
# AWS Load Balancer Controller manages this if:
# alb.ingress.kubernetes.io/security-groups is set correctly
# Fix subnet discovery — tag subnets
aws ec2 create-tags --resources subnet-abc123 \
--tags Key=kubernetes.io/role/elb,Value=1
# Fix TLS — add certificate ARN
kubectl annotate ingress web-app -n frontend \
alb.ingress.kubernetes.io/certificate-arn=arn:aws:acm:eu-west-1:123456789012:certificate/abc-123
# Verify ALB target health after fix
$ aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/k8s-frontend-web/abc123
{
"TargetHealthDescriptions": [{
"Target": {"Id": "10.1.2.34", "Port": 3000},
"TargetHealth": {"State": "healthy"}
}]
}

Prevention:

  • Include health check annotations in golden path Ingress templates
  • Ensure subnet tagging is part of Terraform VPC module
  • Use external-dns to auto-manage DNS records from Ingress resources
  • Alert on ALB target health: aws_alb_tg_unhealthy_host_count > 0

Scenario 9: IRSA / Workload Identity Not Working

Section titled “Scenario 9: IRSA / Workload Identity Not Working”

Pod cannot access AWS/GCP APIs. Application logs show AccessDenied or 403 Forbidden.

Terminal window
$ kubectl logs s3-uploader-abc123 -n data-pipeline
An error occurred (AccessDenied) when calling the PutObject operation:
User: arn:aws:sts::123456789012:assumed-role/eksctl-cluster-nodegroup-NodeInstanceRole/i-abc123
is not authorized to perform: s3:PutObject on resource: arn:aws:s3:::bank-data-lake/*
# ^^ Using NODE role instead of IRSA role — IRSA is not working
Terminal window
# Step 1: Check ServiceAccount annotation
$ kubectl get sa s3-uploader-sa -n data-pipeline -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-uploader-sa
namespace: data-pipeline
annotations: {} # <-- NO IRSA annotation!
# Step 2: Check if the pod is using the correct ServiceAccount
$ kubectl get pod s3-uploader-abc123 -n data-pipeline \
-o jsonpath='{.spec.serviceAccountName}'
default # <-- Using "default" SA, not "s3-uploader-sa"
# Step 3: Verify the projected token volume exists
$ kubectl get pod s3-uploader-abc123 -n data-pipeline \
-o jsonpath='{.spec.volumes[?(@.name=="aws-iam-token")]}' | jq .
# Should return a projected volume with audience "sts.amazonaws.com"
# If empty — IRSA token not being projected
# Step 4: Check the IAM role trust policy
$ aws iam get-role --role-name s3-uploader-role --query 'Role.AssumeRolePolicyDocument'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/ABC123"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.eu-west-1.amazonaws.com/id/ABC123:sub":
"system:serviceaccount:data-pipeline:s3-uploader-sa",
"oidc.eks.eu-west-1.amazonaws.com/id/ABC123:aud": "sts.amazonaws.com"
}
}
}]
}
# Step 5: Verify OIDC provider exists
$ aws eks describe-cluster --name prod-cluster \
--query 'cluster.identity.oidc.issuer' --output text
https://oidc.eks.eu-west-1.amazonaws.com/id/ABC123
$ aws iam list-open-id-connect-providers | grep ABC123
"Arn": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/ABC123"
# Step 6: Test from inside the pod
$ kubectl exec -it s3-uploader-abc123 -n data-pipeline -- env | grep AWS
AWS_ROLE_ARN= # <-- empty, IRSA not injected
AWS_WEB_IDENTITY_TOKEN_FILE= # <-- empty
  1. ServiceAccount missing annotationeks.amazonaws.com/role-arn not set
  2. Pod using wrong ServiceAccount — Deployment spec says serviceAccountName: default
  3. Trust policy mismatch — namespace or SA name in trust policy condition doesn’t match
  4. OIDC provider not created — IRSA requires the EKS OIDC provider registered in IAM
  5. Token audience mismatch — trust policy expects sts.amazonaws.com but token has different audience
  6. Webhook not mutating — Amazon EKS Pod Identity Webhook not installed (needed for IRSA token injection)
Terminal window
# Fix 1: Annotate the ServiceAccount
kubectl annotate sa s3-uploader-sa -n data-pipeline \
eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/s3-uploader-role
# Fix 2: Update Deployment to use the correct SA
kubectl patch deployment s3-uploader -n data-pipeline \
--type='json' -p='[{"op":"replace","path":"/spec/template/spec/serviceAccountName","value":"s3-uploader-sa"}]'
# Fix 3: Fix trust policy — ensure namespace and SA match
aws iam update-assume-role-policy --role-name s3-uploader-role \
--policy-document file://trust-policy.json
# trust-policy.json must have correct:
# system:serviceaccount:<namespace>:<sa-name>
# Fix 4: Create OIDC provider if missing
eksctl utils associate-iam-oidc-provider --cluster prod-cluster --approve
# Verify fix — new pod should have AWS env vars
$ kubectl exec -it s3-uploader-NEW -n data-pipeline -- env | grep AWS
AWS_ROLE_ARN=arn:aws:iam::123456789012:role/s3-uploader-role
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

Prevention:

  • Define IRSA/WI as part of namespace provisioning (Crossplane or Terraform)
  • Use EKS Pod Identity (newer, simpler) instead of IRSA for new clusters
  • Validate IRSA in CI: kubectl auth can-i checks in post-deploy verification
  • Template the trust policy alongside the SA in the same Terraform module

Container is terminated because it exceeded its memory limit. Exit code 137.

Terminal window
$ kubectl get pods -n ml-inference
NAME READY STATUS RESTARTS AGE
model-server-7d8e9f0-a1b2c 0/1 OOMKilled 4 (30s ago) 10m
$ kubectl describe pod model-server-7d8e9f0-a1b2c -n ml-inference
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Terminal window
# Step 1: Check current resource limits
$ kubectl get pod model-server-7d8e9f0-a1b2c -n ml-inference \
-o jsonpath='{.spec.containers[0].resources}' | jq .
{
"limits": {"cpu": "1", "memory": "512Mi"},
"requests": {"cpu": "500m", "memory": "256Mi"}
}
# Step 2: Check actual memory usage before OOM (from Prometheus/metrics-server)
$ kubectl top pod -n ml-inference --sort-by=memory
NAME CPU(cores) MEMORY(bytes)
model-server-7d8e9f0-a1b2c 450m 509Mi # <-- hitting 512Mi limit
# Step 3: Check if VPA has recommendations
$ kubectl get vpa -n ml-inference
NAME MODE CPU MEM PROVIDED AGE
model-server Off - - True 7d
$ kubectl describe vpa model-server -n ml-inference
Recommendation:
Container Recommendations:
Container Name: model-server
Lower Bound: Cpu: 200m, Memory: 768Mi
Target: Cpu: 500m, Memory: 1Gi # <-- VPA recommends 1Gi
Upper Bound: Cpu: 2, Memory: 2Gi
# Step 4: Check node-level OOM killer activity (via SSM)
$ journalctl -k | grep -i "out of memory"
Mar 15 02:45:00 kernel: Out of memory: Killed process 12345 (model-server)
total-vm:1048576kB, anon-rss:524288kB, file-rss:0kB, shmem-rss:0kB
# Step 5: Check if it's a JVM/Go app with known memory patterns
$ kubectl logs model-server-7d8e9f0-a1b2c -n ml-inference --previous | tail -5
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
# ^^ JVM heap not capped — it grows beyond container limit
  1. Memory limit too low — application genuinely needs more memory than the limit allows
  2. Memory leak — application allocates memory without releasing it over time
  3. JVM heap not bounded — Java app defaults to 25% of node memory, exceeding container limit
  4. ML model loading — model loaded into memory exceeds container limit
  5. No limits set + node memory exhaustion — without limits, pod consumes all node memory and the kernel OOM killer terminates it
Terminal window
# Immediate — increase memory limit based on VPA recommendation
kubectl set resources deployment model-server -n ml-inference \
--limits=memory=1536Mi --requests=memory=1Gi
# For JVM apps — set heap explicitly relative to container limit
# In Deployment env:
# - name: JAVA_OPTS
# value: "-XX:MaxRAMPercentage=75.0"
# This caps JVM heap at 75% of container memory limit
# For Go apps — set GOMEMLIMIT
# - name: GOMEMLIMIT
# value: "400MiB"
# Enable VPA in Auto mode (if approved by platform policy)
# Or use VPA in recommendation-only mode and update limits manually

Prevention:

  • ALWAYS set memory limits on all containers (enforce with OPA/Kyverno)
  • Run VPA in recommendation mode on all namespaces, review monthly
  • For JVM: mandate -XX:MaxRAMPercentage=75.0 in golden path
  • Alert on container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85
  • Set LimitRange defaults so pods without explicit limits still get bounded

Pod is in Terminating state indefinitely. kubectl delete pod does not work.

Terminal window
$ kubectl get pods -n legacy-app
NAME READY STATUS RESTARTS AGE
legacy-worker-abc123 1/1 Terminating 0 45m
Terminal window
# Step 1: Check for finalizers on the pod
$ kubectl get pod legacy-worker-abc123 -n legacy-app -o jsonpath='{.metadata.finalizers}'
["custom.finalizer.io/cleanup"]
# ^^ Finalizer is blocking deletion
# Step 2: Check if the node hosting the pod is responsive
$ kubectl get pod legacy-worker-abc123 -n legacy-app -o jsonpath='{.spec.nodeName}'
ip-10-1-1-101.ec2.internal
$ kubectl get node ip-10-1-1-101.ec2.internal
NAME STATUS ROLES AGE VERSION
ip-10-1-1-101.ec2.internal NotReady <none> 30d v1.29.1
# ^^ Node is NotReady — kubelet cannot execute the graceful shutdown
# Step 3: Check if there's a PVC with a finalizer blocking things
$ kubectl get pvc -n legacy-app -o jsonpath='{range .items[*]}{.metadata.name}: {.metadata.finalizers}{"\n"}{end}'
legacy-data: ["kubernetes.io/pvc-protection"]
# Step 4: Check if preStop hook is hanging
$ kubectl get pod legacy-worker-abc123 -n legacy-app \
-o jsonpath='{.spec.containers[0].lifecycle.preStop}'
{"exec":{"command":["sh","-c","sleep 3600"]}}
# ^^ preStop hook sleeps for 3600 seconds (1 hour!)
# Step 5: Check terminationGracePeriodSeconds
$ kubectl get pod legacy-worker-abc123 -n legacy-app \
-o jsonpath='{.spec.terminationGracePeriodSeconds}'
3600 # <-- 1 hour grace period, pod won't force-kill until this expires
  1. Finalizer blocking — a controller’s finalizer is present but the controller is not running to remove it
  2. Node is NotReady — kubelet on the node cannot execute SIGTERM/SIGKILL sequence
  3. preStop hook hanging — long-running preStop hook (e.g., draining connections) never completes
  4. terminationGracePeriodSeconds too high — set to 3600+ seconds
  5. PVC protection finalizer — PVC is still in use, preventing pod deletion chain
Terminal window
# Fix 1: Remove finalizer (if controller is gone)
kubectl patch pod legacy-worker-abc123 -n legacy-app \
--type='json' -p='[{"op":"remove","path":"/metadata/finalizers"}]'
# Fix 2: Force delete (last resort — use with caution)
kubectl delete pod legacy-worker-abc123 -n legacy-app --force --grace-period=0
# Fix 3: If node is NotReady, the pod object stays until the node recovers
# or until you force-delete. Fix the node first (Scenario 6) or:
kubectl delete pod legacy-worker-abc123 -n legacy-app --force --grace-period=0
# Fix 4: Reduce terminationGracePeriodSeconds in Deployment spec
# spec.terminationGracePeriodSeconds: 30 (sane default)

Prevention:

  • Set terminationGracePeriodSeconds: 30 in golden path templates (override only with justification)
  • Ensure preStop hooks have bounded execution time
  • Monitor for pods in Terminating state > 5 minutes
  • If using custom finalizers, ensure the controller has HA (multiple replicas)
  • For StatefulSets, use pod disruption budgets to control deletion safely

HPA exists but replicas remain at minimum even under high load.

Terminal window
$ kubectl get hpa -n checkout
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
checkout-hpa Deployment/checkout-svc <unknown>/70% 2 20 2 30d
# ^^^^^^^^^ <unknown> means metrics not available
Terminal window
# Step 1: Describe the HPA for detailed status
$ kubectl describe hpa checkout-hpa -n checkout
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the
replica count: failed to get cpu utilization: missing request for cpu in
container "checkout" of pod "checkout-svc-abc123"
# Step 2: Check if metrics-server is running
$ kubectl get pods -n kube-system -l k8s-app=metrics-server
NAME READY STATUS RESTARTS AGE
metrics-server-6d684c7b5d-x9y0z 1/1 Running 0 30d
# Step 3: Verify metrics are available
$ kubectl top pods -n checkout
NAME CPU(cores) MEMORY(bytes)
checkout-svc-abc123 450m 256Mi
checkout-svc-def456 380m 230Mi
# Step 4: Check if the pods have resource REQUESTS set (required for CPU-based HPA)
$ kubectl get pod checkout-svc-abc123 -n checkout \
-o jsonpath='{.spec.containers[0].resources}'
{"limits":{"memory":"512Mi"}}
# ^^ NO CPU request! HPA cannot calculate percentage without a request baseline
# Step 5: For custom metrics (e.g., queue depth), check the metrics API
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
# If 404 — custom metrics adapter (Prometheus Adapter / KEDA) not installed
# Step 6: Check HPA events
$ kubectl get events -n checkout --field-selector involvedObject.name=checkout-hpa
LAST SEEN TYPE REASON MESSAGE
2m Warning FailedGetResourceMetric missing request for cpu
5m Warning FailedComputeMetricsReplicas failed to get cpu utilization
  1. No CPU requests on pods — HPA needs resource requests to calculate utilization percentage
  2. Metrics server not installed or brokenkubectl top returns errors
  3. Custom metrics adapter missing — using custom/external metrics but Prometheus Adapter or KEDA not deployed
  4. Wrong metric name — HPA references a metric that doesn’t exist
  5. Cooldown period — HPA recently scaled down and is in --horizontal-pod-autoscaler-downscale-stabilization window (default 5min)
  6. MaxReplicas reached — HPA already at max, cannot scale further
Terminal window
# Fix 1: Add CPU requests to the Deployment
kubectl set resources deployment checkout-svc -n checkout \
--requests=cpu=200m,memory=256Mi --limits=memory=512Mi
# Fix 2: Install metrics-server (if missing)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Fix 3: Verify after adding requests
$ kubectl get hpa -n checkout
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
checkout-hpa Deployment/checkout-svc 65%/70% 2 20 2 30d
# ^^^^ Now showing actual percentage
# Fix 4: For custom metrics, install KEDA
# KEDA ScaledObject example for SQS queue depth:
# apiVersion: keda.sh/v1alpha1
# kind: ScaledObject
# metadata:
# name: checkout-scaledobject
# spec:
# scaleTargetRef:
# name: checkout-svc
# minReplicaCount: 2
# maxReplicaCount: 20
# triggers:
# - type: aws-sqs-queue
# metadata:
# queueURL: https://sqs.eu-west-1.amazonaws.com/123456789012/checkout-queue
# queueLength: "5"

Prevention:

  • Enforce CPU requests on all pods via OPA/Kyverno admission policy
  • Include HPA + resource requests in golden path Helm templates
  • Deploy metrics-server and KEDA as cluster baseline add-ons
  • Monitor kube_horizontalpodautoscaler_status_condition{condition="ScalingActive",status="false"}
  • Set up alerts for HPA at maxReplicas for > 10 minutes

Scenario 13: Certificate Expiry (cert-manager)

Section titled “Scenario 13: Certificate Expiry (cert-manager)”

TLS certificates are about to expire or already expired. Browsers show certificate errors, API clients fail.

Terminal window
$ kubectl get certificates -n frontend
NAME READY SECRET AGE
app-bank-tls False app-bank-tls 90d
$ kubectl describe certificate app-bank-tls -n frontend
Status:
Conditions:
Type: Ready
Status: False
Reason: Renewing
Message: Renewing certificate as renewal was scheduled at 2026-03-14
Not After: 2026-03-15T00:00:00Z # <-- expires TODAY
Terminal window
# Step 1: Check cert-manager pods
$ kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-7d8e9f0a1-b2c3d 1/1 Running 0 30d
cert-manager-cainjector-5f6g7h8i9-j0k1l 1/1 Running 0 30d
cert-manager-webhook-3d4e5f6g7-h8i9j 1/1 Running 0 30d
# Step 2: Check the Certificate resource status
$ kubectl describe certificate app-bank-tls -n frontend
Events:
Warning Failed 2m (x15 over 24h) cert-manager-certificates-issuing
The certificate request has failed to complete and will be retried:
Failed to wait for order resource "app-bank-tls-order-xyz" to become ready:
order is in "errored" state: acme: order error: 403
# Step 3: Check the Order and Challenge
$ kubectl get orders -n frontend
NAME STATE AGE
app-bank-tls-order-xyz errored 24h
$ kubectl describe order app-bank-tls-order-xyz -n frontend
Status:
State: errored
Reason: "acme: order error: one or more domains had a problem"
$ kubectl get challenges -n frontend
NAME STATE DOMAIN AGE
app-bank-tls-challenge-abc123 pending app.bank.com 24h
$ kubectl describe challenge app-bank-tls-challenge-abc123 -n frontend
Status:
Reason: Waiting for DNS-01 challenge propagation: DNS record for
"_acme-challenge.app.bank.com" not yet propagated
State: pending
# Step 4: Check ClusterIssuer/Issuer
$ kubectl get clusterissuer
NAME READY AGE
letsencrypt-prod True 180d
$ kubectl describe clusterissuer letsencrypt-prod
Status:
Acme:
Uri: https://acme-v02.api.letsencrypt.org/acme/acct/123456
Conditions:
Type: Ready
Status: True
# Step 5: Check if DNS credentials for Route53/Cloud DNS are valid
$ kubectl get secret route53-credentials -n cert-manager -o yaml
# Verify the access key is not expired/rotated
# Step 6: cert-manager logs for detailed errors
$ kubectl logs -n cert-manager -l app=cert-manager --tail=30
E0315 cert-manager/challenges "msg"="propagation check failed"
"error"="DNS record for \"_acme-challenge.app.bank.com\" not yet propagated"
"dnsName"="app.bank.com" "type"="DNS-01"
  1. DNS-01 challenge failing — cert-manager cannot create the _acme-challenge TXT record (IAM permissions, wrong hosted zone)
  2. HTTP-01 challenge failing — challenge solver pod cannot be reached from the internet (ingress misconfigured, firewall)
  3. Rate limiting — Let’s Encrypt rate limits: 50 certs per domain per week, 5 duplicate certs per week
  4. Credential expiry — Route53/Cloud DNS IAM credentials used by cert-manager have expired
  5. cert-manager webhook down — webhook not running, certificate resources cannot be validated
  6. Cluster DNS issue — cert-manager pods cannot resolve Let’s Encrypt API (see Scenario 5)
Terminal window
# Fix DNS-01 — verify IAM permissions for Route53
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/cert-manager-role \
--action-names route53:ChangeResourceRecordSets route53:GetChange \
--resource-arns "arn:aws:route53:::hostedzone/Z1234567890"
# Fix — if permissions are correct but record not propagating, delete and retry
kubectl delete challenge app-bank-tls-challenge-abc123 -n frontend
# cert-manager will create a new challenge automatically
# Emergency — if cert already expired, manually create cert from ACM/existing
kubectl create secret tls app-bank-tls -n frontend \
--cert=./fullchain.pem --key=./privkey.pem --dry-run=client -o yaml | kubectl apply -f -
# Force renewal
cmctl renew app-bank-tls -n frontend
# Check cert expiry dates across all namespaces
$ kubectl get certificates --all-namespaces -o custom-columns=\
NAMESPACE:.metadata.namespace,NAME:.metadata.name,\
READY:.status.conditions[0].status,EXPIRY:.status.notAfter
NAMESPACE NAME READY EXPIRY
frontend app-bank-tls False 2026-03-15T00:00:00Z
payments pay-bank-tls True 2026-05-20T00:00:00Z

Prevention:

  • Alert on cert expiry 30 days before: certmanager_certificate_expiration_timestamp_seconds - time() < 30*24*3600
  • Use cert-manager with DNS-01 (more reliable than HTTP-01 for internal services)
  • Set up IRSA/Workload Identity for cert-manager instead of static credentials
  • Monitor certmanager_certificate_ready_status{condition="True"} == 0
  • Run cmctl check api as part of cluster health checks

Scenario 14: Network Policy Blocking Traffic

Section titled “Scenario 14: Network Policy Blocking Traffic”

Pods cannot communicate with each other despite services being correctly configured. Connection timeout or reset.

Terminal window
# From checkout pod, trying to reach payment service
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \
http://payment-svc.payments.svc.cluster.local:8080/api/charge
wget: download timed out
Terminal window
# Step 1: List NetworkPolicies in BOTH source and destination namespaces
$ kubectl get networkpolicy -n payments
NAME POD-SELECTOR AGE
default-deny-all <none> 30d
allow-monitoring app=payment-svc 30d
$ kubectl get networkpolicy -n checkout
NAME POD-SELECTOR AGE
default-deny-all <none> 30d
allow-egress-dns <none> 30d
# Step 2: Inspect the default-deny policy in destination namespace
$ kubectl describe networkpolicy default-deny-all -n payments
Spec:
PodSelector: <none> (Coverage: all pods in the namespace)
Allowing ingress traffic: <none> (Selected pods are isolated for ingress connectivity)
Allowing egress traffic: <none> (Selected pods are isolated for egress connectivity)
Policy Types: Ingress, Egress
# ^^ Denies ALL ingress — checkout pods blocked
# Step 3: Check if there's an allow rule for the specific traffic
$ kubectl get networkpolicy allow-monitoring -n payments -o yaml
spec:
podSelector:
matchLabels:
app: payment-svc
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring # <-- only monitoring namespace allowed
ports:
- port: 8080
# ^^ Only monitoring can reach payment-svc. Checkout is NOT allowed.
# Step 4: Check egress rules in source namespace
$ kubectl describe networkpolicy default-deny-all -n checkout
# If egress is also denied, the checkout pod cannot make ANY outbound connections
# Unless there are specific egress allow rules
# Step 5: Verify namespace labels (required for namespaceSelector)
$ kubectl get namespace payments --show-labels
NAME STATUS AGE LABELS
payments Active 90d kubernetes.io/metadata.name=payments,name=payments,team=payments
$ kubectl get namespace checkout --show-labels
NAME STATUS AGE LABELS
checkout Active 90d kubernetes.io/metadata.name=checkout,team=checkout
# ^^ checkout namespace has label "team=checkout" — this is what we need to match
# Step 6: Test with a debug pod in the same namespace as the destination
$ kubectl run debug -n payments --rm -it --image=busybox -- wget -qO- http://payment-svc:8080/healthz
OK # <-- Works from within the namespace, confirming NetworkPolicy is the issue
  1. Default deny without matching allowdefault-deny-all blocks everything, no ingress rule allows checkout namespace
  2. Missing egress rule in source namespace — checkout namespace also has default deny on egress
  3. Namespace labels missingnamespaceSelector in the allow rule references a label that doesn’t exist on the source namespace
  4. Port not specified in allow rule — ingress rule allows the namespace but not the specific port
  5. CNI doesn’t support NetworkPolicy — some CNIs (e.g., Flannel) don’t enforce NetworkPolicies
Terminal window
# Create an ingress allow rule for checkout -> payments
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-checkout-to-payment
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-svc
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
team: checkout
podSelector:
matchLabels:
app: checkout-svc
ports:
- port: 8080
protocol: TCP
EOF
# Also ensure checkout namespace has egress allowed to payments
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress-to-payments
namespace: checkout
spec:
podSelector:
matchLabels:
app: checkout-svc
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
team: payments
ports:
- port: 8080
protocol: TCP
EOF
# Verify
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- \
http://payment-svc.payments.svc.cluster.local:8080/api/charge
{"status": "ok"}

Prevention:

  • Define NetworkPolicies in GitOps alongside the namespace provisioning
  • Create a “service dependency map” — which namespaces talk to which
  • Include DNS egress rule in every default-deny policy template
  • Use Cilium’s hubble observe for real-time flow visibility
  • Test NetworkPolicies in staging before applying to production

ArgoCD Application shows OutOfSync or SyncFailed. Resources are not being deployed.

Terminal window
# ArgoCD CLI
$ argocd app get payments --refresh
Name: payments
Project: tenant-payments
Server: https://kubernetes.default.svc
Namespace: payments
Status: OutOfSync
Health: Degraded
Sync Status: SyncFailed
Message: ComparisonError: failed to sync: one or more objects failed
to apply: admission webhook "validate.kyverno.svc-fail"
denied the request: resource violated policy require-labels
Terminal window
# Step 1: Check sync status and errors
$ argocd app get payments --show-operation
Operation: Sync
Sync Revision: abc123def456
Phase: Failed
Message: one or more objects failed to apply
STEP RESOURCE RESULT MESSAGE
1 Namespace/payments Synced namespace/payments configured
2 Deployment/payment-api Failed admission webhook denied: missing label "team"
3 Service/payment-api Skipped dependent resource failed
# Step 2: Check ArgoCD application logs
$ argocd app logs payments --tail=20
time="2026-03-15T02:00:00Z" level=error msg="ComparisonError"
application=payments error="failed to compute diff: CRD
certificates.cert-manager.io not found in cluster"
# Step 3: Check if there's a resource hook ordering issue
$ kubectl get applications.argoproj.io payments -n argocd \
-o jsonpath='{.status.operationState.syncResult.resources}' | jq .
[
{"kind":"CustomResourceDefinition","status":"SyncFailed",
"message":"resource mapping not found for name: certificate"}
]
# ^^ CRD must be applied before the CR that uses it
# Step 4: Check if it's a drift / server-side apply conflict
$ argocd app diff payments
--- live
+++ desired
@@ -10,6 +10,7 @@
labels:
app: payment-api
+ team: payments # <-- this label is in Git but not in cluster
version: v2
# Step 5: Check ArgoCD controller and repo-server
$ kubectl get pods -n argocd
NAME READY STATUS RESTARTS AGE
argocd-application-controller-0 1/1 Running 0 7d
argocd-repo-server-5d8e9f0-a1b2c 1/1 Running 0 7d
argocd-server-7d8e9f0-c3d4e 1/1 Running 0 7d
$ kubectl logs argocd-repo-server-5d8e9f0-a1b2c -n argocd --tail=20
time="2026-03-15T02:00:00Z" level=error msg="failed to generate manifest"
error="helm template failed: Error: chart 'payment-api' version '2.3.1'
not found in repository 'https://charts.internal.bank.com'"
# Step 6: Check repo connectivity
$ argocd repo list
TYPE NAME REPO STATUS MESSAGE
git infra git@github.com:bank/infra-manifests.git Successful
helm charts https://charts.internal.bank.com Failed connection refused
  1. Admission webhook rejection — Kyverno/OPA/Gatekeeper policy denies the resource (missing labels, wrong image registry)
  2. CRD ordering — Custom Resources applied before their CRDs exist (cert-manager Certificate before CRD)
  3. Helm chart not found — internal Helm repo is down or chart version doesn’t exist
  4. Server-side apply conflict — field managed by another controller (e.g., HPA manages replicas, ArgoCD tries to set them too)
  5. Resource quota exceeded — namespace quota prevents creating new resources
  6. RBAC — ArgoCD service account doesn’t have permissions to create the resource
Terminal window
# Fix 1: Admission webhook — add the required label in Git
# Edit the Deployment manifest in Git and commit:
# metadata:
# labels:
# team: payments # <-- add this
# Fix 2: CRD ordering — use sync waves
# On the CRD:
# annotations:
# argocd.argoproj.io/sync-wave: "-1" # Apply CRDs first
# On the CR:
# annotations:
# argocd.argoproj.io/sync-wave: "1" # Apply after CRDs
# Fix 3: Server-side apply conflict with HPA — ignore replicas diff
# In ArgoCD Application spec:
# ignoreDifferences:
# - group: apps
# kind: Deployment
# jsonPointers:
# - /spec/replicas
# Fix 4: Retry sync
argocd app sync payments --retry-limit 3
# Fix 5: Force sync (overwrite cluster state with Git)
argocd app sync payments --force
# Fix 6: Check RBAC
kubectl auth can-i create deployments --as=system:serviceaccount:argocd:argocd-application-controller -n payments

Prevention:

  • Use sync waves for all CRD + CR combinations
  • Configure ignoreDifferences for fields managed by controllers (HPA replicas, mutating webhooks)
  • Test manifests against admission policies in CI before pushing to Git
  • Set up ArgoCD notifications (Slack/Teams) for sync failures
  • Use ArgoCD ApplicationSets for consistent configuration across tenant apps
  • Monitor argocd_app_sync_status{sync_status="OutOfSync"} and argocd_app_health_status{health_status!="Healthy"}

Scenario 16: Karpenter Not Provisioning Nodes

Section titled “Scenario 16: Karpenter Not Provisioning Nodes”

Pods are stuck in Pending but Karpenter is not launching new nodes, even though it should.

Terminal window
$ kubectl get pods -n batch-processing
NAME READY STATUS RESTARTS AGE
batch-job-abc123 0/1 Pending 0 20m
batch-job-def456 0/1 Pending 0 20m
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-1-1-100.ec2.internal Ready <none> 7d v1.29.1
# Only 1 node, no new nodes being provisioned
Terminal window
# Step 1: Check Karpenter controller logs
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=30
2026-03-15T02:00:00.000Z ERROR controller.provisioner
Could not schedule pod, incompatible with provisioner "default":
no instance type satisfied resources
{"cpu":"16","memory":"64Gi"} and target NodePool requirements
[{key: "karpenter.sh/capacity-type", operator: In, values: [spot]}]
# Step 2: Check NodePool configuration
$ kubectl get nodepools
NAME NODECLASS NODES READY AGE
default default 1 1 30d
$ kubectl describe nodepool default
Spec:
Template:
Spec:
Requirements:
- Key: karpenter.sh/capacity-type
Operator: In
Values: ["spot"]
- Key: node.kubernetes.io/instance-type
Operator: In
Values: ["m5.xlarge", "m5.2xlarge"]
- Key: topology.kubernetes.io/zone
Operator: In
Values: ["eu-west-1a"]
Limits:
Cpu: "32" # <-- cluster limit: 32 vCPUs total
Memory: "128Gi"
Disruption:
ConsolidationPolicy: WhenUnderutilized
# Step 3: Check current usage against limits
$ kubectl get nodepool default -o jsonpath='{.status}' | jq .
{
"resources": {
"cpu": "28", # <-- 28 of 32 used, only 4 vCPU remaining
"memory": "112Gi"
}
}
# ^^ Pods need 16 CPU but only 4 available in the NodePool limit
# Step 4: Check EC2NodeClass (subnet, security group, AMI)
$ kubectl get ec2nodeclasses
NAME AGE
default 30d
$ kubectl describe ec2nodeclass default
Spec:
Subnet Selector:
karpenter.sh/discovery: prod-cluster
Security Group Selector:
karpenter.sh/discovery: prod-cluster
AMI Family: AL2
Status:
Subnets:
- ID: subnet-abc123
Zone: eu-west-1a
Security Groups:
- ID: sg-abc123
AMIs:
- ID: ami-abc123
Name: amazon-eks-node-1.29-v20260301
# Step 5: Check for EC2 capacity issues
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=50 | grep -i "insufficient"
InsufficientInstanceCapacity: We currently do not have sufficient
m5.2xlarge capacity in the Availability Zone eu-west-1a
# Step 6: Check if the pod has nodeSelector or affinity that conflicts
$ kubectl get pod batch-job-abc123 -n batch-processing \
-o jsonpath='{.spec.nodeSelector}'
{"kubernetes.io/arch":"arm64"}
# ^^ Pod requires ARM but NodePool only allows x86 instance types
  1. NodePool CPU/memory limits reached — Karpenter respects the limits on NodePools
  2. Instance type constraints too narrow — only allowing 2 instance types, and those are unavailable in the AZ
  3. Spot capacity unavailable — requesting spot-only but no spot capacity for the selected instance types
  4. AZ restriction — only allowing one AZ, which has capacity issues
  5. Architecture mismatch — pod requires ARM (arm64) but NodePool only provisions x86 instances
  6. Subnet capacity — subnet has no available IP addresses
  7. IAM permissions — Karpenter node role cannot launch EC2 instances
Terminal window
# Fix 1: Increase NodePool limits
kubectl patch nodepool default --type='merge' -p '{
"spec": {"limits": {"cpu": "64", "memory": "256Gi"}}
}'
# Fix 2: Broaden instance type selection
kubectl patch nodepool default --type='json' -p='[{
"op": "replace",
"path": "/spec/template/spec/requirements/1",
"value": {
"key": "node.kubernetes.io/instance-type",
"operator": "In",
"values": ["m5.xlarge","m5.2xlarge","m5a.xlarge","m5a.2xlarge",
"m6i.xlarge","m6i.2xlarge","c5.xlarge","c5.2xlarge"]
}
}]'
# Fix 3: Allow on-demand fallback (not spot-only)
# In NodePool requirements:
# - key: karpenter.sh/capacity-type
# operator: In
# values: ["spot", "on-demand"]
# Fix 4: Allow multiple AZs
# In NodePool requirements:
# - key: topology.kubernetes.io/zone
# operator: In
# values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
# Fix 5: For ARM pods, add ARM instance types
# In NodePool:
# - key: kubernetes.io/arch
# operator: In
# values: ["amd64", "arm64"]
# - key: node.kubernetes.io/instance-type
# operator: In
# values: ["m6g.xlarge", "m6g.2xlarge"] # Graviton
# Verify — watch Karpenter provision a node
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f
2026-03-15T02:30:00.000Z INFO controller.provisioner
Computed 1 new node(s) will fit 2 pod(s)
2026-03-15T02:30:05.000Z INFO controller.provisioner
Launched node: ip-10-1-1-103.ec2.internal,
type: m6i.2xlarge, zone: eu-west-1b, capacity-type: on-demand

Prevention:

  • Set NodePool limits with 50% headroom above normal peak
  • Use at least 15 instance types (Karpenter picks the cheapest available)
  • Allow both spot and on-demand with spot preference
  • Monitor karpenter_pods_state{state="pending"} and karpenter_nodepool_usage vs limits
  • Alert when NodePool usage > 80% of limits
  • Review Karpenter logs weekly for InsufficientInstanceCapacity patterns

Scenario 17: Cross-Namespace Service Communication Failing

Section titled “Scenario 17: Cross-Namespace Service Communication Failing”

Service in namespace A cannot reach a service in namespace B, even though both services are running and healthy within their own namespaces.

Terminal window
# From checkout namespace, trying to call payment service in payments namespace
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \
http://payment-svc:8080/api/charge
wget: bad address 'payment-svc:8080' # <-- DNS cannot resolve
# Or using FQDN:
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \
http://payment-svc.payments.svc.cluster.local:8080/api/charge
wget: download timed out # <-- DNS resolves but connection blocked
Terminal window
# Step 1: Verify DNS resolution across namespaces
$ kubectl exec -it checkout-abc123 -n checkout -- nslookup payment-svc.payments.svc.cluster.local
Server: 10.100.0.10
Address: 10.100.0.10:53
Name: payment-svc.payments.svc.cluster.local
Address: 10.100.23.45
# ^^ DNS works. Problem is not DNS.
# Step 2: Check if the service has endpoints
$ kubectl get endpoints payment-svc -n payments
NAME ENDPOINTS AGE
payment-svc 10.1.2.34:8080 30d
# ^^ Endpoints exist
# Step 3: Check NetworkPolicies (most likely cause)
$ kubectl get networkpolicy -n payments
NAME POD-SELECTOR AGE
default-deny-all <none> 30d
# ^^ Default deny blocks all ingress to payments namespace
$ kubectl get networkpolicy -n checkout
NAME POD-SELECTOR AGE
default-deny-all <none> 30d
allow-egress-dns <none> 30d
# ^^ Default deny blocks all egress from checkout namespace (except DNS)
# Step 4: Check if FQDN is required (short name only works within same namespace)
$ kubectl exec -it checkout-abc123 -n checkout -- cat /etc/resolv.conf
nameserver 10.100.0.10
search checkout.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
# With ndots:5, "payment-svc" resolves as:
# payment-svc.checkout.svc.cluster.local → NXDOMAIN (not in checkout ns)
# payment-svc.svc.cluster.local → NXDOMAIN
# payment-svc.cluster.local → NXDOMAIN
# payment-svc → NXDOMAIN
# MUST use: payment-svc.payments.svc.cluster.local
# Step 5: Test connectivity with NetworkPolicy temporarily removed (STAGING ONLY)
$ kubectl delete networkpolicy default-deny-all -n payments --dry-run=client
# If removing the policy fixes it, NetworkPolicy is confirmed as the cause
# Step 6: Check for Cilium/Calico-specific network policies
$ kubectl get ciliumnetworkpolicy -n payments 2>/dev/null
NAME AGE
cilium-default-deny 30d
cilium-allow-internal 30d
$ kubectl describe ciliumnetworkpolicy cilium-allow-internal -n payments
Spec:
Endpoint Selector:
Match Labels:
app: payment-svc
Ingress:
- From Endpoints:
- Match Labels:
io.kubernetes.pod.namespace: payments # <-- only same namespace
  1. Using short service namepayment-svc only resolves within the same namespace; must use payment-svc.payments.svc.cluster.local
  2. NetworkPolicy blocking cross-namespace ingress — default deny in destination namespace without allow rule for source namespace (most common in enterprise setups)
  3. NetworkPolicy blocking egress — default deny in source namespace blocks outbound connections
  4. Cilium/Calico-specific policies — CRD-based policies more restrictive than K8s-native NetworkPolicy
  5. Service type mismatch — ExternalName service pointing to wrong FQDN
  6. Istio AuthorizationPolicy — service mesh deny-by-default blocking cross-namespace traffic
Terminal window
# Fix 1: Application must use FQDN for cross-namespace calls
# In app config or env:
# PAYMENT_SERVICE_URL=http://payment-svc.payments.svc.cluster.local:8080
# Fix 2: Create NetworkPolicy allowing cross-namespace traffic
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-checkout-ingress
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-svc
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
team: checkout
podSelector:
matchLabels:
app: checkout-svc
ports:
- port: 8080
protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-payments-egress
namespace: checkout
spec:
podSelector:
matchLabels:
app: checkout-svc
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
team: payments
ports:
- port: 8080
protocol: TCP
EOF
# Fix 3: For Istio, create AuthorizationPolicy
cat <<EOF | kubectl apply -f -
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: allow-checkout
namespace: payments
spec:
selector:
matchLabels:
app: payment-svc
rules:
- from:
- source:
namespaces: ["checkout"]
principals: ["cluster.local/ns/checkout/sa/checkout-sa"]
to:
- operation:
methods: ["POST"]
paths: ["/api/charge"]
EOF
# Verify
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- \
http://payment-svc.payments.svc.cluster.local:8080/api/charge
{"status": "ok", "transaction_id": "txn-789"}

Prevention:

  • Standardize on FQDN for all cross-namespace service calls in golden path configs
  • Maintain a service dependency matrix as part of namespace provisioning
  • Create NetworkPolicy templates that are applied alongside namespace creation
  • Use Cilium Hubble or Calico flow logs to visualize cross-namespace traffic patterns
  • Document allowed communication paths in a “service mesh” diagram per tenant
Cross-Namespace Communication Checklist:
+---------------------------------------------+
| 1. DNS: Use FQDN (svc.ns.svc.cluster.local) |
| 2. Ingress NetPol: Allow source namespace |
| 3. Egress NetPol: Allow destination namespace |
| 4. Istio AuthZ: Allow source principal |
| 5. Endpoints: Verify svc has backends |
| 6. Ports: Match in all policies |
+---------------------------------------------+

+----+---------------------------+-------------------------------------------+
| # | Scenario | First Command to Run |
+----+---------------------------+-------------------------------------------+
| 1 | Pod Pending | kubectl describe pod <pod> -n <ns> |
| 2 | CrashLoopBackOff | kubectl logs <pod> --previous -n <ns> |
| 3 | ImagePullBackOff | kubectl describe pod <pod> -n <ns> |
| 4 | No traffic to pod | kubectl get endpoints <svc> -n <ns> |
| 5 | DNS failures | kubectl get pods -n kube-system -l |
| | | k8s-app=kube-dns |
| 6 | Node NotReady | kubectl describe node <node> |
| 7 | PVC Pending | kubectl describe pvc <pvc> -n <ns> |
| 8 | Ingress not routing | kubectl describe ingress <name> -n <ns> |
| 9 | IRSA/WI not working | kubectl get sa <sa> -n <ns> -o yaml |
| 10 | OOMKilled | kubectl describe pod <pod> -n <ns> |
| 11 | Pod Terminating | kubectl get pod <pod> -o jsonpath= |
| | | '{.metadata.finalizers}' |
| 12 | HPA not scaling | kubectl describe hpa <hpa> -n <ns> |
| 13 | Cert expiry | kubectl describe certificate <cert> -n <ns>|
| 14 | NetworkPolicy blocking | kubectl get networkpolicy -n <ns> |
| 15 | ArgoCD sync failure | argocd app get <app> --show-operation |
| 16 | Karpenter not provisioning| kubectl logs -n kube-system -l |
| | | app.kubernetes.io/name=karpenter |
| 17 | Cross-ns comm failure | kubectl get networkpolicy -n <dest-ns> |
+----+---------------------------+-------------------------------------------+

These are the alerts you should have configured as the platform team:

# Alert rules covering all 17 scenarios
groups:
- name: k8s-troubleshooting-alerts
rules:
# Scenario 1: Pod Pending > 5 min
- alert: PodStuckPending
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 5m
# Scenario 2: CrashLoopBackOff
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 5m
# Scenario 3: ImagePullBackOff
- alert: ImagePullFailure
expr: kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"} == 1
for: 5m
# Scenario 4: Service with no endpoints
- alert: ServiceNoEndpoints
expr: kube_endpoint_address_available == 0
for: 5m
# Scenario 5: CoreDNS down
- alert: CoreDNSDown
expr: up{job="coredns"} == 0
for: 2m
# Scenario 6: Node NotReady
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 3m
# Scenario 7: PVC Pending > 5 min
- alert: PVCStuckPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 5m
# Scenario 10: OOMKilled
- alert: ContainerOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
# Scenario 12: HPA at max replicas
- alert: HPAAtMaxReplicas
expr: kube_horizontalpodautoscaler_status_current_replicas
== kube_horizontalpodautoscaler_spec_max_replicas
for: 10m
# Scenario 13: Certificate expiring in 14 days
- alert: CertificateExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 14*24*3600