Troubleshooting — 17 Debug Scenarios

Your clusters are running. Alerts are firing. Something is broken at 2 AM. This page is your daily reference — 17 scenarios you will encounter repeatedly as the central infra team managing EKS/GKE for 50+ tenant teams.

Where This Fits

Essential Debug Toolkit

Before diving into scenarios, here are the commands you will use in every single debugging session:

# The Big Five — run these first for ANY pod issue
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl get events -n <ns> --sort-by='.metadata.creationTimestamp' | tail -20
kubectl top pod -n <ns>

# Node-level
kubectl get nodes -o wide
kubectl describe node <node>
kubectl top nodes

# Network
kubectl get svc,ep,ingress -n <ns>
kubectl get networkpolicy -n <ns>

# Auth
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>

Scenario 1: Pod Stuck in Pending

Symptoms

Pod stays in Pending status indefinitely. No node is assigned.

$ kubectl get pods -n payments
NAME                          READY   STATUS    RESTARTS   AGE
payment-api-7d4f8b6c9-x2k4l  0/1     Pending   0          12m
payment-api-7d4f8b6c9-m9n3p  0/1     Pending   0          12m

Debug Commands

# Step 1: Describe the pod — look at the Events section at the bottom
$ kubectl describe pod payment-api-7d4f8b6c9-x2k4l -n payments
# ...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  12m   default-scheduler  0/5 nodes are available:
    2 Insufficient cpu, 3 node(s) had taint
    {team=data: NoSchedule} that the pod didn't tolerate.

# Step 2: Check node capacity and allocatable resources
$ kubectl describe nodes | grep -A 5 "Allocated resources"
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3800m (95%)  7600m (190%)
  memory             6Gi (82%)   12Gi (164%)

# Step 3: Check if there are taints blocking scheduling
$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME                          TAINTS
ip-10-1-1-100.ec2.internal    [map[effect:NoSchedule key:team value:data]]
ip-10-1-1-101.ec2.internal    [map[effect:NoSchedule key:team value:data]]
ip-10-1-1-102.ec2.internal    [map[effect:NoSchedule key:team value:data]]
ip-10-1-2-200.ec2.internal    <none>
ip-10-1-2-201.ec2.internal    <none>

# Step 4: Check PVC binding if the pod uses persistent volumes
$ kubectl get pvc -n payments
NAME           STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
payment-data   Pending                                      gp3-encrypted  12m

Root Cause

One or more of these:

Insufficient resources — all nodes are at capacity (cpu/memory requests exhausted)
Taints not tolerated — nodes have taints the pod doesn’t tolerate
Node affinity mismatch — pod requires specific node labels that no node has
PVC not bound — the pod references a PVC that is stuck in Pending (see Scenario 7)
Pod topology spread constraints — cannot satisfy distribution requirements
ResourceQuota exceeded — namespace quota for cpu/memory is maxed out

# Check ResourceQuota
$ kubectl get resourcequota -n payments
NAME              AGE   REQUEST                                     LIMIT
payments-quota    30d   requests.cpu: 3900m/4000m,                  limits.cpu: 7800m/8000m,
                        requests.memory: 6.5Gi/8Gi                  limits.memory: 13Gi/16Gi

Fix + Prevention

# Immediate: If resource shortage, scale up nodes or reduce requests
# For Karpenter — it should auto-provision. If not, see Scenario 16.
# For Cluster Autoscaler:
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# If taint issue, add tolerations to the pod spec:
# spec.tolerations:
# - key: "team"
#   operator: "Equal"
#   value: "payments"
#   effect: "NoSchedule"

# If ResourceQuota, request increase or optimize existing workloads:
kubectl patch resourcequota payments-quota -n payments \
  --type='json' -p='[{"op":"replace","path":"/spec/hard/requests.cpu","value":"6000m"}]'

Prevention:

Set up Prometheus alerts for node utilization > 80%
Use Karpenter/NAP for just-in-time node provisioning
Enforce LimitRange so teams cannot request excessive resources
Review ResourceQuota during team onboarding

Scenario 2: CrashLoopBackOff

Symptoms

Pod starts, crashes, restarts, crashes again. Backoff delay increases each time.

$ kubectl get pods -n checkout
NAME                            READY   STATUS             RESTARTS      AGE
checkout-svc-5f6d7c8b9-k3m2n   0/1     CrashLoopBackOff   7 (2m ago)    15m

Debug Commands

# Step 1: Check the PREVIOUS container's logs (current container already crashed)
$ kubectl logs checkout-svc-5f6d7c8b9-k3m2n -n checkout --previous
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation]
goroutine 1 [running]:
main.main()
    /app/main.go:42 +0x1a4

# Step 2: If no --previous logs, check current attempt
$ kubectl logs checkout-svc-5f6d7c8b9-k3m2n -n checkout
Error: required environment variable DATABASE_URL not set

# Step 3: Check exit code and reason from describe
$ kubectl describe pod checkout-svc-5f6d7c8b9-k3m2n -n checkout
# Look for:
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Mar 2026 02:14:22 +0000
      Finished:     Sun, 15 Mar 2026 02:14:23 +0000
# Or:
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

# Step 4: Check if ConfigMap/Secret exists
$ kubectl get configmap checkout-config -n checkout
Error from server (NotFound): configmaps "checkout-config" not found

# Step 5: Check if liveness probe is killing the container
$ kubectl describe pod checkout-svc-5f6d7c8b9-k3m2n -n checkout | grep -A 10 "Liveness"
    Liveness:       http-get http://:8080/healthz delay=5s timeout=1s period=10s #success=1 #failure=3
Events:
  Warning  Unhealthy  2m (x9 over 14m)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    2m (x3 over 12m)  kubelet  Container checkout failed liveness probe, will be restarted

Root Cause

Common causes ranked by frequency:

Missing ConfigMap/Secret — app requires env vars that don’t exist in the cluster
Application bug — nil pointer, unhandled exception on startup
OOMKilled — container exceeds memory limit (exit code 137). See Scenario 10
Liveness probe too aggressive — app needs 30s to start, probe fails at 5s
Wrong command/args — container entrypoint is incorrect
Permissions — app cannot read files, connect to database, or access cloud APIs

Fix + Prevention

# Missing config — create the missing ConfigMap
kubectl create configmap checkout-config -n checkout \
  --from-literal=DATABASE_URL="postgresql://db.internal:5432/checkout"

# Liveness probe too aggressive — increase initialDelaySeconds
# In the Deployment spec:
#   livenessProbe:
#     httpGet:
#       path: /healthz
#       port: 8080
#     initialDelaySeconds: 30    # <-- was 5, increase this
#     periodSeconds: 10
#     failureThreshold: 5       # <-- was 3, more tolerance

# OOMKilled — increase memory limit
kubectl set resources deployment checkout-svc -n checkout \
  --limits=memory=512Mi --requests=memory=256Mi

# Quick restart without waiting for backoff
kubectl delete pod checkout-svc-5f6d7c8b9-k3m2n -n checkout

Prevention:

Use startupProbe for slow-starting apps (separate from liveness)
Enforce ExternalSecrets or Sealed Secrets so configs are always present via GitOps
Set up VPA in recommendation mode to right-size memory limits
Run pre-deploy checks in CI: validate all referenced ConfigMaps/Secrets exist

Scenario 3: ImagePullBackOff

Symptoms

Pod cannot pull its container image. Status flips between ErrImagePull and ImagePullBackOff.

$ kubectl get pods -n fraud-detection
NAME                           READY   STATUS             RESTARTS   AGE
fraud-ml-model-6b7c8d9e0-p4q5  0/1    ImagePullBackOff   0          8m

Debug Commands

# Step 1: Describe pod — look at Events
$ kubectl describe pod fraud-ml-model-6b7c8d9e0-p4q5 -n fraud-detection
Events:
  Warning  Failed   8m   kubelet  Failed to pull image
    "123456789012.dkr.ecr.eu-west-1.amazonaws.com/fraud-ml:v2.3.1":
    rpc error: code = Unknown desc = Error response from daemon:
    pull access denied for 123456789012.dkr.ecr.eu-west-1.amazonaws.com/fraud-ml,
    repository does not exist or may require 'docker login'

# Step 2: Check if the image actually exists in ECR
$ aws ecr describe-images --repository-name fraud-ml \
    --image-ids imageTag=v2.3.1 --region eu-west-1
# Error: ImageNotFoundException

# Step 3: Check imagePullSecrets configuration
$ kubectl get pod fraud-ml-model-6b7c8d9e0-p4q5 -n fraud-detection \
    -o jsonpath='{.spec.imagePullSecrets}'
[]    # <-- empty, no pull secret configured

# Step 4: Check ServiceAccount for ECR/GAR pull permissions
$ kubectl get sa -n fraud-detection -o yaml | grep -A 3 annotations
    annotations:
      eks.amazonaws.com/role-arn: ""   # <-- no IRSA role for ECR access

# Step 5: Verify ECR token (for debugging only)
$ aws ecr get-login-password --region eu-west-1 | \
    docker login --username AWS --password-stdin \
    123456789012.dkr.ecr.eu-west-1.amazonaws.com
Login Succeeded   # <-- if this works, it's a node/SA permissions issue

Root Cause

Image tag doesn’t exist — typo in tag, image never pushed, tag was overwritten
ECR/GAR authentication failure — node IAM role lacks ecr:GetDownloadUrlForLayer, or IRSA not configured
Private registry without imagePullSecret — pulling from Docker Hub, Quay, or another private registry
Wrong region — ECR repo is in us-east-1 but cluster is in eu-west-1
ECR token expired — ECR tokens last 12 hours; if using static secrets, they expire
Network — node cannot reach the registry (missing NAT Gateway, VPC endpoint, or firewall rule)

Fix + Prevention

# Image doesn't exist — verify and push
aws ecr describe-images --repository-name fraud-ml --region eu-west-1 \
  --query 'imageDetails[*].imageTags' --output table
# Push the correct image from CI/CD

# ECR auth — ensure node IAM role has ECR permissions (EKS managed node group)
# Or use IRSA for pull:
# Policy: arn:aws:iam::policy/AmazonEC2ContainerRegistryReadOnly

# For private registries — create imagePullSecret
kubectl create secret docker-registry regcred -n fraud-detection \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=<user> \
  --docker-password=<token>

# Attach to ServiceAccount (better than per-pod)
kubectl patch serviceaccount default -n fraud-detection \
  -p '{"imagePullSecrets": [{"name": "regcred"}]}'

# Network — if behind NAT, check route tables and security groups
# Consider ECR VPC Endpoint for private subnets:
# com.amazonaws.<region>.ecr.dkr
# com.amazonaws.<region>.ecr.api
# com.amazonaws.<region>.s3 (gateway endpoint for image layers)

Prevention:

Use immutable image tags (sha256 digests) instead of mutable tags like latest
Set up ECR replication to the cluster’s region
Use ECR pull-through cache for public images
Create a platform-level imagePullSecret via ExternalSecrets in every namespace
CI pipeline should verify image exists before updating deployment manifest

Scenario 4: Pod Running But Not Receiving Traffic

Symptoms

Pod shows Running and 1/1 Ready, but requests never reach it. Users report 503 or connection timeout.

$ kubectl get pods -n api-gateway
NAME                          READY   STATUS    RESTARTS   AGE
api-gw-7d8e9f0a1-b2c3d       1/1     Running   0          30m

$ kubectl get svc -n api-gateway
NAME     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
api-gw   ClusterIP   10.100.45.123   <none>        8080/TCP   30d

Debug Commands

# Step 1: Check if endpoints exist for the service
$ kubectl get endpoints api-gw -n api-gateway
NAME     ENDPOINTS   AGE
api-gw   <none>      30d
# ^^ EMPTY endpoints — this is the problem

# Step 2: Compare service selector with pod labels
$ kubectl get svc api-gw -n api-gateway -o jsonpath='{.spec.selector}'
{"app":"api-gateway","version":"v2"}

$ kubectl get pod api-gw-7d8e9f0a1-b2c3d -n api-gateway --show-labels
NAME                          READY   STATUS    LABELS
api-gw-7d8e9f0a1-b2c3d       1/1     Running   app=api-gw,version=v2
# ^^ Label is "app=api-gw" but service selects "app=api-gateway" — MISMATCH

# Step 3: If labels match, check readiness probe
$ kubectl describe pod api-gw-7d8e9f0a1-b2c3d -n api-gateway | grep -A 8 "Readiness"
    Readiness:      http-get http://:8080/ready delay=5s timeout=1s period=10s #success=1 #failure=3
    ...
  Warning  Unhealthy  1m (x45 over 28m)  kubelet  Readiness probe failed:
    HTTP probe failed with statuscode: 503

# Step 4: Test connectivity from inside the cluster
$ kubectl run debug-net --rm -it --image=busybox -n api-gateway -- sh
/ # wget -qO- http://api-gw.api-gateway.svc.cluster.local:8080/ready
wget: server returned error: HTTP/1.1 503 Service Unavailable

# Step 5: Check the actual container port
$ kubectl get pod api-gw-7d8e9f0a1-b2c3d -n api-gateway \
    -o jsonpath='{.spec.containers[0].ports}'
[{"containerPort":3000,"protocol":"TCP"}]
# ^^ Container listens on 3000 but service targets 8080

Root Cause

Service selector does not match pod labels — most common cause
Readiness probe failing — pod is Running but not Ready, so it’s removed from endpoints
Port mismatch — service targetPort doesn’t match container’s actual listening port
Container listening on wrong interface — app binds to 127.0.0.1 instead of 0.0.0.0
NetworkPolicy blocking ingress traffic to the pod (see Scenario 14)

Fix + Prevention

# Fix label mismatch — update service selector OR pod labels
kubectl patch svc api-gw -n api-gateway \
  --type='json' -p='[{"op":"replace","path":"/spec/selector/app","value":"api-gw"}]'

# Fix port mismatch — update service targetPort
kubectl patch svc api-gw -n api-gateway \
  --type='json' -p='[{"op":"replace","path":"/spec/ports/0/targetPort","value":3000}]'

# Verify fix — endpoints should now show the pod IP
$ kubectl get endpoints api-gw -n api-gateway
NAME     ENDPOINTS           AGE
api-gw   10.1.2.34:3000      30d

Prevention:

Use Helm/Kustomize templates where service selector and pod labels are generated from the same variable
Add integration tests in CI that deploy to a test namespace and verify endpoints are populated
Standardize on port naming (http, grpc) in golden path templates
Alert on services with 0 endpoints: kube_endpoint_address_available == 0

Scenario 5: DNS Resolution Failures

Symptoms

Pods cannot resolve internal service names or external domains. Application logs show connection errors to hostnames.

$ kubectl exec -it checkout-svc-abc123 -n checkout -- nslookup payment-svc.payments.svc.cluster.local
;; connection timed out; no servers could be reached

$ kubectl exec -it checkout-svc-abc123 -n checkout -- nslookup google.com
;; connection timed out; no servers could be reached

Debug Commands

# Step 1: Check CoreDNS pods
$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS             RESTARTS   AGE
coredns-5d78c9869d-j4k5l   0/1     CrashLoopBackOff   12         2h
coredns-5d78c9869d-m6n7o   0/1     CrashLoopBackOff   12         2h

# Step 2: Check CoreDNS logs
$ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20
[FATAL] plugin/loop: Loop (127.0.0.1:53 -> :53) detected for zone ".",
  flushing cache

# Step 3: Check CoreDNS ConfigMap
$ kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

# Step 4: Check if DNS service has endpoints
$ kubectl get svc kube-dns -n kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)
kube-dns   ClusterIP   10.100.0.10  <none>        53/UDP,53/TCP

$ kubectl get endpoints kube-dns -n kube-system
NAME       ENDPOINTS   AGE
kube-dns   <none>      90d
# ^^ No endpoints because CoreDNS pods are crashing

# Step 5: Test with explicit DNS server (bypass pod's resolv.conf)
$ kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default 10.100.0.10
;; connection timed out; no servers could be reached

# Step 6: Check resolv.conf inside the pod
$ kubectl exec -it checkout-svc-abc123 -n checkout -- cat /etc/resolv.conf
nameserver 10.100.0.10
search checkout.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

Root Cause

CoreDNS pods crashing — DNS loop detected (node’s /etc/resolv.conf has 127.0.0.1 as nameserver)
CoreDNS resource starvation — too many DNS queries, CoreDNS OOMKilled or CPU-throttled
ndots:5 performance issue — every non-FQDN query generates 5 search domain lookups before the actual query
Upstream DNS unreachable — VPC DNS resolver (.2 address) rate limited or AmazonProvidedDNS issues
NetworkPolicy blocking UDP/53 — egress policy blocks DNS traffic to kube-system

Fix + Prevention

# Fix DNS loop — edit CoreDNS ConfigMap to forward to VPC DNS directly
kubectl edit configmap coredns -n kube-system
# Change:  forward . /etc/resolv.conf
# To:      forward . 169.254.169.253     (EKS VPC DNS)
# Or:      forward . 169.254.169.254     (GKE metadata DNS)

# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system

# Fix ndots performance — in pod spec, set dnsConfig:
# spec:
#   dnsConfig:
#     options:
#       - name: ndots
#         value: "2"

# Scale CoreDNS for large clusters (>100 nodes)
kubectl scale deployment coredns -n kube-system --replicas=5

# Or use NodeLocal DNSCache (preferred for large clusters)
# This runs a DNS cache on every node, reducing CoreDNS load
# Deploy: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

Prevention:

Deploy NodeLocal DNSCache as part of cluster baseline
Monitor CoreDNS with: coredns_dns_requests_total, coredns_dns_responses_rcode_total
Alert on CoreDNS pod restarts
Use FQDN with trailing dot in critical service calls (payment-svc.payments.svc.cluster.local.)
Set ndots:2 in golden path pod templates

Scenario 6: Node NotReady

Symptoms

One or more nodes show NotReady. Pods on those nodes are evicted or stuck.

$ kubectl get nodes
NAME                           STATUS     ROLES    AGE   VERSION
ip-10-1-1-100.ec2.internal     Ready      <none>   30d   v1.29.1
ip-10-1-1-101.ec2.internal     NotReady   <none>   30d   v1.29.1
ip-10-1-1-102.ec2.internal     Ready      <none>   30d   v1.29.1

Debug Commands

# Step 1: Describe the NotReady node — check Conditions
$ kubectl describe node ip-10-1-1-101.ec2.internal
Conditions:
  Type             Status  LastHeartbeatTime                 Reason
  ----             ------  -----------------                 ------
  MemoryPressure   True    Sun, 15 Mar 2026 02:00:00 +0000  KubeletHasMemoryPressure
  DiskPressure     True    Sun, 15 Mar 2026 02:00:00 +0000  KubeletHasDiskPressure
  PIDPressure      False   Sun, 15 Mar 2026 02:00:00 +0000  KubeletHasSufficientPID
  Ready            False   Sun, 15 Mar 2026 01:58:00 +0000  KubeletNotReady
                           message: 'container runtime not ready'

# Step 2: Check node resource usage
$ kubectl top node ip-10-1-1-101.ec2.internal
NAME                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-1-1-101.ec2.internal     3800m        95%    14900Mi         96%

# Step 3: Check system pods on that node
$ kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-1-1-101.ec2.internal
NAMESPACE     NAME                          READY   STATUS    RESTARTS   AGE
kube-system   aws-node-t8k2l                1/1     Running   0          30d
kube-system   kube-proxy-m4n5o              1/1     Running   0          30d
monitoring    prom-node-exporter-p6q7r      1/1     Running   0          30d
payments      payment-worker-heavy-abc123   1/1     Running   0          2h

# Step 4: If you have SSH access (EKS managed nodes via SSM)
$ aws ssm start-session --target i-0abc123def456
$ journalctl -u kubelet --since "30 minutes ago" | tail -30
Mar 15 02:00:12 kubelet: E0315 02:00:12.123456 node_status.go:
  "node not ready" err="container runtime not responding"
$ systemctl status containerd
● containerd.service - containerd container runtime
   Active: inactive (dead) since Sun 2026-03-15 01:58:00 UTC

# Step 5: Check if EC2 instance has issues
$ aws ec2 describe-instance-status --instance-ids i-0abc123def456
{
    "InstanceStatuses": [{
        "SystemStatus": {"Status": "impaired"},
        "InstanceStatus": {"Status": "ok"}
    }]
}

Root Cause

Memory/Disk pressure — kubelet marks node NotReady when system resources are exhausted
Container runtime crashed — containerd/dockerd is not responding
Kubelet stopped — kubelet process died, node stops sending heartbeats
Network partition — node cannot reach API server (security group change, NACL, route table)
EC2 system failure — underlying hardware issue, instance status check failed
Disk full — /var/lib/containerd full from old images, container logs, or emptyDir volumes

Fix + Prevention

# Immediate — if hardware issue, cordon and drain
kubectl cordon ip-10-1-1-101.ec2.internal
kubectl drain ip-10-1-1-101.ec2.internal --ignore-daemonsets --delete-emptydir-data --force

# On the node (via SSM):
# Restart containerd
sudo systemctl restart containerd
sudo systemctl restart kubelet

# Clear disk space
sudo crictl rmi --prune
sudo journalctl --vacuum-size=500M

# For managed node groups — just terminate the instance
# ASG will replace it automatically
aws ec2 terminate-instances --instance-ids i-0abc123def456

# For Karpenter — delete the node, Karpenter provisions replacement
kubectl delete node ip-10-1-1-101.ec2.internal

Prevention:

Set kubelet eviction thresholds: --eviction-hard=memory.available<500Mi,nodefs.available<10%
Use Karpenter ttlSecondsUntilExpired to cycle nodes regularly (e.g., 7 days)
Monitor node conditions with Prometheus: kube_node_status_condition{condition="Ready",status="true"} == 0
Use instance types with enough ephemeral storage (or attach separate EBS for containerd)
Set resource requests on ALL pods to prevent noisy-neighbor memory exhaustion

Scenario 7: PVC Stuck in Pending

Symptoms

PersistentVolumeClaim stays in Pending. Pods using it are also stuck in Pending.

$ kubectl get pvc -n databases
NAME              STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgres-data-0   Pending                                      gp3-encrypted  15m

Debug Commands

# Step 1: Describe the PVC
$ kubectl describe pvc postgres-data-0 -n databases
Events:
  Warning  ProvisioningFailed  2m (x7 over 15m)  ebs.csi.aws.com_ebs-csi-controller-xxx
    failed to provision volume with StorageClass "gp3-encrypted":
    rpc error: could not create volume "pvc-xxx" in zone "eu-west-1a":
    UnauthorizedOperation: not authorized to perform: ec2:CreateVolume

# Step 2: Check if StorageClass exists
$ kubectl get storageclass
NAME              PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE
gp2 (default)     kubernetes.io/aws-ebs   Delete          Immediate
gp3-encrypted     ebs.csi.aws.com         Retain          WaitForFirstConsumer
# ^^ StorageClass exists

# Step 3: Check EBS CSI driver pods
$ kubectl get pods -n kube-system -l app=ebs-csi-controller
NAME                                  READY   STATUS    RESTARTS   AGE
ebs-csi-controller-5d8f9g0h1-a2b3c   6/6     Running   0          7d
ebs-csi-controller-5d8f9g0h1-d4e5f   6/6     Running   0          7d

# Step 4: Check EBS CSI controller logs
$ kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner --tail=20
E0315 02:30:00.123456 controller.go:XXX  could not create volume:
  UnauthorizedOperation: not authorized to perform ec2:CreateVolume
  on resource arn:aws:ec2:eu-west-1:123456789012:volume/*

# Step 5: Check IRSA role for the CSI driver
$ kubectl get sa ebs-csi-controller-sa -n kube-system -o yaml | grep role-arn
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ebs-csi-role

# Step 6: Verify IAM policy
$ aws iam get-role-policy --role-name ebs-csi-role --policy-name ebs-policy
# Look for ec2:CreateVolume, ec2:AttachVolume, ec2:DeleteVolume permissions

Root Cause

IAM permissions — EBS CSI / PD CSI driver service account lacks permissions to create volumes
StorageClass not found — PVC references a StorageClass that doesn’t exist
AZ mismatch with WaitForFirstConsumer — node is in eu-west-1a but PV was pre-provisioned in eu-west-1b
Quota exceeded — AWS EBS volume quota or GCP PD quota hit
Encryption KMS key — StorageClass specifies a KMS key the CSI driver role cannot access
CSI driver not installed — EBS CSI driver addon not enabled on the cluster

Fix + Prevention

# Fix IAM — attach the correct policy to the CSI driver IRSA role
aws iam attach-role-policy --role-name ebs-csi-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

# Fix StorageClass — create if missing
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  kmsKeyId: arn:aws:kms:eu-west-1:123456789012:key/mrk-abc123
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

# Fix KMS — grant CSI role access to the KMS key
aws kms create-grant --key-id mrk-abc123 \
  --grantee-principal arn:aws:iam::123456789012:role/ebs-csi-role \
  --operations "CreateGrant" "Encrypt" "Decrypt" "GenerateDataKey" "DescribeKey"

# Check quota
aws service-quotas get-service-quota --service-code ebs \
  --quota-code L-D18FCD1D --region eu-west-1

Prevention:

Include EBS CSI driver as a cluster add-on in Terraform (not manual install)
Pre-create and test StorageClasses in cluster baseline
Use volumeBindingMode: WaitForFirstConsumer to avoid AZ mismatch
Monitor PVC age with Prometheus: alert if PVC is Pending > 5 minutes
Grant KMS key access in Terraform alongside the CSI driver role

Scenario 8: Ingress Not Routing

Symptoms

Ingress resource exists but external traffic returns 404, 502, or connection refused.

$ kubectl get ingress -n frontend
NAME       CLASS   HOSTS              ADDRESS                                     PORTS   AGE
web-app    alb     app.bank.com       k8s-frontend-web-abc123.eu-west-1.elb.amazonaws.com   80,443   10m

# But accessing app.bank.com returns 502 Bad Gateway
$ curl -I https://app.bank.com
HTTP/2 502
server: awselb/2.0

Debug Commands

# Step 1: Check Ingress resource details
$ kubectl describe ingress web-app -n frontend
Rules:
  Host          Path    Backends
  ----          ----    --------
  app.bank.com  /       web-app-svc:80 (10.1.2.34:3000)

# Step 2: Check if the backend service and endpoints exist
$ kubectl get svc web-app-svc -n frontend
NAME          TYPE        CLUSTER-IP      PORT(S)    AGE
web-app-svc   ClusterIP   10.100.67.89    80/TCP     10m

$ kubectl get endpoints web-app-svc -n frontend
NAME          ENDPOINTS         AGE
web-app-svc   10.1.2.34:3000    10m

# Step 3: Check ALB Ingress Controller / AWS Load Balancer Controller logs
$ kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=30
{"level":"error","msg":"Failed to reconcile ingress frontend/web-app:
  failed to resolve target group health check: backend service web-app-svc
  does not have matching target port annotation"}

# Step 4: Check AWS target group health
$ aws elbv2 describe-target-health \
    --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/k8s-frontend-web/abc123
{
    "TargetHealthDescriptions": [{
        "Target": {"Id": "10.1.2.34", "Port": 3000},
        "TargetHealth": {
            "State": "unhealthy",
            "Reason": "Target.FailedHealthChecks",
            "Description": "Health checks failed with these codes: [404]"
        }
    }]
}
# ^^ ALB health check is hitting the pod but getting 404

# Step 5: Check health check path configuration
$ kubectl get ingress web-app -n frontend -o yaml | grep -A 5 healthcheck
# Look for annotations:
#   alb.ingress.kubernetes.io/healthcheck-path: /healthz
# If missing, ALB defaults to "/" which may return 404

# Step 6: Test the health check from inside the cluster
$ kubectl exec -it web-app-abc123 -n frontend -- wget -qO- http://localhost:3000/healthz
OK
$ kubectl exec -it web-app-abc123 -n frontend -- wget -qO- http://localhost:3000/
# If this returns 404, the ALB health check is failing on "/"

Root Cause

ALB health check failing — default path / returns 404, need to set healthcheck-path annotation
Target group port mismatch — ALB sends traffic to wrong port on the pod
Security group — ALB security group cannot reach pod network (missing node SG ingress rule)
Subnet tags missing — ALB controller cannot discover subnets without kubernetes.io/role/elb: 1 tags
DNS not pointing to ALB — app.bank.com CNAME does not resolve to the ALB address
TLS termination misconfigured — certificate ARN in annotation doesn’t match the domain

Fix + Prevention

# Fix health check path — add annotation
kubectl annotate ingress web-app -n frontend \
  alb.ingress.kubernetes.io/healthcheck-path=/healthz --overwrite

# Fix security group — ensure ALB SG allows traffic to node SG
# AWS Load Balancer Controller manages this if:
#   alb.ingress.kubernetes.io/security-groups is set correctly

# Fix subnet discovery — tag subnets
aws ec2 create-tags --resources subnet-abc123 \
  --tags Key=kubernetes.io/role/elb,Value=1

# Fix TLS — add certificate ARN
kubectl annotate ingress web-app -n frontend \
  alb.ingress.kubernetes.io/certificate-arn=arn:aws:acm:eu-west-1:123456789012:certificate/abc-123

# Verify ALB target health after fix
$ aws elbv2 describe-target-health \
    --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:123456789012:targetgroup/k8s-frontend-web/abc123
{
    "TargetHealthDescriptions": [{
        "Target": {"Id": "10.1.2.34", "Port": 3000},
        "TargetHealth": {"State": "healthy"}
    }]
}

Prevention:

Include health check annotations in golden path Ingress templates
Ensure subnet tagging is part of Terraform VPC module
Use external-dns to auto-manage DNS records from Ingress resources
Alert on ALB target health: aws_alb_tg_unhealthy_host_count > 0

Scenario 9: IRSA / Workload Identity Not Working

Symptoms

Pod cannot access AWS/GCP APIs. Application logs show AccessDenied or 403 Forbidden.

$ kubectl logs s3-uploader-abc123 -n data-pipeline
An error occurred (AccessDenied) when calling the PutObject operation:
User: arn:aws:sts::123456789012:assumed-role/eksctl-cluster-nodegroup-NodeInstanceRole/i-abc123
is not authorized to perform: s3:PutObject on resource: arn:aws:s3:::bank-data-lake/*
# ^^ Using NODE role instead of IRSA role — IRSA is not working

Debug Commands

# Step 1: Check ServiceAccount annotation
$ kubectl get sa s3-uploader-sa -n data-pipeline -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-uploader-sa
  namespace: data-pipeline
  annotations: {}      # <-- NO IRSA annotation!

# Step 2: Check if the pod is using the correct ServiceAccount
$ kubectl get pod s3-uploader-abc123 -n data-pipeline \
    -o jsonpath='{.spec.serviceAccountName}'
default           # <-- Using "default" SA, not "s3-uploader-sa"

# Step 3: Verify the projected token volume exists
$ kubectl get pod s3-uploader-abc123 -n data-pipeline \
    -o jsonpath='{.spec.volumes[?(@.name=="aws-iam-token")]}' | jq .
# Should return a projected volume with audience "sts.amazonaws.com"
# If empty — IRSA token not being projected

# Step 4: Check the IAM role trust policy
$ aws iam get-role --role-name s3-uploader-role --query 'Role.AssumeRolePolicyDocument'
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Principal": {
            "Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/ABC123"
        },
        "Action": "sts:AssumeRoleWithWebIdentity",
        "Condition": {
            "StringEquals": {
                "oidc.eks.eu-west-1.amazonaws.com/id/ABC123:sub":
                    "system:serviceaccount:data-pipeline:s3-uploader-sa",
                "oidc.eks.eu-west-1.amazonaws.com/id/ABC123:aud": "sts.amazonaws.com"
            }
        }
    }]
}

# Step 5: Verify OIDC provider exists
$ aws eks describe-cluster --name prod-cluster \
    --query 'cluster.identity.oidc.issuer' --output text
https://oidc.eks.eu-west-1.amazonaws.com/id/ABC123

$ aws iam list-open-id-connect-providers | grep ABC123
"Arn": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/ABC123"

# Step 6: Test from inside the pod
$ kubectl exec -it s3-uploader-abc123 -n data-pipeline -- env | grep AWS
AWS_ROLE_ARN=          # <-- empty, IRSA not injected
AWS_WEB_IDENTITY_TOKEN_FILE=    # <-- empty

Root Cause

ServiceAccount missing annotation — eks.amazonaws.com/role-arn not set
Pod using wrong ServiceAccount — Deployment spec says serviceAccountName: default
Trust policy mismatch — namespace or SA name in trust policy condition doesn’t match
OIDC provider not created — IRSA requires the EKS OIDC provider registered in IAM
Token audience mismatch — trust policy expects sts.amazonaws.com but token has different audience
Webhook not mutating — Amazon EKS Pod Identity Webhook not installed (needed for IRSA token injection)

Fix + Prevention

# Fix 1: Annotate the ServiceAccount
kubectl annotate sa s3-uploader-sa -n data-pipeline \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/s3-uploader-role

# Fix 2: Update Deployment to use the correct SA
kubectl patch deployment s3-uploader -n data-pipeline \
  --type='json' -p='[{"op":"replace","path":"/spec/template/spec/serviceAccountName","value":"s3-uploader-sa"}]'

# Fix 3: Fix trust policy — ensure namespace and SA match
aws iam update-assume-role-policy --role-name s3-uploader-role \
  --policy-document file://trust-policy.json
# trust-policy.json must have correct:
#   system:serviceaccount:<namespace>:<sa-name>

# Fix 4: Create OIDC provider if missing
eksctl utils associate-iam-oidc-provider --cluster prod-cluster --approve

# Verify fix — new pod should have AWS env vars
$ kubectl exec -it s3-uploader-NEW -n data-pipeline -- env | grep AWS
AWS_ROLE_ARN=arn:aws:iam::123456789012:role/s3-uploader-role
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token

Prevention:

Define IRSA/WI as part of namespace provisioning (Crossplane or Terraform)
Use EKS Pod Identity (newer, simpler) instead of IRSA for new clusters
Validate IRSA in CI: kubectl auth can-i checks in post-deploy verification
Template the trust policy alongside the SA in the same Terraform module

Scenario 10: OOMKilled

Symptoms

Container is terminated because it exceeded its memory limit. Exit code 137.

$ kubectl get pods -n ml-inference
NAME                          READY   STATUS      RESTARTS      AGE
model-server-7d8e9f0-a1b2c    0/1     OOMKilled   4 (30s ago)   10m

$ kubectl describe pod model-server-7d8e9f0-a1b2c -n ml-inference
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

Debug Commands

# Step 1: Check current resource limits
$ kubectl get pod model-server-7d8e9f0-a1b2c -n ml-inference \
    -o jsonpath='{.spec.containers[0].resources}' | jq .
{
  "limits": {"cpu": "1", "memory": "512Mi"},
  "requests": {"cpu": "500m", "memory": "256Mi"}
}

# Step 2: Check actual memory usage before OOM (from Prometheus/metrics-server)
$ kubectl top pod -n ml-inference --sort-by=memory
NAME                          CPU(cores)   MEMORY(bytes)
model-server-7d8e9f0-a1b2c    450m         509Mi          # <-- hitting 512Mi limit

# Step 3: Check if VPA has recommendations
$ kubectl get vpa -n ml-inference
NAME           MODE   CPU   MEM     PROVIDED   AGE
model-server   Off    -     -       True       7d

$ kubectl describe vpa model-server -n ml-inference
Recommendation:
  Container Recommendations:
    Container Name: model-server
    Lower Bound:    Cpu: 200m,  Memory: 768Mi
    Target:         Cpu: 500m,  Memory: 1Gi      # <-- VPA recommends 1Gi
    Upper Bound:    Cpu: 2,     Memory: 2Gi

# Step 4: Check node-level OOM killer activity (via SSM)
$ journalctl -k | grep -i "out of memory"
Mar 15 02:45:00 kernel: Out of memory: Killed process 12345 (model-server)
  total-vm:1048576kB, anon-rss:524288kB, file-rss:0kB, shmem-rss:0kB

# Step 5: Check if it's a JVM/Go app with known memory patterns
$ kubectl logs model-server-7d8e9f0-a1b2c -n ml-inference --previous | tail -5
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
# ^^ JVM heap not capped — it grows beyond container limit

Root Cause

Memory limit too low — application genuinely needs more memory than the limit allows
Memory leak — application allocates memory without releasing it over time
JVM heap not bounded — Java app defaults to 25% of node memory, exceeding container limit
ML model loading — model loaded into memory exceeds container limit
No limits set + node memory exhaustion — without limits, pod consumes all node memory and the kernel OOM killer terminates it

Fix + Prevention

# Immediate — increase memory limit based on VPA recommendation
kubectl set resources deployment model-server -n ml-inference \
  --limits=memory=1536Mi --requests=memory=1Gi

# For JVM apps — set heap explicitly relative to container limit
# In Deployment env:
#   - name: JAVA_OPTS
#     value: "-XX:MaxRAMPercentage=75.0"
# This caps JVM heap at 75% of container memory limit

# For Go apps — set GOMEMLIMIT
#   - name: GOMEMLIMIT
#     value: "400MiB"

# Enable VPA in Auto mode (if approved by platform policy)
# Or use VPA in recommendation-only mode and update limits manually

Prevention:

ALWAYS set memory limits on all containers (enforce with OPA/Kyverno)
Run VPA in recommendation mode on all namespaces, review monthly
For JVM: mandate -XX:MaxRAMPercentage=75.0 in golden path
Alert on container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85
Set LimitRange defaults so pods without explicit limits still get bounded

Scenario 11: Pod Stuck in Terminating

Symptoms

Pod is in Terminating state indefinitely. kubectl delete pod does not work.

$ kubectl get pods -n legacy-app
NAME                          READY   STATUS        RESTARTS   AGE
legacy-worker-abc123          1/1     Terminating   0          45m

Debug Commands

# Step 1: Check for finalizers on the pod
$ kubectl get pod legacy-worker-abc123 -n legacy-app -o jsonpath='{.metadata.finalizers}'
["custom.finalizer.io/cleanup"]
# ^^ Finalizer is blocking deletion

# Step 2: Check if the node hosting the pod is responsive
$ kubectl get pod legacy-worker-abc123 -n legacy-app -o jsonpath='{.spec.nodeName}'
ip-10-1-1-101.ec2.internal

$ kubectl get node ip-10-1-1-101.ec2.internal
NAME                          STATUS     ROLES    AGE   VERSION
ip-10-1-1-101.ec2.internal    NotReady   <none>   30d   v1.29.1
# ^^ Node is NotReady — kubelet cannot execute the graceful shutdown

# Step 3: Check if there's a PVC with a finalizer blocking things
$ kubectl get pvc -n legacy-app -o jsonpath='{range .items[*]}{.metadata.name}: {.metadata.finalizers}{"\n"}{end}'
legacy-data: ["kubernetes.io/pvc-protection"]

# Step 4: Check if preStop hook is hanging
$ kubectl get pod legacy-worker-abc123 -n legacy-app \
    -o jsonpath='{.spec.containers[0].lifecycle.preStop}'
{"exec":{"command":["sh","-c","sleep 3600"]}}
# ^^ preStop hook sleeps for 3600 seconds (1 hour!)

# Step 5: Check terminationGracePeriodSeconds
$ kubectl get pod legacy-worker-abc123 -n legacy-app \
    -o jsonpath='{.spec.terminationGracePeriodSeconds}'
3600    # <-- 1 hour grace period, pod won't force-kill until this expires

Root Cause

Finalizer blocking — a controller’s finalizer is present but the controller is not running to remove it
Node is NotReady — kubelet on the node cannot execute SIGTERM/SIGKILL sequence
preStop hook hanging — long-running preStop hook (e.g., draining connections) never completes
terminationGracePeriodSeconds too high — set to 3600+ seconds
PVC protection finalizer — PVC is still in use, preventing pod deletion chain

Fix + Prevention

# Fix 1: Remove finalizer (if controller is gone)
kubectl patch pod legacy-worker-abc123 -n legacy-app \
  --type='json' -p='[{"op":"remove","path":"/metadata/finalizers"}]'

# Fix 2: Force delete (last resort — use with caution)
kubectl delete pod legacy-worker-abc123 -n legacy-app --force --grace-period=0

# Fix 3: If node is NotReady, the pod object stays until the node recovers
# or until you force-delete. Fix the node first (Scenario 6) or:
kubectl delete pod legacy-worker-abc123 -n legacy-app --force --grace-period=0

# Fix 4: Reduce terminationGracePeriodSeconds in Deployment spec
# spec.terminationGracePeriodSeconds: 30    (sane default)

Prevention:

Set terminationGracePeriodSeconds: 30 in golden path templates (override only with justification)
Ensure preStop hooks have bounded execution time
Monitor for pods in Terminating state > 5 minutes
If using custom finalizers, ensure the controller has HA (multiple replicas)
For StatefulSets, use pod disruption budgets to control deletion safely

Scenario 12: HPA Not Scaling

Symptoms

HPA exists but replicas remain at minimum even under high load.

$ kubectl get hpa -n checkout
NAME           REFERENCE                 TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
checkout-hpa   Deployment/checkout-svc   <unknown>/70%     2         20        2          30d
#                                        ^^^^^^^^^ <unknown> means metrics not available

Debug Commands

# Step 1: Describe the HPA for detailed status
$ kubectl describe hpa checkout-hpa -n checkout
Conditions:
  Type            Status  Reason                   Message
  ----            ------  ------                    -------
  AbleToScale     True    ReadyForNewScale          recommended size matches current size
  ScalingActive   False   FailedGetResourceMetric   the HPA was unable to compute the
    replica count: failed to get cpu utilization: missing request for cpu in
    container "checkout" of pod "checkout-svc-abc123"

# Step 2: Check if metrics-server is running
$ kubectl get pods -n kube-system -l k8s-app=metrics-server
NAME                              READY   STATUS    RESTARTS   AGE
metrics-server-6d684c7b5d-x9y0z  1/1     Running   0          30d

# Step 3: Verify metrics are available
$ kubectl top pods -n checkout
NAME                           CPU(cores)   MEMORY(bytes)
checkout-svc-abc123            450m         256Mi
checkout-svc-def456            380m         230Mi

# Step 4: Check if the pods have resource REQUESTS set (required for CPU-based HPA)
$ kubectl get pod checkout-svc-abc123 -n checkout \
    -o jsonpath='{.spec.containers[0].resources}'
{"limits":{"memory":"512Mi"}}
# ^^ NO CPU request! HPA cannot calculate percentage without a request baseline

# Step 5: For custom metrics (e.g., queue depth), check the metrics API
$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
# If 404 — custom metrics adapter (Prometheus Adapter / KEDA) not installed

# Step 6: Check HPA events
$ kubectl get events -n checkout --field-selector involvedObject.name=checkout-hpa
LAST SEEN   TYPE      REASON                   MESSAGE
2m          Warning   FailedGetResourceMetric  missing request for cpu
5m          Warning   FailedComputeMetricsReplicas  failed to get cpu utilization

Root Cause

No CPU requests on pods — HPA needs resource requests to calculate utilization percentage
Metrics server not installed or broken — kubectl top returns errors
Custom metrics adapter missing — using custom/external metrics but Prometheus Adapter or KEDA not deployed
Wrong metric name — HPA references a metric that doesn’t exist
Cooldown period — HPA recently scaled down and is in --horizontal-pod-autoscaler-downscale-stabilization window (default 5min)
MaxReplicas reached — HPA already at max, cannot scale further

Fix + Prevention

# Fix 1: Add CPU requests to the Deployment
kubectl set resources deployment checkout-svc -n checkout \
  --requests=cpu=200m,memory=256Mi --limits=memory=512Mi

# Fix 2: Install metrics-server (if missing)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Fix 3: Verify after adding requests
$ kubectl get hpa -n checkout
NAME           REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
checkout-hpa   Deployment/checkout-svc   65%/70%   2         20        2          30d
#                                        ^^^^ Now showing actual percentage

# Fix 4: For custom metrics, install KEDA
# KEDA ScaledObject example for SQS queue depth:
# apiVersion: keda.sh/v1alpha1
# kind: ScaledObject
# metadata:
#   name: checkout-scaledobject
# spec:
#   scaleTargetRef:
#     name: checkout-svc
#   minReplicaCount: 2
#   maxReplicaCount: 20
#   triggers:
#   - type: aws-sqs-queue
#     metadata:
#       queueURL: https://sqs.eu-west-1.amazonaws.com/123456789012/checkout-queue
#       queueLength: "5"

Prevention:

Enforce CPU requests on all pods via OPA/Kyverno admission policy
Include HPA + resource requests in golden path Helm templates
Deploy metrics-server and KEDA as cluster baseline add-ons
Monitor kube_horizontalpodautoscaler_status_condition{condition="ScalingActive",status="false"}
Set up alerts for HPA at maxReplicas for > 10 minutes

Scenario 13: Certificate Expiry (cert-manager)

Symptoms

TLS certificates are about to expire or already expired. Browsers show certificate errors, API clients fail.

$ kubectl get certificates -n frontend
NAME            READY   SECRET          AGE
app-bank-tls    False   app-bank-tls    90d

$ kubectl describe certificate app-bank-tls -n frontend
Status:
  Conditions:
    Type:    Ready
    Status:  False
    Reason:  Renewing
    Message: Renewing certificate as renewal was scheduled at 2026-03-14
  Not After:  2026-03-15T00:00:00Z    # <-- expires TODAY

Debug Commands

# Step 1: Check cert-manager pods
$ kubectl get pods -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-7d8e9f0a1-b2c3d              1/1     Running   0          30d
cert-manager-cainjector-5f6g7h8i9-j0k1l   1/1     Running   0          30d
cert-manager-webhook-3d4e5f6g7-h8i9j      1/1     Running   0          30d

# Step 2: Check the Certificate resource status
$ kubectl describe certificate app-bank-tls -n frontend
Events:
  Warning  Failed  2m (x15 over 24h)  cert-manager-certificates-issuing
    The certificate request has failed to complete and will be retried:
    Failed to wait for order resource "app-bank-tls-order-xyz" to become ready:
    order is in "errored" state: acme: order error: 403

# Step 3: Check the Order and Challenge
$ kubectl get orders -n frontend
NAME                           STATE     AGE
app-bank-tls-order-xyz         errored   24h

$ kubectl describe order app-bank-tls-order-xyz -n frontend
Status:
  State: errored
  Reason: "acme: order error: one or more domains had a problem"

$ kubectl get challenges -n frontend
NAME                                STATE    DOMAIN         AGE
app-bank-tls-challenge-abc123       pending  app.bank.com   24h

$ kubectl describe challenge app-bank-tls-challenge-abc123 -n frontend
Status:
  Reason: Waiting for DNS-01 challenge propagation: DNS record for
    "_acme-challenge.app.bank.com" not yet propagated
  State: pending

# Step 4: Check ClusterIssuer/Issuer
$ kubectl get clusterissuer
NAME              READY   AGE
letsencrypt-prod  True    180d

$ kubectl describe clusterissuer letsencrypt-prod
Status:
  Acme:
    Uri: https://acme-v02.api.letsencrypt.org/acme/acct/123456
  Conditions:
    Type:    Ready
    Status:  True

# Step 5: Check if DNS credentials for Route53/Cloud DNS are valid
$ kubectl get secret route53-credentials -n cert-manager -o yaml
# Verify the access key is not expired/rotated

# Step 6: cert-manager logs for detailed errors
$ kubectl logs -n cert-manager -l app=cert-manager --tail=30
E0315 cert-manager/challenges "msg"="propagation check failed"
  "error"="DNS record for \"_acme-challenge.app.bank.com\" not yet propagated"
  "dnsName"="app.bank.com" "type"="DNS-01"

Root Cause

DNS-01 challenge failing — cert-manager cannot create the _acme-challenge TXT record (IAM permissions, wrong hosted zone)
HTTP-01 challenge failing — challenge solver pod cannot be reached from the internet (ingress misconfigured, firewall)
Rate limiting — Let’s Encrypt rate limits: 50 certs per domain per week, 5 duplicate certs per week
Credential expiry — Route53/Cloud DNS IAM credentials used by cert-manager have expired
cert-manager webhook down — webhook not running, certificate resources cannot be validated
Cluster DNS issue — cert-manager pods cannot resolve Let’s Encrypt API (see Scenario 5)

Fix + Prevention

# Fix DNS-01 — verify IAM permissions for Route53
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/cert-manager-role \
  --action-names route53:ChangeResourceRecordSets route53:GetChange \
  --resource-arns "arn:aws:route53:::hostedzone/Z1234567890"

# Fix — if permissions are correct but record not propagating, delete and retry
kubectl delete challenge app-bank-tls-challenge-abc123 -n frontend
# cert-manager will create a new challenge automatically

# Emergency — if cert already expired, manually create cert from ACM/existing
kubectl create secret tls app-bank-tls -n frontend \
  --cert=./fullchain.pem --key=./privkey.pem --dry-run=client -o yaml | kubectl apply -f -

# Force renewal
cmctl renew app-bank-tls -n frontend

# Check cert expiry dates across all namespaces
$ kubectl get certificates --all-namespaces -o custom-columns=\
NAMESPACE:.metadata.namespace,NAME:.metadata.name,\
READY:.status.conditions[0].status,EXPIRY:.status.notAfter
NAMESPACE    NAME            READY   EXPIRY
frontend     app-bank-tls    False   2026-03-15T00:00:00Z
payments     pay-bank-tls    True    2026-05-20T00:00:00Z

Prevention:

Alert on cert expiry 30 days before: certmanager_certificate_expiration_timestamp_seconds - time() < 30*24*3600
Use cert-manager with DNS-01 (more reliable than HTTP-01 for internal services)
Set up IRSA/Workload Identity for cert-manager instead of static credentials
Monitor certmanager_certificate_ready_status{condition="True"} == 0
Run cmctl check api as part of cluster health checks

Scenario 14: Network Policy Blocking Traffic

Symptoms

Pods cannot communicate with each other despite services being correctly configured. Connection timeout or reset.

# From checkout pod, trying to reach payment service
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \
    http://payment-svc.payments.svc.cluster.local:8080/api/charge
wget: download timed out

Debug Commands

# Step 1: List NetworkPolicies in BOTH source and destination namespaces
$ kubectl get networkpolicy -n payments
NAME               POD-SELECTOR       AGE
default-deny-all   <none>             30d
allow-monitoring   app=payment-svc    30d

$ kubectl get networkpolicy -n checkout
NAME               POD-SELECTOR       AGE
default-deny-all   <none>             30d
allow-egress-dns   <none>             30d

# Step 2: Inspect the default-deny policy in destination namespace
$ kubectl describe networkpolicy default-deny-all -n payments
Spec:
  PodSelector:     <none> (Coverage: all pods in the namespace)
  Allowing ingress traffic: <none> (Selected pods are isolated for ingress connectivity)
  Allowing egress traffic: <none> (Selected pods are isolated for egress connectivity)
  Policy Types: Ingress, Egress
# ^^ Denies ALL ingress — checkout pods blocked

# Step 3: Check if there's an allow rule for the specific traffic
$ kubectl get networkpolicy allow-monitoring -n payments -o yaml
spec:
  podSelector:
    matchLabels:
      app: payment-svc
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring       # <-- only monitoring namespace allowed
    ports:
    - port: 8080
# ^^ Only monitoring can reach payment-svc. Checkout is NOT allowed.

# Step 4: Check egress rules in source namespace
$ kubectl describe networkpolicy default-deny-all -n checkout
# If egress is also denied, the checkout pod cannot make ANY outbound connections
# Unless there are specific egress allow rules

# Step 5: Verify namespace labels (required for namespaceSelector)
$ kubectl get namespace payments --show-labels
NAME       STATUS   AGE   LABELS
payments   Active   90d   kubernetes.io/metadata.name=payments,name=payments,team=payments

$ kubectl get namespace checkout --show-labels
NAME       STATUS   AGE   LABELS
checkout   Active   90d   kubernetes.io/metadata.name=checkout,team=checkout
# ^^ checkout namespace has label "team=checkout" — this is what we need to match

# Step 6: Test with a debug pod in the same namespace as the destination
$ kubectl run debug -n payments --rm -it --image=busybox -- wget -qO- http://payment-svc:8080/healthz
OK    # <-- Works from within the namespace, confirming NetworkPolicy is the issue

Root Cause

Default deny without matching allow — default-deny-all blocks everything, no ingress rule allows checkout namespace
Missing egress rule in source namespace — checkout namespace also has default deny on egress
Namespace labels missing — namespaceSelector in the allow rule references a label that doesn’t exist on the source namespace
Port not specified in allow rule — ingress rule allows the namespace but not the specific port
CNI doesn’t support NetworkPolicy — some CNIs (e.g., Flannel) don’t enforce NetworkPolicies

Fix + Prevention

# Create an ingress allow rule for checkout -> payments
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-checkout-to-payment
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-svc
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          team: checkout
      podSelector:
        matchLabels:
          app: checkout-svc
    ports:
    - port: 8080
      protocol: TCP
EOF

# Also ensure checkout namespace has egress allowed to payments
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-to-payments
  namespace: checkout
spec:
  podSelector:
    matchLabels:
      app: checkout-svc
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          team: payments
    ports:
    - port: 8080
      protocol: TCP
EOF

# Verify
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- \
    http://payment-svc.payments.svc.cluster.local:8080/api/charge
{"status": "ok"}

When using default-deny policies, always create an egress allow for DNS in every namespace, otherwise pods cannot resolve any service names:

egress:
- to:
  - namespaceSelector: {}
    podSelector:
      matchLabels:
        k8s-app: kube-dns
  ports:
  - port: 53
    protocol: UDP

Prevention:

Define NetworkPolicies in GitOps alongside the namespace provisioning
Create a “service dependency map” — which namespaces talk to which
Include DNS egress rule in every default-deny policy template
Use Cilium’s hubble observe for real-time flow visibility
Test NetworkPolicies in staging before applying to production

Scenario 15: ArgoCD Sync Failures

Symptoms

ArgoCD Application shows OutOfSync or SyncFailed. Resources are not being deployed.

# ArgoCD CLI
$ argocd app get payments --refresh
Name:               payments
Project:            tenant-payments
Server:             https://kubernetes.default.svc
Namespace:          payments
Status:             OutOfSync
Health:             Degraded
Sync Status:        SyncFailed
Message:            ComparisonError: failed to sync: one or more objects failed
                    to apply: admission webhook "validate.kyverno.svc-fail"
                    denied the request: resource violated policy require-labels

Debug Commands

# Step 1: Check sync status and errors
$ argocd app get payments --show-operation
Operation:        Sync
Sync Revision:    abc123def456
Phase:            Failed
Message:          one or more objects failed to apply

STEP   RESOURCE               RESULT   MESSAGE
1      Namespace/payments     Synced   namespace/payments configured
2      Deployment/payment-api Failed   admission webhook denied: missing label "team"
3      Service/payment-api    Skipped  dependent resource failed

# Step 2: Check ArgoCD application logs
$ argocd app logs payments --tail=20
time="2026-03-15T02:00:00Z" level=error msg="ComparisonError"
  application=payments error="failed to compute diff: CRD
  certificates.cert-manager.io not found in cluster"

# Step 3: Check if there's a resource hook ordering issue
$ kubectl get applications.argoproj.io payments -n argocd \
    -o jsonpath='{.status.operationState.syncResult.resources}' | jq .
[
  {"kind":"CustomResourceDefinition","status":"SyncFailed",
   "message":"resource mapping not found for name: certificate"}
]
# ^^ CRD must be applied before the CR that uses it

# Step 4: Check if it's a drift / server-side apply conflict
$ argocd app diff payments
--- live
+++ desired
@@ -10,6 +10,7 @@
   labels:
     app: payment-api
+    team: payments        # <-- this label is in Git but not in cluster
     version: v2

# Step 5: Check ArgoCD controller and repo-server
$ kubectl get pods -n argocd
NAME                                  READY   STATUS    RESTARTS   AGE
argocd-application-controller-0       1/1     Running   0          7d
argocd-repo-server-5d8e9f0-a1b2c     1/1     Running   0          7d
argocd-server-7d8e9f0-c3d4e          1/1     Running   0          7d

$ kubectl logs argocd-repo-server-5d8e9f0-a1b2c -n argocd --tail=20
time="2026-03-15T02:00:00Z" level=error msg="failed to generate manifest"
  error="helm template failed: Error: chart 'payment-api' version '2.3.1'
  not found in repository 'https://charts.internal.bank.com'"

# Step 6: Check repo connectivity
$ argocd repo list
TYPE  NAME     REPO                                       STATUS      MESSAGE
git   infra    git@github.com:bank/infra-manifests.git    Successful
helm  charts   https://charts.internal.bank.com           Failed      connection refused

Root Cause

Admission webhook rejection — Kyverno/OPA/Gatekeeper policy denies the resource (missing labels, wrong image registry)
CRD ordering — Custom Resources applied before their CRDs exist (cert-manager Certificate before CRD)
Helm chart not found — internal Helm repo is down or chart version doesn’t exist
Server-side apply conflict — field managed by another controller (e.g., HPA manages replicas, ArgoCD tries to set them too)
Resource quota exceeded — namespace quota prevents creating new resources
RBAC — ArgoCD service account doesn’t have permissions to create the resource

Fix + Prevention

# Fix 1: Admission webhook — add the required label in Git
# Edit the Deployment manifest in Git and commit:
#   metadata:
#     labels:
#       team: payments    # <-- add this

# Fix 2: CRD ordering — use sync waves
# On the CRD:
#   annotations:
#     argocd.argoproj.io/sync-wave: "-1"    # Apply CRDs first
# On the CR:
#   annotations:
#     argocd.argoproj.io/sync-wave: "1"     # Apply after CRDs

# Fix 3: Server-side apply conflict with HPA — ignore replicas diff
# In ArgoCD Application spec:
#   ignoreDifferences:
#   - group: apps
#     kind: Deployment
#     jsonPointers:
#     - /spec/replicas

# Fix 4: Retry sync
argocd app sync payments --retry-limit 3

# Fix 5: Force sync (overwrite cluster state with Git)
argocd app sync payments --force

# Fix 6: Check RBAC
kubectl auth can-i create deployments --as=system:serviceaccount:argocd:argocd-application-controller -n payments

Prevention:

Use sync waves for all CRD + CR combinations
Configure ignoreDifferences for fields managed by controllers (HPA replicas, mutating webhooks)
Test manifests against admission policies in CI before pushing to Git
Set up ArgoCD notifications (Slack/Teams) for sync failures
Use ArgoCD ApplicationSets for consistent configuration across tenant apps
Monitor argocd_app_sync_status{sync_status="OutOfSync"} and argocd_app_health_status{health_status!="Healthy"}

Scenario 16: Karpenter Not Provisioning Nodes

Symptoms

Pods are stuck in Pending but Karpenter is not launching new nodes, even though it should.

$ kubectl get pods -n batch-processing
NAME                          READY   STATUS    RESTARTS   AGE
batch-job-abc123              0/1     Pending   0          20m
batch-job-def456              0/1     Pending   0          20m

$ kubectl get nodes
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-1-1-100.ec2.internal     Ready    <none>   7d    v1.29.1
# Only 1 node, no new nodes being provisioned

Debug Commands

# Step 1: Check Karpenter controller logs
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=30
2026-03-15T02:00:00.000Z  ERROR  controller.provisioner
  Could not schedule pod, incompatible with provisioner "default":
  no instance type satisfied resources
  {"cpu":"16","memory":"64Gi"} and target NodePool requirements
  [{key: "karpenter.sh/capacity-type", operator: In, values: [spot]}]

# Step 2: Check NodePool configuration
$ kubectl get nodepools
NAME      NODECLASS        NODES   READY   AGE
default   default          1       1       30d

$ kubectl describe nodepool default
Spec:
  Template:
    Spec:
      Requirements:
      - Key: karpenter.sh/capacity-type
        Operator: In
        Values: ["spot"]
      - Key: node.kubernetes.io/instance-type
        Operator: In
        Values: ["m5.xlarge", "m5.2xlarge"]
      - Key: topology.kubernetes.io/zone
        Operator: In
        Values: ["eu-west-1a"]
  Limits:
    Cpu: "32"                     # <-- cluster limit: 32 vCPUs total
    Memory: "128Gi"
  Disruption:
    ConsolidationPolicy: WhenUnderutilized

# Step 3: Check current usage against limits
$ kubectl get nodepool default -o jsonpath='{.status}' | jq .
{
  "resources": {
    "cpu": "28",                  # <-- 28 of 32 used, only 4 vCPU remaining
    "memory": "112Gi"
  }
}
# ^^ Pods need 16 CPU but only 4 available in the NodePool limit

# Step 4: Check EC2NodeClass (subnet, security group, AMI)
$ kubectl get ec2nodeclasses
NAME      AGE
default   30d

$ kubectl describe ec2nodeclass default
Spec:
  Subnet Selector:
    karpenter.sh/discovery: prod-cluster
  Security Group Selector:
    karpenter.sh/discovery: prod-cluster
  AMI Family: AL2
Status:
  Subnets:
  - ID: subnet-abc123
    Zone: eu-west-1a
  Security Groups:
  - ID: sg-abc123
  AMIs:
  - ID: ami-abc123
    Name: amazon-eks-node-1.29-v20260301

# Step 5: Check for EC2 capacity issues
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=50 | grep -i "insufficient"
InsufficientInstanceCapacity: We currently do not have sufficient
  m5.2xlarge capacity in the Availability Zone eu-west-1a

# Step 6: Check if the pod has nodeSelector or affinity that conflicts
$ kubectl get pod batch-job-abc123 -n batch-processing \
    -o jsonpath='{.spec.nodeSelector}'
{"kubernetes.io/arch":"arm64"}
# ^^ Pod requires ARM but NodePool only allows x86 instance types

Root Cause

NodePool CPU/memory limits reached — Karpenter respects the limits on NodePools
Instance type constraints too narrow — only allowing 2 instance types, and those are unavailable in the AZ
Spot capacity unavailable — requesting spot-only but no spot capacity for the selected instance types
AZ restriction — only allowing one AZ, which has capacity issues
Architecture mismatch — pod requires ARM (arm64) but NodePool only provisions x86 instances
Subnet capacity — subnet has no available IP addresses
IAM permissions — Karpenter node role cannot launch EC2 instances

Fix + Prevention

# Fix 1: Increase NodePool limits
kubectl patch nodepool default --type='merge' -p '{
  "spec": {"limits": {"cpu": "64", "memory": "256Gi"}}
}'

# Fix 2: Broaden instance type selection
kubectl patch nodepool default --type='json' -p='[{
  "op": "replace",
  "path": "/spec/template/spec/requirements/1",
  "value": {
    "key": "node.kubernetes.io/instance-type",
    "operator": "In",
    "values": ["m5.xlarge","m5.2xlarge","m5a.xlarge","m5a.2xlarge",
               "m6i.xlarge","m6i.2xlarge","c5.xlarge","c5.2xlarge"]
  }
}]'

# Fix 3: Allow on-demand fallback (not spot-only)
# In NodePool requirements:
#   - key: karpenter.sh/capacity-type
#     operator: In
#     values: ["spot", "on-demand"]

# Fix 4: Allow multiple AZs
# In NodePool requirements:
#   - key: topology.kubernetes.io/zone
#     operator: In
#     values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]

# Fix 5: For ARM pods, add ARM instance types
# In NodePool:
#   - key: kubernetes.io/arch
#     operator: In
#     values: ["amd64", "arm64"]
#   - key: node.kubernetes.io/instance-type
#     operator: In
#     values: ["m6g.xlarge", "m6g.2xlarge"]   # Graviton

# Verify — watch Karpenter provision a node
$ kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f
2026-03-15T02:30:00.000Z  INFO  controller.provisioner
  Computed 1 new node(s) will fit 2 pod(s)
2026-03-15T02:30:05.000Z  INFO  controller.provisioner
  Launched node: ip-10-1-1-103.ec2.internal,
  type: m6i.2xlarge, zone: eu-west-1b, capacity-type: on-demand

Prevention:

Set NodePool limits with 50% headroom above normal peak
Use at least 15 instance types (Karpenter picks the cheapest available)
Allow both spot and on-demand with spot preference
Monitor karpenter_pods_state{state="pending"} and karpenter_nodepool_usage vs limits
Alert when NodePool usage > 80% of limits
Review Karpenter logs weekly for InsufficientInstanceCapacity patterns

Scenario 17: Cross-Namespace Service Communication Failing

Symptoms

Service in namespace A cannot reach a service in namespace B, even though both services are running and healthy within their own namespaces.

# From checkout namespace, trying to call payment service in payments namespace
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \
    http://payment-svc:8080/api/charge
wget: bad address 'payment-svc:8080'    # <-- DNS cannot resolve

# Or using FQDN:
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- --timeout=5 \
    http://payment-svc.payments.svc.cluster.local:8080/api/charge
wget: download timed out              # <-- DNS resolves but connection blocked

Debug Commands

# Step 1: Verify DNS resolution across namespaces
$ kubectl exec -it checkout-abc123 -n checkout -- nslookup payment-svc.payments.svc.cluster.local
Server:    10.100.0.10
Address:   10.100.0.10:53

Name:      payment-svc.payments.svc.cluster.local
Address:   10.100.23.45
# ^^ DNS works. Problem is not DNS.

# Step 2: Check if the service has endpoints
$ kubectl get endpoints payment-svc -n payments
NAME          ENDPOINTS           AGE
payment-svc   10.1.2.34:8080      30d
# ^^ Endpoints exist

# Step 3: Check NetworkPolicies (most likely cause)
$ kubectl get networkpolicy -n payments
NAME               POD-SELECTOR       AGE
default-deny-all   <none>             30d
# ^^ Default deny blocks all ingress to payments namespace

$ kubectl get networkpolicy -n checkout
NAME               POD-SELECTOR       AGE
default-deny-all   <none>             30d
allow-egress-dns   <none>             30d
# ^^ Default deny blocks all egress from checkout namespace (except DNS)

# Step 4: Check if FQDN is required (short name only works within same namespace)
$ kubectl exec -it checkout-abc123 -n checkout -- cat /etc/resolv.conf
nameserver 10.100.0.10
search checkout.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
# With ndots:5, "payment-svc" resolves as:
#   payment-svc.checkout.svc.cluster.local  → NXDOMAIN (not in checkout ns)
#   payment-svc.svc.cluster.local           → NXDOMAIN
#   payment-svc.cluster.local               → NXDOMAIN
#   payment-svc                             → NXDOMAIN
# MUST use: payment-svc.payments.svc.cluster.local

# Step 5: Test connectivity with NetworkPolicy temporarily removed (STAGING ONLY)
$ kubectl delete networkpolicy default-deny-all -n payments --dry-run=client
# If removing the policy fixes it, NetworkPolicy is confirmed as the cause

# Step 6: Check for Cilium/Calico-specific network policies
$ kubectl get ciliumnetworkpolicy -n payments 2>/dev/null
NAME                    AGE
cilium-default-deny     30d
cilium-allow-internal   30d

$ kubectl describe ciliumnetworkpolicy cilium-allow-internal -n payments
Spec:
  Endpoint Selector:
    Match Labels:
      app: payment-svc
  Ingress:
  - From Endpoints:
    - Match Labels:
        io.kubernetes.pod.namespace: payments   # <-- only same namespace

Root Cause

Using short service name — payment-svc only resolves within the same namespace; must use payment-svc.payments.svc.cluster.local
NetworkPolicy blocking cross-namespace ingress — default deny in destination namespace without allow rule for source namespace (most common in enterprise setups)
NetworkPolicy blocking egress — default deny in source namespace blocks outbound connections
Cilium/Calico-specific policies — CRD-based policies more restrictive than K8s-native NetworkPolicy
Service type mismatch — ExternalName service pointing to wrong FQDN
Istio AuthorizationPolicy — service mesh deny-by-default blocking cross-namespace traffic

Fix + Prevention

# Fix 1: Application must use FQDN for cross-namespace calls
# In app config or env:
#   PAYMENT_SERVICE_URL=http://payment-svc.payments.svc.cluster.local:8080

# Fix 2: Create NetworkPolicy allowing cross-namespace traffic
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-checkout-ingress
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-svc
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          team: checkout
      podSelector:
        matchLabels:
          app: checkout-svc
    ports:
    - port: 8080
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-payments-egress
  namespace: checkout
spec:
  podSelector:
    matchLabels:
      app: checkout-svc
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          team: payments
    ports:
    - port: 8080
      protocol: TCP
EOF

# Fix 3: For Istio, create AuthorizationPolicy
cat <<EOF | kubectl apply -f -
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: allow-checkout
  namespace: payments
spec:
  selector:
    matchLabels:
      app: payment-svc
  rules:
  - from:
    - source:
        namespaces: ["checkout"]
        principals: ["cluster.local/ns/checkout/sa/checkout-sa"]
    to:
    - operation:
        methods: ["POST"]
        paths: ["/api/charge"]
EOF

# Verify
$ kubectl exec -it checkout-abc123 -n checkout -- wget -qO- \
    http://payment-svc.payments.svc.cluster.local:8080/api/charge
{"status": "ok", "transaction_id": "txn-789"}

Prevention:

Standardize on FQDN for all cross-namespace service calls in golden path configs
Maintain a service dependency matrix as part of namespace provisioning
Create NetworkPolicy templates that are applied alongside namespace creation
Use Cilium Hubble or Calico flow logs to visualize cross-namespace traffic patterns
Document allowed communication paths in a “service mesh” diagram per tenant

Cross-Namespace Communication Checklist:
+---------------------------------------------+
| 1. DNS:   Use FQDN (svc.ns.svc.cluster.local) |
| 2. Ingress NetPol: Allow source namespace     |
| 3. Egress NetPol:  Allow destination namespace |
| 4. Istio AuthZ:    Allow source principal      |
| 5. Endpoints:      Verify svc has backends     |
| 6. Ports:          Match in all policies        |
+---------------------------------------------+

Quick Reference: Scenario to Command Map

+----+---------------------------+-------------------------------------------+
| #  | Scenario                  | First Command to Run                      |
+----+---------------------------+-------------------------------------------+
|  1 | Pod Pending               | kubectl describe pod <pod> -n <ns>        |
|  2 | CrashLoopBackOff          | kubectl logs <pod> --previous -n <ns>     |
|  3 | ImagePullBackOff          | kubectl describe pod <pod> -n <ns>        |
|  4 | No traffic to pod         | kubectl get endpoints <svc> -n <ns>       |
|  5 | DNS failures              | kubectl get pods -n kube-system -l        |
|    |                           |   k8s-app=kube-dns                        |
|  6 | Node NotReady             | kubectl describe node <node>              |
|  7 | PVC Pending               | kubectl describe pvc <pvc> -n <ns>        |
|  8 | Ingress not routing       | kubectl describe ingress <name> -n <ns>   |
|  9 | IRSA/WI not working       | kubectl get sa <sa> -n <ns> -o yaml       |
| 10 | OOMKilled                 | kubectl describe pod <pod> -n <ns>        |
| 11 | Pod Terminating           | kubectl get pod <pod> -o jsonpath=         |
|    |                           |   '{.metadata.finalizers}'                |
| 12 | HPA not scaling           | kubectl describe hpa <hpa> -n <ns>        |
| 13 | Cert expiry               | kubectl describe certificate <cert> -n <ns>|
| 14 | NetworkPolicy blocking    | kubectl get networkpolicy -n <ns>         |
| 15 | ArgoCD sync failure       | argocd app get <app> --show-operation     |
| 16 | Karpenter not provisioning| kubectl logs -n kube-system -l            |
|    |                           |   app.kubernetes.io/name=karpenter        |
| 17 | Cross-ns comm failure     | kubectl get networkpolicy -n <dest-ns>    |
+----+---------------------------+-------------------------------------------+

Prometheus Alerts for All 17 Scenarios

These are the alerts you should have configured as the platform team:

# Alert rules covering all 17 scenarios
groups:
- name: k8s-troubleshooting-alerts
  rules:
  # Scenario 1: Pod Pending > 5 min
  - alert: PodStuckPending
    expr: kube_pod_status_phase{phase="Pending"} == 1
    for: 5m

  # Scenario 2: CrashLoopBackOff
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
    for: 5m

  # Scenario 3: ImagePullBackOff
  - alert: ImagePullFailure
    expr: kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"} == 1
    for: 5m

  # Scenario 4: Service with no endpoints
  - alert: ServiceNoEndpoints
    expr: kube_endpoint_address_available == 0
    for: 5m

  # Scenario 5: CoreDNS down
  - alert: CoreDNSDown
    expr: up{job="coredns"} == 0
    for: 2m

  # Scenario 6: Node NotReady
  - alert: NodeNotReady
    expr: kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 3m

  # Scenario 7: PVC Pending > 5 min
  - alert: PVCStuckPending
    expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
    for: 5m

  # Scenario 10: OOMKilled
  - alert: ContainerOOMKilled
    expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

  # Scenario 12: HPA at max replicas
  - alert: HPAAtMaxReplicas
    expr: kube_horizontalpodautoscaler_status_current_replicas
      == kube_horizontalpodautoscaler_spec_max_replicas
    for: 10m

  # Scenario 13: Certificate expiring in 14 days
  - alert: CertificateExpiringSoon
    expr: certmanager_certificate_expiration_timestamp_seconds - time() < 14*24*3600