kubectl Debug Cheatsheet
Pod Failures
Section titled “Pod Failures”Pod Not Starting
Section titled “Pod Not Starting”# Check pod status and eventskubectl get pods -n <ns> -o widekubectl describe pod <pod> -n <ns>
# Common statuses and what they mean:# Pending → scheduling issue (resources, node selector, affinity)# ContainerCreating → image pull or volume mount issue# CrashLoopBackOff → container starts then crashes (check logs)# ImagePullBackOff → can't pull image (wrong name, no auth, registry down)# Init:Error → init container failed# Init:CrashLoop → init container keeps crashing
# Check container logs (current)kubectl logs <pod> -n <ns> -c <container>
# Check container logs (previous crash)kubectl logs <pod> -n <ns> -c <container> --previous
# Check init container logskubectl logs <pod> -n <ns> -c <init-container-name>
# Check events (sorted by time)kubectl get events -n <ns> --sort-by='.lastTimestamp'
# Check events for a specific podkubectl get events -n <ns> --field-selector involvedObject.name=<pod>Pod CrashLoopBackOff
Section titled “Pod CrashLoopBackOff”# Get the exit codekubectl describe pod <pod> -n <ns> | grep -A5 "Last State"# Exit code 0 = container exited normally (check if command is wrong)# Exit code 1 = application error (check logs)# Exit code 137 = OOMKilled (memory limit too low)# Exit code 139 = Segfault# Exit code 143 = SIGTERM (graceful shutdown failed)
# Check if OOMKilledkubectl describe pod <pod> -n <ns> | grep -i oomkubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].lastState}'
# Check resource limits vs actual usagekubectl top pod <pod> -n <ns> --containers
# Live debug with ephemeral container (K8s 1.25+)kubectl debug <pod> -n <ns> -it --image=busybox --target=<container>Pod Stuck in Pending
Section titled “Pod Stuck in Pending”# Check why scheduler can't place the podkubectl describe pod <pod> -n <ns> | grep -A10 Events
# Common reasons:# "Insufficient cpu" → node doesn't have enough CPU# "Insufficient memory" → node doesn't have enough memory# "node(s) had taint" → missing toleration# "node selector" → no nodes match nodeSelector# "unbound PVC" → PVC can't bind to a PV
# Check node resourceskubectl describe nodes | grep -A5 "Allocated resources"kubectl top nodes
# Check if node has taintskubectl describe node <node> | grep Taints
# Check ResourceQuotakubectl get resourcequota -n <ns>kubectl describe resourcequota -n <ns>
# Check LimitRangekubectl get limitrange -n <ns>kubectl describe limitrange -n <ns>Pod Stuck in Terminating
Section titled “Pod Stuck in Terminating”# Check for finalizers blocking deletionkubectl get pod <pod> -n <ns> -o jsonpath='{.metadata.finalizers}'
# Check if pod has a long terminationGracePeriodSecondskubectl get pod <pod> -n <ns> -o jsonpath='{.spec.terminationGracePeriodSeconds}'
# Force delete (use cautiously)kubectl delete pod <pod> -n <ns> --grace-period=0 --force
# Check if node is unreachable (pod stuck on a dead node)kubectl get node <node> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'Node Issues
Section titled “Node Issues”# List all nodes with statuskubectl get nodes -o wide
# Check node conditionskubectl describe node <node> | grep -A20 Conditions# Ready=False → kubelet is down or node is unhealthy# MemoryPressure=True → node running out of memory# DiskPressure=True → node running out of disk# PIDPressure=True → too many processes# NetworkUnavailable → CNI plugin issue
# Check node resource usagekubectl top nodeskubectl describe node <node> | grep -A10 "Allocated resources"
# Check node taintskubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
# Check pods on a specific nodekubectl get pods --all-namespaces --field-selector spec.nodeName=<node> -o wide
# Cordon a node (prevent new scheduling)kubectl cordon <node>
# Drain a node (evict pods safely)kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# Uncordon after maintenancekubectl uncordon <node>
# Check kubelet logs (SSH to node first)journalctl -u kubelet -f --no-pager | tail -100Networking
Section titled “Networking”Service Connectivity
Section titled “Service Connectivity”# Check service exists and has endpointskubectl get svc <service> -n <ns>kubectl get endpoints <service> -n <ns># If endpoints is empty → no pods match the service selector
# Check service selector matches pod labelskubectl get svc <service> -n <ns> -o jsonpath='{.spec.selector}'kubectl get pods -n <ns> -l <key>=<value>
# Test DNS resolution from within a podkubectl run debug --rm -it --image=busybox --restart=Never -- nslookup <service>.<ns>.svc.cluster.localkubectl run debug --rm -it --image=busybox --restart=Never -- nslookup <service>
# Test connectivity from within a podkubectl run debug --rm -it --image=curlimages/curl --restart=Never -- curl -v http://<service>.<ns>:port/health
# Check CoreDNS is runningkubectl get pods -n kube-system -l k8s-app=kube-dnskubectl logs -n kube-system -l k8s-app=kube-dns
# Test external DNS resolutionkubectl run debug --rm -it --image=busybox --restart=Never -- nslookup google.comNetwork Policies
Section titled “Network Policies”# List network policies in a namespacekubectl get networkpolicy -n <ns>
# Describe network policy ruleskubectl describe networkpolicy <policy> -n <ns>
# Check if a default-deny policy existskubectl get networkpolicy -n <ns> -o yaml | grep -A5 "policyTypes"
# Test connectivity between podskubectl exec <pod-a> -n <ns> -- wget -qO- --timeout=5 http://<pod-b-ip>:portkubectl exec <pod-a> -n <ns> -- nc -zv <pod-b-ip> <port>Ingress / Gateway API
Section titled “Ingress / Gateway API”# Check ingress resourceskubectl get ingress -n <ns>kubectl describe ingress <ingress> -n <ns>
# Check ingress controller logskubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Check if external LB was createdkubectl get svc -n ingress-nginx# EXTERNAL-IP should not be <pending>
# Check Gateway API resourceskubectl get gateway -Akubectl get httproute -Akubectl describe httproute <route> -n <ns>
# Check ALB Ingress Controller (EKS)kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
# Check GKE Gateway Controllerkubectl logs -n gke-managed-system -l app=gke-gateway-controllerStorage
Section titled “Storage”# Check PVC statuskubectl get pvc -n <ns># Pending → StorageClass issue, no available PV, or AZ mismatch# Bound → healthy
# Check PVkubectl get pvkubectl describe pv <pv-name>
# Check StorageClasskubectl get storageclasskubectl describe storageclass <sc>
# Check if CSI driver is installedkubectl get csidrivers
# Debug PVC stuck in Pendingkubectl describe pvc <pvc> -n <ns># "waiting for first consumer" → WaitForFirstConsumer binding mode (normal)# "no persistent volumes available" → need to create PV or check StorageClass# "exceeded quota" → ResourceQuota limit reached
# Check volume attachmentskubectl get volumeattachments
# Check disk usage inside a podkubectl exec <pod> -n <ns> -- df -h
# Force detach a stuck volume (careful!)kubectl delete volumeattachment <name># Check if a user/SA can perform an actionkubectl auth can-i create pods -n <ns> --as=system:serviceaccount:<ns>:<sa>kubectl auth can-i get secrets -n <ns> --as=user@example.comkubectl auth can-i '*' '*' --as=system:serviceaccount:kube-system:admin # cluster-admin check
# List all roles and bindings in a namespacekubectl get roles,rolebindings -n <ns>kubectl get clusterroles,clusterrolebindings
# Describe a role to see its permissionskubectl describe role <role> -n <ns>kubectl describe clusterrole <clusterrole>
# Check who has what accesskubectl get rolebinding -n <ns> -o json | jq '.items[] | {name: .metadata.name, subjects: .subjects, role: .roleRef.name}'
# Check service account exists and has tokenkubectl get sa <sa> -n <ns>kubectl get secrets -n <ns> | grep <sa>
# Debug IRSA (EKS) — check SA annotationkubectl get sa <sa> -n <ns> -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'
# Debug Workload Identity (GKE) — check SA annotationkubectl get sa <sa> -n <ns> -o jsonpath='{.metadata.annotations.iam\.gke\.io/gcp-service-account}'
# Check API server audit logs for denied requestskubectl logs -n kube-system -l component=kube-apiserver | grep "Forbidden"Deployments & Rollouts
Section titled “Deployments & Rollouts”# Check deployment statuskubectl get deploy -n <ns>kubectl describe deploy <deploy> -n <ns>
# Check rollout statuskubectl rollout status deploy/<deploy> -n <ns>
# View rollout historykubectl rollout history deploy/<deploy> -n <ns>
# Rollback to previous versionkubectl rollout undo deploy/<deploy> -n <ns>
# Rollback to specific revisionkubectl rollout undo deploy/<deploy> -n <ns> --to-revision=3
# Check ReplicaSets (shows old and new)kubectl get rs -n <ns> -l app=<app>
# Watch rolling update progresskubectl get pods -n <ns> -l app=<app> -w
# Check HPA statuskubectl get hpa -n <ns>kubectl describe hpa <hpa> -n <ns># "unable to get metrics" → metrics-server or Prometheus adapter issue
# Check PDB (Pod Disruption Budget)kubectl get pdb -n <ns>kubectl describe pdb <pdb> -n <ns>Events & Cluster Health
Section titled “Events & Cluster Health”# All events in a namespace (sorted by time)kubectl get events -n <ns> --sort-by='.lastTimestamp'
# Warning events onlykubectl get events -n <ns> --field-selector type=Warning
# Cluster-wide eventskubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50
# Check component status (deprecated but useful)kubectl get componentstatuses
# Check API server healthkubectl get --raw='/healthz'kubectl get --raw='/readyz'
# Check etcd health (if accessible)kubectl get --raw='/healthz/etcd'
# Check cluster resource usage summarykubectl top nodeskubectl top pods --all-namespaces --sort-by=memory | head -20kubectl top pods --all-namespaces --sort-by=cpu | head -20Quick Diagnosis Flowchart
Section titled “Quick Diagnosis Flowchart”References
Section titled “References”- Amazon EKS Troubleshooting Guide — common EKS issues with cluster, nodes, and networking
- EKS Best Practices Guide — operational best practices for running EKS in production
- GKE Troubleshooting Documentation — common GKE issues with clusters, workloads, and networking
- GKE Security Best Practices — cluster hardening and security configuration
Tools & Frameworks
Section titled “Tools & Frameworks”- kubectl Reference Documentation — complete command reference for kubectl
- kubectl Cheat Sheet (official) — Kubernetes official quick reference for common commands
- Kubernetes Troubleshooting Guide — debugging applications, clusters, pods, and services
- Kubernetes Debug Containers — ephemeral container debugging for running pods
- Lens (K8s IDE) — desktop application for Kubernetes cluster management and troubleshooting