Skip to content

Workloads — Deployments, StatefulSets, Jobs & CronJobs

In the enterprise bank architecture, the central platform team defines workload standards: base Deployment templates, Pod security standards, resource quotas per namespace, and approved workload patterns. Tenant teams deploy their applications using these patterns into their assigned namespaces.

Workload Types in a Banking Platform

The most common workload resource. Manages stateless applications with declarative rolling updates and rollback.

Deployment and ReplicaSet Internal Architecture

The rolling update strategy controls how pods are replaced during an update.

apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
namespace: payments
spec:
replicas: 6
revisionHistoryLimit: 5 # keep 5 old ReplicaSets for rollback
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # max 2 extra pods during update (8 total)
maxUnavailable: 1 # max 1 pod unavailable (min 5 running)
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
version: v2.3.1
spec:
serviceAccountName: payments-api
terminationGracePeriodSeconds: 60
containers:
- name: api
image: 123456789012.dkr.ecr.me-south-1.amazonaws.com/payments-api:v2.3.1
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /livez
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"] # allow LB to drain

Key parameters explained:

ParameterValueEffect
maxSurge: 2Create up to 2 extra pods during updateFaster rollout, uses more resources temporarily
maxUnavailable: 1Allow at most 1 pod to be downMaintains capacity during update
revisionHistoryLimit: 5Keep 5 old ReplicaSetsEnables kubectl rollout undo to any of the last 5 versions
terminationGracePeriodSeconds: 60Pod gets 60s to shut down gracefullyLong enough for in-flight requests to complete
preStop: sleep 10Wait 10s before SIGTERMGives load balancer time to remove the pod from its target group

Rolling Update Sequence (6 replicas, maxSurge=2, maxUnavailable=1)

Section titled “Rolling Update Sequence (6 replicas, maxSurge=2, maxUnavailable=1)”
Step 1: Start
Running: [v1] [v1] [v1] [v1] [v1] [v1] = 6 pods (all v1)
Step 2: Create surge pods (maxSurge=2)
Running: [v1] [v1] [v1] [v1] [v1] [v1] [v2] [v2] = 8 pods
v2 pods start, waiting for readiness probe
Step 3: v2 pods pass readiness → terminate v1 pods (maxUnavailable=1)
Running: [v1] [v1] [v1] [v1] [v1] [v2] [v2] = 7 pods
(1 v1 pod terminating)
Step 4: Continue rolling
Running: [v1] [v1] [v1] [v1] [v2] [v2] [v2] = 7 pods
Step 5: Continue rolling
Running: [v1] [v1] [v1] [v2] [v2] [v2] [v2] = 7 pods
... continues until all v1 replaced ...
Step N: Complete
Running: [v2] [v2] [v2] [v2] [v2] [v2] = 6 pods (all v2)
Old ReplicaSet (v1) scaled to 0, kept for rollback
Terminal window
# View rollout history
kubectl rollout history deployment/payments-api -n payments
# Rollback to previous version
kubectl rollout undo deployment/payments-api -n payments
# Rollback to specific revision
kubectl rollout undo deployment/payments-api -n payments --to-revision=3
# Check rollout status
kubectl rollout status deployment/payments-api -n payments

Under the hood, rollout undo simply scales the old ReplicaSet back up and the current one down. This is why revisionHistoryLimit matters — once an old ReplicaSet is garbage collected, you cannot roll back to it.


You almost never create a ReplicaSet directly. The Deployment controller manages them for you.

When you need to know about ReplicaSets:

  • Debugging: kubectl get rs -n payments shows all ReplicaSets, their desired/current/ready counts, and which Deployment owns them
  • Understanding rollbacks: each revision is a ReplicaSet
  • Understanding the pod-template-hash label: added by the Deployment controller to uniquely identify each ReplicaSet’s pods
Terminal window
# List ReplicaSets for a Deployment
kubectl get rs -n payments -l app=payments-api
# Output:
# NAME DESIRED CURRENT READY AGE
# payments-api-7d8f9c6b45 6 6 6 2h ← current (v2)
# payments-api-5c4d3e2f11 0 0 0 3d ← previous (v1)
# payments-api-9a8b7c6d55 0 0 0 7d ← older (v0)

For workloads that need stable identity, ordered deployment, and persistent storage per pod. Think databases, message queues, distributed systems.

FeatureDeploymentStatefulSet
Pod namesRandom hash (api-7d8f9c-abc)Sequential index (kafka-0, kafka-1, kafka-2)
DNSOnly via Service ClusterIPEach pod gets a DNS entry via headless Service
Startup orderAll pods created simultaneouslyPods created sequentially (0, then 1, then 2)
Termination orderRandom orderReverse order (2, then 1, then 0)
StorageShared PVC or noneDedicated PVC per pod (volumeClaimTemplate)
IdentityDisposable — any pod can replace any otherStable — pod-0 is always pod-0, bound to same PVC

StatefulSet Architecture (Kafka)

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: kafka
namespace: fraud-detection
spec:
serviceName: kafka-headless # REQUIRED — headless Service name
replicas: 3
podManagementPolicy: Parallel # or OrderedReady (default)
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # K8s 1.24+ for StatefulSets
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: kafka
topologyKey: topology.kubernetes.io/zone # one broker per AZ
terminationGracePeriodSeconds: 120
containers:
- name: kafka
image: confluentinc/cp-kafka:7.6.0
ports:
- containerPort: 9092
name: client
- containerPort: 9093
name: inter-broker
env:
- name: KAFKA_BROKER_ID
valueFrom:
fieldRef:
fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
- name: KAFKA_ADVERTISED_LISTENERS
value: "PLAINTEXT://$(POD_NAME).kafka-headless.fraud-detection.svc.cluster.local:9092"
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
volumeMounts:
- name: data
mountPath: /var/lib/kafka/data
readinessProbe:
tcpSocket:
port: 9092
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "kafka-server-stop && sleep 30" # graceful Kafka shutdown
volumeClaimTemplates: # creates one PVC per pod
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3-encrypted # EKS: gp3 with KMS encryption
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
name: kafka-headless
namespace: fraud-detection
spec:
clusterIP: None # headless — no load balancing
selector:
app: kafka
ports:
- port: 9092
name: client
- port: 9093
name: inter-broker
---
# Client-facing service (load balanced across all brokers)
apiVersion: v1
kind: Service
metadata:
name: kafka
namespace: fraud-detection
spec:
selector:
app: kafka
ports:
- port: 9092
name: client

A headless Service (clusterIP: None) does not load-balance. Instead, DNS returns the individual pod IPs:

Terminal window
# Normal Service DNS:
nslookup kafka.fraud-detection.svc.cluster.local
# Returns: 172.20.45.10 (single ClusterIP)
# Headless Service DNS:
nslookup kafka-headless.fraud-detection.svc.cluster.local
# Returns: 10.0.1.22, 10.0.2.33, 10.0.3.44 (all pod IPs)
# Individual pod DNS (only works with headless + StatefulSet):
nslookup kafka-0.kafka-headless.fraud-detection.svc.cluster.local
# Returns: 10.0.1.22

This is how Kafka brokers, ZooKeeper nodes, and database replicas discover each other.


Ensures a copy of a pod runs on every node (or a subset of nodes matching a selector). Used for node-level infrastructure.

DaemonSetPurposeNamespaceWho Manages
datadog-agentMetrics collection, APMkube-systemPlatform team
fluent-bitLog forwarding to central loggingkube-systemPlatform team
falcoRuntime security monitoringkube-systemSecurity team
aws-node (VPC CNI)Pod networking on EKSkube-systemEKS add-on
kube-proxyService routingkube-systemEKS/GKE add-on
ebs-csi-nodeEBS volume attachmentkube-systemEKS add-on
node-problem-detectorHardware/OS issue detectionkube-systemPlatform team
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: kube-system
labels:
app: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # update one node at a time
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
tolerations:
- operator: Exists # run on ALL nodes, including tainted ones
priorityClassName: system-node-critical # survive eviction
containers:
- name: fluent-bit
image: fluent/fluent-bit:3.0
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
- name: config
configMap:
name: fluent-bit-config

Use nodeSelector or nodeAffinity to run DaemonSets only on specific nodes:

spec:
template:
spec:
nodeSelector:
workload-type: gpu # only GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule

One-time or batch execution. The pod runs to completion and stops.

apiVersion: batch/v1
kind: Job
metadata:
name: year-end-report-2025
namespace: payments
spec:
completions: 100 # total units of work
parallelism: 10 # run 10 pods at a time
backoffLimit: 3 # retry failed pods 3 times
activeDeadlineSeconds: 3600 # kill job after 1 hour
ttlSecondsAfterFinished: 86400 # auto-delete job after 24 hours
template:
spec:
restartPolicy: Never # required for Jobs (Never or OnFailure)
serviceAccountName: report-generator
containers:
- name: report
image: 123456789012.dkr.ecr.me-south-1.amazonaws.com/report-gen:v1.2
env:
- name: BATCH_ID
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
ParameterValueMeaning
completions: 100100Job is done when 100 pods complete successfully
parallelism: 1010Run up to 10 pods concurrently
backoffLimit: 33After 3 pod failures, mark job as Failed
activeDeadlineSeconds3600Kill entire job after 1 hour regardless of progress
ttlSecondsAfterFinished86400Auto-cleanup job object 24 hours after completion
restartPolicy: Never-Failed pods are not restarted; new pods are created instead
restartPolicy: OnFailure-Failed containers are restarted in the same pod
Job: year-end-report-2025 (completions=100, parallelism=10)
Time 0: [pod-0] [pod-1] [pod-2] ... [pod-9] 10 pods running
Time 5m: pod-0 completes, pod-3 fails (backoff retry)
[pod-10] created (next batch)
[pod-3-retry] created (retry)
Time 10m: 20 completed, 10 running, 1 failed
...
Time 50m: 100 completed → Job status: Complete
If pod-3 fails 3 times → backoffLimit reached for that pod
If total failures > backoffLimit → Job status: Failed

Each pod gets a unique index, useful for processing specific data partitions:

spec:
completionMode: Indexed # each pod gets JOB_COMPLETION_INDEX env var
completions: 100
parallelism: 10

Pod 0 processes partition 0, pod 1 processes partition 1, and so on. The index is available via the JOB_COMPLETION_INDEX environment variable.


Scheduled Jobs. Runs a Job on a cron schedule.

apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-reconciliation
namespace: payments
spec:
schedule: "0 2 * * *" # 02:00 UTC every day
timeZone: "Asia/Dubai" # K8s 1.27+ (important for banking!)
concurrencyPolicy: Forbid # do NOT start new if previous still running
successfulJobsHistoryLimit: 3 # keep last 3 successful jobs
failedJobsHistoryLimit: 5 # keep last 5 failed jobs
startingDeadlineSeconds: 300 # if missed by 5 min, skip this run
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 7200 # max 2 hours per run
template:
spec:
restartPolicy: Never
serviceAccountName: reconciliation-worker
containers:
- name: reconciler
image: 123456789012.dkr.ecr.me-south-1.amazonaws.com/reconciler:v3.1
env:
- name: RUN_DATE
value: "$(date -d 'yesterday' +%Y-%m-%d)"
- name: DB_HOST
valueFrom:
secretKeyRef:
name: payments-db
key: host
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
PolicyBehaviorUse Case
Allow (default)Multiple Job instances can run concurrentlyIndependent tasks that do not conflict
ForbidSkip new run if previous is still runningData reconciliation, reports (prevent duplicates)
ReplaceKill the running Job and start a new oneLatest data always more important than completion

Before Kubernetes 1.27, CronJobs used the timezone of the kube-controller-manager (typically UTC). Since 1.27, the timeZone field is GA:

spec:
schedule: "0 2 * * *"
timeZone: "Asia/Dubai" # runs at 02:00 Dubai time (UTC+4)

This is critical for banking where reconciliation jobs must align with business day boundaries.


Understanding pod lifecycle is essential for zero-downtime deployments and graceful shutdown.

Pod Phase Lifecycle

Container Startup Sequence

Run before any main containers start. Each init container must complete successfully before the next one starts. Use cases:

spec:
initContainers:
# Wait for database to be ready
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z payments-db.payments.svc.cluster.local 5432; do sleep 2; done']
# Run database migrations
- name: db-migrate
image: 123456789012.dkr.ecr.me-south-1.amazonaws.com/payments-api:v2.3.1
command: ['./migrate', '--direction', 'up']
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: payments-db
key: url
# Download config from S3 / GCS
- name: fetch-config
image: amazon/aws-cli:2.15
command: ['aws', 's3', 'cp', 's3://bank-config/payments/config.yaml', '/config/']
volumeMounts:
- name: config
mountPath: /config

Native Sidecar Containers (Kubernetes 1.28+)

Section titled “Native Sidecar Containers (Kubernetes 1.28+)”

Before 1.28, sidecar containers (like Istio proxy, Vault agent) were regular containers in the pod spec. The problem: they did not have defined startup/shutdown ordering relative to the main container.

Native sidecars solve this with restartPolicy: Always on init containers:

spec:
initContainers:
# This is a SIDECAR — starts before main containers, runs alongside them
- name: vault-agent
image: hashicorp/vault:1.16
restartPolicy: Always # THIS makes it a native sidecar
args: ['agent', '-config=/etc/vault/config.hcl']
volumeMounts:
- name: vault-secrets
mountPath: /vault/secrets
# This is a SIDECAR — Istio proxy
- name: istio-proxy
image: istio/proxyv2:1.22
restartPolicy: Always # native sidecar
ports:
- containerPort: 15090
name: http-envoy-prom
containers:
- name: payments-api
image: payments-api:v2.3.1
# Vault secrets are available at /vault/secrets BEFORE this container starts
volumeMounts:
- name: vault-secrets
mountPath: /vault/secrets
readOnly: true

Benefits of native sidecars:

  • Start before main containers (guaranteed ordering)
  • Shut down after main containers (sidecars outlive the app container)
  • Properly handled during Job completion (sidecar does not block Job success)
  • Startup/readiness probes on sidecars gate main container start

When a pod is being terminated (rolling update, node drain, scale down):

Pod Termination Sequence

Critical timing issue:

Critical Timing Issue — Pod Termination

# Startup Probe — for slow-starting apps (JVM, large ML models)
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 60 # 60 * 5s = 300s (5 min) to start
# Once passes, never runs again. Liveness/readiness take over.
# Readiness Probe — controls traffic routing
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
successThreshold: 1
# Failing readiness removes pod from Service endpoints
# Pod stays running — useful for temporary unavailability
# Liveness Probe — detects deadlocks, hangs
livenessProbe:
httpGet:
path: /livez
port: 8080
initialDelaySeconds: 30 # start after app is up
periodSeconds: 20
failureThreshold: 3
# Failing liveness RESTARTS the container
# Use sparingly — restarts can cascade

Blue-Green Deployment with Two Deployments

Section titled “Blue-Green Deployment with Two Deployments”
# Blue (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api-blue
namespace: payments
spec:
replicas: 6
selector:
matchLabels:
app: payments-api
version: blue
template:
metadata:
labels:
app: payments-api
version: blue
spec:
containers:
- name: api
image: payments-api:v2.3.0 # current version
---
# Green (new version, pre-deployed)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api-green
namespace: payments
spec:
replicas: 6
selector:
matchLabels:
app: payments-api
version: green
template:
metadata:
labels:
app: payments-api
version: green
spec:
containers:
- name: api
image: payments-api:v2.4.0 # new version
---
# Service — switch traffic by changing selector
apiVersion: v1
kind: Service
metadata:
name: payments-api
namespace: payments
spec:
selector:
app: payments-api
version: blue # ← change to "green" to switch
ports:
- port: 80
targetPort: 8080

Switch traffic: Update the Service selector from version: blue to version: green. Instant cutover, instant rollback.


Scenario 1: “When would you use a StatefulSet vs a Deployment with PVCs?”

Section titled “Scenario 1: “When would you use a StatefulSet vs a Deployment with PVCs?””

Answer:

“Use a StatefulSet when your application needs stable network identity (each pod must be addressable individually, like Kafka brokers), ordered startup/shutdown (pod-0 before pod-1), or dedicated persistent storage per pod where the storage follows the pod identity (data-kafka-0 always goes to kafka-0).

Use a Deployment with PVCs when you just need persistent storage but do not need stable identity or ordering. For example, a web application that writes user uploads to a shared EFS/Filestore volume — any pod can read/write, pods are interchangeable.

A common mistake is using a StatefulSet just because the app needs a PVC. If the pods are interchangeable and do not need to discover each other by name, a Deployment is simpler to manage. StatefulSets add operational complexity: they are harder to scale, harder to upgrade (rolling updates are pod-by-pod in reverse order), and PVC cleanup requires manual intervention.”


Scenario 2: “Design a batch processing system on Kubernetes that processes 10K files nightly”

Section titled “Scenario 2: “Design a batch processing system on Kubernetes that processes 10K files nightly””

Answer:

Nightly Batch Processing System

“I would use an Indexed Job inside a CronJob. Each pod gets a unique index and processes its partition of the 10K files. With parallelism: 20, we run 20 pods concurrently, each processing 100 files. The job completes after all 100 partitions are done.

For cost optimization, I would use Spot/Preemptible nodes (batch workloads are retryable) or EKS Fargate (no need to manage nodes). Karpenter would provision nodes just-in-time and deprovision after the job completes.

I would set concurrencyPolicy: Forbid to prevent overlapping runs, activeDeadlineSeconds as a safety timeout, and ttlSecondsAfterFinished to clean up completed jobs automatically.”


Scenario 3: “Your CronJob is running duplicate instances. What is happening?”

Section titled “Scenario 3: “Your CronJob is running duplicate instances. What is happening?””

Answer:

“The most likely cause is concurrencyPolicy: Allow (the default). If the previous Job run has not finished by the time the next scheduled run triggers, a new Job is created and runs concurrently.

Diagnosis steps:

  1. Check kubectl get jobs -n payments --sort-by=.status.startTime — do you see overlapping start/end times?
  2. Check kubectl get cronjob daily-reconciliation -n payments -o yaml — what is the concurrencyPolicy?
  3. Check how long each Job takes vs the cron schedule interval

Fix: Set concurrencyPolicy: Forbid. This tells the CronJob controller to skip a run if the previous Job is still active.

Other causes of duplicates:

  • startingDeadlineSeconds too high: if the controller was down and missed schedules, it can create multiple jobs to ‘catch up’ (up to 100 missed schedules)
  • Controller restart: the CronJob controller recalculates missed schedules on startup. If startingDeadlineSeconds is not set, it may attempt to catch up
  • Clock skew between controller-manager replicas (rare in managed K8s)“

Scenario 4: “How do you ensure zero-downtime deployments?”

Section titled “Scenario 4: “How do you ensure zero-downtime deployments?””

Answer:

“Zero-downtime deployments require coordination across five areas:

  1. Rolling update strategy: maxSurge ensures new pods are created before old ones are removed. maxUnavailable controls how many pods can be down simultaneously. For critical services, I use maxSurge: 25%, maxUnavailable: 0 — never go below current replica count.

  2. Readiness probes: New pods must pass readiness before receiving traffic. Without readiness probes, Kubernetes immediately routes traffic to starting pods that are not ready yet.

  3. PodDisruptionBudgets (PDB): Prevent voluntary disruptions (like node drain during upgrade) from taking too many pods offline. minAvailable: 80% or maxUnavailable: 1.

  4. preStop hooks: sleep 10-15s gives the load balancer time to deregister the terminating pod. Without this, requests hit pods that are shutting down.

  5. Graceful shutdown: The application must handle SIGTERM properly — stop accepting new connections, drain in-flight requests, close database connections. Set terminationGracePeriodSeconds high enough (60-120s) for this to complete.”

# Complete zero-downtime configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payments-api-pdb
namespace: payments
spec:
maxUnavailable: 1 # at most 1 pod down during disruption
selector:
matchLabels:
app: payments-api

Scenario 5: “A StatefulSet pod keeps restarting after node failure. Debug it.”

Section titled “Scenario 5: “A StatefulSet pod keeps restarting after node failure. Debug it.””

Answer:

Pod Phase Lifecycle


  1. Deployments vs StatefulSets is not about storage. It is about identity. If pods need stable names and ordered lifecycle, use a StatefulSet. If pods are interchangeable, use a Deployment — even if it needs persistent storage.

  2. Know the rolling update math. If replicas=10, maxSurge=3, maxUnavailable=2, you can have up to 13 pods (10+3) and as few as 8 pods (10-2) at any point during the update. Be ready to calculate this.

  3. CronJob concurrencyPolicy defaults to Allow. This is a common production incident. Always set it explicitly, and use Forbid for data-processing jobs.

  4. preStop + terminationGracePeriodSeconds is the zero-downtime formula. Without preStop sleep, rolling updates cause 502 errors. Without sufficient grace period, in-flight requests are killed.

  5. Native sidecars (1.28+) change everything. They fix the sidecar ordering problem that has plagued Istio and Vault integrations for years. Know that this exists and how it works.