Skip to content

CI/CD, Deployments & Environment Promotion

Where This Fits in the Enterprise Architecture

Section titled “Where This Fits in the Enterprise Architecture”

Enterprise CI/CD Architecture The golden rule: CI pipelines (GitHub Actions) build and push artifacts. CD pipelines (ArgoCD) deploy to clusters. CI never runs kubectl. The gitops repo is the single source of truth for what is deployed where.


Enterprise CI/CD Flow

Connecting GitHub Actions to Cloud Providers Securely

Section titled “Connecting GitHub Actions to Cloud Providers Securely”
# DO NOT DO THIS — static keys stored as GitHub secrets
- name: Configure AWS
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Why this is bad:

  • Long-lived credentials that can leak
  • No automatic rotation
  • Hard to audit — which workflow used which key?
  • Cannot scope to specific repo/branch
  • If compromised, attacker has persistent access

OIDC Authentication Flow

Step 1: Create OIDC Identity Provider in AWS

Section titled “Step 1: Create OIDC Identity Provider in AWS”

The workflow uses OIDC authentication (no static keys), builds and pushes to ECR, scans with Trivy, and updates the GitOps repo.

GCP: Workload Identity Federation for GitHub Actions to GKE

Section titled “GCP: Workload Identity Federation for GitHub Actions to GKE”

Step 1: Create Workload Identity Pool and Provider

Section titled “Step 1: Create Workload Identity Pool and Provider”

Step 2: Create Service Account with IAM Bindings

Section titled “Step 2: Create Service Account with IAM Bindings”

The workflow uses Workload Identity Federation (no JSON keys), builds and pushes to Artifact Registry, scans with Trivy, and updates the GitOps repo.

# Terraform: GitHub OIDC provider in AWS
resource "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["ffffffffffffffffffffffffffffffffffffffff"]
tags = {
Name = "github-actions-oidc"
ManagedBy = "terraform"
}
}

ArgoCD in the Enterprise

# ArgoCD Application for dev environment
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-dev
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io # cleanup on delete
spec:
project: development
source:
repoURL: https://github.com/bank-org/gitops-repo.git
targetRevision: main
path: apps/payments/overlays/dev
destination:
server: https://dev-eks.me-south-1.eks.amazonaws.com
namespace: payments
syncPolicy:
automated:
prune: true # delete resources removed from git
selfHeal: true # revert manual changes in cluster
allowEmpty: false # prevent accidental deletion of all resources
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true # prune after all other syncs
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m
# ArgoCD Application for prod — manual sync + sync windows
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-prod
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/bank-org/gitops-repo.git
targetRevision: main
path: apps/payments/overlays/prod
destination:
server: https://prod-eks.me-south-1.eks.amazonaws.com
namespace: payments
syncPolicy:
# NO automated sync — manual only for prod
syncOptions:
- CreateNamespace=false # namespace must pre-exist in prod
- PrunePropagationPolicy=foreground
- RespectIgnoreDifferences=true
retry:
limit: 5
backoff:
duration: 10s
factor: 2
maxDuration: 5m
# Ignore fields that are set by controllers (avoid false OutOfSync)
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # HPA manages replicas
- group: autoscaling
kind: HorizontalPodAutoscaler
jqPathExpressions:
- .status
# AppProject with sync window — only deploy during business hours
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
spec:
description: Production applications
sourceRepos:
- https://github.com/bank-org/gitops-repo.git
destinations:
- server: https://prod-eks.me-south-1.eks.amazonaws.com
namespace: payments
- server: https://prod-eks.me-south-1.eks.amazonaws.com
namespace: orders
# RBAC: who can sync
roles:
- name: platform-admin
policies:
- p, proj:production:platform-admin, applications, sync, production/*, allow
- p, proj:production:platform-admin, applications, get, production/*, allow
- name: team-payments
policies:
- p, proj:production:team-payments, applications, get, production/payments-*, allow
# Note: team cannot sync prod — only platform-admin can
# Sync windows: only allow deploys Sun-Thu 10am-4pm Dubai time
syncWindows:
- kind: allow
schedule: "0 10 * * 0-4" # Sun-Thu 10am (Dubai work week)
duration: 6h # until 4pm
applications: ["*"]
manualSync: true # manual sync allowed within window
- kind: deny
schedule: "0 0 * * *" # deny all other times
duration: 24h
applications: ["*"]

For managing 50+ microservices, use the app-of-apps pattern — one parent Application that generates child Applications:

App-of-Apps Pattern

# Root Application (the "app of apps")
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/bank-org/gitops-repo.git
targetRevision: main
path: argocd/apps # directory containing Application YAMLs
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true

ArgoCD ApplicationSets — Multi-Environment

Section titled “ArgoCD ApplicationSets — Multi-Environment”

ApplicationSets generate Applications dynamically based on generators (Git directory, list, cluster, matrix).

Git Directory Generator — Auto-Discover Apps per Environment

Section titled “Git Directory Generator — Auto-Discover Apps per Environment”
# One ApplicationSet → generates apps for ALL services in ALL envs
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: all-apps-all-envs
namespace: argocd
spec:
goTemplate: true
goTemplateOptions: ["missingkey=error"]
generators:
# Matrix: combine environment list × git directory discovery
- matrix:
generators:
# Generator 1: environments
- list:
elements:
- env: dev
cluster: https://dev-eks.me-south-1.eks.amazonaws.com
autoSync: "true"
- env: staging
cluster: https://staging-eks.me-south-1.eks.amazonaws.com
autoSync: "true"
- env: prod
cluster: https://prod-eks.me-south-1.eks.amazonaws.com
autoSync: "false" # manual sync for prod
# Generator 2: discover apps from git directory structure
- git:
repoURL: https://github.com/bank-org/gitops-repo.git
revision: main
directories:
- path: "apps/*/overlays/{{ .env }}"
template:
metadata:
name: "{{ index .path.segments 1 }}-{{ .env }}"
# Produces: payments-dev, payments-staging, payments-prod, etc.
spec:
project: "{{ .env }}"
source:
repoURL: https://github.com/bank-org/gitops-repo.git
targetRevision: main
path: "{{ .path.path }}"
destination:
server: "{{ .cluster }}"
namespace: "{{ index .path.segments 1 }}"
syncPolicy:
automated:
prune: true
selfHeal: true

GitOps Repository Structure

apps/payments/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- hpa.yaml
- pdb.yaml
- network-policy.yaml
commonLabels:
app.kubernetes.io/name: payments-api
app.kubernetes.io/part-of: payments
images:
- name: payments-api
newName: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api
newTag: latest # overridden per environment
apps/payments/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 1 # overridden per environment
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
serviceAccountName: payments-api
containers:
- name: payments-api
image: payments-api # placeholder, Kustomize replaces
ports:
- containerPort: 8080
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: LOG_LEVEL
value: "info"
apps/payments/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: payments
patches:
- path: patches/replicas.yaml
- path: patches/resources.yaml
images:
- name: payments-api
newName: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api
newTag: abc123def # CI updates this tag
configMapGenerator:
- name: payments-config
literals:
- DATABASE_HOST=payments-db.dev.internal
- LOG_LEVEL=debug
- ENABLE_TRACING=true
apps/payments/overlays/dev/patches/replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 1 # dev: single replica
apps/payments/overlays/dev/patches/resources.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
template:
spec:
containers:
- name: payments-api
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi
apps/payments/overlays/prod/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
namespace: payments
patches:
- path: patches/replicas.yaml
- path: patches/resources.yaml
- path: patches/tolerations.yaml
images:
- name: payments-api
newName: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api
newTag: def456abc # promoted from staging
configMapGenerator:
- name: payments-config
literals:
- DATABASE_HOST=payments-db.prod.internal
- LOG_LEVEL=warn
- ENABLE_TRACING=true
apps/payments/overlays/prod/patches/replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
replicas: 6 # prod: 6 replicas across 3 AZs
apps/payments/overlays/prod/patches/resources.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
spec:
template:
spec:
containers:
- name: payments-api
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi

GitOps CI/CD Pipeline — Dev to Prod

Section titled “Pattern 1: PR-Based Promotion (Recommended)”
PR-BASED PROMOTION FLOW
==========================
CI merges to main
|
v
GitHub Actions updates dev overlay (image tag)
|
v
ArgoCD auto-syncs dev
|
v
Automated tests pass in dev
|
v
GitHub Actions opens PR: "Promote payments-api:abc123 to staging"
- Updates staging/kustomization.yaml with new image tag
- PR auto-assigned to team lead (CODEOWNERS)
|
v
Team lead reviews + approves PR → merge
|
v
ArgoCD auto-syncs staging
|
v
Staging integration tests pass (automated)
|
v
Platform engineer opens PR: "Promote payments-api:abc123 to prod"
- Updates prod/kustomization.yaml with new image tag
- PR requires 2 approvals (CODEOWNERS)
- Must pass branch protection rules
|
v
Platform team reviews + approves → merge (within sync window)
|
v
ArgoCD syncs prod (manual trigger or within sync window)
|
v
Argo Rollouts: canary 10% → 30% → 60% → 100% (with analysis)

GitHub Actions for automated promotion:

.github/workflows/promote-to-staging.yaml
name: Promote to Staging
on:
workflow_dispatch:
inputs:
image_tag:
description: "Image tag to promote"
required: true
app_name:
description: "Application name"
required: true
default: "payments"
jobs:
promote:
runs-on: ubuntu-latest
steps:
- name: Checkout gitops repo
uses: actions/checkout@v4
- name: Update staging image tag
run: |
cd apps/${{ inputs.app_name }}/overlays/staging
kustomize edit set image \
${{ inputs.app_name }}-api=111111111111.dkr.ecr.me-south-1.amazonaws.com/${{ inputs.app_name }}-api:${{ inputs.image_tag }}
- name: Create promotion PR
uses: peter-evans/create-pull-request@v6
with:
title: "promote(${{ inputs.app_name }}): staging ← ${{ inputs.image_tag }}"
body: |
## Promotion Request
- **App:** ${{ inputs.app_name }}
- **Image tag:** ${{ inputs.image_tag }}
- **Target:** staging
- **Source:** dev (verified)
### Checklist
- [ ] Dev deployment verified
- [ ] Integration tests passed
- [ ] No open incidents
branch: promote/${{ inputs.app_name }}-staging-${{ inputs.image_tag }}
labels: promotion,staging
reviewers: team-leads

CODEOWNERS for approval gates:

# .github/CODEOWNERS
# Dev overlay — team can self-approve
apps/*/overlays/dev/ @bank-org/team-payments
# Staging overlay — team lead approval
apps/*/overlays/staging/ @bank-org/team-leads
# Prod overlay — platform team approval (2 reviewers required)
apps/*/overlays/prod/ @bank-org/platform-team
IMAGE TAG PROMOTION
====================
CI builds image with tag: sha-abc123
|
v
Writes to dev overlay: newTag: sha-abc123
|
v
ArgoCD syncs dev → tests pass
|
v
Promotion job copies SAME tag to staging overlay
(no new build — same image, different config)
|
v
ArgoCD syncs staging → tests pass
|
v
Promotion job copies SAME tag to prod overlay
(same image as dev/staging — guaranteed identical)
|
v
ArgoCD syncs prod (canary)

Argo Rollouts extends Kubernetes Deployments with advanced deployment strategies: canary, blue-green, and progressive delivery with automated analysis.

CANARY DEPLOYMENT WITH ARGO ROLLOUTS
======================================
100% traffic
|
v
+--------------+
Step 0: | Stable v1 | (current production)
| (10 pods) |
+--------------+
Step 1: +--------------+ +------------+
setWeight: 10 | Stable v1 | | Canary v2 |
| (9 pods) |--->| (1 pod) |
| 90% traffic | | 10% traffic|
+--------------+ +------------+
|
Step 2: Run AnalysisTemplate
pause: 5m (check error rate, latency)
|
Pass? Continue
Fail? Auto-rollback
Step 3: +--------------+ +------------+
setWeight: 30 | Stable v1 | | Canary v2 |
| (7 pods) |--->| (3 pods) |
| 70% traffic | | 30% traffic|
+--------------+ +------------+
Step 4: +--------------+ +------------+
setWeight: 60 | Stable v1 | | Canary v2 |
| (4 pods) |--->| (6 pods) |
| 40% traffic | | 60% traffic|
+--------------+ +------------+
Step 5: +--------------+
setWeight: 100 | Canary v2 | Canary promoted to stable
| (10 pods) | Old ReplicaSet scaled to 0
+--------------+
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
namespace: payments
spec:
replicas: 10
revisionHistoryLimit: 5
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
containers:
- name: payments-api
image: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api:sha-abc123
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m
memory: 512Mi
strategy:
canary:
# Traffic management via ALB (EKS) or Istio
trafficRouting:
alb:
ingress: payments-ingress
servicePort: 80
rootService: payments-api-root
annotationPrefix: alb.ingress.kubernetes.io
canaryService: payments-api-canary
stableService: payments-api-stable
steps:
# Step 1: 10% canary with analysis
- setWeight: 10
- pause: { duration: 2m }
- analysis:
templates:
- templateName: payments-success-rate
args:
- name: service-name
value: payments-api-canary
# Step 2: 30% canary
- setWeight: 30
- pause: { duration: 5m }
- analysis:
templates:
- templateName: payments-success-rate
# Step 3: 60% canary
- setWeight: 60
- pause: { duration: 5m }
- analysis:
templates:
- templateName: payments-success-rate
# Step 4: full rollout (manual gate for payments)
- pause: {} # manual approval before 100%
- setWeight: 100
# Auto-rollback on failure
abortScaleDownDelaySeconds: 30
scaleDownDelayRevisionLimit: 1

AnalysisTemplate — Automated Canary Validation

Section titled “AnalysisTemplate — Automated Canary Validation”
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: payments-success-rate
namespace: payments
spec:
args:
- name: service-name
value: payments-api-canary
metrics:
# Metric 1: HTTP success rate must be > 99.5%
- name: success-rate
interval: 60s
count: 5 # run 5 measurements
successCondition: result[0] >= 0.995
failureLimit: 2 # fail if 2+ measurements fail
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status=~"2.."
}[2m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[2m]))
# Metric 2: P99 latency must be < 500ms
- name: latency-p99
interval: 60s
count: 5
successCondition: result[0] < 500
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_milliseconds_bucket{
service="{{args.service-name}}"
}[2m])) by (le)
)
# Metric 3: No increase in error logs
- name: error-log-count
interval: 60s
count: 3
successCondition: result[0] < 10
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(increase(log_messages_total{
service="{{args.service-name}}",
level="error"
}[5m]))

Helm is used for packaging applications with configurable values. In a GitOps setup, ArgoCD renders Helm templates and applies the output.

# ArgoCD Application using Helm
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-prod-helm
namespace: argocd
spec:
source:
repoURL: https://github.com/bank-org/helm-charts.git
targetRevision: main
path: charts/payments-api
helm:
valueFiles:
- values.yaml
- values-prod.yaml # environment-specific overrides
parameters:
- name: image.tag
value: "sha-abc123"
destination:
server: https://prod-eks.me-south-1.eks.amazonaws.com
namespace: payments

Helm values per environment:

# values.yaml (defaults)
replicaCount: 1
image:
repository: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api
tag: latest
resources:
requests:
cpu: 250m
memory: 256Mi
# values-prod.yaml (overrides for prod)
replicaCount: 6
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
ingress:
enabled: true
className: alb
annotations:
alb.ingress.kubernetes.io/scheme: internal

Scenario 1: Design CI/CD for 50 Microservices

Section titled “Scenario 1: Design CI/CD for 50 Microservices”

“Design CI/CD for 50 microservices deployed to EKS across dev, staging, and prod environments.”

Architecture:

50 APP REPOS 1 GITOPS REPO 3 CLUSTERS
+----------+ +-----------+ +----------+
| app-1 |--CI--> | apps/ | | dev |
| app-2 |--CI--> | app-1/ |--ArgoCD--> | staging |
| ... | | app-2/ | | prod |
| app-50 |--CI--> | ... | +----------+
+----------+ | app-50/ |
+-----------+
Each app repo has: GitOps repo has: ArgoCD has:
- src/ - base/ per app - 1 ApplicationSet
- Dockerfile - overlays/dev,staging, - generates 150 apps
- .github/workflows/ci.yaml prod per app (50 x 3 envs)

Key decisions:

  1. One CI workflow per app repo — builds, tests, pushes image, updates gitops repo
  2. Single gitops repo — all 50 apps, Kustomize overlays for 3 envs
  3. One ApplicationSet — matrix generator (envs x apps) creates 150 Applications
  4. Shared CI templates — GitHub Actions reusable workflows for consistency
  5. OIDC auth — single IAM role for all CI pipelines (scoped to org)

Scenario 2: GitHub Actions to EKS Without Static Credentials

Section titled “Scenario 2: GitHub Actions to EKS Without Static Credentials”

“How do you connect GitHub Actions to EKS without static credentials?”

Answer: Use GitHub Actions OIDC federation with AWS STS.

  1. Register GitHub as an OIDC identity provider in AWS
  2. Create an IAM role with a trust policy that validates the GitHub OIDC token
  3. Scope the trust policy to specific repo, branch, and optionally GitHub Environment
  4. In the workflow, use aws-actions/configure-aws-credentials@v4 with role-to-assume
  5. The action requests a short-lived token (15 min - 1 hour) from AWS STS
  6. No static access keys anywhere — not in GitHub Secrets, not in environment variables

Trust policy scoping options:

repo:org/repo:* # any branch (too broad)
repo:org/repo:ref:refs/heads/main # main branch only (good)
repo:org/repo:environment:production # GitHub Environment (best)
repo:org/repo:pull_request # PR context (for CI only)

Scenario 3: Environment Promotion with Approval Gates

Section titled “Scenario 3: Environment Promotion with Approval Gates”

“Design a promotion workflow: dev to staging to prod with approval gates.”

See the PR-based promotion pattern above. Key elements:

  1. Dev: auto-deploy on merge to main (no approval needed)
  2. Staging: automated PR opened by CI, requires team lead approval (CODEOWNERS)
  3. Prod: manual PR by platform engineer, requires 2 platform team approvals
  4. ArgoCD sync windows: prod deploys only during Sun-Thu 10am-4pm Dubai time
  5. Canary: Argo Rollouts with analysis at 10/30/60% before 100%
  6. Rollback: automatic if analysis fails; manual kubectl argo rollouts abort if needed

Scenario 4: Deployment Stuck — CrashLoopBackOff

Section titled “Scenario 4: Deployment Stuck — CrashLoopBackOff”

“A deployment is stuck — 3 new pods are CrashLoopBackOff but old pods are still serving. What happens and how do you fix it?”

What is happening:

Rolling Update in Progress
===========================
maxSurge: 25% (can create 25% extra pods)
maxUnavailable: 25% (can have 25% fewer ready pods)
Deployment: payments-api (replicas=10, image=v1)
→ Update triggered: image=v2
Step 1: Create 3 new pods with v2 (25% surge = ceil(10*0.25) = 3)
Step 2: New pods start → CrashLoopBackOff (bad config, missing env var, etc.)
Step 3: Rolling update STALLS — it will NOT kill old pods because:
- maxUnavailable=25% → can have 8 ready (currently 10 ready with v1)
- New pods are NOT ready → old pods stay
- Users are NOT affected (v1 pods still serve traffic)
The deployment controller waits for progressDeadlineSeconds (default 600s = 10 min)
After deadline: deployment status = "ProgressDeadlineExceeded"
But old pods STILL serve traffic — no outage

Debugging:

Terminal window
# Check rollout status
kubectl rollout status deployment/payments-api -n payments
# Check new pod logs
kubectl logs -n payments -l app=payments-api --tail=50 | grep -i error
# Check events
kubectl describe deployment payments-api -n payments
# Common CrashLoopBackOff causes:
# - Missing ConfigMap/Secret referenced in env
# - Database connection string wrong for new env
# - Missing env variable in new version
# - OOMKilled (new version needs more memory)
# - Liveness probe path changed in new version

Fix:

Terminal window
# Option 1: Rollback to previous revision
kubectl rollout undo deployment/payments-api -n payments
# Option 2: Rollback to specific revision
kubectl rollout undo deployment/payments-api -n payments --to-revision=3
# In GitOps: revert the commit in gitops repo → ArgoCD syncs old version
git revert HEAD
git push

“How do you implement canary deployments on EKS?”

Option 1: Argo Rollouts with ALB (recommended)

  • Replace Deployment with Rollout CRD
  • Use ALB Ingress Controller for traffic splitting
  • AnalysisTemplate validates canary health
  • Automated promotion or rollback

Option 2: Argo Rollouts with Istio

  • Istio VirtualService for fine-grained traffic splitting
  • More precise than ALB (can split by header, cookie, etc.)
  • Higher complexity (requires Istio service mesh)

Option 3: Native Kubernetes (manual canary)

  • Two Deployments (stable + canary) behind same Service
  • Adjust replica counts for traffic ratio
  • No automated analysis — purely manual
  • Not recommended for enterprise

Scenario 6: ArgoCD OutOfSync But Application Running Fine

Section titled “Scenario 6: ArgoCD OutOfSync But Application Running Fine”

“ArgoCD shows ‘OutOfSync’ but the application is running fine. Why?”

Common causes:

CauseFix
HPA changed replica countAdd ignoreDifferences for /spec/replicas
Mutating webhook added fieldsIgnore the webhook-added fields
Server-side apply added managedFieldsArgoCD settings: exclude managedFields
Resource drift (manual kubectl edit)Enable selfHeal: true to auto-revert
CRD status subresourceIgnore .status in diff
Defaulted fields by API serverNormalize in ArgoCD resource customization

Fix example:

# In ArgoCD Application spec
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # HPA manages this
- group: ""
kind: Service
jqPathExpressions:
- .spec.clusterIP # auto-assigned by K8s
- kind: MutatingWebhookConfiguration
jqPathExpressions:
- .webhooks[]?.clientConfig.caBundle

Scenario 7: Prevent Direct Deployment to Prod

Section titled “Scenario 7: Prevent Direct Deployment to Prod”

“How do you prevent a developer from deploying directly to prod, bypassing the pipeline?”

Layered controls:

DEFENSE IN DEPTH — PREVENTING DIRECT PROD DEPLOYS
====================================================
Layer 1: Git (source of truth)
- CODEOWNERS on prod/ overlay → requires platform team approval
- Branch protection: no direct push to main
- Require PR reviews for prod changes
Layer 2: ArgoCD (deployment engine)
- AppProject RBAC: only platform-admin role can sync prod apps
- Sync windows: deny outside business hours
- No automated sync for prod (manual only)
Layer 3: Kubernetes (cluster-level)
- RBAC: team SAs cannot create/update Deployments in prod namespaces
- OPA/Kyverno: deny deployments not matching gitops labels
- Namespace labels: "managed-by: argocd" — reject non-ArgoCD applies
Layer 4: Network
- EKS API server: private endpoint only
- No direct kubectl access from developer laptops
- Break-glass procedure for emergencies (audited)

Kyverno policy — reject non-ArgoCD deployments:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-argocd-managed
spec:
validationFailureAction: Enforce
rules:
- name: check-argocd-label
match:
any:
- resources:
kinds: ["Deployment", "StatefulSet", "DaemonSet"]
namespaces: ["payments", "orders", "trading"]
exclude:
any:
- subjects:
- kind: ServiceAccount
name: argocd-application-controller
namespace: argocd
validate:
message: "Resources in production namespaces must be deployed via ArgoCD."
pattern:
metadata:
labels:
app.kubernetes.io/managed-by: argocd

Scenario 8: Design GitOps Repo for 10 Microservices Across 3 Environments

Section titled “Scenario 8: Design GitOps Repo for 10 Microservices Across 3 Environments”

“Design the gitops repo structure for a team with 10 microservices across 3 environments.”

Structure: Use the Kustomize-based structure shown above. Key principles:

  1. One gitops repo for all 10 services (not 10 repos — simplifies management)
  2. Kustomize base per service — shared manifests (deployment, service, HPA, PDB)
  3. Three overlays per service — dev, staging, prod with environment-specific patches
  4. One ApplicationSet — matrix generator creates 30 Applications automatically
  5. CODEOWNERS — team owns dev/staging overlays, platform team owns prod
  6. Promotion via PR — image tag updated in overlay, reviewed, merged, synced

File count:

  • 10 services x (base: ~5 files + 3 overlays x ~3 files each) = ~140 files
  • 1 ApplicationSet YAML = 1 file
  • Total: ~141 files in one repo — manageable

Scaling considerations:

  • At 50+ services, consider splitting into domain-specific gitops repos (payments-gitops, trading-gitops)
  • ArgoCD can watch multiple repos
  • Use ArgoCD ApplicationSet with multiple Git generators pointing to different repos

CI/CD DECISION MATRIX
======================
Need Tool / Pattern
---- --------------
Container image build GitHub Actions + docker/build-push-action
Image scanning Trivy (OSS) or Snyk (enterprise)
Auth to AWS from CI OIDC + aws-actions/configure-aws-credentials@v4
Auth to GCP from CI WIF + google-github-actions/auth@v2
Container registry ECR (AWS) / Artifact Registry (GCP)
GitOps deployment ArgoCD (preferred) or Flux
Manifest templating Kustomize (simple) or Helm (complex)
Multi-env management Kustomize overlays + ArgoCD ApplicationSets
Canary deployments Argo Rollouts + AnalysisTemplate
Traffic splitting ALB (EKS) / Istio / Gateway API
Approval gates GitHub CODEOWNERS + ArgoCD sync windows
Rollback git revert → ArgoCD sync (GitOps) or
kubectl argo rollouts abort (Argo Rollouts)