CI/CD, Deployments & Environment Promotion
Where This Fits in the Enterprise Architecture
Section titled “Where This Fits in the Enterprise Architecture”
The golden rule: CI pipelines (GitHub Actions) build and push artifacts. CD pipelines (ArgoCD) deploy to clusters. CI never runs kubectl. The gitops repo is the single source of truth for what is deployed where.
The CI/CD Pipeline — End to End
Section titled “The CI/CD Pipeline — End to End”Connecting GitHub Actions to Cloud Providers Securely
Section titled “Connecting GitHub Actions to Cloud Providers Securely”The Wrong Way (Static Credentials)
Section titled “The Wrong Way (Static Credentials)”# DO NOT DO THIS — static keys stored as GitHub secrets- name: Configure AWS env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}Why this is bad:
- Long-lived credentials that can leak
- No automatic rotation
- Hard to audit — which workflow used which key?
- Cannot scope to specific repo/branch
- If compromised, attacker has persistent access
The Right Way — OIDC Federation
Section titled “The Right Way — OIDC Federation”AWS: OIDC Setup for GitHub Actions to EKS
Section titled “AWS: OIDC Setup for GitHub Actions to EKS”Step 1: Create OIDC Identity Provider in AWS
Section titled “Step 1: Create OIDC Identity Provider in AWS”Step 2: Create IAM Role with Trust Policy
Section titled “Step 2: Create IAM Role with Trust Policy”Step 3: GitHub Actions Workflow for EKS
Section titled “Step 3: GitHub Actions Workflow for EKS”The workflow uses OIDC authentication (no static keys), builds and pushes to ECR, scans with Trivy, and updates the GitOps repo.
GCP: Workload Identity Federation for GitHub Actions to GKE
Section titled “GCP: Workload Identity Federation for GitHub Actions to GKE”Step 1: Create Workload Identity Pool and Provider
Section titled “Step 1: Create Workload Identity Pool and Provider”Step 2: Create Service Account with IAM Bindings
Section titled “Step 2: Create Service Account with IAM Bindings”Step 3: GitHub Actions Workflow for GKE
Section titled “Step 3: GitHub Actions Workflow for GKE”The workflow uses Workload Identity Federation (no JSON keys), builds and pushes to Artifact Registry, scans with Trivy, and updates the GitOps repo.
# Terraform: GitHub OIDC provider in AWSresource "aws_iam_openid_connect_provider" "github" { url = "https://token.actions.githubusercontent.com" client_id_list = ["sts.amazonaws.com"] thumbprint_list = ["ffffffffffffffffffffffffffffffffffffffff"]
tags = { Name = "github-actions-oidc" ManagedBy = "terraform" }}# IAM Role for GitHub Actions — scoped to specific repo and branchresource "aws_iam_role" "github_actions_cicd" { name = "GitHubActions-CICD-PaymentsAPI"
assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [ { Effect = "Allow" Principal = { Federated = aws_iam_openid_connect_provider.github.arn } Action = "sts:AssumeRoleWithWebIdentity" Condition = { StringEquals = { "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com" } StringLike = { # Scope to specific repo and branch "token.actions.githubusercontent.com:sub" = "repo:bank-org/payments-api:ref:refs/heads/main" } } } ] })
tags = { Purpose = "github-actions-cicd" Repo = "bank-org/payments-api" }}
# Permissions: push to ECR + read EKS clusterresource "aws_iam_role_policy" "github_actions_permissions" { name = "cicd-permissions" role = aws_iam_role.github_actions_cicd.id
policy = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "ECRPush" Effect = "Allow" Action = [ "ecr:BatchCheckLayerAvailability", "ecr:CompleteLayerUpload", "ecr:GetDownloadUrlForLayer", "ecr:InitiateLayerUpload", "ecr:PutImage", "ecr:UploadLayerPart", "ecr:BatchGetImage", ] Resource = "arn:aws:ecr:me-south-1:111111111111:repository/payments-api" }, { Sid = "ECRAuth" Effect = "Allow" Action = "ecr:GetAuthorizationToken" Resource = "*" }, { Sid = "EKSDescribe" Effect = "Allow" Action = [ "eks:DescribeCluster", ] Resource = "arn:aws:eks:me-south-1:111111111111:cluster/prod-eks-cluster" } ] })}name: CI — Build, Scan, Push to ECR
on: push: branches: [main] pull_request: branches: [main]
permissions: id-token: write # REQUIRED for OIDC contents: read # read repo code
env: AWS_REGION: me-south-1 ECR_REGISTRY: 111111111111.dkr.ecr.me-south-1.amazonaws.com ECR_REPOSITORY: payments-api IMAGE_TAG: ${{ github.sha }}
jobs: lint-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Run linters run: | # Dockerfile linting docker run --rm -i hadolint/hadolint < Dockerfile
- name: Run unit tests run: | go test ./... -v -race -coverprofile=coverage.out
build-and-push: needs: lint-and-test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' # only on main branch merge outputs: image-digest: ${{ steps.build.outputs.digest }}
steps: - uses: actions/checkout@v4
# OIDC authentication — no static keys - name: Configure AWS Credentials via OIDC uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-CICD-PaymentsAPI aws-region: ${{ env.AWS_REGION }} role-session-name: GitHubActions-${{ github.run_id }}
- name: Login to ECR id: ecr-login uses: aws-actions/amazon-ecr-login@v2
- name: Build and push image id: build uses: docker/build-push-action@v6 with: context: . push: true tags: | ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ env.IMAGE_TAG }} ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:latest cache-from: type=registry,ref=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:cache cache-to: type=registry,ref=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:cache,mode=max
# Scan image for CVEs - name: Scan image with Trivy uses: aquasecurity/trivy-action@master with: image-ref: ${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ env.IMAGE_TAG }} format: table exit-code: 1 # fail pipeline on HIGH/CRITICAL CVEs severity: HIGH,CRITICAL ignore-unfixed: true
update-gitops: needs: build-and-push runs-on: ubuntu-latest steps: - name: Checkout gitops repo uses: actions/checkout@v4 with: repository: bank-org/gitops-repo token: ${{ secrets.GITOPS_PAT }} # PAT with write access to gitops repo path: gitops
- name: Update image tag in dev overlay run: | cd gitops/apps/payments/overlays/dev kustomize edit set image \ payments-api=${{ env.ECR_REGISTRY }}/${{ env.ECR_REPOSITORY }}:${{ env.IMAGE_TAG }}
- name: Commit and push run: | cd gitops git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" git add . git commit -m "chore(dev): update payments-api to ${{ env.IMAGE_TAG }}" git push# Terraform: Workload Identity Federation for GitHub Actionsresource "google_iam_workload_identity_pool" "github" { project = var.project_id workload_identity_pool_id = "github-pool" display_name = "GitHub Actions Pool" description = "WIF pool for GitHub Actions CI/CD"}
resource "google_iam_workload_identity_pool_provider" "github" { project = var.project_id workload_identity_pool_id = google_iam_workload_identity_pool.github.workload_identity_pool_id workload_identity_pool_provider_id = "github-provider" display_name = "GitHub Provider"
attribute_mapping = { "google.subject" = "assertion.sub" "attribute.actor" = "assertion.actor" "attribute.repository" = "assertion.repository" "attribute.ref" = "assertion.ref" }
# Restrict to your GitHub org attribute_condition = "assertion.repository_owner == 'bank-org'"
oidc { issuer_uri = "https://token.actions.githubusercontent.com" }}# Service account for GitHub Actionsresource "google_service_account" "github_actions" { project = var.project_id account_id = "github-actions-cicd" display_name = "GitHub Actions CI/CD"}
# Allow GitHub Actions to impersonate this SA (scoped to repo + branch)resource "google_service_account_iam_binding" "github_wif" { service_account_id = google_service_account.github_actions.name role = "roles/iam.workloadIdentityUser"
members = [ # Scoped to specific repo and branch "principalSet://iam.googleapis.com/${google_iam_workload_identity_pool.github.name}/attribute.repository/bank-org/payments-api" ]}
# Permissions: push to Artifact Registry + access GKEresource "google_project_iam_member" "gar_writer" { project = var.project_id role = "roles/artifactregistry.writer" member = "serviceAccount:${google_service_account.github_actions.email}"}
resource "google_project_iam_member" "gke_developer" { project = var.project_id role = "roles/container.developer" member = "serviceAccount:${google_service_account.github_actions.email}"}name: CI — Build, Scan, Push to Artifact Registry
on: push: branches: [main] pull_request: branches: [main]
permissions: id-token: write # REQUIRED for OIDC contents: read
env: GCP_PROJECT: bank-prod-cicd GAR_REGION: me-central1 GAR_REPOSITORY: payments IMAGE_NAME: payments-api IMAGE_TAG: ${{ github.sha }}
jobs: lint-and-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4
- name: Run linters run: docker run --rm -i hadolint/hadolint < Dockerfile
- name: Run unit tests run: go test ./... -v -race
build-and-push: needs: lint-and-test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main'
steps: - uses: actions/checkout@v4
# Workload Identity Federation — no JSON keys - name: Authenticate to Google Cloud via WIF id: auth uses: google-github-actions/auth@v2 with: workload_identity_provider: "projects/123456789/locations/global/workloadIdentityPools/github-pool/providers/github-provider" service_account: "github-actions-cicd@bank-prod-cicd.iam.gserviceaccount.com" token_format: access_token
- name: Login to Artifact Registry uses: docker/login-action@v3 with: registry: ${{ env.GAR_REGION }}-docker.pkg.dev username: oauth2accesstoken password: ${{ steps.auth.outputs.access_token }}
- name: Build and push image uses: docker/build-push-action@v6 with: context: . push: true tags: | ${{ env.GAR_REGION }}-docker.pkg.dev/${{ env.GCP_PROJECT }}/${{ env.GAR_REPOSITORY }}/${{ env.IMAGE_NAME }}:${{ env.IMAGE_TAG }} ${{ env.GAR_REGION }}-docker.pkg.dev/${{ env.GCP_PROJECT }}/${{ env.GAR_REPOSITORY }}/${{ env.IMAGE_NAME }}:latest
- name: Scan image with Trivy uses: aquasecurity/trivy-action@master with: image-ref: "${{ env.GAR_REGION }}-docker.pkg.dev/${{ env.GCP_PROJECT }}/${{ env.GAR_REPOSITORY }}/${{ env.IMAGE_NAME }}:${{ env.IMAGE_TAG }}" format: table exit-code: 1 severity: HIGH,CRITICAL
update-gitops: needs: build-and-push runs-on: ubuntu-latest steps: - name: Checkout gitops repo uses: actions/checkout@v4 with: repository: bank-org/gitops-repo token: ${{ secrets.GITOPS_PAT }} path: gitops
- name: Update image tag in dev overlay run: | cd gitops/apps/payments/overlays/dev kustomize edit set image \ payments-api=${{ env.GAR_REGION }}-docker.pkg.dev/${{ env.GCP_PROJECT }}/${{ env.GAR_REPOSITORY }}/${{ env.IMAGE_NAME }}:${{ env.IMAGE_TAG }}
- name: Commit and push run: | cd gitops git config user.name "github-actions[bot]" git config user.email "github-actions[bot]@users.noreply.github.com" git add . git commit -m "chore(dev): update payments-api to ${{ env.IMAGE_TAG }}" git pushArgoCD — GitOps Deployment
Section titled “ArgoCD — GitOps Deployment”ArgoCD Architecture in the Enterprise
Section titled “ArgoCD Architecture in the Enterprise”ArgoCD Application CRD
Section titled “ArgoCD Application CRD”# ArgoCD Application for dev environmentapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: payments-dev namespace: argocd finalizers: - resources-finalizer.argocd.argoproj.io # cleanup on deletespec: project: development
source: repoURL: https://github.com/bank-org/gitops-repo.git targetRevision: main path: apps/payments/overlays/dev
destination: server: https://dev-eks.me-south-1.eks.amazonaws.com namespace: payments
syncPolicy: automated: prune: true # delete resources removed from git selfHeal: true # revert manual changes in cluster allowEmpty: false # prevent accidental deletion of all resources syncOptions: - CreateNamespace=true - PrunePropagationPolicy=foreground - PruneLast=true # prune after all other syncs retry: limit: 3 backoff: duration: 5s factor: 2 maxDuration: 3m# ArgoCD Application for prod — manual sync + sync windowsapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: payments-prod namespace: argocdspec: project: production
source: repoURL: https://github.com/bank-org/gitops-repo.git targetRevision: main path: apps/payments/overlays/prod
destination: server: https://prod-eks.me-south-1.eks.amazonaws.com namespace: payments
syncPolicy: # NO automated sync — manual only for prod syncOptions: - CreateNamespace=false # namespace must pre-exist in prod - PrunePropagationPolicy=foreground - RespectIgnoreDifferences=true retry: limit: 5 backoff: duration: 10s factor: 2 maxDuration: 5m
# Ignore fields that are set by controllers (avoid false OutOfSync) ignoreDifferences: - group: apps kind: Deployment jsonPointers: - /spec/replicas # HPA manages replicas - group: autoscaling kind: HorizontalPodAutoscaler jqPathExpressions: - .statusArgoCD Sync Windows (Production)
Section titled “ArgoCD Sync Windows (Production)”# AppProject with sync window — only deploy during business hoursapiVersion: argoproj.io/v1alpha1kind: AppProjectmetadata: name: production namespace: argocdspec: description: Production applications
sourceRepos: - https://github.com/bank-org/gitops-repo.git
destinations: - server: https://prod-eks.me-south-1.eks.amazonaws.com namespace: payments - server: https://prod-eks.me-south-1.eks.amazonaws.com namespace: orders
# RBAC: who can sync roles: - name: platform-admin policies: - p, proj:production:platform-admin, applications, sync, production/*, allow - p, proj:production:platform-admin, applications, get, production/*, allow - name: team-payments policies: - p, proj:production:team-payments, applications, get, production/payments-*, allow # Note: team cannot sync prod — only platform-admin can
# Sync windows: only allow deploys Sun-Thu 10am-4pm Dubai time syncWindows: - kind: allow schedule: "0 10 * * 0-4" # Sun-Thu 10am (Dubai work week) duration: 6h # until 4pm applications: ["*"] manualSync: true # manual sync allowed within window - kind: deny schedule: "0 0 * * *" # deny all other times duration: 24h applications: ["*"]App-of-Apps Pattern
Section titled “App-of-Apps Pattern”For managing 50+ microservices, use the app-of-apps pattern — one parent Application that generates child Applications:
# Root Application (the "app of apps")apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: root-app namespace: argocdspec: project: default source: repoURL: https://github.com/bank-org/gitops-repo.git targetRevision: main path: argocd/apps # directory containing Application YAMLs destination: server: https://kubernetes.default.svc namespace: argocd syncPolicy: automated: prune: true selfHeal: trueArgoCD ApplicationSets — Multi-Environment
Section titled “ArgoCD ApplicationSets — Multi-Environment”ApplicationSets generate Applications dynamically based on generators (Git directory, list, cluster, matrix).
Git Directory Generator — Auto-Discover Apps per Environment
Section titled “Git Directory Generator — Auto-Discover Apps per Environment”# One ApplicationSet → generates apps for ALL services in ALL envsapiVersion: argoproj.io/v1alpha1kind: ApplicationSetmetadata: name: all-apps-all-envs namespace: argocdspec: goTemplate: true goTemplateOptions: ["missingkey=error"]
generators: # Matrix: combine environment list × git directory discovery - matrix: generators: # Generator 1: environments - list: elements: - env: dev cluster: https://dev-eks.me-south-1.eks.amazonaws.com autoSync: "true" - env: staging cluster: https://staging-eks.me-south-1.eks.amazonaws.com autoSync: "true" - env: prod cluster: https://prod-eks.me-south-1.eks.amazonaws.com autoSync: "false" # manual sync for prod
# Generator 2: discover apps from git directory structure - git: repoURL: https://github.com/bank-org/gitops-repo.git revision: main directories: - path: "apps/*/overlays/{{ .env }}"
template: metadata: name: "{{ index .path.segments 1 }}-{{ .env }}" # Produces: payments-dev, payments-staging, payments-prod, etc. spec: project: "{{ .env }}" source: repoURL: https://github.com/bank-org/gitops-repo.git targetRevision: main path: "{{ .path.path }}" destination: server: "{{ .cluster }}" namespace: "{{ index .path.segments 1 }}" syncPolicy: automated: prune: true selfHeal: trueGitOps Repository Structure
Section titled “GitOps Repository Structure”Kustomize-Based Structure (Recommended)
Section titled “Kustomize-Based Structure (Recommended)”Kustomize Base
Section titled “Kustomize Base”apiVersion: kustomize.config.k8s.io/v1beta1kind: Kustomization
resources: - deployment.yaml - service.yaml - hpa.yaml - pdb.yaml - network-policy.yaml
commonLabels: app.kubernetes.io/name: payments-api app.kubernetes.io/part-of: payments
images: - name: payments-api newName: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api newTag: latest # overridden per environmentapiVersion: apps/v1kind: Deploymentmetadata: name: payments-apispec: replicas: 1 # overridden per environment selector: matchLabels: app: payments-api template: metadata: labels: app: payments-api spec: serviceAccountName: payments-api containers: - name: payments-api image: payments-api # placeholder, Kustomize replaces ports: - containerPort: 8080 resources: requests: cpu: 250m memory: 256Mi limits: cpu: 500m memory: 512Mi livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 env: - name: LOG_LEVEL value: "info"Environment Overlays
Section titled “Environment Overlays”apiVersion: kustomize.config.k8s.io/v1beta1kind: Kustomization
resources: - ../../base
namespace: payments
patches: - path: patches/replicas.yaml - path: patches/resources.yaml
images: - name: payments-api newName: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api newTag: abc123def # CI updates this tag
configMapGenerator: - name: payments-config literals: - DATABASE_HOST=payments-db.dev.internal - LOG_LEVEL=debug - ENABLE_TRACING=trueapiVersion: apps/v1kind: Deploymentmetadata: name: payments-apispec: replicas: 1 # dev: single replicaapiVersion: apps/v1kind: Deploymentmetadata: name: payments-apispec: template: spec: containers: - name: payments-api resources: requests: cpu: 100m memory: 128Mi limits: cpu: 250m memory: 256MiapiVersion: kustomize.config.k8s.io/v1beta1kind: Kustomization
resources: - ../../base
namespace: payments
patches: - path: patches/replicas.yaml - path: patches/resources.yaml - path: patches/tolerations.yaml
images: - name: payments-api newName: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api newTag: def456abc # promoted from staging
configMapGenerator: - name: payments-config literals: - DATABASE_HOST=payments-db.prod.internal - LOG_LEVEL=warn - ENABLE_TRACING=trueapiVersion: apps/v1kind: Deploymentmetadata: name: payments-apispec: replicas: 6 # prod: 6 replicas across 3 AZsapiVersion: apps/v1kind: Deploymentmetadata: name: payments-apispec: template: spec: containers: - name: payments-api resources: requests: cpu: 500m memory: 512Mi limits: cpu: "2" memory: 2GiEnvironment Promotion Patterns
Section titled “Environment Promotion Patterns”Pattern 1: PR-Based Promotion (Recommended)
Section titled “Pattern 1: PR-Based Promotion (Recommended)”PR-BASED PROMOTION FLOW==========================
CI merges to main | vGitHub Actions updates dev overlay (image tag) | vArgoCD auto-syncs dev | vAutomated tests pass in dev | vGitHub Actions opens PR: "Promote payments-api:abc123 to staging" - Updates staging/kustomization.yaml with new image tag - PR auto-assigned to team lead (CODEOWNERS) | vTeam lead reviews + approves PR → merge | vArgoCD auto-syncs staging | vStaging integration tests pass (automated) | vPlatform engineer opens PR: "Promote payments-api:abc123 to prod" - Updates prod/kustomization.yaml with new image tag - PR requires 2 approvals (CODEOWNERS) - Must pass branch protection rules | vPlatform team reviews + approves → merge (within sync window) | vArgoCD syncs prod (manual trigger or within sync window) | vArgo Rollouts: canary 10% → 30% → 60% → 100% (with analysis)GitHub Actions for automated promotion:
name: Promote to Staging
on: workflow_dispatch: inputs: image_tag: description: "Image tag to promote" required: true app_name: description: "Application name" required: true default: "payments"
jobs: promote: runs-on: ubuntu-latest steps: - name: Checkout gitops repo uses: actions/checkout@v4
- name: Update staging image tag run: | cd apps/${{ inputs.app_name }}/overlays/staging kustomize edit set image \ ${{ inputs.app_name }}-api=111111111111.dkr.ecr.me-south-1.amazonaws.com/${{ inputs.app_name }}-api:${{ inputs.image_tag }}
- name: Create promotion PR uses: peter-evans/create-pull-request@v6 with: title: "promote(${{ inputs.app_name }}): staging ← ${{ inputs.image_tag }}" body: | ## Promotion Request
- **App:** ${{ inputs.app_name }} - **Image tag:** ${{ inputs.image_tag }} - **Target:** staging - **Source:** dev (verified)
### Checklist - [ ] Dev deployment verified - [ ] Integration tests passed - [ ] No open incidents branch: promote/${{ inputs.app_name }}-staging-${{ inputs.image_tag }} labels: promotion,staging reviewers: team-leadsCODEOWNERS for approval gates:
# .github/CODEOWNERS
# Dev overlay — team can self-approveapps/*/overlays/dev/ @bank-org/team-payments
# Staging overlay — team lead approvalapps/*/overlays/staging/ @bank-org/team-leads
# Prod overlay — platform team approval (2 reviewers required)apps/*/overlays/prod/ @bank-org/platform-teamPattern 2: Image Tag Promotion Pipeline
Section titled “Pattern 2: Image Tag Promotion Pipeline”IMAGE TAG PROMOTION====================
CI builds image with tag: sha-abc123 | vWrites to dev overlay: newTag: sha-abc123 | vArgoCD syncs dev → tests pass | vPromotion job copies SAME tag to staging overlay (no new build — same image, different config) | vArgoCD syncs staging → tests pass | vPromotion job copies SAME tag to prod overlay (same image as dev/staging — guaranteed identical) | vArgoCD syncs prod (canary)Argo Rollouts — Progressive Delivery
Section titled “Argo Rollouts — Progressive Delivery”Argo Rollouts extends Kubernetes Deployments with advanced deployment strategies: canary, blue-green, and progressive delivery with automated analysis.
CANARY DEPLOYMENT WITH ARGO ROLLOUTS======================================
100% traffic | v +--------------+Step 0: | Stable v1 | (current production) | (10 pods) | +--------------+
Step 1: +--------------+ +------------+setWeight: 10 | Stable v1 | | Canary v2 | | (9 pods) |--->| (1 pod) | | 90% traffic | | 10% traffic| +--------------+ +------------+ |Step 2: Run AnalysisTemplatepause: 5m (check error rate, latency) | Pass? Continue Fail? Auto-rollback
Step 3: +--------------+ +------------+setWeight: 30 | Stable v1 | | Canary v2 | | (7 pods) |--->| (3 pods) | | 70% traffic | | 30% traffic| +--------------+ +------------+
Step 4: +--------------+ +------------+setWeight: 60 | Stable v1 | | Canary v2 | | (4 pods) |--->| (6 pods) | | 40% traffic | | 60% traffic| +--------------+ +------------+
Step 5: +--------------+setWeight: 100 | Canary v2 | Canary promoted to stable | (10 pods) | Old ReplicaSet scaled to 0 +--------------+Rollout with Canary Strategy
Section titled “Rollout with Canary Strategy”apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: payments-api namespace: paymentsspec: replicas: 10 revisionHistoryLimit: 5 selector: matchLabels: app: payments-api template: metadata: labels: app: payments-api spec: containers: - name: payments-api image: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api:sha-abc123 ports: - containerPort: 8080 resources: requests: cpu: 500m memory: 512Mi strategy: canary: # Traffic management via ALB (EKS) or Istio trafficRouting: alb: ingress: payments-ingress servicePort: 80 rootService: payments-api-root annotationPrefix: alb.ingress.kubernetes.io
canaryService: payments-api-canary stableService: payments-api-stable
steps: # Step 1: 10% canary with analysis - setWeight: 10 - pause: { duration: 2m } - analysis: templates: - templateName: payments-success-rate args: - name: service-name value: payments-api-canary
# Step 2: 30% canary - setWeight: 30 - pause: { duration: 5m } - analysis: templates: - templateName: payments-success-rate
# Step 3: 60% canary - setWeight: 60 - pause: { duration: 5m } - analysis: templates: - templateName: payments-success-rate
# Step 4: full rollout (manual gate for payments) - pause: {} # manual approval before 100% - setWeight: 100
# Auto-rollback on failure abortScaleDownDelaySeconds: 30 scaleDownDelayRevisionLimit: 1AnalysisTemplate — Automated Canary Validation
Section titled “AnalysisTemplate — Automated Canary Validation”apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: payments-success-rate namespace: paymentsspec: args: - name: service-name value: payments-api-canary metrics: # Metric 1: HTTP success rate must be > 99.5% - name: success-rate interval: 60s count: 5 # run 5 measurements successCondition: result[0] >= 0.995 failureLimit: 2 # fail if 2+ measurements fail provider: prometheus: address: http://prometheus.monitoring.svc:9090 query: | sum(rate(http_requests_total{ service="{{args.service-name}}", status=~"2.." }[2m])) / sum(rate(http_requests_total{ service="{{args.service-name}}" }[2m]))
# Metric 2: P99 latency must be < 500ms - name: latency-p99 interval: 60s count: 5 successCondition: result[0] < 500 failureLimit: 2 provider: prometheus: address: http://prometheus.monitoring.svc:9090 query: | histogram_quantile(0.99, sum(rate(http_request_duration_milliseconds_bucket{ service="{{args.service-name}}" }[2m])) by (le) )
# Metric 3: No increase in error logs - name: error-log-count interval: 60s count: 3 successCondition: result[0] < 10 failureLimit: 1 provider: prometheus: address: http://prometheus.monitoring.svc:9090 query: | sum(increase(log_messages_total{ service="{{args.service-name}}", level="error" }[5m]))Helm Charts in the Enterprise
Section titled “Helm Charts in the Enterprise”Helm is used for packaging applications with configurable values. In a GitOps setup, ArgoCD renders Helm templates and applies the output.
# ArgoCD Application using HelmapiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: payments-prod-helm namespace: argocdspec: source: repoURL: https://github.com/bank-org/helm-charts.git targetRevision: main path: charts/payments-api helm: valueFiles: - values.yaml - values-prod.yaml # environment-specific overrides parameters: - name: image.tag value: "sha-abc123" destination: server: https://prod-eks.me-south-1.eks.amazonaws.com namespace: paymentsHelm values per environment:
# values.yaml (defaults)replicaCount: 1image: repository: 111111111111.dkr.ecr.me-south-1.amazonaws.com/payments-api tag: latestresources: requests: cpu: 250m memory: 256Mi
# values-prod.yaml (overrides for prod)replicaCount: 6resources: requests: cpu: 500m memory: 512Mi limits: cpu: "2" memory: 2Giingress: enabled: true className: alb annotations: alb.ingress.kubernetes.io/scheme: internalInterview Scenarios
Section titled “Interview Scenarios”Scenario 1: Design CI/CD for 50 Microservices
Section titled “Scenario 1: Design CI/CD for 50 Microservices”“Design CI/CD for 50 microservices deployed to EKS across dev, staging, and prod environments.”
Architecture:
50 APP REPOS 1 GITOPS REPO 3 CLUSTERS+----------+ +-----------+ +----------+| app-1 |--CI--> | apps/ | | dev || app-2 |--CI--> | app-1/ |--ArgoCD--> | staging || ... | | app-2/ | | prod || app-50 |--CI--> | ... | +----------++----------+ | app-50/ | +-----------+
Each app repo has: GitOps repo has: ArgoCD has:- src/ - base/ per app - 1 ApplicationSet- Dockerfile - overlays/dev,staging, - generates 150 apps- .github/workflows/ci.yaml prod per app (50 x 3 envs)Key decisions:
- One CI workflow per app repo — builds, tests, pushes image, updates gitops repo
- Single gitops repo — all 50 apps, Kustomize overlays for 3 envs
- One ApplicationSet — matrix generator (envs x apps) creates 150 Applications
- Shared CI templates — GitHub Actions reusable workflows for consistency
- OIDC auth — single IAM role for all CI pipelines (scoped to org)
Scenario 2: GitHub Actions to EKS Without Static Credentials
Section titled “Scenario 2: GitHub Actions to EKS Without Static Credentials”“How do you connect GitHub Actions to EKS without static credentials?”
Answer: Use GitHub Actions OIDC federation with AWS STS.
- Register GitHub as an OIDC identity provider in AWS
- Create an IAM role with a trust policy that validates the GitHub OIDC token
- Scope the trust policy to specific repo, branch, and optionally GitHub Environment
- In the workflow, use
aws-actions/configure-aws-credentials@v4withrole-to-assume - The action requests a short-lived token (15 min - 1 hour) from AWS STS
- No static access keys anywhere — not in GitHub Secrets, not in environment variables
Trust policy scoping options:
repo:org/repo:* # any branch (too broad)repo:org/repo:ref:refs/heads/main # main branch only (good)repo:org/repo:environment:production # GitHub Environment (best)repo:org/repo:pull_request # PR context (for CI only)Scenario 3: Environment Promotion with Approval Gates
Section titled “Scenario 3: Environment Promotion with Approval Gates”“Design a promotion workflow: dev to staging to prod with approval gates.”
See the PR-based promotion pattern above. Key elements:
- Dev: auto-deploy on merge to main (no approval needed)
- Staging: automated PR opened by CI, requires team lead approval (CODEOWNERS)
- Prod: manual PR by platform engineer, requires 2 platform team approvals
- ArgoCD sync windows: prod deploys only during Sun-Thu 10am-4pm Dubai time
- Canary: Argo Rollouts with analysis at 10/30/60% before 100%
- Rollback: automatic if analysis fails; manual
kubectl argo rollouts abortif needed
Scenario 4: Deployment Stuck — CrashLoopBackOff
Section titled “Scenario 4: Deployment Stuck — CrashLoopBackOff”“A deployment is stuck — 3 new pods are CrashLoopBackOff but old pods are still serving. What happens and how do you fix it?”
What is happening:
Rolling Update in Progress===========================
maxSurge: 25% (can create 25% extra pods)maxUnavailable: 25% (can have 25% fewer ready pods)
Deployment: payments-api (replicas=10, image=v1) → Update triggered: image=v2
Step 1: Create 3 new pods with v2 (25% surge = ceil(10*0.25) = 3)Step 2: New pods start → CrashLoopBackOff (bad config, missing env var, etc.)Step 3: Rolling update STALLS — it will NOT kill old pods because: - maxUnavailable=25% → can have 8 ready (currently 10 ready with v1) - New pods are NOT ready → old pods stay - Users are NOT affected (v1 pods still serve traffic)
The deployment controller waits for progressDeadlineSeconds (default 600s = 10 min)After deadline: deployment status = "ProgressDeadlineExceeded"But old pods STILL serve traffic — no outageDebugging:
# Check rollout statuskubectl rollout status deployment/payments-api -n payments
# Check new pod logskubectl logs -n payments -l app=payments-api --tail=50 | grep -i error
# Check eventskubectl describe deployment payments-api -n payments
# Common CrashLoopBackOff causes:# - Missing ConfigMap/Secret referenced in env# - Database connection string wrong for new env# - Missing env variable in new version# - OOMKilled (new version needs more memory)# - Liveness probe path changed in new versionFix:
# Option 1: Rollback to previous revisionkubectl rollout undo deployment/payments-api -n payments
# Option 2: Rollback to specific revisionkubectl rollout undo deployment/payments-api -n payments --to-revision=3
# In GitOps: revert the commit in gitops repo → ArgoCD syncs old versiongit revert HEADgit pushScenario 5: Canary Deployments on EKS
Section titled “Scenario 5: Canary Deployments on EKS”“How do you implement canary deployments on EKS?”
Option 1: Argo Rollouts with ALB (recommended)
- Replace Deployment with Rollout CRD
- Use ALB Ingress Controller for traffic splitting
- AnalysisTemplate validates canary health
- Automated promotion or rollback
Option 2: Argo Rollouts with Istio
- Istio VirtualService for fine-grained traffic splitting
- More precise than ALB (can split by header, cookie, etc.)
- Higher complexity (requires Istio service mesh)
Option 3: Native Kubernetes (manual canary)
- Two Deployments (stable + canary) behind same Service
- Adjust replica counts for traffic ratio
- No automated analysis — purely manual
- Not recommended for enterprise
Scenario 6: ArgoCD OutOfSync But Application Running Fine
Section titled “Scenario 6: ArgoCD OutOfSync But Application Running Fine”“ArgoCD shows ‘OutOfSync’ but the application is running fine. Why?”
Common causes:
| Cause | Fix |
|---|---|
| HPA changed replica count | Add ignoreDifferences for /spec/replicas |
| Mutating webhook added fields | Ignore the webhook-added fields |
Server-side apply added managedFields | ArgoCD settings: exclude managedFields |
| Resource drift (manual kubectl edit) | Enable selfHeal: true to auto-revert |
| CRD status subresource | Ignore .status in diff |
| Defaulted fields by API server | Normalize in ArgoCD resource customization |
Fix example:
# In ArgoCD Application specignoreDifferences: - group: apps kind: Deployment jsonPointers: - /spec/replicas # HPA manages this - group: "" kind: Service jqPathExpressions: - .spec.clusterIP # auto-assigned by K8s - kind: MutatingWebhookConfiguration jqPathExpressions: - .webhooks[]?.clientConfig.caBundleScenario 7: Prevent Direct Deployment to Prod
Section titled “Scenario 7: Prevent Direct Deployment to Prod”“How do you prevent a developer from deploying directly to prod, bypassing the pipeline?”
Layered controls:
DEFENSE IN DEPTH — PREVENTING DIRECT PROD DEPLOYS====================================================
Layer 1: Git (source of truth) - CODEOWNERS on prod/ overlay → requires platform team approval - Branch protection: no direct push to main - Require PR reviews for prod changes
Layer 2: ArgoCD (deployment engine) - AppProject RBAC: only platform-admin role can sync prod apps - Sync windows: deny outside business hours - No automated sync for prod (manual only)
Layer 3: Kubernetes (cluster-level) - RBAC: team SAs cannot create/update Deployments in prod namespaces - OPA/Kyverno: deny deployments not matching gitops labels - Namespace labels: "managed-by: argocd" — reject non-ArgoCD applies
Layer 4: Network - EKS API server: private endpoint only - No direct kubectl access from developer laptops - Break-glass procedure for emergencies (audited)Kyverno policy — reject non-ArgoCD deployments:
apiVersion: kyverno.io/v1kind: ClusterPolicymetadata: name: require-argocd-managedspec: validationFailureAction: Enforce rules: - name: check-argocd-label match: any: - resources: kinds: ["Deployment", "StatefulSet", "DaemonSet"] namespaces: ["payments", "orders", "trading"] exclude: any: - subjects: - kind: ServiceAccount name: argocd-application-controller namespace: argocd validate: message: "Resources in production namespaces must be deployed via ArgoCD." pattern: metadata: labels: app.kubernetes.io/managed-by: argocdScenario 8: Design GitOps Repo for 10 Microservices Across 3 Environments
Section titled “Scenario 8: Design GitOps Repo for 10 Microservices Across 3 Environments”“Design the gitops repo structure for a team with 10 microservices across 3 environments.”
Structure: Use the Kustomize-based structure shown above. Key principles:
- One gitops repo for all 10 services (not 10 repos — simplifies management)
- Kustomize base per service — shared manifests (deployment, service, HPA, PDB)
- Three overlays per service — dev, staging, prod with environment-specific patches
- One ApplicationSet — matrix generator creates 30 Applications automatically
- CODEOWNERS — team owns dev/staging overlays, platform team owns prod
- Promotion via PR — image tag updated in overlay, reviewed, merged, synced
File count:
- 10 services x (base: ~5 files + 3 overlays x ~3 files each) = ~140 files
- 1 ApplicationSet YAML = 1 file
- Total: ~141 files in one repo — manageable
Scaling considerations:
- At 50+ services, consider splitting into domain-specific gitops repos (payments-gitops, trading-gitops)
- ArgoCD can watch multiple repos
- Use ArgoCD ApplicationSet with multiple Git generators pointing to different repos
Quick Reference — CI/CD Decision Matrix
Section titled “Quick Reference — CI/CD Decision Matrix”CI/CD DECISION MATRIX======================
Need Tool / Pattern---- --------------Container image build GitHub Actions + docker/build-push-actionImage scanning Trivy (OSS) or Snyk (enterprise)Auth to AWS from CI OIDC + aws-actions/configure-aws-credentials@v4Auth to GCP from CI WIF + google-github-actions/auth@v2Container registry ECR (AWS) / Artifact Registry (GCP)GitOps deployment ArgoCD (preferred) or FluxManifest templating Kustomize (simple) or Helm (complex)Multi-env management Kustomize overlays + ArgoCD ApplicationSetsCanary deployments Argo Rollouts + AnalysisTemplateTraffic splitting ALB (EKS) / Istio / Gateway APIApproval gates GitHub CODEOWNERS + ArgoCD sync windowsRollback git revert → ArgoCD sync (GitOps) or kubectl argo rollouts abort (Argo Rollouts)