SRE & Incident Management

Where This Fits

Enterprise SRE operating model

SRE is how the central platform team ensures reliability at scale. You define the SLOs, build the incident management process, and provide the tooling. Tenant teams adopt the framework for their services.

On-Call Structure

Enterprise On-Call Model

On-call rotation structure with escalation tiers

Severity Levels

Level	Name	Impact	Response Time	Example
SEV-1	Critical	Revenue/data loss, all users affected	5 min ack, 15 min response	Payment system down, database corruption
SEV-2	High	Major feature degraded, many users affected	15 min ack, 30 min response	Login failing for 50% of users, API latency >10s
SEV-3	Medium	Minor feature degraded, some users affected	1 hr ack, 4 hr response	Search is slow, non-critical API returning errors
SEV-4	Low	Cosmetic, workaround available	Next business day	Dashboard rendering issue, log formatting error

Severity Decision Tree

Severity decision tree

Incident Lifecycle

Incident lifecycle: detect, triage, mitigate, resolve, review

Runbook Template

# Runbook: [Alert Name]

## Overview
- **Service**: [service name]
- **SLO impacted**: [which SLO this alert protects]
- **Severity**: [SEV-1/2/3]
- **Owner team**: [team name]
- **Last updated**: [date]

## Alert Condition
```promql
[the exact alert expression from vmalert/Prometheus]

Impact

[What user-facing behavior does this cause?]

Quick Diagnosis (< 5 minutes)

Step 1: Verify the alert is real

# Check if the service is actually down
kubectl -n payments get pods -l app=payment-api
kubectl -n payments top pods -l app=payment-api

Step 2: Check recent deployments

# Was there a deployment in the last hour?
kubectl -n payments rollout history deployment/payment-api
# Check ArgoCD for recent syncs
argocd app history payment-api

Step 3: Check dependencies

# Database connectivity
kubectl -n payments exec -it deploy/payment-api -- \
  pg_isready -h payments-db.xxx.rds.amazonaws.com

# Redis connectivity
kubectl -n payments exec -it deploy/payment-api -- \
  redis-cli -h payments-redis.xxx.cache.amazonaws.com ping

Mitigation Options

Option A: Rollback (fastest, safest)

# Rollback to previous deployment
kubectl -n payments rollout undo deployment/payment-api

# OR via ArgoCD
argocd app rollback payment-api

kubectl -n payments scale deployment/payment-api --replicas=10

Option C: Restart pods (if stuck state)

kubectl -n payments rollout restart deployment/payment-api

Escalation

Tier 2: @platform-senior in #incident-response Slack
Tier 3: @eng-manager via PagerDuty
Database issues: @dba-team in #db-support

Previous Incidents

---

## Postmortem Template

```markdown
# Postmortem: [Incident Title]

**Date**: [incident date]
**Duration**: [start time] to [resolution time] ([total duration])
**Severity**: [SEV-1/2/3]
**Incident Commander**: [name]
**Author**: [name]
**Status**: [Action items complete / In progress]

## Summary
[2-3 sentences describing what happened, impact, and resolution]

## Impact
- **Users affected**: [number or percentage]
- **Revenue impact**: [estimated, if applicable]
- **Error budget consumed**: [X% of 30-day budget]
- **SLO breach**: [Yes/No — which SLO]

## Timeline (all times UTC)

| Time | Event |
|------|-------|
| 10:15 | Deployment of payment-api v2.3.1 begins |
| 10:18 | Error rate increases from 0.01% to 15% |
| 10:19 | PagerDuty alert fires: PaymentAPIHighErrorRate |
| 10:21 | On-call acknowledges, opens Grafana |
| 10:25 | Identifies new deployment as trigger |
| 10:28 | Initiates rollback to v2.3.0 |
| 10:32 | Rollback complete, error rate drops to 0.01% |
| 10:45 | Confirmed stable, incident resolved |

## Root Cause
[Detailed technical explanation. What specifically failed and why.]

The v2.3.1 deployment included a database migration that added an index
on the `transactions` table. The index creation locked the table for
4 minutes, causing all payment queries to timeout. The migration was
tested against a dev database with 1K rows (completed in under 1s) but
production has 50M rows.

## Contributing Factors
1. Database migration was not tested against production-sized data
2. Migration ran synchronously during deployment instead of as a
   separate, pre-deployment step
3. No canary deployment — migration applied to all pods simultaneously

## Detection
- Detected by: Automated alert (vmalert → PagerDuty)
- Time to detect: 1 minute (alert at 10:19, deployment at 10:18)
- Detection was: GOOD — fast automated detection

## Response
- Time to acknowledge: 2 minutes
- Time to mitigate: 13 minutes (rollback completed at 10:32)
- Response was: ADEQUATE — rollback was the right call

## Lessons Learned

### What went well
- Fast automated detection (1 minute)
- On-call had clear runbook for rollback
- PagerDuty escalation worked correctly

### What went wrong
- Migration not tested against prod-sized data
- No canary deployment for database changes
- Rollback took 4 minutes (could be faster with Argo Rollouts)

### Where we got lucky
- Incident happened during business hours (engineers available)
- Only 4 minutes of lock (if table had more rows, could be hours)

## Action Items

| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P0 | Add migration testing with prod-sized data to CI | @dev-lead | 2026-03-22 | TODO |
| P0 | Implement canary deployments for DB migrations | @platform | 2026-03-29 | TODO |
| P1 | Add pre-deploy migration dry-run step | @dev-lead | 2026-04-05 | TODO |
| P2 | Evaluate Argo Rollouts for faster rollback | @platform | 2026-04-15 | TODO |

SLO-Driven Operations

Error Budget Policy

Error budget policy and actions

Common SLO Targets for Banking

Service	Availability SLO	Latency SLO (P99)	Error Budget (30d)
Payment API	99.99%	< 500ms	4.32 min
Customer Portal	99.9%	< 2s	43.2 min
Internal Tools	99.5%	< 5s	3.6 hours
Batch Processing	99.0%	N/A (throughput)	7.2 hours
Platform (K8s)	99.95%	N/A	21.6 min

Toil Reduction

Identifying Toil

Toil identification checklist

Common Toil in Platform Teams

Toil	Frequency	Automation
Creating new namespaces	5/week	Terraform module + GitOps
Rotating secrets	Monthly	External Secrets Operator + rotation policy
Upgrading K8s versions	Quarterly	GKE release channels / EKS blue-green automation
Investigating OOM kills	3/week	VPA recommendations + right-sizing alerts
SSL certificate renewal	Monthly	cert-manager + auto-renewal
User access requests	10/week	RBAC self-service via Git PR
Scaling during peak	Weekly	HPA + Karpenter (reactive auto-scaling)

Chaos Engineering

Designing for reliability is necessary but insufficient. You can build multi-AZ deployments, configure auto-scaling, and write runbooks — but until you deliberately break something and watch what happens, you are operating on hope. Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It originated at Netflix (Chaos Monkey) and has been adopted by every major tech company as a standard reliability practice.

The core principle is the scientific method applied to infrastructure. You define what “normal” looks like (steady state), hypothesize what will happen when a specific component fails, inject that failure in a controlled way, observe whether reality matches your hypothesis, and learn from the gaps. The goal is not to break things for fun — it is to find weaknesses before your customers do. Every chaos experiment that reveals an unexpected failure mode is a production outage prevented.

The hardest part of chaos engineering is not the tooling — it is the organizational courage to run experiments in production. Non-production environments often lack the scale, traffic patterns, and data volumes that cause real failures. A service that handles AZ failure gracefully with 100 requests per second may fail catastrophically at 10,000 requests per second because of connection pool exhaustion or load balancer warm-up times. Start with non-production to build confidence, but graduate to production with strict safety controls.

The Scientific Method for Reliability

Chaos Engineering Experiment Lifecycle
========================================

1. Define Steady State
   └── What does "normal" look like?
   └── Metrics: latency p99 < 200ms, error rate < 0.1%, throughput > 5000 req/s
   └── These are your control measurements

2. Hypothesize
   └── "If AZ-a goes down, traffic shifts to AZ-b and AZ-c within 30 seconds"
   └── "If the database primary fails, Aurora fails over to reader within 60 seconds"
   └── "If 50% of pods are killed, HPA scales replacements within 2 minutes"
   └── Be specific — vague hypotheses produce vague learnings

3. Inject Failure (Controlled)
   └── Use tooling (FIS, Litmus, Chaos Mesh) — never manual kubectl delete
   └── Define blast radius limits
   └── Set automatic stop conditions
   └── Have rollback procedures ready

4. Observe
   └── Compare actual metrics against steady-state baseline
   └── Did traffic shift? How long did it take?
   └── Were there error spikes? How many users affected?
   └── Did alerts fire? Did runbooks work?

5. Learn and Improve
   └── Document findings (hypothesis vs reality)
   └── Create action items for gaps
   └── Share learnings across teams
   └── Schedule follow-up experiment after fixes

Chaos Engineering Tools

AWS Fault Injection Simulator (FIS):

AWS FIS is a fully managed chaos engineering service. It provides pre-built experiment templates for common failure modes across EC2, ECS, EKS, RDS, and networking. The key advantage over open-source tools is native integration with AWS services — you can simulate an entire AZ failure with a single API call, something that is extremely difficult to do with kubectl-based tools.

Experiment types:

EC2: Stop/terminate instances, inject CPU/memory stress, network latency, packet loss
EKS: Delete pods, drain nodes, inject network latency between pods
RDS: Failover Aurora cluster, inject replication lag
Networking: Disrupt connectivity to specific AZs, inject DNS failures
ECS: Stop tasks, inject CPU stress into containers

Safety controls:

Stop conditions: abort experiment if CloudWatch alarm fires (e.g., error rate > 5%)
IAM role scoping: limit which resources FIS can affect (only non-prod, or only specific tagged resources)
Duration limits: experiments auto-stop after defined time
Rollback: FIS reverses injected faults when experiment ends

# Terraform: AWS FIS Experiment — AZ failure simulation
resource "aws_fis_experiment_template" "az_failure" {
  description = "Simulate AZ-a failure for payment service"
  role_arn    = aws_iam_role.fis_role.arn

  action {
    name      = "stop-instances-az-a"
    action_id = "aws:ec2:stop-instances"

    parameter {
      key   = "startInstancesAfterDuration"
      value = "PT5M"  # Restart after 5 minutes
    }

    target {
      key   = "Instances"
      value = "instances-az-a"
    }
  }

  target {
    name           = "instances-az-a"
    resource_type  = "aws:ec2:instance"
    selection_mode = "ALL"

    resource_tag {
      key   = "aws:ec2:availability-zone"
      value = "us-east-1a"
    }

    resource_tag {
      key   = "Environment"
      value = "staging"  # Only target staging
    }
  }

  stop_condition {
    source = "aws:cloudwatch:alarm"
    value  = aws_cloudwatch_metric_alarm.high_error_rate.arn
  }

  tags = {
    Purpose = "chaos-engineering"
    Team    = "platform"
  }
}

# Stop condition: abort if error rate exceeds 5%
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "fis-stop-condition-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "5XXError"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 50
  alarm_description   = "Stop chaos experiment if error rate too high"
}

GCP does not have a native chaos engineering service equivalent to AWS FIS. The recommended approaches are:

Litmus Chaos on GKE: Open-source, Kubernetes-native chaos engineering platform. Deploy via Helm, define experiments as CRDs.
Chaos Mesh on GKE: Alternative open-source tool with stronger network and IO fault injection.
Gremlin (SaaS): Commercial chaos engineering platform that works across GCP, AWS, and bare metal.
Manual GCP API calls: Use gcloud to simulate failures (stop VMs, failover Cloud SQL, drain GKE nodes).

# Litmus Chaos on GKE — Pod Delete Experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-delete
  namespace: payments
spec:
  appinfo:
    appns: payments
    applabel: app=payment-api
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"        # Kill pods for 60 seconds
            - name: CHAOS_INTERVAL
              value: "10"        # Every 10 seconds
            - name: FORCE
              value: "true"      # Force delete (no graceful shutdown)
            - name: PODS_AFFECTED_PERC
              value: "50"        # Kill 50% of pods

Kubernetes-Native Chaos Tools

Litmus Chaos vs Chaos Mesh Comparison
=======================================

Litmus Chaos:
  ├── ChaosExperiments: pod-delete, node-drain, network-loss, disk-fill, pod-cpu-hog
  ├── ChaosEngine: orchestrates experiments against target applications
  ├── ChaosResult: captures outcome (pass/fail) with detailed logs
  ├── ChaosHub: community library of pre-built experiments
  ├── Integration: ArgoCD (GitOps-driven chaos), Prometheus (metrics)
  └── Best for: teams wanting a catalog of pre-built experiments

Chaos Mesh:
  ├── PodChaos: pod-kill, pod-failure, container-kill
  ├── NetworkChaos: delay, loss, duplicate, corrupt, partition
  ├── IOChaos: latency, fault, attribute override on filesystem
  ├── TimeChaos: time-skew (test clock-sensitive code like cert expiry)
  ├── JVMChaos: method-level latency injection for Java apps
  ├── DNSChaos: resolve to wrong IP, inject DNS failures
  ├── Dashboard: built-in web UI for experiment management
  └── Best for: teams needing deep network/IO fault injection

Game Day Planning

A game day is a structured chaos engineering exercise where a team deliberately injects failures and observes the system’s response. It is the infrastructure equivalent of a fire drill. The difference between a game day and randomly deleting pods is planning, safety controls, and structured learning. Well-run game days build team confidence and frequently uncover issues that desk reviews and architecture diagrams miss.

Game Day Execution Template
=============================

Pre-Game (1-2 days before):
  1. Define scope
     └── Which system? (payment API, order processing pipeline)
     └── Which failure mode? (AZ failure, database failover, pod kills)
     └── What's the hypothesis? ("ALB shifts traffic within 30s")

  2. Set blast radius limits
     └── Non-prod first (staging with synthetic traffic)
     └── Production: limit to < 5% of traffic (canary-style)
     └── Never run on a Friday or during peak traffic for first attempt

  3. Identify rollback procedures
     └── Kill switch: how to abort the experiment immediately
     └── FIS: stop experiment via console/API
     └── Litmus: kubectl delete chaosengine
     └── Manual: undo whatever was changed

  4. Brief observers
     └── Who watches which dashboards? (Grafana, CloudWatch, PagerDuty)
     └── Who has access to kill switch?
     └── Who takes notes on timeline and observations?

  5. Get sign-off
     └── Engineering manager for staging game days
     └── VP/Director for production game days
     └── Notify customer support (in case of unexpected impact)

During Game (30-60 minutes):
  1. Verify steady state (baseline metrics look normal)
  2. Start recording (screen record dashboards)
  3. Inject failure
  4. Monitor steady-state metrics in real-time
  5. Record actual behavior vs expected at each milestone
  6. Abort if stop conditions hit (error rate too high, latency too high)
  7. Allow recovery (stop experiment, wait for system to stabilize)
  8. Verify steady state restored

Post-Game (same day):
  1. Blameless debrief (30 min)
  2. Document: hypothesis, actual result, gaps found
  3. Create JIRA tickets for improvements
  4. Share findings in engineering Slack channel
  5. Schedule follow-up experiment after fixes are applied

Example Game Day: AZ Failure During Peak Traffic

Scenario: "What happens when us-east-1a goes down during peak traffic?"

Hypothesis:
  ├── ALB shifts traffic to 1b and 1c within 30 seconds
  ├── EKS pods reschedule via PodDisruptionBudget (min 2 replicas always available)
  ├── Aurora automatically fails over to reader in 1b (< 60 seconds)
  ├── ElastiCache Redis failover to replica in 1c (< 30 seconds)
  └── No user-facing errors beyond brief latency spike

Actual Results (common findings):
  ├── ALB: shifted traffic in 15 seconds ✓ (better than expected)
  ├── EKS pods: some pods did NOT have topology spread constraints
  │   └── All replacement pods landed in 1b → 1b overloaded
  │   └── ACTION: Add topologySpreadConstraints to all Deployments
  ├── Aurora: failover took 90 seconds (hypothesis was 60) → acceptable
  │   └── 12 seconds of connection errors during failover
  │   └── ACTION: Implement connection retry with backoff in app
  ├── ElastiCache: failover took 45 seconds (hypothesis was 30)
  │   └── Cache miss storm on failover → database load spike
  │   └── ACTION: Implement cache warming on failover
  └── Alerting: PagerDuty fired in 2 minutes ✓
      └── But runbook didn't cover multi-service failure scenario
      └── ACTION: Update runbook for correlated failures

Interview Scenarios for Chaos Engineering

“How do you validate that your disaster recovery plan actually works?”

“A DR plan that hasn’t been tested is a wish, not a plan. I validate DR through structured game days with increasing scope:

Component-level: Quarterly. Kill individual pods, fail over databases, drain nodes. Verify auto-healing works. These are fast (30 minutes) and low-risk.
AZ-level: Bi-annually. Simulate entire AZ failure using AWS FIS. Verify traffic shifts, pods reschedule, databases fail over. This is the minimum for any production system claiming multi-AZ resilience.
Region-level (DR): Annually. For services with cross-region DR, actually fail over to the DR region. Test: DNS failover (Route53 health checks), data replication lag (RDS cross-region replicas), application configuration (are environment variables region-aware?). This is the most expensive test but the most valuable.
Tabletop exercises: Monthly. Walk through failure scenarios on paper with the on-call team. ‘What would you do if us-east-1 went down at 3 AM?’ This builds muscle memory without the operational risk.

After each test, document gaps and fix them before the next test. Track improvement over time: DR failover went from 45 minutes (first test) to 8 minutes (after automation).”

“Design a chaos engineering program for a payments platform — what experiments would you run first?”

“Payments is the highest-risk system, so I’d start with the most likely failure modes and work outward:

Pod kills (week 1): Delete 50% of payment-api pods. Verify HPA scales replacements, zero dropped transactions, no 5xx responses. This is the safest experiment and builds team confidence.
Database failover (week 2): Trigger Aurora failover. Measure: connection error duration, transaction retry success rate, data consistency post-failover. Most apps have bugs in their database reconnection logic.
Downstream dependency failure (week 3): Inject latency (5 seconds) into calls to the fraud detection service. Verify: circuit breaker opens, payment falls back to rules-based scoring, timeout doesn’t cascade.
Network partition (week 4): Block traffic between payment pods and Redis cache. Verify: cache miss handling works, database isn’t overwhelmed, circuit breaker on cache client.
AZ failure (month 2): Only after components pass individual tests. Simulate full AZ failure. Verify end-to-end payment flow still works with degraded capacity.

Every experiment has a stop condition: abort if payment success rate drops below 99%. All experiments start in staging with synthetic traffic, graduate to production with canary scope (5% of traffic).”

DORA Metrics

DORA (DevOps Research and Assessment) metrics are the industry standard for measuring software delivery performance. Originating from the “Accelerate” research by Nicole Forsgren, Jez Humble, and Gene Kim, these four metrics have been validated across thousands of organizations as predictive of both organizational performance and employee satisfaction. They measure the velocity and stability of software delivery — and the research consistently shows that speed and stability are not trade-offs but complementary. Elite teams deploy faster AND have fewer failures.

For a platform team, DORA metrics serve a dual purpose. First, they measure the platform team’s own delivery performance (how fast can we ship platform improvements?). Second, they measure the platform’s effectiveness — if the platform provides good CI/CD, deployment tooling, and observability, tenant teams should achieve better DORA metrics than if they were building everything themselves. A platform team that ships golden paths but whose tenants still deploy once a month has a product problem, not a technology problem.

The danger of DORA metrics is using them as performance targets rather than diagnostic tools. If you tell teams “your deployment frequency must be daily,” they will game the metric by deploying trivial changes. The metrics should inform improvement programs: “our lead time is 2 weeks — where is the bottleneck? Manual QA approval? Slow CI pipeline? Waiting for change advisory board?” Fix the constraint, and the metric improves naturally.

The 4 Key Metrics

Metric	Elite	High	Medium	Low
Lead Time for Changes	< 1 hour	1 day - 1 week	1 week - 1 month	> 1 month
Deployment Frequency	On demand (multiple/day)	Weekly - monthly	Monthly - 6 months	> 6 months
Change Failure Rate	0-15%	16-30%	16-30%	> 30%
Mean Time to Recovery	< 1 hour	< 1 day	1 day - 1 week	> 1 week

How to Measure Each Metric

DORA Metrics Measurement Architecture
=======================================

Lead Time for Changes
  ├── Definition: time from PR merge to production deployment
  ├── Source: GitHub API (merge timestamp) → ArgoCD (sync timestamp)
  ├── Calculation: argocd_sync_time - github_merge_time
  ├── Granularity: per-service, per-team, per-org
  └── Exclude: rollbacks, config-only changes

Deployment Frequency
  ├── Definition: how often you successfully deploy to production
  ├── Source: ArgoCD sync events, Helm release history, CI/CD pipeline runs
  ├── Calculation: count(successful_deployments) / time_period
  ├── Granularity: per-service (not per-repo — one repo may have multiple services)
  └── Include: only production deployments (not dev/staging)

Change Failure Rate
  ├── Definition: % of deployments that cause a failure requiring remediation
  ├── Source: deployment events correlated with incident/rollback events
  ├── Calculation: deployments_causing_rollback_or_incident / total_deployments
  ├── What counts as failure: rollback, hotfix, incident triggered within 1 hour
  └── Exclude: planned rollbacks (feature flag toggling)

Mean Time to Recovery (MTTR)
  ├── Definition: time from incident detection to resolution
  ├── Source: PagerDuty (alert_created → incident_resolved timestamps)
  ├── Calculation: avg(resolved_at - created_at) for SEV-1 and SEV-2 incidents
  ├── Track separately: time_to_detect, time_to_mitigate, time_to_resolve
  └── Exclude: planned maintenance windows

How to Improve Each Metric

Improvement Strategies per Metric
====================================

Lead Time for Changes (target: < 1 day)
  ├── Smaller PRs: enforce max 400 lines changed (large PRs wait for review)
  ├── Automated testing: eliminate manual QA gate (shift left)
  ├── Trunk-based development: short-lived branches (< 1 day), merge frequently
  ├── Fast CI pipeline: target < 10 minutes (parallelize tests, cache dependencies)
  └── Auto-merge: if CI passes and 1 approval, auto-merge to main

Deployment Frequency (target: on-demand, multiple per day)
  ├── Feature flags: deploy code without releasing features (LaunchDarkly, Flagsmith)
  ├── Automated CI/CD: no manual steps between merge and production
  ├── Reduce manual approvals: replace CAB meetings with automated policy checks
  ├── Microservices: independent deployment (team deploys without coordinating)
  └── Canary/progressive: automated rollout reduces fear of deploying

Change Failure Rate (target: < 15%)
  ├── Better testing: unit, integration, contract tests in CI
  ├── Shift left: catch issues before merge, not after deployment
  ├── Canary deployments: expose to 5% traffic, monitor, then full rollout
  ├── Automated rollback: Argo Rollouts with analysis (rollback on error spike)
  └── Feature flags: decouple deployment from release (bad feature? toggle off)

Mean Time to Recovery (target: < 1 hour)
  ├── Better observability: correlated metrics, logs, traces (single pane of glass)
  ├── Runbooks: every alert has a step-by-step remediation guide
  ├── Auto-remediation: PagerDuty → Lambda → rollback (no human needed for known issues)
  ├── Fast rollback: Argo Rollouts instant rollback, database migration rollback scripts
  └── Blameless postmortems: learn from failures, fix systemic issues

Platform Team’s Role in DORA Metrics

The platform team does not directly control tenant teams’ DORA metrics, but it provides the tooling and golden paths that enable elite performance. A well-built platform removes the friction that causes slow lead times and low deployment frequency. If a team has to wait 3 days for a namespace to be created, file a ticket for a load balancer, and manually configure CI/CD — their DORA metrics will be terrible regardless of their engineering talent.

Platform Team's DORA Enablement
==================================

Lead Time:
  ├── Provide: fast CI/CD pipelines (< 10 min), pre-built pipeline templates
  ├── Provide: automated code review tools (linting, security scanning)
  └── Measure: pipeline duration as a platform SLO

Deployment Frequency:
  ├── Provide: ArgoCD with auto-sync, Argo Rollouts for canary
  ├── Provide: feature flag infrastructure (integrated with deployment)
  └── Measure: deployments per team per week (detect teams stuck on manual)

Change Failure Rate:
  ├── Provide: canary deployment framework, automated rollback
  ├── Provide: contract testing infrastructure, staging environments
  └── Measure: rollback rate per team (identify teams needing support)

MTTR:
  ├── Provide: Grafana dashboards, alerting framework, runbook templates
  ├── Provide: incident management tooling (PagerDuty integration, Slack bots)
  └── Measure: incident detection time as a platform SLO

Reporting:
  ├── Dashboard: DORA metrics per team, per org, trend over time
  ├── Weekly: automated DORA report to engineering leadership
  ├── Quarterly: DORA review with each team — identify improvement areas
  └── Important: measure per-team, not just org-wide averages
      (org average hides underperforming teams)

Interview Scenarios for DORA

“How do you measure and improve engineering velocity across a 200-person engineering org?”

“I’d implement DORA metrics as the primary velocity measurement framework, with automated data collection and per-team dashboards:

Measurement infrastructure:

Lead Time: instrument the CI/CD pipeline. Record PR merge timestamp (GitHub webhook), deployment timestamp (ArgoCD sync event), calculate delta. Store in a time-series database (VictoriaMetrics) for trending.
Deployment Frequency: count ArgoCD sync events per service per week. Create a Grafana dashboard showing frequency by team with trendlines.
Change Failure Rate: correlate deployment events with PagerDuty incidents and ArgoCD rollback events. If an incident opens within 1 hour of a deployment, that deployment counts as a failure.
MTTR: PagerDuty API provides incident lifecycle timestamps. Calculate time-to-detect (alert lag), time-to-mitigate, and total resolution time.

Improvement program:

Baseline: measure current state for all 20 teams for 4 weeks
Identify bottlenecks: which metric is worst? Lead time is usually the first constraint.
Platform interventions: if lead time is slow because CI takes 30 minutes, invest in pipeline optimization. If deployment frequency is low because teams fear deployments, invest in canary rollout tooling.
Team-level coaching: pair a platform engineer with the lowest-performing teams for 2-week improvement sprints.
Celebrate improvement: monthly engineering all-hands showing DORA trends. Recognize teams that improved the most.

Critical rule: DORA metrics inform, they do not punish. Never tie DORA to individual performance reviews. The moment you do, teams game the metrics instead of improving their actual practices.”

Interview Scenarios

Scenario 1: Design On-Call for a Platform Team

“You’re setting up on-call for a 12-person platform team supporting 200 developers. How do you structure it?”

Strong Answer:

“With 12 engineers, I’d set up two on-call rotations:

Primary rotation: 1 week on / 5 weeks off, 6 engineers. This person is the first responder for all platform alerts — cluster health, networking, shared services (Grafana, ArgoCD). Response SLA: 5 minutes for SEV-1, 15 minutes for SEV-2.

Secondary rotation: 1 week on / 5 weeks off, remaining 6 engineers. Secondary is the escalation path and also handles lower-severity tickets during business hours.

Alert design is critical:

PagerDuty for SEV-1/SEV-2 (pages primary, auto-escalates to secondary after 15 min)
Slack for SEV-3/SEV-4 (notifications during business hours only)
Target: fewer than 2 pages per on-call shift (week). More than that means we have reliability problems to fix.

Compensation: Additional pay or comp time for on-call weeks. If someone gets paged at 3 AM, they start late the next day. On-call burnout is the number one reason platform engineers quit.

Tenant team on-call: Each app team runs their own on-call for application-level issues. Platform team provides the alerting framework (vmalert rules, PagerDuty integration, runbook templates). We only get involved when the issue is infrastructure-level.”

Scenario 2: Postmortem Culture

“Your team had a major outage. Leadership wants to fire the engineer who caused it. How do you handle this?”

Strong Answer:

“I’d firmly advocate for a blameless postmortem. Here’s my argument to leadership:

The engineer didn’t cause the outage — the system did. If one person’s action can bring down production, we have a systemic problem: missing guardrails, insufficient testing, or inadequate deployment safeguards.
Punishing people creates a fear culture. If we fire this person, every other engineer will hide their mistakes instead of reporting them. We’ll lose the early warning system that prevents bigger incidents.
The fix is systemic. Instead of ‘engineer should have been more careful,’ the action items should be: ‘add automated rollback when error rate exceeds threshold,’ ‘require staging validation before prod deploy,’ ‘add database migration dry-run step in CI.’
Google’s approach: Google SRE has a well-documented blameless postmortem culture. They’ve found that the teams with the best reliability are the ones where people feel safe reporting issues.

What I’d do: Run a blameless postmortem within 48 hours. Focus the discussion on: what happened, what contributing factors existed, what systemic changes prevent recurrence. Publish the postmortem internally. The engineer who ‘caused’ it should present — it’s empowering, not punitive.”

Scenario 3: SLO Design for a New Service

“A team is launching a new real-time fraud detection API. Help them define SLOs.”

Strong Answer:

“Fraud detection is critical path — false negatives (missed fraud) are worse than latency. I’d define:

SLO 1 — Availability: 99.99% (4.32 min/month error budget)

SLI: Proportion of requests returning non-5xx responses
Rationale: Every missed API call means a transaction goes unscored, potentially allowing fraud

SLO 2 — Latency: 99.9% of requests < 200ms (P999 < 200ms)

SLI: Proportion of requests completing within 200ms
Rationale: This API is called synchronously during payment processing. >200ms adds noticeable delay to customer checkout

SLO 3 — Correctness: 99.99% of evaluations match the batch model

SLI: Proportion of real-time scores that agree with the nightly batch reprocessing
Rationale: Ensures the real-time model isn’t drifting from the trained model

Error budget policy:

50% budget: Ship features, A/B test new models
25-50%: Freeze model updates, investigate reliability
< 25%: Feature freeze, all hands on reliability
0%: Leadership review, consider fallback to rules-based scoring

Alerting: Multi-window burn rate alerts. Fast burn (14.4x) pages immediately. Slow burn (1x over 3 days) creates a ticket.

Dashboard: Dedicated Grafana dashboard showing all three SLIs, error budget remaining, burn rate trend, and latency percentile heatmap.”

Scenario 4: Reducing Alert Fatigue

“Your platform team gets 150 alerts per day. On-call engineers are burning out. Fix it.”

Strong Answer:

“150 alerts/day means 90% are noise. Here’s my 4-week plan:

Week 1: Audit

Export all alert firings from PagerDuty/Alertmanager for the last 30 days
Categorize each: actionable (required human fix), auto-resolved (<5 min), noise (false positive)
Result: typically 70-80% are noise or auto-resolved

Week 2: Eliminate noise

Delete alerts that never resulted in action (e.g., ‘CPU > 80%’ that auto-recovers every time)
Increase thresholds on alerts that fire too frequently (was 70% CPU, set to 90%)
Add for: 5m or for: 10m to eliminate transient spikes
Result: 150 → ~40 alerts/day

Week 3: Switch to SLO-based

Replace symptom-based alerts with error budget burn rate alerts
Example: Instead of ‘pod CPU high’ + ‘pod memory high’ + ‘pod restart’ (3 alerts), use ‘99.9% availability SLO burning 14.4x’ (1 alert that captures the actual user impact)
Result: 40 → ~15 alerts/day

Week 4: Improve routing

SEV-1/SEV-2: PagerDuty (pages on-call)
SEV-3: Slack notification (business hours only)
SEV-4: Ticket in Jira (triage in next sprint)
Result: On-call gets ~2-3 pages per shift instead of 150

Target: <5 pages per on-call week. Every page should be actionable and have a runbook.”

Scenario 5: Incident Response Exercise

“Walk me through how you’d handle: at 2 AM, PagerDuty pages you. The payment API error rate jumped from 0.1% to 45%.”

Strong Answer:

“Here’s my response timeline:

0-2 minutes: Acknowledge + Assess

Acknowledge PagerDuty alert on my phone
Open Grafana on laptop
Check: Is this a partial outage (some endpoints) or total (all endpoints)?
Check: When did it start? Was there a deployment? (ArgoCD history)
Quick assessment: this is SEV-1 (revenue impact, >50% error rate)

2-5 minutes: Open incident

Create #incident-payment-api Slack channel
Post: ‘SEV-1: Payment API 45% error rate. IC: [me]. Investigating.’
If there was a deployment in the last hour → immediate rollback
- kubectl -n payments rollout undo deployment/payment-api
- This is the fastest mitigation. Root cause can wait.

5-10 minutes: If no deployment (or rollback didn’t help)

Check dependencies:
- Database: {namespace='payments'} |= 'connection' |= 'refused' in Loki
- Redis: check Memorystore/ElastiCache metrics in Grafana
- External services: check span errors in Tempo
Check infrastructure: node health, pod scheduling, DNS resolution

10-15 minutes: Escalate if needed

If I can’t identify the cause, escalate to Tier 2 (page the secondary on-call)
Post update in Slack: ‘Still investigating. No deployment correlation. Checking DB and downstream.’

15-30 minutes: Mitigate

Once root cause identified, apply fix or workaround
Example: DB connection pool exhausted → scale up RDS reader, increase pool size

Post-resolution:

Confirm error rate back to baseline for 15 minutes
Post ‘ALL CLEAR’ in incident channel with summary
Schedule blameless postmortem for next business day
Write draft timeline while it’s fresh”

References

AWS

AWS Well-Architected Framework: Reliability Pillar — fault isolation, recovery, and operational readiness

GCP

Google SRE Book (free online) — foundational text on SRE practices, SLOs, error budgets, and toil
Google SRE Workbook (free online) — practical companion with implementation guidance
Google Cloud Architecture Framework: Reliability — reliability design principles for GCP

Tools & Frameworks

PagerDuty Incident Response Guide — open-source incident response process, roles, and best practices
Atlassian Incident Management Handbook — severity levels, communication templates, and postmortems
Alertmanager Documentation — alert routing, grouping, inhibition, and silencing
SLO Generator (Google) — open-source tool for computing SLIs and error budgets

SRE & Incident Management

Where This Fits

On-Call Structure

Enterprise On-Call Model

Severity Levels

Severity Decision Tree

Incident Lifecycle

Runbook Template

Impact

Quick Diagnosis (< 5 minutes)

Step 1: Verify the alert is real

Step 2: Check recent deployments

Step 3: Check dependencies

Mitigation Options

Option A: Rollback (fastest, safest)

Option B: Scale up (if load-related)

Option C: Restart pods (if stuck state)

Escalation

Related Dashboards

Previous Incidents

SLO-Driven Operations

Error Budget Policy

Common SLO Targets for Banking

Toil Reduction

Identifying Toil

Common Toil in Platform Teams

Chaos Engineering

The Scientific Method for Reliability

Chaos Engineering Tools

Kubernetes-Native Chaos Tools

Game Day Planning

Example Game Day: AZ Failure During Peak Traffic

Interview Scenarios for Chaos Engineering

DORA Metrics

The 4 Key Metrics

How to Measure Each Metric

How to Improve Each Metric

Platform Team’s Role in DORA Metrics

Interview Scenarios for DORA

Interview Scenarios

Scenario 1: Design On-Call for a Platform Team

Scenario 2: Postmortem Culture

Scenario 3: SLO Design for a New Service

Scenario 4: Reducing Alert Fatigue

Scenario 5: Incident Response Exercise

References

AWS

GCP

Tools & Frameworks