SRE & Incident Management
Where This Fits
Section titled “Where This Fits”SRE is how the central platform team ensures reliability at scale. You define the SLOs, build the incident management process, and provide the tooling. Tenant teams adopt the framework for their services.
On-Call Structure
Section titled “On-Call Structure”Enterprise On-Call Model
Section titled “Enterprise On-Call Model”Severity Levels
Section titled “Severity Levels”| Level | Name | Impact | Response Time | Example |
|---|---|---|---|---|
| SEV-1 | Critical | Revenue/data loss, all users affected | 5 min ack, 15 min response | Payment system down, database corruption |
| SEV-2 | High | Major feature degraded, many users affected | 15 min ack, 30 min response | Login failing for 50% of users, API latency >10s |
| SEV-3 | Medium | Minor feature degraded, some users affected | 1 hr ack, 4 hr response | Search is slow, non-critical API returning errors |
| SEV-4 | Low | Cosmetic, workaround available | Next business day | Dashboard rendering issue, log formatting error |
Severity Decision Tree
Section titled “Severity Decision Tree”Incident Lifecycle
Section titled “Incident Lifecycle”Runbook Template
Section titled “Runbook Template”# Runbook: [Alert Name]
## Overview- **Service**: [service name]- **SLO impacted**: [which SLO this alert protects]- **Severity**: [SEV-1/2/3]- **Owner team**: [team name]- **Last updated**: [date]
## Alert Condition```promql[the exact alert expression from vmalert/Prometheus]Impact
Section titled “Impact”[What user-facing behavior does this cause?]
Quick Diagnosis (< 5 minutes)
Section titled “Quick Diagnosis (< 5 minutes)”Step 1: Verify the alert is real
Section titled “Step 1: Verify the alert is real”# Check if the service is actually downkubectl -n payments get pods -l app=payment-apikubectl -n payments top pods -l app=payment-apiStep 2: Check recent deployments
Section titled “Step 2: Check recent deployments”# Was there a deployment in the last hour?kubectl -n payments rollout history deployment/payment-api# Check ArgoCD for recent syncsargocd app history payment-apiStep 3: Check dependencies
Section titled “Step 3: Check dependencies”# Database connectivitykubectl -n payments exec -it deploy/payment-api -- \ pg_isready -h payments-db.xxx.rds.amazonaws.com
# Redis connectivitykubectl -n payments exec -it deploy/payment-api -- \ redis-cli -h payments-redis.xxx.cache.amazonaws.com pingMitigation Options
Section titled “Mitigation Options”Option A: Rollback (fastest, safest)
Section titled “Option A: Rollback (fastest, safest)”# Rollback to previous deploymentkubectl -n payments rollout undo deployment/payment-api
# OR via ArgoCDargocd app rollback payment-apiOption B: Scale up (if load-related)
Section titled “Option B: Scale up (if load-related)”kubectl -n payments scale deployment/payment-api --replicas=10Option C: Restart pods (if stuck state)
Section titled “Option C: Restart pods (if stuck state)”kubectl -n payments rollout restart deployment/payment-apiEscalation
Section titled “Escalation”- Tier 2: @platform-senior in #incident-response Slack
- Tier 3: @eng-manager via PagerDuty
- Database issues: @dba-team in #db-support
Related Dashboards
Section titled “Related Dashboards”Previous Incidents
Section titled “Previous Incidents”---
## Postmortem Template
```markdown# Postmortem: [Incident Title]
**Date**: [incident date]**Duration**: [start time] to [resolution time] ([total duration])**Severity**: [SEV-1/2/3]**Incident Commander**: [name]**Author**: [name]**Status**: [Action items complete / In progress]
## Summary[2-3 sentences describing what happened, impact, and resolution]
## Impact- **Users affected**: [number or percentage]- **Revenue impact**: [estimated, if applicable]- **Error budget consumed**: [X% of 30-day budget]- **SLO breach**: [Yes/No — which SLO]
## Timeline (all times UTC)
| Time | Event ||------|-------|| 10:15 | Deployment of payment-api v2.3.1 begins || 10:18 | Error rate increases from 0.01% to 15% || 10:19 | PagerDuty alert fires: PaymentAPIHighErrorRate || 10:21 | On-call acknowledges, opens Grafana || 10:25 | Identifies new deployment as trigger || 10:28 | Initiates rollback to v2.3.0 || 10:32 | Rollback complete, error rate drops to 0.01% || 10:45 | Confirmed stable, incident resolved |
## Root Cause[Detailed technical explanation. What specifically failed and why.]
The v2.3.1 deployment included a database migration that added an indexon the `transactions` table. The index creation locked the table for4 minutes, causing all payment queries to timeout. The migration wastested against a dev database with 1K rows (completed in under 1s) butproduction has 50M rows.
## Contributing Factors1. Database migration was not tested against production-sized data2. Migration ran synchronously during deployment instead of as a separate, pre-deployment step3. No canary deployment — migration applied to all pods simultaneously
## Detection- Detected by: Automated alert (vmalert → PagerDuty)- Time to detect: 1 minute (alert at 10:19, deployment at 10:18)- Detection was: GOOD — fast automated detection
## Response- Time to acknowledge: 2 minutes- Time to mitigate: 13 minutes (rollback completed at 10:32)- Response was: ADEQUATE — rollback was the right call
## Lessons Learned
### What went well- Fast automated detection (1 minute)- On-call had clear runbook for rollback- PagerDuty escalation worked correctly
### What went wrong- Migration not tested against prod-sized data- No canary deployment for database changes- Rollback took 4 minutes (could be faster with Argo Rollouts)
### Where we got lucky- Incident happened during business hours (engineers available)- Only 4 minutes of lock (if table had more rows, could be hours)
## Action Items
| Priority | Action | Owner | Due Date | Status ||----------|--------|-------|----------|--------|| P0 | Add migration testing with prod-sized data to CI | @dev-lead | 2026-03-22 | TODO || P0 | Implement canary deployments for DB migrations | @platform | 2026-03-29 | TODO || P1 | Add pre-deploy migration dry-run step | @dev-lead | 2026-04-05 | TODO || P2 | Evaluate Argo Rollouts for faster rollback | @platform | 2026-04-15 | TODO |SLO-Driven Operations
Section titled “SLO-Driven Operations”Error Budget Policy
Section titled “Error Budget Policy”Common SLO Targets for Banking
Section titled “Common SLO Targets for Banking”| Service | Availability SLO | Latency SLO (P99) | Error Budget (30d) |
|---|---|---|---|
| Payment API | 99.99% | < 500ms | 4.32 min |
| Customer Portal | 99.9% | < 2s | 43.2 min |
| Internal Tools | 99.5% | < 5s | 3.6 hours |
| Batch Processing | 99.0% | N/A (throughput) | 7.2 hours |
| Platform (K8s) | 99.95% | N/A | 21.6 min |
Toil Reduction
Section titled “Toil Reduction”Identifying Toil
Section titled “Identifying Toil”Common Toil in Platform Teams
Section titled “Common Toil in Platform Teams”| Toil | Frequency | Automation |
|---|---|---|
| Creating new namespaces | 5/week | Terraform module + GitOps |
| Rotating secrets | Monthly | External Secrets Operator + rotation policy |
| Upgrading K8s versions | Quarterly | GKE release channels / EKS blue-green automation |
| Investigating OOM kills | 3/week | VPA recommendations + right-sizing alerts |
| SSL certificate renewal | Monthly | cert-manager + auto-renewal |
| User access requests | 10/week | RBAC self-service via Git PR |
| Scaling during peak | Weekly | HPA + Karpenter (reactive auto-scaling) |
Chaos Engineering
Section titled “Chaos Engineering”Designing for reliability is necessary but insufficient. You can build multi-AZ deployments, configure auto-scaling, and write runbooks — but until you deliberately break something and watch what happens, you are operating on hope. Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It originated at Netflix (Chaos Monkey) and has been adopted by every major tech company as a standard reliability practice.
The core principle is the scientific method applied to infrastructure. You define what “normal” looks like (steady state), hypothesize what will happen when a specific component fails, inject that failure in a controlled way, observe whether reality matches your hypothesis, and learn from the gaps. The goal is not to break things for fun — it is to find weaknesses before your customers do. Every chaos experiment that reveals an unexpected failure mode is a production outage prevented.
The hardest part of chaos engineering is not the tooling — it is the organizational courage to run experiments in production. Non-production environments often lack the scale, traffic patterns, and data volumes that cause real failures. A service that handles AZ failure gracefully with 100 requests per second may fail catastrophically at 10,000 requests per second because of connection pool exhaustion or load balancer warm-up times. Start with non-production to build confidence, but graduate to production with strict safety controls.
The Scientific Method for Reliability
Section titled “The Scientific Method for Reliability”Chaos Engineering Experiment Lifecycle========================================
1. Define Steady State └── What does "normal" look like? └── Metrics: latency p99 < 200ms, error rate < 0.1%, throughput > 5000 req/s └── These are your control measurements
2. Hypothesize └── "If AZ-a goes down, traffic shifts to AZ-b and AZ-c within 30 seconds" └── "If the database primary fails, Aurora fails over to reader within 60 seconds" └── "If 50% of pods are killed, HPA scales replacements within 2 minutes" └── Be specific — vague hypotheses produce vague learnings
3. Inject Failure (Controlled) └── Use tooling (FIS, Litmus, Chaos Mesh) — never manual kubectl delete └── Define blast radius limits └── Set automatic stop conditions └── Have rollback procedures ready
4. Observe └── Compare actual metrics against steady-state baseline └── Did traffic shift? How long did it take? └── Were there error spikes? How many users affected? └── Did alerts fire? Did runbooks work?
5. Learn and Improve └── Document findings (hypothesis vs reality) └── Create action items for gaps └── Share learnings across teams └── Schedule follow-up experiment after fixesChaos Engineering Tools
Section titled “Chaos Engineering Tools”AWS Fault Injection Simulator (FIS):
AWS FIS is a fully managed chaos engineering service. It provides pre-built experiment templates for common failure modes across EC2, ECS, EKS, RDS, and networking. The key advantage over open-source tools is native integration with AWS services — you can simulate an entire AZ failure with a single API call, something that is extremely difficult to do with kubectl-based tools.
Experiment types:
- EC2: Stop/terminate instances, inject CPU/memory stress, network latency, packet loss
- EKS: Delete pods, drain nodes, inject network latency between pods
- RDS: Failover Aurora cluster, inject replication lag
- Networking: Disrupt connectivity to specific AZs, inject DNS failures
- ECS: Stop tasks, inject CPU stress into containers
Safety controls:
- Stop conditions: abort experiment if CloudWatch alarm fires (e.g., error rate > 5%)
- IAM role scoping: limit which resources FIS can affect (only non-prod, or only specific tagged resources)
- Duration limits: experiments auto-stop after defined time
- Rollback: FIS reverses injected faults when experiment ends
# Terraform: AWS FIS Experiment — AZ failure simulationresource "aws_fis_experiment_template" "az_failure" { description = "Simulate AZ-a failure for payment service" role_arn = aws_iam_role.fis_role.arn
action { name = "stop-instances-az-a" action_id = "aws:ec2:stop-instances"
parameter { key = "startInstancesAfterDuration" value = "PT5M" # Restart after 5 minutes }
target { key = "Instances" value = "instances-az-a" } }
target { name = "instances-az-a" resource_type = "aws:ec2:instance" selection_mode = "ALL"
resource_tag { key = "aws:ec2:availability-zone" value = "us-east-1a" }
resource_tag { key = "Environment" value = "staging" # Only target staging } }
stop_condition { source = "aws:cloudwatch:alarm" value = aws_cloudwatch_metric_alarm.high_error_rate.arn }
tags = { Purpose = "chaos-engineering" Team = "platform" }}
# Stop condition: abort if error rate exceeds 5%resource "aws_cloudwatch_metric_alarm" "high_error_rate" { alarm_name = "fis-stop-condition-error-rate" comparison_operator = "GreaterThanThreshold" evaluation_periods = 1 metric_name = "5XXError" namespace = "AWS/ApplicationELB" period = 60 statistic = "Sum" threshold = 50 alarm_description = "Stop chaos experiment if error rate too high"}GCP does not have a native chaos engineering service equivalent to AWS FIS. The recommended approaches are:
- Litmus Chaos on GKE: Open-source, Kubernetes-native chaos engineering platform. Deploy via Helm, define experiments as CRDs.
- Chaos Mesh on GKE: Alternative open-source tool with stronger network and IO fault injection.
- Gremlin (SaaS): Commercial chaos engineering platform that works across GCP, AWS, and bare metal.
- Manual GCP API calls: Use
gcloudto simulate failures (stop VMs, failover Cloud SQL, drain GKE nodes).
# Litmus Chaos on GKE — Pod Delete ExperimentapiVersion: litmuschaos.io/v1alpha1kind: ChaosEnginemetadata: name: payment-pod-delete namespace: paymentsspec: appinfo: appns: payments applabel: app=payment-api appkind: deployment chaosServiceAccount: litmus-admin experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: "60" # Kill pods for 60 seconds - name: CHAOS_INTERVAL value: "10" # Every 10 seconds - name: FORCE value: "true" # Force delete (no graceful shutdown) - name: PODS_AFFECTED_PERC value: "50" # Kill 50% of podsKubernetes-Native Chaos Tools
Section titled “Kubernetes-Native Chaos Tools”Litmus Chaos vs Chaos Mesh Comparison=======================================
Litmus Chaos: ├── ChaosExperiments: pod-delete, node-drain, network-loss, disk-fill, pod-cpu-hog ├── ChaosEngine: orchestrates experiments against target applications ├── ChaosResult: captures outcome (pass/fail) with detailed logs ├── ChaosHub: community library of pre-built experiments ├── Integration: ArgoCD (GitOps-driven chaos), Prometheus (metrics) └── Best for: teams wanting a catalog of pre-built experiments
Chaos Mesh: ├── PodChaos: pod-kill, pod-failure, container-kill ├── NetworkChaos: delay, loss, duplicate, corrupt, partition ├── IOChaos: latency, fault, attribute override on filesystem ├── TimeChaos: time-skew (test clock-sensitive code like cert expiry) ├── JVMChaos: method-level latency injection for Java apps ├── DNSChaos: resolve to wrong IP, inject DNS failures ├── Dashboard: built-in web UI for experiment management └── Best for: teams needing deep network/IO fault injectionGame Day Planning
Section titled “Game Day Planning”A game day is a structured chaos engineering exercise where a team deliberately injects failures and observes the system’s response. It is the infrastructure equivalent of a fire drill. The difference between a game day and randomly deleting pods is planning, safety controls, and structured learning. Well-run game days build team confidence and frequently uncover issues that desk reviews and architecture diagrams miss.
Game Day Execution Template=============================
Pre-Game (1-2 days before): 1. Define scope └── Which system? (payment API, order processing pipeline) └── Which failure mode? (AZ failure, database failover, pod kills) └── What's the hypothesis? ("ALB shifts traffic within 30s")
2. Set blast radius limits └── Non-prod first (staging with synthetic traffic) └── Production: limit to < 5% of traffic (canary-style) └── Never run on a Friday or during peak traffic for first attempt
3. Identify rollback procedures └── Kill switch: how to abort the experiment immediately └── FIS: stop experiment via console/API └── Litmus: kubectl delete chaosengine └── Manual: undo whatever was changed
4. Brief observers └── Who watches which dashboards? (Grafana, CloudWatch, PagerDuty) └── Who has access to kill switch? └── Who takes notes on timeline and observations?
5. Get sign-off └── Engineering manager for staging game days └── VP/Director for production game days └── Notify customer support (in case of unexpected impact)
During Game (30-60 minutes): 1. Verify steady state (baseline metrics look normal) 2. Start recording (screen record dashboards) 3. Inject failure 4. Monitor steady-state metrics in real-time 5. Record actual behavior vs expected at each milestone 6. Abort if stop conditions hit (error rate too high, latency too high) 7. Allow recovery (stop experiment, wait for system to stabilize) 8. Verify steady state restored
Post-Game (same day): 1. Blameless debrief (30 min) 2. Document: hypothesis, actual result, gaps found 3. Create JIRA tickets for improvements 4. Share findings in engineering Slack channel 5. Schedule follow-up experiment after fixes are appliedExample Game Day: AZ Failure During Peak Traffic
Section titled “Example Game Day: AZ Failure During Peak Traffic”Scenario: "What happens when us-east-1a goes down during peak traffic?"
Hypothesis: ├── ALB shifts traffic to 1b and 1c within 30 seconds ├── EKS pods reschedule via PodDisruptionBudget (min 2 replicas always available) ├── Aurora automatically fails over to reader in 1b (< 60 seconds) ├── ElastiCache Redis failover to replica in 1c (< 30 seconds) └── No user-facing errors beyond brief latency spike
Actual Results (common findings): ├── ALB: shifted traffic in 15 seconds ✓ (better than expected) ├── EKS pods: some pods did NOT have topology spread constraints │ └── All replacement pods landed in 1b → 1b overloaded │ └── ACTION: Add topologySpreadConstraints to all Deployments ├── Aurora: failover took 90 seconds (hypothesis was 60) → acceptable │ └── 12 seconds of connection errors during failover │ └── ACTION: Implement connection retry with backoff in app ├── ElastiCache: failover took 45 seconds (hypothesis was 30) │ └── Cache miss storm on failover → database load spike │ └── ACTION: Implement cache warming on failover └── Alerting: PagerDuty fired in 2 minutes ✓ └── But runbook didn't cover multi-service failure scenario └── ACTION: Update runbook for correlated failuresInterview Scenarios for Chaos Engineering
Section titled “Interview Scenarios for Chaos Engineering”“How do you validate that your disaster recovery plan actually works?”
“A DR plan that hasn’t been tested is a wish, not a plan. I validate DR through structured game days with increasing scope:
- Component-level: Quarterly. Kill individual pods, fail over databases, drain nodes. Verify auto-healing works. These are fast (30 minutes) and low-risk.
- AZ-level: Bi-annually. Simulate entire AZ failure using AWS FIS. Verify traffic shifts, pods reschedule, databases fail over. This is the minimum for any production system claiming multi-AZ resilience.
- Region-level (DR): Annually. For services with cross-region DR, actually fail over to the DR region. Test: DNS failover (Route53 health checks), data replication lag (RDS cross-region replicas), application configuration (are environment variables region-aware?). This is the most expensive test but the most valuable.
- Tabletop exercises: Monthly. Walk through failure scenarios on paper with the on-call team. ‘What would you do if us-east-1 went down at 3 AM?’ This builds muscle memory without the operational risk.
After each test, document gaps and fix them before the next test. Track improvement over time: DR failover went from 45 minutes (first test) to 8 minutes (after automation).”
“Design a chaos engineering program for a payments platform — what experiments would you run first?”
“Payments is the highest-risk system, so I’d start with the most likely failure modes and work outward:
- Pod kills (week 1): Delete 50% of payment-api pods. Verify HPA scales replacements, zero dropped transactions, no 5xx responses. This is the safest experiment and builds team confidence.
- Database failover (week 2): Trigger Aurora failover. Measure: connection error duration, transaction retry success rate, data consistency post-failover. Most apps have bugs in their database reconnection logic.
- Downstream dependency failure (week 3): Inject latency (5 seconds) into calls to the fraud detection service. Verify: circuit breaker opens, payment falls back to rules-based scoring, timeout doesn’t cascade.
- Network partition (week 4): Block traffic between payment pods and Redis cache. Verify: cache miss handling works, database isn’t overwhelmed, circuit breaker on cache client.
- AZ failure (month 2): Only after components pass individual tests. Simulate full AZ failure. Verify end-to-end payment flow still works with degraded capacity.
Every experiment has a stop condition: abort if payment success rate drops below 99%. All experiments start in staging with synthetic traffic, graduate to production with canary scope (5% of traffic).”
DORA Metrics
Section titled “DORA Metrics”DORA (DevOps Research and Assessment) metrics are the industry standard for measuring software delivery performance. Originating from the “Accelerate” research by Nicole Forsgren, Jez Humble, and Gene Kim, these four metrics have been validated across thousands of organizations as predictive of both organizational performance and employee satisfaction. They measure the velocity and stability of software delivery — and the research consistently shows that speed and stability are not trade-offs but complementary. Elite teams deploy faster AND have fewer failures.
For a platform team, DORA metrics serve a dual purpose. First, they measure the platform team’s own delivery performance (how fast can we ship platform improvements?). Second, they measure the platform’s effectiveness — if the platform provides good CI/CD, deployment tooling, and observability, tenant teams should achieve better DORA metrics than if they were building everything themselves. A platform team that ships golden paths but whose tenants still deploy once a month has a product problem, not a technology problem.
The danger of DORA metrics is using them as performance targets rather than diagnostic tools. If you tell teams “your deployment frequency must be daily,” they will game the metric by deploying trivial changes. The metrics should inform improvement programs: “our lead time is 2 weeks — where is the bottleneck? Manual QA approval? Slow CI pipeline? Waiting for change advisory board?” Fix the constraint, and the metric improves naturally.
The 4 Key Metrics
Section titled “The 4 Key Metrics”| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Lead Time for Changes | < 1 hour | 1 day - 1 week | 1 week - 1 month | > 1 month |
| Deployment Frequency | On demand (multiple/day) | Weekly - monthly | Monthly - 6 months | > 6 months |
| Change Failure Rate | 0-15% | 16-30% | 16-30% | > 30% |
| Mean Time to Recovery | < 1 hour | < 1 day | 1 day - 1 week | > 1 week |
How to Measure Each Metric
Section titled “How to Measure Each Metric”DORA Metrics Measurement Architecture=======================================
Lead Time for Changes ├── Definition: time from PR merge to production deployment ├── Source: GitHub API (merge timestamp) → ArgoCD (sync timestamp) ├── Calculation: argocd_sync_time - github_merge_time ├── Granularity: per-service, per-team, per-org └── Exclude: rollbacks, config-only changes
Deployment Frequency ├── Definition: how often you successfully deploy to production ├── Source: ArgoCD sync events, Helm release history, CI/CD pipeline runs ├── Calculation: count(successful_deployments) / time_period ├── Granularity: per-service (not per-repo — one repo may have multiple services) └── Include: only production deployments (not dev/staging)
Change Failure Rate ├── Definition: % of deployments that cause a failure requiring remediation ├── Source: deployment events correlated with incident/rollback events ├── Calculation: deployments_causing_rollback_or_incident / total_deployments ├── What counts as failure: rollback, hotfix, incident triggered within 1 hour └── Exclude: planned rollbacks (feature flag toggling)
Mean Time to Recovery (MTTR) ├── Definition: time from incident detection to resolution ├── Source: PagerDuty (alert_created → incident_resolved timestamps) ├── Calculation: avg(resolved_at - created_at) for SEV-1 and SEV-2 incidents ├── Track separately: time_to_detect, time_to_mitigate, time_to_resolve └── Exclude: planned maintenance windowsHow to Improve Each Metric
Section titled “How to Improve Each Metric”Improvement Strategies per Metric====================================
Lead Time for Changes (target: < 1 day) ├── Smaller PRs: enforce max 400 lines changed (large PRs wait for review) ├── Automated testing: eliminate manual QA gate (shift left) ├── Trunk-based development: short-lived branches (< 1 day), merge frequently ├── Fast CI pipeline: target < 10 minutes (parallelize tests, cache dependencies) └── Auto-merge: if CI passes and 1 approval, auto-merge to main
Deployment Frequency (target: on-demand, multiple per day) ├── Feature flags: deploy code without releasing features (LaunchDarkly, Flagsmith) ├── Automated CI/CD: no manual steps between merge and production ├── Reduce manual approvals: replace CAB meetings with automated policy checks ├── Microservices: independent deployment (team deploys without coordinating) └── Canary/progressive: automated rollout reduces fear of deploying
Change Failure Rate (target: < 15%) ├── Better testing: unit, integration, contract tests in CI ├── Shift left: catch issues before merge, not after deployment ├── Canary deployments: expose to 5% traffic, monitor, then full rollout ├── Automated rollback: Argo Rollouts with analysis (rollback on error spike) └── Feature flags: decouple deployment from release (bad feature? toggle off)
Mean Time to Recovery (target: < 1 hour) ├── Better observability: correlated metrics, logs, traces (single pane of glass) ├── Runbooks: every alert has a step-by-step remediation guide ├── Auto-remediation: PagerDuty → Lambda → rollback (no human needed for known issues) ├── Fast rollback: Argo Rollouts instant rollback, database migration rollback scripts └── Blameless postmortems: learn from failures, fix systemic issuesPlatform Team’s Role in DORA Metrics
Section titled “Platform Team’s Role in DORA Metrics”The platform team does not directly control tenant teams’ DORA metrics, but it provides the tooling and golden paths that enable elite performance. A well-built platform removes the friction that causes slow lead times and low deployment frequency. If a team has to wait 3 days for a namespace to be created, file a ticket for a load balancer, and manually configure CI/CD — their DORA metrics will be terrible regardless of their engineering talent.
Platform Team's DORA Enablement==================================
Lead Time: ├── Provide: fast CI/CD pipelines (< 10 min), pre-built pipeline templates ├── Provide: automated code review tools (linting, security scanning) └── Measure: pipeline duration as a platform SLO
Deployment Frequency: ├── Provide: ArgoCD with auto-sync, Argo Rollouts for canary ├── Provide: feature flag infrastructure (integrated with deployment) └── Measure: deployments per team per week (detect teams stuck on manual)
Change Failure Rate: ├── Provide: canary deployment framework, automated rollback ├── Provide: contract testing infrastructure, staging environments └── Measure: rollback rate per team (identify teams needing support)
MTTR: ├── Provide: Grafana dashboards, alerting framework, runbook templates ├── Provide: incident management tooling (PagerDuty integration, Slack bots) └── Measure: incident detection time as a platform SLO
Reporting: ├── Dashboard: DORA metrics per team, per org, trend over time ├── Weekly: automated DORA report to engineering leadership ├── Quarterly: DORA review with each team — identify improvement areas └── Important: measure per-team, not just org-wide averages (org average hides underperforming teams)Interview Scenarios for DORA
Section titled “Interview Scenarios for DORA”“How do you measure and improve engineering velocity across a 200-person engineering org?”
“I’d implement DORA metrics as the primary velocity measurement framework, with automated data collection and per-team dashboards:
Measurement infrastructure:
- Lead Time: instrument the CI/CD pipeline. Record PR merge timestamp (GitHub webhook), deployment timestamp (ArgoCD sync event), calculate delta. Store in a time-series database (VictoriaMetrics) for trending.
- Deployment Frequency: count ArgoCD sync events per service per week. Create a Grafana dashboard showing frequency by team with trendlines.
- Change Failure Rate: correlate deployment events with PagerDuty incidents and ArgoCD rollback events. If an incident opens within 1 hour of a deployment, that deployment counts as a failure.
- MTTR: PagerDuty API provides incident lifecycle timestamps. Calculate time-to-detect (alert lag), time-to-mitigate, and total resolution time.
Improvement program:
- Baseline: measure current state for all 20 teams for 4 weeks
- Identify bottlenecks: which metric is worst? Lead time is usually the first constraint.
- Platform interventions: if lead time is slow because CI takes 30 minutes, invest in pipeline optimization. If deployment frequency is low because teams fear deployments, invest in canary rollout tooling.
- Team-level coaching: pair a platform engineer with the lowest-performing teams for 2-week improvement sprints.
- Celebrate improvement: monthly engineering all-hands showing DORA trends. Recognize teams that improved the most.
Critical rule: DORA metrics inform, they do not punish. Never tie DORA to individual performance reviews. The moment you do, teams game the metrics instead of improving their actual practices.”
Interview Scenarios
Section titled “Interview Scenarios”Scenario 1: Design On-Call for a Platform Team
Section titled “Scenario 1: Design On-Call for a Platform Team”“You’re setting up on-call for a 12-person platform team supporting 200 developers. How do you structure it?”
Strong Answer:
“With 12 engineers, I’d set up two on-call rotations:
Primary rotation: 1 week on / 5 weeks off, 6 engineers. This person is the first responder for all platform alerts — cluster health, networking, shared services (Grafana, ArgoCD). Response SLA: 5 minutes for SEV-1, 15 minutes for SEV-2.
Secondary rotation: 1 week on / 5 weeks off, remaining 6 engineers. Secondary is the escalation path and also handles lower-severity tickets during business hours.
Alert design is critical:
- PagerDuty for SEV-1/SEV-2 (pages primary, auto-escalates to secondary after 15 min)
- Slack for SEV-3/SEV-4 (notifications during business hours only)
- Target: fewer than 2 pages per on-call shift (week). More than that means we have reliability problems to fix.
Compensation: Additional pay or comp time for on-call weeks. If someone gets paged at 3 AM, they start late the next day. On-call burnout is the number one reason platform engineers quit.
Tenant team on-call: Each app team runs their own on-call for application-level issues. Platform team provides the alerting framework (vmalert rules, PagerDuty integration, runbook templates). We only get involved when the issue is infrastructure-level.”
Scenario 2: Postmortem Culture
Section titled “Scenario 2: Postmortem Culture”“Your team had a major outage. Leadership wants to fire the engineer who caused it. How do you handle this?”
Strong Answer:
“I’d firmly advocate for a blameless postmortem. Here’s my argument to leadership:
-
The engineer didn’t cause the outage — the system did. If one person’s action can bring down production, we have a systemic problem: missing guardrails, insufficient testing, or inadequate deployment safeguards.
-
Punishing people creates a fear culture. If we fire this person, every other engineer will hide their mistakes instead of reporting them. We’ll lose the early warning system that prevents bigger incidents.
-
The fix is systemic. Instead of ‘engineer should have been more careful,’ the action items should be: ‘add automated rollback when error rate exceeds threshold,’ ‘require staging validation before prod deploy,’ ‘add database migration dry-run step in CI.’
-
Google’s approach: Google SRE has a well-documented blameless postmortem culture. They’ve found that the teams with the best reliability are the ones where people feel safe reporting issues.
What I’d do: Run a blameless postmortem within 48 hours. Focus the discussion on: what happened, what contributing factors existed, what systemic changes prevent recurrence. Publish the postmortem internally. The engineer who ‘caused’ it should present — it’s empowering, not punitive.”
Scenario 3: SLO Design for a New Service
Section titled “Scenario 3: SLO Design for a New Service”“A team is launching a new real-time fraud detection API. Help them define SLOs.”
Strong Answer:
“Fraud detection is critical path — false negatives (missed fraud) are worse than latency. I’d define:
SLO 1 — Availability: 99.99% (4.32 min/month error budget)
- SLI: Proportion of requests returning non-5xx responses
- Rationale: Every missed API call means a transaction goes unscored, potentially allowing fraud
SLO 2 — Latency: 99.9% of requests < 200ms (P999 < 200ms)
- SLI: Proportion of requests completing within 200ms
- Rationale: This API is called synchronously during payment processing. >200ms adds noticeable delay to customer checkout
SLO 3 — Correctness: 99.99% of evaluations match the batch model
- SLI: Proportion of real-time scores that agree with the nightly batch reprocessing
- Rationale: Ensures the real-time model isn’t drifting from the trained model
Error budget policy:
-
50% budget: Ship features, A/B test new models
- 25-50%: Freeze model updates, investigate reliability
- < 25%: Feature freeze, all hands on reliability
- 0%: Leadership review, consider fallback to rules-based scoring
Alerting: Multi-window burn rate alerts. Fast burn (14.4x) pages immediately. Slow burn (1x over 3 days) creates a ticket.
Dashboard: Dedicated Grafana dashboard showing all three SLIs, error budget remaining, burn rate trend, and latency percentile heatmap.”
Scenario 4: Reducing Alert Fatigue
Section titled “Scenario 4: Reducing Alert Fatigue”“Your platform team gets 150 alerts per day. On-call engineers are burning out. Fix it.”
Strong Answer:
“150 alerts/day means 90% are noise. Here’s my 4-week plan:
Week 1: Audit
- Export all alert firings from PagerDuty/Alertmanager for the last 30 days
- Categorize each: actionable (required human fix), auto-resolved (<5 min), noise (false positive)
- Result: typically 70-80% are noise or auto-resolved
Week 2: Eliminate noise
- Delete alerts that never resulted in action (e.g., ‘CPU > 80%’ that auto-recovers every time)
- Increase thresholds on alerts that fire too frequently (was 70% CPU, set to 90%)
- Add
for: 5morfor: 10mto eliminate transient spikes - Result: 150 → ~40 alerts/day
Week 3: Switch to SLO-based
- Replace symptom-based alerts with error budget burn rate alerts
- Example: Instead of ‘pod CPU high’ + ‘pod memory high’ + ‘pod restart’ (3 alerts), use ‘99.9% availability SLO burning 14.4x’ (1 alert that captures the actual user impact)
- Result: 40 → ~15 alerts/day
Week 4: Improve routing
- SEV-1/SEV-2: PagerDuty (pages on-call)
- SEV-3: Slack notification (business hours only)
- SEV-4: Ticket in Jira (triage in next sprint)
- Result: On-call gets ~2-3 pages per shift instead of 150
Target: <5 pages per on-call week. Every page should be actionable and have a runbook.”
Scenario 5: Incident Response Exercise
Section titled “Scenario 5: Incident Response Exercise”“Walk me through how you’d handle: at 2 AM, PagerDuty pages you. The payment API error rate jumped from 0.1% to 45%.”
Strong Answer:
“Here’s my response timeline:
0-2 minutes: Acknowledge + Assess
- Acknowledge PagerDuty alert on my phone
- Open Grafana on laptop
- Check: Is this a partial outage (some endpoints) or total (all endpoints)?
- Check: When did it start? Was there a deployment? (ArgoCD history)
- Quick assessment: this is SEV-1 (revenue impact, >50% error rate)
2-5 minutes: Open incident
- Create #incident-payment-api Slack channel
- Post: ‘SEV-1: Payment API 45% error rate. IC: [me]. Investigating.’
- If there was a deployment in the last hour → immediate rollback
kubectl -n payments rollout undo deployment/payment-api- This is the fastest mitigation. Root cause can wait.
5-10 minutes: If no deployment (or rollback didn’t help)
- Check dependencies:
- Database:
{namespace='payments'} |= 'connection' |= 'refused'in Loki - Redis: check Memorystore/ElastiCache metrics in Grafana
- External services: check span errors in Tempo
- Database:
- Check infrastructure: node health, pod scheduling, DNS resolution
10-15 minutes: Escalate if needed
- If I can’t identify the cause, escalate to Tier 2 (page the secondary on-call)
- Post update in Slack: ‘Still investigating. No deployment correlation. Checking DB and downstream.’
15-30 minutes: Mitigate
- Once root cause identified, apply fix or workaround
- Example: DB connection pool exhausted → scale up RDS reader, increase pool size
Post-resolution:
- Confirm error rate back to baseline for 15 minutes
- Post ‘ALL CLEAR’ in incident channel with summary
- Schedule blameless postmortem for next business day
- Write draft timeline while it’s fresh”
References
Section titled “References”- AWS Well-Architected Framework: Reliability Pillar — fault isolation, recovery, and operational readiness
- Google SRE Book (free online) — foundational text on SRE practices, SLOs, error budgets, and toil
- Google SRE Workbook (free online) — practical companion with implementation guidance
- Google Cloud Architecture Framework: Reliability — reliability design principles for GCP
Tools & Frameworks
Section titled “Tools & Frameworks”- PagerDuty Incident Response Guide — open-source incident response process, roles, and best practices
- Atlassian Incident Management Handbook — severity levels, communication templates, and postmortems
- Alertmanager Documentation — alert routing, grouping, inhibition, and silencing
- SLO Generator (Google) — open-source tool for computing SLIs and error budgets