System Design Interview Framework

The 5-Step Framework

System design interviews are typically 45-60 minutes. The biggest mistake candidates make is diving straight into drawing architecture diagrams without understanding what they are building or at what scale. The 5-step framework structures your time so you cover requirements, estimation, architecture, depth, and trade-offs — in that order.

┌─────────────────────────────────────────────────────┐
│ Step 1: Clarify Requirements (5 min)                │
│   → Functional + non-functional requirements        │
│   → Ask questions — don't assume                    │
├─────────────────────────────────────────────────────┤
│ Step 2: Back-of-Envelope Estimation (3 min)         │
│   → QPS, storage, bandwidth                         │
│   → Sets the scale for your design                  │
├─────────────────────────────────────────────────────┤
│ Step 3: High-Level Architecture (10 min)            │
│   → Draw the boxes, identify data flow              │
│   → Name services, databases, queues                │
├─────────────────────────────────────────────────────┤
│ Step 4: Deep Dive (20 min)                          │
│   → Pick 2-3 components to go deep                  │
│   → This is where you show expertise                │
├─────────────────────────────────────────────────────┤
│ Step 5: Bottlenecks & Trade-offs (7 min)            │
│   → Scaling, failure modes, cost                    │
│   → Security, observability, operations             │
└─────────────────────────────────────────────────────┘

The remaining time (~5 minutes) is buffer for interviewer questions and discussion. Do not try to fill every second — leaving room for dialogue shows confidence.

Step 1: Clarify Requirements (5 min)

This is the most important step and the strongest signal of seniority. Junior candidates hear “Design a URL shortener” and immediately start drawing boxes. Principal-level candidates ask ten questions first — because the answers change the entire architecture.

Functional Requirements

“What does the system do?” — Define the core user stories and API operations.

Example for a URL shortener:

User submits a long URL, gets a short URL back
User visits the short URL, gets redirected to the original URL
Optional: analytics (click count, geographic distribution), custom short URLs, expiration

Non-Functional Requirements

Ask about these EXPLICITLY. Do not assume defaults — state your assumption and confirm it with the interviewer.

Requirement	Questions to Ask	Why It Matters
Users	How many users? Geographic distribution?	Single-region vs multi-region, CDN placement
Scale	Requests per second? Data volume? Growth rate?	Database choice, caching strategy, partitioning
Latency	p99 target? (< 100ms? < 1s? < 5s?)	Caching layers, read replicas, CDN, edge computing
Availability	99.9%? 99.99%? 99.999%?	Multi-AZ vs multi-region, active-active vs active-passive
Consistency	Strong? Eventual? Read-after-write?	Database selection, replication strategy, cache invalidation
Compliance	PCI? HIPAA? Data residency?	Network isolation, encryption requirements, region restrictions
Budget	Cost-constrained? Performance-first?	Managed services vs self-hosted, reserved vs on-demand

How to phrase your questions:

Do not ask vague questions like “How big is this?” Instead, be specific:

“Are we designing for 1 million users or 1 billion users? That changes whether we need sharding.”
“Is this a read-heavy or write-heavy system? What is the approximate read:write ratio?”
“Do we need strong consistency for reads, or is eventual consistency acceptable with a few seconds of delay?”
“Are there any data residency requirements? Does data need to stay in a specific region?”

Step 2: Back-of-Envelope Estimation (3 min)

Estimation sets the scale for your design. If the system handles 10 QPS, a single PostgreSQL instance is fine. If it handles 100,000 QPS, you need caching, sharding, and CDN. Do the math — do not guess.

Estimation Cheat Sheet

Time:
  1 day = 86,400 seconds ≈ 100K seconds
  1 month ≈ 2.5 million seconds

Scale:
  1 million requests/day ≈ 12 QPS
  1 billion requests/day ≈ 12,000 QPS
  Typical read:write ratio = 10:1 or 100:1

Storage:
  1 char = 1 byte
  1 KB ≈ 1,000 characters (a paragraph)
  1 MB ≈ 1,000 KB (a photo)
  1 GB ≈ 1,000 MB (a movie)
  1 TB ≈ 1,000 GB
  1 PB ≈ 1,000 TB

  1 million records × 1 KB each = 1 GB
  1 billion records × 1 KB each = 1 TB

Bandwidth:
  1 Gbps = 125 MB/sec
  100 MB response × 1000 QPS = 100 GB/sec (need CDN!)

Example Estimation: “Design a URL Shortener”

Walk through the math out loud — interviewers want to see your reasoning, not just the answer.

Writes:
  100M new URLs/month
  100M / (30 days × 86,400 sec) ≈ 100M / 2.5M ≈ 40 QPS writes

Reads:
  Assume 10:1 read:write ratio
  40 × 10 = 400 QPS reads
  Peak (3x average): 1,200 QPS reads

Storage:
  Each URL record: ~500 bytes (short URL + long URL + metadata)
  5 years of data: 100M × 12 months × 5 years × 500 bytes
    = 6 billion records × 500 bytes
    = 3 TB

Bandwidth:
  400 QPS × 500 bytes = 200 KB/sec (trivial)

Conclusion:
  - Modest scale — single-region database handles writes easily
  - 400 QPS reads — add Redis cache (99% hit rate for popular URLs)
  - 3 TB storage — fits in a single Aurora PostgreSQL instance
  - No need for sharding, message queues, or multi-region at this scale

Why this matters: If you design a URL shortener with Kafka, DynamoDB Global Tables, and a service mesh — but the scale only requires PostgreSQL + Redis — you are over-engineering. The estimation prevents this.

Step 3: High-Level Architecture (10 min)

Draw the main components, show data flow with arrows, and name specific services. This is your “30,000 foot view” — enough detail to show the overall structure without getting lost in implementation.

What to Include

Clients: Web browser, mobile app, partner API
Edge: CDN, DNS, WAF
Load balancer: ALB, NLB, Global LB
Application layer: EKS pods, Lambda functions, ECS tasks — name the services
Data stores: Name the specific database (Aurora PostgreSQL, DynamoDB, Redis) — not just “database”
Queues/streams: SQS, Kinesis, Pub/Sub — for async processing
External services: Payment processors, email providers, ML endpoints

Stateless vs Stateful

Identify which components are stateless (can be horizontally scaled freely) and which are stateful (need careful scaling). This shows architectural maturity.

Stateless (scale horizontally):       Stateful (scale carefully):
  - API servers                         - Databases (Aurora, DynamoDB)
  - Workers/consumers                   - Cache (Redis/ElastiCache)
  - Lambda functions                    - Message queues (SQS, Kafka)
  - CDN edge nodes                      - Search indexes (OpenSearch)

Example: High-Level Architecture for URL Shortener

Browser/API Client
  ↓ HTTPS
CloudFront (CDN — cache redirects for popular URLs)
  ↓
ALB (Application Load Balancer)
  ↓
EKS: url-shortener-api (stateless, 3 replicas)
  ├── POST /shorten → generate short code → write to Aurora
  └── GET /:code → check Redis cache → fallback to Aurora → 301 redirect
  ↓              ↓
Redis            Aurora PostgreSQL
(cache:          (primary store:
 code→URL         code, long_url,
 TTL 24h)         created_at, clicks)

Key decisions to explain:

“I chose Aurora PostgreSQL over DynamoDB because the data model is simple (key-value with metadata) and 40 QPS writes do not require a NoSQL database. Aurora gives us SQL flexibility for analytics queries on URL usage.”
“Redis caches the most accessed URLs. With a 99% cache hit rate, only 4 QPS reach Aurora — well within its capacity.”
“CloudFront caches redirect responses at the edge. A URL that gets 10,000 clicks/day only hits the origin once per TTL period.”

Step 4: Deep Dive (20 min)

This is where you demonstrate expertise. Pick 2-3 components that are the HARDEST parts of the system and go deep. Do not try to cover everything — depth beats breadth.

How to Choose What to Deep Dive

Ask yourself: “What is the technically interesting challenge in this system?” or “Where could this design break?”

For a URL shortener: Short code generation (collision avoidance), caching strategy (invalidation), analytics pipeline
For an e-commerce platform: Inventory consistency, checkout flow (distributed transactions), search indexing
For a notification system: Fan-out at scale (1 event → millions of notifications), delivery guarantees
For a chat system: Real-time message delivery (WebSocket vs long polling), message ordering, presence detection

Deep Dive Template

For each component you deep dive, cover:

The challenge: “The interesting problem here is X…”
Options considered: “We could solve this with A, B, or C…”
Decision and rationale: “I would choose B because…”
Implementation detail: Show the data model, API, or architecture for this component
Edge cases: “What happens when X fails? How do we handle Y?”

Example Deep Dive: Short Code Generation

“The challenge is generating globally unique short codes without coordination between API servers.

Option A — Sequential counter: Simple but predictable (users can guess URLs) and requires coordination (single counter = bottleneck).

Option B — Random generation: Generate a random 7-character alphanumeric code. 62^7 = 3.5 trillion possible codes. With 6 billion URLs over 5 years, collision probability is negligible. Check for collision on write — if the code exists, regenerate.

Option C — Hash-based: Base62-encode a hash (MD5, SHA-256) of the long URL + timestamp. Deterministic but longer output requires truncation, increasing collision risk.

I would choose Option B — random generation with collision check. It is simple, horizontally scalable (no coordination needed), and the collision probability with 7 characters is vanishingly small at our scale.

Short Code Generation:
  1. Generate random 7-char alphanumeric string
  2. Check Redis bloom filter (fast collision check)
  3. If collision → regenerate (statistically rare)
  4. Write to Aurora with UNIQUE constraint on code column
  5. If DB unique violation → regenerate (safety net)
  6. Add to bloom filter
  7. Cache code→URL in Redis

The bloom filter is an optimization — it avoids a database round-trip for collision checks in 99.99% of cases.”

Step 5: Bottlenecks and Trade-offs (7 min)

Every design has weak points. Identifying them proactively shows maturity — it is far better for YOU to say “this component could be a bottleneck” than for the interviewer to find it.

Questions to Ask Yourself

“If traffic doubles tomorrow, which component breaks first?”
“What happens if this database goes down? What is the user experience?”
“What is the approximate monthly cost of this design? Is it reasonable for the business?”
“How do we monitor this system? What alerts would we set up?”
“What happens during a deployment? Is there downtime?”

Trade-offs to Discuss Explicitly

Every architectural decision has trade-offs. Name them.

Decision: Use Aurora PostgreSQL instead of DynamoDB
  Pro: SQL flexibility, complex queries for analytics, familiar to most engineers
  Con: Vertical scaling limits (max ~128 vCPU), schema migrations require coordination
  Trade-off accepted because: Our scale (40 QPS writes) is well within Aurora limits

Decision: Cache redirects in Redis
  Pro: Sub-millisecond reads, reduces database load by 99%
  Con: Cache invalidation complexity if URL targets change
  Trade-off accepted because: URL targets rarely change, TTL-based expiry is sufficient

Decision: Single-region deployment
  Pro: Simpler, lower cost, no cross-region consistency issues
  Con: Higher latency for users far from the region, single point of failure
  Trade-off accepted because: The requirements specify 99.9% availability (not 99.99%)
    which is achievable with multi-AZ in a single region

Common Design Prompts

These are five frequently asked system design prompts with approach outlines. Each outline identifies key challenges, proposes an architecture, suggests deep dive topics, and cross-references other sections of this guide for detailed patterns.

1. Design a Multi-Region E-Commerce Platform

Key challenges:

Product catalog: read-heavy (thousands of reads per second), cache-friendly but cache invalidation on product updates is tricky
Inventory: write-heavy, consistency-critical (cannot sell more than what is in stock)
Checkout: distributed transaction across inventory, payment, and order services
Search: full-text search with filters (category, price range, availability)

Architecture outline:

Users (global)
  ↓ HTTPS
CloudFront / Cloud CDN (static assets, product images)
  ↓
Route 53 / Cloud DNS (latency-based routing to nearest region)
  ↓
Region A (us-east-1)                    Region B (eu-west-1)
┌──────────────────────┐               ┌──────────────────────┐
│ ALB                  │               │ ALB                  │
│  ↓                   │               │  ↓                   │
│ EKS: product-api     │               │ EKS: product-api     │
│ EKS: cart-service    │               │ EKS: cart-service     │
│ EKS: checkout-svc    │               │ EKS: checkout-svc    │
│ EKS: search-api      │               │ EKS: search-api      │
│  ↓            ↓      │               │  ↓            ↓      │
│ ElastiCache  Aurora   │◄── Global ──►│ ElastiCache  Aurora   │
│ (Redis)      Global DB│   Replication │ (Redis)      Read    │
│              (writer) │               │              Replica  │
│  ↓                   │               │                      │
│ DynamoDB Global Table │◄── Global ──►│ DynamoDB Global Table │
│ (cart/session)        │   Tables     │ (cart/session)        │
└──────────────────────┘               └──────────────────────┘

Deep dive topics:

Cache invalidation when a product updates: publish product-updated event to SNS/EventBridge, invalidate specific cache keys in all regions, use cache-aside pattern with TTL as safety net
Inventory consistency: the writer region (us-east-1) handles all inventory decrements. Read replicas serve inventory reads with eventual consistency (acceptable for product browsing, NOT for checkout). Checkout service calls the writer region directly for the final inventory check.
Reference: web-architecture/ (caching patterns), databases/ (Aurora Global Database), event-driven/ (event-driven inventory updates)

2. Design a Real-Time Notification System

Key challenges:

Fan-out: a single event (e.g., “flash sale starts”) may trigger millions of notifications
Delivery channels: push notifications (APNs/FCM), email (SES), SMS (Twilio/SNS), in-app
User preferences: each user chooses which channels they want, what types of notifications
Delivery guarantees: at-least-once delivery, retry logic, dead-letter handling

Architecture outline:

Event Producers (order-service, marketing, system alerts)
  ↓ Events
EventBridge / Pub/Sub (event bus)
  ↓ Event rules route to appropriate handlers
Lambda / Cloud Functions (fan-out: resolve recipients, preferences)
  ↓ Per-user notification messages
SQS queue per channel          SQS queue per channel
┌─────────────┐               ┌─────────────┐
│ push-queue   │               │ email-queue  │
│  ↓           │               │  ↓           │
│ push-worker  │               │ email-worker │
│  ↓           │               │  ↓           │
│ APNs / FCM   │               │ SES          │
└─────────────┘               └─────────────┘
  ↓ Failed deliveries            ↓ Failed deliveries
  DLQ (retry after backoff)      DLQ (retry after backoff)
  ↓ After max retries
  Notification log (DynamoDB — delivery status per user)

Deep dive topics:

Fan-out at scale: a “flash sale” event triggers Lambda to query the user preference service (DynamoDB), generate one SQS message per user per channel. For 10M users, this is 10M+ SQS messages — use SQS FIFO for ordering or standard for throughput. SNS can also fan out to SQS queues per channel.
Delivery guarantees: SQS provides at-least-once delivery. Workers process messages, call the delivery API (APNs/SES/Twilio), and delete the message only on success. Failed deliveries go to DLQ with exponential backoff retry. After max retries, log as failed and alert.
User preference service: DynamoDB table with userId as partition key. Each record contains channel preferences, notification type preferences, quiet hours, and timezone.
Reference: event-driven/ (fan-out pattern, EventBridge), compute/ (Lambda concurrency and scaling)

3. Design a Multi-Tenant SaaS Platform

Key challenges:

Tenant isolation: security boundary — one tenant must never access another tenant’s data
Noisy neighbor: performance isolation — one tenant’s heavy usage must not degrade service for others
Billing: per-tenant cost tracking and usage metering
Onboarding: self-service tenant provisioning (ideally < 5 minutes)

Architecture outline:

Tenants (each with API key / OAuth token)
  ↓ HTTPS
API Gateway (per-tenant throttling via usage plans)
  ↓ JWT with tenant_id claim
ALB → EKS
  ↓
┌─────────────────────────────────────────────────┐
│ Isolation Model (choose one):                    │
│                                                  │
│ Option A: Namespace-per-tenant                   │
│   tenant-a namespace → pods + NetworkPolicy      │
│   tenant-b namespace → pods + NetworkPolicy      │
│   Shared database with row-level security (RLS)  │
│                                                  │
│ Option B: Cluster-per-tenant (premium tier)      │
│   Dedicated EKS cluster per enterprise tenant    │
│   Dedicated Aurora instance                      │
│   Full blast radius isolation                    │
│                                                  │
│ Option C: Hybrid (most common)                   │
│   Small tenants → shared cluster, namespace      │
│   Enterprise tenants → dedicated cluster         │
└─────────────────────────────────────────────────┘
  ↓
Aurora PostgreSQL
  Schema-per-tenant (moderate isolation)
  or Row-level security (shared schema, tenant_id column)
  ↓
Kubecost / CUR (per-tenant cost attribution)

Deep dive topics:

Isolation model trade-offs: namespace-per-tenant is cheapest but requires strong NetworkPolicy, OPA/Gatekeeper policies, and ResourceQuotas. Cluster-per-tenant is most secure but expensive (EKS control plane cost per cluster). Hybrid gives flexibility — onboard small tenants to shared, migrate to dedicated when they reach enterprise tier.
Per-tenant rate limiting: API Gateway usage plans with per-tenant API keys. Each plan defines requests/second and burst limits. Alternatively, Istio rate limiting with EnvoyFilter based on JWT tenant_id claim.
Billing integration: Kubecost labels costs by namespace (tenant). For cluster-per-tenant, use AWS Cost and Usage Reports (CUR) tagged by tenant account. Meter API usage via API Gateway usage plans and export to billing system.
Reference: kubernetes/multi-tenancy (isolation patterns), api-gateway/ (usage plans, throttling), finops/ (Kubecost, cost allocation)

4. Design a CI/CD Platform for 200 Developers

Key challenges:

Build speed: developers abandon CI if builds take > 15 minutes
Security scanning: shift-left security without slowing down developers
Environment promotion: consistent dev → staging → prod pipeline with approval gates
Self-service: developers should be able to onboard new services without platform team involvement
Cost: CI runners are expensive at scale (compute-heavy, often idle)

Architecture outline:

Developer
  ↓ git push
GitHub (source of truth)
  ↓ webhook
GitHub Actions (CI)
  ├── Build: Docker build with layer caching
  ├── Test: unit tests, integration tests
  ├── Scan: Trivy (container), tfsec (Terraform), Snyk (dependencies)
  ├── Publish: push image to ECR/Artifact Registry (immutable tag)
  └── Update: bump image tag in GitOps repo
         ↓
ArgoCD (CD — GitOps)
  ├── Dev cluster: auto-sync (deploy on merge to main)
  ├── Staging cluster: auto-sync with health checks
  └── Prod cluster: manual sync (approval gate in ArgoCD)
         ↓
EKS/GKE clusters (dev, staging, prod)
  ↓ metrics
Grafana (DORA metrics dashboard)
  - Deployment frequency
  - Lead time for changes
  - Change failure rate
  - Mean time to recovery

Deep dive topics:

Golden paths: provide standardized pipeline templates (reusable GitHub Actions workflows) for common patterns — Go microservice, Node.js API, Python ML service. Each template includes build, test, scan, and deploy stages. Developers import the template and provide only service-specific config.
Security scanning in pipeline: Trivy scans container images for CVEs (block CRITICAL/HIGH). tfsec scans Terraform for misconfigurations. Snyk scans application dependencies. Results posted as PR comments — developers fix before merge.
Environment promotion: GitOps repo has directory per environment (dev/, staging/, prod/). CI updates the image tag in dev/. ArgoCD auto-syncs dev. Promotion to staging is automatic after dev health checks pass. Promotion to prod requires a PR approval in the GitOps repo.
Cost optimization: use GitHub Actions larger runners (or self-hosted runners on spot instances) for heavy builds. Cache Docker layers in ECR/Artifact Registry. Run integration tests in ephemeral namespaces (create on PR open, delete on PR close).
Reference: kubernetes/deployments (GitOps with ArgoCD), kubernetes/enterprise-platform (golden paths, developer experience), cheatsheets/terraform-patterns (CI/CD for Terraform)

5. Design a Data Pipeline for Real-Time Fraud Detection

Key challenges:

Latency: fraud decision must happen in < 100ms (before the payment is authorized)
ML model serving: low-latency inference with auto-scaling
Feature store: real-time feature retrieval (user’s recent transaction history, device fingerprint, location)
False positive handling: too many false positives = blocked legitimate customers = revenue loss

Architecture outline:

Payment Event (from checkout service)
  ↓ (< 5ms)
Kinesis Data Streams / Pub/Sub (event ingestion)
  ↓ (< 10ms)
Flink / Dataflow (real-time feature extraction)
  ├── Compute: transaction velocity (last 5 min), amount deviation
  ├── Enrich: user profile, device fingerprint, geo-IP
  └── Write features to Feature Store (DynamoDB/Bigtable)
  ↓ (< 20ms)
SageMaker Endpoint / Vertex AI Endpoint (ML inference)
  ├── Input: enriched feature vector
  ├── Output: fraud probability score (0.0 - 1.0)
  └── Auto-scaling: scale on invocations/latency metrics
  ↓ (< 5ms)
Decision Service
  ├── Score < 0.3 → ALLOW (proceed with payment)
  ├── Score 0.3 - 0.7 → REVIEW (queue for human analyst)
  └── Score > 0.7 → BLOCK (reject payment, notify user)
  ↓
DynamoDB / Bigtable (fraud decision log — every decision stored)
  ↓ (async)
Feedback Loop
  ├── Analyst reviews REVIEW decisions → labels as fraud/legit
  ├── Labels feed back to training pipeline
  └── Model retrained weekly on new labeled data

Deep dive topics:

Feature store: the feature store is the key to low-latency ML inference. Pre-compute features (transaction velocity, average amount, device history) and store in DynamoDB/Bigtable for single-digit-millisecond reads. The Flink job continuously updates features as new transactions flow in. Without a feature store, the inference service would need to query multiple databases at request time — too slow for 100ms SLA.
Model serving: SageMaker/Vertex AI endpoints provide managed inference with auto-scaling. Use multi-model endpoints to A/B test new models (send 5% of traffic to the challenger model). Monitor model drift — if fraud patterns change and the model’s precision drops, trigger retraining.
Feedback loop: the analyst review queue is critical. Analysts label borderline decisions (score 0.3-0.7) as fraud or legitimate. These labels become training data for the next model version. Without this loop, the model degrades over time as fraud patterns evolve.
Reference: data-platform/ (streaming architectures, Kinesis/Pub/Sub), event-driven/ (event-driven processing), compute/ (Lambda vs containers for ML serving)

How to Use This Guide During Design Rounds

This guide covers the building blocks you will combine in system design interviews. Here is a cross-reference table to quickly find the right patterns for each design topic:

Design Topic	Guide Page	Key Patterns
Network design	networking/ (VPC, connectivity, security)	VPC architecture, Transit Gateway, PrivateLink
Database selection	databases/	Aurora vs DynamoDB, read replicas, Global Tables
K8s architecture	kubernetes/ (13 pages)	Pod design, deployments, scaling, networking
Event-driven patterns	event-driven/	SQS/SNS fan-out, EventBridge, Kinesis streaming
Caching	web-architecture/ (caching section)	Cache-aside, write-through, invalidation strategies
Security	security/ + iam-fundamentals/	Zero Trust, mTLS, AuthorizationPolicy, encryption
CI/CD	kubernetes/deployments	GitOps with ArgoCD, canary deployments
Cost	finops/	Kubecost, Spot instances, reserved capacity
Observability	observability/ (metrics, logging, tracing)	Prometheus, Grafana, distributed tracing
Resilience	web-architecture/ (resilience patterns)	Circuit breakers, retries, bulkheads, chaos engineering
Multi-tenancy	kubernetes/multi-tenancy	Namespace isolation, ResourceQuotas, NetworkPolicy
API design	api-gateway/	Rate limiting, usage plans, authentication patterns
Service mesh	security/service-mesh	Istio, mTLS, traffic management, service exposure
Compliance	security/ (compliance section)	PCI-DSS, SOC 2, CIS benchmarks

During the interview, mentally map the design prompt to these building blocks. “Design an e-commerce platform” touches networking, databases, caching, event-driven, CI/CD, observability, and security. You do not need to cover all of them — but knowing which patterns apply shows breadth.

Common Mistakes

These are the mistakes that cost candidates the most points in system design interviews. Each one is avoidable with awareness.

1. Jumping to Solutions

The mistake: “I would use Kafka” — said before understanding the requirements, scale, or use case.

The fix: Always start with Step 1 (clarify requirements). State your assumptions explicitly: “I am assuming this is a write-heavy system with eventual consistency requirements. If that changes, the architecture changes significantly.”

2. Ignoring Non-Functional Requirements

The mistake: Designing for features but not for scale, reliability, or cost. The system works at 100 users but falls apart at 1 million.

The fix: Step 2 (estimation) forces you to think about scale. Ask about availability, latency, and consistency requirements in Step 1. These requirements often matter more than features for architecture decisions.

3. Over-Engineering

The mistake: Adding Kafka, Redis, service mesh, multi-region, and microservices when a simple PostgreSQL + ALB + monolith would handle the load comfortably.

The fix: Let the estimation guide your complexity. If the system handles 50 QPS, you do not need Kafka. If it serves 1,000 users, you do not need multi-region. State: “At this scale, a simpler approach is sufficient. Here is where I would add complexity as scale grows.”

Scale-Driven Complexity:
  < 100 QPS    → Monolith + PostgreSQL + Redis
  100-10K QPS  → Microservices + managed DB + caching layer
  10K-100K QPS → Sharding, CDN, message queues, read replicas
  > 100K QPS   → Multi-region, custom solutions, edge computing

4. Not Discussing Trade-offs

The mistake: Presenting your design as if every decision is obvious and has no downsides.

The fix: Every choice has trade-offs. Say them explicitly: “I chose Aurora over DynamoDB because we need SQL joins for analytics. The trade-off is that Aurora has vertical scaling limits — if we exceed 128 vCPU on the writer, we would need to shard manually or switch to DynamoDB.”

5. Single Points of Failure

The mistake: Forgetting to make stateful components (database, cache, queue) highly available. If the single Redis instance fails, the entire system degrades.

The fix: For every stateful component, ask: “What happens if this fails?” Database should be multi-AZ. Cache should have replicas. Queues should be managed (SQS/Pub/Sub are inherently durable). State your HA strategy explicitly.

6. Ignoring Cost

The mistake: Designing a $500K/month architecture when the business constraint is $50K/month. Or the inverse — choosing the cheapest option for a system that needs 99.99% availability.

The fix: Give a rough cost estimate. “This design with 3 EKS clusters, Aurora Global DB, and ElastiCache across 2 regions costs approximately $15K-20K/month. If the budget is lower, I would start with a single region and add the second region when the business justifies it.”

7. Not Naming Specific Services

The mistake: Saying “database” instead of “Aurora PostgreSQL with read replicas” or “cache” instead of “ElastiCache Redis cluster mode.”

The fix: Be specific. Name the exact AWS/GCP service, the configuration (instance type, replication mode), and why you chose it over alternatives. This demonstrates hands-on experience versus theoretical knowledge.

8. Forgetting Observability

The mistake: No mention of how you would monitor, alert, or debug the system. The architecture looks perfect until something goes wrong — and you have no visibility.

The fix: Include observability in your architecture: “Prometheus scrapes metrics from all services. Grafana dashboards show request rate, error rate, and p99 latency per service. PagerDuty alerts on error rate > 1% or p99 > 500ms. Distributed tracing with Jaeger/X-Ray for debugging cross-service latency.”