Data Platform — Architecture Patterns
Where This Fits
Section titled “Where This Fits”This page builds on the Pipeline Stages foundation. Those are the building blocks; here we assemble them into complete enterprise architectures. Each pattern addresses different business requirements — from traditional batch analytics to real-time fraud detection to decentralized data mesh.
Pattern 1: Traditional Enterprise Bank
Section titled “Pattern 1: Traditional Enterprise Bank”The most common data platform pattern at regulated financial institutions. Prioritizes reliability, auditability, and compliance over real-time speed.
Key Design Decisions
Section titled “Key Design Decisions”| Decision | Choice | Rationale |
|---|---|---|
| Ingestion | CDC via DMS/Datastream | Near-real-time with minimal source impact |
| Lake format | Parquet with Iceberg | ACID, schema evolution, audit trail |
| PCI data | Separate zone, tokenized | PCI-DSS: cardholder data isolated and encrypted |
| Warehouse | Per-team schemas/datasets | Multi-tenant cost allocation and access control |
| Orchestration | Airflow (MWAA/Composer) | Enterprise standard, DAG-based, retries, alerting |
| Retention | 7 years raw, 3 years curated | Regulatory (banking records retention) |
Pattern 2: Real-Time Streaming Platform (Fraud Detection)
Section titled “Pattern 2: Real-Time Streaming Platform (Fraud Detection)”For use cases requiring sub-second processing — fraud detection, transaction monitoring, real-time personalization.
Fraud Detection Scoring Flow
Section titled “Fraud Detection Scoring Flow”Pattern 3: Modern Lakehouse (Delta Lake / Iceberg)
Section titled “Pattern 3: Modern Lakehouse (Delta Lake / Iceberg)”The lakehouse eliminates the data warehouse as a separate copy. Instead, the warehouse engine queries the data lake directly using open table formats.
Why Lakehouse?
Section titled “Why Lakehouse?”| Traditional (Lake + Warehouse) | Lakehouse |
|---|---|
| Data copied from lake to warehouse | Single copy in lake, warehouse queries it |
| ETL pipeline to load warehouse | No loading ETL, direct query |
| Two systems to manage and pay for | One storage layer, multiple query engines |
| Schema divergence (lake vs warehouse) | Single schema of truth |
| Cannot do ML on warehouse data easily | ML frameworks read lake directly |
| ACID only in warehouse | ACID on the lake (Iceberg/Delta) |
Iceberg vs Delta Lake
Section titled “Iceberg vs Delta Lake”| Feature | Apache Iceberg | Delta Lake |
|---|---|---|
| Origin | Netflix → Apache | Databricks |
| Open source | Fully open | Open (core), some features Databricks-only |
| Engine support | Spark, Trino, Presto, Flink, BigQuery, Redshift | Spark (best), some Trino/Presto support |
| Partition evolution | Yes (change partition without rewrite) | No (requires full rewrite) |
| Hidden partitioning | Yes (users do not see partition columns) | No |
| Community | Growing rapidly, becoming the standard | Large Databricks ecosystem |
| Recommendation | Multi-engine environments (GCP + AWS) | Databricks-centric environments |
Pattern 4: Multi-Cloud Data Platform (Snowflake)
Section titled “Pattern 4: Multi-Cloud Data Platform (Snowflake)”When the enterprise operates on both AWS and GCP (or Azure) and needs a single SQL interface for analysts.
Snowflake Key Concepts
Section titled “Snowflake Key Concepts”| Concept | Description |
|---|---|
| Virtual Warehouse | Compute cluster (T-shirt sizes: XS to 6XL). Each WH is independent — no resource contention between teams |
| Snowpipe | Serverless auto-ingest from S3/GCS/Azure Blob. Detects new files and loads automatically |
| Data Sharing | Share live data between Snowflake accounts (even cross-cloud) without copying |
| Time Travel | Query data as it existed at any point in the past (up to 90 days, Enterprise) |
| Zero-Copy Cloning | Instant clone of database/schema/table for testing — no storage duplication |
| Streams + Tasks | CDC on Snowflake tables (detect changes) + scheduled SQL tasks (mini-Airflow) |
BigQuery vs Redshift vs Snowflake
Section titled “BigQuery vs Redshift vs Snowflake”| Criterion | BigQuery | Redshift | Snowflake |
|---|---|---|---|
| Cloud | GCP only | AWS only | AWS, GCP, Azure |
| Pricing model | On-demand ($6.25/TB) or slots | Per-node or Serverless RPUs | Per-credit (warehouse-size dependent) |
| Serverless | Yes (always) | Serverless option | Yes (always) |
| Separation of compute/storage | Yes | Yes (RA3) | Yes |
| Concurrency | High (auto-allocate slots) | Concurrency scaling (add clusters) | Per-warehouse (independent compute) |
| Streaming ingestion | Storage Write API | Kinesis → Redshift Streaming | Snowpipe Streaming |
| Multi-cloud | No | No | Yes (primary advantage) |
| Data sharing | Analytics Hub (limited) | Data Sharing (cross-account) | Secure Data Sharing (cross-cloud) |
| Open format | Native + Iceberg | Native + Spectrum (S3) | Native (proprietary storage) |
| Best for | GCP-native, ad-hoc analytics | AWS-heavy, predictable workloads | Multi-cloud, data sharing |
Pattern 5: Data Mesh (Decentralized)
Section titled “Pattern 5: Data Mesh (Decentralized)”For large enterprises where a centralized data team becomes a bottleneck. Each domain team owns their data as a product.
Data Mesh Principles
Section titled “Data Mesh Principles”| Principle | What It Means | Central Team Role |
|---|---|---|
| Domain ownership | Each business domain owns their data pipelines and products | Provide tools, not run pipelines |
| Data as a product | Data has SLAs, quality metrics, documentation, discoverability | Define standards, enforce via CI |
| Self-serve platform | Domain teams provision infrastructure without tickets | Terraform modules, golden paths |
| Federated governance | Central policies, domain-level decisions | Policy engine, automated enforcement |
When Data Mesh Works vs When It Does Not
Section titled “When Data Mesh Works vs When It Does Not”Terraform
Section titled “Terraform”S3 Data Lake with Lifecycle Policies
Section titled “S3 Data Lake with Lifecycle Policies”# --- Data Lake S3 Buckets ---
resource "aws_s3_bucket" "data_lake" { bucket = "enterprise-data-lake-${var.environment}"
tags = { Environment = var.environment Team = "data-platform" }}
resource "aws_s3_bucket_versioning" "data_lake" { bucket = aws_s3_bucket.data_lake.id versioning_configuration { status = "Enabled" }}
resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" { bucket = aws_s3_bucket.data_lake.id
rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" kms_master_key_id = aws_kms_key.data_lake.arn } bucket_key_enabled = true }}
# Lifecycle: raw data moves to IA after 90 days, Glacier after 1 yearresource "aws_s3_bucket_lifecycle_configuration" "data_lake" { bucket = aws_s3_bucket.data_lake.id
rule { id = "raw-zone-lifecycle" status = "Enabled"
filter { prefix = "raw/" }
transition { days = 90 storage_class = "STANDARD_IA" }
transition { days = 365 storage_class = "GLACIER" }
expiration { days = 2555 # 7 years (regulatory retention) } }
rule { id = "curated-zone-lifecycle" status = "Enabled"
filter { prefix = "curated/" }
transition { days = 180 storage_class = "STANDARD_IA" }
expiration { days = 1095 # 3 years } }}
# Block public accessresource "aws_s3_bucket_public_access_block" "data_lake" { bucket = aws_s3_bucket.data_lake.id
block_public_acls = true block_public_policy = true ignore_public_acls = true restrict_public_buckets = true}Lake Formation Permissions
Section titled “Lake Formation Permissions”# --- Lake Formation ---
resource "aws_lakeformation_data_lake_settings" "main" { admins = [aws_iam_role.data_platform_admin.arn]}
resource "aws_lakeformation_resource" "data_lake" { arn = aws_s3_bucket.data_lake.arn}
# Grant payments team access to payments databaseresource "aws_lakeformation_permissions" "payments_team" { principal = aws_iam_role.payments_team.arn
permissions = ["SELECT", "DESCRIBE"]
table { database_name = aws_glue_catalog_database.payments.name wildcard = true # All tables in payments database }}
# Column-level restriction: mask SSN for analytics teamresource "aws_lakeformation_permissions" "analytics_restricted" { principal = aws_iam_role.analytics_team.arn
permissions = ["SELECT"]
table_with_columns { database_name = aws_glue_catalog_database.payments.name name = "customers" excluded_column_names = ["ssn", "full_pan"] # Cannot see these columns }}Redshift Serverless
Section titled “Redshift Serverless”resource "aws_redshiftserverless_namespace" "analytics" { namespace_name = "analytics" db_name = "analytics" admin_username = "admin" manage_admin_password = true # Secrets Manager auto-rotation kms_key_id = aws_kms_key.redshift.arn
iam_roles = [aws_iam_role.redshift_spectrum.arn]
log_exports = ["userlog", "connectionlog", "useractivitylog"]
tags = { Environment = "production" Team = "data-platform" }}
resource "aws_redshiftserverless_workgroup" "analytics" { namespace_name = aws_redshiftserverless_namespace.analytics.namespace_name workgroup_name = "analytics-workgroup"
base_capacity = 32 # RPUs (Redshift Processing Units), min 8
security_group_ids = [aws_security_group.redshift.id] subnet_ids = var.data_subnet_ids
publicly_accessible = false
config_parameter { parameter_key = "datestyle" parameter_value = "ISO, MDY" }
config_parameter { parameter_key = "enable_user_activity_logging" parameter_value = "true" }
tags = { Environment = "production" }}GCS Data Lake with Lifecycle
Section titled “GCS Data Lake with Lifecycle”# --- Data Lake GCS Buckets ---
resource "google_storage_bucket" "data_lake" { name = "enterprise-data-lake-${var.environment}-${var.project_id}" location = var.region project = var.project_id storage_class = "STANDARD"
uniform_bucket_level_access = true
versioning { enabled = true }
encryption { default_kms_key_name = google_kms_crypto_key.data_lake.id }
# Raw zone: move to Nearline after 90 days, Coldline after 1 year lifecycle_rule { condition { age = 90 matches_prefix = ["raw/"] } action { type = "SetStorageClass" storage_class = "NEARLINE" } }
lifecycle_rule { condition { age = 365 matches_prefix = ["raw/"] } action { type = "SetStorageClass" storage_class = "COLDLINE" } }
lifecycle_rule { condition { age = 2555 # 7 years matches_prefix = ["raw/"] } action { type = "Delete" } }
# Curated zone: Nearline after 6 months lifecycle_rule { condition { age = 180 matches_prefix = ["curated/"] } action { type = "SetStorageClass" storage_class = "NEARLINE" } }
labels = { environment = var.environment team = "data-platform" }}Dataplex Lake + Zones
Section titled “Dataplex Lake + Zones”# --- Dataplex Data Governance ---
resource "google_dataplex_lake" "banking" { location = var.region name = "banking-data-lake" display_name = "Banking Data Lake" project = var.project_id
labels = { environment = "production" }}
resource "google_dataplex_zone" "raw" { lake = google_dataplex_lake.banking.name location = var.region name = "raw-zone" project = var.project_id type = "RAW"
discovery_spec { enabled = true schedule = "0 */6 * * *" # Discover new data every 6 hours }
resource_spec { location_type = "SINGLE_REGION" }}
resource "google_dataplex_zone" "curated" { lake = google_dataplex_lake.banking.name location = var.region name = "curated-zone" project = var.project_id type = "CURATED"
discovery_spec { enabled = true schedule = "0 */6 * * *" }
resource_spec { location_type = "SINGLE_REGION" }}
# Attach GCS bucket as assetresource "google_dataplex_asset" "raw_data" { name = "raw-data-asset" location = var.region lake = google_dataplex_lake.banking.name dataplex_zone = google_dataplex_zone.raw.name project = var.project_id
discovery_spec { enabled = true }
resource_spec { name = "projects/${var.project_id}/buckets/${google_storage_bucket.data_lake.name}" type = "STORAGE_BUCKET" }}Dataproc Cluster with Autoscaling
Section titled “Dataproc Cluster with Autoscaling”resource "google_dataproc_cluster" "spark" { name = "data-processing-cluster" region = var.region project = var.project_id
cluster_config { staging_bucket = google_storage_bucket.dataproc_staging.name
master_config { num_instances = 1 machine_type = "n2-standard-8"
disk_config { boot_disk_type = "pd-ssd" boot_disk_size_gb = 100 } }
worker_config { num_instances = 2 machine_type = "n2-standard-16"
disk_config { boot_disk_type = "pd-ssd" boot_disk_size_gb = 100 num_local_ssds = 1 } }
autoscaling_config { policy_uri = google_dataproc_autoscaling_policy.spark.name }
software_config { image_version = "2.1-debian11" optional_components = ["JUPYTER"]
override_properties = { "spark:spark.sql.catalog.iceberg" = "org.apache.iceberg.spark.SparkCatalog" "spark:spark.sql.catalog.iceberg.type" = "hive" "spark:spark.sql.extensions" = "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" } }
gce_cluster_config { subnetwork = google_compute_subnetwork.data.id service_account = google_service_account.dataproc.email internal_ip_only = true
shielded_instance_config { enable_secure_boot = true } } }
labels = { environment = "production" team = "data-platform" }}
resource "google_dataproc_autoscaling_policy" "spark" { policy_id = "spark-autoscaling" location = var.region project = var.project_id
worker_config { min_instances = 2 max_instances = 20 weight = 1 }
basic_algorithm { yarn_config { graceful_decommission_timeout = "3600s" scale_up_factor = 1.0 scale_down_factor = 1.0 scale_up_min_worker_fraction = 0.0 } cooldown_period = "120s" }}Interview Scenarios
Section titled “Interview Scenarios”Scenario 1: BigQuery Cost Explosion
Section titled “Scenario 1: BigQuery Cost Explosion”Q: “Your BigQuery costs jumped from $5,000 to $25,000 this month. The CFO wants answers. What do you do?”
A: This is almost certainly caused by either new users running unoptimized queries on on-demand pricing, or a table losing its partitioning.
Step 1 — Find the culprits:
-- Top 10 most expensive queries this monthSELECT user_email, job_id, total_bytes_processed / POW(10,12) AS tb_scanned, total_bytes_processed / POW(10,12) * 6.25 AS cost_usd, queryFROM `region-us`.INFORMATION_SCHEMA.JOBSWHERE creation_time >= '2026-03-01' AND job_type = 'QUERY' AND state = 'DONE'ORDER BY total_bytes_processed DESCLIMIT 10;Step 2 — Common root causes:
| Root Cause | Fix |
|---|---|
SELECT * on unpartitioned 50TB table | Add partition filter, retrain user |
| Dashboard refreshing every 5 minutes on large table | Add BI Engine cache, materialized view |
| New team member running exploratory queries | Set per-user byte quota |
| Table was recreated without partitioning | Re-create with PARTITION BY and CLUSTER BY |
| Streaming inserts doubled (new data source) | Switch to Storage Write API (cheaper) |
Step 3 — Implement controls:
- Switch to Enterprise edition with 500-slot reservation: predictable cost of ~$21,900/month regardless of query volume
- Require partition filters:
ALTER TABLE SET OPTIONS(require_partition_filter=true)on all large tables - Custom quotas: limit bytes scanned per user per day
- Slot assignments: allocate 200 slots to analytics team, 100 to ML team, 200 to platform
- BI Engine: enable for dashboard datasets (sub-second response, no repeated scans)
- Cost alerting: budget alert at $15,000 (60% of limit) and $20,000 (80%)
Scenario 2: Glue Job Optimization
Section titled “Scenario 2: Glue Job Optimization”Q: “Your Glue ETL job processes 500GB daily but takes 4 hours and costs $200/run. How do you optimize it?”
A: 4 hours for 500GB suggests the job is not efficiently using its DPUs. Here is my optimization approach:
Diagnosis:
Optimizations (in order of impact):
| Optimization | Expected Impact |
|---|---|
| Enable job bookmarks | Process only new data (incremental), skip already-processed files. If 90% of data is unchanged, 10x speedup |
| Pushdown predicates | Filter at S3 scan (partition pruning). push_down_predicate = "year=2026 AND month=3" |
| Coalesce small files | If source has 100K small files, reading each is slow. groupFiles = "inPartition" + groupSize = "1073741824" (1GB groups) |
| Switch to Glue 4.0 | Spark 3.3, better optimizer, adaptive query execution (AQE) |
| Right-size workers | If memory-bound: switch from G.1X to G.2X. If CPU-bound: add more workers |
| Repartition before write | Avoid small output files: .repartition(50) before writing |
| Use Parquet input | If reading JSON/CSV, the first run should convert to Parquet. Subsequent runs on Parquet are 10-100x faster |
Target: 500GB job should complete in 30-45 minutes with 20 DPUs ≈ $6-8/run.
Scenario 3: CDC Aurora to BigQuery
Section titled “Scenario 3: CDC Aurora to BigQuery”Q: “Design CDC from Aurora PostgreSQL (AWS) into BigQuery (GCP) for near-real-time analytics.”
A: Cross-cloud CDC requires careful architecture. Here are two viable approaches:
Approach A: DMS → S3 → GCS → BigQuery (10-15 min latency)
Approach B: DMS → Kinesis → Lambda → BigQuery API (2-5 min latency)
My recommendation: Approach B for the bank use case. The BigQuery Storage Write API supports exactly-once semantics, which is critical for financial data. The end-to-end latency is 2-5 minutes vs 10-15 for the S3-based approach.
Key considerations:
- Schema changes: DMS handles column additions but not type changes. Set up alerting for schema drift
- Network: Aurora → DMS → Kinesis stays within AWS. Lambda calls BigQuery API over the internet (use VPC endpoint for Private Google Access if needed)
- Exactly-once: BigQuery Storage Write API in committed mode deduplicates by stream offset
- Monitoring: DMS task metrics + Kinesis iterator age + BigQuery streaming buffer size
- Cost: DMS instance (~$200/month for r5.large) + Kinesis ($0.015/shard-hr) + Lambda (negligible) + BigQuery streaming (included with Enterprise)
Scenario 4: Self-Service Analytics for 10 Teams
Section titled “Scenario 4: Self-Service Analytics for 10 Teams”Q: “How do you provide self-service analytics to 10 teams without them stepping on each other?”
A: This is a multi-tenancy problem for the data warehouse.
BigQuery approach:
Guardrails:
- Per-project slot assignments — no team can consume all compute
- Require partition filters — prevent full table scans
- Per-user byte quotas — limit daily scan volume
- Authorized views — teams see only their data (row-level security)
- Column-level security — PII columns masked for non-compliance teams
- Cost labels — every query tagged with team, enabling chargeback
- BI Engine — enable for dashboard datasets (eliminates repeated scans)
Redshift approach: Separate namespaces per team in Redshift Serverless, with data sharing from a central namespace. Each namespace has its own RPU limits.
Scenario 5: Data Governance for Regulated Enterprise
Section titled “Scenario 5: Data Governance for Regulated Enterprise”Q: “Design data governance for a regulated bank. Auditors need to trace any number on any report back to its source.”
A: This requires end-to-end data lineage, classification, and access controls.
Scenario 6: Lakehouse vs Traditional Warehouse
Section titled “Scenario 6: Lakehouse vs Traditional Warehouse”Q: “Explain the lakehouse architecture. When would you use it over a traditional data warehouse?”
A: A lakehouse combines the reliability of a data warehouse (ACID, schema enforcement) with the flexibility of a data lake (open formats, multi-engine access).
Use lakehouse when:
- You want to eliminate the ETL from lake to warehouse (single copy of data)
- ML teams need to read the same data that BI teams query (Spark reads Iceberg, BigQuery reads Iceberg)
- You need open formats to avoid warehouse vendor lock-in
- You want time travel and row-level operations on lake data (GDPR deletes)
Use traditional warehouse when:
- Query performance is the top priority (native warehouse format is optimized)
- Your team is SQL-only (no Spark, no ML on raw data)
- You are already invested in BigQuery/Redshift and it is working well
- Sub-second dashboard queries are required (BI Engine works on native BQ tables)
Practical advice: Start with a traditional warehouse (BigQuery/Redshift). Adopt lakehouse (Iceberg) when you need multi-engine access or the warehouse ETL becomes a bottleneck. Many enterprises run a hybrid: Iceberg for the lake, BigQuery for dashboards with external tables pointing to Iceberg.
Scenario 7: Choosing Between BigQuery, Redshift, and Snowflake
Section titled “Scenario 7: Choosing Between BigQuery, Redshift, and Snowflake”Q: “A client runs workloads on both AWS and GCP. They need a data warehouse. BigQuery, Redshift, or Snowflake?”
A: For a multi-cloud enterprise, the decision depends on where the data originates and where it is consumed.
| Factor | BigQuery | Redshift | Snowflake |
|---|---|---|---|
| Data on AWS | Cross-cloud transfer needed | Native S3 access | Snowpipe from S3 |
| Data on GCP | Native GCS access | Cross-cloud transfer needed | Snowpipe from GCS |
| Single SQL interface | No (separate systems) | No (separate systems) | Yes (one account, both clouds) |
| Data sharing across clouds | Limited (Analytics Hub) | No | Yes (secure shares) |
| Vendor lock-in | GCP | AWS | Snowflake (but runs on any cloud) |
My recommendation for true multi-cloud:
- Snowflake if cross-cloud data sharing and a single interface are the primary requirements
- BigQuery for GCP-heavy workloads + Redshift for AWS-heavy workloads with a data mesh approach (each cloud’s warehouse handles its domain)
- Iceberg lakehouse as the long-term play: both BigQuery and Redshift can query Iceberg tables, eventually converging on a single data format without a single warehouse vendor
Scenario 8: Designing a Data Quality Framework
Section titled “Scenario 8: Designing a Data Quality Framework”Q: “How do you ensure data quality in an enterprise data platform?”
A: Data quality must be automated, measured, and visible. Here is my framework:
Quality checks at every stage:
Enforcement: dbt tests run as a gate in the Airflow DAG. If any test fails, downstream models do not execute, dashboards show stale (but correct) data, and the data team is paged.
Scenario 9: Query Scanning Too Much Data
Section titled “Scenario 9: Query Scanning Too Much Data”Q: “A BigQuery query scans 10TB but should only need 100GB. What is wrong?”
A: This is a classic cost and performance problem. The root causes in order of likelihood:
-
Table not partitioned: query scans all data instead of one partition
-- Fix: recreate table with partitioningCREATE TABLE payments.transactions_v2PARTITION BY DATE(transaction_date)CLUSTER BY region, customer_idAS SELECT * FROM payments.transactions; -
No partition filter in query: even with partitioning,
SELECT * FROM transactionsscans all partitions-- Enforce partition filterALTER TABLE payments.transactionsSET OPTIONS (require_partition_filter = true);-- Now this fails: SELECT * FROM transactions-- This works: SELECT * FROM transactions WHERE transaction_date = '2026-03-15' -
SELECT *instead of specific columns: scans all columns even if you only need 3-- Bad: SELECT * FROM transactions WHERE date = '2026-03-15'-- Scans ALL columns (100 columns × 1 day = 10GB)-- Good: SELECT customer_id, amount, merchant FROM transactions WHERE date = '2026-03-15'-- Scans 3 columns (3 columns × 1 day = 0.3GB) -
JOIN exploding rows: joining two large tables without proper keys
-
Missing clustering: even with partitioning, scanning an entire partition can be large. Clustering by frequently filtered columns reduces scan further
Prevention: require partition filters on all tables, set per-user scan quotas, train users to use --dry_run to estimate cost before running, and enable query validator in BI tools.