Skip to content

Data Platform — Architecture Patterns

This page builds on the Pipeline Stages foundation. Those are the building blocks; here we assemble them into complete enterprise architectures. Each pattern addresses different business requirements — from traditional batch analytics to real-time fraud detection to decentralized data mesh.


The most common data platform pattern at regulated financial institutions. Prioritizes reliability, auditability, and compliance over real-time speed.

Traditional bank data platform

DecisionChoiceRationale
IngestionCDC via DMS/DatastreamNear-real-time with minimal source impact
Lake formatParquet with IcebergACID, schema evolution, audit trail
PCI dataSeparate zone, tokenizedPCI-DSS: cardholder data isolated and encrypted
WarehousePer-team schemas/datasetsMulti-tenant cost allocation and access control
OrchestrationAirflow (MWAA/Composer)Enterprise standard, DAG-based, retries, alerting
Retention7 years raw, 3 years curatedRegulatory (banking records retention)

Pattern 2: Real-Time Streaming Platform (Fraud Detection)

Section titled “Pattern 2: Real-Time Streaming Platform (Fraud Detection)”

For use cases requiring sub-second processing — fraud detection, transaction monitoring, real-time personalization.

Real-time streaming platform for fraud detection

Fraud detection scoring flow


Pattern 3: Modern Lakehouse (Delta Lake / Iceberg)

Section titled “Pattern 3: Modern Lakehouse (Delta Lake / Iceberg)”

The lakehouse eliminates the data warehouse as a separate copy. Instead, the warehouse engine queries the data lake directly using open table formats.

Lakehouse architecture with Apache Iceberg

Traditional (Lake + Warehouse)Lakehouse
Data copied from lake to warehouseSingle copy in lake, warehouse queries it
ETL pipeline to load warehouseNo loading ETL, direct query
Two systems to manage and pay forOne storage layer, multiple query engines
Schema divergence (lake vs warehouse)Single schema of truth
Cannot do ML on warehouse data easilyML frameworks read lake directly
ACID only in warehouseACID on the lake (Iceberg/Delta)
FeatureApache IcebergDelta Lake
OriginNetflix → ApacheDatabricks
Open sourceFully openOpen (core), some features Databricks-only
Engine supportSpark, Trino, Presto, Flink, BigQuery, RedshiftSpark (best), some Trino/Presto support
Partition evolutionYes (change partition without rewrite)No (requires full rewrite)
Hidden partitioningYes (users do not see partition columns)No
CommunityGrowing rapidly, becoming the standardLarge Databricks ecosystem
RecommendationMulti-engine environments (GCP + AWS)Databricks-centric environments

Pattern 4: Multi-Cloud Data Platform (Snowflake)

Section titled “Pattern 4: Multi-Cloud Data Platform (Snowflake)”

When the enterprise operates on both AWS and GCP (or Azure) and needs a single SQL interface for analysts.

Multi-cloud data platform with Snowflake

ConceptDescription
Virtual WarehouseCompute cluster (T-shirt sizes: XS to 6XL). Each WH is independent — no resource contention between teams
SnowpipeServerless auto-ingest from S3/GCS/Azure Blob. Detects new files and loads automatically
Data SharingShare live data between Snowflake accounts (even cross-cloud) without copying
Time TravelQuery data as it existed at any point in the past (up to 90 days, Enterprise)
Zero-Copy CloningInstant clone of database/schema/table for testing — no storage duplication
Streams + TasksCDC on Snowflake tables (detect changes) + scheduled SQL tasks (mini-Airflow)
CriterionBigQueryRedshiftSnowflake
CloudGCP onlyAWS onlyAWS, GCP, Azure
Pricing modelOn-demand ($6.25/TB) or slotsPer-node or Serverless RPUsPer-credit (warehouse-size dependent)
ServerlessYes (always)Serverless optionYes (always)
Separation of compute/storageYesYes (RA3)Yes
ConcurrencyHigh (auto-allocate slots)Concurrency scaling (add clusters)Per-warehouse (independent compute)
Streaming ingestionStorage Write APIKinesis → Redshift StreamingSnowpipe Streaming
Multi-cloudNoNoYes (primary advantage)
Data sharingAnalytics Hub (limited)Data Sharing (cross-account)Secure Data Sharing (cross-cloud)
Open formatNative + IcebergNative + Spectrum (S3)Native (proprietary storage)
Best forGCP-native, ad-hoc analyticsAWS-heavy, predictable workloadsMulti-cloud, data sharing

For large enterprises where a centralized data team becomes a bottleneck. Each domain team owns their data as a product.

Data mesh architecture

PrincipleWhat It MeansCentral Team Role
Domain ownershipEach business domain owns their data pipelines and productsProvide tools, not run pipelines
Data as a productData has SLAs, quality metrics, documentation, discoverabilityDefine standards, enforce via CI
Self-serve platformDomain teams provision infrastructure without ticketsTerraform modules, golden paths
Federated governanceCentral policies, domain-level decisionsPolicy engine, automated enforcement

Data mesh — when it works vs when it fails


# --- Data Lake S3 Buckets ---
resource "aws_s3_bucket" "data_lake" {
bucket = "enterprise-data-lake-${var.environment}"
tags = {
Environment = var.environment
Team = "data-platform"
}
}
resource "aws_s3_bucket_versioning" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.data_lake.arn
}
bucket_key_enabled = true
}
}
# Lifecycle: raw data moves to IA after 90 days, Glacier after 1 year
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "raw-zone-lifecycle"
status = "Enabled"
filter {
prefix = "raw/"
}
transition {
days = 90
storage_class = "STANDARD_IA"
}
transition {
days = 365
storage_class = "GLACIER"
}
expiration {
days = 2555 # 7 years (regulatory retention)
}
}
rule {
id = "curated-zone-lifecycle"
status = "Enabled"
filter {
prefix = "curated/"
}
transition {
days = 180
storage_class = "STANDARD_IA"
}
expiration {
days = 1095 # 3 years
}
}
}
# Block public access
resource "aws_s3_bucket_public_access_block" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# --- Lake Formation ---
resource "aws_lakeformation_data_lake_settings" "main" {
admins = [aws_iam_role.data_platform_admin.arn]
}
resource "aws_lakeformation_resource" "data_lake" {
arn = aws_s3_bucket.data_lake.arn
}
# Grant payments team access to payments database
resource "aws_lakeformation_permissions" "payments_team" {
principal = aws_iam_role.payments_team.arn
permissions = ["SELECT", "DESCRIBE"]
table {
database_name = aws_glue_catalog_database.payments.name
wildcard = true # All tables in payments database
}
}
# Column-level restriction: mask SSN for analytics team
resource "aws_lakeformation_permissions" "analytics_restricted" {
principal = aws_iam_role.analytics_team.arn
permissions = ["SELECT"]
table_with_columns {
database_name = aws_glue_catalog_database.payments.name
name = "customers"
excluded_column_names = ["ssn", "full_pan"] # Cannot see these columns
}
}
resource "aws_redshiftserverless_namespace" "analytics" {
namespace_name = "analytics"
db_name = "analytics"
admin_username = "admin"
manage_admin_password = true # Secrets Manager auto-rotation
kms_key_id = aws_kms_key.redshift.arn
iam_roles = [aws_iam_role.redshift_spectrum.arn]
log_exports = ["userlog", "connectionlog", "useractivitylog"]
tags = {
Environment = "production"
Team = "data-platform"
}
}
resource "aws_redshiftserverless_workgroup" "analytics" {
namespace_name = aws_redshiftserverless_namespace.analytics.namespace_name
workgroup_name = "analytics-workgroup"
base_capacity = 32 # RPUs (Redshift Processing Units), min 8
security_group_ids = [aws_security_group.redshift.id]
subnet_ids = var.data_subnet_ids
publicly_accessible = false
config_parameter {
parameter_key = "datestyle"
parameter_value = "ISO, MDY"
}
config_parameter {
parameter_key = "enable_user_activity_logging"
parameter_value = "true"
}
tags = {
Environment = "production"
}
}

Q: “Your BigQuery costs jumped from $5,000 to $25,000 this month. The CFO wants answers. What do you do?”

A: This is almost certainly caused by either new users running unoptimized queries on on-demand pricing, or a table losing its partitioning.

Step 1 — Find the culprits:

-- Top 10 most expensive queries this month
SELECT
user_email,
job_id,
total_bytes_processed / POW(10,12) AS tb_scanned,
total_bytes_processed / POW(10,12) * 6.25 AS cost_usd,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= '2026-03-01'
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 10;

Step 2 — Common root causes:

Root CauseFix
SELECT * on unpartitioned 50TB tableAdd partition filter, retrain user
Dashboard refreshing every 5 minutes on large tableAdd BI Engine cache, materialized view
New team member running exploratory queriesSet per-user byte quota
Table was recreated without partitioningRe-create with PARTITION BY and CLUSTER BY
Streaming inserts doubled (new data source)Switch to Storage Write API (cheaper)

Step 3 — Implement controls:

  1. Switch to Enterprise edition with 500-slot reservation: predictable cost of ~$21,900/month regardless of query volume
  2. Require partition filters: ALTER TABLE SET OPTIONS(require_partition_filter=true) on all large tables
  3. Custom quotas: limit bytes scanned per user per day
  4. Slot assignments: allocate 200 slots to analytics team, 100 to ML team, 200 to platform
  5. BI Engine: enable for dashboard datasets (sub-second response, no repeated scans)
  6. Cost alerting: budget alert at $15,000 (60% of limit) and $20,000 (80%)

Q: “Your Glue ETL job processes 500GB daily but takes 4 hours and costs $200/run. How do you optimize it?”

A: 4 hours for 500GB suggests the job is not efficiently using its DPUs. Here is my optimization approach:

Diagnosis: Glue ETL optimization

Optimizations (in order of impact):

OptimizationExpected Impact
Enable job bookmarksProcess only new data (incremental), skip already-processed files. If 90% of data is unchanged, 10x speedup
Pushdown predicatesFilter at S3 scan (partition pruning). push_down_predicate = "year=2026 AND month=3"
Coalesce small filesIf source has 100K small files, reading each is slow. groupFiles = "inPartition" + groupSize = "1073741824" (1GB groups)
Switch to Glue 4.0Spark 3.3, better optimizer, adaptive query execution (AQE)
Right-size workersIf memory-bound: switch from G.1X to G.2X. If CPU-bound: add more workers
Repartition before writeAvoid small output files: .repartition(50) before writing
Use Parquet inputIf reading JSON/CSV, the first run should convert to Parquet. Subsequent runs on Parquet are 10-100x faster

Target: 500GB job should complete in 30-45 minutes with 20 DPUs ≈ $6-8/run.


Q: “Design CDC from Aurora PostgreSQL (AWS) into BigQuery (GCP) for near-real-time analytics.”

A: Cross-cloud CDC requires careful architecture. Here are two viable approaches:

Approach A: DMS → S3 → GCS → BigQuery (10-15 min latency) CDC architecture

Approach B: DMS → Kinesis → Lambda → BigQuery API (2-5 min latency) CDC replication flow

My recommendation: Approach B for the bank use case. The BigQuery Storage Write API supports exactly-once semantics, which is critical for financial data. The end-to-end latency is 2-5 minutes vs 10-15 for the S3-based approach.

Key considerations:

  • Schema changes: DMS handles column additions but not type changes. Set up alerting for schema drift
  • Network: Aurora → DMS → Kinesis stays within AWS. Lambda calls BigQuery API over the internet (use VPC endpoint for Private Google Access if needed)
  • Exactly-once: BigQuery Storage Write API in committed mode deduplicates by stream offset
  • Monitoring: DMS task metrics + Kinesis iterator age + BigQuery streaming buffer size
  • Cost: DMS instance (~$200/month for r5.large) + Kinesis ($0.015/shard-hr) + Lambda (negligible) + BigQuery streaming (included with Enterprise)

Scenario 4: Self-Service Analytics for 10 Teams

Section titled “Scenario 4: Self-Service Analytics for 10 Teams”

Q: “How do you provide self-service analytics to 10 teams without them stepping on each other?”

A: This is a multi-tenancy problem for the data warehouse.

BigQuery approach: Governance architecture levels

Guardrails:

  1. Per-project slot assignments — no team can consume all compute
  2. Require partition filters — prevent full table scans
  3. Per-user byte quotas — limit daily scan volume
  4. Authorized views — teams see only their data (row-level security)
  5. Column-level security — PII columns masked for non-compliance teams
  6. Cost labels — every query tagged with team, enabling chargeback
  7. BI Engine — enable for dashboard datasets (eliminates repeated scans)

Redshift approach: Separate namespaces per team in Redshift Serverless, with data sharing from a central namespace. Each namespace has its own RPU limits.


Scenario 5: Data Governance for Regulated Enterprise

Section titled “Scenario 5: Data Governance for Regulated Enterprise”

Q: “Design data governance for a regulated bank. Auditors need to trace any number on any report back to its source.”

A: This requires end-to-end data lineage, classification, and access controls.

Full governance architecture


Scenario 6: Lakehouse vs Traditional Warehouse

Section titled “Scenario 6: Lakehouse vs Traditional Warehouse”

Q: “Explain the lakehouse architecture. When would you use it over a traditional data warehouse?”

A: A lakehouse combines the reliability of a data warehouse (ACID, schema enforcement) with the flexibility of a data lake (open formats, multi-engine access).

Use lakehouse when:

  • You want to eliminate the ETL from lake to warehouse (single copy of data)
  • ML teams need to read the same data that BI teams query (Spark reads Iceberg, BigQuery reads Iceberg)
  • You need open formats to avoid warehouse vendor lock-in
  • You want time travel and row-level operations on lake data (GDPR deletes)

Use traditional warehouse when:

  • Query performance is the top priority (native warehouse format is optimized)
  • Your team is SQL-only (no Spark, no ML on raw data)
  • You are already invested in BigQuery/Redshift and it is working well
  • Sub-second dashboard queries are required (BI Engine works on native BQ tables)

Practical advice: Start with a traditional warehouse (BigQuery/Redshift). Adopt lakehouse (Iceberg) when you need multi-engine access or the warehouse ETL becomes a bottleneck. Many enterprises run a hybrid: Iceberg for the lake, BigQuery for dashboards with external tables pointing to Iceberg.


Scenario 7: Choosing Between BigQuery, Redshift, and Snowflake

Section titled “Scenario 7: Choosing Between BigQuery, Redshift, and Snowflake”

Q: “A client runs workloads on both AWS and GCP. They need a data warehouse. BigQuery, Redshift, or Snowflake?”

A: For a multi-cloud enterprise, the decision depends on where the data originates and where it is consumed.

FactorBigQueryRedshiftSnowflake
Data on AWSCross-cloud transfer neededNative S3 accessSnowpipe from S3
Data on GCPNative GCS accessCross-cloud transfer neededSnowpipe from GCS
Single SQL interfaceNo (separate systems)No (separate systems)Yes (one account, both clouds)
Data sharing across cloudsLimited (Analytics Hub)NoYes (secure shares)
Vendor lock-inGCPAWSSnowflake (but runs on any cloud)

My recommendation for true multi-cloud:

  • Snowflake if cross-cloud data sharing and a single interface are the primary requirements
  • BigQuery for GCP-heavy workloads + Redshift for AWS-heavy workloads with a data mesh approach (each cloud’s warehouse handles its domain)
  • Iceberg lakehouse as the long-term play: both BigQuery and Redshift can query Iceberg tables, eventually converging on a single data format without a single warehouse vendor

Scenario 8: Designing a Data Quality Framework

Section titled “Scenario 8: Designing a Data Quality Framework”

Q: “How do you ensure data quality in an enterprise data platform?”

A: Data quality must be automated, measured, and visible. Here is my framework:

Quality checks at every stage: Interview pipeline stages

Enforcement: dbt tests run as a gate in the Airflow DAG. If any test fails, downstream models do not execute, dashboards show stale (but correct) data, and the data team is paged.


Q: “A BigQuery query scans 10TB but should only need 100GB. What is wrong?”

A: This is a classic cost and performance problem. The root causes in order of likelihood:

  1. Table not partitioned: query scans all data instead of one partition

    -- Fix: recreate table with partitioning
    CREATE TABLE payments.transactions_v2
    PARTITION BY DATE(transaction_date)
    CLUSTER BY region, customer_id
    AS SELECT * FROM payments.transactions;
  2. No partition filter in query: even with partitioning, SELECT * FROM transactions scans all partitions

    -- Enforce partition filter
    ALTER TABLE payments.transactions
    SET OPTIONS (require_partition_filter = true);
    -- Now this fails: SELECT * FROM transactions
    -- This works: SELECT * FROM transactions WHERE transaction_date = '2026-03-15'
  3. SELECT * instead of specific columns: scans all columns even if you only need 3

    -- Bad: SELECT * FROM transactions WHERE date = '2026-03-15'
    -- Scans ALL columns (100 columns × 1 day = 10GB)
    -- Good: SELECT customer_id, amount, merchant FROM transactions WHERE date = '2026-03-15'
    -- Scans 3 columns (3 columns × 1 day = 0.3GB)
  4. JOIN exploding rows: joining two large tables without proper keys

  5. Missing clustering: even with partitioning, scanning an entire partition can be large. Clustering by frequently filtered columns reduces scan further

Prevention: require partition filters on all tables, set per-user scan quotas, train users to use --dry_run to estimate cost before running, and enable query validator in BI tools.


Quick Reference: Architecture Pattern Selection

Section titled “Quick Reference: Architecture Pattern Selection”

Interview full platform answer