Data Platform — Pipeline Stages
Where This Fits
Section titled “Where This Fits”The data platform lives in a dedicated Data Platform Account/Project within the Workloads OU. It consumes VPCs from the Network Hub, ingests data from Workload Accounts (via cross-account S3/GCS access, Kinesis/Pub/Sub streams, database CDC), and serves data to BI tools and ML pipelines. Data engineers and analytics teams are tenants — they get datasets, schemas, and governed access managed by the central infra team.
Stage 1: Ingest
Section titled “Stage 1: Ingest”Data enters the platform through three patterns: batch, streaming, and CDC (Change Data Capture).
Ingestion Patterns
Section titled “Ingestion Patterns”AWS: Ingestion Services
Section titled “AWS: Ingestion Services”| Service | Pattern | Use Case | Throughput |
|---|---|---|---|
| Kinesis Data Streams | Streaming | Real-time events, ordering matters | 1MB/s per shard, scale by adding shards |
| Kinesis Data Firehose | Streaming → batch | Deliver to S3/Redshift/OpenSearch | Auto-scales, no shards to manage |
| DMS | CDC | Database replication (Oracle → Aurora, on-prem → cloud) | Depends on instance size |
| S3 Transfer | Batch | Large file ingestion from on-prem | 5Gbps per transfer |
| EventBridge | Event-driven | SaaS integrations (Salesforce, Zendesk) | Default 2,400 events/sec |
GCP: Ingestion Services
Section titled “GCP: Ingestion Services”| Service | Pattern | Use Case | Throughput |
|---|---|---|---|
| Pub/Sub | Streaming | Event bus, decoupled pub/sub messaging | Unlimited (auto-scales) |
| Datastream | CDC | Database replication to GCS/BigQuery | Per-stream throughput config |
| Transfer Service | Batch | S3 → GCS, on-prem → GCS, URL list | Managed, auto-retries |
| BigQuery Data Transfer | Batch | SaaS → BigQuery (Google Ads, YouTube, S3) | Scheduled, automatic |
| Cloud Functions | Event-driven | Lightweight event processing triggers | Per-invocation |
Stage 2: Store (Data Lake)
Section titled “Stage 2: Store (Data Lake)”Zone Architecture
Section titled “Zone Architecture”The data lake follows a medallion architecture with three zones:
File Format Decision
Section titled “File Format Decision”| Format | Compression | Schema | Columnar | Splittable | Best For |
|---|---|---|---|---|---|
| Parquet | Snappy / ZSTD | Embedded | Yes | Yes | Analytics, warehouse loading |
| ORC | ZLIB / Snappy | Embedded | Yes | Yes | Hive-heavy ecosystems |
| Avro | Snappy / Deflate | Embedded | No (row) | Yes | Schema evolution, streaming |
| Delta Lake | Parquet-based | Manifest | Yes | Yes | ACID on data lake, time travel |
| Iceberg | Parquet/ORC | Metadata | Yes | Yes | Multi-engine (Spark, Trino, Presto) |
| JSON | GZIP | None | No | If newline-delimited | Raw ingestion, interchange |
| CSV | GZIP | None | No | If newline-delimited | Legacy systems, simple data |
Data Lake Access Control
Section titled “Data Lake Access Control”AWS: Lake Formation provides centralized access control:
GCP: Dataplex provides data governance:
Stage 3: Process
Section titled “Stage 3: Process”ETL vs ELT
Section titled “ETL vs ELT”AWS: Processing Services
Section titled “AWS: Processing Services”| Service | Engine | Best For | Serverless? |
|---|---|---|---|
| Glue ETL | Spark (PySpark / Scala) | Batch ETL jobs, S3 → Redshift, schema discovery | Yes (Glue 4.0) |
| Glue Crawlers | N/A | Auto-discover schemas from S3/RDS → populate Glue Catalog | Yes |
| EMR | Spark, Hive, Presto, Trino | Heavy Spark workloads, long-running clusters | EMR Serverless available |
| Step Functions | State machine | Orchestrating multi-step data pipelines | Yes |
| Athena | Trino (Presto) | Ad-hoc SQL queries on S3 data lake | Yes (pay per query) |
Glue ETL Optimization Tips:
- DPU sizing: start with 10 DPUs, monitor with Glue metrics, scale up for larger datasets
- Bookmarks: enable job bookmarks to process only new data (incremental)
- Pushdown predicates: filter data at source (S3 select, partition pruning)
- Grouping: use
groupFilesfor many small files,groupSizefor few large files - Output partitioning: repartition before write to avoid small files problem
GCP: Processing Services
Section titled “GCP: Processing Services”| Service | Engine | Best For | Serverless? |
|---|---|---|---|
| Dataproc | Spark, Hadoop, Presto | Spark workloads, migration from on-prem Hadoop | Dataproc Serverless available |
| Dataflow | Apache Beam | Streaming + batch (unified model), windowed aggregations | Yes (fully managed) |
| Dataprep | Trifacta | Visual data wrangling for analysts | Yes |
| BigQuery SQL | BigQuery engine | ELT transforms directly in warehouse | Yes |
Glue vs Dataproc vs Dataflow
Section titled “Glue vs Dataproc vs Dataflow”| Feature | Glue ETL | Dataproc | Dataflow |
|---|---|---|---|
| Engine | Spark | Spark | Apache Beam |
| Best for | AWS batch ETL | Spark migration from on-prem | Streaming + batch |
| Serverless | Yes (Glue 4.0) | Dataproc Serverless | Yes (always) |
| Streaming | No (Glue Streaming is limited) | Spark Structured Streaming | Native streaming |
| Auto-scaling | Fixed DPUs per run | Cluster auto-scaling | Automatic |
| Pricing | Per DPU-hour | Per VM-hour | Per vCPU/GB-hour |
| Catalog | Built-in Glue Catalog | Hive Metastore or Dataproc Metastore | N/A (uses BigQuery) |
dbt (Data Build Tool) — ELT Transforms in SQL
Section titled “dbt (Data Build Tool) — ELT Transforms in SQL”Stage 4: Warehouse
Section titled “Stage 4: Warehouse”GCP: BigQuery Deep Dive
Section titled “GCP: BigQuery Deep Dive”What is a Slot?
Section titled “What is a Slot?”A slot is a unit of BigQuery compute. Think of it as a virtual CPU.
BigQuery Editions
Section titled “BigQuery Editions”| Feature | On-Demand | Standard | Enterprise | Enterprise Plus |
|---|---|---|---|---|
| Pricing | $6.25/TB scanned | Slot commitment | Slot commitment | Slot commitment |
| Slots | Shared pool (2000 default) | 100+ slot commitment | 100+ slot commitment | 100+ slot commitment |
| Autoscaling | Automatic | Baseline + autoscale | Baseline + autoscale | Baseline + autoscale |
| Commitment | None | 1-year or 3-year | 1-year or 3-year | 1-year or 3-year |
| Slot price | N/A | $0.04/slot-hour | $0.06/slot-hour | $0.10/slot-hour |
| Streaming | $0.01/200MB | Included | Included | Included |
| Materialized Views | Limited | Yes | Auto-refresh | Auto-refresh |
| Multi-region | Yes | Regional only | Regional | Multi-regional |
| BI Engine | No | No | Yes | Yes |
| CMEK | No | No | Yes | Yes |
| Column-level security | No | No | Yes | Yes |
Reservations and Assignments
Section titled “Reservations and Assignments”Partitioning and Clustering
Section titled “Partitioning and Clustering”Column-Level and Row-Level Security
Section titled “Column-Level and Row-Level Security”AWS: Redshift Overview
Section titled “AWS: Redshift Overview”| Feature | Redshift Provisioned | Redshift Serverless |
|---|---|---|
| Management | You choose node type and count | Auto-provisioned |
| Scaling | Resize cluster (elastic or classic) | Automatic |
| Concurrency | Concurrency scaling (auto-add clusters) | Automatic |
| Pricing | Per node-hour (RA3: $3.26/hr for ra3.4xlarge) | Per RPU-hour ($0.375/RPU) |
| Storage | RA3: managed storage (decoupled) | Included |
| Best for | Predictable, high-volume analytics | Variable workloads, getting started |
| Data Sharing | Cross-cluster and cross-account | Same |
| Spectrum | Query S3 data lake directly (no loading) | Same |
Stage 5: Serve
Section titled “Stage 5: Serve”Data consumers access the processed data through multiple channels:
Data Governance
Section titled “Data Governance”Enterprise Data Governance Framework
Section titled “Enterprise Data Governance Framework”Orchestration: Airflow (MWAA / Cloud Composer)
Section titled “Orchestration: Airflow (MWAA / Cloud Composer)”AWS: MWAA (Managed Workflows for Apache Airflow)
Section titled “AWS: MWAA (Managed Workflows for Apache Airflow)”GCP: Cloud Composer (Managed Airflow)
Section titled “GCP: Cloud Composer (Managed Airflow)”DAG Design Principles
Section titled “DAG Design Principles”# Example: daily transaction processing pipeline
# DAG Structure:## check_source_freshness# │# extract_transactions (DMS CDC or Datastream)# │# load_raw_to_lake (S3/GCS raw zone)# │# dbt_run_staging (clean, cast, dedupe)# │# ┌──────┴──────┐# │ │# dbt_run_ dbt_run_# facts dimensions# │ │# └──────┬──────┘# │# dbt_test (assert data quality)# │# ┌──────┴──────┐# │ │# refresh_ notify_# dashboards stakeholdersEnterprise DAG rules:
- One DAG per pipeline — do not combine unrelated workflows
- Idempotent tasks — re-running produces the same result (no duplicates)
- Incremental processing — process only new data (dbt incremental models, Glue bookmarks)
- Alerting — Slack/PagerDuty on task failure, SLA breach
- Data quality gates — dbt tests run before downstream consumers access new data
- Backfill support — catch_up=False for forward-only, True for historical reprocessing
Data Governance — Lake Formation & Dataplex
Section titled “Data Governance — Lake Formation & Dataplex”The Problem: Without governance, every data team has full access to every table. A junior analyst can accidentally query PII data, export it, or join it with external datasets. At 20+ consuming teams, access control via S3 bucket policies or BigQuery dataset-level permissions alone becomes unmanageable. You need column-level, row-level, and even cell-level access control — and you need it enforced consistently across every query engine (Athena, Spark, Redshift Spectrum, BigQuery).
AWS: Lake Formation
Section titled “AWS: Lake Formation”Lake Formation is a centralized permissions layer that sits on top of your S3 data lake and Glue Data Catalog. Instead of managing access through S3 bucket policies, IAM policies, and Glue resource policies separately, Lake Formation provides a single pane of glass for data access control.
Core Capabilities:
- Fine-grained access control: Column-level security (hide PII columns from unauthorized users), row-level security (filter rows by region, tenant, or business unit), and cell-level security (mask specific values like showing only last 4 digits of SSN)
- Tag-based access control (LF-Tags): The most powerful feature. Tag columns with attributes like
sensitivity=pii,domain=payments,classification=confidential. Then grant access by tag rather than by individual table/column. When a new table is registered, it automatically inherits permissions based on its tags — no manual permission grants needed for each new dataset. - Data Catalog: Built on top of Glue Catalog. Register S3 locations as databases and tables with searchable metadata, schemas, and data lineage. The catalog becomes the single source of truth for “what data exists and who can access it.”
- Cross-account data sharing: Share specific tables or columns with other AWS accounts without copying data. The consuming account queries the data directly from the producer’s S3 bucket through Glue Catalog federation. This is critical in multi-account landing zones where data platform and workload accounts are separate.
- Integration: Athena, Redshift Spectrum, EMR (Spark), and Glue ETL jobs all respect Lake Formation permissions transparently. Users do not bypass permissions by switching query engines.
GCP: Dataplex
Section titled “GCP: Dataplex”Dataplex provides a data fabric across GCS and BigQuery, offering unified governance regardless of where data physically lives. Unlike Lake Formation which focuses on S3 + Glue Catalog, Dataplex treats BigQuery datasets and GCS buckets as part of a single logical data lake.
Core Capabilities:
- Data zones: Organize data into logical zones — raw zone (landing, source-format), curated zone (cleaned, validated, Parquet/Iceberg), and consumer zone (aggregated, business-ready). Dataplex enforces zone-level policies and automatically validates data as it moves between zones.
- Automated data quality: Define data quality rules declaratively (e.g., “column
amountmust be positive”, “no null values intransaction_id”, “row count should not drop more than 10% day-over-day”). Dataplex continuously validates data against these rules and flags violations. This replaces custom dbt tests for infrastructure-level quality checks. - Data lineage: Built into BigQuery — track how data flows from source tables through transformations to final dashboards. Automatically captured for BigQuery SQL jobs, Dataflow pipelines, and Dataproc Spark jobs. No manual instrumentation needed.
- Metadata management: Auto-discover datasets across GCS and BigQuery, tag with business metadata (domain, owner, sensitivity, freshness), and make everything searchable through Data Catalog. Domain teams self-serve data discovery instead of asking the platform team “where is the customer table?”
- Security integration: Dataplex works with BigQuery column-level security (policy tags), row-level security (row access policies), and data masking. Policy tags can be applied to columns via Data Catalog and enforced across all BigQuery queries automatically.
# Register S3 location with Lake Formationresource "aws_lakeformation_resource" "data_lake" { arn = aws_s3_bucket.data_lake.arn}
# Create LF-Tags for classificationresource "aws_lakeformation_lf_tag" "sensitivity" { key = "sensitivity" values = ["public", "internal", "confidential", "pii"]}
resource "aws_lakeformation_lf_tag" "domain" { key = "domain" values = ["payments", "customers", "products", "analytics"]}
# Assign LF-Tags to table columnsresource "aws_lakeformation_resource_lf_tags" "transactions_tags" { database { name = aws_glue_catalog_database.payments.name }
table { database_name = aws_glue_catalog_database.payments.name name = "transactions" }
lf_tag { key = "sensitivity" value = "confidential" }
lf_tag { key = "domain" value = "payments" }}
# Grant analysts access to non-PII columns onlyresource "aws_lakeformation_permissions" "analysts_read" { principal = "arn:aws:iam::123456789012:role/AnalystRole" permissions = ["SELECT"]
table_with_columns { database_name = aws_glue_catalog_database.payments.name name = "transactions" # Exclude PII columns excluded_column_names = ["card_number", "ssn", "email", "phone"] }}
# Grant data science team full access (they sign a DPA)resource "aws_lakeformation_permissions" "datascience_full" { principal = "arn:aws:iam::123456789012:role/DataScienceRole" permissions = ["SELECT"]
table_with_columns { database_name = aws_glue_catalog_database.payments.name name = "transactions" wildcard = true # All columns }}
# Cross-account sharing: share table with analytics accountresource "aws_lakeformation_permissions" "cross_account_share" { principal = "arn:aws:iam::987654321098:root" permissions = ["SELECT", "DESCRIBE"] permissions_with_grant_option = ["SELECT", "DESCRIBE"]
table { database_name = aws_glue_catalog_database.payments.name name = "transactions" }}# Dataplex Lakeresource "google_dataplex_lake" "data_platform" { location = var.region name = "enterprise-data-lake" project = var.project_id display_name = "Enterprise Data Lake" description = "Central data lake with governance zones"
labels = { environment = "production" team = "data-platform" }}
# Raw Zone — landing area for source dataresource "google_dataplex_zone" "raw" { lake = google_dataplex_lake.data_platform.name location = var.region name = "raw-zone" project = var.project_id display_name = "Raw Zone" type = "RAW"
resource_spec { location_type = "SINGLE_REGION" }
discovery_spec { enabled = true schedule = "0 */6 * * *" # Auto-discover every 6 hours
csv_options { delimiter = "," header_rows = 1 }
json_options { encoding = "UTF-8" } }}
# Curated Zone — cleaned and validated dataresource "google_dataplex_zone" "curated" { lake = google_dataplex_lake.data_platform.name location = var.region name = "curated-zone" project = var.project_id display_name = "Curated Zone" type = "CURATED"
resource_spec { location_type = "SINGLE_REGION" }
discovery_spec { enabled = true schedule = "0 */6 * * *" }}
# Attach GCS bucket as an asset in the raw zoneresource "google_dataplex_asset" "raw_storage" { lake = google_dataplex_lake.data_platform.name dataplex_zone = google_dataplex_zone.raw.name location = var.region name = "raw-gcs-asset" project = var.project_id display_name = "Raw Data Storage"
resource_spec { name = "projects/${var.project_id}/buckets/${google_storage_bucket.raw_data.name}" type = "STORAGE_BUCKET" }
discovery_spec { enabled = true schedule = "0 */6 * * *" }}
# BigQuery column-level security with policy tagsresource "google_data_catalog_taxonomy" "sensitivity" { provider = google project = var.project_id region = var.region display_name = "Data Sensitivity" description = "Classification taxonomy for column-level security"}
resource "google_data_catalog_policy_tag" "pii" { taxonomy = google_data_catalog_taxonomy.sensitivity.id display_name = "PII" description = "Personally Identifiable Information — restricted access"}
resource "google_data_catalog_policy_tag" "confidential" { taxonomy = google_data_catalog_taxonomy.sensitivity.id display_name = "Confidential" description = "Business-confidential data — team-level access"}
# Grant data science team access to PII policy tagresource "google_data_catalog_taxonomy_iam_member" "datascience_pii" { taxonomy = google_data_catalog_taxonomy.sensitivity.id role = "roles/datacatalog.categoryFineGrainedReader" member = "group:data-science@bank.com"}Interview: “How do you enforce column-level security on a shared data lake with 20 consuming teams?”
Strong Answer:
“I would use AWS Lake Formation with LF-Tags or GCP Dataplex with BigQuery policy tags, depending on the cloud platform.
On AWS: Tag PII columns (SSN, card number, email, phone) with sensitivity=pii using LF-Tags. Create a Lake Formation permission grant that gives the AnalystRole SELECT access to all columns WHERE sensitivity != pii. Data science team gets full access to all columns because they need PII for ML model training, but they must sign a Data Processing Agreement (DPA). The compliance team audits all access via CloudTrail Lake Formation events — they can see exactly who queried which PII columns and when.
On GCP: Apply policy tags to PII columns in BigQuery using Data Catalog. Only users with datacatalog.categoryFineGrainedReader on the PII policy tag can see those columns. BigQuery automatically masks PII columns for unauthorized users — they see NULL instead of the actual value. Row-level security is enforced through authorized views or row access policies.
The key advantage of tag-based governance over manual permissions is scalability. When a new table is registered in the catalog, Lake Formation automatically applies permissions based on its tags. You do not need to create individual permission grants for each new table across 20 teams — the tag-based policies handle it automatically.”
Terraform
Section titled “Terraform”AWS: Glue Crawler + ETL Job + Catalog
Section titled “AWS: Glue Crawler + ETL Job + Catalog”# --- Glue Data Catalog ---
resource "aws_glue_catalog_database" "payments" { name = "payments"
create_table_default_permission { permissions = ["ALL"] principal { data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS" } }}
# Crawler: auto-discover schema from S3resource "aws_glue_crawler" "raw_transactions" { database_name = aws_glue_catalog_database.payments.name name = "raw-transactions-crawler" role = aws_iam_role.glue_crawler.arn schedule = "cron(0 */6 * * ? *)" # Every 6 hours
s3_target { path = "s3://${aws_s3_bucket.data_lake.id}/raw/source=core_banking/" }
schema_change_policy { delete_behavior = "LOG" update_behavior = "UPDATE_IN_DATABASE" }
configuration = jsonencode({ Version = 1.0 Grouping = { TableGroupingPolicy = "CombineCompatibleSchemas" } })}
# Glue ETL Job: raw → curatedresource "aws_glue_job" "transform_transactions" { name = "transform-transactions" role_arn = aws_iam_role.glue_etl.arn
command { name = "glueetl" script_location = "s3://${aws_s3_bucket.glue_scripts.id}/scripts/transform_transactions.py" python_version = "3" }
glue_version = "4.0" number_of_workers = 10 worker_type = "G.1X" timeout = 120 # minutes
default_arguments = { "--enable-metrics" = "true" "--enable-continuous-cloudwatch-log" = "true" "--enable-spark-ui" = "true" "--spark-event-logs-path" = "s3://${aws_s3_bucket.glue_scripts.id}/spark-logs/" "--job-bookmark-option" = "job-bookmark-enable" "--TempDir" = "s3://${aws_s3_bucket.glue_scripts.id}/temp/" "--output_path" = "s3://${aws_s3_bucket.data_lake.id}/curated/domain=payments/" }
tags = { Pipeline = "payments-etl" Team = "data-platform" }}AWS: MWAA Environment
Section titled “AWS: MWAA Environment”resource "aws_mwaa_environment" "main" { name = "data-platform-airflow"
airflow_version = "2.8.1" environment_class = "mw1.medium" # 5 workers max_workers = 25 min_workers = 1
source_bucket_arn = aws_s3_bucket.mwaa_dags.arn dag_s3_path = "dags/" requirements_s3_path = "requirements/requirements.txt" plugins_s3_path = "plugins/plugins.zip" execution_role_arn = aws_iam_role.mwaa.arn
network_configuration { security_group_ids = [aws_security_group.mwaa.id] subnet_ids = var.private_subnet_ids }
logging_configuration { dag_processing_logs { enabled = true log_level = "INFO" } scheduler_logs { enabled = true log_level = "WARNING" } task_logs { enabled = true log_level = "INFO" } webserver_logs { enabled = true log_level = "WARNING" } worker_logs { enabled = true log_level = "INFO" } }
airflow_configuration_options = { "core.default_timezone" = "utc" "celery.worker_autoscale" = "5,1" "webserver.default_ui_timezone" = "Asia/Dubai" }
tags = { Environment = "production" Team = "data-platform" }}GCP: BigQuery Dataset + Table with Partitioning
Section titled “GCP: BigQuery Dataset + Table with Partitioning”# --- BigQuery Dataset ---
resource "google_bigquery_dataset" "payments" { dataset_id = "payments" friendly_name = "Payments Domain" description = "Curated payment transaction data" location = var.region project = var.project_id
default_partition_expiration_ms = null # No auto-expiry default_table_expiration_ms = null
# Default encryption with CMEK default_encryption_configuration { kms_key_name = google_kms_crypto_key.bigquery.id }
# Access control access { role = "OWNER" special_group = "projectOwners" }
access { role = "READER" group_by_email = "analytics-team@bank.com" }
access { role = "WRITER" group_by_email = "data-engineers@bank.com" }
labels = { environment = "production" domain = "payments" }}
# --- BigQuery Table with Partitioning + Clustering ---
resource "google_bigquery_table" "transactions" { dataset_id = google_bigquery_dataset.payments.dataset_id table_id = "transactions" project = var.project_id deletion_protection = true
time_partitioning { type = "DAY" field = "transaction_date" expiration_ms = null # Keep forever (regulatory) }
clustering = ["region", "customer_id"]
schema = file("${path.module}/schemas/transactions.json")
encryption_configuration { kms_key_name = google_kms_crypto_key.bigquery.id }
labels = { sensitivity = "confidential" domain = "payments" }}
# --- BigQuery Reservation (slot allocation) ---
resource "google_bigquery_reservation" "enterprise" { name = "enterprise-baseline" location = var.region project = var.project_id slot_capacity = 500 # 500 baseline slots edition = "ENTERPRISE" autoscale { max_slots = 1000 # Burst up to 1000 slots }}
# Assign slots to production projectresource "google_bigquery_reservation_assignment" "prod" { assignee = "projects/${var.project_id}" job_type = "QUERY" reservation = google_bigquery_reservation.enterprise.id}
# Assign slots to analytics team projectresource "google_bigquery_reservation_assignment" "analytics" { assignee = "projects/${var.analytics_project_id}" job_type = "QUERY" reservation = google_bigquery_reservation.enterprise.id}GCP: Cloud Composer Environment
Section titled “GCP: Cloud Composer Environment”resource "google_composer_environment" "main" { name = "data-platform-airflow" region = var.region project = var.project_id
config { software_config { image_version = "composer-2.8.0-airflow-2.8.1"
pypi_packages = { "dbt-bigquery" = ">=1.7.0" "apache-airflow-providers-google" = ">=10.0.0" "soda-core-bigquery" = ">=3.0.0" }
airflow_config_overrides = { core-default_timezone = "UTC" webserver-default_ui_timezone = "Asia/Dubai" } }
workloads_config { scheduler { cpu = 2 memory_gb = 4 storage_gb = 5 count = 2 # HA schedulers } web_server { cpu = 2 memory_gb = 4 storage_gb = 5 } worker { cpu = 4 memory_gb = 16 storage_gb = 10 min_count = 1 max_count = 10 } }
environment_size = "ENVIRONMENT_SIZE_MEDIUM"
node_config { service_account = google_service_account.composer.email network = google_compute_network.main.id subnetwork = google_compute_subnetwork.composer.id }
private_environment_config { enable_private_endpoint = true } }}Interview Scenarios
Section titled “Interview Scenarios”Scenario 1: Enterprise Data Platform Design
Section titled “Scenario 1: Enterprise Data Platform Design”Q: “Design a data platform for a bank processing 2TB of transaction data daily across all 5 pipeline stages.”
A: I would design this as a five-stage pipeline in a dedicated data platform account:
Scenario 2: BigQuery Cost Control
Section titled “Scenario 2: BigQuery Cost Control”Q: “Your BigQuery costs are growing 30% month-over-month. How do you control them?”
A: BigQuery cost growth is almost always caused by on-demand pricing + uncontrolled queries. My approach:
1. Analyze current spend:
-- Find the most expensive queries (INFORMATION_SCHEMA)SELECT user_email, query, total_bytes_processed / POW(10,12) AS tb_scanned, total_bytes_processed / POW(10,12) * 6.25 AS estimated_cost_usdFROM `region-us`.INFORMATION_SCHEMA.JOBSWHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)ORDER BY total_bytes_processed DESCLIMIT 20;2. Switch from on-demand to Enterprise edition:
- Buy slot commitments (e.g., 500 slots annual)
- Cost becomes predictable: 500 slots x $0.06/slot-hr x 730 hrs = ~$21,900/month
- vs on-demand: if scanning 5TB/day = $31.25/day x 30 = $937.50/month (cheap!) but if scanning 500TB/day = $93,750/month (expensive!)
3. Enforce partitioning and clustering:
-- Require partition filter on all tablesALTER TABLE payments.transactionsSET OPTIONS (require_partition_filter = true);-- Now queries without WHERE transaction_date = ... will fail4. Per-team slot allocation via reservations:
- Team A: 200 slots, Team B: 100 slots, Team C: 100 slots
- Prevents one team from consuming all slots
- Idle slots redistributed automatically
5. Additional controls:
- Custom quotas: limit bytes scanned per user per day
- BI Engine: in-memory cache for dashboard queries (sub-second, no re-scan)
- Materialized views: pre-compute expensive aggregations
- Audit: alert when a single query scans > 1TB
Scenario 3: CDC from Aurora to BigQuery
Section titled “Scenario 3: CDC from Aurora to BigQuery”Q: “Design CDC from Aurora PostgreSQL to BigQuery for near-real-time analytics.”
A: Two approaches depending on latency requirements:
For cross-cloud (Aurora on AWS → BigQuery on GCP), Option A is more practical:
- DMS reads Aurora WAL (logical replication)
- DMS writes CDC events to S3 in Parquet format (Kinesis as intermediate for ordering)
- S3 event triggers Glue job (or Lambda)
- Glue transforms and writes to BigQuery via Storage Write API
- Latency: 5-15 minutes end-to-end
For GCP-native (Cloud SQL → BigQuery), use Datastream:
- Serverless, no infrastructure to manage
- Direct CDC to BigQuery (seconds latency)
- Handles schema changes automatically
- Much simpler architecture