Skip to content

Data Platform — Pipeline Stages

The data platform lives in a dedicated Data Platform Account/Project within the Workloads OU. It consumes VPCs from the Network Hub, ingests data from Workload Accounts (via cross-account S3/GCS access, Kinesis/Pub/Sub streams, database CDC), and serves data to BI tools and ML pipelines. Data engineers and analytics teams are tenants — they get datasets, schemas, and governed access managed by the central infra team.

Data platform overview — workload accounts to data platform


Data Platform — 5 Pipeline Stages

Data enters the platform through three patterns: batch, streaming, and CDC (Change Data Capture).

Data ingestion patterns — batch, streaming, CDC

ServicePatternUse CaseThroughput
Kinesis Data StreamsStreamingReal-time events, ordering matters1MB/s per shard, scale by adding shards
Kinesis Data FirehoseStreaming → batchDeliver to S3/Redshift/OpenSearchAuto-scales, no shards to manage
DMSCDCDatabase replication (Oracle → Aurora, on-prem → cloud)Depends on instance size
S3 TransferBatchLarge file ingestion from on-prem5Gbps per transfer
EventBridgeEvent-drivenSaaS integrations (Salesforce, Zendesk)Default 2,400 events/sec

CDC with AWS DMS

ServicePatternUse CaseThroughput
Pub/SubStreamingEvent bus, decoupled pub/sub messagingUnlimited (auto-scales)
DatastreamCDCDatabase replication to GCS/BigQueryPer-stream throughput config
Transfer ServiceBatchS3 → GCS, on-prem → GCS, URL listManaged, auto-retries
BigQuery Data TransferBatchSaaS → BigQuery (Google Ads, YouTube, S3)Scheduled, automatic
Cloud FunctionsEvent-drivenLightweight event processing triggersPer-invocation

CDC with GCP Datastream


The data lake follows a medallion architecture with three zones:

Data lake medallion architecture — raw, curated, refined

FormatCompressionSchemaColumnarSplittableBest For
ParquetSnappy / ZSTDEmbeddedYesYesAnalytics, warehouse loading
ORCZLIB / SnappyEmbeddedYesYesHive-heavy ecosystems
AvroSnappy / DeflateEmbeddedNo (row)YesSchema evolution, streaming
Delta LakeParquet-basedManifestYesYesACID on data lake, time travel
IcebergParquet/ORCMetadataYesYesMulti-engine (Spark, Trino, Presto)
JSONGZIPNoneNoIf newline-delimitedRaw ingestion, interchange
CSVGZIPNoneNoIf newline-delimitedLegacy systems, simple data

AWS: Lake Formation provides centralized access control:

AWS Lake Formation access control

GCP: Dataplex provides data governance:

GCP Dataplex governance


ETL vs ELT comparison

ServiceEngineBest ForServerless?
Glue ETLSpark (PySpark / Scala)Batch ETL jobs, S3 → Redshift, schema discoveryYes (Glue 4.0)
Glue CrawlersN/AAuto-discover schemas from S3/RDS → populate Glue CatalogYes
EMRSpark, Hive, Presto, TrinoHeavy Spark workloads, long-running clustersEMR Serverless available
Step FunctionsState machineOrchestrating multi-step data pipelinesYes
AthenaTrino (Presto)Ad-hoc SQL queries on S3 data lakeYes (pay per query)

Glue ETL pipeline

Glue ETL Optimization Tips:

  • DPU sizing: start with 10 DPUs, monitor with Glue metrics, scale up for larger datasets
  • Bookmarks: enable job bookmarks to process only new data (incremental)
  • Pushdown predicates: filter data at source (S3 select, partition pruning)
  • Grouping: use groupFiles for many small files, groupSize for few large files
  • Output partitioning: repartition before write to avoid small files problem
ServiceEngineBest ForServerless?
DataprocSpark, Hadoop, PrestoSpark workloads, migration from on-prem HadoopDataproc Serverless available
DataflowApache BeamStreaming + batch (unified model), windowed aggregationsYes (fully managed)
DataprepTrifactaVisual data wrangling for analystsYes
BigQuery SQLBigQuery engineELT transforms directly in warehouseYes

Dataflow streaming pipeline

FeatureGlue ETLDataprocDataflow
EngineSparkSparkApache Beam
Best forAWS batch ETLSpark migration from on-premStreaming + batch
ServerlessYes (Glue 4.0)Dataproc ServerlessYes (always)
StreamingNo (Glue Streaming is limited)Spark Structured StreamingNative streaming
Auto-scalingFixed DPUs per runCluster auto-scalingAutomatic
PricingPer DPU-hourPer VM-hourPer vCPU/GB-hour
CatalogBuilt-in Glue CatalogHive Metastore or Dataproc MetastoreN/A (uses BigQuery)

dbt (Data Build Tool) — ELT Transforms in SQL

Section titled “dbt (Data Build Tool) — ELT Transforms in SQL”

dbt project structure


A slot is a unit of BigQuery compute. Think of it as a virtual CPU.

BigQuery query execution with slots

FeatureOn-DemandStandardEnterpriseEnterprise Plus
Pricing$6.25/TB scannedSlot commitmentSlot commitmentSlot commitment
SlotsShared pool (2000 default)100+ slot commitment100+ slot commitment100+ slot commitment
AutoscalingAutomaticBaseline + autoscaleBaseline + autoscaleBaseline + autoscale
CommitmentNone1-year or 3-year1-year or 3-year1-year or 3-year
Slot priceN/A$0.04/slot-hour$0.06/slot-hour$0.10/slot-hour
Streaming$0.01/200MBIncludedIncludedIncluded
Materialized ViewsLimitedYesAuto-refreshAuto-refresh
Multi-regionYesRegional onlyRegionalMulti-regional
BI EngineNoNoYesYes
CMEKNoNoYesYes
Column-level securityNoNoYesYes

BigQuery capacity model — reservations and assignments

BigQuery partitioning and clustering

BigQuery security model — column and row-level

FeatureRedshift ProvisionedRedshift Serverless
ManagementYou choose node type and countAuto-provisioned
ScalingResize cluster (elastic or classic)Automatic
ConcurrencyConcurrency scaling (auto-add clusters)Automatic
PricingPer node-hour (RA3: $3.26/hr for ra3.4xlarge)Per RPU-hour ($0.375/RPU)
StorageRA3: managed storage (decoupled)Included
Best forPredictable, high-volume analyticsVariable workloads, getting started
Data SharingCross-cluster and cross-accountSame
SpectrumQuery S3 data lake directly (no loading)Same

Redshift RA3 architecture


Data consumers access the processed data through multiple channels:

Data serving layer


Data governance pillars


Orchestration: Airflow (MWAA / Cloud Composer)

Section titled “Orchestration: Airflow (MWAA / Cloud Composer)”

AWS: MWAA (Managed Workflows for Apache Airflow)

Section titled “AWS: MWAA (Managed Workflows for Apache Airflow)”

MWAA architecture

Cloud Composer 2 architecture

# Example: daily transaction processing pipeline
# DAG Structure:
#
# check_source_freshness
# │
# extract_transactions (DMS CDC or Datastream)
# │
# load_raw_to_lake (S3/GCS raw zone)
# │
# dbt_run_staging (clean, cast, dedupe)
# │
# ┌──────┴──────┐
# │ │
# dbt_run_ dbt_run_
# facts dimensions
# │ │
# └──────┬──────┘
# │
# dbt_test (assert data quality)
# │
# ┌──────┴──────┐
# │ │
# refresh_ notify_
# dashboards stakeholders

Enterprise DAG rules:

  1. One DAG per pipeline — do not combine unrelated workflows
  2. Idempotent tasks — re-running produces the same result (no duplicates)
  3. Incremental processing — process only new data (dbt incremental models, Glue bookmarks)
  4. Alerting — Slack/PagerDuty on task failure, SLA breach
  5. Data quality gates — dbt tests run before downstream consumers access new data
  6. Backfill support — catch_up=False for forward-only, True for historical reprocessing

Data Governance — Lake Formation & Dataplex

Section titled “Data Governance — Lake Formation & Dataplex”

The Problem: Without governance, every data team has full access to every table. A junior analyst can accidentally query PII data, export it, or join it with external datasets. At 20+ consuming teams, access control via S3 bucket policies or BigQuery dataset-level permissions alone becomes unmanageable. You need column-level, row-level, and even cell-level access control — and you need it enforced consistently across every query engine (Athena, Spark, Redshift Spectrum, BigQuery).

Lake Formation is a centralized permissions layer that sits on top of your S3 data lake and Glue Data Catalog. Instead of managing access through S3 bucket policies, IAM policies, and Glue resource policies separately, Lake Formation provides a single pane of glass for data access control.

Core Capabilities:

  • Fine-grained access control: Column-level security (hide PII columns from unauthorized users), row-level security (filter rows by region, tenant, or business unit), and cell-level security (mask specific values like showing only last 4 digits of SSN)
  • Tag-based access control (LF-Tags): The most powerful feature. Tag columns with attributes like sensitivity=pii, domain=payments, classification=confidential. Then grant access by tag rather than by individual table/column. When a new table is registered, it automatically inherits permissions based on its tags — no manual permission grants needed for each new dataset.
  • Data Catalog: Built on top of Glue Catalog. Register S3 locations as databases and tables with searchable metadata, schemas, and data lineage. The catalog becomes the single source of truth for “what data exists and who can access it.”
  • Cross-account data sharing: Share specific tables or columns with other AWS accounts without copying data. The consuming account queries the data directly from the producer’s S3 bucket through Glue Catalog federation. This is critical in multi-account landing zones where data platform and workload accounts are separate.
  • Integration: Athena, Redshift Spectrum, EMR (Spark), and Glue ETL jobs all respect Lake Formation permissions transparently. Users do not bypass permissions by switching query engines.

Data Lake Architecture with Lake Formation

Dataplex provides a data fabric across GCS and BigQuery, offering unified governance regardless of where data physically lives. Unlike Lake Formation which focuses on S3 + Glue Catalog, Dataplex treats BigQuery datasets and GCS buckets as part of a single logical data lake.

Core Capabilities:

  • Data zones: Organize data into logical zones — raw zone (landing, source-format), curated zone (cleaned, validated, Parquet/Iceberg), and consumer zone (aggregated, business-ready). Dataplex enforces zone-level policies and automatically validates data as it moves between zones.
  • Automated data quality: Define data quality rules declaratively (e.g., “column amount must be positive”, “no null values in transaction_id”, “row count should not drop more than 10% day-over-day”). Dataplex continuously validates data against these rules and flags violations. This replaces custom dbt tests for infrastructure-level quality checks.
  • Data lineage: Built into BigQuery — track how data flows from source tables through transformations to final dashboards. Automatically captured for BigQuery SQL jobs, Dataflow pipelines, and Dataproc Spark jobs. No manual instrumentation needed.
  • Metadata management: Auto-discover datasets across GCS and BigQuery, tag with business metadata (domain, owner, sensitivity, freshness), and make everything searchable through Data Catalog. Domain teams self-serve data discovery instead of asking the platform team “where is the customer table?”
  • Security integration: Dataplex works with BigQuery column-level security (policy tags), row-level security (row access policies), and data masking. Policy tags can be applied to columns via Data Catalog and enforced across all BigQuery queries automatically.

GCP Dataplex Data Fabric

# Register S3 location with Lake Formation
resource "aws_lakeformation_resource" "data_lake" {
arn = aws_s3_bucket.data_lake.arn
}
# Create LF-Tags for classification
resource "aws_lakeformation_lf_tag" "sensitivity" {
key = "sensitivity"
values = ["public", "internal", "confidential", "pii"]
}
resource "aws_lakeformation_lf_tag" "domain" {
key = "domain"
values = ["payments", "customers", "products", "analytics"]
}
# Assign LF-Tags to table columns
resource "aws_lakeformation_resource_lf_tags" "transactions_tags" {
database {
name = aws_glue_catalog_database.payments.name
}
table {
database_name = aws_glue_catalog_database.payments.name
name = "transactions"
}
lf_tag {
key = "sensitivity"
value = "confidential"
}
lf_tag {
key = "domain"
value = "payments"
}
}
# Grant analysts access to non-PII columns only
resource "aws_lakeformation_permissions" "analysts_read" {
principal = "arn:aws:iam::123456789012:role/AnalystRole"
permissions = ["SELECT"]
table_with_columns {
database_name = aws_glue_catalog_database.payments.name
name = "transactions"
# Exclude PII columns
excluded_column_names = ["card_number", "ssn", "email", "phone"]
}
}
# Grant data science team full access (they sign a DPA)
resource "aws_lakeformation_permissions" "datascience_full" {
principal = "arn:aws:iam::123456789012:role/DataScienceRole"
permissions = ["SELECT"]
table_with_columns {
database_name = aws_glue_catalog_database.payments.name
name = "transactions"
wildcard = true # All columns
}
}
# Cross-account sharing: share table with analytics account
resource "aws_lakeformation_permissions" "cross_account_share" {
principal = "arn:aws:iam::987654321098:root"
permissions = ["SELECT", "DESCRIBE"]
permissions_with_grant_option = ["SELECT", "DESCRIBE"]
table {
database_name = aws_glue_catalog_database.payments.name
name = "transactions"
}
}

Interview: “How do you enforce column-level security on a shared data lake with 20 consuming teams?”

Strong Answer:

“I would use AWS Lake Formation with LF-Tags or GCP Dataplex with BigQuery policy tags, depending on the cloud platform.

On AWS: Tag PII columns (SSN, card number, email, phone) with sensitivity=pii using LF-Tags. Create a Lake Formation permission grant that gives the AnalystRole SELECT access to all columns WHERE sensitivity != pii. Data science team gets full access to all columns because they need PII for ML model training, but they must sign a Data Processing Agreement (DPA). The compliance team audits all access via CloudTrail Lake Formation events — they can see exactly who queried which PII columns and when.

On GCP: Apply policy tags to PII columns in BigQuery using Data Catalog. Only users with datacatalog.categoryFineGrainedReader on the PII policy tag can see those columns. BigQuery automatically masks PII columns for unauthorized users — they see NULL instead of the actual value. Row-level security is enforced through authorized views or row access policies.

The key advantage of tag-based governance over manual permissions is scalability. When a new table is registered in the catalog, Lake Formation automatically applies permissions based on its tags. You do not need to create individual permission grants for each new table across 20 teams — the tag-based policies handle it automatically.”


# --- Glue Data Catalog ---
resource "aws_glue_catalog_database" "payments" {
name = "payments"
create_table_default_permission {
permissions = ["ALL"]
principal {
data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
}
}
}
# Crawler: auto-discover schema from S3
resource "aws_glue_crawler" "raw_transactions" {
database_name = aws_glue_catalog_database.payments.name
name = "raw-transactions-crawler"
role = aws_iam_role.glue_crawler.arn
schedule = "cron(0 */6 * * ? *)" # Every 6 hours
s3_target {
path = "s3://${aws_s3_bucket.data_lake.id}/raw/source=core_banking/"
}
schema_change_policy {
delete_behavior = "LOG"
update_behavior = "UPDATE_IN_DATABASE"
}
configuration = jsonencode({
Version = 1.0
Grouping = {
TableGroupingPolicy = "CombineCompatibleSchemas"
}
})
}
# Glue ETL Job: raw → curated
resource "aws_glue_job" "transform_transactions" {
name = "transform-transactions"
role_arn = aws_iam_role.glue_etl.arn
command {
name = "glueetl"
script_location = "s3://${aws_s3_bucket.glue_scripts.id}/scripts/transform_transactions.py"
python_version = "3"
}
glue_version = "4.0"
number_of_workers = 10
worker_type = "G.1X"
timeout = 120 # minutes
default_arguments = {
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
"--enable-spark-ui" = "true"
"--spark-event-logs-path" = "s3://${aws_s3_bucket.glue_scripts.id}/spark-logs/"
"--job-bookmark-option" = "job-bookmark-enable"
"--TempDir" = "s3://${aws_s3_bucket.glue_scripts.id}/temp/"
"--output_path" = "s3://${aws_s3_bucket.data_lake.id}/curated/domain=payments/"
}
tags = {
Pipeline = "payments-etl"
Team = "data-platform"
}
}
resource "aws_mwaa_environment" "main" {
name = "data-platform-airflow"
airflow_version = "2.8.1"
environment_class = "mw1.medium" # 5 workers
max_workers = 25
min_workers = 1
source_bucket_arn = aws_s3_bucket.mwaa_dags.arn
dag_s3_path = "dags/"
requirements_s3_path = "requirements/requirements.txt"
plugins_s3_path = "plugins/plugins.zip"
execution_role_arn = aws_iam_role.mwaa.arn
network_configuration {
security_group_ids = [aws_security_group.mwaa.id]
subnet_ids = var.private_subnet_ids
}
logging_configuration {
dag_processing_logs {
enabled = true
log_level = "INFO"
}
scheduler_logs {
enabled = true
log_level = "WARNING"
}
task_logs {
enabled = true
log_level = "INFO"
}
webserver_logs {
enabled = true
log_level = "WARNING"
}
worker_logs {
enabled = true
log_level = "INFO"
}
}
airflow_configuration_options = {
"core.default_timezone" = "utc"
"celery.worker_autoscale" = "5,1"
"webserver.default_ui_timezone" = "Asia/Dubai"
}
tags = {
Environment = "production"
Team = "data-platform"
}
}

GCP: BigQuery Dataset + Table with Partitioning

Section titled “GCP: BigQuery Dataset + Table with Partitioning”
# --- BigQuery Dataset ---
resource "google_bigquery_dataset" "payments" {
dataset_id = "payments"
friendly_name = "Payments Domain"
description = "Curated payment transaction data"
location = var.region
project = var.project_id
default_partition_expiration_ms = null # No auto-expiry
default_table_expiration_ms = null
# Default encryption with CMEK
default_encryption_configuration {
kms_key_name = google_kms_crypto_key.bigquery.id
}
# Access control
access {
role = "OWNER"
special_group = "projectOwners"
}
access {
role = "READER"
group_by_email = "analytics-team@bank.com"
}
access {
role = "WRITER"
group_by_email = "data-engineers@bank.com"
}
labels = {
environment = "production"
domain = "payments"
}
}
# --- BigQuery Table with Partitioning + Clustering ---
resource "google_bigquery_table" "transactions" {
dataset_id = google_bigquery_dataset.payments.dataset_id
table_id = "transactions"
project = var.project_id
deletion_protection = true
time_partitioning {
type = "DAY"
field = "transaction_date"
expiration_ms = null # Keep forever (regulatory)
}
clustering = ["region", "customer_id"]
schema = file("${path.module}/schemas/transactions.json")
encryption_configuration {
kms_key_name = google_kms_crypto_key.bigquery.id
}
labels = {
sensitivity = "confidential"
domain = "payments"
}
}
# --- BigQuery Reservation (slot allocation) ---
resource "google_bigquery_reservation" "enterprise" {
name = "enterprise-baseline"
location = var.region
project = var.project_id
slot_capacity = 500 # 500 baseline slots
edition = "ENTERPRISE"
autoscale {
max_slots = 1000 # Burst up to 1000 slots
}
}
# Assign slots to production project
resource "google_bigquery_reservation_assignment" "prod" {
assignee = "projects/${var.project_id}"
job_type = "QUERY"
reservation = google_bigquery_reservation.enterprise.id
}
# Assign slots to analytics team project
resource "google_bigquery_reservation_assignment" "analytics" {
assignee = "projects/${var.analytics_project_id}"
job_type = "QUERY"
reservation = google_bigquery_reservation.enterprise.id
}
resource "google_composer_environment" "main" {
name = "data-platform-airflow"
region = var.region
project = var.project_id
config {
software_config {
image_version = "composer-2.8.0-airflow-2.8.1"
pypi_packages = {
"dbt-bigquery" = ">=1.7.0"
"apache-airflow-providers-google" = ">=10.0.0"
"soda-core-bigquery" = ">=3.0.0"
}
airflow_config_overrides = {
core-default_timezone = "UTC"
webserver-default_ui_timezone = "Asia/Dubai"
}
}
workloads_config {
scheduler {
cpu = 2
memory_gb = 4
storage_gb = 5
count = 2 # HA schedulers
}
web_server {
cpu = 2
memory_gb = 4
storage_gb = 5
}
worker {
cpu = 4
memory_gb = 16
storage_gb = 10
min_count = 1
max_count = 10
}
}
environment_size = "ENVIRONMENT_SIZE_MEDIUM"
node_config {
service_account = google_service_account.composer.email
network = google_compute_network.main.id
subnetwork = google_compute_subnetwork.composer.id
}
private_environment_config {
enable_private_endpoint = true
}
}
}

Scenario 1: Enterprise Data Platform Design

Section titled “Scenario 1: Enterprise Data Platform Design”

Q: “Design a data platform for a bank processing 2TB of transaction data daily across all 5 pipeline stages.”

A: I would design this as a five-stage pipeline in a dedicated data platform account:

Interview — data pipeline architecture

Q: “Your BigQuery costs are growing 30% month-over-month. How do you control them?”

A: BigQuery cost growth is almost always caused by on-demand pricing + uncontrolled queries. My approach:

1. Analyze current spend:

-- Find the most expensive queries (INFORMATION_SCHEMA)
SELECT
user_email,
query,
total_bytes_processed / POW(10,12) AS tb_scanned,
total_bytes_processed / POW(10,12) * 6.25 AS estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 20;

2. Switch from on-demand to Enterprise edition:

  • Buy slot commitments (e.g., 500 slots annual)
  • Cost becomes predictable: 500 slots x $0.06/slot-hr x 730 hrs = ~$21,900/month
  • vs on-demand: if scanning 5TB/day = $31.25/day x 30 = $937.50/month (cheap!) but if scanning 500TB/day = $93,750/month (expensive!)

3. Enforce partitioning and clustering:

-- Require partition filter on all tables
ALTER TABLE payments.transactions
SET OPTIONS (require_partition_filter = true);
-- Now queries without WHERE transaction_date = ... will fail

4. Per-team slot allocation via reservations:

  • Team A: 200 slots, Team B: 100 slots, Team C: 100 slots
  • Prevents one team from consuming all slots
  • Idle slots redistributed automatically

5. Additional controls:

  • Custom quotas: limit bytes scanned per user per day
  • BI Engine: in-memory cache for dashboard queries (sub-second, no re-scan)
  • Materialized views: pre-compute expensive aggregations
  • Audit: alert when a single query scans > 1TB

Q: “Design CDC from Aurora PostgreSQL to BigQuery for near-real-time analytics.”

A: Two approaches depending on latency requirements:

CDC migration options

For cross-cloud (Aurora on AWS → BigQuery on GCP), Option A is more practical:

  1. DMS reads Aurora WAL (logical replication)
  2. DMS writes CDC events to S3 in Parquet format (Kinesis as intermediate for ordering)
  3. S3 event triggers Glue job (or Lambda)
  4. Glue transforms and writes to BigQuery via Storage Write API
  5. Latency: 5-15 minutes end-to-end

For GCP-native (Cloud SQL → BigQuery), use Datastream:

  • Serverless, no infrastructure to manage
  • Direct CDC to BigQuery (seconds latency)
  • Handles schema changes automatically
  • Much simpler architecture

Full data platform architecture