Event-Driven Architecture

Where This Fits

The central event bus lives in the Shared Services Account/Project. Workload accounts publish events to and subscribe from this central bus via cross-account rules (EventBridge) or IAM-scoped subscriptions (Pub/Sub). This decouples tenant services — teams communicate through events, not direct API calls.

Central event bus architecture in shared services account

Event-Driven vs Request-Driven

Request-driven vs event-driven architecture comparison

Factor	Request-Driven	Event-Driven
Coupling	Tight — caller knows callee	Loose — publisher does not know subscribers
Failure handling	Cascading failures	Isolated — one consumer failing does not affect others
Latency	Accumulated (sum of all hops)	Publisher returns immediately
Debugging	Simple (follow the request)	Complex (trace events across systems)
Ordering	Natural (request/response)	Requires explicit ordering guarantees
Best for	CRUD operations, queries	Notifications, workflows, integration

Event Patterns

Fan-Out Pattern

One event triggers multiple independent consumers:

Fan-out pattern for order placed event

Saga Pattern (Distributed Transactions)

When a business process spans multiple services, use the saga pattern with compensating actions:

Saga pattern for distributed order processing

CQRS (Command Query Responsibility Segregation)

Separate the write model (commands) from the read model (queries):

CQRS command query responsibility segregation architecture

Cloud Event Services

Amazon EventBridge

EventBridge is the enterprise event bus — event routing with content-based filtering, schema discovery, and cross-account delivery.

Amazon EventBridge concepts: bus, rules, and schema registry

Amazon SQS

Point-to-point message queue. One producer, one consumer (or consumer group).

SQS Standard vs FIFO queue comparison

SNS is a pub/sub notification service. One message, multiple subscribers.

SNS fan-out pattern with multiple subscribers

Kinesis (Real-Time Streaming)

Kinesis is AWS’s managed streaming platform for real-time data ingestion and processing. Unlike SQS (message queue, point-to-point), Kinesis is a stream (ordered, replayable, multiple consumers).

Service	Purpose	Use Case
Kinesis Data Streams	Real-time streaming ingestion	Clickstream, IoT telemetry, log aggregation
Kinesis Data Firehose	Managed delivery to destinations	Stream → S3/Redshift/OpenSearch (no code)
Kinesis Data Analytics	SQL/Flink on streams	Real-time aggregation, anomaly detection

When Kinesis vs SQS:

Kinesis: multiple consumers read same data, ordering within shard, replay capability, real-time analytics
SQS: single consumer (or consumer group), simpler, per-message processing, no ordering needed (unless FIFO)

Google Cloud Pub/Sub

Google Cloud Pub/Sub concepts

Eventarc (Managed Event Routing)

Eventarc provides managed event routing from Google Cloud services (and custom sources) to Cloud Run, GKE, and Workflows.

Eventarc managed event routing architecture

Terraform — AWS EventBridge & SQS
Terraform — GCP Pub/Sub

# Central EventBridge bus in Shared Services Account
resource "aws_cloudwatch_event_bus" "enterprise" {
  name = "enterprise-event-bus"
}

# Allow workload accounts to publish events
resource "aws_cloudwatch_event_bus_policy" "cross_account" {
  event_bus_name = aws_cloudwatch_event_bus.enterprise.name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowWorkloadAccountPublish"
        Effect = "Allow"
        Principal = {
          AWS = [
            "arn:aws:iam::${var.team_a_account_id}:root",
            "arn:aws:iam::${var.team_b_account_id}:root",
            "arn:aws:iam::${var.team_c_account_id}:root"
          ]
        }
        Action   = "events:PutEvents"
        Resource = aws_cloudwatch_event_bus.enterprise.arn
      }
    ]
  })
}

# Rule: route order events to Team-B's SQS queue
resource "aws_cloudwatch_event_rule" "order_events" {
  name           = "route-order-events"
  event_bus_name = aws_cloudwatch_event_bus.enterprise.name

  event_pattern = jsonencode({
    source      = ["com.bank.orders"]
    detail-type = ["OrderPlaced", "OrderCancelled"]
  })
}

resource "aws_cloudwatch_event_target" "team_b_queue" {
  rule           = aws_cloudwatch_event_rule.order_events.name
  event_bus_name = aws_cloudwatch_event_bus.enterprise.name
  arn            = "arn:aws:sqs:me-central-1:${var.team_b_account_id}:order-events"

  sqs_target {
    message_group_id = "orders"  # For FIFO queues
  }

  dead_letter_config {
    arn = aws_sqs_queue.dlq.arn
  }

  retry_policy {
    maximum_retry_attempts       = 3
    maximum_event_age_in_seconds = 86400  # 24 hours
  }
}

# Schema registry for event contracts
resource "aws_schemas_registry" "enterprise" {
  name        = "enterprise-events"
  description = "Schema registry for enterprise event contracts"
}

resource "aws_schemas_schema" "order_placed" {
  name          = "OrderPlaced"
  registry_name = aws_schemas_registry.enterprise.name
  type          = "OpenApi3"

  content = jsonencode({
    openapi = "3.0.0"
    info = {
      title   = "OrderPlaced"
      version = "1.0.0"
    }
    paths = {}
    components = {
      schemas = {
        OrderPlaced = {
          type = "object"
          required = ["orderId", "customerId", "amount", "currency"]
          properties = {
            orderId    = { type = "string", format = "uuid" }
            customerId = { type = "string", format = "uuid" }
            amount     = { type = "number" }
            currency   = { type = "string", enum = ["AED", "USD", "EUR"] }
            items = {
              type = "array"
              items = {
                type = "object"
                properties = {
                  productId = { type = "string" }
                  quantity  = { type = "integer" }
                  price     = { type = "number" }
                }
              }
            }
            timestamp = { type = "string", format = "date-time" }
          }
        }
      }
    }
  })
}

# Publishing Events from an EKS Pod (Python)
import boto3
import json
from datetime import datetime
import uuid

eventbridge = boto3.client('events', region_name='me-central-1')

def publish_order_placed(order):
    response = eventbridge.put_events(
        Entries=[
            {
                'Source': 'com.bank.orders',
                'DetailType': 'OrderPlaced',
                'Detail': json.dumps({
                    'orderId': str(uuid.uuid4()),
                    'customerId': order['customer_id'],
                    'amount': order['total'],
                    'currency': 'AED',
                    'items': order['items'],
                    'timestamp': datetime.utcnow().isoformat()
                }),
                'EventBusName': 'arn:aws:events:me-central-1:SHARED_SERVICES_ACCOUNT:event-bus/enterprise-event-bus'
            }
        ]
    )
    # Check for failed entries
    if response['FailedEntryCount'] > 0:
        for entry in response['Entries']:
            if 'ErrorCode' in entry:
                logger.error(f"Failed to publish event: {entry['ErrorCode']}")

resource "aws_sqs_queue" "order_processing" {
  name                       = "order-processing"
  visibility_timeout_seconds = 300  # 5 min (must be > Lambda timeout)
  message_retention_seconds  = 1209600  # 14 days
  receive_wait_time_seconds  = 20  # Long polling (reduces empty receives)
  max_message_size           = 262144  # 256 KB

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.order_dlq.arn
    maxReceiveCount     = 3  # After 3 failures, send to DLQ
  })

  sqs_managed_sse_enabled = true  # Encryption at rest
}

resource "aws_sqs_queue" "order_dlq" {
  name                      = "order-processing-dlq"
  message_retention_seconds = 1209600  # 14 days

  sqs_managed_sse_enabled = true
}

# CloudWatch alarm on DLQ depth
resource "aws_cloudwatch_metric_alarm" "dlq_not_empty" {
  alarm_name          = "order-dlq-has-messages"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 60
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "DLQ has failed messages that need investigation"
  alarm_actions       = [var.pagerduty_sns_topic_arn]

  dimensions = {
    QueueName = aws_sqs_queue.order_dlq.name
  }
}

# DLQ redrive policy (replay failed messages back to main queue)
resource "aws_sqs_queue_redrive_allow_policy" "order_processing" {
  queue_url = aws_sqs_queue.order_processing.url

  redrive_allow_policy = jsonencode({
    redrivePermission = "byQueue"
    sourceQueueArns   = [aws_sqs_queue.order_dlq.arn]
  })
}

# Schema for event validation
resource "google_pubsub_schema" "order_placed" {
  name       = "order-placed-schema"
  project    = var.shared_services_project_id
  type       = "AVRO"
  definition = jsonencode({
    type = "record"
    name = "OrderPlaced"
    fields = [
      { name = "orderId", type = "string" },
      { name = "customerId", type = "string" },
      { name = "amount", type = "double" },
      { name = "currency", type = "string" },
      { name = "timestamp", type = "string" }
    ]
  })
}

# Central topic in Shared Services project
resource "google_pubsub_topic" "order_events" {
  name    = "order-events"
  project = var.shared_services_project_id

  schema_settings {
    schema   = google_pubsub_schema.order_placed.id
    encoding = "JSON"
  }

  message_retention_duration = "604800s"  # 7 days

  # CMEK encryption
  kms_key_name = google_kms_crypto_key.pubsub_key.id
}

# Dead letter topic
resource "google_pubsub_topic" "order_events_dlq" {
  name    = "order-events-dlq"
  project = var.shared_services_project_id
}

# Subscription for Team-A (in their project)
resource "google_pubsub_subscription" "team_a_orders" {
  name    = "team-a-order-processing"
  project = var.team_a_project_id
  topic   = google_pubsub_topic.order_events.id

  ack_deadline_seconds       = 60
  message_retention_duration = "604800s"  # 7 days
  retain_acked_messages      = true       # For replay capability

  enable_exactly_once_delivery = true

  # Message ordering
  enable_message_ordering = true

  # Dead letter policy
  dead_letter_policy {
    dead_letter_topic     = google_pubsub_topic.order_events_dlq.id
    max_delivery_attempts = 5
  }

  # Retry policy
  retry_policy {
    minimum_backoff = "10s"
    maximum_backoff = "600s"
  }

  # Filter: only receive OrderPlaced events (not OrderCancelled)
  filter = "attributes.eventType = \"OrderPlaced\""
}

# IAM: Allow Team-A's workload to subscribe
resource "google_pubsub_subscription_iam_member" "team_a_subscriber" {
  subscription = google_pubsub_subscription.team_a_orders.name
  project      = var.team_a_project_id
  role         = "roles/pubsub.subscriber"
  member       = "serviceAccount:${var.team_a_workload_sa}"
}

# IAM: Allow Team-A's workload to publish to central topic
resource "google_pubsub_topic_iam_member" "team_a_publisher" {
  topic   = google_pubsub_topic.order_events.name
  project = var.shared_services_project_id
  role    = "roles/pubsub.publisher"
  member  = "serviceAccount:${var.team_a_workload_sa}"
}

Dead Letter Queues (DLQ) and Error Handling

Dead letter queue architecture and error handling

DLQ Replay Pattern

# SQS DLQ Replay — AWS CLI
# Step 1: Inspect messages in DLQ
aws sqs receive-message \
  --queue-url https://sqs.me-central-1.amazonaws.com/ACCOUNT/order-processing-dlq \
  --max-number-of-messages 10

# Step 2: Start DLQ redrive (replay messages back to source queue)
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:me-central-1:ACCOUNT:order-processing-dlq \
  --destination-arn arn:aws:sqs:me-central-1:ACCOUNT:order-processing \
  --max-number-of-messages-per-second 10  # Throttle replay to avoid overwhelming consumer

Exactly-Once Delivery

Exactly-once delivery strategies

Idempotent Consumer Pattern (Python):

import boto3
import psycopg2
from contextlib import contextmanager

def process_order_event(event):
    event_id = event['detail']['orderId']

    with db_transaction() as cursor:
        # Check if already processed (idempotency key)
        cursor.execute(
            "SELECT 1 FROM processed_events WHERE event_id = %s FOR UPDATE",
            (event_id,)
        )
        if cursor.fetchone():
            logger.info(f"Event {event_id} already processed, skipping")
            return  # Idempotent — skip duplicate

        # Process the event
        cursor.execute(
            "INSERT INTO orders (id, customer_id, amount) VALUES (%s, %s, %s)",
            (event_id, event['detail']['customerId'], event['detail']['amount'])
        )

        # Record that we processed this event
        cursor.execute(
            "INSERT INTO processed_events (event_id, processed_at) VALUES (%s, NOW())",
            (event_id,)
        )
        # Both inserts in same transaction — atomic

Schema Registry and Event Contracts

When 20 teams publish events to a shared bus, you need contracts. A schema registry enforces the structure of events so that a producer cannot break downstream consumers by changing the event format.

Schema evolution rules for event contracts

EventBridge Schema Registry auto-discovers event schemas from the bus and generates code bindings:

resource "aws_schemas_discoverer" "enterprise_bus" {
  source_arn  = aws_cloudwatch_event_bus.enterprise.arn
  description = "Auto-discover schemas from enterprise event bus"
}

Pub/Sub supports Avro and Protocol Buffer schemas with validation at publish time:

resource "google_pubsub_schema" "order_placed_v2" {
  name       = "order-placed-v2"
  type       = "PROTOCOL_BUFFER"
  definition = <<-EOT
    syntax = "proto3";
    message OrderPlaced {
      string order_id = 1;
      string customer_id = 2;
      double amount = 3;
      string currency = 4;
      repeated OrderItem items = 5;
      string timestamp = 6;
      string region = 7;  // New field in v2
    }
    message OrderItem {
      string product_id = 1;
      int32 quantity = 2;
      double price = 3;
    }
  EOT
}

Managed Kafka — MSK & Confluent Cloud

When the event-driven architecture grows beyond simple fan-out patterns into a full enterprise event backbone serving 100+ microservices, you need a streaming platform — not just a message queue. Apache Kafka is the dominant choice for this role. Unlike SQS or Pub/Sub which are message brokers (messages are consumed and removed), Kafka is a distributed commit log where messages persist for configurable durations and multiple independent consumers read from the same topic at their own pace. This fundamental difference makes Kafka suitable for event sourcing, CDC (change data capture), stream processing, and replay scenarios that queues cannot support.

The managed Kafka landscape offers three tiers of operational burden: MSK Provisioned (you choose broker count and instance types — full Kafka compatibility), MSK Serverless (auto-scales with no broker management but fewer features), and Confluent Cloud (fully managed with the richest Kafka ecosystem including ksqlDB, Schema Registry, and Kafka Connect). The choice depends on your team’s Kafka expertise, feature requirements, and budget. For teams new to Kafka, MSK Serverless or Confluent Cloud removes the operational complexity that makes self-hosted Kafka notoriously difficult to run.

The critical architecture question is not “Kafka or not Kafka” but “what is the right tool for each communication pattern?” Using Kafka for simple request-reply between two services is over-engineering. Using SQS for a real-time data pipeline feeding 15 downstream consumers with replay requirements is under-engineering. The decision matrix below helps make this choice systematically.

When Kafka vs Kinesis vs Pub/Sub?

Criterion	Kafka (MSK)	Kinesis	Pub/Sub
Message retention	Unlimited (disk-based, configurable)	7 days max (365 with enhanced)	31 days max
Ordering	Per-partition (strong guarantee)	Per-shard	Per-key (ordering key)
Consumer model	Pull (consumer groups, independent offsets)	Pull (enhanced fan-out) or Lambda trigger	Pull (subscriptions)
Throughput	Millions msg/sec (add brokers + partitions)	1 MB/sec per shard (add shards)	Auto-scales (no capacity planning)
Exactly-once	Native transactions (producer + consumer)	Requires dedup logic	Not native (at-least-once)
Ecosystem	Kafka Connect, Kafka Streams, ksqlDB, Debezium	Kinesis Analytics (Flink)	Dataflow (Apache Beam)
Ops overhead	Medium (MSK) / None (MSK Serverless)	None (fully managed)	None (fully managed)
Cost model	Per-broker-hour + storage + data transfer	Per-shard-hour + data	Per-message + storage
Replay	Consumer seeks to any offset (full replay)	Resharding resets, but timestamp-based	Seek to timestamp on subscription
Best for	Event backbone for 100+ services, CDC, log aggregation	Real-time analytics, Lambda triggers	GCP-native event routing, cross-service

Decision Tree: Which Event/Streaming Platform?
=================================================

Is it simple request-reply between 2 services?
  └── YES → Use direct API calls (REST/gRPC)

Is it a task queue? (one producer, one consumer, process and delete)
  └── YES → Use SQS (AWS) or Cloud Tasks (GCP)

Is it fan-out? (one event, multiple independent consumers, no replay needed)
  └── YES → Use SNS+SQS (AWS) or Pub/Sub (GCP) or EventBridge (AWS)

Do you need replay? (reprocess events from 7 days ago for a new consumer)
  └── YES → Kafka (MSK) or Kinesis

Do you need exactly-once transactions? (financial data, inventory updates)
  └── YES → Kafka (MSK) with transactions

Is it a central event backbone for 50+ microservices with CDC?
  └── YES → Kafka (MSK or Confluent Cloud)

Are you GCP-native and want minimal ops?
  └── YES → Pub/Sub (different API but covers most use cases)

AWS MSK Deep Dive

MSK Provisioned gives you full Apache Kafka compatibility. You select broker instance types (kafka.m5.large, kafka.m5.2xlarge, etc.), broker count (minimum 3, one per AZ), and EBS storage per broker. Any standard Kafka client library works — MSK is just Kafka running on AWS-managed EC2 instances. You get full access to Kafka features: Kafka Connect, Kafka Streams, transactions, exactly-once semantics, log compaction, and fine-grained topic configuration.

MSK Serverless removes all broker management. You create a cluster, create topics, and produce/consume. AWS manages scaling, patching, and broker placement. The trade-off: fewer Kafka features (no Kafka Connect, limited topic-level configs, no custom broker configs), and throughput is billed per-data-unit rather than per-broker-hour. MSK Serverless is ideal for teams that want Kafka’s API and ecosystem without the operational burden of capacity planning.

Partition strategy is the most important Kafka design decision. Partitions determine parallelism: each partition can have at most one consumer per consumer group. Rule of thumb: start with partitions equal to your maximum expected consumer instances. A topic with 6 partitions can have at most 6 consumers in a single consumer group reading in parallel. Over-partitioning wastes broker resources (file handles, memory) and increases end-to-end latency. Under-partitioning limits throughput. For most services, 6-12 partitions per topic is a good starting point.

Key-based partitioning guarantees ordering for related events. All events with the same partition key (e.g., order_id=123) go to the same partition and are therefore processed in order. Different keys may go to different partitions and are processed in parallel. This is essential for event sourcing: all events for a given aggregate (order, customer, account) must be processed in sequence.

Kafka Architecture — MSK Cluster

MSK Monitoring — Key Metrics to Watch:

Critical MSK Metrics
======================

Broker Health:
  ├── kafka.server:BrokerState              → 3 = running (alert if != 3)
  ├── ActiveControllerCount                 → must be exactly 1 across cluster
  └── UnderReplicatedPartitions             → must be 0 (alert immediately if > 0)

Consumer Lag:
  ├── SumOffsetLag (per consumer group)     → gap between latest offset and consumer
  ├── MaxOffsetLag                          → worst partition in the group
  └── Alert: if lag > 10,000 messages for > 5 minutes → consumer is falling behind

Throughput:
  ├── BytesInPerSec (per broker)            → ingestion rate
  ├── BytesOutPerSec (per broker)           → consumption rate
  └── MessagesInPerSec                      → message rate

Storage:
  ├── KafkaDataLogsDiskUsed                 → percentage of EBS used
  └── Alert: if > 80% → increase EBS volume or reduce retention

GCP Kafka Options

GCP does not offer a native managed Kafka service equivalent to AWS MSK. This is a deliberate architectural choice — Google positions Pub/Sub as its managed messaging platform. For teams that specifically need the Kafka API and ecosystem (Kafka Connect, Kafka Streams, ksqlDB, Debezium), there are three viable approaches on GCP, each with different trade-offs.

Option 1: Confluent Cloud on GCP is the most operationally simple. Confluent manages the entire Kafka cluster — brokers, ZooKeeper (or KRaft), storage, networking, patching. The cluster runs in your GCP region with VPC peering for private connectivity. You get the full Confluent ecosystem: Schema Registry, Kafka Connect with 200+ pre-built connectors, ksqlDB for stream processing, and cluster linking for cross-region replication. The trade-off is cost — Confluent Cloud is significantly more expensive than self-hosted Kafka, typically 3-5x the infrastructure cost. For teams without deep Kafka expertise, this premium buys operational peace.

Option 2: Self-hosted Kafka on GKE using the Strimzi Kubernetes operator gives full control and the lowest cost at scale. Strimzi manages Kafka brokers, ZooKeeper (or KRaft mode), Kafka Connect, and MirrorMaker as Kubernetes custom resources. You get the flexibility to tune every Kafka configuration, choose your own Schema Registry (Confluent Schema Registry is open-source), and pay only GKE compute costs. The trade-off is significant operational burden: you manage upgrades, scaling, persistent volumes, monitoring, and disaster recovery. This is appropriate for teams with deep Kafka and Kubernetes expertise operating at very high scale where managed service costs are prohibitive.

Option 3: Use Pub/Sub instead of Kafka. For many use cases, Pub/Sub provides equivalent functionality with zero operational burden. Pub/Sub supports message ordering (via ordering keys), dead letter topics, exactly-once delivery (at-least-once with dedup), schema validation (Avro, Protocol Buffers), and message filtering. The API is different from Kafka’s, so it requires code changes, but the semantics are similar enough for most event-driven patterns. Pub/Sub is the right choice for GCP-native architectures where the Kafka ecosystem (Connect, Streams) is not required.

GCP Kafka Decision Tree
=========================

Do you need the Kafka API specifically?
  ├── NO → Use Pub/Sub (simpler, fully managed, GCP-native)
  └── YES → Continue...

Do you need Kafka Connect ecosystem (200+ connectors)?
  ├── YES → Confluent Cloud on GCP (managed, rich ecosystem)
  └── NO → Continue...

Is cost the primary concern at very high scale (>100 TB/day)?
  ├── YES → Self-hosted Kafka on GKE (Strimzi operator)
  └── NO → Confluent Cloud on GCP (managed, less ops)

Do you have 2+ engineers with deep Kafka operational experience?
  ├── YES → Self-hosted is viable
  └── NO → Confluent Cloud (don't self-host without expertise)

Terraform — AWS MSK Cluster
Config — GKE Strimzi Kafka

resource "aws_msk_cluster" "enterprise" {
  cluster_name           = "enterprise-event-backbone"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3  # One per AZ

  broker_node_group_info {
    instance_type   = "kafka.m5.large"  # 2 vCPU, 8 GB RAM
    client_subnets  = var.private_subnet_ids  # One subnet per AZ
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 500  # GB per broker
        provisioned_throughput {
          enabled           = true
          volume_throughput  = 250  # MB/s
        }
      }
    }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"       # Encrypt client-broker traffic
      in_cluster    = true         # Encrypt inter-broker replication
    }
    encryption_at_rest_kms_key_arn = var.kms_key_arn
  }

  configuration_info {
    arn      = aws_msk_configuration.enterprise.arn
    revision = aws_msk_configuration.enterprise.latest_revision
  }

  open_monitoring {
    prometheus {
      jmx_exporter {
        enabled_in_broker = true  # JMX metrics for Prometheus/vmagent
      }
      node_exporter {
        enabled_in_broker = true  # Node-level metrics
      }
    }
  }

  logging_info {
    broker_logs {
      cloudwatch_logs {
        enabled   = true
        log_group = aws_cloudwatch_log_group.msk.name
      }
    }
  }

  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

# Kafka broker configuration
resource "aws_msk_configuration" "enterprise" {
  name              = "enterprise-config"
  kafka_versions    = ["3.6.0"]

  server_properties = <<-EOT
    auto.create.topics.enable=false
    default.replication.factor=3
    min.insync.replicas=2
    num.partitions=6
    log.retention.hours=168
    log.retention.bytes=-1
    message.max.bytes=1048576
    compression.type=lz4
    unclean.leader.election.enable=false
  EOT
}

# MSK Connect for CDC (Debezium)
resource "aws_mskconnect_connector" "debezium_cdc" {
  name = "debezium-postgres-cdc"

  kafkaconnect_version = "2.7.1"

  capacity {
    autoscaling {
      min_worker_count = 1
      max_worker_count = 4
      mcu_count        = 1

      scale_in_policy {
        cpu_utilization_percentage = 20
      }
      scale_out_policy {
        cpu_utilization_percentage = 80
      }
    }
  }

  connector_configuration = {
    "connector.class"                    = "io.debezium.connector.postgresql.PostgresConnector"
    "database.hostname"                  = var.rds_endpoint
    "database.port"                      = "5432"
    "database.user"                      = var.db_username
    "database.password"                  = var.db_password
    "database.dbname"                    = "orders"
    "database.server.name"              = "orders-db"
    "table.include.list"                = "public.orders,public.payments"
    "topic.prefix"                       = "cdc"
    "plugin.name"                        = "pgoutput"
    "publication.autocreate.mode"        = "filtered"
    "key.converter"                      = "org.apache.kafka.connect.json.JsonConverter"
    "value.converter"                    = "org.apache.kafka.connect.json.JsonConverter"
    "transforms"                         = "unwrap"
    "transforms.unwrap.type"            = "io.debezium.transforms.ExtractNewRecordState"
  }

  kafka_cluster {
    apache_kafka_cluster {
      bootstrap_servers = aws_msk_cluster.enterprise.bootstrap_brokers_tls
      vpc {
        security_groups = [aws_security_group.msk_connect.id]
        subnets         = var.private_subnet_ids
      }
    }
  }

  plugin {
    custom_plugin {
      arn      = aws_mskconnect_custom_plugin.debezium.arn
      revision = aws_mskconnect_custom_plugin.debezium.latest_revision
    }
  }

  service_execution_role_arn = aws_iam_role.msk_connect.arn
}

# Schema Registry — AWS Glue Schema Registry
resource "aws_glue_registry" "enterprise" {
  registry_name = "enterprise-events"
  description   = "Schema registry for enterprise event contracts"
}

resource "aws_glue_schema" "order_placed" {
  schema_name       = "OrderPlaced"
  registry_arn      = aws_glue_registry.enterprise.arn
  data_format       = "AVRO"
  compatibility     = "BACKWARD"  # New schema must read old data
  schema_definition = jsonencode({
    type      = "record"
    name      = "OrderPlaced"
    namespace = "com.bank.orders"
    fields = [
      { name = "orderId", type = "string" },
      { name = "customerId", type = "string" },
      { name = "amount", type = "double" },
      { name = "currency", type = "string" },
      { name = "timestamp", type = "string" }
    ]
  })
}

# Strimzi Kafka on GKE — Cluster CRD
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: enterprise-kafka
  namespace: kafka
spec:
  kafka:
    version: 3.6.0
    replicas: 3
    listeners:
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      log.retention.hours: 168
    storage:
      type: persistent-claim
      size: 500Gi
      class: premium-rwo  # GKE SSD persistent disk
    resources:
      requests:
        cpu: "2"
        memory: 8Gi
      limits:
        memory: 8Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 50Gi
      class: premium-rwo
  entityOperator:
    topicOperator: {}
    userOperator: {}

Enterprise Pattern: MSK as Central Event Backbone

In a large microservices architecture, MSK (or Confluent Cloud) serves as the central nervous system. All domain events flow through Kafka topics, enabling loose coupling between services while providing a durable, replayable event log. This pattern combines three key capabilities:

Domain events: Microservices publish business events (OrderPlaced, PaymentProcessed, InventoryReserved) to Kafka topics. Other services subscribe to events they care about. No direct service-to-service calls for asynchronous workflows.
CDC (Change Data Capture): Debezium monitors database transaction logs and publishes every INSERT, UPDATE, DELETE as a Kafka event. This enables real-time data synchronization between services without application-level dual-write (which is inherently inconsistent). CDC into Kafka is the most reliable way to get data from operational databases to analytics systems.
Schema Registry: Enforces event contracts across all producers and consumers. Schemas are versioned with backward compatibility rules — a producer cannot publish a breaking schema change that would crash existing consumers. AWS Glue Schema Registry or Confluent Schema Registry both serve this purpose.

Enterprise Event Backbone Architecture

Interview Scenarios for Kafka

“Design a real-time data pipeline. When do you choose Kafka over Kinesis?”

“The decision comes down to three factors: consumer model, retention needs, and ecosystem.

Choose Kafka (MSK) when: (1) Multiple independent consumer groups need to read the same data at their own pace — Kafka’s consumer group model handles this natively, while Kinesis requires enhanced fan-out or SNS fan-out to multiple streams. (2) You need unlimited retention — some regulatory use cases require keeping an immutable event log for years, Kafka supports this with tiered storage. Kinesis maxes at 365 days. (3) You need the Kafka Connect ecosystem — 200+ connectors for CDC (Debezium), S3 sinks, Elasticsearch sinks, JDBC sources. Building these integrations from scratch with Kinesis is months of work.

Choose Kinesis when: (1) You want zero operational overhead and your throughput fits within shard limits. (2) You want Lambda integration — Kinesis triggers Lambda functions natively with event source mapping, batching, and error handling. Kafka-Lambda integration exists but is less mature. (3) You need real-time SQL analytics — Kinesis Analytics (managed Apache Flink) provides stream processing without additional infrastructure.

For this pipeline, I’d choose MSK if we have 10+ downstream consumers with different processing speeds and need CDC from databases. I’d choose Kinesis if it’s a simpler pipeline with 2-3 consumers and we want Lambda-based processing.”

“You have 500 microservices. Would you use an event bus (EventBridge) or a streaming platform (Kafka)? Why?”

“Both, for different purposes. EventBridge for event routing and integration, Kafka for the core event backbone.

EventBridge is excellent for: routing events between services based on content (event rules with filtering), integrating with AWS services (Lambda, Step Functions, SQS, API destinations), and providing a managed schema registry. It is a router, not a store — events are delivered and forgotten.

Kafka is essential for: durable event storage (replay events from 7 days ago for a new consumer), high-throughput stream processing (aggregations, joins, windowing via Kafka Streams or ksqlDB), CDC from databases (Debezium), and serving as the single source of truth for event history.

In practice, I’d use both: microservices publish domain events to Kafka topics (durable, replayable). An EventBridge pipe reads from Kafka and routes specific event types to AWS-native targets (Lambda, Step Functions) for services that prefer the EventBridge integration model. This gives you the durability and ecosystem of Kafka with the routing flexibility of EventBridge.”

Interview Scenarios

Scenario 1: “Design an Event-Driven Order Processing System Handling 10K Orders per Minute”

Strong Answer:

“10K orders per minute is about 167 per second — well within the capacity of both EventBridge and Pub/Sub.

Architecture:

Order API (EKS pods) publishes order.placed events to the central EventBridge bus in the Shared Services Account. The event contains: orderId, customerId, items, amount, currency, timestamp.

Fan-out via EventBridge rules:

Rule 1: Route to Payment SQS queue (Team-B account) — process payment
Rule 2: Route to Inventory SQS queue (Team-C account) — reserve stock
Rule 3: Route to Notification Lambda (Team-D account) — send confirmation email
Rule 4: Route to Analytics Kinesis stream — real-time dashboards

Each consumer is independent: Payment failing does not block inventory. Each has its own DLQ with CloudWatch alarms. Retry policy: 3 attempts with exponential backoff.

Ordering: SQS FIFO queues with orderId as the message group ID — events for the same order are processed in sequence. Different orders process in parallel.

Exactly-once: Consumer-side idempotency. Each processor stores processed orderIds in DynamoDB. Before processing, check if the orderId exists. This handles the case where SQS delivers the same message twice.

Schema: EventBridge Schema Registry validates event format. Breaking changes require a versioning process (v1 and v2 published simultaneously during migration).

Monitoring: CloudWatch metrics on SQS queue depth, consumer lag, DLQ depth, EventBridge failed invocations. PagerDuty alert if DLQ depth > 0 or consumer lag > 5 minutes.”

Scenario 2: “How Do You Handle Failed Events and Ensure Nothing Is Lost?”

Strong Answer:

“Multiple layers of protection:

Layer 1 — Retries: EventBridge retries failed target invocations for up to 24 hours with exponential backoff. SQS consumers get 3 attempts (configurable via maxReceiveCount). Pub/Sub retries with configurable backoff (10s to 600s).

Layer 2 — Dead Letter Queue: After exhausting retries, the message moves to a DLQ. The DLQ has a 14-day retention period — plenty of time to investigate and replay.

Layer 3 — Alerting: CloudWatch alarm fires the moment a single message appears in the DLQ. PagerDuty notification to the on-call engineer. The alert includes the queue name and approximate message count.

Layer 4 — Replay: Once the root cause is fixed (bug in consumer code, downstream service restored), we replay messages from the DLQ back to the source queue using SQS StartMessageMoveTask or Pub/Sub Seek to a timestamp. We throttle replay to avoid overwhelming the consumer.

Layer 5 — Audit trail: Every event published is logged with a unique eventId and timestamp. If we suspect data loss, we can reconcile the producer’s log against the consumer’s processed events log.

Layer 6 — Event store (for critical systems): For financial transactions, we write every event to S3/GCS as an immutable log before publishing to EventBridge/Pub/Sub. Even if the entire messaging system fails, we can rebuild state from the event store.”

Scenario 3: “As the Central Team, How Do You Provide a Shared Event Bus for 20 Teams?”

Strong Answer:

“This is a platform engineering problem — provide self-service with guardrails.

Infrastructure (platform team owns):

Central EventBridge custom bus in the Shared Services Account
Resource policy allowing workload accounts to PutEvents
Schema Registry for event contract validation
DLQ in Shared Services with alarms
Monitoring dashboards (events published/consumed per team)

Governance (platform team enforces):

Event naming convention: com.bank.{team}.{entity}.{action} — e.g., com.bank.orders.order.placed
Schema registration required before publishing (prevents unstructured events)
Backward compatibility required for schema changes
Rate limiting per source (EventBridge has account-level throttling)

Self-service (tenant teams own):

Teams create their own EventBridge rules to subscribe to events they care about
Teams define their own SQS/Lambda targets in their workload accounts
Teams manage their own consumer logic, DLQ monitoring, and replay

Terraform module the platform team provides:

module "event_subscription" {
  source = "git::https://github.com/bank/terraform-modules//event-subscription"

  event_bus_arn  = data.aws_cloudwatch_event_bus.enterprise.arn
  team_name      = "team-b"
  event_patterns = [
    {
      source      = ["com.bank.orders"]
      detail_type = ["OrderPlaced"]
    }
  ]
  target_sqs_arn = aws_sqs_queue.order_events.arn
  dlq_arn        = aws_sqs_queue.order_events_dlq.arn
}

Teams use this module in their Terraform — they specify which events they want and where to deliver them. The platform team reviews PRs for convention compliance.”

Scenario 4: “SQS vs Kinesis — When Would You Choose Each?”

Strong Answer:

“The fundamental difference: SQS is a queue (messages are consumed and deleted), Kinesis is a stream (messages persist for the retention period and multiple consumers read independently).

Choose SQS when:

You need a simple task queue (one producer, one consumer group)
Messages should be processed once and removed
You do not need to replay processed messages
Throughput is variable (SQS scales automatically with no shard management)
Example: order processing, email sending, image resizing

Choose Kinesis when:

Multiple independent consumers need to read the same data (fan-out without SNS)
You need to replay messages (reprocess last 7 days of data for a new consumer)
Ordering is critical (Kinesis guarantees per-shard ordering always, not just in FIFO mode)
Real-time analytics (Kinesis Analytics can run SQL on the stream)
High-throughput, append-only log semantics (CDC, clickstream, IoT telemetry)
Example: real-time transaction monitoring, log aggregation, CDC pipeline

Or use both: Kinesis as the primary stream (durable, replayable) → fan-out to SQS queues for each consumer (independent processing, DLQ per consumer). This gives you durability AND independent consumption.”

Scenario 5: “Design Event Schema Evolution Without Breaking Downstream Consumers”

Strong Answer:

“Schema evolution is critical when 20 teams depend on each other’s events. The rules:

Additive changes are always safe. Adding a new optional field to an event is backward compatible. Old consumers ignore the field, new consumers use it. Example: adding region to the OrderPlaced event.

Removing or renaming fields is a breaking change. A consumer deserializing the event will fail if an expected field is missing.

The evolution process for breaking changes:

Create a new version: Instead of modifying OrderPlaced, create OrderPlaced_v2 with the new schema.
Publish both versions: The producer publishes both OrderPlaced (v1) and OrderPlaced_v2 during the transition period. This is dual-publishing — more work for the producer but zero impact on consumers.
Consumers migrate: Each team migrates their subscription to the v2 event on their own schedule. The platform team tracks which teams are still on v1.
Deprecation notice: After all consumers have migrated (verified by schema registry usage metrics), announce v1 deprecation with a 30-day sunset.
Remove v1: Stop publishing v1. Clean up the schema.

Tooling: EventBridge Schema Registry or Pub/Sub schema validation catches breaking changes at publish time. In CI/CD, run a compatibility check: new schema must be backward compatible with current schema — fail the pipeline if not.”

Scenario 6: “How Do You Ensure Exactly-Once Processing?”

Strong Answer:

“True exactly-once delivery is a distributed systems myth at the transport layer. What we actually achieve is ‘effectively exactly-once’ through two complementary techniques:

Broker-side deduplication:

SQS FIFO: set a MessageDeduplicationId on each message. SQS deduplicates within a 5-minute window. If the same dedup ID is sent twice, SQS silently drops the duplicate.
Pub/Sub: enable exactly_once_delivery on the subscription. Pub/Sub tracks acknowledged messages and prevents redelivery of ack’d messages.

Consumer-side idempotency (always required): Even with broker deduplication, edge cases exist (network timeout after processing but before ack, consumer crash after processing but before commit). The consumer must be idempotent:

Extract a unique event ID from the message (orderId, transactionId)
Before processing, check a database table: SELECT 1 FROM processed_events WHERE event_id = ?
If found, skip (already processed)
If not found, process the event AND insert into processed_events in the same database transaction (atomic)
This guarantees that even if the same message arrives twice, the business logic executes only once

For financial systems: Add a reconciliation job that runs nightly, comparing the producer’s event log against the consumer’s processed events, and flags any discrepancies for manual review.”

References

AWS

Amazon EventBridge User Guide — event bus, rules, schema registry, and cross-account patterns
Amazon SQS Developer Guide — standard and FIFO queues, DLQs, and redrive
Amazon SNS Developer Guide — pub/sub fan-out, message filtering, and delivery
Amazon Kinesis Data Streams — real-time streaming and shard management

GCP

Google Cloud Pub/Sub Documentation — topics, subscriptions, exactly-once delivery, and schema validation
Eventarc Documentation — managed event routing from Google Cloud sources to targets
Pub/Sub Schema Validation — Avro and Protocol Buffer schema enforcement

Tools & Frameworks

Enterprise Integration Patterns — Hohpe and Woolf’s catalog of messaging patterns (saga, CQRS, fan-out)
Microservices.io: Saga Pattern — choreography and orchestration-based saga patterns
CQRS Pattern (Martin Fowler) — command query responsibility segregation explained

Event-Driven Architecture

Where This Fits

Event-Driven vs Request-Driven

Event Patterns

Fan-Out Pattern

Saga Pattern (Distributed Transactions)

CQRS (Command Query Responsibility Segregation)

Cloud Event Services

Amazon EventBridge

Amazon SQS

SNS (Fan-Out)

Kinesis (Real-Time Streaming)

Google Cloud Pub/Sub

Eventarc (Managed Event Routing)

Dead Letter Queues (DLQ) and Error Handling

DLQ Replay Pattern

Exactly-Once Delivery

Schema Registry and Event Contracts

Managed Kafka — MSK & Confluent Cloud

When Kafka vs Kinesis vs Pub/Sub?

AWS MSK Deep Dive

GCP Kafka Options

Enterprise Pattern: MSK as Central Event Backbone

Interview Scenarios for Kafka

Interview Scenarios

Scenario 1: “Design an Event-Driven Order Processing System Handling 10K Orders per Minute”

Scenario 2: “How Do You Handle Failed Events and Ensure Nothing Is Lost?”

Scenario 3: “As the Central Team, How Do You Provide a Shared Event Bus for 20 Teams?”

Scenario 4: “SQS vs Kinesis — When Would You Choose Each?”

Scenario 5: “Design Event Schema Evolution Without Breaking Downstream Consumers”

Scenario 6: “How Do You Ensure Exactly-Once Processing?”

References

AWS

GCP

Tools & Frameworks