Skip to content

Connectivity — Transit Gateway, Peering & Hybrid

This page covers the Network Hub Account — the central connectivity layer that stitches together every workload account, on-premises data center, and partner network. Transit Gateway (AWS) and Shared VPC + NCC (GCP) live here.

Network Hub Account — connectivity context

Every workload VPC attaches to Transit Gateway as a spoke. The Network Hub Account owns the TGW and controls all inter-VPC and internet-bound routing through route tables.


In a mesh topology, every VPC peers with every other VPC. With n VPCs, you need n*(n-1)/2 peering connections. For 20 VPCs, that is 190 peering connections — each requiring route updates on both sides. This does not scale.

Hub-spoke solves this: every spoke VPC connects to ONE central hub (Transit Gateway). Adding a new VPC requires ONE attachment, not n-1 peering connections. The hub controls routing — spokes do not need to know about each other.

Mesh vs Hub-Spoke topology comparison


Transit Gateway (TGW) is a regional network transit hub that connects VPCs, VPN connections, Direct Connect gateways, and TGW peering connections (cross-region). It operates at Layer 3 (IP routing).

ConceptDescription
TGWThe hub itself — regional resource, one per region
AttachmentA connection from a VPC, VPN, Direct Connect, or peered TGW to the TGW
Route TableTGW has its own route tables (separate from VPC route tables). Controls where traffic goes between attachments
AssociationLinks an attachment to a route table — determines which route table is used for traffic FROM that attachment
PropagationAn attachment can propagate its routes into a route table — the attachment’s CIDR automatically appears as a route
Static RouteManually added routes in TGW route tables (e.g., default route to inspection VPC)
Appliance ModeEnsures symmetric routing for stateful appliances — both directions of a flow go through the same AZ

This is the most misunderstood part of TGW. There are two actions on every route table:

  1. Association: “Traffic FROM this attachment uses THIS route table to decide where to go next”
  2. Propagation: “This attachment ADVERTISES its CIDR INTO this route table so others can find it”

TGW route tables — association and propagation

In our bank, we use separate TGW route tables to enforce environment isolation:

TGW route table segmentation strategy

Full Architecture — Centralized Egress with Inspection

Section titled “Full Architecture — Centralized Egress with Inspection”

TGW full architecture with centralized inspection VPC

The Inspection VPC needs careful routing to avoid loops:

Inspection VPC route tables


GCP Connectivity — Shared VPC & Network Connectivity Center

Section titled “GCP Connectivity — Shared VPC & Network Connectivity Center”

GCP Shared VPC is fundamentally different from AWS TGW. Instead of connecting separate VPCs, you share ONE VPC across multiple projects.

Concepts:

  • Host Project: owns the VPC, subnets, firewall rules, Cloud NAT, Cloud Router. Managed by the central infra team.
  • Service Project: attached to the host project. Workload teams deploy GKE clusters, VMs, Cloud SQL into subnets owned by the host project.
  • IAM bindings: service project users need compute.networkUser on specific subnets to deploy resources.

GCP Shared VPC — host project with service projects

Key advantages over AWS TGW model:

  • No data processing charges for inter-project communication (same VPC)
  • Simpler routing — no TGW route tables to manage
  • Firewall rules apply uniformly across all projects

Key limitations:

  • Maximum ~1000 service projects per host project (soft limit, can be raised)
  • All projects share the same VPC and IP space — CIDR planning is critical
  • Service project teams cannot create their own firewall rules (by design — central team controls)
  • Subnet-level IAM is needed to restrict which teams can deploy to which subnets

VPC Peering vs Transit Gateway — Decision Framework

Section titled “VPC Peering vs Transit Gateway — Decision Framework”
FactorVPC PeeringTransit Gateway
ScaleDoes not scale (n^2 connections)Scales linearly (n connections)
TransitivityNOT transitive (A↔B + B↔C does not mean A↔C)Transitive (all spokes can reach each other)
CostNo hourly charge, no data processing charge$0.05/hr per attachment + $0.02/GB data processing
LatencySlightly lower (direct)Slightly higher (1 additional hop)
Cross-regionYes (inter-region peering)Yes (inter-region TGW peering)
Cross-accountYesYes
Route managementManual routes on both sidesCentralized route tables with propagation
BandwidthNo limit (same as within VPC)Up to 50 Gbps per VPC attachment
Security inspectionCannot insert firewall inlineCan route through inspection VPC
Overlapping CIDRsNot allowedNot allowed

Decision:

  • Use VPC Peering for: 2-3 VPCs with stable topology, high-throughput needs (e.g., EKS ↔ Shared Services), cost-sensitive data transfer
  • Use Transit Gateway for: 5+ VPCs, centralized security inspection, on-prem connectivity, environment segmentation, any enterprise architecture

Hybrid Connectivity — Direct Connect & Cloud Interconnect

Section titled “Hybrid Connectivity — Direct Connect & Cloud Interconnect”

Enterprise banks always have on-premises data centers that need private, dedicated connectivity to the cloud. Public internet is not acceptable for production traffic — latency is variable, bandwidth is shared, and regulatory requirements often mandate private links.

Direct Connect (DX) provides a dedicated physical connection between your on-prem data center and AWS.

Key concepts:

  • Connection: physical port (1 Gbps or 10 Gbps) at a Direct Connect location (colocation facility)
  • Virtual Interface (VIF): logical connection over the physical port
    • Private VIF: connects to VPCs (via Direct Connect Gateway → TGW or VGW)
    • Public VIF: connects to AWS public services (S3, DynamoDB) over private link
    • Transit VIF: connects to Transit Gateway via Direct Connect Gateway
  • Direct Connect Gateway (DXGW): global resource that connects DX to TGWs in multiple regions
  • LAG (Link Aggregation Group): bundle multiple DX connections for higher bandwidth

AWS Direct Connect — on-prem to TGW via DX Gateway

Redundancy patterns:

  • Standard HA: 2 DX connections at the SAME DX location (protects against port/device failure)
  • Maximum resilience: 2 DX connections at DIFFERENT DX locations (protects against facility failure)
  • VPN backup: Site-to-site VPN over internet as failover when DX goes down (lower bandwidth, higher latency, but works)

BGP configuration:

  • On-prem announces its CIDRs (172.16.0.0/12) to AWS via BGP
  • AWS announces VPC CIDRs (10.0.0.0/8) back to on-prem
  • BGP community tags control route propagation (e.g., 7224:8100 = local region preference)
  • BFD (Bidirectional Forwarding Detection) for sub-second failover

Both AWS and GCP support site-to-site VPN as a backup path when the dedicated link fails.

AWS Site-to-Site VPN components


Inter-Region TGW Peering:

Each region has its own TGW. You peer them together for cross-region routing. TGW peering is NOT transitive — if Region A peers with Region B and Region B peers with Region C, Region A cannot reach Region C without a direct peering.

Inter-region TGW peering — eu-west-1 to me-south-1


Packet Trace: Pod A in VPC-1 Reaches RDS in VPC-2 via TGW

Section titled “Packet Trace: Pod A in VPC-1 Reaches RDS in VPC-2 via TGW”

This is a common interview question. Walk through every hop.

Setup:

  • Pod A runs in an EKS cluster in payments-prod VPC (10.10.0.0/16), private subnet (10.10.1.0/24)
  • RDS instance is in data-platform-prod VPC (10.12.0.0/16), data subnet (10.12.2.50)
  • Both VPCs are attached to Transit Gateway in the Network Hub Account

Packet trace — pod to internet


Terraform — Transit Gateway & Shared VPC

Section titled “Terraform — Transit Gateway & Shared VPC”
# network-hub-account/tgw.tf — Transit Gateway Configuration
resource "aws_ec2_transit_gateway" "main" {
description = "Enterprise bank TGW"
amazon_side_asn = 64512
auto_accept_shared_attachments = "disable" # Manual approval
default_route_table_association = "disable" # Explicit route tables
default_route_table_propagation = "disable" # Explicit propagation
dns_support = "enable"
vpn_ecmp_support = "enable" # ECMP for VPN tunnels
multicast_support = "disable"
tags = { Name = "bank-tgw-eu-west-1" }
}
# Share TGW with workload accounts via AWS RAM
resource "aws_ram_resource_share" "tgw" {
name = "tgw-share"
allow_external_principals = false # Same org only
}
resource "aws_ram_resource_association" "tgw" {
resource_arn = aws_ec2_transit_gateway.main.arn
resource_share_arn = aws_ram_resource_share.tgw.arn
}
resource "aws_ram_principal_association" "workloads_ou" {
principal = "arn:aws:organizations::111111111111:ou/o-xxx/ou-xxx-workloads"
resource_share_arn = aws_ram_resource_share.tgw.arn
}
# ─── TGW Route Tables ──────────────────────────────
resource "aws_ec2_transit_gateway_route_table" "prod" {
transit_gateway_id = aws_ec2_transit_gateway.main.id
tags = { Name = "prod-rt" }
}
resource "aws_ec2_transit_gateway_route_table" "non_prod" {
transit_gateway_id = aws_ec2_transit_gateway.main.id
tags = { Name = "non-prod-rt" }
}
resource "aws_ec2_transit_gateway_route_table" "shared_services" {
transit_gateway_id = aws_ec2_transit_gateway.main.id
tags = { Name = "shared-services-rt" }
}
resource "aws_ec2_transit_gateway_route_table" "inspection" {
transit_gateway_id = aws_ec2_transit_gateway.main.id
tags = { Name = "inspection-rt" }
}
# ─── Inspection VPC Attachment ──────────────────────
resource "aws_ec2_transit_gateway_vpc_attachment" "inspection" {
transit_gateway_id = aws_ec2_transit_gateway.main.id
vpc_id = aws_vpc.inspection.id
subnet_ids = aws_subnet.inspection_tgw[*].id
appliance_mode_support = "enable" # CRITICAL
transit_gateway_default_route_table_association = false
transit_gateway_default_route_table_propagation = false
tags = { Name = "inspection-vpc-attachment" }
}
# Associate inspection attachment with inspection route table
resource "aws_ec2_transit_gateway_route_table_association" "inspection" {
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.inspection.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.inspection.id
}
# Default route in prod RT → inspection VPC (for internet egress)
resource "aws_ec2_transit_gateway_route" "prod_default" {
destination_cidr_block = "0.0.0.0/0"
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.inspection.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.prod.id
}
# Default route in non-prod RT → inspection VPC
resource "aws_ec2_transit_gateway_route" "non_prod_default" {
destination_cidr_block = "0.0.0.0/0"
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.inspection.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.non_prod.id
}
# ─── Direct Connect Gateway ────────────────────────
resource "aws_dx_gateway" "main" {
name = "bank-dxgw"
amazon_side_asn = 64513
}
resource "aws_dx_gateway_association" "tgw" {
dx_gateway_id = aws_dx_gateway.main.id
associated_gateway_id = aws_ec2_transit_gateway.main.id
allowed_prefixes = ["10.0.0.0/8"] # Advertise cloud CIDRs to on-prem
}
# ─── VPN as DX Backup ──────────────────────────────
resource "aws_customer_gateway" "on_prem" {
bgp_asn = 65000
ip_address = "203.0.113.1" # On-prem VPN public IP
type = "ipsec.1"
tags = { Name = "on-prem-cgw" }
}
resource "aws_vpn_connection" "backup" {
customer_gateway_id = aws_customer_gateway.on_prem.id
transit_gateway_id = aws_ec2_transit_gateway.main.id
type = "ipsec.1"
static_routes_only = false # Use BGP
tags = { Name = "dx-backup-vpn" }
}
# ─── Outputs ────────────────────────────────────────
output "transit_gateway_id" {
value = aws_ec2_transit_gateway.main.id
}
output "tgw_route_table_ids" {
value = {
prod = aws_ec2_transit_gateway_route_table.prod.id
non_prod = aws_ec2_transit_gateway_route_table.non_prod.id
shared_services = aws_ec2_transit_gateway_route_table.shared_services.id
inspection = aws_ec2_transit_gateway_route_table.inspection.id
}
}

In workload accounts (called by VPC module):

# Workload account attaches VPC to the shared TGW
data "aws_ec2_transit_gateway" "hub" {
filter {
name = "tag:Name"
values = ["bank-tgw-eu-west-1"]
}
}
# Associate with correct route table (prod or non-prod)
resource "aws_ec2_transit_gateway_route_table_association" "this" {
transit_gateway_attachment_id = module.vpc.tgw_attachment_id
transit_gateway_route_table_id = var.is_production ? data.terraform_remote_state.hub.outputs.tgw_route_table_ids["prod"] : data.terraform_remote_state.hub.outputs.tgw_route_table_ids["non_prod"]
}
# Propagate this VPC's CIDR into the appropriate route tables
resource "aws_ec2_transit_gateway_route_table_propagation" "to_prod" {
count = var.is_production ? 1 : 0
transit_gateway_attachment_id = module.vpc.tgw_attachment_id
transit_gateway_route_table_id = data.terraform_remote_state.hub.outputs.tgw_route_table_ids["prod"]
}
resource "aws_ec2_transit_gateway_route_table_propagation" "to_shared" {
transit_gateway_attachment_id = module.vpc.tgw_attachment_id
transit_gateway_route_table_id = data.terraform_remote_state.hub.outputs.tgw_route_table_ids["shared_services"]
}
resource "aws_ec2_transit_gateway_route_table_propagation" "to_inspection" {
transit_gateway_attachment_id = module.vpc.tgw_attachment_id
transit_gateway_route_table_id = data.terraform_remote_state.hub.outputs.tgw_route_table_ids["inspection"]
}

For multi-cloud enterprises, connecting AWS and GCP requires one of these patterns:

Cross-cloud connectivity — VPN-based


Section titled “Private Connectivity — PrivateLink & Private Service Connect”

Private connectivity services let you expose a service to consumers in other accounts, VPCs, or organizations without VPC peering, public IPs, or internet traversal. Traffic stays on the provider’s backbone network. This is the foundation for SaaS publishing, shared platform APIs, and accessing AWS/GCP managed services privately.

PrivateLink is a consumer/provider model. The provider publishes a service behind a Network Load Balancer. The consumer creates an Interface VPC Endpoint in their VPC to access it privately.

AWS PrivateLink Architecture

How it works:

  1. Provider creates an NLB (internal) and registers targets (EKS pods, EC2, IP targets)
  2. Provider creates a VPC Endpoint Service pointing to the NLB
  3. Provider optionally enables acceptance required (manual approval of consumer connections)
  4. Provider adds allowed principals (specific AWS account IDs or org IDs)
  5. Consumer creates an Interface VPC Endpoint specifying the service name
  6. Endpoint creates ENIs in the consumer’s subnets with private IPs
  7. Consumer’s app connects to the endpoint ENI IP or the private DNS name

Cross-account access: PrivateLink is designed for cross-account use. Provider allows specific accounts; consumers create endpoints in their own VPC. No VPC peering, no route table changes, no CIDR overlap concerns.

Gateway Endpoints vs Interface Endpoints:

AspectGateway EndpointInterface Endpoint
ServicesS3 and DynamoDB ONLY100+ AWS services + custom
CostFREE (no hourly or data charges)$0.01/hr per AZ + $0.01/GB
How it worksRoute table entry → prefix listENI in your subnet with private IP
DNSNo private DNS (uses prefix list routes)Private DNS (resolves service domain to private IP)
SecurityVPC endpoint policy (JSON)Security groups + VPC endpoint policy
Cross-regionNoNo (same region only)

Cost comparison — Interface Endpoint vs NAT Gateway for AWS service access:

PatternMonthly Cost (100 GB traffic, 3 AZs)
NAT Gateway$32.40/mo (gateway) + $4.50 (data) = ~$37
Interface Endpoint$21.60/mo (3 AZ x $7.20) + $1.00 (data) = ~$23
Gateway Endpoint (S3/DDB)$0

For workloads that only need to reach AWS services (not the internet), Interface Endpoints are cheaper than NAT Gateway and more secure (traffic never touches the internet).

Exposing EKS services via PrivateLink:

# Step 1: K8s Service with internal NLB
# (applied via kubectl, shown here for reference)
# apiVersion: v1
# kind: Service
# metadata:
# name: shared-api
# annotations:
# service.beta.kubernetes.io/aws-load-balancer-type: "external"
# service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
# service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
# spec:
# type: LoadBalancer
# ports:
# - port: 443
# targetPort: 8443
# selector:
# app: shared-api
# Step 2: VPC Endpoint Service
resource "aws_vpc_endpoint_service" "shared_api" {
acceptance_required = true
network_load_balancer_arns = [aws_lb.shared_api_nlb.arn]
allowed_principals = [
"arn:aws:iam::111111111111:root", # Tenant account 1
"arn:aws:iam::222222222222:root", # Tenant account 2
]
tags = {
Name = "shared-api-endpoint-service"
}
}
# Step 3: Consumer creates this in their account
resource "aws_vpc_endpoint" "consume_shared_api" {
vpc_id = aws_vpc.workload.id
service_name = aws_vpc_endpoint_service.shared_api.service_name
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.endpoint_sg.id]
private_dns_enabled = false # Use endpoint-specific DNS
}
Section titled “Decision Matrix — PrivateLink vs VPC Peering vs Transit Gateway”

Choosing the right connectivity pattern depends on the traffic pattern, scale, and security requirements:

AspectPrivateLink / PSCVPC PeeringTransit Gateway / NCC
Use caseExpose single service to consumersFull network connectivity between 2 VPCsHub-spoke for many VPCs
CIDR overlapAllowed (NAT handles it)NOT allowed (must be unique)NOT allowed
RoutingNo route table changes (ENI/forwarding rule)Requires route table entries both sidesCentralized route tables
ScaleUnlimited consumers per serviceMax 125 peering connections per VPC5,000 attachments per TGW
TransitivityN/A (point-to-point service)NOT transitive (A↔B, B↔C does NOT give A↔C)Transitive by design
Cross-regionSame region onlyCross-region supportedCross-region via peering
Cross-accountYes (primary use case)YesYes (RAM sharing)
SecurityConsumer can only reach the published serviceFull VPC-to-VPC access (filtered by SGs)Route table segmentation
CostPer-hour + per-GBFree (data transfer charges only)Per-attachment + per-GB
Best forSaaS publishing, shared APIs, AWS service accessSimple 2-VPC connectivity, low trafficEnterprise multi-account networking

Interview scenario — “How do you expose a shared API to 50 tenant accounts without VPC peering?”

Answer: Use PrivateLink (AWS) or Private Service Connect (GCP). Deploy the API in a shared services account behind an internal NLB (AWS) or ILB (GCP). Create a VPC Endpoint Service / Service Attachment. Each tenant account creates their own VPC Endpoint / PSC Endpoint. Benefits: (1) No CIDR coordination — tenants can use overlapping CIDRs. (2) No route table management — endpoints are local ENIs. (3) Least-privilege — tenants can only reach the published service, not your entire VPC. (4) Scale — adding tenant 51 is the same as tenant 1. (5) Security — provider controls acceptance, consumer controls security groups on the endpoint.


Hybrid Connectivity — Direct Connect & Cloud Interconnect

Section titled “Hybrid Connectivity — Direct Connect & Cloud Interconnect”

Hybrid connectivity bridges on-premises data centers to cloud VPCs over private, dedicated circuits. This is mandatory for regulated industries (banking, healthcare) where internet-based VPN does not meet latency, bandwidth, or compliance requirements.

Direct Connect provides a dedicated physical connection from your data center (or colocation facility) to an AWS Direct Connect location.

Physical Layer:

AWS Direct Connect Physical and Logical Layers

Provisioning steps:

  1. Request a DX connection in AWS Console (choose location and port speed)
  2. AWS provides a LOA-CFA (Letter of Authorization - Connecting Facility Assignment)
  3. Give LOA-CFA to your colocation provider to provision the cross-connect (physical cable)
  4. Configure Virtual Interfaces (VIFs) — the logical layer on top of the physical connection
  5. Establish BGP peering between your router and AWS router

Port speeds: 1 Gbps, 10 Gbps, 100 Gbps (dedicated). Sub-1G via partner connections (50 Mbps to 10 Gbps).

Virtual Interface Types:

VIF TypePurposeConnects ToBGP Peering
Private VIFAccess VPC private IPsVPC via VGW or DX GatewayPrivate ASN
Transit VIFAccess VPCs via Transit GatewayTGW via DX GatewayPrivate ASN
Public VIFAccess AWS public services (S3, DynamoDB, etc.)AWS public IP rangesPublic ASN

DX Gateway — multi-region access from single connection:

A DX Gateway is a global resource that connects a DX connection to VPCs (via VGW) or Transit Gateways in ANY region. One physical connection in Dubai → DX Gateway → VPCs in me-south-1, eu-west-1, us-east-1.

High Availability Pattern (must-know for interviews):

Direct Connect High Availability Pattern

Key HA principles:

  • Dual DX connections at SEPARATE facilities — protects against facility failure
  • VPN backup — if both DX fail, traffic falls back to VPN over internet
  • BGP failover — use MED (Multi-Exit Discriminator) to prefer DX (lower MED) over VPN (higher MED)
  • BFD (Bidirectional Forwarding Detection) — sub-second failure detection on DX links

MACsec encryption: Available on 10G and 100G dedicated connections. Encrypts data at Layer 2 between your router and the AWS DX router. Required for compliance in banking/government.

LAG (Link Aggregation Group): Bundle multiple DX connections (same speed, same location) into a single logical connection for higher throughput. Up to 4 connections per LAG.

Terraform:

resource "aws_dx_connection" "primary" {
name = "dx-primary-dubai"
bandwidth = "10Gbps"
location = "DXB1" # Dubai DX location
tags = {
Environment = "production"
Redundancy = "primary"
}
}
resource "aws_dx_gateway" "main" {
name = "dx-gateway-main"
amazon_side_asn = "64512"
}
resource "aws_dx_transit_virtual_interface" "primary" {
connection_id = aws_dx_connection.primary.id
dx_gateway_id = aws_dx_gateway.main.id
name = "transit-vif-primary"
vlan = 100
address_family = "ipv4"
bgp_asn = 65001 # Your on-prem ASN
mtu = 8500 # Jumbo frames
}
resource "aws_dx_gateway_association" "tgw" {
dx_gateway_id = aws_dx_gateway.main.id
associated_gateway_id = aws_ec2_transit_gateway.hub.id
allowed_prefixes = [
"10.0.0.0/8", # All VPC CIDRs
]
}
# VPN backup with higher MED
resource "aws_vpn_connection" "backup" {
customer_gateway_id = aws_customer_gateway.onprem.id
transit_gateway_id = aws_ec2_transit_gateway.hub.id
type = "ipsec.1"
static_routes_only = false # Use BGP
tags = {
Name = "vpn-backup-for-dx"
}
}

Hybrid DNS — Resolving Across Cloud and On-Premises

Section titled “Hybrid DNS — Resolving Across Cloud and On-Premises”

Hybrid DNS is critical for any organization with workloads split between on-prem and cloud. Servers in AWS need to resolve on-prem hostnames (e.g., ldap.corp.internal), and on-prem servers need to resolve cloud-hosted service names (e.g., api.payments.aws.internal).

Hybrid DNS Resolution Architecture

Route 53 Resolver Endpoints:

  • Inbound Endpoint — ENIs in your VPC that accept DNS queries FROM on-premises or other networks. On-prem DNS servers forward queries for *.aws.internal to these inbound endpoint IPs.
  • Outbound Endpoint — ENIs that forward DNS queries FROM your VPC TO on-prem DNS. You create Resolver Rules specifying which domains (e.g., corp.internal) should be forwarded to which on-prem DNS server IPs.

Terraform:

# Inbound endpoint — on-prem can resolve cloud names
resource "aws_route53_resolver_endpoint" "inbound" {
name = "hybrid-dns-inbound"
direction = "INBOUND"
security_group_ids = [aws_security_group.dns.id]
ip_address {
subnet_id = aws_subnet.private_a.id
ip = "10.20.1.10"
}
ip_address {
subnet_id = aws_subnet.private_b.id
ip = "10.20.2.10"
}
}
# Outbound endpoint — cloud resolves on-prem names
resource "aws_route53_resolver_endpoint" "outbound" {
name = "hybrid-dns-outbound"
direction = "OUTBOUND"
security_group_ids = [aws_security_group.dns.id]
ip_address {
subnet_id = aws_subnet.private_a.id
}
ip_address {
subnet_id = aws_subnet.private_b.id
}
}
# Forward corp.internal to on-prem DNS
resource "aws_route53_resolver_rule" "forward_corp" {
domain_name = "corp.internal"
name = "forward-to-onprem-dns"
rule_type = "FORWARD"
resolver_endpoint_id = aws_route53_resolver_endpoint.outbound.id
target_ip {
ip = "10.0.0.53"
port = 53
}
target_ip {
ip = "10.0.0.54"
port = 53
}
}
# Share rule across accounts via RAM
resource "aws_route53_resolver_rule_association" "shared_vpc" {
resolver_rule_id = aws_route53_resolver_rule.forward_corp.id
vpc_id = aws_vpc.workload.id
}

Split-horizon DNS: A pattern where the same domain name resolves to different IPs depending on WHERE the query originates. For example, api.company.com resolves to a public IP (52.x.x.x) when queried from the internet, but to a private IP (10.x.x.x) when queried from within the VPC or on-prem. Implemented using Route53 private hosted zones (which override public zones for queries from associated VPCs) or GCP private DNS zones.

Interview scenario — “Design hybrid connectivity for a bank with a datacenter in Dubai connecting to both AWS and GCP”

Answer: (1) Physical connectivity: Dedicated Interconnect to GCP at a Dubai colocation facility (e.g., Equinix DX1) + AWS Direct Connect at the same facility. Use Cross-Cloud Interconnect between GCP and AWS for direct cloud-to-cloud traffic. (2) HA: Dual connections at separate facilities for each cloud. VPN backup over internet with BGP failover (lower MED on dedicated links). (3) Routing: BGP with BFD for fast failover. DX Gateway + Transit VIF for AWS multi-VPC access. Cloud Router for GCP. (4) DNS: Route53 Resolver Endpoints + GCP Forwarding Zones both pointing to on-prem AD DNS for corp.internal. On-prem DNS forwards *.aws.internal to Route53 inbound endpoints and *.gcp.internal to Cloud DNS inbound policy IPs. (5) Encryption: MACsec on DX, IPsec on VPN backup, TLS for all application traffic. (6) Bandwidth: Start with 10G to each cloud, monitor utilization, scale with LAG or upgrade to 100G.

Interview scenario — “Your Direct Connect goes down at 2 AM. What happens and how fast do you recover?”

Answer: (1) Detection — BFD detects link failure in under 1 second. BGP session drops. CloudWatch alarm fires (DX connection state change). PagerDuty alert to on-call. (2) Automatic failover — If dual-DX: traffic shifts to second DX connection via BGP reconvergence (~30-60 seconds with BFD). If single DX with VPN backup: BGP reconverges to VPN tunnel (~30-90 seconds). VPN has higher latency and lower bandwidth but maintains connectivity. (3) Impact during failover — TCP connections are dropped and must be re-established. Long-lived DB connections (connection poolers like PgBouncer) may need manual reconnection. (4) Recovery — Work with colo provider to restore physical link (could be hours to days for hardware failure). Meanwhile, VPN/secondary DX carries traffic. (5) Prevention — Always deploy dual-DX at separate facilities. Monitor with CloudWatch metrics: ConnectionState, ConnectionBpsEgress, ConnectionBpsIngress. Run quarterly failover drills.


Multi-Region Networking & Global Load Balancing

Section titled “Multi-Region Networking & Global Load Balancing”

Multi-region networking is fundamentally different between AWS and GCP due to one key architectural decision: AWS VPCs are regional (you need multiple VPCs and connect them), while GCP VPCs are global (subnets span regions within a single VPC). This shapes everything from routing to load balancing to disaster recovery.

Since AWS VPCs are regional, connecting workloads across regions requires TGW inter-region peering. Each region has its own Transit Gateway; you peer them together.

Transit Gateway Multi-Region Peering

TGW peering is encrypted by default and runs over AWS’s global backbone. Route tables at each TGW control which VPCs can reach which cross-region VPCs (segmentation).

Global Accelerator provides two static anycast IPs that route traffic to the nearest healthy AWS endpoint via AWS’s global edge network. Unlike CloudFront, it works for ANY TCP/UDP traffic (not just HTTP).

How it works:

  1. Client connects to one of the two anycast IPs (same IPs regardless of client location)
  2. Traffic enters the nearest AWS edge location (~100+ edge locations globally)
  3. AWS routes traffic over its backbone to the optimal endpoint (based on health, geography, routing policies)
  4. Endpoints can be ALBs, NLBs, EC2 instances, or Elastic IPs in any region

Use cases:

  • Gaming (UDP, low-latency)
  • IoT (TCP/MQTT, consistent endpoints)
  • VoIP/SIP (UDP, failover without DNS TTL delays)
  • Any TCP app where you need fast failover (under 30 seconds vs DNS TTL)
  • Multi-region active-active with health-based routing

Global Accelerator vs CloudFront vs Route 53:

FeatureGlobal AcceleratorCloudFrontRoute 53
ProtocolTCP, UDPHTTP, HTTPS, WebSocketDNS-based (any protocol)
CachingNoYes (edge cache)N/A
Static IPsYes (2 anycast IPs)No (uses domain names)No
Failover speedUnder 30sDepends on origin health checkDNS TTL (60-300s typically)
PricingPer-hour + data transfer premiumPer-request + data transferPer-query + health checks
Best forNon-HTTP TCP/UDP, fast failover, static IPsHTTP/HTTPS with caching, API accelerationDNS-level routing, weighted/geo/latency
DDoSShield Standard built-inShield Standard + WAF integrationShield Standard

Route 53 provides seven routing policies for DNS-level traffic management:

PolicyHow It WorksUse Case
SimpleReturns one record (or multiple values randomly)Single-region, basic setup
WeightedDistribute traffic by percentage (e.g., 90/10)A/B testing, blue/green, gradual migration
LatencyRoute to region with lowest latency for the userMulti-region apps, best user experience
FailoverActive-passive with health checksDR: primary in UAE, failover to EU
GeolocationRoute by user’s country or continentCompliance (EU data stays in EU), localization
GeoproximityRoute by geographic proximity + configurable biasShift traffic between regions (Traffic Flow)
MultivalueReturn up to 8 healthy IPsSimple load distribution with health checks

Active-active with Route 53:

resource "aws_route53_record" "api_latency_uae" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "uae"
alias {
name = aws_lb.api_uae.dns_name
zone_id = aws_lb.api_uae.zone_id
evaluate_target_health = true
}
latency_routing_policy {
region = "me-south-1"
}
}
resource "aws_route53_record" "api_latency_eu" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "eu"
alias {
name = aws_lb.api_eu.dns_name
zone_id = aws_lb.api_eu.zone_id
evaluate_target_health = true
}
latency_routing_policy {
region = "eu-west-1"
}
}

Active-Active Multi-Region Design Considerations

Section titled “Active-Active Multi-Region Design Considerations”

Building a truly active-active multi-region architecture requires solving these networking challenges:

Data replication trade-offs:

PatternLatencyConsistencyUse Case
Synchronous replicationHigh (cross-region RTT added to every write)Strong consistencyFinancial transactions, inventory counts
Asynchronous replicationLow (writes return immediately, replicate in background)Eventual consistency (data loss window = replication lag)Analytics, user profiles, session data
Conflict-free (CRDTs/last-writer-wins)LowEventual with automatic conflict resolutionShopping carts, collaborative editing

Session management:

  • Stateless design (preferred): JWTs or signed tokens — no server-side session. Any region can serve any request. This is the foundation of active-active.
  • Session affinity (fallback): Sticky sessions via cookie or source IP — problematic because failover breaks sessions. Use only when stateless is impossible (legacy apps).
  • Distributed session store: Redis Global Datastore (AWS) or Memorystore (GCP) with cross-region replication — adds latency but maintains sessions during failover.

DNS TTL considerations for failover:

  • Lower TTL = faster failover but more DNS queries (higher cost, more resolver load)
  • Higher TTL = slower failover but better caching
  • Typical production values: 60 seconds for active-active with health checks, 300 seconds for stable single-region
  • Problem: Some resolvers and clients ignore TTL (Java’s default DNS cache is forever unless overridden). Always test with real clients.

Interview scenario — “Design multi-region networking for an e-commerce app with users in UAE, EU, and APAC”

Answer: (1) Regions: me-south-1 (UAE), eu-west-1 (EU), ap-southeast-1 (APAC). (2) Global entry point: Route 53 latency-based routing → regional ALBs. Or: Global Accelerator for fast failover + static IPs (if partners need stable endpoints). CloudFront for static assets + API acceleration with caching. (3) Cross-region connectivity: TGW inter-region peering for backend replication traffic. (4) Data strategy: Partition by user region — UAE users’ data lives in me-south-1. Product catalog replicated async to all regions (read replicas). Order writes go to the user’s home region only. DynamoDB Global Tables for session/cart data (multi-region active-active with last-writer-wins). (5) Failover: Route 53 health checks on ALBs. If UAE region fails, latency routing redirects UAE users to EU (next lowest latency). RDS read replica in EU promoted to primary (RPO = replication lag, RTO = promotion time ~5-10 min). (6) DNS: TTL 60 seconds for API endpoints, 300 seconds for static assets. (7) Cost optimization: Use CloudFront for static content to reduce origin load and inter-region data transfer.


Scenario 1: “Design networking for a company with 20 VPCs, 3 regions, and 2 on-prem DCs”

Section titled “Scenario 1: “Design networking for a company with 20 VPCs, 3 regions, and 2 on-prem DCs””

Answer:

I would design a hub-spoke architecture with Transit Gateway in each region and Direct Connect for on-prem:

Multi-region TGW peering diagram

TGW route table strategy: Each region’s TGW has prod-rt, non-prod-rt, shared-services-rt, and inspection-rt. Prod VPCs cannot reach non-prod VPCs. All internet-bound traffic goes through the regional inspection VPC.

CIDR planning: Use non-overlapping ranges per region — Region 1 uses 10.10.0.0/12, Region 2 uses 10.50.0.0/12, Region 3 uses 10.60.0.0/12. On-prem uses 172.16.0.0/12.

DNS: Route 53 private hosted zones in each region. Resolver rules shared via RAM for hybrid DNS resolution. On-prem DCs use Route 53 inbound endpoints to resolve AWS private zones.


Scenario 2: “Migrate from VPC peering to Transit Gateway with zero downtime”

Section titled “Scenario 2: “Migrate from VPC peering to Transit Gateway with zero downtime””

Answer:

This is a careful, phased migration. The key insight: you can have both VPC peering AND TGW attachment active simultaneously. Routes determine which path traffic takes.

Phase 1 — Preparation (Week 1):

  • Deploy TGW in the Network Hub Account
  • Create route tables (prod-rt, non-prod-rt, inspection-rt)
  • Share TGW via RAM to all workload accounts
  • Do NOT add any routes yet — existing peering continues to work

Phase 2 — Parallel Paths (Week 2):

  • Attach each VPC to TGW (one at a time, during maintenance windows)
  • Add more-specific routes for TEST traffic via TGW
    • Example: add route 10.12.2.0/25 via TGW (more specific than existing 10.12.0.0/16 via peering)
    • Only traffic to that /25 goes via TGW; everything else stays on peering
  • Validate latency, throughput, connectivity

Phase 3 — Cutover (Week 3-4):

  • For each VPC pair, replace peering routes with TGW routes
    • Remove 10.12.0.0/16 → pcx-xxxx from route table
    • Add 10.12.0.0/16 → tgw-xxxx (or use broader 10.0.0.0/8 → tgw-xxxx)
    • Route table updates are atomic and take effect immediately — no downtime
  • Add default route (0.0.0.0/0 → TGW) for centralized egress

Phase 4 — Cleanup (Week 5):

  • Verify all traffic flows via TGW (check VPC Flow Logs)
  • Delete VPC peering connections
  • Remove stale route table entries

Risk mitigation: keep peering connections active for 1 week after cutover as a rollback option. If something breaks, re-add the peering route (more specific wins).


Scenario 3: “How does GCP Shared VPC differ from AWS Transit Gateway? When would you use each?”

Section titled “Scenario 3: “How does GCP Shared VPC differ from AWS Transit Gateway? When would you use each?””

Answer:

They solve the same problem (multi-project/account networking) but use fundamentally different approaches:

GCP Shared VPC — one VPC shared across multiple projects:

  • All projects use the SAME VPC network — same IP space, same firewall rules, same routes
  • Central team (host project) controls ALL networking; service projects just deploy workloads
  • No data processing charges for cross-project communication (same VPC)
  • Simpler — fewer moving parts, no route propagation to configure
  • Limitation: less isolation between projects (shared firewall namespace, shared IP space)

AWS Transit Gateway — separate VPCs connected via a hub:

  • Each account has its OWN VPC — full IP space isolation, independent firewall rules (NACLs + SGs)
  • TGW connects them with controlled routing (route tables, propagation)
  • Data processing charges ($0.02/GB for cross-VPC traffic)
  • More complex but more flexible — fine-grained route table segmentation per environment
  • Can insert network inspection (firewall) between VPCs

When to use each:

  • Shared VPC: when the central team wants full network control and workload teams just need compute resources. Works well when all workloads share similar security posture.
  • Transit Gateway: when you need strong isolation between workloads, centralized security inspection, or when different teams need independent networking control within their accounts.

In practice at our bank: we use TGW on AWS (every team gets their own VPC and we inspect all traffic centrally) and Shared VPC on GCP (central network host project, GKE clusters in service projects). The models suit each cloud’s strengths.


Scenario 4: “Design cross-cloud connectivity between AWS and GCP for a multi-cloud enterprise”

Section titled “Scenario 4: “Design cross-cloud connectivity between AWS and GCP for a multi-cloud enterprise””

Answer:

For an enterprise bank running workloads in both AWS and GCP, I would use a partner interconnect solution (Megaport or Equinix Cloud Exchange Fabric) for production traffic, with VPN as backup.

Enterprise hub-spoke network architecture

Routing: BGP on both sides. AWS announces 10.0.0.0/8 (cloud CIDRs). GCP announces its subnet CIDRs. MED/local-preference controls preferred path (Megaport primary, VPN backup).

DNS: AWS Route 53 Resolver outbound endpoint forwards *.gcp.bank.internal to GCP Cloud DNS inbound policy IP. GCP Cloud DNS forwarding zone sends *.aws.bank.internal to Route 53 Resolver inbound endpoint. Both travel over the private link.

Security: Traffic between clouds traverses the inspection VPC on AWS (Network Firewall) and Cloud NGFW on GCP. No unfiltered cross-cloud traffic.

Cost: Megaport 1 Gbps port ~$500/month + per-GB egress from each cloud. VPN backup is nearly free (just the tunnel hours). Compare this to Dedicated Interconnect ($1500+/month per port) — partner interconnect is more cost-effective for moderate bandwidth.


Scenario 5: “A team needs private connectivity to an internal API in another team’s VPC. Options?”

Section titled “Scenario 5: “A team needs private connectivity to an internal API in another team’s VPC. Options?””

Answer:

Four options, in order of preference for enterprise:

1. Transit Gateway (already in place): If both VPCs are attached to TGW and the route tables allow communication, it already works. The calling VPC has a route to the target VPC’s CIDR via TGW. Security group on the target API allows ingress from the caller’s CIDR. No additional infrastructure needed.

  • Pros: zero setup if TGW is configured correctly
  • Cons: $0.02/GB data processing

2. AWS PrivateLink / GCP Private Service Connect: The API team publishes their service via NLB + VPC Endpoint Service (AWS) or Internal LB + Service Attachment (GCP). The consuming team creates a VPC Interface Endpoint / PSC Consumer Endpoint in their VPC.

  • Pros: unidirectional (consumer cannot reach anything else in provider’s VPC), works across accounts and even across AWS organizations, no CIDR overlap issues
  • Cons: API team must set up the endpoint service, additional cost per endpoint

3. VPC Peering (targeted): Peer the two VPCs directly. Add routes on both sides for each other’s CIDRs.

  • Pros: no data processing charge, low latency
  • Cons: not transitive, bidirectional access (need security groups to restrict), does not work with overlapping CIDRs

4. Service Mesh (application layer): If both teams run on the same Kubernetes cluster or have a service mesh (Istio, Consul), the API is accessible via the mesh’s service discovery — no network-level connectivity changes needed.

  • Pros: application-level routing, mTLS, observability
  • Cons: requires both teams to be on the mesh, more operational overhead

My recommendation for a bank: PrivateLink / PSC if the API is shared widely (many consumers), TGW if it is point-to-point between two known VPCs. PrivateLink is more secure because it is unidirectional — the consumer gets a private IP in their VPC that points to the API, but cannot scan or access anything else in the API team’s VPC.