Skip to content

VPC & Subnet Design

In our enterprise bank architecture, VPCs are the network boundary for every workload account. The central infrastructure team defines VPC standards — CIDR ranges, subnet tiers, naming conventions, tagging — via reusable Terraform modules. Tenant teams (payments, trading, data platform) consume pre-built VPCs. They never create their own.

VPC context within the AWS organization

Every workload VPC follows the same 3-tier subnet pattern, attaches to Transit Gateway (covered in the Connectivity page), and routes internet-bound traffic through the Network Hub inspection VPC (covered in the Security page).


A Virtual Private Cloud is a logically isolated network within a cloud provider’s infrastructure. It gives you full control over IP addressing, subnets, routing, and network access control.

AWS VPCs are regional. A VPC lives in one region and cannot span regions. Subnets are scoped to a single Availability Zone.

Key properties:

  • CIDR block: primary + up to 4 secondary (e.g., 10.10.0.0/16)
  • Subnets: each in one AZ, gets a subset of the VPC CIDR
  • Implied router: every VPC has a built-in router; you control it via route tables
  • Default vs custom VPC: default VPC exists per region (public subnets, IGW). Enterprise accounts delete it or lock it down via SCP
  • DNS: enableDnsSupport and enableDnsHostnames — both must be true for private DNS resolution
  • Tenancy: default (shared hardware) or dedicated (compliance use cases)

AWS VPC with 3-tier subnets across 3 AZs


Enterprise networks use tiered subnets to enforce network segmentation at the routing level. Our bank uses three tiers across every workload VPC.

TierPurposeRoute to InternetRoute from InternetExamples
PublicResources that need inbound internet accessYes (IGW / Cloud NAT)Yes (via IGW)ALB/NLB, bastion hosts, NAT GW
PrivateApplication workloadsOutbound only (via NAT GW)NoEKS/GKE nodes, EC2/GCE app servers, Lambda
DataDatabases, caches, message queuesNo internet accessNoRDS, ElastiCache, MSK, Memorystore

CIDR planning is one of the most important and most overlooked tasks. Get it wrong and you face overlapping ranges, exhausted IPs, and inability to peer or route between VPCs.

Our Bank’s CIDR Allocation Plan:

Enterprise CIDR Master Plan
============================
10.0.0.0/8 — Cloud allocation (entire RFC 1918 Class A)
Environment Allocation:
10.10.0.0/12 — Production (10.10.0.0 – 10.15.255.255)
10.20.0.0/12 — Staging (10.20.0.0 – 10.25.255.255)
10.30.0.0/12 — Development (10.30.0.0 – 10.35.255.255)
10.40.0.0/12 — Sandbox (10.40.0.0 – 10.45.255.255)
Infrastructure (shared/hub):
10.0.0.0/16 — Network Hub VPC
10.1.0.0/16 — Shared Services VPC
10.2.0.0/16 — Security VPC
Production VPCs (10.10.0.0/12):
10.10.0.0/16 — payments-prod (65,534 IPs)
10.11.0.0/16 — trading-prod (65,534 IPs)
10.12.0.0/16 — data-platform-prod (65,534 IPs)
10.13.0.0/16 — mobile-api-prod (65,534 IPs)
...room for 6 more /16 VPCs
On-Premises:
172.16.0.0/12 — Corporate data center (no overlap with cloud)
192.168.0.0/16— Office networks

Subnet Breakdown for a Single VPC (10.10.0.0/16):

VPC: 10.10.0.0/16 (payments-prod)
AZ-1a (eu-west-1a):
10.10.0.0/24 — public (251 usable IPs)
10.10.1.0/24 — private (251 usable IPs)
10.10.2.0/24 — data (251 usable IPs)
AZ-1b (eu-west-1b):
10.10.10.0/24 — public (251 usable IPs)
10.10.11.0/24 — private (251 usable IPs)
10.10.12.0/24 — data (251 usable IPs)
AZ-1c (eu-west-1c):
10.10.20.0/24 — public (251 usable IPs)
10.10.21.0/24 — private (251 usable IPs)
10.10.22.0/24 — data (251 usable IPs)
Reserved:
10.10.100.0/24 — EKS pod secondary CIDR (if using custom networking)
10.10.200.0/24 — future expansion

Why /24 per subnet? For most enterprise workloads, 251 IPs per subnet per AZ is sufficient. EKS worker nodes need one primary IP each, and pod IPs come from secondary CIDRs (VPC CNI custom networking) or overlay networks. If you expect 500+ nodes in a single AZ, use /23 or /22.

AWS VPC IPAM:

AWS VPC IPAM (IP Address Manager) lets you centrally manage and allocate CIDR blocks across accounts. The central infra team creates IPAM pools and delegates allocation to workload accounts — preventing overlaps.

AWS VPC IPAM pool hierarchy


Route tables determine where network traffic is directed. Every subnet must be associated with exactly one route table.

AWS has a main route table (default for unassociated subnets) and custom route tables. Best practice: never use the main route table; create explicit ones per tier.

Public subnet route table:

DestinationTargetPurpose
10.10.0.0/16localTraffic within the VPC
10.0.0.0/8tgw-xxxxxxxAll cloud traffic via Transit Gateway
172.16.0.0/12tgw-xxxxxxxOn-prem via TGW → Direct Connect
0.0.0.0/0tgw-xxxxxxxInternet via TGW → Network Hub NAT

Private subnet route table:

DestinationTargetPurpose
10.10.0.0/16localWithin VPC
10.0.0.0/8tgw-xxxxxxxCross-VPC via TGW
172.16.0.0/12tgw-xxxxxxxOn-prem via TGW
0.0.0.0/0tgw-xxxxxxxInternet via Network Hub inspection
pl-xxxxxxxxvpce-s3S3 via gateway endpoint (prefix list)

Data subnet route table:

DestinationTargetPurpose
10.10.0.0/16localWithin VPC only
10.0.0.0/8tgw-xxxxxxxCross-VPC (for replication, etc.)

Private subnets need outbound internet access (package updates, API calls, pulling container images). NAT translates private IPs to public IPs for outbound traffic.

AWS NAT Gateway is a managed, zonal resource. You deploy one per AZ for high availability.

Key characteristics:

  • Zonal: deploy in each AZ where you have private subnets
  • Elastic IP: each NAT GW gets a static public IP (Elastic IP)
  • Bandwidth: up to 100 Gbps per NAT GW (auto-scales)
  • Cost: $0.045/hr + $0.045/GB processed (can be expensive at scale)
  • No security group: NAT GW does not have a security group attached

AWS NAT Gateway per AZ

Enterprise pattern: In our bank, workload VPCs do NOT have their own NAT GW. Internet traffic routes via TGW to the Network Hub inspection VPC, which has centralized NAT GW + Network Firewall. This means:

  • Single point of egress control and logging
  • All outbound traffic is inspected by IPS/IDS rules
  • Fewer Elastic IPs to manage and allowlist with third-party APIs

Accessing AWS/GCP services (S3, DynamoDB, Container Registry, Cloud Storage) from private subnets normally requires going through NAT → internet → service. VPC endpoints and Private Service Connect provide private, direct connectivity — no internet traversal, lower latency, lower cost.

Gateway Endpoints (free, S3 and DynamoDB only):

  • Adds a route in your route table pointing to the service via a prefix list
  • No ENI, no DNS change — just a route
  • Free: no hourly or data processing charges

Interface Endpoints (powered by AWS PrivateLink):

  • Creates an ENI in your subnet with a private IP
  • DNS resolves the service endpoint to the private IP (via private hosted zone)
  • Works for 100+ AWS services: ECR, CloudWatch, SSM, STS, KMS, Secrets Manager, etc.
  • Cost: ~$0.01/hr per AZ + $0.01/GB processed
  • Requires security group configuration

Gateway Load Balancer Endpoints (for appliances):

  • Used to route traffic to third-party security appliances (firewalls, IDS)
  • Works with AWS Network Firewall under the hood

Enterprise VPC endpoints — gateway and interface


DNS is the backbone of service discovery, hybrid connectivity, and multi-account architecture. In our enterprise bank, the Network Hub Account owns all DNS infrastructure.

Public Hosted Zones:

  • Internet-facing DNS records (e.g., api.bank.com → ALB)
  • Supports alias records to AWS resources (ALB, CloudFront, S3) — free queries, no TTL issues

Private Hosted Zones:

  • Only resolvable within associated VPCs
  • Use for internal service discovery: payments.internal.bank.com
  • Can associate with VPCs in OTHER accounts (cross-account DNS)

Split-Horizon DNS:

  • Same domain name, different answers depending on where the query comes from
  • Public zone: api.bank.com → 52.x.x.x (internet users)
  • Private zone: api.bank.com → 10.10.1.50 (internal users hit internal ALB)

Route 53 Resolver:

  • Inbound endpoints: allow on-prem DNS servers to resolve AWS private hosted zones (on-prem → AWS)
  • Outbound endpoints: allow VPCs to resolve on-prem DNS domains (AWS → on-prem)
  • Resolver rules: forward queries for corp.bank.internal to on-prem DNS servers
  • Rules can be shared across accounts via AWS RAM

AWS Route 53 private zones and resolver


Load balancers are the front door to every application. Choosing the right type — L4 vs L7, regional vs global, internal vs external — is a critical architecture decision.

Application Load Balancer (ALB) — Layer 7

Section titled “Application Load Balancer (ALB) — Layer 7”
  • Protocol: HTTP, HTTPS, gRPC, WebSocket
  • Routing: path-based (/api/*), host-based (api.bank.com), header-based, query-string-based
  • Targets: EC2 instances, IP addresses, Lambda functions, EKS pods (IP mode)
  • SSL termination: yes, with ACM certificates
  • WAF integration: attach AWS WAF Web ACL directly
  • Authentication: built-in OIDC/Cognito authentication on the ALB
  • Cross-zone LB: enabled by default (free since 2024)
  • Scope: regional — one ALB per region

When to use: web applications, APIs, microservices, anything HTTP/HTTPS. This is your default choice.

  • Protocol: TCP, UDP, TLS
  • Routing: port-based only (no content inspection)
  • Static IPs: each NLB gets one static IP per AZ (or Elastic IP)
  • Performance: millions of requests/sec, ultra-low latency (~100us added)
  • Preserve source IP: yes (ALB does not by default — uses X-Forwarded-For)
  • Targets: EC2 instances, IP addresses, ALB (NLB → ALB pattern for static IPs + L7 routing)
  • PrivateLink: expose services via NLB + VPC endpoint service
  • Cross-zone LB: disabled by default (enable for even distribution)

When to use: TCP services (databases, MQTT, gaming), extreme performance needs, static IPs required, PrivateLink, TLS passthrough.

  • Purpose: route traffic to virtual appliances (firewalls, IDS/IPS)
  • Protocol: IP (all traffic, all ports)
  • How it works: GENEVE encapsulation to appliance, traffic returns via same GWLB
  • Used by: AWS Network Firewall (under the hood), third-party firewalls (Palo Alto, Fortinet)

When to use: centralized network inspection architectures (Network Hub VPC).

AWS Load Balancer decision tree

ALB vs NLB vs GCP Global LB — Quick Comparison

Section titled “ALB vs NLB vs GCP Global LB — Quick Comparison”
FeatureAWS ALBAWS NLBGCP Global HTTP(S) LB
Layer7 (HTTP)4 (TCP/UDP)7 (HTTP)
ScopeRegionalRegionalGlobal (anycast)
Static IPNo (use Global Accelerator)YesYes (anycast)
Path routingYesNoYes (URL maps)
WAFYes (AWS WAF)NoYes (Cloud Armor)
WebSocketYesYes (TCP)Yes
SSL terminationYesOptional (TLS)Yes
Multi-regionNo (need one per region)NoYes (native)
PrivateLinkNoYes (endpoint service)PSC (consumer endpoint)

# modules/vpc/main.tf — Enterprise VPC Module
# Deploys a 3-tier VPC across 3 AZs with TGW attachment
variable "vpc_name" {
description = "Name of the VPC (e.g., payments-prod)"
type = string
}
variable "vpc_cidr" {
description = "CIDR block for the VPC (e.g., 10.10.0.0/16)"
type = string
}
variable "azs" {
description = "List of availability zones"
type = list(string)
default = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
}
variable "public_subnets" {
description = "CIDR blocks for public subnets"
type = list(string)
}
variable "private_subnets" {
description = "CIDR blocks for private subnets"
type = list(string)
}
variable "data_subnets" {
description = "CIDR blocks for data subnets"
type = list(string)
}
variable "transit_gateway_id" {
description = "Transit Gateway ID for hub-spoke attachment"
type = string
}
variable "enable_vpc_endpoints" {
description = "Deploy standard VPC endpoints (S3, ECR, CloudWatch, etc.)"
type = bool
default = true
}
# ─── VPC ────────────────────────────────────────────
resource "aws_vpc" "this" {
cidr_block = var.vpc_cidr
enable_dns_support = true
enable_dns_hostnames = true
tags = {
Name = var.vpc_name
Environment = split("-", var.vpc_name)[1] # e.g., "prod" from "payments-prod"
ManagedBy = "terraform"
Module = "enterprise-vpc"
}
}
# ─── Subnets ────────────────────────────────────────
resource "aws_subnet" "public" {
count = length(var.azs)
vpc_id = aws_vpc.this.id
cidr_block = var.public_subnets[count.index]
availability_zone = var.azs[count.index]
tags = {
Name = "${var.vpc_name}-public-${var.azs[count.index]}"
Tier = "public"
"kubernetes.io/role/elb" = "1" # For ALB Ingress Controller
}
}
resource "aws_subnet" "private" {
count = length(var.azs)
vpc_id = aws_vpc.this.id
cidr_block = var.private_subnets[count.index]
availability_zone = var.azs[count.index]
tags = {
Name = "${var.vpc_name}-private-${var.azs[count.index]}"
Tier = "private"
"kubernetes.io/role/internal-elb" = "1" # For internal ALB
}
}
resource "aws_subnet" "data" {
count = length(var.azs)
vpc_id = aws_vpc.this.id
cidr_block = var.data_subnets[count.index]
availability_zone = var.azs[count.index]
tags = {
Name = "${var.vpc_name}-data-${var.azs[count.index]}"
Tier = "data"
}
}
# ─── Route Tables ───────────────────────────────────
resource "aws_route_table" "public" {
vpc_id = aws_vpc.this.id
tags = { Name = "${var.vpc_name}-public-rt" }
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.this.id
tags = { Name = "${var.vpc_name}-private-rt" }
}
resource "aws_route_table" "data" {
vpc_id = aws_vpc.this.id
tags = { Name = "${var.vpc_name}-data-rt" }
}
# All internet-bound traffic → Transit Gateway (→ Network Hub for inspection)
resource "aws_route" "public_default" {
route_table_id = aws_route_table.public.id
destination_cidr_block = "0.0.0.0/0"
transit_gateway_id = var.transit_gateway_id
}
resource "aws_route" "private_default" {
route_table_id = aws_route_table.private.id
destination_cidr_block = "0.0.0.0/0"
transit_gateway_id = var.transit_gateway_id
}
# Cross-VPC traffic → Transit Gateway
resource "aws_route" "public_cross_vpc" {
route_table_id = aws_route_table.public.id
destination_cidr_block = "10.0.0.0/8"
transit_gateway_id = var.transit_gateway_id
}
resource "aws_route" "private_cross_vpc" {
route_table_id = aws_route_table.private.id
destination_cidr_block = "10.0.0.0/8"
transit_gateway_id = var.transit_gateway_id
}
resource "aws_route" "data_cross_vpc" {
route_table_id = aws_route_table.data.id
destination_cidr_block = "10.0.0.0/8"
transit_gateway_id = var.transit_gateway_id
}
# No default route for data subnets — intentionally isolated from internet
# ─── Route Table Associations ───────────────────────
resource "aws_route_table_association" "public" {
count = length(var.azs)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = length(var.azs)
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private.id
}
resource "aws_route_table_association" "data" {
count = length(var.azs)
subnet_id = aws_subnet.data[count.index].id
route_table_id = aws_route_table.data.id
}
# ─── Transit Gateway Attachment ─────────────────────
resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
transit_gateway_id = var.transit_gateway_id
vpc_id = aws_vpc.this.id
subnet_ids = aws_subnet.private[*].id # Attach via private subnets
transit_gateway_default_route_table_association = false
transit_gateway_default_route_table_propagation = false
tags = { Name = "${var.vpc_name}-tgw-attachment" }
}
# ─── VPC Endpoints (Gateway) ────────────────────────
resource "aws_vpc_endpoint" "s3" {
count = var.enable_vpc_endpoints ? 1 : 0
vpc_id = aws_vpc.this.id
service_name = "com.amazonaws.${data.aws_region.current.name}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [
aws_route_table.private.id,
aws_route_table.data.id,
]
tags = { Name = "${var.vpc_name}-s3-endpoint" }
}
# ─── VPC Endpoints (Interface) ──────────────────────
resource "aws_security_group" "vpc_endpoints" {
count = var.enable_vpc_endpoints ? 1 : 0
vpc_id = aws_vpc.this.id
name = "${var.vpc_name}-vpce-sg"
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [var.vpc_cidr]
description = "HTTPS from VPC"
}
tags = { Name = "${var.vpc_name}-vpce-sg" }
}
locals {
interface_endpoints = var.enable_vpc_endpoints ? [
"ecr.api", "ecr.dkr", "sts", "logs",
"monitoring", "ssm", "kms", "secretsmanager"
] : []
}
resource "aws_vpc_endpoint" "interface" {
for_each = toset(local.interface_endpoints)
vpc_id = aws_vpc.this.id
service_name = "com.amazonaws.${data.aws_region.current.name}.${each.value}"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints[0].id]
private_dns_enabled = true
tags = { Name = "${var.vpc_name}-${each.value}-endpoint" }
}
data "aws_region" "current" {}
# ─── Outputs ────────────────────────────────────────
output "vpc_id" { value = aws_vpc.this.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }
output "data_subnet_ids" { value = aws_subnet.data[*].id }
output "tgw_attachment_id" { value = aws_ec2_transit_gateway_vpc_attachment.this.id }

Usage:

module "payments_vpc" {
source = "../modules/vpc"
vpc_name = "payments-prod"
vpc_cidr = "10.10.0.0/16"
azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
public_subnets = ["10.10.0.0/24", "10.10.10.0/24", "10.10.20.0/24"]
private_subnets = ["10.10.1.0/24", "10.10.11.0/24", "10.10.21.0/24"]
data_subnets = ["10.10.2.0/24", "10.10.12.0/24", "10.10.22.0/24"]
transit_gateway_id = data.aws_ec2_transit_gateway.hub.id
}

DNS is the first thing that happens when a user connects to your application. It is also the most fragile — misconfigured DNS can take down an entire application even when all infrastructure is healthy. Enterprise DNS architecture must handle hybrid resolution (cloud + on-prem), multi-region routing, failover, and compliance requirements.

Route 53 Routing Policies — All 7 Explained

Section titled “Route 53 Routing Policies — All 7 Explained”

Route 53 provides seven routing policies. Each serves a different use case. Understanding when to use each is critical for multi-region architecture design.

PolicyHow It WorksUse CaseExample
SimpleReturns one or more values. If multiple, client picks randomly. No health checks on multi-value simple records.Single-region, basic setupapi.example.com → 52.1.2.3
WeightedDistribute traffic by weight (0-255). Weight=0 stops traffic.A/B testing, canary, gradual migration90% to v1 ALB, 10% to v2 ALB
LatencyRoute to the region with lowest latency from the user’s resolver location.Multi-region active-activeUAE users → me-south-1, EU users → eu-west-1
FailoverActive-passive. Primary record used when healthy; secondary when primary fails health check.Disaster recoveryPrimary: UAE ALB, Failover: EU ALB
GeolocationRoute based on user’s country, continent, or “default”. Most specific match wins.Compliance (data residency), content localizationEU users → EU region (GDPR), US users → US region
GeoproximityRoute based on geographic distance + configurable bias. Bias shifts the “boundary” between regions. Requires Traffic Flow.Fine-tuned geographic routingBias UAE region by +50 to capture nearby countries
MultivalueReturn up to 8 healthy IP addresses. Client-side load balancing from the returned set.Simple load distribution with health checksReturn 8 healthy IPs from a pool of 12

Key differences that interviewers test:

  • Weighted vs Latency: Weighted gives you explicit control (90/10 split). Latency is automatic based on network measurements. Use weighted for controlled rollouts; latency for best user experience.
  • Geolocation vs Geoproximity: Geolocation routes by political boundaries (country/continent). Geoproximity routes by physical distance with adjustable bias. Geolocation is binary (country X → region Y); geoproximity is gradient (closer = more likely).
  • Failover vs Multivalue: Failover is active-passive (one primary, one secondary). Multivalue is active-active (up to 8 healthy records returned). Use failover for DR; multivalue for simple load distribution.
  • Simple vs Multivalue: Both can return multiple IPs, but multivalue supports health checks per record and removes unhealthy IPs from responses. Simple returns all values regardless of health.

GCP Cloud DNS provides authoritative DNS hosting with additional features for enterprise hybrid architectures:

  • Public zones: authoritative DNS for internet-facing domains. Anycast DNS servers for low-latency resolution globally.
  • Private zones: DNS names visible only within specified VPC networks. Used for internal service discovery (payments.internal.example.com).
  • Response policies (DNS firewall): Override DNS responses for specific domains. Use cases: block malicious domains at the DNS layer, redirect internal service names, enforce split-horizon DNS. Rules can return NXDOMAIN (block), return a different IP (redirect), or pass through.
  • DNS peering: Resolve names from another VPC’s private DNS zones without forwarding. Cross-project DNS resolution in Shared VPC architectures. No data leaves Google’s network.
  • Forwarding zones: Forward DNS queries for specific domains to external DNS servers (typically on-prem AD/BIND). Queries are forwarded via Cloud Interconnect or VPN (private path), NOT over the internet.

This is one of the most common enterprise DNS patterns — resolving names across cloud and on-premises environments.

Hybrid DNS Architecture

How it works:

  1. Cloud resolves on-prem names — An application in AWS needs to resolve ldap.corp.internal. Route 53 Resolver checks its forwarding rules, finds that corp.internal should be forwarded to 10.0.0.53 (on-prem AD DNS). The query goes out through the Outbound Endpoint ENI, over Direct Connect to on-prem, gets resolved, and the answer comes back.

  2. On-prem resolves cloud names — An on-prem server needs to resolve api.payments.aws.internal. The on-prem DNS server has a conditional forwarder pointing aws.internal to the Route 53 Resolver Inbound Endpoint IPs (10.20.1.10, 10.20.2.10). The query comes over Direct Connect to the Inbound Endpoint, Route 53 resolves it from the private hosted zone, and returns the answer.

# Security group for DNS endpoints (UDP/TCP 53)
resource "aws_security_group" "dns_resolver" {
name = "dns-resolver-endpoints"
description = "Allow DNS queries to/from resolver endpoints"
vpc_id = aws_vpc.hub.id
ingress {
from_port = 53
to_port = 53
protocol = "udp"
cidr_blocks = ["10.0.0.0/8"] # On-prem + all VPC CIDRs
}
ingress {
from_port = 53
to_port = 53
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"]
}
egress {
from_port = 53
to_port = 53
protocol = "udp"
cidr_blocks = ["10.0.0.0/8"]
}
egress {
from_port = 53
to_port = 53
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"]
}
}
# Inbound endpoint — on-prem forwards cloud DNS queries here
resource "aws_route53_resolver_endpoint" "inbound" {
name = "hybrid-dns-inbound"
direction = "INBOUND"
security_group_ids = [aws_security_group.dns_resolver.id]
ip_address {
subnet_id = aws_subnet.hub_private_a.id
ip = "10.20.1.10"
}
ip_address {
subnet_id = aws_subnet.hub_private_b.id
ip = "10.20.2.10"
}
}
# Outbound endpoint — cloud forwards on-prem DNS queries here
resource "aws_route53_resolver_endpoint" "outbound" {
name = "hybrid-dns-outbound"
direction = "OUTBOUND"
security_group_ids = [aws_security_group.dns_resolver.id]
ip_address {
subnet_id = aws_subnet.hub_private_a.id
}
ip_address {
subnet_id = aws_subnet.hub_private_b.id
}
}
# Forwarding rule — send corp.internal queries to on-prem DNS
resource "aws_route53_resolver_rule" "forward_corp" {
domain_name = "corp.internal"
name = "forward-corp-internal"
rule_type = "FORWARD"
resolver_endpoint_id = aws_route53_resolver_endpoint.outbound.id
target_ip {
ip = "10.0.0.53"
port = 53
}
target_ip {
ip = "10.0.0.54"
port = 53
}
}
# Share forwarding rule with all VPCs via RAM
resource "aws_ram_resource_share" "dns_rules" {
name = "dns-forwarding-rules"
allow_external_principals = true
}
resource "aws_ram_resource_association" "dns_rule" {
resource_arn = aws_route53_resolver_rule.forward_corp.arn
resource_share_arn = aws_ram_resource_share.dns_rules.arn
}
# Associate rule with workload VPCs
resource "aws_route53_resolver_rule_association" "workload" {
resolver_rule_id = aws_route53_resolver_rule.forward_corp.id
vpc_id = aws_vpc.workload.id
}
# Private hosted zone for cloud-internal names
resource "aws_route53_zone" "cloud_internal" {
name = "aws.internal"
vpc {
vpc_id = aws_vpc.hub.id
}
}

Interview — “Design DNS architecture for a hybrid environment where some services are on-prem and others in AWS/GCP”

Answer: (1) DNS zones: On-prem owns corp.internal (Active Directory DNS). AWS owns aws.internal (Route 53 private hosted zone). GCP owns gcp.internal (Cloud DNS private zone). Public-facing: Route 53 or Cloud DNS for example.com. (2) Cross-resolution: On-prem DNS has conditional forwarders — aws.internal → Route 53 Inbound Endpoint IPs, gcp.internal → Cloud DNS inbound forwarding IPs. AWS has Resolver Outbound Endpoint with forwarding rule — corp.internal → on-prem DNS IPs via Direct Connect. GCP has forwarding zone — corp.internal → on-prem DNS IPs via Cloud Interconnect (private path). (3) Cross-cloud DNS: AWS and GCP resolve each other’s domains via on-prem DNS as a hub (simplest) or via direct forwarding (GCP forwarding zone → Route 53 Inbound Endpoint). (4) Split-horizon DNS: api.example.com resolves to public IP from internet, private IP from within VPC/on-prem. Implemented via Route 53 private hosted zone (overrides public zone for associated VPCs). (5) DNS security: DNSSEC on public zones. Response policies on GCP for DNS-layer malware blocking. VPC DNS firewall (Route 53 Resolver DNS Firewall) for blocking known-bad domains from VPC workloads. (6) Shared across accounts: RAM (Resource Access Manager) shares Route 53 Resolver rules across all workload accounts so every VPC can resolve on-prem names.


You cannot secure or troubleshoot what you cannot see. Network observability provides visibility into traffic patterns, security events, and connectivity issues across your cloud and hybrid infrastructure.

VPC Flow Logs capture IP traffic metadata for network interfaces, subnets, or entire VPCs. They record source/destination IP, port, protocol, action (ACCEPT/REJECT), packets, and bytes — but NOT packet payloads.

Destinations:

  • CloudWatch Logs — real-time queries, dashboards, metric filters. Good for alerting (e.g., alert on rejected traffic spikes). Expensive at scale.
  • S3 — cost-effective long-term storage. Query with Athena (SQL over S3). Best for compliance retention (store 1+ year of flow logs).
  • Kinesis Data Firehose — streaming to SIEM (Splunk, Datadog, Elastic). Real-time security analytics.

Custom log format — select only the fields you need to reduce cost and noise:

Default fields:
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
Useful additional fields:
vpc-id subnet-id instance-id type pkt-srcaddr pkt-dstaddr region az-id traffic-path flow-direction

Transit Gateway Flow Logs — capture traffic crossing TGW attachments. Essential for understanding cross-VPC traffic patterns and identifying unexpected lateral movement.

Unlike Flow Logs (metadata only), Traffic Mirroring captures FULL PACKETS — headers AND payloads. This is critical for:

  • Security forensics — reconstruct attack sequences, analyze malware payloads
  • Compliance auditing — prove what data was transmitted (PCI DSS, SOC 2)
  • Application debugging — inspect actual HTTP request/response bodies
  • IDS/IPS — feed mirrored traffic to an intrusion detection appliance (Suricata, Zeek)

Architecture:

Traffic Mirroring Architecture

Mirror traffic from specific ENIs (or all ENIs in a subnet) to a target NLB. Apply mirror filter sessions to capture only specific traffic (e.g., only TCP 443, only traffic to specific CIDRs).

AWS Reachability Analyzer tests connectivity between two endpoints WITHOUT sending actual traffic. It analyzes the network configuration (route tables, security groups, NACLs, VPC peering, TGW routes) and tells you whether traffic CAN reach the destination — and if not, which configuration is blocking it.

Use cases:

  • “Can my EKS pod reach this RDS instance?” — answer without sending a packet
  • Pre-deployment validation — verify connectivity before deploying an application
  • Compliance audits — prove that sensitive resources are NOT reachable from untrusted networks
  • Troubleshooting — identify exactly which security group or route table is blocking traffic

Terraform — Flow Logs and Traffic Mirroring

Section titled “Terraform — Flow Logs and Traffic Mirroring”
# VPC Flow Logs to S3 (cost-effective, query with Athena)
resource "aws_flow_log" "vpc" {
vpc_id = aws_vpc.main.id
log_destination = aws_s3_bucket.flow_logs.arn
log_destination_type = "s3"
traffic_type = "ALL" # ACCEPT, REJECT, or ALL
log_format = "$${version} $${account-id} $${interface-id} $${srcaddr} $${dstaddr} $${srcport} $${dstport} $${protocol} $${packets} $${bytes} $${start} $${end} $${action} $${log-status} $${vpc-id} $${subnet-id} $${flow-direction}"
max_aggregation_interval = 60 # 60 seconds (or 600 for lower cost)
tags = {
Name = "vpc-flow-logs"
}
}
# S3 bucket for flow logs (with lifecycle for cost management)
resource "aws_s3_bucket" "flow_logs" {
bucket = "company-vpc-flow-logs-${data.aws_caller_identity.current.account_id}"
}
resource "aws_s3_bucket_lifecycle_configuration" "flow_logs" {
bucket = aws_s3_bucket.flow_logs.id
rule {
id = "archive-and-expire"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}
}
}
# Transit Gateway Flow Logs
resource "aws_flow_log" "tgw" {
transit_gateway_id = aws_ec2_transit_gateway.hub.id
log_destination = aws_s3_bucket.flow_logs.arn
log_destination_type = "s3"
traffic_type = "ALL"
max_aggregation_interval = 60
}
# Traffic Mirroring — full packet capture
resource "aws_ec2_traffic_mirror_target" "ids" {
description = "IDS appliance behind NLB"
network_load_balancer_arn = aws_lb.ids_nlb.arn
}
resource "aws_ec2_traffic_mirror_filter" "sensitive" {
description = "Mirror only traffic to sensitive subnets"
}
resource "aws_ec2_traffic_mirror_filter_rule" "capture_db" {
traffic_mirror_filter_id = aws_ec2_traffic_mirror_filter.sensitive.id
description = "Capture traffic to database subnet"
rule_number = 100
rule_action = "accept"
destination_cidr_block = "10.10.32.0/20" # Database subnet CIDR
source_cidr_block = "0.0.0.0/0"
traffic_direction = "ingress"
protocol = 6 # TCP
}
resource "aws_ec2_traffic_mirror_session" "eks_to_ids" {
description = "Mirror EKS node traffic to IDS"
network_interface_id = aws_instance.eks_node.primary_network_interface_id
traffic_mirror_filter_id = aws_ec2_traffic_mirror_filter.sensitive.id
traffic_mirror_target_id = aws_ec2_traffic_mirror_target.ids.id
session_number = 1
}

Interview — “A pod in EKS can’t reach an RDS instance in a different VPC. Walk through debugging with network tools.”

Answer: (1) Verify the basics — Is the RDS instance in a different VPC? If yes, is there a Transit Gateway attachment or VPC peering between the two VPCs? Check TGW route tables for the RDS VPC CIDR. (2) Reachability Analyzer — Run a reachability analysis from the EKS node ENI to the RDS endpoint IP on port 5432. This will immediately tell you if the issue is routing, security groups, NACLs, or TGW route tables — without sending traffic. (3) Security groups — Check the RDS security group: does it allow inbound TCP 5432 from the EKS node security group ID or CIDR? Since they are in different VPCs, SG ID references do NOT work — you must use CIDR. This is a common mistake. (4) Route tables — Check the EKS subnet route table: is there a route for the RDS VPC CIDR pointing to the TGW? Check the RDS subnet route table: is there a route back to the EKS VPC CIDR via TGW? (5) NACLs — Check both subnets’ NACLs for deny rules. Remember NACLs are stateless — you need both inbound (5432 on RDS subnet) and outbound (ephemeral ports on RDS subnet) rules. (6) DNS — Is the pod resolving the RDS endpoint hostname to the correct private IP? If using a private hosted zone, is it associated with the EKS VPC? Try nslookup from the pod. (7) VPC Flow Logs — Enable flow logs on both the EKS node ENI and RDS ENI. Look for REJECT entries on port 5432. The rejected log entry will show which ENI rejected the traffic. (8) TGW Flow Logs — Check if traffic is even crossing the TGW. If no TGW flow log entries, the traffic is not leaving the EKS VPC (routing issue).


IPv6 adoption in enterprise cloud is accelerating, driven by IPv4 address exhaustion, mobile carrier networks (which increasingly use IPv6-only with NAT64), and government mandates.

The most common enterprise approach is dual-stack — assign BOTH IPv4 and IPv6 CIDR blocks to VPCs. All instances get both an IPv4 and IPv6 address. Applications work on either protocol without code changes.

  • AWS: Assign an IPv6 CIDR block (/56 from Amazon’s pool or BYOIP) to the VPC. Each subnet gets a /64 IPv6 CIDR. Security groups and NACLs support IPv6 rules. Route tables support ::/0 for IPv6 internet via Internet Gateway (no NAT needed — IPv6 addresses are globally unique).
  • GCP: VPC supports dual-stack natively. Subnets can be configured as dual-stack with both IPv4 and IPv6 ranges. GKE supports IPv6 pods and services.

AWS supports IPv6-only subnets (since 2022). Instances in these subnets get ONLY an IPv6 address — no IPv4. This eliminates IPv4 address consumption entirely.

Use case: large-scale workloads (data processing, batch jobs, K8s pods) that do not need to communicate with IPv4-only services. EKS pods in IPv6-only subnets can scale to millions of pods without IPv4 CIDR exhaustion.

For IPv6-only workloads that need to reach IPv4-only services (e.g., legacy on-prem systems, third-party APIs):

  • DNS64: translates IPv4 DNS responses into synthesized IPv6 addresses
  • NAT64: translates between IPv6 and IPv4 at the network layer

This allows IPv6-only pods to communicate with IPv4-only endpoints transparently.

ScenarioIPv6 Strategy
Mobile-heavy appsPriority — carrier-grade NAT (CGNAT) makes IPv4 unreliable for mobile users. IPv6 provides direct connectivity.
IoTRequired — billions of devices cannot share IPv4 addresses. Each device needs a unique address.
Government mandatesRequired — US federal agencies mandate IPv6 (OMB M-21-07). UAE may follow.
Large-scale K8sRecommended — IPv6 eliminates pod CIDR exhaustion. EKS supports IPv6 pod networking.
Legacy enterpriseOptional — dual-stack for new VPCs, IPv4-only for existing workloads. Migrate gradually.
  • New VPCs: dual-stack by default (costs nothing extra, provides future flexibility)
  • Legacy workloads: IPv4-only (no changes needed, migrate when refactoring)
  • Greenfield K8s pods: consider IPv6-only (eliminates CIDR planning headaches at scale)
  • Internet-facing ALBs: dual-stack (serve both IPv4 and IPv6 clients)
  • GCP GKE: supports dual-stack clusters where pods get both IPv4 and IPv6 addresses. Service type LoadBalancer can expose IPv6 endpoints.

Scenario 1: “Design a VPC architecture for a 3-tier web application”

Section titled “Scenario 1: “Design a VPC architecture for a 3-tier web application””

What the interviewer is looking for: Can you map application tiers to network tiers? Do you think about security, HA, and cost?

Answer:

I would design a VPC with three subnet tiers across three AZs for high availability:

Enterprise VPC architecture for interview

In our enterprise bank context, this VPC has no IGW. The ALB in the “public” tier is internal-facing — it receives traffic from the Network Hub’s internet-facing ALB via Transit Gateway. All egress flows through the centralized inspection VPC.


Scenario 2: “How do AWS VPCs differ from GCP VPCs architecturally?”

Section titled “Scenario 2: “How do AWS VPCs differ from GCP VPCs architecturally?””

What the interviewer is looking for: Deep understanding of both clouds, not just surface-level knowledge.

Answer:

The fundamental difference is scope:

AspectAWS VPCGCP VPC
ScopeRegional — one VPC per regionGlobal — one VPC spans all regions
SubnetsAZ-scoped (one AZ each)Regional (spans all zones in a region)
Cross-regionRequires VPC peering or TGW peeringSame VPC, add subnets in new regions
CIDRDefined at VPC level (primary + secondary)Defined per subnet (no VPC-level CIDR)
FirewallingNACLs (subnet) + Security Groups (ENI)Firewall rules at VPC level (priority-based, target by tag/SA)
Multi-tenancySeparate VPCs per account, TGW to connectShared VPC: one VPC, multiple projects
NATNAT Gateway per AZ (device-based)Cloud NAT per region (software-defined, on Cloud Router)
DNSRoute 53 private hosted zones (per VPC)Cloud DNS private zones (per VPC network)

Architecture implication: In GCP, a single global VPC simplifies multi-region communication — subnets in europe-west1 and us-central1 can communicate directly. In AWS, you need inter-region TGW peering or VPC peering, adding cost and configuration.

Enterprise implication: GCP’s Shared VPC model (one host project, many service projects) is conceptually different from AWS’s model (one VPC per account, connected via TGW). Neither is “better” — Shared VPC is simpler for networking but requires careful IAM to prevent service projects from modifying shared network resources.


Scenario 3: “Your application in a private subnet needs to call a public API. What are your options?”

Section titled “Scenario 3: “Your application in a private subnet needs to call a public API. What are your options?””

Answer:

Three options, in order of preference for an enterprise:

  1. Centralized NAT via Network Hub (our bank’s approach): Traffic routes from private subnet → TGW → inspection VPC (Network Firewall inspects, IPS/IDS checks) → NAT GW → internet → public API. Pros: centralized egress control, full visibility, IPS/IDS inspection. Cons: additional latency (~2-5ms), TGW and NAT data processing costs.

  2. VPC-local NAT Gateway: Deploy NAT GW in the workload VPC’s public subnet. Private subnet route table has 0.0.0.0/0 → nat-gw. Simpler and lower latency but no centralized inspection — acceptable for non-regulated workloads.

  3. Forward proxy (Squid/Envoy) in Shared Services: Application connects to an internal proxy that maintains an allowlist of permitted external APIs. Proxy logs every request. More application-level control but adds operational overhead.

What I would NOT do: Put the application in a public subnet or assign it a public IP. This violates defense-in-depth and exposes the instance to inbound internet traffic.

For AWS services specifically: Use VPC endpoints instead of going through NAT. If the application calls S3, use the S3 Gateway Endpoint (free). If it calls SSM Parameter Store, use the SSM Interface Endpoint. This avoids NAT costs entirely for AWS-to-AWS communication.


Scenario 4: “ALB vs NLB — when do you use each?”

Section titled “Scenario 4: “ALB vs NLB — when do you use each?””

Answer:

Use ALB when:

  • The service speaks HTTP/HTTPS/gRPC (Layer 7 protocol)
  • You need content-based routing — path (/api/v2/*), host (api.bank.com), headers, query strings
  • You want to integrate WAF for OWASP rule protection
  • You want built-in OIDC authentication at the load balancer (offload from application)
  • You want to target EKS pods directly via IP-mode target groups (AWS Load Balancer Controller)
  • WebSocket support is needed

Use NLB when:

  • The service is TCP/UDP (databases, MQTT, gRPC with TLS passthrough, custom protocols)
  • You need static IPs (regulatory requirement: firewall allowlisting by IP)
  • Extreme performance: millions of requests/sec with sub-millisecond latency
  • You need to preserve the source IP address natively (ALB uses X-Forwarded-For header)
  • PrivateLink: exposing a service to other VPCs/accounts — NLB is required as the backend for VPC Endpoint Services
  • TLS passthrough: NLB can forward TLS traffic without terminating, letting the backend handle decryption

Combined pattern — NLB in front of ALB: When you need both static IPs AND Layer 7 routing, place an NLB in front of an ALB. The NLB provides static IPs; the ALB provides path/host routing. This is common in enterprise environments where partners allowlist by IP address.


Scenario 5: “How does GCP Global Load Balancer differ from AWS ALB?”

Section titled “Scenario 5: “How does GCP Global Load Balancer differ from AWS ALB?””

Answer:

The core difference is scope and architecture:

GCP Global External HTTP(S) LB:

  • Uses a single anycast IP that is advertised from Google’s edge PoPs worldwide
  • Users are routed to the nearest Google edge location automatically (like having a built-in CDN + global routing)
  • Backends can span multiple regions — the LB automatically routes to the closest healthy backend
  • Built-in Cloud CDN, Cloud Armor (WAF/DDoS), and traffic splitting for canary deployments
  • One URL map handles all regions

AWS ALB:

  • Regional — one ALB per region per application
  • For multi-region, you need ALBs in each region PLUS Route 53 with latency-based routing or Global Accelerator for anycast IPs
  • WAF is per-ALB (attached separately in each region)
  • No built-in CDN (need CloudFront in front of ALB)

Practical example: If I have a banking API serving users in Europe and Middle East:

  • GCP: One Global LB, backends in europe-west1 and me-central1. Users in Dubai hit the nearest edge, routed to me-central1 backend. Users in London hit europe-west1. Automatic failover if one region goes down. One Cloud Armor policy protects both.

  • AWS: ALB in eu-west-1 + ALB in me-south-1. Route 53 latency-based routing points api.bank.com to the nearest ALB. CloudFront distribution in front for edge caching. Separate WAF Web ACL per ALB (or use AWS Firewall Manager to synchronize). Health checks at Route 53 level for failover.

GCP is operationally simpler for global applications. AWS gives more granular control per region but requires more infrastructure to achieve the same result.