FinOps for Architects
Where This Fits
Section titled “Where This Fits”As the central platform team, you design the architecture to be cost-efficient by default. Tenant teams make local decisions (instance size, replica count) within guardrails you set.
Architectural Cost Levers
Section titled “Architectural Cost Levers”The Big Five
Section titled “The Big Five”Compute Optimization
Section titled “Compute Optimization”Cost Optimization Hierarchy (most to least impactful):
1. TURN IT OFF (idle resources) Dev/staging environments: scheduled shutdown (7 PM - 7 AM) Lambda@Edge for dev environment scheduler Savings: 60% on non-prod compute
2. RIGHT-SIZE (oversized instances) AWS Compute Optimizer recommendations CloudWatch CPU/memory → if <30% avg, downsize r6g.2xlarge (8 vCPU, 64 GB) → r6g.xlarge if using 20% CPU Savings: 30-50% on oversized instances
3. SPOT INSTANCES (fault-tolerant workloads) EKS node groups with mixed On-Demand + Spot Karpenter consolidation + Spot diversification Savings: 60-90% vs On-Demand Use for: CI/CD, batch processing, stateless services
4. GRAVITON (ARM-based, 20% cheaper) m7g, r7g, c7g instances Most containers work on ARM without changes Savings: 20% + better performance
5. SAVINGS PLANS (committed use) Compute Savings Plans: 1yr no-upfront = 20-30% savings 3yr all-upfront = 50-60% savings Apply to steady-state workloads AFTER right-sizing# Karpenter NodePool with cost optimizationapiVersion: karpenter.sh/v1kind: NodePoolmetadata: name: general-purposespec: template: spec: requirements: # Graviton (ARM) preferred for 20% cost savings - key: kubernetes.io/arch operator: In values: ["arm64"] # Mix instance types for Spot availability - key: node.kubernetes.io/instance-type operator: In values: - m7g.large - m7g.xlarge - m6g.large - m6g.xlarge - c7g.large - c7g.xlarge # Use Spot for non-critical workloads - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 30s limits: cpu: "200" memory: 800GiCost Optimization Hierarchy:
1. TURN IT OFF GKE cluster scheduling (scale node pools to 0 for non-prod) Cloud Scheduler + Cloud Functions for environment shutdown Savings: 60% on non-prod
2. RIGHT-SIZE GKE recommender (built-in VPA recommendations) Active Assist recommendations for VM right-sizing Savings: 30-50%
3. SPOT VMs (preemptible, 60-91% discount) GKE Spot node pools Best for: batch, CI/CD, stateless services Savings: 60-91% vs On-Demand
4. COMMITTED USE DISCOUNTS (CUDs) Resource-based CUDs: commit CPU/memory (1yr: 37%, 3yr: 55%) Spend-based CUDs: commit $/hr (1yr: 25%, 3yr: 52%) Apply to steady-state GKE workloads
5. AUTOPILOT Google manages nodes — optimized bin-packing No idle node capacity (you pay per pod, not per node) Often 20-40% cheaper than Standard for variable workloads# GKE node pool with Spot VMsresource "google_container_node_pool" "spot_pool" { name = "spot-general" cluster = google_container_cluster.main.name location = var.region
autoscaling { min_node_count = 0 max_node_count = 50 }
node_config { machine_type = "e2-standard-4" spot = true
labels = { "node-type" = "spot" }
taint { key = "cloud.google.com/gke-spot" value = "true" effect = "NO_SCHEDULE" } }
management { auto_repair = true auto_upgrade = true }}Tagging Strategy
Section titled “Tagging Strategy”Mandatory Tags
Section titled “Mandatory Tags”| Tag Key | Example Value | Purpose |
|---|---|---|
team | payments | Cost attribution to team |
environment | production | Cost by environment |
project | mobile-banking | Cost by project/product |
cost-center | CC-4521 | Finance mapping |
managed-by | terraform | Infrastructure management |
data-classification | confidential | Security/compliance |
Tag Enforcement with SCPs
Section titled “Tag Enforcement with SCPs”{ "Version": "2012-10-17", "Statement": [ { "Sid": "DenyUntaggedResources", "Effect": "Deny", "Action": [ "ec2:RunInstances", "rds:CreateDBInstance", "rds:CreateDBCluster", "elasticache:CreateReplicationGroup", "ecs:CreateCluster", "eks:CreateCluster", "lambda:CreateFunction", "s3:CreateBucket" ], "Resource": "*", "Condition": { "Null": { "aws:RequestTag/team": "true", "aws:RequestTag/environment": "true", "aws:RequestTag/cost-center": "true" } } } ]}# Terraform: attach SCP to Workloads OUresource "aws_organizations_policy" "require_tags" { name = "require-cost-tags" description = "Deny resource creation without required cost tags" type = "SERVICE_CONTROL_POLICY" content = file("policies/require-tags.json")}
resource "aws_organizations_policy_attachment" "workloads" { policy_id = aws_organizations_policy.require_tags.id target_id = aws_organizations_organizational_unit.workloads.id}# GCP doesn't have native label enforcement via org policies.# Use Config Connector or a custom org policy with tags.
# Option 1: Custom constraint (Tag-based, newer approach)resource "google_org_policy_custom_constraint" "require_labels" { name = "custom.requireCostLabels" parent = "organizations/${var.org_id}" display_name = "Require cost labels on all resources" description = "All resources must have team, environment, and cost-center labels"
action_type = "DENY" condition = "!resource.matchLabels('team', '*') || !resource.matchLabels('environment', '*')" method_types = ["CREATE"] resource_types = [ "compute.googleapis.com/Instance", "container.googleapis.com/Cluster", "sqladmin.googleapis.com/Instance", ]}
# Option 2: Billing budgets per project (cost guardrails)resource "google_billing_budget" "team_budget" { for_each = var.team_projects
billing_account = var.billing_account_id display_name = "${each.key}-monthly-budget"
budget_filter { projects = ["projects/${each.value.project_id}"] }
amount { specified_amount { currency_code = "USD" units = each.value.monthly_budget } }
threshold_rules { threshold_percent = 0.5 # Alert at 50% spend_basis = "CURRENT_SPEND" } threshold_rules { threshold_percent = 0.8 # Alert at 80% } threshold_rules { threshold_percent = 1.0 # Alert at 100% }
all_updates_rule { monitoring_notification_channels = [each.value.notification_channel] disable_default_iam_recipients = true }}Savings Plans vs Reserved Instances vs CUDs
Section titled “Savings Plans vs Reserved Instances vs CUDs”Comparison
Section titled “Comparison”| Feature | Savings Plans (AWS) | Reserved Instances (AWS) | CUDs (GCP) |
|---|---|---|---|
| Commitment | $/hour | Instance type + region | CPU/memory or $/hour |
| Flexibility | Any instance family, region, OS | Fixed instance type (Standard), or convertible | Resource-based or spend-based |
| Discount | 20-30% (1yr), 50-60% (3yr) | 30-40% (1yr), 50-60% (3yr) | 37% (1yr), 55% (3yr) |
| Covers K8s | Yes (any compute) | Only if matching instance type | Yes |
| Best for | Dynamic workloads | Predictable, fixed workloads | GKE steady-state |
Sizing Strategy
Section titled “Sizing Strategy”Kubecost for Kubernetes Cost Attribution
Section titled “Kubecost for Kubernetes Cost Attribution”Architecture
Section titled “Architecture”Kubecost Deployment
Section titled “Kubecost Deployment”# Helm values for Kubecost# helm install kubecost cost-analyzer/cost-analyzer -f values.yamlkubecostProductConfigs: clusterName: "prod-us-east-1" currencyCode: "USD" # Shared cost allocation (platform overhead split across namespaces) sharedNamespaces: "monitoring,kube-system,istio-system,ingress-nginx" sharedOverhead: "500" # $500/month for shared infra split equally
global: prometheus: enabled: false # Use existing VictoriaMetrics fqdn: "http://vmselect.monitoring:8481/select/0/prometheus"
grafana: enabled: false # Use existing Grafana proxy: false
# Integration with cloud billingcloudCost: enabled: true # AWS: CUR (Cost and Usage Report) in S3 # GCP: BigQuery billing exportCost Allocation Report Example
Section titled “Cost Allocation Report Example”Data Transfer Cost Optimization
Section titled “Data Transfer Cost Optimization”# VPC endpoints to avoid NAT Gateway costsresource "aws_vpc_endpoint" "s3" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.${var.region}.s3" vpc_endpoint_type = "Gateway" route_table_ids = [aws_route_table.private.id]}
resource "aws_vpc_endpoint" "ecr_api" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.${var.region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true}
resource "aws_vpc_endpoint" "ecr_dkr" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.${var.region}.ecr.dkr" vpc_endpoint_type = "Interface" subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true}Interview Scenarios
Section titled “Interview Scenarios”Scenario 1: Reduce AWS Bill by 40%
Section titled “Scenario 1: Reduce AWS Bill by 40%”“The CTO says our AWS bill is $500K/month and wants it at $300K. What do you do?”
Strong Answer:
“$200K reduction requires systematic optimization across all cost categories. Here’s my 90-day plan:
Week 1-2: Discovery (identify the $200K)
- Enable AWS Cost Explorer with hourly granularity
- Tag analysis: find untagged resources (typically 10-20% of spend is ‘unknown’)
- Run AWS Compute Optimizer for right-sizing recommendations
- Deploy Kubecost to attribute K8s costs to namespaces
Quick wins (Month 1: target $80K savings)
- Shut down idle resources: Dev/staging environments running 24/7 → schedule 12 hrs/day = save 50% of non-prod compute. Typical impact: $30-40K.
- Right-size oversized instances: Compute Optimizer says 40% of instances can downsize by 50%. Impact: $20-30K.
- Delete unused resources: Unattached EBS volumes, old snapshots, unused Elastic IPs, idle NAT Gateways. Impact: $10-15K.
Medium-term (Month 2: target $70K savings) 4. Spot instances for K8s: Move stateless workloads to Spot nodes (60-70% discount). Impact: $30-40K. 5. Graviton migration: Switch to ARM instances (20% cheaper, 10% faster). Impact: $15-20K. 6. VPC endpoints: S3, ECR, STS endpoints to avoid NAT Gateway processing fees. Impact: $5-10K.
Long-term (Month 3: target $50K savings) 7. Compute Savings Plans: After right-sizing stabilizes, commit to 1-year no-upfront Savings Plan for 70% of baseline. Impact: $30-40K. 8. Observability cost: Migrate from Datadog to open-source (Grafana stack). Impact: $10-15K.
Total estimated savings: $200K/month achieved in 90 days.”
Scenario 2: Kubernetes Cost Attribution
Section titled “Scenario 2: Kubernetes Cost Attribution”“Engineering says ‘Kubernetes is expensive’ but nobody knows which team is consuming what. How do you fix this?”
Strong Answer:
“This is a visibility problem, not a Kubernetes problem. Three-step approach:
-
Deploy Kubecost pointing at our existing VictoriaMetrics. Within a day, we have cost per namespace, per deployment, per pod. The key metrics: CPU cost, memory cost, storage cost, idle cost (requested but unused).
-
Enforce namespace-per-team: If teams share namespaces, cost attribution is impossible. Our platform already uses namespace-per-team with ResourceQuotas. Each namespace has a
teamandcost-centerlabel. -
Weekly cost report: Automated report to each team lead showing:
- Their namespace cost this week vs last week
- Top 3 optimization recommendations (from Kubecost)
- Efficiency score (actual usage / requested resources)
- Cluster-wide idle cost they’re contributing to
The key insight: Most K8s cost waste comes from over-requesting resources. A pod requesting 4 CPU cores but using 0.5 means 3.5 cores are reserved but idle — the node can’t use them for other pods. VPA in recommendation mode shows the right values. The platform team runs VPA and publishes right-sizing recommendations weekly.
Shared infrastructure costs (monitoring, ingress, kube-system) are split proportionally by namespace resource consumption. This prevents teams from free-riding on shared infrastructure.”
Scenario 3: Savings Plans Strategy
Section titled “Scenario 3: Savings Plans Strategy”“We spend $200K/month on EC2/Fargate compute. How much should we commit to Savings Plans?”
Strong Answer:
“Never commit 100%. Here’s my approach:
Step 1: Analyze 90 days of hourly compute spend in Cost Explorer. Identify the baseline — the minimum compute running at 3 AM on a Sunday. That’s your floor.
Step 2: Typical breakdown for enterprise:
- Baseline (always-on): 60% = $120K/month
- Variable (scales with traffic): 25% = $50K/month
- Burst/ephemeral (CI/CD, batch): 15% = $30K/month
Step 3: Reserve 80% of baseline:
- Commit: $96K/month on 1-year Compute Savings Plan (no upfront)
- Discount: ~30% = saves ~$29K/month = $348K/year
- Risk: minimal — we’re only committing on what we always use
Step 4: Cover variable with Spot:
- $50K of variable workloads on Spot instances
- Average 65% discount = saves ~$32K/month
Step 5: Leave burst as On-Demand:
- $30K stays On-Demand (CI/CD, batch jobs, temporary scaling)
- Flexibility to scale down if business changes
Why Compute Savings Plans over EC2 Reserved Instances? Savings Plans apply across instance families, regions, and OS. If we migrate from m5.xlarge to m7g.large (Graviton), the Savings Plan still applies. RIs would need to be exchanged.
Total savings: ~$61K/month = $732K/year on a $200K/month spend.”
Scenario 4: Multi-Cloud Cost Management
Section titled “Scenario 4: Multi-Cloud Cost Management”“We run workloads on both AWS and GCP. How do you manage costs across both clouds?”
Strong Answer:
“Unified visibility is the first challenge. Here’s my approach:
1. Unified tagging/labeling:
- Define one tagging standard that works on both clouds
- AWS tags:
team,environment,project,cost-center - GCP labels: same keys (labels have lowercase restriction, so consistent naming)
- Enforce via AWS SCPs and GCP org policies
2. Centralized cost dashboard:
- AWS Cost and Usage Report (CUR) → S3 → Athena
- GCP Billing Export → BigQuery
- Build unified Grafana dashboard querying both sources
- OR use a FinOps tool like CloudHealth, Vantage, or Infracost
3. Per-cloud optimization:
- AWS: Compute Savings Plans for steady-state EC2/Fargate
- GCP: Resource-based CUDs for steady-state GKE
- Both: Spot/preemptible for variable workloads
- Both: VPC endpoints / Private Google Access to reduce egress
4. Workload placement decisions:
- Run each workload where it’s cheapest for the requirements
- GKE Autopilot is often cheaper than EKS for variable workloads (no idle node waste)
- BigQuery is cheaper than Redshift for ad-hoc analytics
- AWS Lambda is cheaper than Cloud Functions for low-volume event processing
5. Kubernetes cost normalization:
- Kubecost on both clouds with the same configuration
- Compare cost-per-request across clouds for the same service
- Use this data for future placement decisions”
References
Section titled “References”- AWS Cost Optimization Pillar — Well-Architected Framework cost optimization best practices
- AWS Savings Plans User Guide — Compute and EC2 Savings Plans, sizing, and recommendations
- AWS Cost Explorer — cost visualization, forecasting, and rightsizing recommendations
- AWS Compute Optimizer — instance right-sizing recommendations
- GCP Cost Management Documentation — billing reports, budgets, and cost optimization
- Committed Use Discounts — resource-based and spend-based CUDs
- GCP Active Assist Recommender — right-sizing, idle resource, and commitment recommendations
Tools & Frameworks
Section titled “Tools & Frameworks”- FinOps Foundation — FinOps principles, framework, and community best practices
- Kubecost Documentation — Kubernetes cost monitoring, allocation, and optimization
- OpenCost — CNCF open-source cost monitoring for Kubernetes
- Infracost — cost estimation for Terraform changes in CI/CD pipelines