Skip to content

FinOps for Architects

FinOps operating model

As the central platform team, you design the architecture to be cost-efficient by default. Tenant teams make local decisions (instance size, replica count) within guardrails you set.


Cloud cost breakdown by category

Cost Optimization Hierarchy (most to least impactful):
1. TURN IT OFF (idle resources)
Dev/staging environments: scheduled shutdown (7 PM - 7 AM)
Lambda@Edge for dev environment scheduler
Savings: 60% on non-prod compute
2. RIGHT-SIZE (oversized instances)
AWS Compute Optimizer recommendations
CloudWatch CPU/memory → if <30% avg, downsize
r6g.2xlarge (8 vCPU, 64 GB) → r6g.xlarge if using 20% CPU
Savings: 30-50% on oversized instances
3. SPOT INSTANCES (fault-tolerant workloads)
EKS node groups with mixed On-Demand + Spot
Karpenter consolidation + Spot diversification
Savings: 60-90% vs On-Demand
Use for: CI/CD, batch processing, stateless services
4. GRAVITON (ARM-based, 20% cheaper)
m7g, r7g, c7g instances
Most containers work on ARM without changes
Savings: 20% + better performance
5. SAVINGS PLANS (committed use)
Compute Savings Plans: 1yr no-upfront = 20-30% savings
3yr all-upfront = 50-60% savings
Apply to steady-state workloads AFTER right-sizing
# Karpenter NodePool with cost optimization
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
spec:
requirements:
# Graviton (ARM) preferred for 20% cost savings
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
# Mix instance types for Spot availability
- key: node.kubernetes.io/instance-type
operator: In
values:
- m7g.large
- m7g.xlarge
- m6g.large
- m6g.xlarge
- c7g.large
- c7g.xlarge
# Use Spot for non-critical workloads
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "200"
memory: 800Gi

Tag KeyExample ValuePurpose
teampaymentsCost attribution to team
environmentproductionCost by environment
projectmobile-bankingCost by project/product
cost-centerCC-4521Finance mapping
managed-byterraformInfrastructure management
data-classificationconfidentialSecurity/compliance
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUntaggedResources",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"rds:CreateDBInstance",
"rds:CreateDBCluster",
"elasticache:CreateReplicationGroup",
"ecs:CreateCluster",
"eks:CreateCluster",
"lambda:CreateFunction",
"s3:CreateBucket"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/team": "true",
"aws:RequestTag/environment": "true",
"aws:RequestTag/cost-center": "true"
}
}
}
]
}
# Terraform: attach SCP to Workloads OU
resource "aws_organizations_policy" "require_tags" {
name = "require-cost-tags"
description = "Deny resource creation without required cost tags"
type = "SERVICE_CONTROL_POLICY"
content = file("policies/require-tags.json")
}
resource "aws_organizations_policy_attachment" "workloads" {
policy_id = aws_organizations_policy.require_tags.id
target_id = aws_organizations_organizational_unit.workloads.id
}

Savings Plans vs Reserved Instances vs CUDs

Section titled “Savings Plans vs Reserved Instances vs CUDs”
FeatureSavings Plans (AWS)Reserved Instances (AWS)CUDs (GCP)
Commitment$/hourInstance type + regionCPU/memory or $/hour
FlexibilityAny instance family, region, OSFixed instance type (Standard), or convertibleResource-based or spend-based
Discount20-30% (1yr), 50-60% (3yr)30-40% (1yr), 50-60% (3yr)37% (1yr), 55% (3yr)
Covers K8sYes (any compute)Only if matching instance typeYes
Best forDynamic workloadsPredictable, fixed workloadsGKE steady-state

Reserved capacity sizing strategy


Kubecost architecture for Kubernetes cost attribution

# Helm values for Kubecost
# helm install kubecost cost-analyzer/cost-analyzer -f values.yaml
kubecostProductConfigs:
clusterName: "prod-us-east-1"
currencyCode: "USD"
# Shared cost allocation (platform overhead split across namespaces)
sharedNamespaces: "monitoring,kube-system,istio-system,ingress-nginx"
sharedOverhead: "500" # $500/month for shared infra split equally
global:
prometheus:
enabled: false # Use existing VictoriaMetrics
fqdn: "http://vmselect.monitoring:8481/select/0/prometheus"
grafana:
enabled: false # Use existing Grafana
proxy: false
# Integration with cloud billing
cloudCost:
enabled: true
# AWS: CUR (Cost and Usage Report) in S3
# GCP: BigQuery billing export

Kubecost monthly cost allocation report


AWS data transfer costs and common traps

# VPC endpoints to avoid NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
}
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}

“The CTO says our AWS bill is $500K/month and wants it at $300K. What do you do?”

Strong Answer:

“$200K reduction requires systematic optimization across all cost categories. Here’s my 90-day plan:

Week 1-2: Discovery (identify the $200K)

  • Enable AWS Cost Explorer with hourly granularity
  • Tag analysis: find untagged resources (typically 10-20% of spend is ‘unknown’)
  • Run AWS Compute Optimizer for right-sizing recommendations
  • Deploy Kubecost to attribute K8s costs to namespaces

Quick wins (Month 1: target $80K savings)

  1. Shut down idle resources: Dev/staging environments running 24/7 → schedule 12 hrs/day = save 50% of non-prod compute. Typical impact: $30-40K.
  2. Right-size oversized instances: Compute Optimizer says 40% of instances can downsize by 50%. Impact: $20-30K.
  3. Delete unused resources: Unattached EBS volumes, old snapshots, unused Elastic IPs, idle NAT Gateways. Impact: $10-15K.

Medium-term (Month 2: target $70K savings) 4. Spot instances for K8s: Move stateless workloads to Spot nodes (60-70% discount). Impact: $30-40K. 5. Graviton migration: Switch to ARM instances (20% cheaper, 10% faster). Impact: $15-20K. 6. VPC endpoints: S3, ECR, STS endpoints to avoid NAT Gateway processing fees. Impact: $5-10K.

Long-term (Month 3: target $50K savings) 7. Compute Savings Plans: After right-sizing stabilizes, commit to 1-year no-upfront Savings Plan for 70% of baseline. Impact: $30-40K. 8. Observability cost: Migrate from Datadog to open-source (Grafana stack). Impact: $10-15K.

Total estimated savings: $200K/month achieved in 90 days.


“Engineering says ‘Kubernetes is expensive’ but nobody knows which team is consuming what. How do you fix this?”

Strong Answer:

“This is a visibility problem, not a Kubernetes problem. Three-step approach:

  1. Deploy Kubecost pointing at our existing VictoriaMetrics. Within a day, we have cost per namespace, per deployment, per pod. The key metrics: CPU cost, memory cost, storage cost, idle cost (requested but unused).

  2. Enforce namespace-per-team: If teams share namespaces, cost attribution is impossible. Our platform already uses namespace-per-team with ResourceQuotas. Each namespace has a team and cost-center label.

  3. Weekly cost report: Automated report to each team lead showing:

    • Their namespace cost this week vs last week
    • Top 3 optimization recommendations (from Kubecost)
    • Efficiency score (actual usage / requested resources)
    • Cluster-wide idle cost they’re contributing to

The key insight: Most K8s cost waste comes from over-requesting resources. A pod requesting 4 CPU cores but using 0.5 means 3.5 cores are reserved but idle — the node can’t use them for other pods. VPA in recommendation mode shows the right values. The platform team runs VPA and publishes right-sizing recommendations weekly.

Shared infrastructure costs (monitoring, ingress, kube-system) are split proportionally by namespace resource consumption. This prevents teams from free-riding on shared infrastructure.”


“We spend $200K/month on EC2/Fargate compute. How much should we commit to Savings Plans?”

Strong Answer:

“Never commit 100%. Here’s my approach:

Step 1: Analyze 90 days of hourly compute spend in Cost Explorer. Identify the baseline — the minimum compute running at 3 AM on a Sunday. That’s your floor.

Step 2: Typical breakdown for enterprise:

  • Baseline (always-on): 60% = $120K/month
  • Variable (scales with traffic): 25% = $50K/month
  • Burst/ephemeral (CI/CD, batch): 15% = $30K/month

Step 3: Reserve 80% of baseline:

  • Commit: $96K/month on 1-year Compute Savings Plan (no upfront)
  • Discount: ~30% = saves ~$29K/month = $348K/year
  • Risk: minimal — we’re only committing on what we always use

Step 4: Cover variable with Spot:

  • $50K of variable workloads on Spot instances
  • Average 65% discount = saves ~$32K/month

Step 5: Leave burst as On-Demand:

  • $30K stays On-Demand (CI/CD, batch jobs, temporary scaling)
  • Flexibility to scale down if business changes

Why Compute Savings Plans over EC2 Reserved Instances? Savings Plans apply across instance families, regions, and OS. If we migrate from m5.xlarge to m7g.large (Graviton), the Savings Plan still applies. RIs would need to be exchanged.

Total savings: ~$61K/month = $732K/year on a $200K/month spend.


“We run workloads on both AWS and GCP. How do you manage costs across both clouds?”

Strong Answer:

“Unified visibility is the first challenge. Here’s my approach:

1. Unified tagging/labeling:

  • Define one tagging standard that works on both clouds
  • AWS tags: team, environment, project, cost-center
  • GCP labels: same keys (labels have lowercase restriction, so consistent naming)
  • Enforce via AWS SCPs and GCP org policies

2. Centralized cost dashboard:

  • AWS Cost and Usage Report (CUR) → S3 → Athena
  • GCP Billing Export → BigQuery
  • Build unified Grafana dashboard querying both sources
  • OR use a FinOps tool like CloudHealth, Vantage, or Infracost

3. Per-cloud optimization:

  • AWS: Compute Savings Plans for steady-state EC2/Fargate
  • GCP: Resource-based CUDs for steady-state GKE
  • Both: Spot/preemptible for variable workloads
  • Both: VPC endpoints / Private Google Access to reduce egress

4. Workload placement decisions:

  • Run each workload where it’s cheapest for the requirements
  • GKE Autopilot is often cheaper than EKS for variable workloads (no idle node waste)
  • BigQuery is cheaper than Redshift for ad-hoc analytics
  • AWS Lambda is cheaper than Cloud Functions for low-volume event processing

5. Kubernetes cost normalization:

  • Kubecost on both clouds with the same configuration
  • Compare cost-per-request across clouds for the same service
  • Use this data for future placement decisions”

  • FinOps Foundation — FinOps principles, framework, and community best practices
  • Kubecost Documentation — Kubernetes cost monitoring, allocation, and optimization
  • OpenCost — CNCF open-source cost monitoring for Kubernetes
  • Infracost — cost estimation for Terraform changes in CI/CD pipelines