FinOps for Architects

Where This Fits

FinOps operating model

As the central platform team, you design the architecture to be cost-efficient by default. Tenant teams make local decisions (instance size, replica count) within guardrails you set.

Architectural Cost Levers

The Big Five

Cloud cost breakdown by category

Compute Optimization

Cost Optimization Hierarchy (most to least impactful):

1. TURN IT OFF (idle resources)
   Dev/staging environments: scheduled shutdown (7 PM - 7 AM)
   Lambda@Edge for dev environment scheduler
   Savings: 60% on non-prod compute

2. RIGHT-SIZE (oversized instances)
   AWS Compute Optimizer recommendations
   CloudWatch CPU/memory → if <30% avg, downsize
   r6g.2xlarge (8 vCPU, 64 GB) → r6g.xlarge if using 20% CPU
   Savings: 30-50% on oversized instances

3. SPOT INSTANCES (fault-tolerant workloads)
   EKS node groups with mixed On-Demand + Spot
   Karpenter consolidation + Spot diversification
   Savings: 60-90% vs On-Demand
   Use for: CI/CD, batch processing, stateless services

4. GRAVITON (ARM-based, 20% cheaper)
   m7g, r7g, c7g instances
   Most containers work on ARM without changes
   Savings: 20% + better performance

5. SAVINGS PLANS (committed use)
   Compute Savings Plans: 1yr no-upfront = 20-30% savings
   3yr all-upfront = 50-60% savings
   Apply to steady-state workloads AFTER right-sizing

# Karpenter NodePool with cost optimization
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      requirements:
        # Graviton (ARM) preferred for 20% cost savings
        - key: kubernetes.io/arch
          operator: In
          values: ["arm64"]
        # Mix instance types for Spot availability
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m7g.large
            - m7g.xlarge
            - m6g.large
            - m6g.xlarge
            - c7g.large
            - c7g.xlarge
        # Use Spot for non-critical workloads
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "200"
    memory: 800Gi

Cost Optimization Hierarchy:

1. TURN IT OFF
   GKE cluster scheduling (scale node pools to 0 for non-prod)
   Cloud Scheduler + Cloud Functions for environment shutdown
   Savings: 60% on non-prod

2. RIGHT-SIZE
   GKE recommender (built-in VPA recommendations)
   Active Assist recommendations for VM right-sizing
   Savings: 30-50%

3. SPOT VMs (preemptible, 60-91% discount)
   GKE Spot node pools
   Best for: batch, CI/CD, stateless services
   Savings: 60-91% vs On-Demand

4. COMMITTED USE DISCOUNTS (CUDs)
   Resource-based CUDs: commit CPU/memory (1yr: 37%, 3yr: 55%)
   Spend-based CUDs: commit $/hr (1yr: 25%, 3yr: 52%)
   Apply to steady-state GKE workloads

5. AUTOPILOT
   Google manages nodes — optimized bin-packing
   No idle node capacity (you pay per pod, not per node)
   Often 20-40% cheaper than Standard for variable workloads

# GKE node pool with Spot VMs
resource "google_container_node_pool" "spot_pool" {
  name     = "spot-general"
  cluster  = google_container_cluster.main.name
  location = var.region

  autoscaling {
    min_node_count = 0
    max_node_count = 50
  }

  node_config {
    machine_type = "e2-standard-4"
    spot         = true

    labels = {
      "node-type" = "spot"
    }

    taint {
      key    = "cloud.google.com/gke-spot"
      value  = "true"
      effect = "NO_SCHEDULE"
    }
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

Tagging Strategy

Mandatory Tags

Tag Key	Example Value	Purpose
`team`	`payments`	Cost attribution to team
`environment`	`production`	Cost by environment
`project`	`mobile-banking`	Cost by project/product
`cost-center`	`CC-4521`	Finance mapping
`managed-by`	`terraform`	Infrastructure management
`data-classification`	`confidential`	Security/compliance

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUntaggedResources",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "rds:CreateDBCluster",
        "elasticache:CreateReplicationGroup",
        "ecs:CreateCluster",
        "eks:CreateCluster",
        "lambda:CreateFunction",
        "s3:CreateBucket"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/team": "true",
          "aws:RequestTag/environment": "true",
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

# Terraform: attach SCP to Workloads OU
resource "aws_organizations_policy" "require_tags" {
  name        = "require-cost-tags"
  description = "Deny resource creation without required cost tags"
  type        = "SERVICE_CONTROL_POLICY"
  content     = file("policies/require-tags.json")
}

resource "aws_organizations_policy_attachment" "workloads" {
  policy_id = aws_organizations_policy.require_tags.id
  target_id = aws_organizations_organizational_unit.workloads.id
}

# GCP doesn't have native label enforcement via org policies.
# Use Config Connector or a custom org policy with tags.

# Option 1: Custom constraint (Tag-based, newer approach)
resource "google_org_policy_custom_constraint" "require_labels" {
  name         = "custom.requireCostLabels"
  parent       = "organizations/${var.org_id}"
  display_name = "Require cost labels on all resources"
  description  = "All resources must have team, environment, and cost-center labels"

  action_type    = "DENY"
  condition      = "!resource.matchLabels('team', '*') || !resource.matchLabels('environment', '*')"
  method_types   = ["CREATE"]
  resource_types = [
    "compute.googleapis.com/Instance",
    "container.googleapis.com/Cluster",
    "sqladmin.googleapis.com/Instance",
  ]
}

# Option 2: Billing budgets per project (cost guardrails)
resource "google_billing_budget" "team_budget" {
  for_each = var.team_projects

  billing_account = var.billing_account_id
  display_name    = "${each.key}-monthly-budget"

  budget_filter {
    projects = ["projects/${each.value.project_id}"]
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units         = each.value.monthly_budget
    }
  }

  threshold_rules {
    threshold_percent = 0.5   # Alert at 50%
    spend_basis       = "CURRENT_SPEND"
  }
  threshold_rules {
    threshold_percent = 0.8   # Alert at 80%
  }
  threshold_rules {
    threshold_percent = 1.0   # Alert at 100%
  }

  all_updates_rule {
    monitoring_notification_channels = [each.value.notification_channel]
    disable_default_iam_recipients   = true
  }
}

Savings Plans vs Reserved Instances vs CUDs

Comparison

Feature	Savings Plans (AWS)	Reserved Instances (AWS)	CUDs (GCP)
Commitment	$/hour	Instance type + region	CPU/memory or $/hour
Flexibility	Any instance family, region, OS	Fixed instance type (Standard), or convertible	Resource-based or spend-based
Discount	20-30% (1yr), 50-60% (3yr)	30-40% (1yr), 50-60% (3yr)	37% (1yr), 55% (3yr)
Covers K8s	Yes (any compute)	Only if matching instance type	Yes
Best for	Dynamic workloads	Predictable, fixed workloads	GKE steady-state

Sizing Strategy

Reserved capacity sizing strategy

Kubecost for Kubernetes Cost Attribution

Architecture

Kubecost architecture for Kubernetes cost attribution

Kubecost Deployment

# Helm values for Kubecost
# helm install kubecost cost-analyzer/cost-analyzer -f values.yaml
kubecostProductConfigs:
  clusterName: "prod-us-east-1"
  currencyCode: "USD"
  # Shared cost allocation (platform overhead split across namespaces)
  sharedNamespaces: "monitoring,kube-system,istio-system,ingress-nginx"
  sharedOverhead: "500"  # $500/month for shared infra split equally

global:
  prometheus:
    enabled: false  # Use existing VictoriaMetrics
    fqdn: "http://vmselect.monitoring:8481/select/0/prometheus"

  grafana:
    enabled: false  # Use existing Grafana
    proxy: false

# Integration with cloud billing
cloudCost:
  enabled: true
  # AWS: CUR (Cost and Usage Report) in S3
  # GCP: BigQuery billing export

Cost Allocation Report Example

Kubecost monthly cost allocation report

Data Transfer Cost Optimization

AWS data transfer costs and common traps

# VPC endpoints to avoid NAT Gateway costs
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
}

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

Interview Scenarios

Scenario 1: Reduce AWS Bill by 40%

“The CTO says our AWS bill is $500K/month and wants it at $300K. What do you do?”

Strong Answer:

“$200K reduction requires systematic optimization across all cost categories. Here’s my 90-day plan:

Week 1-2: Discovery (identify the $200K)

Enable AWS Cost Explorer with hourly granularity
Tag analysis: find untagged resources (typically 10-20% of spend is ‘unknown’)
Run AWS Compute Optimizer for right-sizing recommendations
Deploy Kubecost to attribute K8s costs to namespaces

Quick wins (Month 1: target $80K savings)

Shut down idle resources: Dev/staging environments running 24/7 → schedule 12 hrs/day = save 50% of non-prod compute. Typical impact: $30-40K.
Right-size oversized instances: Compute Optimizer says 40% of instances can downsize by 50%. Impact: $20-30K.
Delete unused resources: Unattached EBS volumes, old snapshots, unused Elastic IPs, idle NAT Gateways. Impact: $10-15K.

Medium-term (Month 2: target $70K savings) 4. Spot instances for K8s: Move stateless workloads to Spot nodes (60-70% discount). Impact: $30-40K. 5. Graviton migration: Switch to ARM instances (20% cheaper, 10% faster). Impact: $15-20K. 6. VPC endpoints: S3, ECR, STS endpoints to avoid NAT Gateway processing fees. Impact: $5-10K.

Long-term (Month 3: target $50K savings) 7. Compute Savings Plans: After right-sizing stabilizes, commit to 1-year no-upfront Savings Plan for 70% of baseline. Impact: $30-40K. 8. Observability cost: Migrate from Datadog to open-source (Grafana stack). Impact: $10-15K.

Total estimated savings: $200K/month achieved in 90 days.”

Scenario 2: Kubernetes Cost Attribution

“Engineering says ‘Kubernetes is expensive’ but nobody knows which team is consuming what. How do you fix this?”

Strong Answer:

“This is a visibility problem, not a Kubernetes problem. Three-step approach:

Deploy Kubecost pointing at our existing VictoriaMetrics. Within a day, we have cost per namespace, per deployment, per pod. The key metrics: CPU cost, memory cost, storage cost, idle cost (requested but unused).
Enforce namespace-per-team: If teams share namespaces, cost attribution is impossible. Our platform already uses namespace-per-team with ResourceQuotas. Each namespace has a team and cost-center label.
Weekly cost report: Automated report to each team lead showing:
- Their namespace cost this week vs last week
- Top 3 optimization recommendations (from Kubecost)
- Efficiency score (actual usage / requested resources)
- Cluster-wide idle cost they’re contributing to

The key insight: Most K8s cost waste comes from over-requesting resources. A pod requesting 4 CPU cores but using 0.5 means 3.5 cores are reserved but idle — the node can’t use them for other pods. VPA in recommendation mode shows the right values. The platform team runs VPA and publishes right-sizing recommendations weekly.

Shared infrastructure costs (monitoring, ingress, kube-system) are split proportionally by namespace resource consumption. This prevents teams from free-riding on shared infrastructure.”

Scenario 3: Savings Plans Strategy

“We spend $200K/month on EC2/Fargate compute. How much should we commit to Savings Plans?”

Strong Answer:

“Never commit 100%. Here’s my approach:

Step 1: Analyze 90 days of hourly compute spend in Cost Explorer. Identify the baseline — the minimum compute running at 3 AM on a Sunday. That’s your floor.

Step 2: Typical breakdown for enterprise:

Baseline (always-on): 60% = $120K/month
Variable (scales with traffic): 25% = $50K/month
Burst/ephemeral (CI/CD, batch): 15% = $30K/month

Step 3: Reserve 80% of baseline:

Commit: $96K/month on 1-year Compute Savings Plan (no upfront)
Discount: ~30% = saves ~$29K/month = $348K/year
Risk: minimal — we’re only committing on what we always use

Step 4: Cover variable with Spot:

$50K of variable workloads on Spot instances
Average 65% discount = saves ~$32K/month

Step 5: Leave burst as On-Demand:

$30K stays On-Demand (CI/CD, batch jobs, temporary scaling)
Flexibility to scale down if business changes

Why Compute Savings Plans over EC2 Reserved Instances? Savings Plans apply across instance families, regions, and OS. If we migrate from m5.xlarge to m7g.large (Graviton), the Savings Plan still applies. RIs would need to be exchanged.

Total savings: ~$61K/month = $732K/year on a $200K/month spend.”

Scenario 4: Multi-Cloud Cost Management

“We run workloads on both AWS and GCP. How do you manage costs across both clouds?”

Strong Answer:

“Unified visibility is the first challenge. Here’s my approach:

1. Unified tagging/labeling:

Define one tagging standard that works on both clouds
AWS tags: team, environment, project, cost-center
GCP labels: same keys (labels have lowercase restriction, so consistent naming)
Enforce via AWS SCPs and GCP org policies

2. Centralized cost dashboard:

AWS Cost and Usage Report (CUR) → S3 → Athena
GCP Billing Export → BigQuery
Build unified Grafana dashboard querying both sources
OR use a FinOps tool like CloudHealth, Vantage, or Infracost

3. Per-cloud optimization:

AWS: Compute Savings Plans for steady-state EC2/Fargate
GCP: Resource-based CUDs for steady-state GKE
Both: Spot/preemptible for variable workloads
Both: VPC endpoints / Private Google Access to reduce egress

4. Workload placement decisions:

Run each workload where it’s cheapest for the requirements
GKE Autopilot is often cheaper than EKS for variable workloads (no idle node waste)
BigQuery is cheaper than Redshift for ad-hoc analytics
AWS Lambda is cheaper than Cloud Functions for low-volume event processing

5. Kubernetes cost normalization:

Kubecost on both clouds with the same configuration
Compare cost-per-request across clouds for the same service
Use this data for future placement decisions”

References

AWS

AWS Cost Optimization Pillar — Well-Architected Framework cost optimization best practices
AWS Savings Plans User Guide — Compute and EC2 Savings Plans, sizing, and recommendations
AWS Cost Explorer — cost visualization, forecasting, and rightsizing recommendations
AWS Compute Optimizer — instance right-sizing recommendations

GCP

GCP Cost Management Documentation — billing reports, budgets, and cost optimization
Committed Use Discounts — resource-based and spend-based CUDs
GCP Active Assist Recommender — right-sizing, idle resource, and commitment recommendations

Tools & Frameworks

FinOps Foundation — FinOps principles, framework, and community best practices
Kubecost Documentation — Kubernetes cost monitoring, allocation, and optimization
OpenCost — CNCF open-source cost monitoring for Kubernetes
Infracost — cost estimation for Terraform changes in CI/CD pipelines