Enterprise Platform Design

You are the central platform team at an enterprise bank. Your job is not to run applications — it is to build the platform that lets 500 developers across 50 teams ship safely, fast, and at scale. This page covers how to design, build, and operate that platform.

Where This Fits

What Is an Internal Developer Platform (IDP)?

An IDP is the set of tools, workflows, and self-service capabilities that the platform team provides to application teams. It abstracts away infrastructure complexity so developers can focus on writing code.

Internal Developer Platform Architecture

The Key Principle

App teams should never need to:

Write Terraform for infrastructure
SSH into nodes
Create Kubernetes RBAC manifests
Manage TLS certificates
Set up monitoring dashboards from scratch

App teams should be able to:

Deploy code to any environment via Git push
Get a preview environment for every PR
Access their logs and metrics through a portal
Request a new namespace or database through self-service
See their cloud costs in real time

Platform Team vs Application Team Responsibilities

Cluster Strategy

One of the most critical decisions: how many clusters, and how do you organize them?

Option 1: Cluster Per Environment (Most Common)

Cluster Per Environment Strategy When to use: Standard enterprise setup. Most banks start here.

Pros: Simple blast radius (dev issues don’t affect prod). Clear promotion path.

Cons: Resource waste in dev/staging. All teams share the same cluster size limits.

Option 2: Cluster Per Team (Large Enterprises)

Cluster Per Team Strategy When to use: Regulatory requirements (PCI for payments), extreme isolation needs.

Pros: Total blast radius isolation. Teams can choose upgrade schedules. Different compliance requirements per cluster.

Cons: Expensive. More clusters to manage. Operational overhead scales linearly.

Option 3: Hybrid (Recommended for Enterprise Banks)

Enterprise Cluster Strategy

When to use: Most enterprises with regulatory workloads. Balance between isolation and cost.

Pros: PCI workloads isolated as required. General workloads share clusters for efficiency. Data workloads get GPU/high-memory nodes.

Cons: More complexity in cluster management (3-5 clusters vs 1).

Decision Matrix

Cluster Strategy Decision Matrix

Golden Paths and Templates

A golden path is the platform team’s recommended way to do something. It is opinionated, tested, and supported. Teams can deviate, but they lose support guarantees.

What Gets Templated

Golden Path Templates

Example Golden Path Helm Chart Structure

# charts/web-service/values.yaml — what app teams customize
replicaCount: 2
image:
  repository: ""        # REQUIRED by app team
  tag: ""               # REQUIRED by app team

# Everything below has sane defaults from platform team
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    memory: 512Mi        # No CPU limits (best practice)

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilization: 70

podDisruptionBudget:
  enabled: true
  minAvailable: 1

ingress:
  enabled: true
  className: alb
  annotations:
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/healthcheck-path: /healthz

networkPolicy:
  enabled: true          # Default deny + allow rules

serviceMonitor:
  enabled: true          # Prometheus auto-scrapes
  path: /metrics
  port: http

securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

Governance: Enforced vs Recommended

Enforced vs Recommended Governance

Rule of thumb: Enforce anything that affects security or stability. Recommend everything else and make it the path of least resistance.

Self-Service Capabilities

Namespace Provisioning

When a new team needs a namespace, they should not file a Jira ticket and wait 2 weeks. They should submit a PR or fill a form.

Team Onboarding Flow

What Gets Provisioned Per Namespace

# Namespace bundle — applied as a single ArgoCD Application
---
apiVersion: v1
kind: Namespace
metadata:
  name: team-payments-prod
  labels:
    team: payments
    environment: prod
    cost-center: CC-4521
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-payments-admin
  namespace: team-payments-prod
subjects:
- kind: Group
  name: team-payments          # Mapped from SSO group
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: namespace-admin        # Platform-defined ClusterRole
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments-prod
spec:
  hard:
    requests.cpu: "8"
    requests.memory: "16Gi"
    limits.memory: "32Gi"
    persistentvolumeclaims: "5"
    services.loadbalancers: "0"     # No public LBs — use Ingress
---
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-payments-prod
spec:
  limits:
  - default:
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "256Mi"
    type: Container
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: team-payments-prod
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - port: 53
      protocol: UDP

Deploy Previews

Every PR gets an ephemeral environment for testing. Platform team sets up the machinery; app teams get it automatically.

Deploy Preview Flow

Log Access

App teams need log access without SSH. The platform provides:

Developer Log Access

Developer Portal: Backstage / Port

The developer portal is the single pane of glass for app teams. It replaces: internal wikis, spreadsheets of services, Slack “who owns this?” questions, and manual onboarding.

What the Portal Shows

Developer Portal — Backstage / Port

Backstage vs Port

Cost Visibility: Kubecost / OpenCost

Every namespace gets cost attribution. Teams see what they spend. Platform team sees the overall picture.

How Cost Allocation Works

Cluster Cost Allocation Pipeline

Kubecost vs OpenCost

Enforcing Cost Labels

# Kyverno policy: every namespace MUST have cost-center label
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-cost-center-label
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-cost-center
    match:
      any:
      - resources:
          kinds:
          - Namespace
    validate:
      message: "Namespace must have a 'cost-center' label for chargeback"
      pattern:
        metadata:
          labels:
            cost-center: "CC-*"

Platform Maturity Model

Where is your platform today? Where should it be?

Platform Maturity Model

Standardized Cluster Configuration

Every cluster in the fleet must have the same baseline. No snowflakes.

Cluster Baseline Add-Ons

Managing Configuration Across Clusters

# ApplicationSet that deploys baseline to ALL clusters
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-baseline
  namespace: argocd
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          platform-managed: "true"
  template:
    metadata:
      name: 'baseline-{{name}}'
    spec:
      project: platform
      source:
        repoURL: git@github.com:bank/platform-baseline.git
        path: 'clusters/{{metadata.labels.environment}}'
        targetRevision: main
      destination:
        server: '{{server}}'
        namespace: kube-system
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

# Cluster API — declarative cluster lifecycle
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-eu-west-1
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["10.244.0.0/16"]
    services:
      cidrBlocks: ["10.96.0.0/12"]
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta2
    kind: AWSManagedControlPlane       # EKS
    name: prod-eu-west-1-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSManagedCluster
    name: prod-eu-west-1
---
# Managed machine pool (node group)
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: prod-eu-west-1-workers
spec:
  instanceType: m6i.2xlarge
  scaling:
    minSize: 3
    maxSize: 20

# Crossplane — provision cluster as a Kubernetes resource
apiVersion: eks.aws.upbound.io/v1beta1
kind: Cluster
metadata:
  name: prod-eu-west-1
spec:
  forProvider:
    region: eu-west-1
    version: "1.29"
    vpcConfig:
      - subnetIds:
        - subnet-abc123
        - subnet-def456
      endpointPrivateAccess: true
      endpointPublicAccess: false
    encryptionConfig:
      - provider:
          - keyArn: arn:aws:kms:eu-west-1:123456789012:key/mrk-abc123
        resources:
          - secrets

Fleet Cluster Upgrades

Upgrading 15 clusters is not 15 individual upgrades. It is a structured rollout.

Fleet Upgrade Strategy

Upgrade Commands

# EKS control plane upgrade
$ aws eks update-cluster-version --name prod-eu-west-1 --kubernetes-version 1.29

# Check upgrade status
$ aws eks describe-update --name prod-eu-west-1 --update-id abc-123
{
    "update": {
        "status": "InProgress",
        "type": "VersionUpdate"
    }
}

# Upgrade managed node group (rolling)
$ aws eks update-nodegroup-version \
    --cluster-name prod-eu-west-1 \
    --nodegroup-name workers \
    --kubernetes-version 1.29

# Upgrade add-ons
$ aws eks update-addon --cluster-name prod-eu-west-1 \
    --addon-name coredns --addon-version v1.11.1-eksbuild.1
$ aws eks update-addon --cluster-name prod-eu-west-1 \
    --addon-name kube-proxy --addon-version v1.29.0-eksbuild.1
$ aws eks update-addon --cluster-name prod-eu-west-1 \
    --addon-name vpc-cni --addon-version v1.16.0-eksbuild.1

# Check for deprecated APIs BEFORE upgrading
$ kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
# Or use pluto:
$ pluto detect-all-in-cluster
NAME                   KIND          VERSION              REPLACEMENT   REMOVED
my-ingress             Ingress       networking/v1beta1   networking/v1 v1.22

Scenario 1: Design a Platform for 500 Developers Across 10 Teams

Interview question: “You’re hired as the lead platform engineer. The bank has 500 developers, 10 product teams, and they want to move to Kubernetes. Design the platform.”

Answer Framework

Platform Design — Phased Rollout

Cluster architecture: Cluster Architecture — AWS Organization

Scenario 2: Standardize Configuration Across 20 Clusters

Interview question: “You inherited 20 Kubernetes clusters across 3 regions, each configured differently. How do you standardize them?”

Answer

Standardize 20 Clusters — Steps

Tools: ArgoCD ApplicationSets (GitOps delivery), Cluster API or Crossplane (lifecycle), Kyverno (policy enforcement).

Scenario 3: Zero to Production — New Microservice

Interview question: “A team wants to deploy a new microservice. Walk me through the process from zero to production.”

Zero to Production — New Microservice

Scenario 4: Fleet Cluster Upgrades Across 15 Clusters

Interview question: “How do you handle Kubernetes version upgrades across a fleet of 15 clusters?”

This was covered in detail in the Fleet Cluster Upgrades section above. Key points for the interview:

Never upgrade all at once — batch by environment, then by criticality
Pre-upgrade checklist — deprecated APIs (pluto), add-on compatibility, Helm chart compatibility
Canary approach — dev (2 clusters) → staging (3) → prod batch 1 (3) → prod batch 2 (3) → prod critical (4)
Soak time — 48h between prod batches, 72h for critical
Rollback plan — you cannot rollback EKS control plane, so the rollback plan is “fix forward” or “spin up new cluster at old version and migrate workloads”
Automation — Terraform or Cluster API for control plane, managed node group rolling update for nodes
Communication — notify all teams 2 weeks before their cluster is upgraded, provide testing window

Scenario 5: Observability — Platform Team vs Application Teams

Interview question: “Design observability for the platform team vs application teams. Different needs, shared infrastructure.”

Observability Architecture

Separation Strategy

Observability Separation — Platform vs App Teams

Implementation:

Grafana multi-tenancy — use Grafana Organizations or Grafana Cloud stacks per team
Prometheus label filtering — use namespace label to restrict what each team queries
Loki tenant ID — set Loki tenant ID per namespace via Promtail labels
RBAC for dashboards — SSO groups mapped to Grafana orgs
Golden dashboards — platform provides pre-built RED dashboards; teams customize from there

References

AWS

EKS Best Practices Guide — operational patterns for running EKS at scale

GCP

GKE Enterprise (Anthos) Documentation — multi-cluster fleet management and governance

Tools & Frameworks

CNCF Platforms Whitepaper — defining what a platform is and why organizations build them
Backstage Documentation — open-source developer portal for building Internal Developer Platforms
Kubecost Documentation — Kubernetes cost monitoring and optimization
Crossplane Documentation — cloud-native control plane for infrastructure provisioning via Kubernetes APIs