Skip to content

Enterprise Platform Design

You are the central platform team at an enterprise bank. Your job is not to run applications — it is to build the platform that lets 500 developers across 50 teams ship safely, fast, and at scale. This page covers how to design, build, and operate that platform.

Enterprise Platform — Where This Fits

What Is an Internal Developer Platform (IDP)?

Section titled “What Is an Internal Developer Platform (IDP)?”

An IDP is the set of tools, workflows, and self-service capabilities that the platform team provides to application teams. It abstracts away infrastructure complexity so developers can focus on writing code.

Internal Developer Platform Architecture

App teams should never need to:

  • Write Terraform for infrastructure
  • SSH into nodes
  • Create Kubernetes RBAC manifests
  • Manage TLS certificates
  • Set up monitoring dashboards from scratch

App teams should be able to:

  • Deploy code to any environment via Git push
  • Get a preview environment for every PR
  • Access their logs and metrics through a portal
  • Request a new namespace or database through self-service
  • See their cloud costs in real time

Platform Team vs Application Team Responsibilities

Section titled “Platform Team vs Application Team Responsibilities”

Platform Team vs Application Team Responsibilities


One of the most critical decisions: how many clusters, and how do you organize them?

Option 1: Cluster Per Environment (Most Common)

Section titled “Option 1: Cluster Per Environment (Most Common)”

Cluster Per Environment Strategy When to use: Standard enterprise setup. Most banks start here.

Pros: Simple blast radius (dev issues don’t affect prod). Clear promotion path.

Cons: Resource waste in dev/staging. All teams share the same cluster size limits.

Option 2: Cluster Per Team (Large Enterprises)

Section titled “Option 2: Cluster Per Team (Large Enterprises)”

Cluster Per Team Strategy When to use: Regulatory requirements (PCI for payments), extreme isolation needs.

Pros: Total blast radius isolation. Teams can choose upgrade schedules. Different compliance requirements per cluster.

Cons: Expensive. More clusters to manage. Operational overhead scales linearly.

Section titled “Option 3: Hybrid (Recommended for Enterprise Banks)”

Enterprise Cluster Strategy

When to use: Most enterprises with regulatory workloads. Balance between isolation and cost.

Pros: PCI workloads isolated as required. General workloads share clusters for efficiency. Data workloads get GPU/high-memory nodes.

Cons: More complexity in cluster management (3-5 clusters vs 1).

Cluster Strategy Decision Matrix


A golden path is the platform team’s recommended way to do something. It is opinionated, tested, and supported. Teams can deviate, but they lose support guarantees.

Golden Path Templates

# charts/web-service/values.yaml — what app teams customize
replicaCount: 2
image:
repository: "" # REQUIRED by app team
tag: "" # REQUIRED by app team
# Everything below has sane defaults from platform team
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi # No CPU limits (best practice)
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilization: 70
podDisruptionBudget:
enabled: true
minAvailable: 1
ingress:
enabled: true
className: alb
annotations:
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/healthcheck-path: /healthz
networkPolicy:
enabled: true # Default deny + allow rules
serviceMonitor:
enabled: true # Prometheus auto-scrapes
path: /metrics
port: http
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false

Enforced vs Recommended Governance

Rule of thumb: Enforce anything that affects security or stability. Recommend everything else and make it the path of least resistance.


When a new team needs a namespace, they should not file a Jira ticket and wait 2 weeks. They should submit a PR or fill a form.

Team Onboarding Flow

# Namespace bundle — applied as a single ArgoCD Application
---
apiVersion: v1
kind: Namespace
metadata:
name: team-payments-prod
labels:
team: payments
environment: prod
cost-center: CC-4521
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: team-payments-admin
namespace: team-payments-prod
subjects:
- kind: Group
name: team-payments # Mapped from SSO group
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: namespace-admin # Platform-defined ClusterRole
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-payments-prod
spec:
hard:
requests.cpu: "8"
requests.memory: "16Gi"
limits.memory: "32Gi"
persistentvolumeclaims: "5"
services.loadbalancers: "0" # No public LBs — use Ingress
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-payments-prod
spec:
limits:
- default:
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "256Mi"
type: Container
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: team-payments-prod
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
egress:
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP

Every PR gets an ephemeral environment for testing. Platform team sets up the machinery; app teams get it automatically.

Deploy Preview Flow

App teams need log access without SSH. The platform provides:

Developer Log Access


The developer portal is the single pane of glass for app teams. It replaces: internal wikis, spreadsheets of services, Slack “who owns this?” questions, and manual onboarding.

Developer Portal — Backstage / Port

Backstage vs Port


Every namespace gets cost attribution. Teams see what they spend. Platform team sees the overall picture.

Cluster Cost Allocation Pipeline

Kubecost vs OpenCost

# Kyverno policy: every namespace MUST have cost-center label
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-cost-center-label
spec:
validationFailureAction: Enforce
rules:
- name: check-cost-center
match:
any:
- resources:
kinds:
- Namespace
validate:
message: "Namespace must have a 'cost-center' label for chargeback"
pattern:
metadata:
labels:
cost-center: "CC-*"

Where is your platform today? Where should it be?

Platform Maturity Model


Every cluster in the fleet must have the same baseline. No snowflakes.

Cluster Baseline Add-Ons

# ApplicationSet that deploys baseline to ALL clusters
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: cluster-baseline
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
platform-managed: "true"
template:
metadata:
name: 'baseline-{{name}}'
spec:
project: platform
source:
repoURL: git@github.com:bank/platform-baseline.git
path: 'clusters/{{metadata.labels.environment}}'
targetRevision: main
destination:
server: '{{server}}'
namespace: kube-system
syncPolicy:
automated:
prune: true
selfHeal: true

Upgrading 15 clusters is not 15 individual upgrades. It is a structured rollout.

Fleet Upgrade Strategy

Terminal window
# EKS control plane upgrade
$ aws eks update-cluster-version --name prod-eu-west-1 --kubernetes-version 1.29
# Check upgrade status
$ aws eks describe-update --name prod-eu-west-1 --update-id abc-123
{
"update": {
"status": "InProgress",
"type": "VersionUpdate"
}
}
# Upgrade managed node group (rolling)
$ aws eks update-nodegroup-version \
--cluster-name prod-eu-west-1 \
--nodegroup-name workers \
--kubernetes-version 1.29
# Upgrade add-ons
$ aws eks update-addon --cluster-name prod-eu-west-1 \
--addon-name coredns --addon-version v1.11.1-eksbuild.1
$ aws eks update-addon --cluster-name prod-eu-west-1 \
--addon-name kube-proxy --addon-version v1.29.0-eksbuild.1
$ aws eks update-addon --cluster-name prod-eu-west-1 \
--addon-name vpc-cni --addon-version v1.16.0-eksbuild.1
# Check for deprecated APIs BEFORE upgrading
$ kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
# Or use pluto:
$ pluto detect-all-in-cluster
NAME KIND VERSION REPLACEMENT REMOVED
my-ingress Ingress networking/v1beta1 networking/v1 v1.22

Scenario 1: Design a Platform for 500 Developers Across 10 Teams

Section titled “Scenario 1: Design a Platform for 500 Developers Across 10 Teams”

Interview question: “You’re hired as the lead platform engineer. The bank has 500 developers, 10 product teams, and they want to move to Kubernetes. Design the platform.”

Platform Design — Phased Rollout

Cluster architecture: Cluster Architecture — AWS Organization


Scenario 2: Standardize Configuration Across 20 Clusters

Section titled “Scenario 2: Standardize Configuration Across 20 Clusters”

Interview question: “You inherited 20 Kubernetes clusters across 3 regions, each configured differently. How do you standardize them?”

Standardize 20 Clusters — Steps

Tools: ArgoCD ApplicationSets (GitOps delivery), Cluster API or Crossplane (lifecycle), Kyverno (policy enforcement).


Scenario 3: Zero to Production — New Microservice

Section titled “Scenario 3: Zero to Production — New Microservice”

Interview question: “A team wants to deploy a new microservice. Walk me through the process from zero to production.”

Zero to Production — New Microservice


Scenario 4: Fleet Cluster Upgrades Across 15 Clusters

Section titled “Scenario 4: Fleet Cluster Upgrades Across 15 Clusters”

Interview question: “How do you handle Kubernetes version upgrades across a fleet of 15 clusters?”

This was covered in detail in the Fleet Cluster Upgrades section above. Key points for the interview:

  1. Never upgrade all at once — batch by environment, then by criticality
  2. Pre-upgrade checklist — deprecated APIs (pluto), add-on compatibility, Helm chart compatibility
  3. Canary approach — dev (2 clusters) → staging (3) → prod batch 1 (3) → prod batch 2 (3) → prod critical (4)
  4. Soak time — 48h between prod batches, 72h for critical
  5. Rollback plan — you cannot rollback EKS control plane, so the rollback plan is “fix forward” or “spin up new cluster at old version and migrate workloads”
  6. Automation — Terraform or Cluster API for control plane, managed node group rolling update for nodes
  7. Communication — notify all teams 2 weeks before their cluster is upgraded, provide testing window

Scenario 5: Observability — Platform Team vs Application Teams

Section titled “Scenario 5: Observability — Platform Team vs Application Teams”

Interview question: “Design observability for the platform team vs application teams. Different needs, shared infrastructure.”

Observability Architecture

Observability Separation — Platform vs App Teams

Implementation:

  • Grafana multi-tenancy — use Grafana Organizations or Grafana Cloud stacks per team
  • Prometheus label filtering — use namespace label to restrict what each team queries
  • Loki tenant ID — set Loki tenant ID per namespace via Promtail labels
  • RBAC for dashboards — SSO groups mapped to Grafana orgs
  • Golden dashboards — platform provides pre-built RED dashboards; teams customize from there