Skip to content

Metrics & Monitoring

Observability platform architecture — VictoriaMetrics, Loki, Tempo feeding Grafana, with workload clusters

The central infra team operates the metrics platform. Tenant teams get Grafana dashboards scoped to their namespaces, and SLO-based alerting out of the box.


Central Observability Platform — VictoriaMetrics, Loki, Tempo, Grafana

Per-cluster vmagent collecting metrics and shipping via remote_write to central VictoriaMetrics and Grafana


Prometheus server architecture — scraper, TSDB, rule engine, PromQL, Alertmanager, and scrape targets

TypeDescriptionExample
CounterMonotonically increasinghttp_requests_total
GaugeCan go up or downnode_memory_available_bytes
HistogramObservations in bucketshttp_request_duration_seconds
SummaryPre-calculated quantilesgo_gc_duration_seconds
# Request rate per second (5-minute window)
rate(http_requests_total{job="api"}[5m])
# P99 latency from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="api"}[5m]))
# Error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# CPU usage per pod
sum(rate(container_cpu_usage_seconds_total{namespace="payments"}[5m])) by (pod)
# Memory usage percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Top 5 pods by memory usage
topk(5, container_memory_working_set_bytes{namespace="production"})
# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0

FeaturePrometheusVictoriaMetrics
Memory usageHigh (all in-memory)7x less (efficient compression)
StorageLocal TSDB onlyLocal + S3/GCS (built-in)
HA/ClusteringNeeds Thanos/MimirNative cluster mode
Ingestion rate500K samples/s10M+ samples/s
Compression~1.3 bytes/sample~0.4 bytes/sample
Query languagePromQLMetricsQL (PromQL superset)
Remote writeSupportedOptimized receiver
DownsamplingManualAutomatic (Enterprise)

VictoriaMetrics cluster architecture — write path (vmagent, vminsert, vmstorage shards) and read path (vmselect, Grafana)

# vmagent DaemonSet per cluster
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vmagent
namespace: monitoring
spec:
selector:
matchLabels:
app: vmagent
template:
metadata:
labels:
app: vmagent
spec:
serviceAccountName: vmagent
containers:
- name: vmagent
image: victoriametrics/vmagent:v1.106.1
args:
- -promscrape.config=/config/scrape.yml
- -remoteWrite.url=https://vminsert.monitoring.example.com/insert/0/prometheus/api/v1/write
- -remoteWrite.tmpDataPath=/vmagent-remotewrite-data
- -remoteWrite.maxDiskUsagePerURL=1GB
- -remoteWrite.queues=4
- -remoteWrite.showURL=false
# Stream aggregation: reduce cardinality before sending
- -remoteWrite.streamAggr.config=/config/aggregation.yml
- -promscrape.suppressDuplicateScrapeTargetErrors
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi
volumeMounts:
- name: config
mountPath: /config
- name: buffer
mountPath: /vmagent-remotewrite-data
volumes:
- name: config
configMap:
name: vmagent-config
- name: buffer
emptyDir:
sizeLimit: 2Gi
# vmagent scrape config (scrape.yml)
global:
scrape_interval: 30s
external_labels:
cluster: "prod-us-east-1"
environment: "production"
scrape_configs:
# Kubernetes pod auto-discovery
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
action: replace
target_label: __address__
regex: (.+);(.+)
replacement: $2:$1
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# kubelet / cAdvisor metrics
- job_name: kubelet
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
# kube-state-metrics
- job_name: kube-state-metrics
static_configs:
- targets: ["kube-state-metrics.monitoring:8080"]
# node-exporter
- job_name: node-exporter
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: (.+):(.+)
target_label: __address__
replacement: $1:9100
# aggregation.yml — aggregate before remote_write
# Reduces ingestion volume by 60-80%
- match: '{__name__=~"container_cpu_usage_seconds_total"}'
interval: 1m
outputs: [total]
by: [namespace, pod, container]
- match: '{__name__=~"container_memory_working_set_bytes"}'
interval: 1m
outputs: [last]
by: [namespace, pod, container]
- match: '{__name__=~"http_request_duration_seconds_bucket"}'
interval: 1m
outputs: [total]
by: [namespace, service, le, method, status_code]

# Helm values for victoria-metrics-cluster chart
# helm repo add vm https://victoriametrics.github.io/helm-charts/
# helm install vmcluster vm/victoria-metrics-cluster -f values.yaml
vminsert:
replicaCount: 3
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
memory: 1Gi
extraArgs:
maxLabelsPerTimeseries: "40"
replicationFactor: "2"
vmstorage:
replicaCount: 3
storageDataPath: /vmstorage-data
persistentVolume:
enabled: true
size: 100Gi
storageClass: gp3-encrypted
retentionPeriod: "90d"
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
memory: 4Gi
extraArgs:
dedup.minScrapeInterval: "30s"
search.maxUniqueTimeseries: "1000000"
vmselect:
replicaCount: 3
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 2Gi
extraArgs:
search.maxQueryDuration: "120s"
search.maxConcurrentRequests: "16"
# Cache configuration
cacheMountPath: /select-cache
persistentVolume:
enabled: true
size: 10Gi

Grafana Mimir architecture — Distributor, Ingester, S3/GCS blocks, Store-Gateway, Querier

CriteriaVictoriaMetricsGrafana Mimir
Ease of opsSimpler (3 components)More components (6+)
StorageLocal disk (fast)S3/GCS (infinite, cheaper)
MemoryVery lowHigher (ingesters cache)
Grafana integrationStandard Prometheus datasourceNative (Grafana-native histograms, exemplars)
Multi-tenancyVia labels + Grafana orgNative tenant isolation
LicenseOpen-source (Apache 2.0)AGPLv3
SupportCommunity + EnterpriseGrafana Cloud + Enterprise
Best for<10M active series>10M active series

Ingesting Cloud Provider Metrics into Grafana

Section titled “Ingesting Cloud Provider Metrics into Grafana”

CloudWatch metrics ingestion into Grafana — direct plugin or via YACE exporter to VictoriaMetrics

Option 1: Grafana CloudWatch Data Source (simple, real-time)

  • Direct API calls to CloudWatch
  • No additional infrastructure
  • Cost: CloudWatch API charges ($0.01 per 1,000 GetMetricData requests)
  • Best for: dashboards with few panels, ad-hoc queries

Option 2: YACE Exporter (production, PromQL)

# YACE (yet-another-cloudwatch-exporter) config
apiVersion: v1
kind: ConfigMap
metadata:
name: yace-config
namespace: monitoring
data:
config.yml: |
discovery:
jobs:
- type: AWS/RDS
regions: [us-east-1]
period: 300
length: 300
metrics:
- name: CPUUtilization
statistics: [Average, Maximum]
- name: DatabaseConnections
statistics: [Sum]
- name: FreeableMemory
statistics: [Average]
- name: ReadIOPS
statistics: [Average]
- type: AWS/ApplicationELB
regions: [us-east-1]
period: 60
length: 300
metrics:
- name: RequestCount
statistics: [Sum]
- name: TargetResponseTime
statistics: [Average, p99]
- name: HTTPCode_ELB_5XX_Count
statistics: [Sum]

SLO Framework — SLI, SLO, Error Budget, and SLA definitions

# Prometheus / vmalert recording rules for SLOs
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: payment-api-slos
namespace: monitoring
spec:
groups:
- name: payment-api-slo
interval: 30s
rules:
# SLI: availability (non-5xx responses)
- record: slo:payment_api:availability:rate5m
expr: |
sum(rate(http_requests_total{job="payment-api", status!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-api"}[5m]))
# SLI: latency (requests under 300ms)
- record: slo:payment_api:latency:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{job="payment-api", le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="payment-api"}[5m]))
# Error budget remaining (30-day window)
- record: slo:payment_api:availability:error_budget_remaining
expr: |
1 - (
(1 - slo:payment_api:availability:rate5m)
/
(1 - 0.999)
)
# Multi-window multi-burn-rate alerts
# Fast burn: 14.4x error rate over 1h (pages immediately)
- alert: PaymentAPIHighErrorBudgetBurn
expr: |
(
1 - slo:payment_api:availability:rate5m < 0.999 - (14.4 * (1 - 0.999))
)
and
(
1 - sum(rate(http_requests_total{job="payment-api", status!~"5.."}[1h]))
/
sum(rate(http_requests_total{job="payment-api"}[1h]))
> 14.4 * (1 - 0.999)
)
for: 2m
labels:
severity: critical
team: payments
annotations:
summary: "Payment API burning error budget 14.4x faster than allowed"
dashboard: "https://grafana.example.com/d/payment-slo"
runbook: "https://runbooks.example.com/payment-api-errors"
# Slow burn: 1x error rate over 3d (ticket)
- alert: PaymentAPISlowErrorBudgetBurn
expr: |
(
1 - sum(rate(http_requests_total{job="payment-api", status!~"5.."}[3d]))
/
sum(rate(http_requests_total{job="payment-api"}[3d]))
> 1 * (1 - 0.999)
)
for: 1h
labels:
severity: warning
team: payments
annotations:
summary: "Payment API slowly burning error budget — investigate within 24h"

Payment API SLO Dashboard — availability, latency, error budget panels with burn rate chart


DashboardAudienceKey Panels
Cluster OverviewPlatform teamNode count, pod capacity, CPU/memory utilization, pod scheduling rate
Namespace UsageTenant teamsCPU/memory requests vs limits vs actual, pod count, PVC usage
SLO OverviewEveryoneSLI values, error budget remaining, burn rate for all services
Node HealthPlatform teamNode conditions, disk pressure, memory pressure, PID pressure
Ingress TrafficPlatform teamRPS, latency percentiles, error rate per ingress
Cost AttributionFinOpsCPU/memory cost per namespace (via Kubecost metrics)

# Alertmanager config for enterprise routing
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yml: |
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/xxx"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: default-slack
group_by: [alertname, cluster, namespace]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts → PagerDuty (pages on-call)
- match:
severity: critical
receiver: pagerduty-critical
group_wait: 10s
repeat_interval: 1h
# Warning alerts → Slack channel
- match:
severity: warning
receiver: team-slack
repeat_interval: 12h
# Tenant-specific routing by namespace label
- match_re:
namespace: "payments.*"
receiver: payments-team-slack
- match_re:
namespace: "lending.*"
receiver: lending-team-slack
receivers:
- name: default-slack
slack_configs:
- channel: "#alerts-platform"
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Cluster:* {{ .Labels.cluster }}
*Namespace:* {{ .Labels.namespace }}
*Description:* {{ .Annotations.summary }}
{{ end }}
- name: pagerduty-critical
pagerduty_configs:
- service_key: "<pagerduty-integration-key>"
severity: critical
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: payments-team-slack
slack_configs:
- channel: "#alerts-payments"
- name: lending-team-slack
slack_configs:
- channel: "#alerts-lending"
inhibit_rules:
# Don't fire warning if critical is already firing
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, cluster, namespace]

Cloud-Native Monitoring — CloudWatch & GCP Cloud Monitoring

Section titled “Cloud-Native Monitoring — CloudWatch & GCP Cloud Monitoring”

Every cloud provider includes a built-in monitoring service that automatically collects metrics from managed services. CloudWatch captures CPU utilization from EC2, connection counts from RDS, invocation durations from Lambda, and request latency from ALB — all without any agent installation or configuration. GCP Cloud Monitoring does the same for Compute Engine, Cloud SQL, Cloud Functions, and Cloud Load Balancing. These cloud-native monitoring services are the first line of visibility into your infrastructure and the only source for metrics from fully managed services that do not expose Prometheus-compatible endpoints.

The question enterprise platform teams face is not whether to use cloud-native monitoring — you must, because managed services like RDS and Lambda do not expose /metrics endpoints for Prometheus to scrape. The question is whether cloud-native monitoring is sufficient as your sole monitoring platform or whether you need open-source tooling (Prometheus, VictoriaMetrics, Loki, Grafana) alongside it. For Kubernetes-heavy workloads at scale, the answer is almost always “use both.” Cloud-native monitoring for managed services, Prometheus/VictoriaMetrics for Kubernetes workloads and application metrics, and Grafana as the unified dashboard layer that queries both.

The cost implications of this decision are significant. CloudWatch charges per custom metric, per API call (GetMetricData, PutMetricData), per alarm, and per dashboard widget query. At enterprise scale with hundreds of services and thousands of metrics, CloudWatch costs can exceed $50,000/month. Open-source alternatives running on your own compute have predictable, linear costs tied to compute and storage — not API call volume. Understanding this cost model is essential for any cloud architect responsible for an observability platform budget.

Metrics:

CloudWatch automatically collects metrics for all AWS services at no additional cost for basic monitoring (5-minute intervals for EC2, 1-minute for most other services). Detailed monitoring (1-minute intervals for EC2) and high-resolution metrics (1-second intervals for custom metrics) are available at additional cost. Custom metrics are published via the PutMetricData API or the CloudWatch Agent installed on EC2/ECS/EKS instances.

CloudWatch Metric Math allows you to create derived metrics from existing ones using arithmetic expressions. For example, calculating error rate as errors / total_requests * 100 without creating a new custom metric. Anomaly detection applies machine learning to create dynamic thresholds based on historical patterns — far more useful than static thresholds for metrics with daily or weekly seasonality.

CloudWatch Metrics Architecture
==================================
Automatic Collection (zero config):
├── EC2: CPUUtilization, NetworkIn/Out, DiskReadOps, StatusCheckFailed
├── RDS: CPUUtilization, DatabaseConnections, FreeableMemory, ReadIOPS, WriteIOPS
├── ALB: RequestCount, TargetResponseTime, HTTPCode_ELB_5XX, ActiveConnectionCount
├── Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExecutions
├── SQS: ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage
├── DynamoDB: ConsumedReadCapacityUnits, ThrottledRequests
└── EKS: via Container Insights (pod, node, cluster metrics)
Custom Metrics (you publish):
├── Application metrics: request counts, business KPIs, queue depths
├── CloudWatch Agent: memory utilization, disk usage (not collected by default!)
├── Embedded Metric Format (EMF): publish metrics from Lambda via structured logs
└── PutMetricData API: programmatic metric publication
High-Resolution Metrics:
├── Standard: 1-minute granularity (most services)
├── Detailed: 1-minute for EC2 (costs extra)
├── High-resolution: 1-second granularity (custom metrics only)
└── Use case: detecting sub-minute spikes in latency or error rates
Metric Math:
├── METRICS("m1") / METRICS("m2") * 100 → error rate percentage
├── ANOMALY_DETECTION_BAND(m1, 2) → ML-based dynamic threshold
├── FILL(m1, 0) → replace missing data points
└── SEARCH(' {AWS/ApplicationELB} MetricName="RequestCount" ', 'Sum', 300)
→ search across all ALBs for request count

Alarms:

CloudWatch alarms evaluate a metric against a threshold and trigger actions when the threshold is breached. Simple alarms monitor a single metric. Composite alarms combine multiple alarms with AND/OR logic, which dramatically reduces noise — instead of getting paged for “CPU high” and “memory high” and “disk high” separately, a composite alarm fires only when all three are true simultaneously, indicating a genuine capacity problem rather than a transient spike.

# Simple alarm: RDS CPU > 80% for 5 minutes
resource "aws_cloudwatch_metric_alarm" "rds_cpu_high" {
alarm_name = "rds-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5 # 5 consecutive periods
metric_name = "CPUUtilization"
namespace = "AWS/RDS"
period = 60 # 1-minute intervals
statistic = "Average"
threshold = 80
alarm_description = "RDS CPU above 80% for 5 minutes"
alarm_actions = [aws_sns_topic.pagerduty.arn]
ok_actions = [aws_sns_topic.pagerduty.arn] # Notify on recovery too
dimensions = {
DBClusterIdentifier = aws_rds_cluster.main.cluster_identifier
}
}
# Anomaly detection alarm: ALB latency anomaly
resource "aws_cloudwatch_metric_alarm" "alb_latency_anomaly" {
alarm_name = "alb-latency-anomaly"
comparison_operator = "GreaterThanUpperThreshold"
evaluation_periods = 3
threshold_metric_id = "ad1"
alarm_description = "ALB latency exceeds ML-predicted band"
alarm_actions = [aws_sns_topic.pagerduty.arn]
metric_query {
id = "m1"
return_data = true
metric {
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 300
stat = "p99"
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
}
}
}
metric_query {
id = "ad1"
expression = "ANOMALY_DETECTION_BAND(m1, 2)" # 2 standard deviations
label = "Latency Anomaly Band"
return_data = true
}
}
# Composite alarm: only page when BOTH cpu AND memory are critical
resource "aws_cloudwatch_composite_alarm" "rds_capacity_critical" {
alarm_name = "rds-capacity-critical"
alarm_rule = "ALARM(${aws_cloudwatch_metric_alarm.rds_cpu_high.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.rds_memory_low.alarm_name})"
alarm_actions = [aws_sns_topic.pagerduty.arn]
alarm_description = "RDS capacity critical: both CPU and memory stressed"
}

Synthetics:

CloudWatch Synthetics runs canary scripts on a schedule to monitor endpoints, APIs, and multi-step workflows. Canaries are Lambda-based and can simulate user interactions (login, checkout, form submission). They detect degradation before real users notice — if your canary’s latency increases from 500ms to 2000ms, you have early warning of a problem.

# CloudWatch Synthetics canary — monitor payment API health
resource "aws_synthetics_canary" "payment_api_health" {
name = "payment-api-health"
artifact_s3_location = "s3://${aws_s3_bucket.canary_artifacts.id}/canary/"
execution_role_arn = aws_iam_role.canary.arn
runtime_version = "syn-nodejs-puppeteer-7.0"
handler = "apiCanary.handler"
schedule {
expression = "rate(5 minutes)" # Run every 5 minutes
}
run_config {
timeout_in_seconds = 60
memory_in_mb = 960
active_tracing = true # X-Ray tracing for canary runs
}
zip_file = data.archive_file.canary_script.output_path
# Alert if canary fails 2 consecutive runs
success_retention_period = 31
failure_retention_period = 31
}

Contributor Insights:

Identifies the top contributors to a metric pattern. For example: “which 10 IP addresses are making the most requests?” or “which 10 API endpoints have the highest latency?” This is invaluable during incidents — quickly identifying whether the problem is global or concentrated on a specific resource, customer, or endpoint.

Application Insights:

Automated detection and visualization of application issues for .NET, SQL Server, Java, and IIS workloads. Sets up CloudWatch alarms and dashboards automatically based on detected application components. Most useful for lift-and-shift Windows workloads where manual instrumentation is impractical.

When to Use Cloud-Native vs Open-Source Monitoring

Section titled “When to Use Cloud-Native vs Open-Source Monitoring”
CriterionCloud-Native (CloudWatch / GCP Monitoring)Open-Source (Prometheus / VictoriaMetrics / Loki)
SetupZero — built-in for all managed servicesDeploy, configure, and maintain the stack
AWS/GCP service metricsAutomatic (RDS, Lambda, ALB, Cloud SQL)Requires CloudWatch Exporter or GCP exporter
Kubernetes metricsContainer Insights / GKE Monitoring (limited depth)Prometheus native: kube-state-metrics, node-exporter, cAdvisor
Application metricsPutMetricData API (per-metric cost)Prometheus scrapes /metrics endpoints (free)
Query languageCloudWatch Insights / MQLPromQL / LogQL (richer, larger community, more tutorials)
Cost at scaleExpensive: per-metric, per-API call, per-alarmPredictable: compute + storage only
Cost example (50 clusters)$50-100K/month (custom metrics + API calls)$10-20K/month (VM compute + EBS storage)
PortabilityLocked to one cloud providerMulti-cloud, any K8s cluster, on-prem
DashboardsCloudWatch Dashboards / GCP DashboardsGrafana (richer, 5000+ community dashboards)
Long-term storageCloudWatch: 15 months (expensive)Thanos/Mimir/VictoriaMetrics on object storage (cheap, unlimited)
AlertingCloudWatch Alarms / GCP Alerting PoliciesAlertmanager (more flexible routing, inhibition, grouping)
CommunityAWS/GCP documentationMassive open-source community, CNCF ecosystem

The optimal architecture for most enterprises combines cloud-native monitoring for managed services with open-source tooling for Kubernetes workloads, unified through Grafana as the single dashboard layer. This gives you the best of both worlds: automatic metrics collection from managed services (which cannot expose Prometheus endpoints) and the rich ecosystem of Prometheus/Grafana for everything else.

Enterprise Monitoring — Hybrid Approach

Interview Scenarios for Cloud-Native Monitoring

Section titled “Interview Scenarios for Cloud-Native Monitoring”

“Why would you choose Prometheus over CloudWatch for a multi-cluster EKS setup?”

“For a multi-cluster EKS setup, Prometheus (or VictoriaMetrics) is the clear choice for Kubernetes workload monitoring, while CloudWatch remains essential for managed services. Here’s why:

PromQL is more powerful. CloudWatch Metrics Insights is improving but still lacks the expressiveness of PromQL for complex calculations — percentile aggregations across dimensions, ratio computations, prediction functions, and label-based filtering. When an SRE needs to write histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)), there is no CloudWatch equivalent that is as clean.

Grafana ecosystem is richer. There are thousands of community-maintained Grafana dashboards for Kubernetes (kube-state-metrics, node-exporter, CoreDNS, Ingress-NGINX). The CloudWatch dashboard ecosystem is much smaller. When a new team onboards, they get pre-built Grafana dashboards that work immediately.

Cost is predictable at scale. CloudWatch charges per custom metric ($0.30/month per metric for the first 10K), per API call ($0.01 per 1000 GetMetricData requests), and per alarm ($0.10/month per standard alarm). With 50 EKS clusters, each with hundreds of pods exposing dozens of metrics, CloudWatch custom metrics costs explode exponentially. Prometheus/VictoriaMetrics running on a few EC2 instances costs a fixed amount regardless of how many metrics you scrape.

Portability. If the organization runs EKS and GKE (or plans to), Prometheus works identically on both. CloudWatch only covers AWS. A single Grafana instance with Prometheus datasources for all clusters gives a unified view across clouds.

But you still need CloudWatch for managed services. RDS, ALB, Lambda, SQS, DynamoDB — none of these expose /metrics endpoints. The YACE exporter bridges this gap by scraping CloudWatch metrics and exposing them as Prometheus metrics, but the simpler approach is adding a CloudWatch datasource directly in Grafana alongside the Prometheus datasource.”


Scenario 1: Design a Centralized Metrics Platform

Section titled “Scenario 1: Design a Centralized Metrics Platform”

“You have 15 EKS clusters across 3 regions. Design a centralized metrics platform.”

Strong Answer:

“I’d build a hub-spoke model with VictoriaMetrics:

Per cluster (spoke): Deploy vmagent as a DaemonSet. It scrapes kubelet, kube-state-metrics, node-exporter, and application /metrics endpoints. vmagent adds cluster and region external labels, then remote-writes to the central VictoriaMetrics cluster. I use stream aggregation on vmagent to reduce cardinality — aggregating container-level metrics to namespace level before sending.

Central hub (Shared Services account): VictoriaMetrics cluster mode — 3 vminsert (stateless write), 3 vmstorage (sharded with replication factor 2), 3 vmselect (stateless read). 90-day retention on fast storage. This handles 5M+ active series easily.

Visualization: Grafana with VictoriaMetrics datasource. Pre-built dashboards for cluster overview, namespace usage, and SLOs. Tenant teams get Grafana orgs scoped to their namespaces via label-based access control.

Alerting: vmalert evaluates recording and alerting rules against vmselect. Routes through Alertmanager with team-based routing (payments team alerts go to #alerts-payments Slack and PagerDuty).

Why VictoriaMetrics over Mimir? For 15 clusters with 5M series, VM is simpler to operate (3 components vs 6+) and uses significantly less memory. If we grow past 50M series, I’d consider Mimir with S3 backend for cost-effective long-term storage.”


“We’re running Prometheus in each cluster. Why would we change?”

Strong Answer:

“Prometheus per-cluster works, but at enterprise scale you hit three problems:

  1. No global view: You can’t query across clusters. ‘Show me P99 latency for the payments service across all clusters’ requires querying 15 separate Prometheus instances.

  2. Memory: Prometheus stores all active series in memory. At 500K series per cluster, each Prometheus needs 8-12 GB RAM. VictoriaMetrics’ vmagent doing the same scraping needs 256-512 MB.

  3. No long-term storage: Prometheus local TSDB has limited retention (usually 15-30 days due to disk/memory). Capacity planning and trend analysis need months of data.

My recommendation: Replace Prometheus with vmagent per cluster (same scrape configs, drop-in replacement). Central VictoriaMetrics cluster for storage and querying. Teams notice zero difference in their dashboards — the data source just changes from local Prometheus to central VM.”


“A team added a user_id label to their HTTP metrics. Now our Prometheus is OOM. What happened?”

Strong Answer:

“Classic cardinality explosion. If they have 1M users and 10 metric names, that’s 10M time series instead of 10. Each active series costs approximately 3-4 KB in Prometheus memory, so 10M series = 30-40 GB RAM.

Immediate fix:

  1. Remove the user_id label from metrics (it belongs in logs or traces, not metrics)
  2. If they need per-user metrics, use a recording rule that pre-aggregates: sum by (endpoint, status) (rate(http_requests_total[5m])) — this collapses the user dimension

Prevention (platform-level guardrails):

  1. vmagent stream aggregation config that drops labels matching user_id|customer_id|request_id|trace_id
  2. vmalert rule: alert when a metric exceeds 10K active series
  3. OPA/Gatekeeper policy that validates ServiceMonitor CRDs — reject if they scrape endpoints exposing high-cardinality labels
  4. Grafana Mimir/VictoriaMetrics tenant limits: max 500K active series per namespace”

“The team says ‘CloudWatch already monitors our AWS resources — why do we need Prometheus?’”

Strong Answer:

“CloudWatch is essential for AWS infrastructure metrics — RDS CPU, ALB latency, Lambda invocations. But it has limitations for a platform team:

  1. No application metrics: CloudWatch doesn’t know about your app’s business logic (order processing rate, payment failures, queue depth in your code). Prometheus scrapes /metrics endpoints that your app exposes.

  2. No Kubernetes visibility: CloudWatch Container Insights gives basic pod/node metrics but lacks the depth of kube-state-metrics (pod phase, deployment rollout status, resource quota usage).

  3. Cross-cloud: If you run GKE and EKS, CloudWatch only covers AWS. A Prometheus/VictoriaMetrics stack works identically on both.

  4. Query language: PromQL is far more powerful than CloudWatch Metrics Insights for complex queries (percentile calculations, ratio computations, predictions).

My recommendation: Use both. CloudWatch for AWS infrastructure metrics (RDS, ALB, Lambda) — ingest into Grafana via CloudWatch datasource or YACE exporter. Prometheus/vmagent for application metrics, Kubernetes metrics, and custom SLOs. Single Grafana for unified dashboards.”


“A team is launching a new payment processing service. What metrics should they instrument?”

Strong Answer:

“I follow the RED and USE methods:

RED (for request-driven services):

  • Rate: payment_requests_total (counter) — with labels: method, status_code, payment_type
  • Errors: payment_errors_total (counter) — with labels: error_type (timeout, validation, downstream)
  • Duration: payment_request_duration_seconds (histogram) — buckets at 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

USE (for infrastructure resources):

  • Utilization: CPU, memory, connection pool usage
  • Saturation: queue depth, thread pool pending, connection pool waiting
  • Errors: OOM kills, connection timeouts, disk errors

Business metrics:

  • payment_amount_total (counter) — for revenue tracking
  • payment_processing_inflight (gauge) — for capacity planning
  • payment_downstream_call_duration_seconds (histogram) — for dependency monitoring

SLOs I’d define:

  • Availability: 99.99% (payment is critical path)
  • Latency: P99 under 500ms for payment initiation
  • Error budget: 4.32 min/month — alerts at 14.4x burn rate”

“How do you give 20 teams access to Grafana without them seeing each other’s data?”

Strong Answer:

“Three layers of isolation:

  1. Data source level: VictoriaMetrics or Mimir supports label-based access. Configure Grafana data source with a default label filter: {namespace=~'team-a.*'}. Use Grafana provisioning to create per-team datasources with pre-filtered queries.

  2. Grafana Organizations: Each team gets a Grafana org with their own dashboards and datasources. Platform team has a global org with cross-cutting views. SSO integration maps AD groups to Grafana orgs automatically.

  3. Dashboard provisioning: Platform team provides dashboard-as-code via Git. Standard dashboards (namespace overview, SLO tracker) are provisioned automatically when a namespace is created. Teams can create custom dashboards in their org but cannot modify platform dashboards.

RBAC: Viewers can see dashboards but not edit. Editors can create dashboards in their org. Admins (platform team only) manage datasources and provisioned dashboards.”


“The on-call team is getting 200 alerts per day. Most are noise. How do you fix this?”

Strong Answer:

“Alert fatigue is a platform team failure. Here’s my systematic approach:

  1. Audit all alerts: Export all firing alerts from the last 30 days. Categorize: actionable (required human intervention), informational (auto-resolved), noise (false positives). Typically 70% are noise.

  2. Apply the SLO model: Replace symptom-based alerts (CPU > 80%) with SLO-based alerts (error budget burn rate). A pod at 85% CPU is fine if the SLO is met. Only alert when user-facing service quality degrades.

  3. Multi-window burn rates: Instead of alerting on 5-minute spikes, use Google’s multi-window approach — fast burn (2% budget in 1h → page) and slow burn (5% budget in 6h → ticket). This eliminates transient spikes.

  4. Aggregation and deduplication: Group alerts by service, not by pod. ‘Pod payments-abc-123 is OOMKilling’ x50 becomes ‘50 pods in payments namespace are OOM.’

  5. Target: <5 critical pages per week, <20 warnings per day. Every alert must have a runbook link and a clear owner.”


“How much storage do you need for metrics from 15 clusters, 90-day retention?”

Strong Answer:

“Let me size it:

Ingestion estimate:

  • 15 clusters x 200 nodes average = 3,000 nodes
  • Per node: kubelet (500 series), node-exporter (2,000), cAdvisor (1,000 per pod x 30 pods = 30,000)
  • kube-state-metrics: ~500 series per node
  • Application metrics: ~5,000 series per namespace x 20 namespaces per cluster = 100,000

Total raw series per cluster: ~133,000 series After stream aggregation (60% reduction): ~53,000 per cluster 15 clusters: ~800,000 active series

Storage calculation (VictoriaMetrics):

  • VM compression: ~0.4 bytes per data point
  • 30s scrape interval = 2 points/min = 2,880/day per series
  • 800K series x 2,880 points/day x 0.4 bytes = ~921 MB/day
  • 90 days: ~83 GB

My recommendation: 3x vmstorage with 100 GB each (300 GB total) gives comfortable headroom with replication factor 2. On gp3 EBS, that’s about $24/month in storage costs — trivial compared to the compute for vminsert/vmselect/vmstorage instances.

In practice, I’d start with 200 GB per vmstorage node and monitor actual usage with vm_data_size_bytes metric.”


Scenario 9: Migrating from Datadog to Open Source

Section titled “Scenario 9: Migrating from Datadog to Open Source”

“Leadership wants to move from Datadog to open-source to save costs. What’s your plan?”

Strong Answer:

“Datadog typically costs $15-25 per host per month for infrastructure monitoring, plus $0.10 per ingested GB for logs. For 500 hosts, that’s $90-150K/year. Open source can reduce this by 70-80%.

Migration plan:

Phase 1 (Month 1): Metrics — Deploy vmagent alongside the Datadog agent. Both scrape the same targets. Central VictoriaMetrics cluster for storage. Recreate top 20 Datadog dashboards in Grafana. Run in parallel for 2 weeks to validate data parity.

Phase 2 (Month 2): Alerting — Migrate Datadog monitors to vmalert/Alertmanager rules. Map Datadog notification channels to Alertmanager receivers (Slack, PagerDuty). Run in parallel for 1 week.

Phase 3 (Month 3): Logs — Deploy Grafana Alloy (replaces Datadog agent for logs). Ship to Loki instead of Datadog Logs. Recreate log-based dashboards and alerts.

Phase 4 (Month 4): APM/Tracing — Instrument apps with OpenTelemetry SDK. Ship traces to Tempo. Remove Datadog APM library.

Phase 5: Decommission — Remove Datadog agents, cancel contract.

Cost comparison:

  • Datadog: ~$120K/year (500 hosts, infra + logs + APM)
  • Open source: ~$30K/year (VM cluster compute + storage + Grafana Enterprise license)
  • Savings: ~$90K/year

Risk: Open source requires platform team expertise to operate. Budget for 1 dedicated SRE for the observability platform.”