Logging
Where This Fits
Section titled “Where This Fits”The central platform team operates Loki as the log aggregation backend. Tenant teams query their logs in Grafana, filtered by namespace labels. No one manages Elasticsearch clusters.
Grafana Loki Architecture
Section titled “Grafana Loki Architecture”Grafana Alloy (Log Collection Agent)
Section titled “Grafana Alloy (Log Collection Agent)”Why Alloy over Promtail/Fluentd/Fluent Bit
Section titled “Why Alloy over Promtail/Fluentd/Fluent Bit”| Feature | Alloy | Promtail | Fluent Bit | Fluentd |
|---|---|---|---|---|
| Metrics | Yes (Prometheus scrape) | No | Partial | Partial |
| Logs | Yes (Loki push) | Yes | Yes | Yes |
| Traces | Yes (OTel collector) | No | Partial | No |
| Config | River (HCL-like) | YAML | INI-like | Ruby DSL |
| Memory | Low (~50 MB) | Low (~30 MB) | Very low (~15 MB) | High (~100 MB+) |
| Maintained by | Grafana Labs | Grafana Labs (legacy) | CNCF | CNCF |
Alloy replaces Promtail, the Grafana Agent, and the OTel Collector in a single binary. It’s the recommended collector for the Grafana stack.
Alloy DaemonSet
Section titled “Alloy DaemonSet”apiVersion: apps/v1kind: DaemonSetmetadata: name: grafana-alloy namespace: monitoringspec: selector: matchLabels: app: grafana-alloy template: metadata: labels: app: grafana-alloy spec: serviceAccountName: grafana-alloy tolerations: - operator: Exists # Run on ALL nodes including masters containers: - name: alloy image: grafana/alloy:v1.5.1 args: - run - /etc/alloy/config.alloy - --storage.path=/var/lib/alloy/data - --server.http.listen-addr=0.0.0.0:12345 ports: - containerPort: 12345 name: http resources: requests: cpu: 100m memory: 128Mi limits: memory: 256Mi volumeMounts: - name: config mountPath: /etc/alloy - name: varlog mountPath: /var/log readOnly: true - name: containers mountPath: /var/lib/docker/containers readOnly: true - name: alloy-data mountPath: /var/lib/alloy/data volumes: - name: config configMap: name: alloy-config - name: varlog hostPath: path: /var/log - name: containers hostPath: path: /var/lib/docker/containers - name: alloy-data hostPath: path: /var/lib/alloy/dataAlloy Configuration (River syntax)
Section titled “Alloy Configuration (River syntax)”// Alloy config for Kubernetes log collection// Discover pods, tail their logs, enrich with K8s labels, push to Loki
// Kubernetes pod discoverydiscovery.kubernetes "pods" { role = "pod"}
// Relabel: extract namespace, pod, container labelsdiscovery.relabel "pods" { targets = discovery.kubernetes.pods.targets
rule { source_labels = ["__meta_kubernetes_namespace"] target_label = "namespace" } rule { source_labels = ["__meta_kubernetes_pod_name"] target_label = "pod" } rule { source_labels = ["__meta_kubernetes_pod_container_name"] target_label = "container" } rule { source_labels = ["__meta_kubernetes_pod_label_app"] target_label = "app" } // Drop system pods with excessive logging rule { source_labels = ["namespace"] regex = "kube-system|monitoring" action = "drop" }}
// Tail container log filesloki.source.kubernetes "pods" { targets = discovery.relabel.pods.output forward_to = [loki.process.pipeline.receiver]}
// Processing pipeline: parse, filter, transformloki.process "pipeline" { // Parse JSON logs (if structured) stage.json { expressions = { level = "level", msg = "msg", traceID = "trace_id", } }
// Add parsed fields as labels stage.labels { values = { level = "", } }
// Drop debug logs in production (reduce volume 40-60%) stage.match { selector = "{level=\"debug\"}" action = "drop" }
// Rate limit per stream (prevent log flood from one bad pod) stage.limit { rate = 100 // lines per second burst = 200 }
// Add structured metadata (Loki 3.0+) stage.structured_metadata { values = { trace_id = "traceID", } }
forward_to = [loki.write.default.receiver]}
// Push to central Lokiloki.write "default" { endpoint { url = "https://loki.monitoring.example.com/loki/api/v1/push"
tenant_id = "platform"
tls_config { ca_file = "/etc/alloy/ca.crt" }
basic_auth { username = env("LOKI_USERNAME") password = env("LOKI_PASSWORD") } } external_labels = { cluster = "prod-us-east-1", env = "production", }}Ingesting Cloud Provider Logs
Section titled “Ingesting Cloud Provider Logs”AWS: CloudWatch Logs to Loki
Section titled “AWS: CloudWatch Logs to Loki”Option 1: Lambda Forwarder (real-time, low volume)
# Subscribe CloudWatch Log Group to Lambda forwarderresource "aws_cloudwatch_log_subscription_filter" "rds_to_loki" { name = "rds-logs-to-loki" log_group_name = "/aws/rds/cluster/payments-db/postgresql" filter_pattern = "" # All logs destination_arn = aws_lambda_function.loki_forwarder.arn}
resource "aws_lambda_function" "loki_forwarder" { function_name = "cloudwatch-to-loki" role = aws_iam_role.loki_forwarder.arn handler = "index.handler" runtime = "nodejs20.x" timeout = 60 memory_size = 256
environment { variables = { LOKI_URL = "https://loki.monitoring.example.com" LOKI_USERNAME = "cloudwatch-ingester" } }
vpc_config { subnet_ids = var.private_subnet_ids security_group_ids = [aws_security_group.lambda.id] }}Option 2: Grafana Alloy CloudWatch Source
// Alloy config: scrape CloudWatch Logs directlyloki.source.cloudwatch "rds_logs" { region = "us-east-1"
log_groups { names = ["/aws/rds/cluster/payments-db/postgresql"] }
// Process every 60 seconds poll_interval = "60s"
forward_to = [loki.write.default.receiver]
labels = { source = "cloudwatch", service = "rds", database = "payments-db", }}Log Sink + Pub/Sub + Subscriber
# Organization-level log sink → Pub/Subresource "google_logging_organization_sink" "loki" { name = "loki-sink" org_id = var.org_id destination = "pubsub.googleapis.com/projects/${var.shared_project}/topics/${google_pubsub_topic.loki_logs.name}" include_children = true
# Filter: only important logs (reduces volume 80%) filter = <<-EOT resource.type = "cloud_sql_database" OR resource.type = "cloud_run_revision" OR resource.type = "gke_cluster" OR (resource.type = "k8s_container" AND resource.labels.namespace_name != "kube-system") severity >= WARNING EOT}
resource "google_pubsub_topic" "loki_logs" { name = "loki-log-ingestion" project = var.shared_project}
# Subscriber: Cloud Run service that pushes to Lokiresource "google_pubsub_subscription" "loki" { name = "loki-subscriber" topic = google_pubsub_topic.loki_logs.name
push_config { push_endpoint = google_cloud_run_v2_service.loki_ingester.uri
oidc_token { service_account_email = google_service_account.loki_ingester.email } }
ack_deadline_seconds = 60 message_retention_duration = "604800s" # 7 days retry_policy { minimum_backoff = "10s" maximum_backoff = "300s" }}LogQL Query Language
Section titled “LogQL Query Language”Basic Queries
Section titled “Basic Queries”# All logs from payments namespace{namespace="payments"}
# Filter by pod name pattern{namespace="payments", pod=~"api-.*"}
# Full-text search within filtered logs{namespace="payments"} |= "connection refused"
# Regex match{namespace="payments"} |~ "error|Error|ERROR"
# Exclude pattern{namespace="payments"} != "health_check"
# JSON parsing + field filter{namespace="payments"} | json | level="error" | status_code >= 500
# Line format (rewrite log lines for readability){namespace="payments"} | json | line_format "{{.timestamp}} [{{.level}}] {{.msg}}"Aggregation Queries (Log Metrics)
Section titled “Aggregation Queries (Log Metrics)”# Error rate per service (count of error logs per second)sum by (app) (rate({namespace="payments"} |= "error" [5m]))
# Bytes ingested per namespace (cost attribution)sum by (namespace) (bytes_rate({job="kubernetes-pods"} [1h]))
# Top 5 pods by log volumetopk(5, sum by (pod) (bytes_rate({namespace="production"} [1h])))
# Count 5xx errors per endpoint (from JSON logs)sum by (endpoint) ( count_over_time( {namespace="payments"} | json | status_code >= 500 [5m] ))
# P99 latency from log fields (when you don't have metrics)quantile_over_time(0.99, {namespace="payments"} | json | unwrap duration_ms [5m]) by (endpoint)Loki Deployment (Scalable Mode)
Section titled “Loki Deployment (Scalable Mode)”# Helm values for Loki in microservices mode# helm install loki grafana/loki -f values.yamlloki: auth_enabled: true # Multi-tenant mode
storage: type: s3 bucketNames: chunks: loki-chunks-prod ruler: loki-ruler-prod s3: region: us-east-1 endpoint: null # Use default AWS endpoint
limits_config: retention_period: 30d ingestion_rate_mb: 20 ingestion_burst_size_mb: 30 max_query_parallelism: 32 max_entries_limit_per_query: 10000 per_stream_rate_limit: 5MB per_stream_rate_limit_burst: 15MB
schema_config: configs: - from: "2024-01-01" store: tsdb object_store: s3 schema: v13 index: prefix: loki_index_ period: 24h
compactor: retention_enabled: true delete_request_store: s3
# Component sizingwrite: replicas: 3 resources: requests: cpu: 500m memory: 1Gi limits: memory: 2Gi persistence: size: 20Gi storageClass: gp3-encrypted
read: replicas: 3 resources: requests: cpu: 500m memory: 1Gi limits: memory: 2Gi
backend: replicas: 2 resources: requests: cpu: 250m memory: 512Mi
# Gateway (nginx) for routinggateway: replicas: 2 resources: requests: cpu: 100m memory: 128MiLog-Based Alerting
Section titled “Log-Based Alerting”# Loki ruler config for log-based alertsapiVersion: v1kind: ConfigMapmetadata: name: loki-ruler-config namespace: monitoringdata: rules.yml: | groups: - name: application-errors rules: # Alert on high error rate in logs - alert: HighErrorLogRate expr: | sum by (namespace, app) ( rate({job="kubernetes-pods"} |= "level=error" [5m]) ) > 10 for: 5m labels: severity: warning annotations: summary: "High error rate in {{ $labels.app }} ({{ $labels.namespace }})" description: "More than 10 error logs/sec for 5 minutes"
# Alert on OOM kill messages in kernel logs - alert: OOMKillDetected expr: | count_over_time( {job="systemd-journal"} |= "Out of memory: Killed process" [5m] ) > 0 labels: severity: critical annotations: summary: "OOM kill detected on {{ $labels.node }}" runbook: "https://runbooks.example.com/oom-kill"
# Alert on database connection failures - alert: DatabaseConnectionFailure expr: | sum by (app) ( count_over_time( {namespace="payments"} |= "connection refused" |= "postgresql" [5m] ) ) > 5 for: 2m labels: severity: critical annotations: summary: "{{ $labels.app }} cannot connect to database"
- name: security-events rules: # Alert on authentication failures - alert: HighAuthFailureRate expr: | sum by (source_ip) ( count_over_time( {namespace="auth"} | json | event_type="auth_failure" [10m] ) ) > 50 labels: severity: critical team: security annotations: summary: "50+ auth failures from {{ $labels.source_ip }} in 10 minutes — possible brute force"Structured Logging Best Practices
Section titled “Structured Logging Best Practices”Golden Format
Section titled “Golden Format”{ "timestamp": "2026-03-15T10:23:45.123Z", "level": "error", "logger": "payment-service", "msg": "Payment processing failed", "trace_id": "abc123def456", "span_id": "789ghi", "service": "payment-api", "environment": "production", "error": { "type": "PaymentGatewayTimeout", "message": "Upstream timeout after 30s", "stack": "at PaymentService.process..." }, "context": { "payment_id": "PAY-12345", "amount": 150.00, "currency": "AED", "customer_tier": "premium" }}Interview Scenarios
Section titled “Interview Scenarios”Scenario 1: Design Centralized Logging
Section titled “Scenario 1: Design Centralized Logging”“Design a logging architecture for 20 Kubernetes clusters generating 5 TB of logs per day.”
Strong Answer:
“At 5 TB/day, cost is the primary constraint. Here’s my architecture:
Collection (per cluster): Grafana Alloy as DaemonSet on every node. It tails container log files from /var/log/containers/, enriches with Kubernetes labels (namespace, pod, app), and ships to central Loki. I configure pipeline stages to:
- Parse JSON logs and extract
levelas a label - Drop
debuglevel logs in production (reduces volume 40-60%) - Rate-limit per stream to 100 lines/sec (prevents log flooding from one bad pod)
Storage (central): Loki in microservices mode with S3/GCS backend. Loki only indexes labels — log content goes to object storage as compressed chunks. At 5 TB/day after filtering (say 2 TB after dropping debug):
- S3 cost: ~$46/month per TB = ~$2,760/month for 30-day retention
- Compare to Elasticsearch: 10-20x more expensive (needs hot/warm/cold nodes with fast disks)
Retention: 30 days in Loki (hot queries), 90 days in S3 (archived, queryable but slower). Compliance logs (audit, security) stored separately with 7-year retention in S3 Glacier.
Access control: Multi-tenant Loki — each team’s Grafana org sees only their namespace’s logs. Platform team has a global tenant for cross-cutting queries.”
Scenario 2: CloudWatch vs Loki
Section titled “Scenario 2: CloudWatch vs Loki”“Why not just use CloudWatch Logs for everything on AWS?”
Strong Answer:
“CloudWatch Logs works for AWS-native services (RDS, Lambda, ALB) but has limitations for a platform team:
-
Cost at scale: CloudWatch charges $0.50/GB ingested + $0.03/GB stored. At 2 TB/day, that’s $1,000/day = $30K/month just for ingestion. Loki with S3 storage costs 10-20x less.
-
No Kubernetes awareness: CloudWatch doesn’t understand namespace, pod, or container labels natively. Container Insights provides basic log routing but lacks the label-based querying of LogQL.
-
Query limitations: CloudWatch Insights has a 10K result limit and charges per query ($0.005/GB scanned). LogQL on Loki has no per-query cost and supports streaming.
-
Cross-cloud: If you also run GKE, CloudWatch only covers AWS. Loki handles both uniformly.
My approach: Use CloudWatch for AWS service logs (RDS, ALB, Lambda) and forward important ones to Loki via Lambda forwarder or Alloy’s CloudWatch source. Use Loki for all Kubernetes application logs. Single Grafana UI for everything.”
Scenario 3: Log Volume Explosion
Section titled “Scenario 3: Log Volume Explosion”“A team deployed a new version and log volume jumped from 100 GB/day to 2 TB/day. What do you do?”
Strong Answer:
“This is a platform emergency — it can blow storage budgets and degrade Loki performance. My response:
Immediate (within 1 hour):
- Identify the source:
topk(5, sum by (namespace, pod) (bytes_rate({} [5m])))in LogQL - If it’s debug logging left on, ask the team to deploy a fix. If they can’t immediately, add an Alloy pipeline stage to drop logs matching the pattern
- Check Loki ingestion limits — per-stream rate limiting should prevent a single pod from overwhelming the system
Root cause fix:
- Enforce structured logging standards — no unstructured printf debugging in production
- Add per-namespace ingestion quotas in Loki’s
limits_config - Alloy rate-limiting per stream (already in our standard config)
- Add a CI/CD check: warn if a new deployment enables
LOG_LEVEL=DEBUGin production
Prevention:
- Platform-level Alloy pipeline drops
debugandtracelevel logs in production by default - Per-namespace log budget alerts (warn at 2x baseline, critical at 5x)
- Cost chargeback — teams pay for their log volume, which incentivizes optimization”
Scenario 4: Debugging with Logs
Section titled “Scenario 4: Debugging with Logs”“A customer reports intermittent 500 errors on the payment API. Walk through your debugging process using logs.”
Strong Answer:
“Starting in Grafana, I’d work from broad to narrow:
-
Check SLO dashboard first: Is the payment API’s error budget burning? This tells me severity and duration.
-
Log query:
{namespace=\"payments\", app=\"payment-api\"} | json | status_code >= 500 | line_format \"{{.timestamp}} {{.method}} {{.path}} {{.status_code}} {{.error}}\"This shows me all 500 errors with their error messages.
-
Correlate with time: Overlay the error log rate on the latency graph. If errors correlate with latency spikes, it’s likely an upstream timeout.
-
Trace correlation: Click the
trace_idfrom an error log line. Grafana jumps to Tempo showing the full distributed trace. I can see which downstream service is failing. -
Pattern analysis:
{namespace=\"payments\", app=\"payment-api\"} | json | status_code >= 500 | pattern \"<_> error=<error>\" | count by (error)This groups errors by type — maybe 80% are ‘database connection timeout.’
-
Check database logs:
{namespace=\"payments\", app=\"payments-db\"} |= \"too many connections\"— confirms connection pool exhaustion. -
Resolution: Increase connection pool size or add PgBouncer sidecar.”
Scenario 5: Log Retention and Compliance
Section titled “Scenario 5: Log Retention and Compliance”“The compliance team requires 7-year log retention for audit trails. How do you handle this cost-effectively?”
Strong Answer:
“Separate audit logs from application logs — they have different retention and query patterns:
Application logs (30-day retention):
- Stored in Loki with S3 backend
- 30-day compactor retention policy
- Used for debugging, alerting
- Cost: ~$2-3K/month at 2 TB/day
Audit logs (7-year retention):
- Route audit events (login, permission changes, data access) to a separate Loki tenant or directly to S3
- S3 lifecycle policy: Standard (90 days) → Glacier Instant Retrieval (1 year) → Glacier Deep Archive (7 years)
- Tag with compliance metadata for retrieval
- Cost at Deep Archive: $0.00099/GB/month = $1/TB/month
Implementation:
- Application emits audit events with
type: auditlabel - Alloy pipeline routes
{type=\"audit\"}to separate Loki tenant with 7-year retention - OR: Skip Loki entirely for audit — ship directly to S3 in JSON format, queryable via Athena when needed
Total cost for 7-year audit retention (assuming 50 GB/day of audit logs): ~$12K total for 7 years. Compare to keeping everything in Elasticsearch: easily $1M+ over the same period.”
Scenario 6: Migrating from ELK to Loki
Section titled “Scenario 6: Migrating from ELK to Loki”“We run a 30-node Elasticsearch cluster for logging. It costs $50K/month in compute alone. Can we switch to Loki?”
Strong Answer:
“Yes, and the savings are dramatic. Here’s my migration plan:
Phase 1 (Week 1-2): Dual-write
- Deploy Alloy alongside existing Filebeat/Fluentd
- Configure Alloy to ship to Loki while Filebeat continues to Elasticsearch
- Validate data parity: same log lines appear in both systems
Phase 2 (Week 3-4): Dashboard migration
- Recreate top 20 Kibana dashboards in Grafana using LogQL
- Train teams on LogQL syntax (similar to Kibana KQL but different)
- Key translation: KQL
status:500 AND service:payments→ LogQL{app=\"payments\"} | json | status=500
Phase 3 (Month 2): Cutover
- Switch on-call runbooks to use Grafana/Loki
- Stop Filebeat shipping to Elasticsearch
- Keep Elasticsearch read-only for 30 days (historical queries)
Phase 4: Decommission
- Shut down 30-node Elasticsearch cluster
- Archive any required data to S3
Cost comparison:
- Elasticsearch: 30 nodes x r6g.2xlarge = ~$50K/month (compute only)
- Loki: 6 pods (3 write, 3 read) + S3 storage = ~$3-5K/month
- Savings: ~$45K/month = $540K/year
Tradeoff: Full-text search is slower in Loki. If teams rely heavily on free-text grep across all logs without label filtering, they’ll notice. Mitigation: enforce structured JSON logging so LogQL can filter on parsed fields efficiently.”
References
Section titled “References”- Amazon CloudWatch Logs User Guide — log groups, subscriptions, Insights queries, and retention
- CloudWatch Logs Insights Query Syntax — query language reference for CloudWatch Logs
- Google Cloud Logging Documentation — log ingestion, routing, sinks, and log-based metrics
- Cloud Logging Query Language — filtering and querying logs in GCP
Tools & Frameworks
Section titled “Tools & Frameworks”- Grafana Loki Documentation — label-indexed log aggregation with S3/GCS backend
- LogQL Reference — Loki query language for log filtering, parsing, and aggregation
- Grafana Alloy Documentation — unified telemetry collector for metrics, logs, and traces
- Grafana Alloy Loki Components — Alloy configuration for Loki log collection and processing