Logging

Where This Fits

Central Observability Platform — Metrics, Logs (Loki highlighted), Traces feeding into Grafana

The central platform team operates Loki as the log aggregation backend. Tenant teams query their logs in Grafana, filtered by namespace labels. No one manages Elasticsearch clusters.

Grafana Loki Architecture

Grafana Loki architecture — write path (Alloy, Distributor, Ingester, S3/GCS) and read path (Query Frontend, Querier)

Grafana Alloy (Log Collection Agent)

Why Alloy over Promtail/Fluentd/Fluent Bit

Feature	Alloy	Promtail	Fluent Bit	Fluentd
Metrics	Yes (Prometheus scrape)	No	Partial	Partial
Logs	Yes (Loki push)	Yes	Yes	Yes
Traces	Yes (OTel collector)	No	Partial	No
Config	River (HCL-like)	YAML	INI-like	Ruby DSL
Memory	Low (~50 MB)	Low (~30 MB)	Very low (~15 MB)	High (~100 MB+)
Maintained by	Grafana Labs	Grafana Labs (legacy)	CNCF	CNCF

Alloy replaces Promtail, the Grafana Agent, and the OTel Collector in a single binary. It’s the recommended collector for the Grafana stack.

Alloy DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: grafana-alloy
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: grafana-alloy
  template:
    metadata:
      labels:
        app: grafana-alloy
    spec:
      serviceAccountName: grafana-alloy
      tolerations:
        - operator: Exists  # Run on ALL nodes including masters
      containers:
        - name: alloy
          image: grafana/alloy:v1.5.1
          args:
            - run
            - /etc/alloy/config.alloy
            - --storage.path=/var/lib/alloy/data
            - --server.http.listen-addr=0.0.0.0:12345
          ports:
            - containerPort: 12345
              name: http
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              memory: 256Mi
          volumeMounts:
            - name: config
              mountPath: /etc/alloy
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: containers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: alloy-data
              mountPath: /var/lib/alloy/data
      volumes:
        - name: config
          configMap:
            name: alloy-config
        - name: varlog
          hostPath:
            path: /var/log
        - name: containers
          hostPath:
            path: /var/lib/docker/containers
        - name: alloy-data
          hostPath:
            path: /var/lib/alloy/data

Alloy Configuration (River syntax)

// Alloy config for Kubernetes log collection
// Discover pods, tail their logs, enrich with K8s labels, push to Loki

// Kubernetes pod discovery
discovery.kubernetes "pods" {
  role = "pod"
}

// Relabel: extract namespace, pod, container labels
discovery.relabel "pods" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_label_app"]
    target_label  = "app"
  }
  // Drop system pods with excessive logging
  rule {
    source_labels = ["namespace"]
    regex         = "kube-system|monitoring"
    action        = "drop"
  }
}

// Tail container log files
loki.source.kubernetes "pods" {
  targets    = discovery.relabel.pods.output
  forward_to = [loki.process.pipeline.receiver]
}

// Processing pipeline: parse, filter, transform
loki.process "pipeline" {
  // Parse JSON logs (if structured)
  stage.json {
    expressions = {
      level   = "level",
      msg     = "msg",
      traceID = "trace_id",
    }
  }

  // Add parsed fields as labels
  stage.labels {
    values = {
      level = "",
    }
  }

  // Drop debug logs in production (reduce volume 40-60%)
  stage.match {
    selector = "{level=\"debug\"}"
    action   = "drop"
  }

  // Rate limit per stream (prevent log flood from one bad pod)
  stage.limit {
    rate  = 100    // lines per second
    burst = 200
  }

  // Add structured metadata (Loki 3.0+)
  stage.structured_metadata {
    values = {
      trace_id = "traceID",
    }
  }

  forward_to = [loki.write.default.receiver]
}

// Push to central Loki
loki.write "default" {
  endpoint {
    url = "https://loki.monitoring.example.com/loki/api/v1/push"

    tenant_id = "platform"

    tls_config {
      ca_file = "/etc/alloy/ca.crt"
    }

    basic_auth {
      username = env("LOKI_USERNAME")
      password = env("LOKI_PASSWORD")
    }
  }
  external_labels = {
    cluster = "prod-us-east-1",
    env     = "production",
  }
}

Ingesting Cloud Provider Logs

AWS CloudWatch Logs forwarding to Loki via Lambda function

Option 1: Lambda Forwarder (real-time, low volume)

# Subscribe CloudWatch Log Group to Lambda forwarder
resource "aws_cloudwatch_log_subscription_filter" "rds_to_loki" {
  name            = "rds-logs-to-loki"
  log_group_name  = "/aws/rds/cluster/payments-db/postgresql"
  filter_pattern  = ""  # All logs
  destination_arn = aws_lambda_function.loki_forwarder.arn
}

resource "aws_lambda_function" "loki_forwarder" {
  function_name = "cloudwatch-to-loki"
  role          = aws_iam_role.loki_forwarder.arn
  handler       = "index.handler"
  runtime       = "nodejs20.x"
  timeout       = 60
  memory_size   = 256

  environment {
    variables = {
      LOKI_URL      = "https://loki.monitoring.example.com"
      LOKI_USERNAME = "cloudwatch-ingester"
    }
  }

  vpc_config {
    subnet_ids         = var.private_subnet_ids
    security_group_ids = [aws_security_group.lambda.id]
  }
}

Option 2: Grafana Alloy CloudWatch Source

// Alloy config: scrape CloudWatch Logs directly
loki.source.cloudwatch "rds_logs" {
  region = "us-east-1"

  log_groups {
    names = ["/aws/rds/cluster/payments-db/postgresql"]
  }

  // Process every 60 seconds
  poll_interval = "60s"

  forward_to = [loki.write.default.receiver]

  labels = {
    source = "cloudwatch",
    service = "rds",
    database = "payments-db",
  }
}

GCP Cloud Logging forwarding to Loki via Log Sink, Pub/Sub, and Cloud Function subscriber

Log Sink + Pub/Sub + Subscriber

# Organization-level log sink → Pub/Sub
resource "google_logging_organization_sink" "loki" {
  name             = "loki-sink"
  org_id           = var.org_id
  destination      = "pubsub.googleapis.com/projects/${var.shared_project}/topics/${google_pubsub_topic.loki_logs.name}"
  include_children = true

  # Filter: only important logs (reduces volume 80%)
  filter = <<-EOT
    resource.type = "cloud_sql_database"
    OR resource.type = "cloud_run_revision"
    OR resource.type = "gke_cluster"
    OR (resource.type = "k8s_container"
        AND resource.labels.namespace_name != "kube-system")
    severity >= WARNING
  EOT
}

resource "google_pubsub_topic" "loki_logs" {
  name    = "loki-log-ingestion"
  project = var.shared_project
}

# Subscriber: Cloud Run service that pushes to Loki
resource "google_pubsub_subscription" "loki" {
  name  = "loki-subscriber"
  topic = google_pubsub_topic.loki_logs.name

  push_config {
    push_endpoint = google_cloud_run_v2_service.loki_ingester.uri

    oidc_token {
      service_account_email = google_service_account.loki_ingester.email
    }
  }

  ack_deadline_seconds       = 60
  message_retention_duration = "604800s"  # 7 days
  retry_policy {
    minimum_backoff = "10s"
    maximum_backoff = "300s"
  }
}

LogQL Query Language

Basic Queries

# All logs from payments namespace
{namespace="payments"}

# Filter by pod name pattern
{namespace="payments", pod=~"api-.*"}

# Full-text search within filtered logs
{namespace="payments"} |= "connection refused"

# Regex match
{namespace="payments"} |~ "error|Error|ERROR"

# Exclude pattern
{namespace="payments"} != "health_check"

# JSON parsing + field filter
{namespace="payments"} | json | level="error" | status_code >= 500

# Line format (rewrite log lines for readability)
{namespace="payments"} | json | line_format "{{.timestamp}} [{{.level}}] {{.msg}}"

Aggregation Queries (Log Metrics)

# Error rate per service (count of error logs per second)
sum by (app) (rate({namespace="payments"} |= "error" [5m]))

# Bytes ingested per namespace (cost attribution)
sum by (namespace) (bytes_rate({job="kubernetes-pods"} [1h]))

# Top 5 pods by log volume
topk(5, sum by (pod) (bytes_rate({namespace="production"} [1h])))

# Count 5xx errors per endpoint (from JSON logs)
sum by (endpoint) (
  count_over_time(
    {namespace="payments"} | json | status_code >= 500 [5m]
  )
)

# P99 latency from log fields (when you don't have metrics)
quantile_over_time(0.99,
  {namespace="payments"} | json | unwrap duration_ms [5m]
) by (endpoint)

Loki Deployment (Scalable Mode)

# Helm values for Loki in microservices mode
# helm install loki grafana/loki -f values.yaml
loki:
  auth_enabled: true  # Multi-tenant mode

  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks-prod
      ruler: loki-ruler-prod
    s3:
      region: us-east-1
      endpoint: null  # Use default AWS endpoint

  limits_config:
    retention_period: 30d
    ingestion_rate_mb: 20
    ingestion_burst_size_mb: 30
    max_query_parallelism: 32
    max_entries_limit_per_query: 10000
    per_stream_rate_limit: 5MB
    per_stream_rate_limit_burst: 15MB

  schema_config:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  compactor:
    retention_enabled: true
    delete_request_store: s3

# Component sizing
write:
  replicas: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      memory: 2Gi
  persistence:
    size: 20Gi
    storageClass: gp3-encrypted

read:
  replicas: 3
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      memory: 2Gi

backend:
  replicas: 2
  resources:
    requests:
      cpu: 250m
      memory: 512Mi

# Gateway (nginx) for routing
gateway:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 128Mi

Log-Based Alerting

# Loki ruler config for log-based alerts
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-ruler-config
  namespace: monitoring
data:
  rules.yml: |
    groups:
      - name: application-errors
        rules:
          # Alert on high error rate in logs
          - alert: HighErrorLogRate
            expr: |
              sum by (namespace, app) (
                rate({job="kubernetes-pods"} |= "level=error" [5m])
              ) > 10
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High error rate in {{ $labels.app }} ({{ $labels.namespace }})"
              description: "More than 10 error logs/sec for 5 minutes"

          # Alert on OOM kill messages in kernel logs
          - alert: OOMKillDetected
            expr: |
              count_over_time(
                {job="systemd-journal"} |= "Out of memory: Killed process" [5m]
              ) > 0
            labels:
              severity: critical
            annotations:
              summary: "OOM kill detected on {{ $labels.node }}"
              runbook: "https://runbooks.example.com/oom-kill"

          # Alert on database connection failures
          - alert: DatabaseConnectionFailure
            expr: |
              sum by (app) (
                count_over_time(
                  {namespace="payments"} |= "connection refused" |= "postgresql" [5m]
                )
              ) > 5
            for: 2m
            labels:
              severity: critical
            annotations:
              summary: "{{ $labels.app }} cannot connect to database"

      - name: security-events
        rules:
          # Alert on authentication failures
          - alert: HighAuthFailureRate
            expr: |
              sum by (source_ip) (
                count_over_time(
                  {namespace="auth"} | json | event_type="auth_failure" [10m]
                )
              ) > 50
            labels:
              severity: critical
              team: security
            annotations:
              summary: "50+ auth failures from {{ $labels.source_ip }} in 10 minutes — possible brute force"

Structured Logging Best Practices

Golden Format

{
  "timestamp": "2026-03-15T10:23:45.123Z",
  "level": "error",
  "logger": "payment-service",
  "msg": "Payment processing failed",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "service": "payment-api",
  "environment": "production",
  "error": {
    "type": "PaymentGatewayTimeout",
    "message": "Upstream timeout after 30s",
    "stack": "at PaymentService.process..."
  },
  "context": {
    "payment_id": "PAY-12345",
    "amount": 150.00,
    "currency": "AED",
    "customer_tier": "premium"
  }
}

Interview Scenarios

Scenario 1: Design Centralized Logging

“Design a logging architecture for 20 Kubernetes clusters generating 5 TB of logs per day.”

Strong Answer:

“At 5 TB/day, cost is the primary constraint. Here’s my architecture:

Collection (per cluster): Grafana Alloy as DaemonSet on every node. It tails container log files from /var/log/containers/, enriches with Kubernetes labels (namespace, pod, app), and ships to central Loki. I configure pipeline stages to:

Parse JSON logs and extract level as a label
Drop debug level logs in production (reduces volume 40-60%)
Rate-limit per stream to 100 lines/sec (prevents log flooding from one bad pod)

Storage (central): Loki in microservices mode with S3/GCS backend. Loki only indexes labels — log content goes to object storage as compressed chunks. At 5 TB/day after filtering (say 2 TB after dropping debug):

S3 cost: ~$46/month per TB = ~$2,760/month for 30-day retention
Compare to Elasticsearch: 10-20x more expensive (needs hot/warm/cold nodes with fast disks)

Retention: 30 days in Loki (hot queries), 90 days in S3 (archived, queryable but slower). Compliance logs (audit, security) stored separately with 7-year retention in S3 Glacier.

Access control: Multi-tenant Loki — each team’s Grafana org sees only their namespace’s logs. Platform team has a global tenant for cross-cutting queries.”

Scenario 2: CloudWatch vs Loki

“Why not just use CloudWatch Logs for everything on AWS?”

Strong Answer:

“CloudWatch Logs works for AWS-native services (RDS, Lambda, ALB) but has limitations for a platform team:

Cost at scale: CloudWatch charges $0.50/GB ingested + $0.03/GB stored. At 2 TB/day, that’s $1,000/day = $30K/month just for ingestion. Loki with S3 storage costs 10-20x less.
No Kubernetes awareness: CloudWatch doesn’t understand namespace, pod, or container labels natively. Container Insights provides basic log routing but lacks the label-based querying of LogQL.
Query limitations: CloudWatch Insights has a 10K result limit and charges per query ($0.005/GB scanned). LogQL on Loki has no per-query cost and supports streaming.
Cross-cloud: If you also run GKE, CloudWatch only covers AWS. Loki handles both uniformly.

My approach: Use CloudWatch for AWS service logs (RDS, ALB, Lambda) and forward important ones to Loki via Lambda forwarder or Alloy’s CloudWatch source. Use Loki for all Kubernetes application logs. Single Grafana UI for everything.”

Scenario 3: Log Volume Explosion

“A team deployed a new version and log volume jumped from 100 GB/day to 2 TB/day. What do you do?”

Strong Answer:

“This is a platform emergency — it can blow storage budgets and degrade Loki performance. My response:

Immediate (within 1 hour):

Identify the source: topk(5, sum by (namespace, pod) (bytes_rate({} [5m]))) in LogQL
If it’s debug logging left on, ask the team to deploy a fix. If they can’t immediately, add an Alloy pipeline stage to drop logs matching the pattern
Check Loki ingestion limits — per-stream rate limiting should prevent a single pod from overwhelming the system

Root cause fix:

Enforce structured logging standards — no unstructured printf debugging in production
Add per-namespace ingestion quotas in Loki’s limits_config
Alloy rate-limiting per stream (already in our standard config)
Add a CI/CD check: warn if a new deployment enables LOG_LEVEL=DEBUG in production

Prevention:

Platform-level Alloy pipeline drops debug and trace level logs in production by default
Per-namespace log budget alerts (warn at 2x baseline, critical at 5x)
Cost chargeback — teams pay for their log volume, which incentivizes optimization”

Scenario 4: Debugging with Logs

“A customer reports intermittent 500 errors on the payment API. Walk through your debugging process using logs.”

Strong Answer:

“Starting in Grafana, I’d work from broad to narrow:

Check SLO dashboard first: Is the payment API’s error budget burning? This tells me severity and duration.
Log query:

{namespace=\"payments\", app=\"payment-api\"} | json | status_code >= 500 | line_format \"{{.timestamp}} {{.method}} {{.path}} {{.status_code}} {{.error}}\"

This shows me all 500 errors with their error messages.

Correlate with time: Overlay the error log rate on the latency graph. If errors correlate with latency spikes, it’s likely an upstream timeout.
Trace correlation: Click the trace_id from an error log line. Grafana jumps to Tempo showing the full distributed trace. I can see which downstream service is failing.
Pattern analysis:

{namespace=\"payments\", app=\"payment-api\"} | json | status_code >= 500 | pattern \"<_> error=<error>\" | count by (error)

This groups errors by type — maybe 80% are ‘database connection timeout.’

Check database logs: {namespace=\"payments\", app=\"payments-db\"} |= \"too many connections\" — confirms connection pool exhaustion.
Resolution: Increase connection pool size or add PgBouncer sidecar.”

Scenario 5: Log Retention and Compliance

“The compliance team requires 7-year log retention for audit trails. How do you handle this cost-effectively?”

Strong Answer:

“Separate audit logs from application logs — they have different retention and query patterns:

Application logs (30-day retention):

Stored in Loki with S3 backend
30-day compactor retention policy
Used for debugging, alerting
Cost: ~$2-3K/month at 2 TB/day

Audit logs (7-year retention):

Route audit events (login, permission changes, data access) to a separate Loki tenant or directly to S3
S3 lifecycle policy: Standard (90 days) → Glacier Instant Retrieval (1 year) → Glacier Deep Archive (7 years)
Tag with compliance metadata for retrieval
Cost at Deep Archive: $0.00099/GB/month = $1/TB/month

Implementation:

Application emits audit events with type: audit label
Alloy pipeline routes {type=\"audit\"} to separate Loki tenant with 7-year retention
OR: Skip Loki entirely for audit — ship directly to S3 in JSON format, queryable via Athena when needed

Total cost for 7-year audit retention (assuming 50 GB/day of audit logs): ~$12K total for 7 years. Compare to keeping everything in Elasticsearch: easily $1M+ over the same period.”

Scenario 6: Migrating from ELK to Loki

“We run a 30-node Elasticsearch cluster for logging. It costs $50K/month in compute alone. Can we switch to Loki?”

Strong Answer:

“Yes, and the savings are dramatic. Here’s my migration plan:

Phase 1 (Week 1-2): Dual-write

Deploy Alloy alongside existing Filebeat/Fluentd
Configure Alloy to ship to Loki while Filebeat continues to Elasticsearch
Validate data parity: same log lines appear in both systems

Phase 2 (Week 3-4): Dashboard migration

Recreate top 20 Kibana dashboards in Grafana using LogQL
Train teams on LogQL syntax (similar to Kibana KQL but different)
Key translation: KQL status:500 AND service:payments → LogQL {app=\"payments\"} | json | status=500

Phase 3 (Month 2): Cutover

Switch on-call runbooks to use Grafana/Loki
Stop Filebeat shipping to Elasticsearch
Keep Elasticsearch read-only for 30 days (historical queries)

Phase 4: Decommission

Shut down 30-node Elasticsearch cluster
Archive any required data to S3

Cost comparison:

Elasticsearch: 30 nodes x r6g.2xlarge = ~$50K/month (compute only)
Loki: 6 pods (3 write, 3 read) + S3 storage = ~$3-5K/month
Savings: ~$45K/month = $540K/year

Tradeoff: Full-text search is slower in Loki. If teams rely heavily on free-text grep across all logs without label filtering, they’ll notice. Mitigation: enforce structured JSON logging so LogQL can filter on parsed fields efficiently.”

References

AWS

Amazon CloudWatch Logs User Guide — log groups, subscriptions, Insights queries, and retention
CloudWatch Logs Insights Query Syntax — query language reference for CloudWatch Logs

GCP

Google Cloud Logging Documentation — log ingestion, routing, sinks, and log-based metrics
Cloud Logging Query Language — filtering and querying logs in GCP

Tools & Frameworks

Grafana Loki Documentation — label-indexed log aggregation with S3/GCS backend
LogQL Reference — Loki query language for log filtering, parsing, and aggregation
Grafana Alloy Documentation — unified telemetry collector for metrics, logs, and traces
Grafana Alloy Loki Components — Alloy configuration for Loki log collection and processing