Skip to content

Logging

Central Observability Platform — Metrics, Logs (Loki highlighted), Traces feeding into Grafana

The central platform team operates Loki as the log aggregation backend. Tenant teams query their logs in Grafana, filtered by namespace labels. No one manages Elasticsearch clusters.


Grafana Loki architecture — write path (Alloy, Distributor, Ingester, S3/GCS) and read path (Query Frontend, Querier)


Why Alloy over Promtail/Fluentd/Fluent Bit

Section titled “Why Alloy over Promtail/Fluentd/Fluent Bit”
FeatureAlloyPromtailFluent BitFluentd
MetricsYes (Prometheus scrape)NoPartialPartial
LogsYes (Loki push)YesYesYes
TracesYes (OTel collector)NoPartialNo
ConfigRiver (HCL-like)YAMLINI-likeRuby DSL
MemoryLow (~50 MB)Low (~30 MB)Very low (~15 MB)High (~100 MB+)
Maintained byGrafana LabsGrafana Labs (legacy)CNCFCNCF

Alloy replaces Promtail, the Grafana Agent, and the OTel Collector in a single binary. It’s the recommended collector for the Grafana stack.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: grafana-alloy
namespace: monitoring
spec:
selector:
matchLabels:
app: grafana-alloy
template:
metadata:
labels:
app: grafana-alloy
spec:
serviceAccountName: grafana-alloy
tolerations:
- operator: Exists # Run on ALL nodes including masters
containers:
- name: alloy
image: grafana/alloy:v1.5.1
args:
- run
- /etc/alloy/config.alloy
- --storage.path=/var/lib/alloy/data
- --server.http.listen-addr=0.0.0.0:12345
ports:
- containerPort: 12345
name: http
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 256Mi
volumeMounts:
- name: config
mountPath: /etc/alloy
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
- name: alloy-data
mountPath: /var/lib/alloy/data
volumes:
- name: config
configMap:
name: alloy-config
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
- name: alloy-data
hostPath:
path: /var/lib/alloy/data
// Alloy config for Kubernetes log collection
// Discover pods, tail their logs, enrich with K8s labels, push to Loki
// Kubernetes pod discovery
discovery.kubernetes "pods" {
role = "pod"
}
// Relabel: extract namespace, pod, container labels
discovery.relabel "pods" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
rule {
source_labels = ["__meta_kubernetes_pod_label_app"]
target_label = "app"
}
// Drop system pods with excessive logging
rule {
source_labels = ["namespace"]
regex = "kube-system|monitoring"
action = "drop"
}
}
// Tail container log files
loki.source.kubernetes "pods" {
targets = discovery.relabel.pods.output
forward_to = [loki.process.pipeline.receiver]
}
// Processing pipeline: parse, filter, transform
loki.process "pipeline" {
// Parse JSON logs (if structured)
stage.json {
expressions = {
level = "level",
msg = "msg",
traceID = "trace_id",
}
}
// Add parsed fields as labels
stage.labels {
values = {
level = "",
}
}
// Drop debug logs in production (reduce volume 40-60%)
stage.match {
selector = "{level=\"debug\"}"
action = "drop"
}
// Rate limit per stream (prevent log flood from one bad pod)
stage.limit {
rate = 100 // lines per second
burst = 200
}
// Add structured metadata (Loki 3.0+)
stage.structured_metadata {
values = {
trace_id = "traceID",
}
}
forward_to = [loki.write.default.receiver]
}
// Push to central Loki
loki.write "default" {
endpoint {
url = "https://loki.monitoring.example.com/loki/api/v1/push"
tenant_id = "platform"
tls_config {
ca_file = "/etc/alloy/ca.crt"
}
basic_auth {
username = env("LOKI_USERNAME")
password = env("LOKI_PASSWORD")
}
}
external_labels = {
cluster = "prod-us-east-1",
env = "production",
}
}

AWS CloudWatch Logs forwarding to Loki via Lambda function

Option 1: Lambda Forwarder (real-time, low volume)

# Subscribe CloudWatch Log Group to Lambda forwarder
resource "aws_cloudwatch_log_subscription_filter" "rds_to_loki" {
name = "rds-logs-to-loki"
log_group_name = "/aws/rds/cluster/payments-db/postgresql"
filter_pattern = "" # All logs
destination_arn = aws_lambda_function.loki_forwarder.arn
}
resource "aws_lambda_function" "loki_forwarder" {
function_name = "cloudwatch-to-loki"
role = aws_iam_role.loki_forwarder.arn
handler = "index.handler"
runtime = "nodejs20.x"
timeout = 60
memory_size = 256
environment {
variables = {
LOKI_URL = "https://loki.monitoring.example.com"
LOKI_USERNAME = "cloudwatch-ingester"
}
}
vpc_config {
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.lambda.id]
}
}

Option 2: Grafana Alloy CloudWatch Source

// Alloy config: scrape CloudWatch Logs directly
loki.source.cloudwatch "rds_logs" {
region = "us-east-1"
log_groups {
names = ["/aws/rds/cluster/payments-db/postgresql"]
}
// Process every 60 seconds
poll_interval = "60s"
forward_to = [loki.write.default.receiver]
labels = {
source = "cloudwatch",
service = "rds",
database = "payments-db",
}
}

# All logs from payments namespace
{namespace="payments"}
# Filter by pod name pattern
{namespace="payments", pod=~"api-.*"}
# Full-text search within filtered logs
{namespace="payments"} |= "connection refused"
# Regex match
{namespace="payments"} |~ "error|Error|ERROR"
# Exclude pattern
{namespace="payments"} != "health_check"
# JSON parsing + field filter
{namespace="payments"} | json | level="error" | status_code >= 500
# Line format (rewrite log lines for readability)
{namespace="payments"} | json | line_format "{{.timestamp}} [{{.level}}] {{.msg}}"
# Error rate per service (count of error logs per second)
sum by (app) (rate({namespace="payments"} |= "error" [5m]))
# Bytes ingested per namespace (cost attribution)
sum by (namespace) (bytes_rate({job="kubernetes-pods"} [1h]))
# Top 5 pods by log volume
topk(5, sum by (pod) (bytes_rate({namespace="production"} [1h])))
# Count 5xx errors per endpoint (from JSON logs)
sum by (endpoint) (
count_over_time(
{namespace="payments"} | json | status_code >= 500 [5m]
)
)
# P99 latency from log fields (when you don't have metrics)
quantile_over_time(0.99,
{namespace="payments"} | json | unwrap duration_ms [5m]
) by (endpoint)

# Helm values for Loki in microservices mode
# helm install loki grafana/loki -f values.yaml
loki:
auth_enabled: true # Multi-tenant mode
storage:
type: s3
bucketNames:
chunks: loki-chunks-prod
ruler: loki-ruler-prod
s3:
region: us-east-1
endpoint: null # Use default AWS endpoint
limits_config:
retention_period: 30d
ingestion_rate_mb: 20
ingestion_burst_size_mb: 30
max_query_parallelism: 32
max_entries_limit_per_query: 10000
per_stream_rate_limit: 5MB
per_stream_rate_limit_burst: 15MB
schema_config:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
compactor:
retention_enabled: true
delete_request_store: s3
# Component sizing
write:
replicas: 3
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 2Gi
persistence:
size: 20Gi
storageClass: gp3-encrypted
read:
replicas: 3
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 2Gi
backend:
replicas: 2
resources:
requests:
cpu: 250m
memory: 512Mi
# Gateway (nginx) for routing
gateway:
replicas: 2
resources:
requests:
cpu: 100m
memory: 128Mi

# Loki ruler config for log-based alerts
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-ruler-config
namespace: monitoring
data:
rules.yml: |
groups:
- name: application-errors
rules:
# Alert on high error rate in logs
- alert: HighErrorLogRate
expr: |
sum by (namespace, app) (
rate({job="kubernetes-pods"} |= "level=error" [5m])
) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.app }} ({{ $labels.namespace }})"
description: "More than 10 error logs/sec for 5 minutes"
# Alert on OOM kill messages in kernel logs
- alert: OOMKillDetected
expr: |
count_over_time(
{job="systemd-journal"} |= "Out of memory: Killed process" [5m]
) > 0
labels:
severity: critical
annotations:
summary: "OOM kill detected on {{ $labels.node }}"
runbook: "https://runbooks.example.com/oom-kill"
# Alert on database connection failures
- alert: DatabaseConnectionFailure
expr: |
sum by (app) (
count_over_time(
{namespace="payments"} |= "connection refused" |= "postgresql" [5m]
)
) > 5
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.app }} cannot connect to database"
- name: security-events
rules:
# Alert on authentication failures
- alert: HighAuthFailureRate
expr: |
sum by (source_ip) (
count_over_time(
{namespace="auth"} | json | event_type="auth_failure" [10m]
)
) > 50
labels:
severity: critical
team: security
annotations:
summary: "50+ auth failures from {{ $labels.source_ip }} in 10 minutes — possible brute force"

{
"timestamp": "2026-03-15T10:23:45.123Z",
"level": "error",
"logger": "payment-service",
"msg": "Payment processing failed",
"trace_id": "abc123def456",
"span_id": "789ghi",
"service": "payment-api",
"environment": "production",
"error": {
"type": "PaymentGatewayTimeout",
"message": "Upstream timeout after 30s",
"stack": "at PaymentService.process..."
},
"context": {
"payment_id": "PAY-12345",
"amount": 150.00,
"currency": "AED",
"customer_tier": "premium"
}
}

“Design a logging architecture for 20 Kubernetes clusters generating 5 TB of logs per day.”

Strong Answer:

“At 5 TB/day, cost is the primary constraint. Here’s my architecture:

Collection (per cluster): Grafana Alloy as DaemonSet on every node. It tails container log files from /var/log/containers/, enriches with Kubernetes labels (namespace, pod, app), and ships to central Loki. I configure pipeline stages to:

  1. Parse JSON logs and extract level as a label
  2. Drop debug level logs in production (reduces volume 40-60%)
  3. Rate-limit per stream to 100 lines/sec (prevents log flooding from one bad pod)

Storage (central): Loki in microservices mode with S3/GCS backend. Loki only indexes labels — log content goes to object storage as compressed chunks. At 5 TB/day after filtering (say 2 TB after dropping debug):

  • S3 cost: ~$46/month per TB = ~$2,760/month for 30-day retention
  • Compare to Elasticsearch: 10-20x more expensive (needs hot/warm/cold nodes with fast disks)

Retention: 30 days in Loki (hot queries), 90 days in S3 (archived, queryable but slower). Compliance logs (audit, security) stored separately with 7-year retention in S3 Glacier.

Access control: Multi-tenant Loki — each team’s Grafana org sees only their namespace’s logs. Platform team has a global tenant for cross-cutting queries.”


“Why not just use CloudWatch Logs for everything on AWS?”

Strong Answer:

“CloudWatch Logs works for AWS-native services (RDS, Lambda, ALB) but has limitations for a platform team:

  1. Cost at scale: CloudWatch charges $0.50/GB ingested + $0.03/GB stored. At 2 TB/day, that’s $1,000/day = $30K/month just for ingestion. Loki with S3 storage costs 10-20x less.

  2. No Kubernetes awareness: CloudWatch doesn’t understand namespace, pod, or container labels natively. Container Insights provides basic log routing but lacks the label-based querying of LogQL.

  3. Query limitations: CloudWatch Insights has a 10K result limit and charges per query ($0.005/GB scanned). LogQL on Loki has no per-query cost and supports streaming.

  4. Cross-cloud: If you also run GKE, CloudWatch only covers AWS. Loki handles both uniformly.

My approach: Use CloudWatch for AWS service logs (RDS, ALB, Lambda) and forward important ones to Loki via Lambda forwarder or Alloy’s CloudWatch source. Use Loki for all Kubernetes application logs. Single Grafana UI for everything.”


“A team deployed a new version and log volume jumped from 100 GB/day to 2 TB/day. What do you do?”

Strong Answer:

“This is a platform emergency — it can blow storage budgets and degrade Loki performance. My response:

Immediate (within 1 hour):

  1. Identify the source: topk(5, sum by (namespace, pod) (bytes_rate({} [5m]))) in LogQL
  2. If it’s debug logging left on, ask the team to deploy a fix. If they can’t immediately, add an Alloy pipeline stage to drop logs matching the pattern
  3. Check Loki ingestion limits — per-stream rate limiting should prevent a single pod from overwhelming the system

Root cause fix:

  1. Enforce structured logging standards — no unstructured printf debugging in production
  2. Add per-namespace ingestion quotas in Loki’s limits_config
  3. Alloy rate-limiting per stream (already in our standard config)
  4. Add a CI/CD check: warn if a new deployment enables LOG_LEVEL=DEBUG in production

Prevention:

  • Platform-level Alloy pipeline drops debug and trace level logs in production by default
  • Per-namespace log budget alerts (warn at 2x baseline, critical at 5x)
  • Cost chargeback — teams pay for their log volume, which incentivizes optimization”

“A customer reports intermittent 500 errors on the payment API. Walk through your debugging process using logs.”

Strong Answer:

“Starting in Grafana, I’d work from broad to narrow:

  1. Check SLO dashboard first: Is the payment API’s error budget burning? This tells me severity and duration.

  2. Log query:

{namespace=\"payments\", app=\"payment-api\"} | json | status_code >= 500 | line_format \"{{.timestamp}} {{.method}} {{.path}} {{.status_code}} {{.error}}\"

This shows me all 500 errors with their error messages.

  1. Correlate with time: Overlay the error log rate on the latency graph. If errors correlate with latency spikes, it’s likely an upstream timeout.

  2. Trace correlation: Click the trace_id from an error log line. Grafana jumps to Tempo showing the full distributed trace. I can see which downstream service is failing.

  3. Pattern analysis:

{namespace=\"payments\", app=\"payment-api\"} | json | status_code >= 500 | pattern \"<_> error=<error>\" | count by (error)

This groups errors by type — maybe 80% are ‘database connection timeout.’

  1. Check database logs: {namespace=\"payments\", app=\"payments-db\"} |= \"too many connections\" — confirms connection pool exhaustion.

  2. Resolution: Increase connection pool size or add PgBouncer sidecar.”


“The compliance team requires 7-year log retention for audit trails. How do you handle this cost-effectively?”

Strong Answer:

“Separate audit logs from application logs — they have different retention and query patterns:

Application logs (30-day retention):

  • Stored in Loki with S3 backend
  • 30-day compactor retention policy
  • Used for debugging, alerting
  • Cost: ~$2-3K/month at 2 TB/day

Audit logs (7-year retention):

  • Route audit events (login, permission changes, data access) to a separate Loki tenant or directly to S3
  • S3 lifecycle policy: Standard (90 days) → Glacier Instant Retrieval (1 year) → Glacier Deep Archive (7 years)
  • Tag with compliance metadata for retrieval
  • Cost at Deep Archive: $0.00099/GB/month = $1/TB/month

Implementation:

  • Application emits audit events with type: audit label
  • Alloy pipeline routes {type=\"audit\"} to separate Loki tenant with 7-year retention
  • OR: Skip Loki entirely for audit — ship directly to S3 in JSON format, queryable via Athena when needed

Total cost for 7-year audit retention (assuming 50 GB/day of audit logs): ~$12K total for 7 years. Compare to keeping everything in Elasticsearch: easily $1M+ over the same period.”


“We run a 30-node Elasticsearch cluster for logging. It costs $50K/month in compute alone. Can we switch to Loki?”

Strong Answer:

“Yes, and the savings are dramatic. Here’s my migration plan:

Phase 1 (Week 1-2): Dual-write

  • Deploy Alloy alongside existing Filebeat/Fluentd
  • Configure Alloy to ship to Loki while Filebeat continues to Elasticsearch
  • Validate data parity: same log lines appear in both systems

Phase 2 (Week 3-4): Dashboard migration

  • Recreate top 20 Kibana dashboards in Grafana using LogQL
  • Train teams on LogQL syntax (similar to Kibana KQL but different)
  • Key translation: KQL status:500 AND service:payments → LogQL {app=\"payments\"} | json | status=500

Phase 3 (Month 2): Cutover

  • Switch on-call runbooks to use Grafana/Loki
  • Stop Filebeat shipping to Elasticsearch
  • Keep Elasticsearch read-only for 30 days (historical queries)

Phase 4: Decommission

  • Shut down 30-node Elasticsearch cluster
  • Archive any required data to S3

Cost comparison:

  • Elasticsearch: 30 nodes x r6g.2xlarge = ~$50K/month (compute only)
  • Loki: 6 pods (3 write, 3 read) + S3 storage = ~$3-5K/month
  • Savings: ~$45K/month = $540K/year

Tradeoff: Full-text search is slower in Loki. If teams rely heavily on free-text grep across all logs without label filtering, they’ll notice. Mitigation: enforce structured JSON logging so LogQL can filter on parsed fields efficiently.”