Skip to content

Distributed Tracing

Central Observability Platform — Metrics, Logs, Traces (Tempo highlighted) correlated through Grafana

Tracing answers the question: “Where did this request spend its time?” While metrics tell you something is slow and logs tell you what happened, traces show you the full journey of a request across services.


OpenTelemetry overview — APIs/SDKs, OTel Collector, OTLP Protocol, vendor-neutral

Trace structure — nested spans showing API Gateway, Payment Service, DB Query, Redis Cache, and External API call


OTel Collector pipeline — Receivers, Processors, Exporters with Agent and Gateway deployment models

Two-tier OTel Collector deployment — per-node Agent DaemonSet forwarding via OTLP/gRPC to central Gateway

# Central Gateway Collector (Deployment)
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-gateway-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Prevent OOM
memory_limiter:
check_interval: 5s
limit_mib: 1500
spike_limit_mib: 500
# Batch before export
batch:
send_batch_size: 10000
send_batch_max_size: 11000
timeout: 5s
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
# Tail-based sampling (requires full trace)
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
# Always sample errors
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Always sample slow requests (>2s)
- name: slow-requests
type: latency
latency:
threshold_ms: 2000
# Sample 10% of successful fast requests
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
# Always sample specific services
- name: critical-services
type: string_attribute
string_attribute:
key: service.name
values: [payment-api, auth-service]
# Filter out health check spans (noise)
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/ready"'
exporters:
# Traces → Tempo
otlp/tempo:
endpoint: tempo-distributor.monitoring:4317
tls:
insecure: true
# Span metrics → VictoriaMetrics
# Generates RED metrics from traces automatically
prometheusremotewrite:
endpoint: http://vminsert.monitoring:8480/insert/0/prometheus/api/v1/write
connectors:
# Generate metrics from spans (spanmetrics)
spanmetrics:
histogram:
explicit:
buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s]
dimensions:
- name: http.method
- name: http.status_code
- name: service.name
exemplars:
enabled: true # Link metrics to traces
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter, tail_sampling, batch, resource]
exporters: [otlp/tempo, spanmetrics]
# Span-derived metrics pipeline
metrics:
receivers: [spanmetrics]
processors: [batch]
exporters: [prometheusremotewrite]

Grafana Tempo architecture — Distributor, Ingester, S3/GCS trace blocks, Query Frontend, Querier

# Helm values for Tempo distributed mode
# helm install tempo grafana/tempo-distributed -f values.yaml
tempo:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces-prod
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
retention: 14d # 14-day trace retention
metrics_generator:
enabled: true
storage:
path: /var/tempo/wal
remote_write:
- url: http://vminsert.monitoring:8480/insert/0/prometheus/api/v1/write
processor:
service_graphs:
enabled: true
dimensions: [service.namespace]
span_metrics:
enabled: true
dimensions: [http.method, http.status_code]
enable_target_info: true
distributor:
replicas: 3
resources:
requests:
cpu: 500m
memory: 512Mi
ingester:
replicas: 3
resources:
requests:
cpu: 500m
memory: 1Gi
persistence:
enabled: true
size: 20Gi
querier:
replicas: 2
resources:
requests:
cpu: 500m
memory: 1Gi
queryFrontend:
replicas: 2

Head-based vs tail-based sampling strategies — decision timing, pros and cons


Metric-Log-Trace correlation in Grafana — exemplars, trace_id links, and bidirectional navigation

# Prometheus scrape config to capture exemplars
# vmagent or Prometheus config
scrape_configs:
- job_name: payment-api
scrape_interval: 15s
# Enable exemplar storage
enable_http2: true
metrics_path: /metrics
static_configs:
- targets: ["payment-api:8080"]
// Application code: emit metrics with exemplars (Go example)
import (
"github.com/prometheus/client_golang/prometheus"
"go.opentelemetry.io/otel/trace"
)
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path", "status"},
)
func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// ... handle request ...
duration := time.Since(start).Seconds()
span := trace.SpanFromContext(r.Context())
// Record metric WITH exemplar (trace_id links metric to trace)
requestDuration.WithLabelValues(
r.Method, r.URL.Path, "200",
).(prometheus.ExemplarObserver).ObserveWithExemplar(
duration,
prometheus.Labels{"trace_id": span.SpanContext().TraceID().String()},
)
}
# Grafana provisioning: link data sources for correlation
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
url: http://vmselect.monitoring:8481/select/0/prometheus
jsonData:
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo
urlDisplayLabel: "View Trace"
- name: Loki
type: loki
url: http://loki-gateway.monitoring:3100
uid: loki
jsonData:
derivedFields:
- name: TraceID
matcherRegex: '"trace_id":"([a-f0-9]+)"'
url: '$${__value.raw}'
datasourceUid: tempo
urlDisplayLabel: "View Trace"
- name: Tempo
type: tempo
url: http://tempo-query-frontend.monitoring:3200
uid: tempo
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: false
mapTagNamesEnabled: true
tags:
- key: service.name
value: app
tracesToMetrics:
datasourceUid: victoriametrics
tags:
- key: service.name
value: service
queries:
- name: "Request Rate"
query: "sum(rate(http_request_duration_seconds_count{$$__tags}[5m]))"
- name: "Error Rate"
query: "sum(rate(http_request_duration_seconds_count{$$__tags, status=~\"5..\"}[5m]))"
serviceMap:
datasourceUid: victoriametrics

# Zero-code instrumentation for Java apps
# OTel Java agent auto-instruments: Spring Boot, gRPC,
# JDBC, Jedis/Lettuce, Kafka, HTTP clients, etc.
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
template:
spec:
initContainers:
- name: otel-agent
image: ghcr.io/open-telemetry/opentelemetry-java-instrumentation:v2.11.0
command: ["cp", "/javaagent.jar", "/otel/javaagent.jar"]
volumeMounts:
- name: otel-agent
mountPath: /otel
containers:
- name: api
image: registry.example.com/payment-api:v2.1.0
env:
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/otel/javaagent.jar"
- name: OTEL_SERVICE_NAME
value: "payment-api"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.monitoring:4317"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "service.namespace=payments,deployment.environment=production"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "1.0" # Send all to collector (tail sampling there)
volumeMounts:
- name: otel-agent
mountPath: /otel
volumes:
- name: otel-agent
emptyDir: {}

# Find traces by service name
{resource.service.name = "payment-api"}
# Find error traces
{status = error}
# Find slow spans (>2 seconds)
{duration > 2s}
# Find traces where payment-api called database and it was slow
{resource.service.name = "payment-api" && span.db.system = "postgresql" && duration > 1s}
# Find traces with specific HTTP status
{span.http.status_code >= 500}
# Structural queries: find traces where service A called service B
{resource.service.name = "api-gateway"} >> {resource.service.name = "payment-api"}
# Find traces where any span has an error
{status = error} | count() > 0
# Aggregate: P99 duration for payment service
{resource.service.name = "payment-api"} | quantile_over_time(duration, 0.99)

Scenario 1: Design Distributed Tracing for Microservices

Section titled “Scenario 1: Design Distributed Tracing for Microservices”

“You have 30 microservices. Design a tracing architecture.”

Strong Answer:

“I’d use OpenTelemetry end-to-end with a two-tier collector:

Instrumentation: Auto-instrument Java services with the OTel Java agent (zero code changes). For Python and Go services, use the OTel SDK with auto-instrumentors that hook into HTTP clients, gRPC, database drivers automatically.

Collection (two-tier):

  1. Agent tier (DaemonSet): OTel Collector on every node. Receives OTLP from pods via gRPC (port 4317). Applies memory limiter, filters health check spans, adds cluster/namespace resource attributes. Forwards to gateway.

  2. Gateway tier (Deployment, 3 replicas): Receives from all agents. Applies tail-based sampling — keep 100% of error and slow traces, 10% of normal. This is critical because we want every failed payment trace but don’t need every health check.

Storage: Grafana Tempo with S3 backend. 14-day retention (traces are for debugging, not compliance). Tempo’s metrics generator creates span metrics (RED) automatically — these go to VictoriaMetrics.

Correlation: Grafana links metrics (exemplars) → traces → logs. A spike in P99 latency → click exemplar → see the exact trace → click ‘View Logs’ → see the log lines for that request.

Cost: At 10% sampling of 30 services doing 5K RPS total, we store ~500 traces/sec. Tempo on S3 costs about $50-100/month for 14-day retention.”


“Should we sample 100% of traces or use sampling? What’s the tradeoff?”

Strong Answer:

“Never 100% in production at scale. Here’s why and what to do:

The math: 30 services at 5K RPS, average 8 spans per trace = 40K spans/sec. At ~500 bytes per span, that’s 20 MB/sec = 1.7 TB/day. At S3 rates, storage alone is $1,200/month for 30 days. Tempo ingestion and querying resources scale linearly.

My sampling strategy (tail-based on the OTel Collector Gateway):

PolicySample RateRationale
Error traces100%Always debug errors
Traces > 2s100%Always debug slow requests
Payment service100%Critical path, low volume
Auth failures100%Security signal
Everything else10%Enough for service graphs and baseline metrics

Result: Instead of 1.7 TB/day, we store ~200 GB/day. Cost drops 85% while keeping 100% of actionable traces.

Key requirement: Tail-based sampling must happen at the gateway, not per-service. Per-service (head-based) sampling would miss errors that only manifest downstream — a request might succeed at Service A but fail at Service C.”


“A customer reports their payment took 30 seconds instead of the usual 2 seconds. How do you investigate?”

Strong Answer:

“In Grafana, I follow this workflow:

  1. Find the trace: If I have the payment ID, query Loki: {app=\"payment-api\"} | json | payment_id=\"PAY-12345\". Extract the trace_id from the log line.

  2. Open trace in Tempo: The trace shows 12 spans across 5 services. I see the waterfall view and immediately notice: the external-payment-gateway span took 28 seconds (the root cause).

  3. Drill into the span: The span attributes show http.status_code: 504, http.url: gateway.example.com/charge. The payment gateway timed out.

  4. Check if it’s systemic: TraceQL: {resource.service.name=\"payment-api\" && span.http.url =~ \"gateway.example.com.*\" && duration > 5s} over the last hour. If I see many, the external gateway has a problem.

  5. Check metrics: From the trace view, click ‘View metrics’ — Grafana shows the span metrics (P99 latency for external-payment-gateway calls). If the P99 jumped from 1s to 25s at 10:15 AM, that confirms the timeline.

  6. Resolution: If it’s the external gateway, we need circuit breaker configuration (Istio retry + timeout policies) to fail fast at 5s instead of waiting 30s. Also add a fallback path.”


“Why use OpenTelemetry instead of Datadog APM or New Relic?”

Strong Answer:

“Vendor lock-in and cost are the two big reasons:

  1. Vendor neutrality: With OTel, we instrument once and can export to any backend — Tempo today, Jaeger tomorrow, Datadog if business requires it. With Datadog’s dd-trace library, you’re locked to Datadog.

  2. Cost: Datadog APM charges ~$31/host/month + $0.10 per analyzed span beyond included volume. For 500 hosts doing 50K spans/sec, that’s $18K/month for APM alone. Tempo with S3 backend costs $200-500/month for the same data.

  3. Standards compliance: OTel is a CNCF graduated project — the industry standard. SDKs exist for every major language. Most frameworks (Spring Boot, Django, Express) have auto-instrumentors.

  4. Flexibility: OTel Collector lets us process telemetry in the pipeline — tail sampling, attribute enrichment, filtering — before it reaches the backend. Vendor agents don’t offer this level of control.

When I’d still use Datadog/New Relic: If the org is small (< 50 engineers), doesn’t have SRE capacity to operate open-source tooling, and values a single-pane-of-glass product. The operational overhead of running Tempo + Loki + VictoriaMetrics is non-trivial — you need at least one dedicated SRE.”


“How does a trace ID travel across microservices? What if one service uses gRPC and another uses HTTP?”

Strong Answer:

“Context propagation is the mechanism that carries the trace context (trace ID, span ID, sampling flag) across service boundaries. OTel handles this transparently:

HTTP: The W3C Trace Context standard defines two headers:

  • traceparent: 00-<trace-id>-<parent-span-id>-<trace-flags>
  • tracestate: vendor=value (optional, vendor-specific)

gRPC: Same context propagated via gRPC metadata (equivalent to HTTP headers). OTel auto-instruments gRPC clients and servers to inject/extract these.

Message queues (Kafka/SQS/Pub/Sub): OTel injects context into message attributes/headers. The consumer extracts it and creates a new span linked to the producer’s span.

Cross-language: This is the beauty of OTel — a Java producer writes the traceparent header, a Python consumer reads it, and the trace is stitched together seamlessly. The trace ID stays the same across all services regardless of language.

Common pitfall: Custom HTTP clients or message wrappers that don’t propagate headers. The fix is ensuring OTel auto-instrumentation covers the HTTP client library in use, or manually injecting/extracting context.”


Scenario 6: Service Graph and Dependency Mapping

Section titled “Scenario 6: Service Graph and Dependency Mapping”

“How do you automatically discover service dependencies without manual documentation?”

Strong Answer:

“Tempo’s metrics generator creates service graphs automatically from trace data:

  1. How it works: Tempo analyzes spans and identifies caller → callee relationships. If Service A has a span calling Service B (client span in A, server span in B with the same trace ID), Tempo records that edge.

  2. Grafana Service Graph: A visual graph showing all services and their connections — generated automatically from traces. Each edge shows request rate, error rate, and latency.

  3. Configuration: Enable metrics_generator.processor.service_graphs in Tempo config. The generated metrics go to VictoriaMetrics via remote write. Grafana’s Node Graph panel renders them.

  4. What it reveals: You can immediately see that the payment service depends on 3 downstream services, the auth service is called by 15 services (potential bottleneck), and there’s an unexpected dependency between the notification service and the analytics service.

  5. Beyond discovery: Use the service graph to plan blast radius of deployments, identify critical paths, and detect circular dependencies. It replaces manually maintained architecture diagrams that are always outdated.”