Distributed tracing was invented because synchronous, single-process applications were easy to debug. You’d just attach a debugger and step through the code. But as soon as you introduce multiple services, asynchronous communication, or distributed systems, that approach breaks down. The problem becomes: if a request fails or is slow, how do you know which service in the chain is the culprit, and why? Distributed tracing solves this by stitching together the individual pieces of a request’s journey across different services into a single, coherent timeline.
Imagine a user clicks "buy" on an e-commerce site. This single action might trigger calls to:
- Frontend Service: Receives the click, initiates the order process.
- Order Service: Creates a new order record.
- Payment Service: Processes the credit card transaction.
- Inventory Service: Decreases stock levels.
- Notification Service: Sends an email confirmation.
Without tracing, if the order takes 30 seconds to complete, you’re left guessing. Is the frontend slow? Is the payment gateway unresponsive? Is the inventory check taking ages?
OpenTelemetry’s distributed tracing works by assigning a unique trace_id to the initial request. As this request propagates through various services, each service generates span_ids for the work it performs. Crucially, each subsequent service receives the trace_id and the span_id of the calling service (which becomes its parent_span_id). This creates a directed acyclic graph (DAG) of operations, allowing you to reconstruct the entire request flow.
Here’s a simplified view of what this looks like in practice. Let’s say we have a simple Go application using OpenTelemetry.
package main
import (
"context"
"fmt"
"log"
"net/http"
"os"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
func initTracer() (func(), error) {
// Exporter: Where traces go. Here, stdout for demonstration.
exporter, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
if err != nil {
return nil, fmt.Errorf("failed to create stdout exporter: %w", err)
}
// Resource: Identifies the service.
res, err := resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceName("my-frontend-service"),
semconv.ServiceVersion("1.0.0"),
),
)
if err != nil {
return nil, fmt.Errorf("failed to create resource: %w", err)
}
// Tracer Provider: Manages tracers.
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(res),
)
// Set global propagator: How trace context is propagated between services (e.g., via HTTP headers).
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{}, // W3C Trace Context
propagation.Baggage{}, // W3C Baggage
))
// Set global tracer provider.
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}
func main() {
shutdown, err := initTracer()
if err != nil {
log.Fatalf("Error initializing tracer: %v", err)
}
defer shutdown(context.Background())
tr := otel.Tracer("frontend-tracer")
http.HandleFunc("/process-order", func(w http.ResponseWriter, r *http.Request) {
ctx, span := tr.Start(r.Context(), "processOrderHandler")
defer span.End()
// Extract trace context from incoming request headers
// This is crucial for linking spans across services
carrier := propagation.HeaderCarrier(r.Header)
savedCtx := otel.GetTextMapPropagator().Extract(ctx, carrier)
// Simulate calling another service (e.g., OrderService)
orderCtx, orderSpan := tr.Start(savedCtx, "callOrderService")
orderSpan.SetAttributes(attribute.String("service.name", "order-service"))
orderSpan.AddEvent("Simulating network call to order service...")
// In a real app, you'd make an HTTP/gRPC call here.
// The trace context would be injected into the outgoing headers.
fmt.Println("Simulating call to order service...")
// Simulate work
// time.Sleep(100 * time.Millisecond)
orderSpan.End()
// Simulate another call
paymentCtx, paymentSpan := tr.Start(orderCtx, "callPaymentService")
paymentSpan.SetAttributes(attribute.String("service.name", "payment-service"))
paymentSpan.AddEvent("Simulating payment processing...")
fmt.Println("Simulating call to payment service...")
// Simulate work
// time.Sleep(150 * time.Millisecond)
paymentSpan.End()
span.AddEvent("Order processed successfully")
w.WriteHeader(http.StatusOK)
w.Write([]byte("Order processed"))
})
fmt.Println("Server listening on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
When this code runs, and a request hits /process-order, OpenTelemetry will generate a trace. The stdouttrace exporter will print something like this to your console (simplified for clarity):
{
"traceId": "a1b2c3d4e5f678901234567890abcdef",
"id": "1111111111111111",
"parentID": "0000000000000000", // Root span
"name": "processOrderHandler",
"kind": 0,
"startTimeUnixNano": 1678886400123456789,
"endTimeUnixNano": 1678886400456789012,
"attributes": [],
"events": [
{
"name": "Order processed successfully",
"timeUnixNano": 1678886400450000000
}
],
"status": {
"code": 0
},
"traceState": ""
}
{
"traceId": "a1b2c3d4e5f678901234567890abcdef",
"id": "2222222222222222",
"parentID": "1111111111111111", // Parent is processOrderHandler
"name": "callOrderService",
"kind": 0,
"startTimeUnixNano": 1678886400150000000,
"endTimeUnixNano": 1678886400250000000,
"attributes": [
{ "key": "service.name", "value": { "type": "STRING", "string": "order-service" } }
],
"events": [
{
"name": "Simulating network call to order service...",
"timeUnixNano": 1678886400160000000
}
],
"status": {
"code": 0
},
"traceState": ""
}
{
"traceId": "a1b2c3d4e5f678901234567890abcdef",
"id": "3333333333333333",
"parentID": "2222222222222222", // Parent is callOrderService
"name": "callPaymentService",
"kind": 0,
"startTimeUnixNano": 1678886400260000000,
"endTimeUnixNano": 1678886400400000000,
"attributes": [
{ "key": "service.name", "value": { "type": "STRING", "string": "payment-service" } }
],
"events": [
{
"name": "Simulating payment processing...",
"timeUnixNano": 1678886400270000000
}
],
"status": {
"code": 0
},
"traceState": ""
}
The key to understanding this output is the traceId, id, and parentID fields. All spans share the same traceId. The id uniquely identifies a span, and parentID links it to the span that initiated it. This allows a tracing backend (like Jaeger, Zipkin, or Honeycomb) to reconstruct the entire request flow.
The traceId is generated on the first service. When that service calls another, it injects the traceId and its own spanId into the outgoing request’s headers (using the W3C Trace Context format, for example). The receiving service extracts these headers, uses the traceId and the incoming spanId as its parent_span_id, and generates its own spanId. This is how the chain is built.
The kind attribute (0 for SPAN_KIND_INTERNAL, 1 for SPAN_KIND_SERVER, 2 for SPAN_KIND_CLIENT, 3 for SPAN_KIND_PRODUCER, 4 for SPAN_KIND_CONSUMER) indicates the role of the span. SPAN_KIND_SERVER is for spans representing incoming requests to a service, SPAN_KIND_CLIENT for outgoing requests from a service, and SPAN_KIND_INTERNAL for operations within a service.
When you send these traces to a backend, you can visualize them as a waterfall, where the width of each span represents its duration. This makes it immediately obvious which operation took the longest. You can then drill down into specific spans to see attributes, events, and errors associated with that particular piece of work.
The one thing most people don’t grasp is that tracing is fundamentally about propagation. The entire system hinges on the traceId and parent_span_id being correctly passed from service to service. If you’re using HTTP, this means ensuring your HTTP client library and server framework are configured to inject and extract the W3C Trace Context headers. For message queues, it means the message producer must add the context to the message, and the consumer must extract it. Without this propagation, traces will start new, unrelated traces in each service, creating disconnected islands of telemetry instead of a coherent view.
The next step in observing your distributed system is understanding how to correlate traces with logs and metrics.