Microservice Request Tracking: Correlation IDs Explained

A single request can traverse dozens of microservices, and when something breaks, figuring out which service dropped the ball is a nightmare without a thread to follow. OpenTelemetry’s correlation IDs, more commonly known as trace IDs, are that thread, letting you stitch together the entire journey of a request as it hops between services.

Let’s see it in action. Imagine a user request hitting an api-gateway service.

{
  "traceId": "d41d8cd98f00b204e9800998ecf8427e",
  "spanId": "a1b2c3d4e5f6a7b8",
  "parentSpanId": null,
  "name": "HTTP GET /users/123",
  "kind": "SERVER",
  "startTimeUnixNano": 1678886400000000000,
  "endTimeUnixNano": 1678886400100000000,
  "attributes": {
    "http.method": "GET",
    "http.url": "/users/123",
    "net.peer.ip": "192.168.1.100"
  }
}

This is a Span produced by the api-gateway. The traceId (d41d8cd98f00b204e9800998ecf8427e) is the unique identifier for this entire request’s journey. The spanId (a1b2c3d4e5f6a7b8) identifies this specific operation within the gateway. Since this is the entry point, parentSpanId is null.

Now, the api-gateway calls a user-service to fetch user data. The api-gateway propagates the traceId to the user-service, typically via HTTP headers.

// Request headers from api-gateway to user-service
// ...
"traceparent: 00-d41d8cd98f00b204e9800998ecf8427e-a1b2c3d4e5f6a7b8-01"
// ...

The user-service receives this header and uses the traceId to create its own span for the operation.

{
  "traceId": "d41d8cd98f00b204e9800998ecf8427e",
  "spanId": "c9d8e7f6a5b4c3d2",
  "parentSpanId": "a1b2c3d4e5f6a7b8",
  "name": "HTTP GET /users/123 (user-service)",
  "kind": "SERVER",
  "startTimeUnixNano": 1678886400110000000,
  "endTimeUnixNano": 1678886400250000000,
  "attributes": {
    "http.method": "GET",
    "http.url": "/users/123",
    "db.system": "postgresql",
    "db.statement": "SELECT * FROM users WHERE id = 123"
  }
}

Notice how the traceId is the same, but the spanId is new, and importantly, parentSpanId now points to the spanId of the api-gateway. This establishes the parent-child relationship, forming the trace. If user-service then called a database-service, it would propagate the same traceId and its own spanId as the parent for the database-service’s span.

This hierarchical structure, built by passing traceId and parentSpanId across service boundaries, is how distributed tracing works. All these individual spans, originating from different services but sharing the same traceId, are collected by an OpenTelemetry Collector and sent to a backend (like Jaeger, Zipkin, or a cloud observability platform) where they are reassembled into a complete trace graph. You can then query this backend to see the entire request flow, identify latency bottlenecks, or pinpoint errors.

The most surprising thing about trace context propagation is that it’s entirely opt-in for each service and relies on a standardized format for passing the context, most commonly the W3C Trace Context specification (traceparent and tracestate headers). If any service in the chain doesn’t participate in propagation, the trace effectively "breaks" at that point, and subsequent services won’t be part of the same trace. It’s not a magic bullet; it requires instrumenting every service involved in the request path.

The actual mechanism for propagation is surprisingly simple: a set of HTTP headers. The traceparent header is the core, containing the traceId, the current spanId, and a traceFlags byte (often 01 for sampled, 00 for not sampled). For example, traceparent: 00-d41d8cd98f00b204e9800998ecf8427e-a1b2c3d4e5f6a7b8-01. The 00 denotes the version, followed by the trace ID, the parent span ID (which is the span ID of the calling service), and the flags. When a service receives this header, it extracts these values and uses them to create its own span, setting its traceId to the received traceId, its spanId to a new generated ID, and its parentSpanId to the received parent span ID. This is how the tree structure is built.

The levers you control are primarily in your application’s instrumentation. You configure your OpenTelemetry SDK to:

Generate Trace IDs: This happens automatically when a request enters an instrumented service and no traceparent header is present.
Propagate Trace Context: This is crucial. You ensure your HTTP client libraries (or gRPC, Kafka producers, etc.) are configured to inject the traceparent header into outgoing requests. Most modern frameworks and OpenTelemetry auto-instrumentation packages handle this by default if configured correctly.
Extract Trace Context: When an instrumented service receives an incoming request, it must be configured to parse the traceparent header and use it to establish the context for subsequent operations within that service.
Export Spans: Configure your exporter to send spans to an OpenTelemetry Collector or directly to a tracing backend.

The real power comes from the backend, which aggregates these spans and provides visualization and querying capabilities. Without a backend to correlate and display the traces, the generated spans are just isolated events.

Most people don’t realize that trace context can be propagated via arbitrary mechanisms, not just HTTP headers. While W3C Trace Context over HTTP is the de facto standard, you can technically propagate the traceId and parentSpanId through message queues, RPC payloads, or even custom protocols, as long as both the sender and receiver agree on the format and the mechanism. This flexibility is what makes OpenTelemetry applicable in a vast array of distributed systems beyond simple HTTP calls.

The next concept you’ll wrestle with is how to ensure consistent sampling decisions across all services in a trace.