Distributed profiling lets you see what’s happening inside a complex system when you can’t just attach a debugger to a single process.
Here’s a trace in action. Imagine a user request hitting a web service. That service then calls out to a user service, which in turn calls a database service. A distributed trace captures each of these hops, showing not just how long each individual service took, but also the latency between them.
[
{
"traceId": "a1b2c3d4e5f6",
"spanId": "111111111111",
"parentSpanId": null,
"serviceName": "frontend-api",
"operationName": "POST /users",
"startTime": "2023-10-27T10:00:00Z",
"duration": 150,
"tags": {
"http.method": "POST",
"http.url": "/users",
"http.status_code": 201
}
},
{
"traceId": "a1b2c3d4e5f6",
"spanId": "222222222222",
"parentSpanId": "111111111111",
"serviceName": "user-service",
"operationName": "createUser",
"startTime": "2023-10-27T10:00:00.050Z",
"duration": 80,
"tags": {
"db.instance": "users_db",
"db.statement": "INSERT INTO users (name, email) VALUES (?, ?)"
}
},
{
"traceId": "a1b2c3d4e5f6",
"spanId": "333333333333",
"parentSpanId": "222222222222",
"serviceName": "user-service",
"operationName": "getUserById",
"startTime": "2023-10-27T10:00:00.100Z",
"duration": 20,
"tags": {
"db.instance": "users_db",
"db.statement": "SELECT * FROM users WHERE id = ?"
}
}
]
This data, often collected by agents and sent to a tracing backend like Jaeger or Zipkin, allows you to reconstruct the flow of a request. You see the frontend-api takes 150ms. Within that, it calls user-service which takes 80ms and 20ms for its internal operations. The critical insight is the time between these spans. The user-service span starts 50ms after the frontend-api span, and the getUserById span starts 100ms after the user-service span. This immediately tells you where to focus your investigation.
The problem distributed profiling solves is the "black box" nature of microservices. When requests are slow, you need to know which service is the bottleneck, and why. Is it a single slow service, or is it the network latency between services? Tracing provides the answer by breaking down the total request time into its constituent parts and the time spent waiting for downstream services.
At its core, distributed tracing relies on a few key concepts:
- Trace: A single end-to-end request through your system. It’s a collection of spans.
- Span: A single unit of work within a trace. This could be an HTTP request, a database query, or a function call. Each span has a start time, duration, and metadata (tags, logs).
- Trace ID: A unique identifier that links all spans belonging to the same trace.
- Span ID: A unique identifier for a specific span.
- Parent Span ID: Links a child span to its parent, establishing the causal relationship and the call hierarchy.
When a request enters your system, a new trace ID is generated (or an existing one is propagated if it’s an incoming request from another traced system). As this request traverses services, the trace ID is passed along. Each service that participates in the request creates its own span, assigning it the same trace ID and referencing the parent span’s ID. This forms a directed acyclic graph (DAG) of operations.
The magic happens when you view these spans together. A tracing UI can reconstruct the timeline, showing you the critical path. You’re not just looking at average latencies; you’re seeing the latency of a specific request at a specific moment. This is invaluable for debugging transient performance issues or understanding the impact of load.
To implement this, you typically instrument your code. Libraries for popular languages (OpenTelemetry, OpenTracing) provide APIs to create spans, add tags, and propagate context (the trace ID and parent span ID) across network calls. For example, in Go, you might use a library like go.opentelemetry.io/otel to wrap your HTTP client:
import (
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
func doRequest(ctx context.Context, url string) (*http.Response, error) {
tr := otel.Tracer("my-service")
ctx, span := tr.Start(ctx, "makeExternalRequest")
defer span.End()
span.SetAttributes(attribute.String("http.url", url))
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
span.RecordError(err)
return nil, err
}
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("http.status_code", resp.StatusCode))
return resp, nil
}
Notice how http.NewRequestWithContext(ctx, ...) is crucial. This function automatically injects the tracing context (including trace ID and parent span ID) into the outgoing HTTP headers, which the downstream service’s tracing instrumentation can then read.
The true power of distributed profiling isn’t just seeing the total time. It’s about identifying contention and unexpected delays. For instance, you might see a service consistently taking 50ms, but its child spans only add up to 20ms. That 30ms difference is the time spent within that service’s runtime, before it even made a downstream call, or after a downstream call returned but before its own span completed. It could be garbage collection pauses, thread contention, or inefficient internal logic, all pinpointed by that unaccounted-for duration.
You might find yourself looking at a trace and seeing a span that’s unexpectedly long. The most common reason for this is an inefficient query or operation within that specific service. However, it’s also possible that the downstream service that this span is waiting on is the actual bottleneck, and your upstream span is simply showing the duration of that wait. This is why correlating parent and child span durations is key.
The next logical step after understanding individual request traces is correlating them with metrics and logs.