Sampling is how OpenTelemetry decides which traces to keep and which to discard, a crucial mechanism for managing the sheer volume of data generated by distributed systems in production.
Let’s look at a trace in action. Imagine a user requests a page on your web app. This single request might trigger calls to several microservices: frontend -> auth-service -> user-profile-service -> database. OpenTelemetry instruments each of these services, creating spans that represent the work done by each.
[
{
"traceId": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
"parentId": "f67890a1b2c3d4e5f67890a1b2c3d4e5",
"id": "1234567890abcdef1234567890abcdef",
"name": "GET /users/{id}",
"kind": 1,
"startTimeUnixNano": "1678886400000000000",
"endTimeUnixNano": "1678886400150000000",
"attributes": {
"http.method": "GET",
"http.url": "/users/123",
"http.status_code": 200
},
"status": {
"code": 0
}
},
// ... other spans within the same traceId
]
Without sampling, every single request, every single operation, would generate a full trace. In a high-traffic production environment, this quickly becomes unmanageable and prohibitively expensive for storage and processing. Sampling acts as a filter, deciding which of these potentially millions of traces are interesting enough to keep.
The core problem sampling solves is trace volume. If you have 10,000 requests per second and each request generates 10 spans, that’s 100,000 spans per second. Storing and analyzing all of that is often infeasible. Sampling allows you to retain a representative subset, typically 1% or 5% of traces, providing enough data for debugging and performance analysis without overwhelming your backend.
OpenTelemetry provides several sampling strategies. The most common ones are:
AlwaysOn: Every trace is kept. Useful for testing or very low-traffic environments.AlwaysOff: Every trace is discarded. Useful for disabling tracing entirely.TraceIdRatio: A probabilistic sampler where a certain percentage of traces are kept, determined by thetrace_id. If the first byte of thetrace_id(when interpreted as an integer) is less than or equal tosample_rate * 256, the trace is kept. For example, asample_rateof0.01(1%) means traces where the first byte oftrace_idis 0-2 (approximately 1% of0to255) are kept. This is the workhorse for production.ParentBased: This is a composite strategy. It samples a trace based on the sampling decision of its parent span. If a span is generated, it inherits the sampling decision from its parent. If it’s a root span (no parent), it uses a configured child sampler (likeTraceIdRatio). This is powerful because it ensures that if a root span is sampled, all its descendant spans within that trace will also be sampled, providing a complete view of that particular request.
When configuring TraceIdRatio, you’re usually setting a fraction. For example, to sample 5% of traces, you’d configure sample_rate: 0.05. The actual implementation uses the trace_id as a random seed. If the numerical value of the trace_id falls within the sampled range (e.g., less than 0.05 * 2^128), the trace is kept.
Let’s say you’re using the ParentBased sampler with TraceIdRatio as its child sampler. You configure the ratio to 0.1 (10%).
- A request comes in, generating a root span.
- The
ParentBasedsampler sees no parent, so it consults its child sampler,TraceIdRatio(0.1). TraceIdRatiochecks thetrace_id. Let’s say it decides to keep this trace (10% chance).- Now, as this root span spawns child spans in downstream services, the
ParentBasedsampler ensures they are also kept because their parent was sampled.
This approach is highly effective. You get complete traces for a random subset of requests, which is usually sufficient for understanding system behavior and diagnosing issues.
A common pitfall is setting the sampling rate too high in production, leading to excessive data. Conversely, setting it too low might miss critical error traces. The ParentBased sampler is key here; it ensures that all spans belonging to a sampled trace are collected. If you only used TraceIdRatio without ParentBased, you might sample a root span but then miss its child spans if they fall outside the ratio’s random selection, giving you an incomplete picture. The ParentBased sampler with TraceIdRatio as the default child sampler is often the recommended configuration for production environments: it guarantees a full trace for the sampled subset, while still providing the necessary probabilistic filtering.
The next challenge you’ll face is understanding how to effectively route your sampled traces to different backends based on specific attributes, a concept known as Trace Export.