Distributed workflows, like Sagas, are notoriously hard to observe because they involve multiple independent services coordinating over a long period, making it difficult to pinpoint where failures occur or why a transaction is stuck.

Let’s trace a simple order placement Saga using OpenTelemetry. Imagine a user places an order. This triggers a sequence: OrderService creates an order, then calls PaymentService to process payment, and finally calls InventoryService to reserve stock. If payment fails, OrderService needs to compensate by cancelling the order.

Here’s a simplified view of the tracing data you might see in an observability platform (like Jaeger or Honeycomb) after this flow:

Trace ID: a1b2c3d4e5f6
  -> Span: OrderService.CreateOrder (Start: 2023-10-27T10:00:00Z, Duration: 50ms)
     -> Span: PaymentService.ProcessPayment (Start: 2023-10-27T10:00:00.050Z, Duration: 150ms)
        -> Span: PaymentService.AuthorizeCard (Start: 2023-10-27T10:00:00.060Z, Duration: 100ms)
     -> Span: InventoryService.ReserveStock (Start: 2023-10-27T10:00:00.200Z, Duration: 80ms)

If PaymentService.ProcessPayment failed, the trace would look different, and the OrderService would then initiate compensation:

Trace ID: a1b2c3d4e5f6
  -> Span: OrderService.CreateOrder (Start: 2023-10-27T10:00:00Z, Duration: 50ms)
     -> Span: PaymentService.ProcessPayment (Start: 2023-10-27T10:00:00.050Z, Duration: 150ms, Status: ERROR)
        -> Span: PaymentService.AuthorizeCard (Start: 2023-10-27T10:00:00.060Z, Duration: 100ms, Status: ERROR)
     -> Span: OrderService.CompensateOrder (Start: 2023-10-27T10:00:01.500Z, Duration: 70ms)
        -> Span: PaymentService.RefundPayment (Start: 2023-10-27T10:00:01.510Z, Duration: 40ms)

This is the core problem Sagas solve: handling eventual consistency in distributed transactions. Instead of a single ACID transaction that locks resources across services (which is often impossible or impractical), a Saga breaks a transaction into a sequence of local transactions. Each local transaction updates its own service’s data and publishes an event or sends a command to trigger the next step. If any step fails, compensating transactions are executed in reverse order to undo the preceding steps.

OpenTelemetry allows us to instrument each of these local transactions as spans. Crucially, it propagates trace context (like trace_id and span_id) across service boundaries using headers (e.g., traceparent, tracestate). When OrderService calls PaymentService, it injects the current trace context into the HTTP request. PaymentService then receives this context and uses it to create its own spans, linking them back to the original trace from OrderService. This creates a unified view of the entire distributed workflow, even across multiple services and asynchronous message queues.

To achieve this, you typically:

  1. Instrument your services: Use OpenTelemetry SDKs for your language. For example, in Java with Spring Boot, you’d add the opentelemetry-api, opentelemetry-sdk, and relevant auto-instrumentation or manual instrumentation libraries.

  2. Configure the exporter: Set up an exporter to send trace data to your backend. For example, to send to Jaeger running on localhost:14250:

    OtelExporter.builder()
        .setEndpoint("http://localhost:14250/api/traces") // For Jaeger Agent
        .build();
    

    Or for OTLP to a collector:

    # Collector config snippet
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
    exporters:
      logging:
        loglevel: debug
      # Or Jaeger, Prometheus, etc.
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging] # Replace with your actual exporter
    
  3. Propagate context: Ensure context propagation is enabled. For HTTP calls, this is often automatic with instrumentation libraries. For message queues (Kafka, RabbitMQ), you’ll need to manually inject and extract the W3C Trace Context headers into message metadata.

    For Kafka, using Spring Kafka:

    // Producer side (injecting context)
    @Autowired
    private KafkaTemplate<String, OrderEvent> kafkaTemplate;
    @Autowired
    private Tracer tracer; // OpenTelemetry Tracer
    
    public void sendOrderCreatedEvent(Order order) {
        Span span = tracer.spanBuilder("OrderService.publishOrderCreated")
                          .setSpanKind(SpanKind.PRODUCER)
                          .startActive();
        try {
            Message<?> message = MessageBuilder.withPayload(new OrderCreatedEvent(order))
                .setHeader(KafkaHeaders.TOPIC, "order-events")
                .setHeader("traceparent", getCurrentTraceContext().getTraceId() + "-" + span.getSpanContext().getSpanId() + "-1") // Manual injection example
                .build();
            kafkaTemplate.send(message);
        } finally {
            span.end();
        }
    }
    
    // Consumer side (extracting context)
    @KafkaListener(topics = "order-events")
    public void handleOrderCreatedEvent(OrderCreatedEvent event, @Header(name = "traceparent", required = false) String traceparentHeader) {
        Span span = tracer.spanBuilder("OrderService.handleOrderCreated")
                          .setSpanKind(SpanKind.CONSUMER)
                          .startActive();
        // Extract context from traceparentHeader and potentially link to parent span
        span.makeCurrent(); // Make span active for subsequent operations
        try {
            // ... process event ...
        } finally {
            span.end();
        }
    }
    

    Note: Libraries like opentelemetry-javaagent often handle HTTP and some messaging propagation automatically. Manual injection/extraction is for cases where auto-instrumentation isn’t sufficient or for custom protocols.

The real magic of Saga observability with OpenTelemetry is how it stitches together asynchronous operations. When OrderService sends a command to PaymentService via a message queue, the trace context must be embedded in the message headers. The PaymentService then extracts this context upon receiving the message, allowing its spans to be correctly linked to the originating trace. This is critical because the "workflow" isn’t just a series of synchronous RPC calls; it’s often a chain of events and commands across different transport layers.

When you have services emitting metrics alongside traces (e.g., the number of payment authorizations, the duration of stock reservation), you can correlate them. For instance, if you see a spike in payment_authorization_failed metrics, you can jump directly to the corresponding traces in your observability platform to see the full context of those failed transactions, identifying patterns or specific customer orders affected.

The most surprising thing is how little manual code you often need to write for basic tracing. With modern OpenTelemetry SDKs and auto-instrumentation agents, you can often get distributed tracing working across HTTP services with just dependency additions and minimal configuration. The complexity arises when you move beyond simple HTTP to asynchronous messaging, gRPC, or custom protocols, where manual context propagation becomes essential.

Once you have robust tracing for your Sagas, the next logical step is to implement distributed logging, ensuring your log messages also contain trace and span IDs to allow seamless navigation from a trace to the exact logs that occurred during that specific operation.

Want structured learning?

Take the full Saga-pattern course →