Sagas don’t have to be slow, even when they span many services.
Let’s watch a typical "Order Processing" saga unfold across three services: OrderService, PaymentService, and ShippingService.
graph LR
A[Client Request: New Order] --> B{OrderService};
B -- Create Order --> C{PaymentService};
C -- Process Payment --> D{ShippingService};
D -- Ship Order --> E{OrderService};
E -- Update Order Status --> F[Client Response: Order Confirmed];
Here’s what happens under the hood when a new order comes in:
- Client Request: A customer places an order.
OrderService: Receives the request. It creates an order record with a "PENDING" status. Then, it publishes anOrderCreatedevent.PaymentService: Subscribes toOrderCreated. It receives the event, attempts to process the payment. If successful, it publishes aPaymentProcessedevent.ShippingService: Subscribes toPaymentProcessed. It receives the event, initiates the shipping process. Once shipped, it publishes aOrderShippedevent.OrderService: Subscribes toOrderShipped. It receives the event and updates the order status to "CONFIRMED".- Client Response:
OrderServicecan now respond to the client that the order is confirmed.
This looks synchronous, but it’s actually asynchronous. Each service reacts to an event, and the "workflow" is orchestrated by the events themselves. The latency here isn’t in a single, blocking call. It’s the sum of:
- Network latency for event publishing and subscription.
- Processing time within each service.
- Message broker latency (e.g., Kafka, RabbitMQ).
The real performance killer in long sagas isn’t usually the individual service processing time, but the cumulative effect of waiting for each step to complete and publish its event, and then waiting for the next service to pick it up.
Consider this scenario: if PaymentService takes 500ms to process a payment and ShippingService takes 800ms to ship, and the message broker adds 100ms RTT for each hop, you’re looking at 1.4 seconds just for the core work, plus broker latency. If you have 10 steps in your saga, that adds up fast.
The key to minimizing latency is to maximize concurrency and reduce the number of sequential event hops.
Maximizing Concurrency:
Instead of waiting for PaymentService to finish before ShippingService can even start thinking about it, what if ShippingService could start preparing things in parallel?
Imagine OrderService publishes OrderCreated. Both PaymentService and ShippingService (for warehouse allocation, perhaps) can subscribe.
graph LR
A[Client Request: New Order] --> B{OrderService};
B -- Create Order --> C{PaymentService};
B -- Create Order --> D{ShippingService};
C -- Process Payment --> E{OrderService};
D -- Allocate Inventory --> F{OrderService};
E -- Payment Done --> G{ShippingService};
F -- Inventory Allocated --> G;
G -- Ship Order --> H{OrderService};
H -- Update Order Status --> I[Client Response: Order Confirmed];
In this revised flow:
OrderServicepublishesOrderCreated.PaymentServiceandShippingServiceboth subscribe.PaymentServicestarts processing payment.ShippingServicestarts allocating inventory.- When
PaymentServicefinishes, it publishesPaymentProcessed. - When
ShippingServicefinishes inventory allocation, it publishesInventoryAllocated. OrderServicenow needs to track bothPaymentProcessedandInventoryAllocatedbefore it can tellShippingServiceto actually ship. This requires a state machine or correlation mechanism withinOrderServiceto know when all prerequisites are met.
This pattern is often called "parallel compensation" or "parallel execution" within a saga. It requires more complex event correlation and potentially more state management in the orchestrator service (OrderService in this case).
Reducing Sequential Hops:
Another strategy is to push more work into a single service’s transaction boundary if possible, or to combine related steps.
For example, if PaymentService and ShippingService are tightly coupled and often fail together, you might consider if they could be merged into a single OrderFulfillmentService. This isn’t always feasible due to domain boundaries, but it’s a performance lever.
More practically, consider the OrderService’s role. It’s currently reactive to OrderShipped. What if ShippingService could publish a OrderReadyToShip event, and OrderService could then directly instruct ShippingService (via an API call, not an event) to ship, and then wait for a ShipmentConfirmed response? This reduces the event bus hop for the final step.
graph LR
A[Client Request: New Order] --> B{OrderService};
B -- Create Order --> C{PaymentService};
C -- Process Payment --> D{ShippingService};
D -- Allocate Inventory --> E{OrderService};
E -- Payment Done --> F{ShippingService};
F -- Inventory Allocated --> G{OrderService};
G -- Ship Order Request --> H{ShippingService};
H -- Shipment Confirmed --> I{OrderService};
I -- Update Order Status --> J[Client Response: Order Confirmed];
In this third example:
OrderServicecreates order, publishesOrderCreated.PaymentServiceprocesses payment, publishesPaymentProcessed.ShippingServiceallocates inventory, publishesInventoryAllocated.OrderServicecorrelatesPaymentProcessedandInventoryAllocated. Once both are received, it makes a direct API call toShippingService’s/shipendpoint.ShippingServiceperforms the physical shipment and returns aShipmentConfirmedstatus directly toOrderService.OrderServiceupdates its order status.
The key here is that OrderService is now acting more like an orchestrator and less like a simple event subscriber. It’s managing the state and deciding when to trigger the next step, potentially using direct API calls for efficiency when the next step is known and tightly coupled. The latency is reduced because the OrderShipped event hop is eliminated, replaced by a synchronous API call and response.
The most surprising thing about optimizing saga performance is how often the bottleneck isn’t the work being done, but the communication overhead between services. A well-tuned saga treats eventing as a distributed transaction coordinator, but direct API calls can be faster for very tightly coupled, sequential steps where immediate feedback is required.
The next frontier for performance, especially in sagas involving complex decision trees or human intervention, is often managing the state of long-running, potentially paused, sagas and ensuring they can be resumed efficiently without losing context or replaying unnecessary work.