A saga isn’t a distributed transaction; it’s a sequence of local transactions where each transaction updates data and publishes an event or message to trigger the next.
Let’s watch a simple e-commerce order placement saga unfold. Imagine a customer wants to buy a book.
- Create Order: The
OrderServicecreates an order in aPENDINGstate. It then publishes anOrderCreatedevent.{ "eventType": "OrderCreated", "orderId": "ORD123", "customerId": "CUST456", "bookId": "BOOK789", "quantity": 1 } - Reserve Inventory: The
InventoryServicelistens forOrderCreatedevents. Upon receiving it, it checks ifBOOK789has sufficient stock. If yes, it decrements the stock count and publishes anInventoryReservedevent. If no, it publishes anInventoryReservationFailedevent.{ "eventType": "InventoryReserved", "orderId": "ORD123", "bookId": "BOOK789", "reservedQuantity": 1 } - Process Payment: The
PaymentServicelistens forInventoryReservedevents. It then attempts to chargeCUST456’s account for the book’s price. If successful, it publishes aPaymentProcessedevent. If it fails (e.g., insufficient funds), it publishes aPaymentProcessingFailedevent.{ "eventType": "PaymentProcessed", "orderId": "ORD123", "customerId": "CUST456", "amount": 25.99 } - Finalize Order: The
OrderServicelistens forPaymentProcessedevents. Upon receiving it, it updates the order status toAPPROVED. If it receives aPaymentProcessingFailedevent, it updates the order status toFAILED.
What happens if InventoryReservation fails? The InventoryService publishes InventoryReservationFailed. The OrderService listens for this, and if it receives it, it updates the order status to FAILED. This is a compensation step. If payment processing fails, the PaymentService publishes PaymentProcessingFailed. The OrderService sees this and updates the order to FAILED. But we also need to undo the inventory reservation. So, the OrderService would also publish a PaymentFailedCompensation event. The InventoryService would listen for this and add back the reserved stock.
This entire sequence, with its forward execution and potential compensation steps, is a saga. It solves the problem of maintaining data consistency across independent services without the tight coupling and blocking nature of traditional distributed transactions.
The core problem sagas solve is maintaining eventual consistency in a distributed system where multiple services own their data. Unlike ACID transactions, sagas don’t guarantee atomicity (all or nothing) in the same way. Instead, they ensure that the system eventually reaches a consistent state through a series of compensating actions if any step fails. The "trade-off" is sacrificing immediate consistency for availability and scalability.
The exact levers you control are the events published by each service and the logic within each service to handle those events, including the compensation logic. For instance, in the payment step, you might have a PaymentRetryPolicy that dictates how many times the PaymentService will attempt to charge the customer before publishing PaymentProcessingFailed.
When implementing a saga, the choreography-based approach (where each service publishes events and others react) is often simpler to start with, but can become hard to track. The alternative, orchestration-based approach (where a central orchestrator service dictates the flow and calls each service directly, managing compensation logic), offers more control but introduces a single point of failure and a more complex orchestrator.
The most surprising thing is that a failed step in a saga doesn’t necessarily mean the entire business transaction fails permanently. It means the forward progress stops, and compensation kicks in to roll back only the completed steps. This allows other parts of the system to continue operating, which is a massive win for availability.
The next concept you’ll run into is idempotency, which is critical for reliably handling retries and ensuring events aren’t processed multiple times.