A saga is a sequence of local transactions. If one transaction fails, the saga executes a series of compensating transactions to undo the preceding operations.
Let’s see a simple order placement saga:
- Order Service: Create an order.
- Payment Service: Process payment.
- Inventory Service: Reserve stock.
If the Inventory Service fails to reserve stock, the saga needs to compensate.
- Compensating Transaction for Inventory: Release the reserved stock (if any was partially reserved).
- Compensating Transaction for Payment: Refund the payment.
- Compensating Transaction for Order: Mark the order as failed/cancelled.
Here’s a conceptual representation of the flow:
+----------------+ +-----------------+ +-------------------+
| Order Created | ----> | Payment Processed| ----> | Stock Reserved |
+----------------+ +-----------------+ +-------------------+
^ |
| v
+----------------+ +-----------------+ +-------------------+
| Order Failed | <---- | Payment Refunded| <---- | Stock Released |
+----------------+ +-----------------+ +-------------------+
The magic of sagas lies in how they manage this state and execution. There are two main patterns:
Choreography: Each service publishes an event upon completion of its local transaction. Other services listen to these events and trigger their own local transactions or compensating actions.
Imagine the Order Service successfully creates an order and publishes an OrderCreated event. The Payment Service listens for OrderCreated, processes the payment, and publishes a PaymentProcessed event. The Inventory Service listens for PaymentProcessed, reserves stock, and publishes a StockReserved event.
If Inventory fails, it publishes a StockReservationFailed event. The Payment Service listens for StockReservationFailed, initiates a refund, and publishes a PaymentRefunded event. The Order Service listens for PaymentRefunded, marks the order as failed, and publishes an OrderFailed event.
Orchestration: A central orchestrator (a dedicated service or state machine) manages the saga’s execution. It sends commands to services to perform local transactions and receives events back indicating success or failure. The orchestrator then decides the next step, whether it’s the next transaction in the forward flow or a compensating transaction.
Using the same order example with an orchestrator:
- Orchestrator sends
CreateOrderCommandto Order Service. - Order Service responds with
OrderCreatedEvent. - Orchestrator sends
ProcessPaymentCommandto Payment Service. - Payment Service responds with
PaymentProcessedEvent. - Orchestrator sends
ReserveStockCommandto Inventory Service. - Inventory Service fails and responds with
StockReservationFailedEvent. - Orchestrator, upon receiving
StockReservationFailedEvent, sendsRefundPaymentCommandto Payment Service. - Payment Service responds with
PaymentRefundedEvent. - Orchestrator sends
CancelOrderCommandto Order Service. - Order Service responds with
OrderFailedEvent.
The orchestrator pattern is often easier to reason about because the logic for the entire saga is in one place. Choreography can become complex as more services and events are added, leading to a distributed "big ball of mud."
The core problem sagas solve is maintaining data consistency across distributed services without using ACID transactions, which are often impractical or impossible in microservice architectures. Instead of a single, atomic operation, a saga is a sequence of locally atomic operations.
The "state" of the saga is crucial. In orchestration, the orchestrator holds this state. In choreography, each service might hold a piece of the state, or there might be a separate saga log. This state dictates which compensating transaction to run for which preceding transaction. For example, if payment succeeded but inventory failed, you refund payment. If payment failed, there’s no need to refund.
What most people miss is that the compensating transactions themselves must be idempotent. If a refund message is delivered twice, you don’t want to refund twice. This is usually achieved by including a unique transaction ID in the compensating command and having the service check if that ID has already been processed.
The next conceptual hurdle is handling long-running sagas and ensuring their eventual completion even in the face of network partitions or service outages.