A saga is more than just a sequence of operations; it’s a distributed transaction that guarantees eventual consistency across multiple services.
Let’s see a simple saga in action, managing an order placement process. We have three services: OrderService, PaymentService, and InventoryService.
// Request to OrderService
POST /orders
{
"orderId": "ORD123",
"customerId": "CUST456",
"items": [
{"productId": "PROD789", "quantity": 2}
],
"totalAmount": 100.00
}
// OrderService initiates saga, calls PaymentService
POST /payments/authorize
{
"orderId": "ORD123",
"customerId": "CUST456",
"amount": 100.00
}
// PaymentService authorizes, calls InventoryService
POST /inventory/reserve
{
"orderId": "ORD123",
"items": [
{"productId": "PROD789", "quantity": 2}
]
}
// InventoryService reserves, responds to OrderService
// OrderService confirms payment and inventory, completes order
// OrderService responds to client
201 Created
{
"orderId": "ORD123",
"status": "COMPLETED"
}
If PaymentService fails, OrderService needs to compensate. It would then call a compensation endpoint on PaymentService (e.g., POST /payments/void) to reverse the authorization. Similarly, if InventoryService fails, OrderService would call POST /payments/refund on PaymentService and then POST /inventory/release on InventoryService.
The core problem sagas solve is maintaining data integrity in microservice architectures without the overhead and limitations of traditional ACID transactions across service boundaries. Each service owns its data, and the saga orchestrates a series of local transactions, with defined compensation actions for each step. This allows for high availability and scalability while still providing a guarantee that the overall business transaction will either succeed or be reliably rolled back.
The two main patterns for implementing sagas are:
-
Choreography: Services communicate directly with each other via events. When a service completes its local transaction, it publishes an event, and other services interested in that event react and perform their own local transactions. This leads to a decentralized system where no single service orchestrates the entire flow.
- Example:
OrderServicecompletes its initial step, publishesOrderCreatedevent.PaymentServicelistens forOrderCreated, processes payment, and publishesPaymentAuthorizedevent.InventoryServicelistens forPaymentAuthorized, reserves stock, and publishesStockReservedevent.OrderServicelistens forStockReservedand marks the order as complete.
- Example:
-
Orchestration: A central orchestrator service manages the saga flow. The orchestrator sends commands to each participating service and listens for replies. If a step fails, the orchestrator is responsible for invoking the compensation actions on preceding services.
- Example:
OrderServiceacts as the orchestrator. It sendsAuthorizePaymentcommand toPaymentService. Upon receivingPaymentAuthorizedreply, it sendsReserveInventorycommand toInventoryService. IfReserveInventoryfails, the orchestrator sendsRefundPaymentcommand toPaymentServiceand thenReleaseInventorycommand toInventoryService.
- Example:
When testing sagas end-to-end, assertions go beyond simple state checks. You need to verify not only the final desired state but also the intermediate states and the successful execution of compensation actions.
For a successful order:
- Assertion 1: The final
Orderstatus inOrderServiceisCOMPLETED. - Assertion 2:
PaymentServicehas a record of a successful authorization for the order amount. - Assertion 3:
InventoryServicehas a record of the items being reserved for the order.
For a failed order (e.g., inventory unavailable after payment authorization):
- Assertion 1: The final
Orderstatus inOrderServiceisFAILEDorCANCELLED. - Assertion 2:
PaymentServicehas a record of an authorization that was subsequently refunded or voided. - Assertion 3:
InventoryServicehas a record of the items not being reserved (or the reservation being released if it happened before the failure). - Assertion 4: No compensation actions were erroneously triggered for steps that had already succeeded.
A common pitfall is relying solely on the final state. You must also test the failure paths rigorously. This often involves mocking or stubbing downstream services to simulate failures at various stages of the saga. For instance, to test the compensation for PaymentService failure, you’d simulate PaymentService returning an error response to OrderService’s authorization request. Then, you’d assert that OrderService correctly calls PaymentService’s compensation endpoint.
The true complexity in sagas often lies not in the happy path, but in ensuring that compensation logic correctly handles partial successes and the potential for retries or idempotency issues in distributed systems. If a compensation action itself fails, the system enters a much more complex error handling state, often requiring manual intervention or a dedicated "dead-letter" queue for failed compensation steps.
The next challenge is managing the complexity of long-running sagas and their state persistence.