The most surprising thing about Sagas is that they achieve transactional consistency without any locks, instead relying on a series of independent, compensating actions.
Let’s see this in action. Imagine a simple e-commerce order process:
- Order Service: Creates an order, publishes
OrderCreatedevent. - Payment Service: Listens for
OrderCreated, attempts to process payment, publishesPaymentProcessedorPaymentFailed. - Inventory Service: Listens for
PaymentProcessed, reserves inventory, publishesInventoryReservedorInventoryReservationFailed. - Shipping Service: Listens for
InventoryReserved, creates a shipment, publishesOrderShipped.
Here’s a snippet of what the messages might look like on RabbitMQ:
// Order Service publishes this
{
"eventType": "OrderCreated",
"orderId": "ORD-12345",
"customerId": "CUST-67890",
"totalAmount": 150.75
}
// Payment Service receives, processes, publishes this
{
"eventType": "PaymentProcessed",
"orderId": "ORD-12345",
"paymentId": "PAY-ABCDEF"
}
// Or if it fails, it publishes this
{
"eventType": "PaymentFailed",
"orderId": "ORD-12345",
"reason": "Insufficient funds"
}
// Inventory Service receives PaymentProcessed, reserves, publishes this
{
"eventType": "InventoryReserved",
"orderId": "ORD-12345",
"reservationId": "INV-GHIJKL"
}
// Or if it fails, it publishes this
{
"eventType": "InventoryReservationFailed",
"orderId": "ORD-12345",
"reason": "Item out of stock"
}
// Shipping Service receives InventoryReserved, ships, publishes this
{
"eventType": "OrderShipped",
"orderId": "ORD-12345",
"shipmentId": "SHIP-MNOPQR"
}
The problem Sagas solve is how to maintain data consistency across multiple independent services when a single business transaction spans them. Traditional ACID transactions are out because these services are distributed. If the Order Service creates an order, the Payment Service processes it, but then the Inventory Service fails to reserve stock, you can’t simply roll back the payment. The Payment Service needs to be told to undo its action. This undoing is the "compensating action."
In a Saga, each step publishes an event indicating success or failure. If a downstream service fails, it publishes a failure event. This failure event triggers upstream services to execute their compensating actions. For example, if InventoryReservationFailed is published, the Payment Service would receive it and then publish a RefundIssued event, which is its compensating action for processing the payment. The Order Service would then publish an OrderCancelled event.
The key is that each service acts independently based on the events it receives. RabbitMQ acts as the central nervous system, reliably delivering these events. We use durable queues and persistent messages to ensure no event is lost. For example, an OrderCreated event is published to an exchange (e.g., order.events) and routed to queues for services interested in it (e.g., payment_service_queue). The payment_service_queue is declared as durable = true and messages are published with delivery_mode = 2 (persistent).
The "choreography" aspect means there’s no central orchestrator dictating the flow. Services react to events. This makes the system highly decoupled. If the Shipping Service needs to be added later, it just subscribes to InventoryReserved and publishes its own events.
To ensure reliability with RabbitMQ, use these configurations:
- Publisher Confirms: The producer (e.g.,
Order Service) waits for an acknowledgment from RabbitMQ that the message has been safely received by the broker. This prevents message loss before it enters the queue. - Durable Queues and Exchanges: Ensures that queues and exchanges survive broker restarts.
- Persistent Messages: Messages are written to disk, so they survive broker restarts.
- Consumer Acknowledgements (ACKs): Consumers explicitly acknowledge messages after they have been successfully processed. If a consumer crashes before acknowledging, RabbitMQ redelivers the message. This is crucial for compensating actions.
- Dead Letter Exchanges (DLX): If a message cannot be processed (e.g., repeated failures), it can be routed to a DLX for later inspection.
The most counterintuitive part of implementing Sagas with choreography is how you handle idempotency and retries. Since messages can be redelivered, each service must be able to process the same event multiple times without causing duplicate side effects. This is typically achieved by storing the orderId and the state of the transaction processed by that service. Before executing any action, the service checks if it has already processed an event for that orderId with the same outcome. If so, it simply re-acknowledges the message without re-executing the logic.
The next concept you’ll likely wrestle with is distributed tracing to track requests across these many independent services.