The Saga pattern is a way to manage data consistency across microservices without resorting to the two-phase commit (2PC) protocol, which often becomes a bottleneck in distributed systems.
Let’s see the Saga pattern in action. Imagine an e-commerce system where placing an order involves several independent services: OrderService, PaymentService, and InventoryService.
Here’s a simplified flow:
- Client Request: A user requests to place an order.
- OrderService: Creates an
Orderwith statusPENDING. It then initiates the saga by sending a command toPaymentServiceto process payment. - PaymentService: Processes the payment. If successful, it sends a
PaymentSuccessevent. If it fails, it sends aPaymentFailedevent. - InventoryService: Upon receiving
PaymentSuccess, it reserves the items in the inventory. If successful, it sends anInventoryReservedevent. If it fails (e.g., out of stock), it sends anInventoryFailedevent. - OrderService:
- If
InventoryReservedis received, it updates theOrderstatus toAPPROVEDand sends aShipmentRequestto aShippingService. - If
PaymentFailedorInventoryFailedis received, it initiates the compensating transaction.
- If
A compensating transaction is the "undo" operation for a step that has already completed.
- If
PaymentFailedoccurs after the order was created but before payment, theOrderServiceneeds to mark the order asFAILED. No compensation needed for payment itself if it didn’t happen. - If
InventoryFailedoccurs after payment was successful, theOrderServiceneeds to mark the order asFAILEDand, crucially, trigger a compensation inPaymentServiceto refund the payment.
This creates a sequence of local transactions. If any local transaction fails, the saga executes a series of compensating transactions to undo the work of preceding transactions.
Here’s how you might model this with an event-driven approach.
Order Service (OrderService)
# OrderService Configuration (Conceptual)
saga:
name: PlaceOrderSaga
steps:
- name: CreateOrder
command: CreateOrderCommand
event: OrderCreatedEvent
compensating_command: CancelOrderCommand # Optional, if order creation itself can fail critically
- name: ProcessPayment
command: ProcessPaymentCommand
event: PaymentProcessedEvent
compensating_command: RefundPaymentCommand
depends_on: OrderCreatedEvent
- name: ReserveInventory
command: ReserveInventoryCommand
event: InventoryReservedEvent
compensating_command: ReleaseInventoryCommand
depends_on: PaymentProcessedEvent
- name: ApproveOrder
event: OrderApprovedEvent
depends_on: InventoryReservedEvent
Payment Service (PaymentService)
# PaymentService Logic (Conceptual)
on(ProcessPaymentCommand):
try:
# ... perform payment processing ...
if payment_successful:
publish(PaymentProcessedEvent(order_id=command.order_id, amount=command.amount))
else:
publish(PaymentFailedEvent(order_id=command.order_id, reason="Insufficient funds"))
except Exception as e:
publish(PaymentFailedEvent(order_id=command.order_id, reason=str(e)))
on(RefundPaymentCommand):
try:
# ... perform refund processing ...
publish(PaymentRefundedEvent(order_id=command.order_id))
except Exception as e:
# Log error, potentially retry or alert
publish(RefundFailedEvent(order_id=command.order_id, reason=str(e)))
Inventory Service (InventoryService)
# InventoryService Logic (Conceptual)
on(ReserveInventoryCommand):
try:
# ... attempt to reserve items ...
if reservation_successful:
publish(InventoryReservedEvent(order_id=command.order_id, items=command.items))
else:
publish(InventoryFailedEvent(order_id=command.order_id, reason="Item out of stock"))
except Exception as e:
publish(InventoryFailedEvent(order_id=command.order_id, reason=str(e)))
on(ReleaseInventoryCommand):
try:
# ... release reserved items ...
publish(InventoryReleasedEvent(order_id=command.order_id))
except Exception as e:
# Log error, potentially retry or alert
publish(ReleaseFailedEvent(order_id=command.order_id, reason=str(e)))
The core idea is that each service performs a local transaction. If a service fails, it doesn’t roll back the entire distributed system; instead, it signals failure, and the orchestrator (or the services themselves, in a choreography-based saga) triggers compensating actions for the steps that did succeed.
A common pitfall is misunderstanding compensation. A compensating transaction must be idempotent and guaranteed to succeed. If a refund fails, you have a distributed transaction that partially succeeded and failed to compensate, leading to data inconsistency. This is why robust error handling, retry mechanisms, and potentially manual intervention processes are critical for compensating actions.
The next challenge you’ll face is handling the failure of a compensating transaction itself, which often requires a separate, perhaps manual, resolution process.