Distributed transactions are a fundamental challenge when your data is spread across multiple independent services or databases (shards). The core problem is ensuring atomicity: either all parts of a transaction succeed, or all fail. If one part succeeds and another fails, you’re left with inconsistent data, a nightmare to untangle.
Here’s how it looks in practice. Imagine an e-commerce system where placing an order involves three steps:
- Order Service: Creates the order record.
- Inventory Service: Decrements the stock count.
- Payment Service: Charges the customer.
If the Order Service creates the order, but the Inventory Service fails to decrement stock (maybe it’s down), you have an order without inventory. Or worse, the Inventory Service succeeds, but the Payment Service fails – you’ve promised goods you can’t deliver.
Two-Phase Commit (2PC)
The classic solution for distributed atomicity is Two-Phase Commit (2PC). It’s a synchronous, blocking protocol involving a Transaction Coordinator and multiple Participants (your services/databases).
Phase 1: Prepare
- The Coordinator sends a
PREPARErequest to all Participants. - Each Participant checks if it can commit its part of the transaction. If yes, it locks the necessary resources and responds with
YES. If no, it responds withNOand releases any temporary locks.
Phase 2: Commit/Rollback
- If the Coordinator receives
YESfrom all Participants, it sends aCOMMITrequest to everyone. Participants then make their changes permanent and release locks. - If any Participant responds with
NO, or if the Coordinator times out waiting for a response, it sends aROLLBACKrequest to all Participants. Participants then undo any tentative changes and release locks.
Example Scenario (Success):
- Coordinator: "PREPARE Order Service" -> Order Service: "YES (holds order ID 123)"
- Coordinator: "PREPARE Inventory Service" -> Inventory Service: "YES (holds item SKU-ABC)"
- Coordinator: "PREPARE Payment Service" -> Payment Service: "YES (holds payment token XYZ)"
- Coordinator: "COMMIT Order Service" -> Order Service: "Committing…"
- Coordinator: "COMMIT Inventory Service" -> Inventory Service: "Committing…"
- Coordinator: "COMMIT Payment Service" -> Payment Service: "Committing…"
Example Scenario (Failure):
- Coordinator: "PREPARE Order Service" -> Order Service: "YES (holds order ID 123)"
- Coordinator: "PREPARE Inventory Service" -> Inventory Service: "NO (out of stock)"
- Coordinator: "ROLLBACK Order Service" -> Order Service: "Rolling back…" (order ID 123 is deleted)
Why 2PC is Problematic:
- Blocking: Participants must hold locks from the
PREPAREphase until they receive aCOMMITorROLLBACKinstruction. If the Coordinator fails permanently, these locks can remain indefinitely, causing deadlocks and blocking other operations. - Performance Overhead: The synchronous nature and multiple network round trips add latency.
- Single Point of Failure: The Coordinator itself is a critical component. If it fails, the entire transaction state is uncertain.
Sagas
Sagas offer an alternative to 2PC for achieving eventual consistency across distributed systems, prioritizing availability over immediate atomicity. A saga is a sequence of local transactions. Each local transaction updates data within a single service and then triggers the next local transaction in the saga. If a local transaction fails, the saga executes a series of compensating transactions to undo the work done by preceding local transactions.
Key Concepts:
- Local Transactions: Each step in the saga is a standard, atomic transaction within its own service.
- Compensating Transactions: For every "forward" operation, there must be a corresponding "backward" or compensating operation that can undo the effect of the forward operation. These are not rollbacks; they are explicit undo actions.
- Orchestration vs. Choreography:
- Orchestration: A central orchestrator (like a workflow engine) manages the saga, telling each participant what to do next and when to compensate.
- Choreography: Participants react to events published by other participants, deciding their next action or compensation themselves.
Example Scenario (Saga Orchestration - Order Placement):
- Order Service (Orchestrator): Starts saga, creates order (local transaction), publishes
OrderCreatedevent. - Inventory Service: Listens for
OrderCreated, reserves stock (local transaction), publishesStockReservedevent. - Payment Service: Listens for
StockReserved, processes payment (local transaction), publishesPaymentProcessedevent. - Order Service: Listens for
PaymentProcessed, marks order as complete.
Compensation Scenario (Payment Fails):
- Order Service: Creates order (
OrderCreated). - Inventory Service: Reserves stock (
StockReserved). - Payment Service: Fails to process payment (local transaction fails). Publishes
PaymentFailedevent. - Inventory Service: Listens for
PaymentFailed, compensates by releasing reserved stock (local compensating transaction). PublishesStockReleasedevent. - Order Service: Listens for
PaymentFailedorStockReleased, compensates by cancelling the order (local compensating transaction).
Why Sagas are Often Preferred:
- No Blocking Locks: Participants only hold locks for the duration of their local transactions, greatly improving availability and reducing deadlocks.
- Higher Availability: The system can continue processing other requests even if one part of a saga is temporarily unavailable, as compensations will eventually be triggered.
- Scalability: Less coordination overhead compared to 2PC.
The Counter-Intuitive Truth About Sagas:
The true power and complexity of sagas lie not in the forward path, but in the design and implementation of the compensating transactions. A compensating transaction must be idempotent – running it multiple times should have the same effect as running it once. This is crucial because network issues or retries might cause a compensating action to be invoked more than once. For instance, if a ReleaseStock compensation is triggered twice, you don’t want to release stock twice; you want to ensure it’s released exactly once, and subsequent attempts do nothing. This idempotency often requires careful state management within the compensating service itself, perhaps using unique correlation IDs or version numbers.
The next challenge you’ll face is managing the complexity of saga orchestration or choreography, especially as the number of steps and potential failure points grows.