The most surprising thing about Sagas when paired with Event Sourcing is that the "state" of a Saga isn’t stored in a database record waiting for updates; it’s a direct, immutable consequence of the sequence of events that have occurred.
Let’s see this in action. Imagine a simple order placement Saga.
// Event 1: OrderCreated
{
"type": "OrderCreated",
"orderId": "ORD123",
"customerId": "CUST456",
"items": [
{"productId": "PROD789", "quantity": 2}
],
"timestamp": "2023-10-27T10:00:00Z"
}
// Event 2: PaymentAuthorized
{
"type": "PaymentAuthorized",
"orderId": "ORD123",
"paymentId": "PAY001",
"amount": 50.00,
"timestamp": "2023-10-27T10:01:00Z"
}
// Event 3: InventoryReserved
{
"type": "InventoryReserved",
"orderId": "ORD123",
"reservationId": "INV555",
"timestamp": "2023-10-27T10:02:00Z"
}
The Saga orchestrator, when asked "What’s the current state of ORD123?", doesn’t query a saga_states table. Instead, it fetches all events related to ORD123 from the event store and "replays" them.
// Replay Process:
// 1. Start with initial state: { orderId: "ORD123", status: "NEW", ... }
// 2. Apply OrderCreated: { orderId: "ORD123", status: "PENDING_PAYMENT", customerId: "CUST456", ... }
// 3. Apply PaymentAuthorized: { orderId: "ORD123", status: "PENDING_INVENTORY", customerId: "CUST456", paymentId: "PAY001", ... }
// 4. Apply InventoryReserved: { orderId: "ORD123", status: "COMPLETED", customerId: "CUST456", paymentId: "PAY001", reservationId: "INV555", ... }
// Final Reconstructed State:
{
"orderId": "ORD123",
"status": "COMPLETED",
"customerId": "CUST456",
"paymentId": "PAY001",
"reservationId": "INV555",
"items": [
{"productId": "PROD789", "quantity": 2}
]
}
This replay mechanism is the core of how Event Sourcing powers Sagas. The "state" is the projection of all historical decisions and actions.
The problem this solves is the inherent complexity of distributed transactions and long-running business processes. Instead of trying to coordinate multiple services with two-phase commits (which are brittle and hard to scale), a Saga uses a sequence of local transactions. Each local transaction publishes an event. The Saga orchestrator listens for these events and triggers the next step. If a step fails, compensating events are published to undo previous steps.
The key levers you control are:
- Event Definitions: What information is captured in each event? This dictates what data is available for state reconstruction and compensation.
- Saga Logic: The rules for transitioning between states based on incoming events. This is often implemented as a state machine that reacts to events.
- Compensation Logic: For each successful step, what is the compensating action? This is crucial for handling failures.
- Event Store: The reliable, append-only log where all events are stored.
The mechanics of rebuilding state are surprisingly simple but have profound implications for resilience. When your Saga orchestrator service restarts, it doesn’t need to load state from a traditional database. It simply restarts its event subscription, replays all relevant events from the event store, and reconstructs its in-memory state. This makes state recovery instantaneous and consistent, as the state is derived directly from the immutable truth of the event log. It also means that if you need to "rewind" a process, you can do so by simply replaying events up to a certain point or by introducing a new "correction" event that triggers a different compensation path.
What most people don’t fully grasp is the power of this for auditing and debugging. If you need to understand why an order is in a certain state, you have the entire history of decisions and actions that led to it, directly queryable from the event store. You can replay events for a specific order ID to see its journey, or even replay all events up to a certain timestamp to get a snapshot of the entire system at a past moment.
The next conceptual hurdle you’ll encounter is managing eventual consistency across different projections of your event stream.