A saga step’s compensation logic doesn’t run automatically when the step fails; it’s explicitly invoked by the orchestrator.
Let’s watch a simple saga in action. Imagine we’re processing an order. We have three steps: "Create Order," "Process Payment," and "Ship Order." If "Process Payment" fails, the orchestrator needs to tell "Create Order" to undo what it did.
{
"id": "order-saga-123",
"state": "PROCESSING",
"steps": [
{
"name": "createOrder",
"status": "COMPLETED",
"compensation": "cancelOrder"
},
{
"name": "processPayment",
"status": "FAILED",
"compensation": "refundPayment"
},
{
"name": "shipOrder",
"status": "PENDING",
"compensation": "cancelShipment"
}
],
"currentStep": "processPayment"
}
When processPayment fails, the saga orchestrator inspects the steps array. It sees processPayment has status: "FAILED". It then looks for the compensation field on the previous completed step, which is createOrder. The value is cancelOrder. The orchestrator then sends a command to the createOrder service (or its associated compensation handler) to execute cancelOrder.
The core problem sagas solve is managing distributed transactions where traditional ACID properties are impractical. Instead of a single, atomic commit or rollback, sagas use a sequence of local transactions, each with a corresponding compensating transaction. If any step fails, the saga executes compensating transactions for all preceding successful steps in reverse order.
The surprising part is how events tie into this. Compensating actions are often triggered by events. When processPayment fails, it might emit a PaymentFailedEvent. The saga orchestrator (or a dedicated event handler listening for PaymentFailedEvent) consumes this event and initiates the compensation sequence. Similarly, when cancelOrder successfully completes, it emits an OrderCancelledEvent, which the orchestrator might listen for to know it can proceed to the next compensation or declare the saga finished.
Here’s how you’d typically mock these in a unit test. Let’s say you’re testing the orchestrator’s logic when processPayment fails.
@Test
fun `orchestrator initiates compensation when payment step fails`() {
val sagaState = SagaState(
id = "saga-456",
steps = listOf(
Step(name = "createOrder", status = Status.COMPLETED, compensation = "cancelOrder"),
Step(name = "processPayment", status = Status.FAILED, compensation = "refundPayment")
),
currentStep = "processPayment"
)
// Mock the command publisher to verify cancelOrder is called
val commandPublisher = mockk<CommandPublisher>()
every { commandPublisher.publish(any<CancelOrderCommand>()) } just Runs
val orchestrator = SagaOrchestrator(commandPublisher)
// Simulate the failure event that triggers compensation
orchestrator.handlePaymentFailedEvent(PaymentFailedEvent("saga-456", "order-789"))
// Assert that the cancelOrder command was published
verify(exactly = 1) { commandPublisher.publish(CancelOrderCommand("saga-456")) }
}
In this test, we set up a SagaState where the createOrder step is COMPLETED and processPayment is FAILED. We then mock a CommandPublisher to capture outgoing commands. When we simulate a PaymentFailedEvent arriving (which signifies the failure of the processPayment step), we expect the orchestrator to publish a CancelOrderCommand for the createOrder step. The every { ... } just Runs tells MockK to do nothing but accept the call, and verify checks that it happened as expected. The actual command payload (CancelOrderCommand("saga-456")) is verified to ensure the correct compensation for the correct saga instance is invoked.
The key is that the orchestrator doesn’t just magically know to call cancelOrder. It needs to be explicitly programmed to look at the compensation field of the preceding completed step when a subsequent step fails. The event handling mechanism is the glue that tells the orchestrator when a step has failed and which saga instance it pertains to.
The most nuanced aspect is often handling concurrent failures or compensation failures. If a compensating transaction itself fails, the saga enters a "failed compensation" state. In robust systems, this might trigger manual intervention, further automated retry mechanisms, or sophisticated state management to avoid data corruption. It’s not just about undoing; it’s about reliably undoing, even when undoing fails.
The next logical step is understanding how to handle saga state persistence to survive application restarts.