Chaos testing is when you intentionally break things to see how your system reacts, and simulating saga failures is a specific flavor of that for distributed transaction workflows.
Here’s a saga workflow in action, orchestrating a simple order placement:
apiVersion: actions.github.com/v1
kind: Workflow
name: OrderPlacementSaga
spec:
steps:
- name: CreateOrder
uses: ./actions/create-order
onSuccess:
- ProcessPayment
onFailure:
- CompensateOrderCreation
- name: ProcessPayment
uses: ./actions/process-payment
onSuccess:
- ShipOrder
onFailure:
- CompensatePayment
- CompensateOrderCreation
- name: ShipOrder
uses: ./actions/ship-order
onSuccess:
- NotifyCustomer
onFailure:
- CompensateShipment
- CompensatePayment
- CompensateOrderCreation
- name: NotifyCustomer
uses: ./actions/notify-customer
onSuccess:
- SagaComplete
onFailure:
- CompensateNotification
- CompensateShipment
- CompensatePayment
- CompensateOrderCreation
- name: CompensateOrderCreation
uses: ./actions/compensate-order-creation
onSuccess:
- SagaFailed
onFailure:
- LogCompensationFailure
- name: CompensatePayment
uses: ./actions/compensate-payment
onSuccess:
- SagaFailed
onFailure:
- LogCompensationFailure
- name: CompensateShipment
uses: ./actions/compensate-shipment
onSuccess:
- SagaFailed
onFailure:
- LogCompensationFailure
- name: CompensateNotification
uses: ./actions/compensate-notification
onSuccess:
- SagaFailed
onFailure:
- LogCompensationFailure
- name: SagaComplete
uses: ./actions/saga-complete
- name: SagaFailed
uses: ./actions/saga-failed
- name: LogCompensationFailure
uses: ./actions/log-compensation-failure
This workflow defines a sequence of actions for placing an order. If any step in the primary flow (CreateOrder, ProcessPayment, ShipOrder, NotifyCustomer) fails, the workflow triggers corresponding compensation steps (CompensateOrderCreation, CompensatePayment, etc.) to roll back previous successful operations, ensuring atomicity.
The surprising truth is that sagas don’t guarantee transactional consistency in the ACID sense; they achieve eventual consistency by managing failures through explicit compensation logic. The system might be in an inconsistent state for a brief period between a failure and its compensation, but it will eventually reach a consistent state.
To chaos test this, you’d inject failures at various points. Imagine ProcessPayment failing. This is what happens:
ProcessPaymentfails: TheonFailurehandler forProcessPaymentis invoked.CompensatePaymentis called: This action attempts to reverse the payment.CompensateOrderCreationis called: SinceProcessPaymentfailed, the order creation must also be undone.
This cascade of compensations ensures that if payment processing fails, the order is not created in the first place (or is marked as canceled if it was already created).
You can simulate these failures by:
- Network Partitioning: Use tools like
iptableson Linux to block traffic to specific services involved in the saga. For example, to simulateProcessPaymentbeing unreachable from the orchestrator:
This command drops incoming TCP packets destined for port 8081. To revert:sudo iptables -A INPUT -p tcp --dport 8081 -j DROP # Assuming Payment service on port 8081sudo iptables -D INPUT -p tcp --dport 8081 -j DROP - Service Crashes: Manually kill the process for a specific microservice. If you’re using Kubernetes, this is as simple as:
Kubernetes will then attempt to reschedule the pod. During the downtime, your saga step will fail.kubectl delete pod <payment-service-pod-name> -n <namespace> - Introducing Latency: Use tools like
tc(traffic control) to add artificial delays. This can simulate timeouts before compensation logic kicks in.
To remove:sudo tc qdisc add dev eth0 root netem delay 5000ms # Add 5 seconds delay to all traffic on eth0sudo tc qdisc del dev eth0 root netem - Failing Service Responses: Modify the service’s code or use a mock server to return error codes (e.g., HTTP 500) for specific requests. This is often the most precise way to test specific failure scenarios within a service. For instance, if your
ProcessPaymentservice expects aPOST /paymentrequest, you could configure your mock server to return503 Service Unavailablefor that endpoint. - Database Errors: Inject errors at the database level, such as connection refused, primary key violations, or transaction rollbacks, if your saga steps interact directly with a database. For example, in PostgreSQL, you could trigger a rollback:
Or, simulate a connection issue by stopping the database service temporarily.ROLLBACK;
The most overlooked aspect of saga failure simulation is the behavior of the compensation steps themselves. It’s not enough to just make a forward step fail; you must also consider what happens when a compensation step also fails. This requires designing idempotent compensation actions and potentially having a "last resort" fallback mechanism if even compensation cannot be fully completed.
The next challenge is simulating the failures of the compensation steps themselves and how the system handles double-failure scenarios.