Chaos testing is when you intentionally break things to see how your system reacts, and simulating saga failures is a specific flavor of that for distributed transaction workflows.

Here’s a saga workflow in action, orchestrating a simple order placement:

apiVersion: actions.github.com/v1
kind: Workflow
name: OrderPlacementSaga
spec:
  steps:
    - name: CreateOrder
      uses: ./actions/create-order
      onSuccess:
        - ProcessPayment
      onFailure:
        - CompensateOrderCreation

    - name: ProcessPayment
      uses: ./actions/process-payment
      onSuccess:
        - ShipOrder
      onFailure:
        - CompensatePayment
        - CompensateOrderCreation

    - name: ShipOrder
      uses: ./actions/ship-order
      onSuccess:
        - NotifyCustomer
      onFailure:
        - CompensateShipment
        - CompensatePayment
        - CompensateOrderCreation

    - name: NotifyCustomer
      uses: ./actions/notify-customer
      onSuccess:
        - SagaComplete
      onFailure:
        - CompensateNotification
        - CompensateShipment
        - CompensatePayment
        - CompensateOrderCreation

    - name: CompensateOrderCreation
      uses: ./actions/compensate-order-creation
      onSuccess:
        - SagaFailed
      onFailure:
        - LogCompensationFailure

    - name: CompensatePayment
      uses: ./actions/compensate-payment
      onSuccess:
        - SagaFailed
      onFailure:
        - LogCompensationFailure

    - name: CompensateShipment
      uses: ./actions/compensate-shipment
      onSuccess:
        - SagaFailed
      onFailure:
        - LogCompensationFailure

    - name: CompensateNotification
      uses: ./actions/compensate-notification
      onSuccess:
        - SagaFailed
      onFailure:
        - LogCompensationFailure

    - name: SagaComplete
      uses: ./actions/saga-complete

    - name: SagaFailed
      uses: ./actions/saga-failed

    - name: LogCompensationFailure
      uses: ./actions/log-compensation-failure

This workflow defines a sequence of actions for placing an order. If any step in the primary flow (CreateOrder, ProcessPayment, ShipOrder, NotifyCustomer) fails, the workflow triggers corresponding compensation steps (CompensateOrderCreation, CompensatePayment, etc.) to roll back previous successful operations, ensuring atomicity.

The surprising truth is that sagas don’t guarantee transactional consistency in the ACID sense; they achieve eventual consistency by managing failures through explicit compensation logic. The system might be in an inconsistent state for a brief period between a failure and its compensation, but it will eventually reach a consistent state.

To chaos test this, you’d inject failures at various points. Imagine ProcessPayment failing. This is what happens:

  1. ProcessPayment fails: The onFailure handler for ProcessPayment is invoked.
  2. CompensatePayment is called: This action attempts to reverse the payment.
  3. CompensateOrderCreation is called: Since ProcessPayment failed, the order creation must also be undone.

This cascade of compensations ensures that if payment processing fails, the order is not created in the first place (or is marked as canceled if it was already created).

You can simulate these failures by:

  • Network Partitioning: Use tools like iptables on Linux to block traffic to specific services involved in the saga. For example, to simulate ProcessPayment being unreachable from the orchestrator:
    sudo iptables -A INPUT -p tcp --dport 8081 -j DROP  # Assuming Payment service on port 8081
    
    This command drops incoming TCP packets destined for port 8081. To revert:
    sudo iptables -D INPUT -p tcp --dport 8081 -j DROP
    
  • Service Crashes: Manually kill the process for a specific microservice. If you’re using Kubernetes, this is as simple as:
    kubectl delete pod <payment-service-pod-name> -n <namespace>
    
    Kubernetes will then attempt to reschedule the pod. During the downtime, your saga step will fail.
  • Introducing Latency: Use tools like tc (traffic control) to add artificial delays. This can simulate timeouts before compensation logic kicks in.
    sudo tc qdisc add dev eth0 root netem delay 5000ms  # Add 5 seconds delay to all traffic on eth0
    
    To remove:
    sudo tc qdisc del dev eth0 root netem
    
  • Failing Service Responses: Modify the service’s code or use a mock server to return error codes (e.g., HTTP 500) for specific requests. This is often the most precise way to test specific failure scenarios within a service. For instance, if your ProcessPayment service expects a POST /payment request, you could configure your mock server to return 503 Service Unavailable for that endpoint.
  • Database Errors: Inject errors at the database level, such as connection refused, primary key violations, or transaction rollbacks, if your saga steps interact directly with a database. For example, in PostgreSQL, you could trigger a rollback:
    ROLLBACK;
    
    Or, simulate a connection issue by stopping the database service temporarily.

The most overlooked aspect of saga failure simulation is the behavior of the compensation steps themselves. It’s not enough to just make a forward step fail; you must also consider what happens when a compensation step also fails. This requires designing idempotent compensation actions and potentially having a "last resort" fallback mechanism if even compensation cannot be fully completed.

The next challenge is simulating the failures of the compensation steps themselves and how the system handles double-failure scenarios.

Want structured learning?

Take the full Saga-pattern course →