A saga’s retry policy is less about if a step will fail, and more about how gracefully the entire distributed transaction can recover when a step inevitably does.
Let’s say you have a CreateOrder saga. It involves several steps: ValidateInventory, ChargeCustomer, and UpdateOrderStatus. If ChargeCustomer times out talking to the payment gateway, the saga needs to know what to do. Simply retrying immediately is a bad idea. The payment gateway might be overloaded, and hammering it with requests will only make things worse. This is where exponential backoff comes in.
Exponential backoff means increasing the delay between retries. Start with a short delay, say 500ms. If that fails, wait 1 second. Then 2 seconds, then 4, and so on. This gives the failing service time to recover without overwhelming it.
Consider this simplified example using a hypothetical Python library.
import time
import random
def charge_customer(customer_id, amount):
# Simulate a flaky payment gateway
if random.random() < 0.7: # 70% chance of failure
print(f"Payment gateway unavailable for customer {customer_id}")
raise ConnectionError("Payment gateway timeout")
print(f"Successfully charged {amount} for customer {customer_id}")
return True
def saga_step_charge(customer_id, amount, max_retries=5):
delay = 0.5 # Initial delay in seconds
for attempt in range(max_retries):
try:
return charge_customer(customer_id, amount)
except ConnectionError as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
print(f"Retrying in {delay:.1f} seconds...")
time.sleep(delay)
delay *= 2 # Exponential backoff
else:
print("Max retries reached. Saga failed.")
raise # Re-raise the exception to signal saga failure
# Example usage
try:
saga_step_charge("user123", 100.00)
except Exception as e:
print(f"Saga could not complete: {e}")
In this code, the delay doubles with each failed attempt. This is a common strategy. Some systems also add jitter, a small random variation to the delay, to prevent multiple clients from retrying at the exact same time, which can cause thundering herd problems. A common jitter strategy is to add a random value between 0 and the current delay.
However, retrying a step isn’t always safe. What if the ChargeCustomer step succeeds but the response doesn’t reach your system? Your system might think it failed and retry, leading to a customer being charged twice. This is where idempotency becomes critical.
An idempotent operation is one that can be executed multiple times with the same effect as executing it once. For ChargeCustomer, this means the payment gateway (or your system’s interface to it) must be able to detect if a charge for a specific order ID has already been processed.
To achieve idempotency for ChargeCustomer, you’d typically include a unique idempotency_key (often a UUID generated by the orchestrator) with each charge request. The payment gateway would then:
- Receive the request with the
idempotency_key. - Check if a charge with that
idempotency_keyhas already been processed. - If yes, return the result of the previous successful charge immediately, without performing the charge again.
- If no, process the charge, store the
idempotency_keyand its result, and then return the result.
In practice, your saga orchestrator would manage the retry logic and the generation of idempotency_keys. For example, in a workflow engine like Cadence or Temporal, you’d configure these policies directly.
{
"activityOptions": {
"scheduleToCloseTimeout": "10m",
"startToCloseTimeout": "1m",
"retryPolicy": {
"initialInterval": "5s",
"backoffCoefficient": 2.0,
"maximumInterval": "1m",
"maximumAttempts": 0 // 0 means unlimited attempts
}
}
}
This JSON snippet shows a retry policy configuration for an activity (a unit of work in Temporal). initialInterval is 5 seconds, backoffCoefficient is 2.0 (meaning it doubles), maximumInterval caps the delay at 1 minute, and maximumAttempts: 0 means it will keep retrying indefinitely until successful or until scheduleToCloseTimeout is hit.
The one thing most people don’t realize is that idempotency isn’t just for external services. If your saga involves updating a database record, that update operation must also be idempotent. If a step tries to set order_status = 'SHIPPED', and it retries after a network blip, you don’t want to accidentally run that UPDATE statement twice if the first one actually succeeded. This means your database update logic needs to be designed to handle duplicate requests gracefully, perhaps by checking the current status before applying the change or by using unique constraints that prevent duplicate states.
Ultimately, a robust saga retry policy is a combination of intelligent delays (exponential backoff with jitter) and truly idempotent operations at every step, ensuring that even in the face of transient failures, your distributed transaction can eventually succeed without unintended side effects.
The next problem you’ll likely encounter is how to handle permanent failures, where a step consistently fails even after retries, and how to orchestrate compensating transactions.