The most surprising thing about monitoring saga workflows is that the absence of errors on a dashboard often means something is deeply broken.
Let’s say you’re using a saga orchestrator, like Netflix’s Conductor or Camunda. You’ve got a complex, multi-step process: User signs up, sends an email, creates a profile, charges a credit card, and then triggers an onboarding flow. Each of these is a "task" or "step" within the saga.
Here’s a simplified view of a saga workflow definition in JSON:
{
"name": "userOnboarding",
"schemaVersion": "1.0",
"tasks": [
{
"name": "sendWelcomeEmail",
"taskReferenceName": "sendWelcomeEmail_ref",
"type": "SIMPLE",
"inputParameters": {
"emailAddress": "${workflow.input.userEmail}",
"subject": "Welcome to Our Awesome Service!"
}
},
{
"name": "createProfile",
"taskReferenceName": "createProfile_ref",
"type": "SIMPLE",
"inputParameters": {
"userId": "${workflow.input.userId}",
"profileData": "${workflow.input.profileDetails}"
}
},
{
"name": "chargeCreditCard",
"taskReferenceName": "chargeCreditCard_ref",
"type": "SIMPLE",
"inputParameters": {
"userId": "${workflow.input.userId}",
"amount": 29.99,
"currency": "USD"
}
},
{
"name": "startOnboardingFlow",
"taskReferenceName": "startOnboardingFlow_ref",
"type": "SIMPLE",
"inputParameters": {
"userId": "${workflow.input.userId}"
}
}
],
"join": {
"joinTaskName": "startOnboardingFlow_ref",
"joinOn": [
"sendWelcomeEmail_ref",
"createProfile_ref",
"chargeCreditCard_ref"
]
}
}
When this workflow runs, you’d expect to see tasks progressing. Your dashboard might show:
- Running Workflows: Number of active
userOnboardinginstances. - Completed Workflows: Number of successful
userOnboardinginstances. - Failed Workflows: Number of
userOnboardinginstances that hit an error. - Task Status: Breakdown of
sendWelcomeEmail,createProfile, etc., byRUNNING,COMPLETED,FAILED.
This is where the "absence of errors" trap lies. If your Failed Workflows count is zero, and Completed Workflows is steadily increasing, you might think everything is fine. But what if the chargeCreditCard task is silently failing without marking the workflow as failed?
Consider the chargeCreditCard task. It’s a SIMPLE type, meaning it’s likely calling an external service. What happens if that external service is down, or returns an HTTP 500, but your task worker doesn’t correctly report the failure back to the orchestrator? The orchestrator might just keep retrying, or worse, time out and just disappear from the "running" count without ever reaching "failed."
Here’s a more realistic scenario. The chargeCreditCard task calls a payment gateway. The gateway API returns a 502 Bad Gateway. Your task worker, perhaps a Lambda function or a Kubernetes pod, receives this response.
Scenario 1: The Task Worker Errors Out Correctly
The worker code has a try-catch block:
def execute_charge_credit_card(task_data):
try:
# Call external payment gateway
response = requests.post("https://api.paymentgateway.com/charge", json=payload)
response.raise_for_status() # This raises HTTPError for 4xx/5xx
return {"status": "COMPLETED", "output": response.json()}
except requests.exceptions.RequestException as e:
# Log the error, then re-raise or return a specific failure status
print(f"Payment gateway failed: {e}")
return {"status": "FAILED", "output": {"error": str(e)}}
In this case, the orchestrator receives a FAILED status for chargeCreditCard_ref. The overall userOnboarding workflow might then be marked as FAILED (depending on your workflow definition’s error handling). This is good! You see the failure, you investigate.
Scenario 2: The Task Worker Doesn’t Error Out Correctly
The worker code doesn’t have robust error handling, or the orchestrator’s SDK is used incorrectly:
def execute_charge_credit_card_buggy(task_data):
# Assume this is a simplified example, real SDKs are more complex
try:
response = requests.post("https://api.paymentgateway.com/charge", json=payload)
if response.status_code >= 400:
# Silently ignore or log without returning failure to orchestrator
print(f"Warning: Payment gateway returned status {response.status_code}")
# NO return {"status": "FAILED", ...}
# Or worse, it might return a "COMPLETED" status with a malformed output
return {"status": "COMPLETED", "output": {"message": "Potentially failed, check logs"}}
else:
return {"status": "COMPLETED", "output": response.json()}
except Exception as e:
print(f"An unexpected error occurred: {e}")
# Again, NO explicit FAILURE return to orchestrator
return {"status": "COMPLETED", "output": {"error": "Unexpected error, check logs"}}
In this buggy scenario, the orchestrator receives COMPLETED for chargeCreditCard_ref, even though the payment failed. The workflow proceeds to the join condition, all tasks are marked COMPLETED, and the userOnboarding workflow itself is marked COMPLETED. Your dashboard shows zero failures. The user never got charged, but the system thinks they did. This is a catastrophic business failure hidden by a green dashboard.
Monitoring Key Metrics:
- Task Execution Duration: Is
chargeCreditCardsuddenly taking 5 minutes when it used to take 500ms? This can indicate retries or a slow downstream service.- Diagnosis: Check your orchestrator’s UI or query its database for the average/median duration of
chargeCreditCard_refover time. If it’s increased significantly, investigate the external service. - Fix: Optimize the external service, add caching, or increase the task’s timeout in the workflow definition.
- Diagnosis: Check your orchestrator’s UI or query its database for the average/median duration of
- Task State Transitions: Are tasks spending too long in
RUNNING? This could mean workers are failing to pick up tasks, or they’re stuck.- Diagnosis: Query the orchestrator’s task table for tasks in
RUNNINGstate older than your expected maximum task duration. - Fix: Ensure your worker instances are healthy and scaled appropriately. Check for deadlocks or long-running operations within the worker.
- Diagnosis: Query the orchestrator’s task table for tasks in
- External Service Health: Monitor the health of services your tasks call (e.g., the payment gateway API).
- Diagnosis: Use an external monitoring tool (Datadog, Prometheus) to track error rates and latency of calls to
api.paymentgateway.com/charge. - Fix: Address issues with the external service provider. Implement circuit breakers in your workers to gracefully handle outages.
- Diagnosis: Use an external monitoring tool (Datadog, Prometheus) to track error rates and latency of calls to
- Worker Logs: This is your ultimate fallback. If the orchestrator doesn’t report a failure, the worker’s logs must.
- Diagnosis: Centralize worker logs (e.g., to CloudWatch Logs, Elasticsearch). Search for errors related to the specific task type (
chargeCreditCard) or external service calls. - Fix: Correct the bug in the worker code to properly report failures back to the orchestrator.
- Diagnosis: Centralize worker logs (e.g., to CloudWatch Logs, Elasticsearch). Search for errors related to the specific task type (
- Workflow Completion Rate vs. Business Outcome: This is the most critical. Compare your workflow’s
COMPLETEDrate with actual business outcomes (e.g., successful payments in your payment gateway’s dashboard).- Diagnosis: If
userOnboardingworkflows are 99.9% completed, but your revenue isn’t growing as expected, there’s a mismatch. - Fix: Implement reconciliation jobs or audit trails that cross-reference workflow completion with actual business events. This often involves adding specific "audit" tasks within your workflow that confirm external state changes.
- Diagnosis: If
When setting up alerts, don’t just alert on workflow.status == FAILED. Alert on:
task.status == FAILEDfor critical tasks.task.status == RUNNINGfor tasks older thanNtimes their expected duration.workflow.status == RUNNINGfor workflows older than their expected total duration.- Significant increase in task duration for specific task types.
The next thing you’ll worry about is how to handle compensation for workflows that do fail, especially if they’ve already performed partial, irreversible actions.