You get:
- workflows that stop completely on first error (everything after is lost)
- partial updates that leave data inconsistent (some tasks ran, some didn’t)
- no way to retry transient failures (rate limits, timeouts)
- no compensation for already-completed tasks when later tasks fail
- escalation to humans only after everything fails (too late)
But recovery strategies exist:
- retry: try the failed task again (with backoff)
- skip: log the failure and continue (non-critical tasks)
- rollback: undo already-completed tasks
- compensate: run a different task to fix the inconsistency
- escalate: pause workflow and notify human
Without recovery planning, failures cascade.
This prompt designs recovery strategies for each failure mode.
Assume the role of a workflow resilience engineer who designs recovery strategies.
Your task is to plan how the workflow recovers from task failures.
Generate:
1. FAILURE MODES INVENTORY (per task)
| Task | Possible Failures | Likelihood | Impact | Criticality |
|------|-------------------|------------|--------|-------------|
| [name] | [e.g., API timeout] | High/Med/Low | High/Med/Low | Critical/Non-critical |
2. RECOVERY STRATEGY PER FAILURE MODE
| Failure Mode | Recovery Strategy | Parameters | Success Criteria |
|--------------|-------------------|------------|------------------|
| API timeout | Retry with backoff | 3 retries, 1s/2s/4s | Success within 10s |
| Rate limit | Wait + retry | Wait 60s, retry once | Success within 2min |
| Invalid data | Skip + log | Log to error table | Workflow continues |
| Auth expired | Escalate to human | Notify #on-call | Human intervenes |
3. COMPENSATING ACTIONS
- If [Task A] fails but [Task B] already ran:
* Compensating action: [what to do, e.g., "reverse Task B"]
* When to trigger: [immediately / batch / manual]
4. ROLLBACK PLAN
- If workflow fails after partial execution:
* What to roll back: [list of completed tasks to undo]
* Rollback method: [automated / manual / not possible]
5. ESCALATION RULES
- When to escalate: [after X retries / certain error types]
- Who to escalate to: [Slack channel, email, PagerDuty]
- What information to include: [failed task, error, data snapshot]
6. RECOVERY TESTING PLAN
- How to simulate each failure
- Expected behavior
INPUTS:
Workflow tasks (with criticality):
[E.G., "Task 1: Fetch data (critical), Task 2: Transform (critical), Task 3: Send email (non-critical)"]
Known failure history:
[E.G., "Email API sometimes rate limits, data API occasionally times out"]
Recovery time objective (RTO):
[E.G., "Workflow must recover within 5 minutes"]
RULES:
- Critical tasks (customer data, payments, orders) need retry + escalation
- Non-critical tasks (notifications, logging) can skip on failure
- Always use exponential backoff for retries (1s, 2s, 4s, 8s...)
- Idempotent tasks are safe to retry; non-idempotent need careful handling
- Compensating actions are complex — prefer atomic transactions
- Test recovery by intentionally causing each failure mode
- Critical tasks (customer data, payments, orders) need retry + escalation.
- Non-critical tasks (notifications, logging) can skip on failure.
- Always use exponential backoff for retries (1s, 2s, 4s, 8s).
- Idempotent tasks are safe to retry; non-idempotent need careful handling.
- Test recovery by intentionally causing each failure mode in staging.
Workflow tasks:
“Task 1: Fetch order from database (critical). Task 2: Process payment via Stripe (critical). Task 3: Send confirmation email (non-critical). Task 4: Update inventory (critical).”
Known failure history:
“Stripe API occasionally times out (2% of calls). Email API rate limits during peak hours. Database is reliable.”
Recovery time objective:
“5 minutes”
This framework improves outcomes by forcing:
- failure mode inventory (what can go wrong at each step?)
- recovery strategy assignment (retry, skip, rollback, compensate, escalate)
- compensating action design (undo partial work)
- rollback planning (revert to consistent state)
- escalation rules (when to involve humans)
Failure modes this prevents:
- Partial updates with no rollback (inconsistent data)
- Transient failures causing permanent failure (no retry)
- Critical failures with no human notification (silent failure)
- Unbounded retries (infinite loop, wasted resources)
This improves on: “Fail-stop” workflows that assume nothing ever breaks. Production workflows need graceful degradation.
Related to: TO-01 (Dependencies) for understanding impact of failures; TO-05 (Human-in-Loop) for escalation paths.
Build Better AI Systems
Subscribe for advanced prompt engineering, AI coding tools, debugging frameworks, and practical strategies for developers and engineers.
