Guides
Operator runbook for diagnosing, recovering, and hardening workflow DAG executions in production.
This playbook is for on-call and platform operators handling stuck, slow, or failing workflow runs.
For engine internals, see DAG Runtime. For the conceptual model, see Workflows.
1) Fast triage checklist
When a workflow appears stalled or degraded:
- Fetch runtime graph:
GET /v1/workflow-runs/{workflowRunID}/graph - Fetch decision stream:
GET /v1/workflow-runs/{workflowRunID}/explain - Confirm workflow policy:
GET /v1/workflow-policies/{projectID} - Inspect waiting event triggers (if
wait_for_event/sleepis used) - Decide recovery scope: single-step retry vs subtree replay
2) What to look for
Graph signals (/graph)
runnableis non-empty but run does not progress: usually scheduler blockers (max_parallel_steps,concurrency_key, orresource_classlimits).critical_path_remaining_msgrowing over time: likely blocked long-running branch.- High number of
pendingnodes with dependencies complete: progression callback/reconciliation issue.
Explain signals (/explain)
Filter with step_ref and decision_type to reduce noise.
decision_type=resource: step blocked byresource_classquota.decision_type=concurrency:concurrency_keylock contention.decision_type=scheduler: workflow-level parallelism cap reached.decision_type=condition: step intentionally skipped due to condition evaluation.
3) Recovery actions
A) Retry one terminal step (smallest blast radius)
Use when failure is local and downstream branch should continue from that step.
curl -X POST "https://strait.dev/v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry" \
-H "Authorization: Bearer strait_live_abc123"B) Replay subtree (branch-scoped replay)
Use when the selected step and its descendants should be recomputed.
curl -X POST "https://strait.dev/v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree" \
-H "Authorization: Bearer strait_live_abc123"C) Let reaper reconcile stalled runs
Tune and enable auto-healing behavior:
WF_STALL_THRESHOLD(default15m)WF_STALL_ACTION:log_only(observe only)reconcile(resume callback)fail_workflow(terminal fail-safe)
4) Policy tuning guardrails
Set project policy via:
PUT /v1/workflow-policies/{projectID}
Key fields:
max_fan_out: controls branch explosion.max_depth: limits chain depth.forbidden_step_types: disallow risky/unsupported step kinds in a project.require_approval_for_deploy: enforce manual gate on deploy DAGs.
5) Common incident patterns
Pattern: Deploy step never starts
Typical findings:
/graph: step is runnable but not running./explain:resourceorconcurrencyblocks.
Actions:
- Confirm there is no intentional policy restriction.
- Retry failed terminal dependency if present.
- If branch state is inconsistent, replay subtree from closest failed ancestor.
Pattern: Workflow waits forever on external input
Typical findings:
- Waiting
wait_for_eventstep. - Event trigger still
waitingnear/after timeout.
Actions:
- Verify event key and sender integration path.
- Send/cancel event trigger depending on business intent.
- If timeout handling is too strict, adjust workflow failure policy and timeout config in next version.
Pattern: High fan-out DAG causes prolonged queue pressure
Typical findings:
- Very large runnable set.
- Frequent scheduler/resource block decisions.
Actions:
- Reduce fan-out in DAG design.
- Increase capacity in a controlled way (resource classes / worker sizing).
- Enforce stricter
max_fan_outpolicy to prevent recurrence.
6) Post-incident hardening
After recovery, capture and apply one or more of:
- Workflow policy update (
max_fan_out,max_depth, forbidden types) - Better
concurrency_keystrategy to avoid accidental serialization hotspots - More accurate
resource_classassignments per step - Additional alerting on stalled run and explain decision rates
- Regression tests for the exact progression edge case observed
7) Minimum evidence to close incident
- Graph snapshot before/after recovery
- Explain decision sample for impacted step(s)
- Recovery action used (
retryorreplay-subtree) with timestamp - Final workflow terminal state and run ID(s)
- Follow-up backlog item for permanent fix (policy/config/workflow design/test)
Was this page helpful?