DAG Operations Playbook

Operator runbook for diagnosing, recovering, and hardening workflow DAG executions in production.

This playbook is for on-call and platform operators handling stuck, slow, or failing workflow runs.

For engine internals, see DAG Runtime. For the conceptual model, see Workflows.

1) Fast triage checklist

When a workflow appears stalled or degraded:

runnable is non-empty but run does not progress: usually scheduler blockers (max_parallel_steps, concurrency_key, or resource_class limits).
critical_path_remaining_ms growing over time: likely blocked long-running branch.
High number of pending nodes with dependencies complete: progression callback/reconciliation issue.

Filter with step_ref and decision_type to reduce noise.

decision_type=resource: step blocked by resource_class quota.
decision_type=concurrency: concurrency_key lock contention.
decision_type=scheduler: workflow-level parallelism cap reached.
decision_type=condition: step intentionally skipped due to condition evaluation.

Use when failure is local and downstream branch should continue from that step.

curl -X POST "https://strait.dev/v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry" \
  -H "Authorization: Bearer strait_live_abc123"

Use when the selected step and its descendants should be recomputed.

curl -X POST "https://strait.dev/v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree" \
  -H "Authorization: Bearer strait_live_abc123"

Tune and enable auto-healing behavior:

WF_STALL_THRESHOLD (default 15m)
WF_STALL_ACTION:
- log_only (observe only)
- reconcile (resume callback)
- fail_workflow (terminal fail-safe)

Set project policy via:

Key fields:

Typical findings:

Actions:

Typical findings:

Actions:

Verify event key and sender integration path.
Send/cancel event trigger depending on business intent.
If timeout handling is too strict, adjust workflow failure policy and timeout config in next version.

Typical findings:

Actions:

After recovery, capture and apply one or more of:

Was this page helpful?