Strait Docs
Guides

Operator runbook for diagnosing, recovering, and hardening workflow DAG executions in production.

This playbook is for on-call and platform operators handling stuck, slow, or failing workflow runs.

For engine internals, see DAG Runtime. For the conceptual model, see Workflows.

1) Fast triage checklist

When a workflow appears stalled or degraded:

  1. Fetch runtime graph: GET /v1/workflow-runs/{workflowRunID}/graph
  2. Fetch decision stream: GET /v1/workflow-runs/{workflowRunID}/explain
  3. Confirm workflow policy: GET /v1/workflow-policies/{projectID}
  4. Inspect waiting event triggers (if wait_for_event / sleep is used)
  5. Decide recovery scope: single-step retry vs subtree replay

2) What to look for

Graph signals (/graph)

  • runnable is non-empty but run does not progress: usually scheduler blockers (max_parallel_steps, concurrency_key, or resource_class limits).
  • critical_path_remaining_ms growing over time: likely blocked long-running branch.
  • High number of pending nodes with dependencies complete: progression callback/reconciliation issue.

Explain signals (/explain)

Filter with step_ref and decision_type to reduce noise.

  • decision_type=resource: step blocked by resource_class quota.
  • decision_type=concurrency: concurrency_key lock contention.
  • decision_type=scheduler: workflow-level parallelism cap reached.
  • decision_type=condition: step intentionally skipped due to condition evaluation.

3) Recovery actions

A) Retry one terminal step (smallest blast radius)

Use when failure is local and downstream branch should continue from that step.

curl -X POST "https://strait.dev/v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry" \
  -H "Authorization: Bearer strait_live_abc123"

B) Replay subtree (branch-scoped replay)

Use when the selected step and its descendants should be recomputed.

curl -X POST "https://strait.dev/v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree" \
  -H "Authorization: Bearer strait_live_abc123"

C) Let reaper reconcile stalled runs

Tune and enable auto-healing behavior:

  • WF_STALL_THRESHOLD (default 15m)
  • WF_STALL_ACTION:
    • log_only (observe only)
    • reconcile (resume callback)
    • fail_workflow (terminal fail-safe)

4) Policy tuning guardrails

Set project policy via:

  • PUT /v1/workflow-policies/{projectID}

Key fields:

  • max_fan_out: controls branch explosion.
  • max_depth: limits chain depth.
  • forbidden_step_types: disallow risky/unsupported step kinds in a project.
  • require_approval_for_deploy: enforce manual gate on deploy DAGs.

5) Common incident patterns

Pattern: Deploy step never starts

Typical findings:

  • /graph: step is runnable but not running.
  • /explain: resource or concurrency blocks.

Actions:

  1. Confirm there is no intentional policy restriction.
  2. Retry failed terminal dependency if present.
  3. If branch state is inconsistent, replay subtree from closest failed ancestor.

Pattern: Workflow waits forever on external input

Typical findings:

  • Waiting wait_for_event step.
  • Event trigger still waiting near/after timeout.

Actions:

  1. Verify event key and sender integration path.
  2. Send/cancel event trigger depending on business intent.
  3. If timeout handling is too strict, adjust workflow failure policy and timeout config in next version.

Pattern: High fan-out DAG causes prolonged queue pressure

Typical findings:

  • Very large runnable set.
  • Frequent scheduler/resource block decisions.

Actions:

  1. Reduce fan-out in DAG design.
  2. Increase capacity in a controlled way (resource classes / worker sizing).
  3. Enforce stricter max_fan_out policy to prevent recurrence.

6) Post-incident hardening

After recovery, capture and apply one or more of:

  • Workflow policy update (max_fan_out, max_depth, forbidden types)
  • Better concurrency_key strategy to avoid accidental serialization hotspots
  • More accurate resource_class assignments per step
  • Additional alerting on stalled run and explain decision rates
  • Regression tests for the exact progression edge case observed

7) Minimum evidence to close incident

  • Graph snapshot before/after recovery
  • Explain decision sample for impacted step(s)
  • Recovery action used (retry or replay-subtree) with timestamp
  • Final workflow terminal state and run ID(s)
  • Follow-up backlog item for permanent fix (policy/config/workflow design/test)
Was this page helpful?

On this page