Detailed reference for workflow DAG execution, policy controls, explainability, and runtime recovery APIs.
This page is the implementation-level reference for Strait's workflow DAG engine.
If you need the high-level conceptual overview, start with Workflows. This document focuses on what the runtime does today, the exact controls that exist, and how to operate and debug DAG runs safely. For step-by-step incident handling and operator runbooks, see DAG Operations Playbook.
What is implemented today
Strait DAG execution currently includes:
- DAG validation and cycle rejection before execution
- Multi-parent fan-in with atomic dependency counters
- Per-step condition evaluation and skip decisions
- Step-level concurrency control (
concurrency_key) - Resource-class scheduling limits (
small,medium,large) - Human approvals, event waits, durable sleep, and sub-workflow steps
- Project-level workflow policies (
max_fan_out,max_depth, forbidden step types, deploy approval requirement) - Explainability stream (
workflow_step_decisions) with API querying - Runtime graph introspection with critical-path estimates
- Branch-local recovery operations (
retry step,replay subtree) - Stalled workflow reconciliation via scheduler reaper policy
DAG runtime model
A workflow run materializes as many workflow_step_runs as there are steps in the selected workflow version snapshot.
Key runtime fields:
deps_required: number of direct dependencies declared by the stepdeps_completed: number of completed/advanced dependencies observed so farstatus:pending|waiting|running|completed|failed|skipped|canceledattempt: step-level attempt counter
Progression is callback-driven. When a step reaches terminal state (or is skipped), the engine advances dependent steps by incrementing counters and evaluating scheduling constraints.
Scheduling and fan-in mechanics
Atomic fan-in
When a parent step completes, the engine increments dependency counters for its children with atomic SQL updates. A child becomes runnable when:
deps_completed == deps_required- and it is not already terminal/running
This prevents double-start races when many parents complete concurrently.
Targeted scheduling reads
The scheduler uses targeted read sets per workflow run:
- statuses map:
step_ref -> status - currently running step runs
- currently runnable step runs
This avoids broad full-run scans in hot progression paths and keeps scheduling deterministic under load.
Scheduling blockers (recorded decisions)
Before starting a runnable step, the scheduler checks:
- Workflow-level parallelism cap:
max_parallel_steps - Step-level serialization:
concurrency_key - Resource-class capacity (
resource_class) - Step condition evaluation
When a step is blocked or skipped, a decision is recorded to workflow_step_decisions for explainability.
Resource classes
resource_class is a step-level scheduling hint and quota bucket.
Current runtime limits are enforced in scheduler logic:
small: 50 concurrent running stepsmedium: 20large: 5
If absent or unknown, class resolves to small.
Condition DSL
The condition evaluator currently supports:
step_statusstep_status_inall_ofany_ofnoteq,negt,gte,lt,ltecontainsinregexexists
If a condition evaluates to false, the step is marked skipped and progression continues.
Workflow policy controls
Project policies are configured via:
PUT /v1/workflow-policies/{projectID}GET /v1/workflow-policies/{projectID}
Policy fields:
max_fan_outmax_depthforbidden_step_types[]require_approval_for_deploy
Enforcement points:
- create workflow
- update workflow
- trigger workflow
This means invalid DAG shapes are rejected at API boundaries instead of failing later during runtime.
Explainability APIs
Step decision stream
GET /v1/workflow-runs/{workflowRunID}/explain
Optional filters:
step_refdecision_type
Returns paginated decision records (decision, explanation, details, timestamp) so operators can answer: why did this step wait or skip?
Runtime graph + critical path
GET /v1/workflow-runs/{workflowRunID}/graph
Returns:
- nodes (state/timing per step)
- edges (dependency links)
- roots
- runnable set
critical_pathcritical_path_estimate_mscritical_path_remaining_ms
Critical-path timing strategy:
- terminal steps: use observed runtime duration
- running steps: use elapsed time so far
- pending/waiting steps: use
timeout_secs_overridewhen present
Runtime recovery APIs
Retry one step
POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry
Behavior:
- requires target step run to be terminal
- resets step run to
pending - clears started/finished/error/output/event_key
- resumes workflow progression
Replay a subtree
POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree
Behavior:
- computes descendants from workflow version DAG
- resets selected step + descendants to
pending - resumes workflow progression
Use this when one failed branch should be replayed without re-running unrelated successful branches.
Version insight APIs
Diff versions
GET /v1/workflows/{workflowID}/versions/{fromVersionID}/diff/{toVersionID}
Returns step refs added/removed between versions.
Version impact
GET /v1/workflows/{workflowID}/versions/{versionID}/impact
Returns how many sampled runs match the requested workflow version.
Simulate snapshot
POST /v1/workflows/{workflowID}/simulate
Returns snapshot-level ordering metadata (predicted_order, step_count) for the current workflow version.
API examples (runtime and recovery)
Explain a blocked or skipped step
curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/explain?step_ref=deploy&decision_type=resource" \
-H "Authorization: Bearer strait_live_abc123"{
"data": [
{
"id": "dec_1",
"workflow_run_id": "wr_123",
"step_ref": "deploy",
"decision_type": "resource",
"decision": "blocked",
"explanation": "resource class large at capacity",
"details": { "resource_class": "large", "running": 5, "limit": 5 },
"created_at": "2026-03-11T10:21:00Z"
}
],
"next_cursor": "2026-03-11T10:21:00Z"
}Inspect graph and critical path
curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/graph" \
-H "Authorization: Bearer strait_live_abc123"{
"workflow_run_id": "wr_123",
"workflow_id": "wf_release",
"version": 12,
"roots": ["build"],
"runnable": ["deploy"],
"critical_path": ["build", "test", "deploy"],
"critical_path_estimate_ms": 330000,
"critical_path_remaining_ms": 210000
}Retry a terminal step
curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/retry" \
-H "Authorization: Bearer strait_live_abc123"Replay a failed subtree
curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/replay-subtree" \
-H "Authorization: Bearer strait_live_abc123"Upsert project policy
curl -X PUT "https://strait.dev/v1/workflow-policies/proj_1" \
-H "Authorization: Bearer strait_live_abc123" \
-H "Content-Type: application/json" \
-d '{
"max_fan_out": 15,
"max_depth": 12,
"forbidden_step_types": ["sleep"],
"require_approval_for_deploy": true
}'Reaper-driven DAG safety nets
The scheduler reaper includes stalled workflow detection (WF_STALL_THRESHOLD) with configurable action (WF_STALL_ACTION):
log_onlyreconcile(resume callback)fail_workflow
This protects runs that stop progressing due to transient callback or scheduling failures.
Known operational limits
- Max workflow steps per definition: 1000
concurrency_keymax length: 128 charsevent_keymax length: 512 chars- Sub-workflow default nesting depth limit: 10
Recommended operator workflow
When a workflow appears stuck:
- Check graph:
/workflow-runs/{id}/graph - Check decisions:
/workflow-runs/{id}/explain - Inspect policy constraints for the project
- Retry a single step or replay subtree if failure is localized
- Use stalled-workflow reaper action
reconcilein environments where auto-healing is preferred