DAG Runtime

Detailed reference for workflow DAG execution, policy controls, explainability, and runtime recovery APIs.

This page is the implementation-level reference for Strait's workflow DAG engine.

If you need the high-level conceptual overview, start with Workflows. This document focuses on what the runtime does today, the exact controls that exist, and how to operate and debug DAG runs safely. For step-by-step incident handling and operator runbooks, see DAG Operations Playbook.

What is implemented today

Strait DAG execution currently includes:

DAG validation and cycle rejection before execution
Multi-parent fan-in with atomic dependency counters
Per-step condition evaluation and skip decisions
Step-level concurrency control (concurrency_key)
Resource-class scheduling limits (small, medium, large)
Human approvals, event waits, durable sleep, and sub-workflow steps
Project-level workflow policies (max_fan_out, max_depth, forbidden step types, deploy approval requirement)
Explainability stream (workflow_step_decisions) with API querying
Runtime graph introspection with critical-path estimates
Branch-local recovery operations (retry step, replay subtree)
Stalled workflow reconciliation via scheduler reaper policy

DAG runtime model

A workflow run materializes as many workflow_step_runs as there are steps in the selected workflow version snapshot.

Key runtime fields:

deps_required: number of direct dependencies declared by the step
deps_completed: number of completed/advanced dependencies observed so far
status: pending|waiting|running|completed|failed|skipped|canceled
attempt: step-level attempt counter

Progression is callback-driven. When a step reaches terminal state (or is skipped), the engine advances dependent steps by incrementing counters and evaluating scheduling constraints.

Scheduling and fan-in mechanics

Atomic fan-in

When a parent step completes, the engine increments dependency counters for its children with atomic SQL updates. A child becomes runnable when:

deps_completed == deps_required
and it is not already terminal/running

This prevents double-start races when many parents complete concurrently.

Targeted scheduling reads

The scheduler uses targeted read sets per workflow run:

statuses map: step_ref -> status
currently running step runs
currently runnable step runs

This avoids broad full-run scans in hot progression paths and keeps scheduling deterministic under load.

Scheduling blockers (recorded decisions)

Before starting a runnable step, the scheduler checks:

Workflow-level parallelism cap: max_parallel_steps
Step-level serialization: concurrency_key
Resource-class capacity (resource_class)
Step condition evaluation

When a step is blocked or skipped, a decision is recorded to workflow_step_decisions for explainability.

Resource classes

resource_class is a step-level scheduling hint and quota bucket.

Current runtime limits are enforced in scheduler logic:

small: 50 concurrent running steps
medium: 20
large: 5

If absent or unknown, class resolves to small.

Condition DSL

The condition evaluator currently supports:

step_status
step_status_in
all_of
any_of
not
eq, ne
gt, gte, lt, lte
contains
in
regex
exists

If a condition evaluates to false, the step is marked skipped and progression continues.

Workflow policy controls

Project policies are configured via:

PUT /v1/workflow-policies/{projectID}
GET /v1/workflow-policies/{projectID}

Policy fields:

max_fan_out
max_depth
forbidden_step_types[]
require_approval_for_deploy

Enforcement points:

create workflow
update workflow
trigger workflow

This means invalid DAG shapes are rejected at API boundaries instead of failing later during runtime.

Explainability APIs

Step decision stream

GET /v1/workflow-runs/{workflowRunID}/explain

Optional filters:

step_ref
decision_type

Returns paginated decision records (decision, explanation, details, timestamp) so operators can answer: why did this step wait or skip?

Runtime graph + critical path

GET /v1/workflow-runs/{workflowRunID}/graph

Returns:

nodes (state/timing per step)
edges (dependency links)
roots
runnable set
critical_path
critical_path_estimate_ms
critical_path_remaining_ms

Critical-path timing strategy:

terminal steps: use observed runtime duration
running steps: use elapsed time so far
pending/waiting steps: use timeout_secs_override when present

Runtime recovery APIs

Retry one step

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry

Behavior:

requires target step run to be terminal
resets step run to pending
clears started/finished/error/output/event_key
resumes workflow progression

Replay a subtree

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree

Behavior:

computes descendants from workflow version DAG
resets selected step + descendants to pending
resumes workflow progression

Use this when one failed branch should be replayed without re-running unrelated successful branches.

curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/explain?step_ref=deploy&decision_type=resource" \
  -H "Authorization: Bearer strait_live_abc123"

{
  "data": [
    {
      "id": "dec_1",
      "workflow_run_id": "wr_123",
      "step_ref": "deploy",
      "decision_type": "resource",
      "decision": "blocked",
      "explanation": "resource class large at capacity",
      "details": { "resource_class": "large", "running": 5, "limit": 5 },
      "created_at": "2026-03-11T10:21:00Z"
    }
  ],
  "next_cursor": "2026-03-11T10:21:00Z"
}

Inspect graph and critical path

curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/graph" \
  -H "Authorization: Bearer strait_live_abc123"

{
  "workflow_run_id": "wr_123",
  "workflow_id": "wf_release",
  "version": 12,
  "roots": ["build"],
  "runnable": ["deploy"],
  "critical_path": ["build", "test", "deploy"],
  "critical_path_estimate_ms": 330000,
  "critical_path_remaining_ms": 210000
}

Retry a terminal step

curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/retry" \
  -H "Authorization: Bearer strait_live_abc123"

Replay a failed subtree

curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/replay-subtree" \
  -H "Authorization: Bearer strait_live_abc123"

Upsert project policy

curl -X PUT "https://strait.dev/v1/workflow-policies/proj_1" \
  -H "Authorization: Bearer strait_live_abc123" \
  -H "Content-Type: application/json" \
  -d '{
    "max_fan_out": 15,
    "max_depth": 12,
    "forbidden_step_types": ["sleep"],
    "require_approval_for_deploy": true
  }'

Reaper-driven DAG safety nets

The scheduler reaper includes stalled workflow detection (WF_STALL_THRESHOLD) with configurable action (WF_STALL_ACTION):

log_only
reconcile (resume callback)
fail_workflow

This protects runs that stop progressing due to transient callback or scheduling failures.

Known operational limits

Max workflow steps per definition: 1000
concurrency_key max length: 128 chars
event_key max length: 512 chars
Sub-workflow default nesting depth limit: 10

Recommended operator workflow

When a workflow appears stuck:

Check graph: /workflow-runs/{id}/graph
Check decisions: /workflow-runs/{id}/explain
Inspect policy constraints for the project
Retry a single step or replay subtree if failure is localized
Use stalled-workflow reaper action reconcile in environments where auto-healing is preferred

Was this page helpful?