Strait Docs
Concepts

Detailed reference for workflow DAG execution, policy controls, explainability, and runtime recovery APIs.

This page is the implementation-level reference for Strait's workflow DAG engine.

If you need the high-level conceptual overview, start with Workflows. This document focuses on what the runtime does today, the exact controls that exist, and how to operate and debug DAG runs safely. For step-by-step incident handling and operator runbooks, see DAG Operations Playbook.

What is implemented today

Strait DAG execution currently includes:

  • DAG validation and cycle rejection before execution
  • Multi-parent fan-in with atomic dependency counters
  • Per-step condition evaluation and skip decisions
  • Step-level concurrency control (concurrency_key)
  • Resource-class scheduling limits (small, medium, large)
  • Human approvals, event waits, durable sleep, and sub-workflow steps
  • Project-level workflow policies (max_fan_out, max_depth, forbidden step types, deploy approval requirement)
  • Explainability stream (workflow_step_decisions) with API querying
  • Runtime graph introspection with critical-path estimates
  • Branch-local recovery operations (retry step, replay subtree)
  • Stalled workflow reconciliation via scheduler reaper policy

DAG runtime model

A workflow run materializes as many workflow_step_runs as there are steps in the selected workflow version snapshot.

Key runtime fields:

  • deps_required: number of direct dependencies declared by the step
  • deps_completed: number of completed/advanced dependencies observed so far
  • status: pending|waiting|running|completed|failed|skipped|canceled
  • attempt: step-level attempt counter

Progression is callback-driven. When a step reaches terminal state (or is skipped), the engine advances dependent steps by incrementing counters and evaluating scheduling constraints.

Scheduling and fan-in mechanics

Atomic fan-in

When a parent step completes, the engine increments dependency counters for its children with atomic SQL updates. A child becomes runnable when:

  • deps_completed == deps_required
  • and it is not already terminal/running

This prevents double-start races when many parents complete concurrently.

Targeted scheduling reads

The scheduler uses targeted read sets per workflow run:

  • statuses map: step_ref -> status
  • currently running step runs
  • currently runnable step runs

This avoids broad full-run scans in hot progression paths and keeps scheduling deterministic under load.

Scheduling blockers (recorded decisions)

Before starting a runnable step, the scheduler checks:

  1. Workflow-level parallelism cap: max_parallel_steps
  2. Step-level serialization: concurrency_key
  3. Resource-class capacity (resource_class)
  4. Step condition evaluation

When a step is blocked or skipped, a decision is recorded to workflow_step_decisions for explainability.

Resource classes

resource_class is a step-level scheduling hint and quota bucket.

Current runtime limits are enforced in scheduler logic:

  • small: 50 concurrent running steps
  • medium: 20
  • large: 5

If absent or unknown, class resolves to small.

Condition DSL

The condition evaluator currently supports:

  • step_status
  • step_status_in
  • all_of
  • any_of
  • not
  • eq, ne
  • gt, gte, lt, lte
  • contains
  • in
  • regex
  • exists

If a condition evaluates to false, the step is marked skipped and progression continues.

Workflow policy controls

Project policies are configured via:

  • PUT /v1/workflow-policies/{projectID}
  • GET /v1/workflow-policies/{projectID}

Policy fields:

  • max_fan_out
  • max_depth
  • forbidden_step_types[]
  • require_approval_for_deploy

Enforcement points:

  • create workflow
  • update workflow
  • trigger workflow

This means invalid DAG shapes are rejected at API boundaries instead of failing later during runtime.

Explainability APIs

Step decision stream

GET /v1/workflow-runs/{workflowRunID}/explain

Optional filters:

  • step_ref
  • decision_type

Returns paginated decision records (decision, explanation, details, timestamp) so operators can answer: why did this step wait or skip?

Runtime graph + critical path

GET /v1/workflow-runs/{workflowRunID}/graph

Returns:

  • nodes (state/timing per step)
  • edges (dependency links)
  • roots
  • runnable set
  • critical_path
  • critical_path_estimate_ms
  • critical_path_remaining_ms

Critical-path timing strategy:

  • terminal steps: use observed runtime duration
  • running steps: use elapsed time so far
  • pending/waiting steps: use timeout_secs_override when present

Runtime recovery APIs

Retry one step

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry

Behavior:

  • requires target step run to be terminal
  • resets step run to pending
  • clears started/finished/error/output/event_key
  • resumes workflow progression

Replay a subtree

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree

Behavior:

  • computes descendants from workflow version DAG
  • resets selected step + descendants to pending
  • resumes workflow progression

Use this when one failed branch should be replayed without re-running unrelated successful branches.

Version insight APIs

Diff versions

GET /v1/workflows/{workflowID}/versions/{fromVersionID}/diff/{toVersionID}

Returns step refs added/removed between versions.

Version impact

GET /v1/workflows/{workflowID}/versions/{versionID}/impact

Returns how many sampled runs match the requested workflow version.

Simulate snapshot

POST /v1/workflows/{workflowID}/simulate

Returns snapshot-level ordering metadata (predicted_order, step_count) for the current workflow version.

API examples (runtime and recovery)

Explain a blocked or skipped step

curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/explain?step_ref=deploy&decision_type=resource" \
  -H "Authorization: Bearer strait_live_abc123"
{
  "data": [
    {
      "id": "dec_1",
      "workflow_run_id": "wr_123",
      "step_ref": "deploy",
      "decision_type": "resource",
      "decision": "blocked",
      "explanation": "resource class large at capacity",
      "details": { "resource_class": "large", "running": 5, "limit": 5 },
      "created_at": "2026-03-11T10:21:00Z"
    }
  ],
  "next_cursor": "2026-03-11T10:21:00Z"
}

Inspect graph and critical path

curl -X GET "https://strait.dev/v1/workflow-runs/wr_123/graph" \
  -H "Authorization: Bearer strait_live_abc123"
{
  "workflow_run_id": "wr_123",
  "workflow_id": "wf_release",
  "version": 12,
  "roots": ["build"],
  "runnable": ["deploy"],
  "critical_path": ["build", "test", "deploy"],
  "critical_path_estimate_ms": 330000,
  "critical_path_remaining_ms": 210000
}

Retry a terminal step

curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/retry" \
  -H "Authorization: Bearer strait_live_abc123"

Replay a failed subtree

curl -X POST "https://strait.dev/v1/workflow-runs/wr_123/steps/deploy/replay-subtree" \
  -H "Authorization: Bearer strait_live_abc123"

Upsert project policy

curl -X PUT "https://strait.dev/v1/workflow-policies/proj_1" \
  -H "Authorization: Bearer strait_live_abc123" \
  -H "Content-Type: application/json" \
  -d '{
    "max_fan_out": 15,
    "max_depth": 12,
    "forbidden_step_types": ["sleep"],
    "require_approval_for_deploy": true
  }'

Reaper-driven DAG safety nets

The scheduler reaper includes stalled workflow detection (WF_STALL_THRESHOLD) with configurable action (WF_STALL_ACTION):

  • log_only
  • reconcile (resume callback)
  • fail_workflow

This protects runs that stop progressing due to transient callback or scheduling failures.

Known operational limits

  • Max workflow steps per definition: 1000
  • concurrency_key max length: 128 chars
  • event_key max length: 512 chars
  • Sub-workflow default nesting depth limit: 10

When a workflow appears stuck:

  1. Check graph: /workflow-runs/{id}/graph
  2. Check decisions: /workflow-runs/{id}/explain
  3. Inspect policy constraints for the project
  4. Retry a single step or replay subtree if failure is localized
  5. Use stalled-workflow reaper action reconcile in environments where auto-healing is preferred
Was this page helpful?

On this page