Architecture

Deep dive into strait internals, design decisions, and component interactions.

Strait is designed as a distributed job orchestration service that uses PostgreSQL as its primary state store and message queue.

System Overview

                    ┌──────────────────────────────────┐
                    │           API Server              │
                    │  (Chi router + middleware)         │
                    │                                    │
                    │  /v1/jobs/* ── Job CRUD + Health   │
                    │  /v1/workflows/* ── DAG CRUD + plan │
                    │  /v1/workflow-runs/* ── graph/explain│
                    │  /v1/workflow-policies/* ── governance│
                    │  /v1/jobs/{id}/trigger ── Enqueue   │
                    │  /v1/runs/* ── Run mgmt + DLQ       │
                    │  /v1/events/* ── Event triggers     │
                    │  /sdk/v1/* ── SDK (JWT auth)        │
                    │  /metrics ── Prometheus              │
                    └──────────┬───────────────────────┘
                               │ Enqueue (budget check)
                               v
                    ┌──────────────────────────────────┐
                    │         PostgreSQL                 │
                    │                                    │
                    │  jobs ── job definitions           │
                    │  job_runs ── run state + queue     │
                    │  workflows ── DAG definitions      │
                    │  workflow_runs ── workflow state   │
                    │  event_triggers ── durable waits   │
                    │  run_events ── log entries         │
                    │  run_usage ── AI cost tracking     │
                    │  environments ── endpoint config   │
                    │  project_quotas ── budget limits   │
                    │                                    │
                    │  Queue: SELECT FOR UPDATE          │
                    │         SKIP LOCKED                │
                    └──────────┬───────────────────────┘
                               │ Dequeue
                               v
                    ┌──────────────────────────────────┐
                    │         Worker Executor            │
                    │                                    │
                    │  Poll ─> DequeueN(available)       │
                    │  Workflow Engine:                  │
                    │  - DAG Validation (Kahn's)         │
                    │  - Atomic Fan-in (UPDATE...RET)    │
                    │  - Condition Evaluation            │
                    │  - Template Rendering              │
                    │  - Sub-workflow Nesting            │
                    │                                    │
                    │  Job Execution:                    │
                    │  - Resolve ─> Env override + SSRF  │
                    │  - Execute ─> HTTP POST to endpt   │
                    │  - Retry ─> Smart strategy select  │
                    │  - Trace ─> Execution timing       │
                    │  - DLQ ─> Dead letter on exhaust   │
                    └──────────┬───────────────────────┘
                               │ Webhook / PubSub
                               v
                    ┌──────────────────────────────────┐
                    │  Scheduler         │  Redis       │
                    │  - Cron ticker     │  - PubSub    │
                    │  - Delayed poller  │  - SSE       │
                    │  - Stale reaper    │  streaming   │
                    │  - Event reaper    │              │
                    │  - Retention       │              │
                    │  - Pool pruner     │              │
                    └──────────────────────────────────┘
                               │
                               v
                    ┌──────────────────────────────────┐
                    │  Fly Machines (managed execution) │
                    │                                    │
                    │  Warm Pool ─> Start ─> Wait        │
                    │  Paused   ─> Start ─> Wait        │
                    │  Cold     ─> Create ─> Wait       │
                    │                                    │
                    │  Presets: micro .. large-2x        │
                    │  auto_destroy=false (reusable)     │
                    └──────────────────────────────────┘
                               │
                               v
                    ┌──────────────────────────────────┐
                    │  ClickHouse (optional analytics)  │
                    │                                    │
                    │  run_events ── lifecycle timeline  │
                    │  run_analytics ── run summaries   │
                    │  compute_usage ── cost tracking    │
                    │                                    │
                    │  Async batch export (10s / 100)    │
                    └──────────────────────────────────┘

The system is distributed as a single Go binary that can run in three modes:

api: Handles HTTP requests, job management, and triggering.
worker: Runs executor, scheduler, and background maintenance tasks.
all: Runs both API and worker components in a single process.

Graceful shutdown is implemented using errgroup and signal handling to ensure in-flight jobs are completed and resources are released cleanly.

Component Architecture

The core logic resides in internal/ directory, organized into specialized packages.

api

Implements Chi HTTP router, middleware chain, and authentication (Internal Secret + Run Token). Handles SSRF validation, health scoring, debug bundles, and DLQ management.

worker

Handles pond/v2 executor pool with bounded queue backpressure, HTTP dispatch, smart retries, and state transitions. Exposes OTel metrics for pool observability.

workflow

Manages DAG execution, Kahn's algorithm validation, atomic fan-in logic, step conditions, and sub-workflow nesting.

queue

Implements Postgres-backed queue using SKIP LOCKED for atomic job claiming and batch dequeue operations.

store

Provides raw SQL access via pgx/v5 with instrumented OTel spans, optimistic locking, and interface segregation.

scheduler

Background components for Cron, delayed polling, stale run reaping, retention, and cost budget enforcement.

compute

Container runtime abstraction for managed execution. Fly Machines and Docker backends, warm machine pool, machine lifecycle (Create, Start, Wait, Stop, Destroy), cost estimation.

clickhouse

Optional analytics backend. Async batch export of run events, analytics summaries, and compute usage to ClickHouse for time-series reporting.

webhook

Async webhook delivery with retry policies, circuit breaker, HMAC-SHA256 signing, and dead letter tracking.

cdc

Change data capture via Sequin. Polls database changes for real-time event streaming to external consumers.

Technology Stack

Component	Technology	Role
Runtime	Go 1.26	Single binary, native concurrency, instant cold start
Database	PostgreSQL 18	State store + message queue via `SELECT FOR UPDATE SKIP LOCKED`
Cache/PubSub	Redis 8	Pub/sub for SSE streaming, CDC event distribution
SQL Driver	pgx/v5	Raw SQL with connection pooling and prepared statements
HTTP Router	Chi/v5	Composable middleware, low-allocation request handling
Concurrency	sourcegraph/conc	Panic-safe goroutine pools with context-aware cancellation
Worker Pool	alitto/pond/v2	Bounded queue backpressure with Prometheus metrics
CDC	Sequin	Postgres WAL streaming through Redis channels
Observability	OpenTelemetry	Vendor-neutral tracing (Jaeger/Tempo) and Prometheus metrics
Auth Tokens	golang-jwt/v5	HS256 JWT for SDK run tokens (60s expiry)
Scheduling	robfig/cron/v3	5-field cron expressions with timezone support
Migrations	golang-migrate/v4	Embedded SQL migrations via `go:embed`
Utilities	samber/lo	Type-safe generic collection operations
Testing	google/go-cmp, testcontainers-go	Structural diffs, real Postgres/Redis in integration tests

For detailed rationale behind each technology choice, see Technology Choices.

Queue Mechanics

Strait uses PostgreSQL as a message queue to minimize operational complexity and ensure transactional consistency.

Multiple workers can poll the same job_runs table concurrently. SELECT FOR UPDATE SKIP LOCKED locks only the rows that will be processed by this worker and skips rows already locked by other workers. This provides:

Lock-free concurrency: Workers don't block each other. Each worker claims rows it can process immediately.
Exactly-once semantics: A row locked by one worker cannot be claimed by another, preventing duplicate job execution.
ACID guarantees: The dequeue and status update happen in a single transaction. If the worker crashes after claiming but before processing, the transaction rolls back and the run returns to queued state.

Batch Dequeue (DequeueN)

Using a Common Table Expression (CTE), workers can claim up to N rows in a single database round-trip. The CTE selects rows with SKIP LOCKED and returns their IDs, then the outer UPDATE transitions them to dequeued. This amortizes network latency across multiple jobs.

Priority and Ordering

The queue respects two ordering dimensions:

Priority: Jobs with higher priority values are dequeued first (ORDER BY priority DESC).
FIFO within priority: Jobs with the same priority follow first-in-first-out ordering (ORDER BY created_at ASC).

Delayed and Retry Gating

The dequeue query includes filters for scheduled_at and next_retry_at columns. Jobs are only dequeued when:

scheduled_at IS NULL OR scheduled_at <= NOW(): Delayed jobs wait until their scheduled time.
next_retry_at IS NULL OR next_retry_at <= NOW(): Retrying jobs wait until their backoff period expires.

Visibility Timeout

The transition from queued to dequeued acts as the visibility timeout mechanism. If a worker crashes after claiming but before processing, the stale reaper identifies runs stuck in dequeued state with no recent heartbeat and transitions them back to queued for reprocessing.

Finite State Machine (FSM)

The lifecycle of a job run is managed by an FSM with 13 possible states.

                    ┌─────────┐
                    │ delayed │
                    └────┬────┘
                         │ scheduled_at <= NOW
                         v
    ┌────────────────┬─────────┬────────────────────┐
    │                │ queued  │                     │
    │                └────┬────┘                     │
    │                     │ dequeue                  │
    │                     v                          │
    │              ┌──────────┐                      │
    │              │ dequeued │──────────┐            │
    │              └────┬─────┘          │            │
    │                   │ start          │ system     │
    │                   v                │ failure    │
    │    retry    ┌───────────┐          │            │
    │  ┌─────────>│ executing │          │            │
    │  │          └─────┬─────┘          │            │
    │  │  ┌─────────┬───┴───┬─────────┐ │            │
    │  │  │         │       │         │ │            │
    │  │  v         v       v         v v            │
    │  │ completed failed timed_out system_failed    │
    │  │                │                             │
    │  │                v (max attempts reached       │
    │  │          ┌─────────────┐                     │
    │  │          │ dead_letter │                     │
    │  │          └─────────────┘                     │
    │  │                                              │
    │  └── (attempt < max_attempts) ──────────────────┘
    │                                                 │
    │  canceled <── (any non-terminal) ──────────────┘
    │  expired  <── (delayed, queued with expires_at) ┘

Valid Transitions

Delayed jobs move to queued when scheduled time arrives. Can be canceled or expired before execution.

Dequeued by worker via SKIP LOCKED, or manually canceled, or expired if TTL elapses.

Starts execution, returns to queue if worker fails, or system failure during dispatch.

Terminal states indicate job outcome. Waiting state for SDK checkpoints. Queued for retry if attempt < max_attempts. Dead letter if all retries exhausted.

Waiting state for SDK-controlled execution (checkpoints, continuation). Returns to executing or terminal.

Replay via DLQ management endpoint resets attempt counter and status.

Optimistic Locking

State transitions use optimistic locking with WHERE status = $from clause in the UPDATE statement. Multiple workers can attempt the same transition simultaneously, but only one will match the row and succeed. The others receive a 0-row update result and must retry the entire transition. This prevents race conditions without distributed locks or version numbers.

Authentication

The system implements two distinct authentication schemes:

Used for management API (/v1/*). Requires Authorization: Bearer <INTERNAL_SECRET> header. Simple string comparison provides fast auth for job CRUD, triggering, and run management.

Used for SDK API (/sdk/v1/*). Uses JWT HS256 signed with JWT_SIGNING_KEY. Token contains sub (runID), exp (timeout + 60s), and iat. Stateless auth—server validates signature on each request.

Execution Lifecycle

API receives POST to /v1/jobs/{id}/trigger. Cost budget check is performed. Create job_run with status=queued or delayed. Generate JWT run token.

Worker calls DequeueN, claiming runs via SKIP LOCKED and updating status to dequeued. Optimistic lock ensures only one worker claims each run.

Execution submitted to pond/v2 worker pool using context.WithoutCancel. This ensures in-flight jobs complete even if strait process shuts down. Background goroutine sends heartbeats to job_runs.heartbeat_at.

If job has environment_id, executor resolves environment variables. Check for ENDPOINT_URL override and validate with SSRF protection (blocks private/loopback IPs).

Executor sends HTTP POST to resolved endpoint URL. Headers include X-Run-ID, X-Job-ID, X-Attempt, and X-SDK-Token. Payload includes job payload and metadata.

Capture timing breakdown: queue_wait_ms, dequeue_ms, connect_ms, ttfb_ms, transfer_ms, total_ms. Stored as JSONB in execution_trace column.

2xx response → status=completed, store result. Non-2xx → schedule retry using job's retry_strategy or mark failed. Timeout → retry or timed_out. Adaptive timeout adjustment applied.

If attempt >= max_attempts, transition to dead_letter instead of failed. Dead-lettered runs excluded from normal listings.

Upon reaching terminal state, send signed POST to job's webhook_url. Job-run webhooks are retried up to 3 times with exponential backoff (1s, 5s delays). Event-trigger webhooks use persistent delivery via the webhook_deliveries table with up to 5 retries and exponential backoff, surviving process restarts. See Webhooks for the full delivery lifecycle.

State change published to Redis channel run:{runID} for SSE streaming.

Workflow Engine

The workflow engine manages execution of Directed Acyclic Graphs (DAGs) where each node represents a job execution, human approval, or nested sub-workflow.

DAG Validation

Before execution, the workflow definition is validated using Kahn's algorithm for cycle detection. The algorithm:

Identifies all steps with no dependencies (roots).
Removes roots from the graph and identifies newly exposed roots.
Repeats until all steps are processed.

If a cycle is detected, the workflow is rejected before any runs are created. This prevents infinite execution loops that could exhaust resources.

Atomic Fan-In

When multiple parent steps complete concurrently, their child step must start only when all dependencies are satisfied. The engine uses a deps_completed counter on each workflow_step_run. As each parent completes, an UPDATE ... RETURNING query atomically increments the counter and identifies runs where deps_completed == deps_required. This ensures exactly one parent triggers the child, even with concurrent completions.

Step-level concurrency groups (concurrency_key) are enforced during scheduling. Steps that share the same key are serialized even when dependency fan-in says they are ready.

Payload Merging

Step payloads are constructed by merging three layers:

Trigger payload: Initial payload provided when workflow is triggered.
Step payload: Static payload defined on the step in workflow definition.
Parent outputs: Outputs from all direct dependency steps, keyed by step_ref.

Template variables {{parent_outputs.stepRef.field}} allow accessing nested fields. The engine preserves native JSON types—arrays, objects, strings—so no manual serialization is needed.

Template Rendering

The template engine renders {{variable}} placeholders using dot notation. Example: {{parent_outputs.extractData.id}} accesses the id field of the extractData object from a parent step's output. This provides flexible data flow while maintaining type safety.

Output Transform

Steps can define an output_transform using JSONPath expressions. Before persisting a step's output, the engine extracts the matching subset of the job result. This reduces storage overhead and simplifies downstream steps that only need specific fields. Example: $.data.items[*].id extracts all IDs from an array.

Step Types

Executes a regular job. The workflow creates a new job_run and waits for it to reach a terminal state before marking the step as completed.

Human-in-the-loop gate. Creates workflow_step_approvals record with pending status. Also creates a parallel event trigger for API-based resolution. Workflow pauses until approval via API or CLI command. Timeout specified in approval_timeout_secs.

Durable external event wait. Creates an event_triggers record with waiting status and a globally unique event_key. No goroutine is held — the wait is a database row. External systems send events via POST /v1/events/{eventKey}/send. Supports template-rendered event keys and configurable timeouts.

Durable sleep. Creates a sleep-type event trigger with an expiry time. The reaper completes the trigger when the duration expires, marking the step as completed and triggering downstream steps. Supports Go duration strings (e.g., 30s, 5m, 24h).

Triggers a nested workflow run. The parent workflow waits for the child to complete and aggregates its outputs. Nesting depth limited to max_nesting_depth to prevent infinite recursion.

Workflow FSM

The workflow engine manages two FSMs—one for workflow runs, one for step runs.

Workflow Run States: pending → running → paused → (completed | failed | timed_out | canceled)

Step Run States: pending → (waiting | skipped | canceled) → running → (completed | failed)

Callback handler OnJobRunTerminal drives workflow progression when any job run completes. This single entry point handles all terminal states (completed, failed, timed_out, crashed, canceled, system_failed, dead_letter) and determines next workflow actions—start ready children, apply failure policies, propagate sub-workflow outputs. Event-driven callbacks OnEventReceived, OnStepCompleted, and OnStepFailed handle event trigger resolution, sleep completion, and timeout/cancellation failures respectively. Event chaining is handled by tryEmitEvent, which auto-resolves waiting triggers when steps with event_emit_key complete.

Policy Enforcement and Governance

Workflow governance is managed with per-project policies stored in workflow_policies and exposed via GET|PUT /v1/workflow-policies/{projectID}.

Policy checks run at workflow lifecycle boundaries:

Create (POST /v1/workflows)
Update (PATCH /v1/workflows/{workflowID})
Trigger (POST /v1/workflows/{workflowID}/trigger)

Current policy controls:

max_fan_out: maximum number of direct dependents per step
max_depth: maximum DAG depth (computed from dependency chains)
forbidden_step_types: denylisted step types for the project
require_approval_for_deploy: requires at least one approval step for deploy/release-like DAGs

Violations fail fast at the API boundary, preventing invalid DAGs from entering runtime scheduling.

Scheduling Decisions and Explainability

The progression scheduler records key routing decisions in workflow_step_decisions for post-mortem and operator visibility. Decisions include:

scheduler: blocked by workflow max_parallel_steps
concurrency: blocked by concurrency_key serialization
resource: blocked by resource_class quota
condition: skipped because condition evaluated to false

These records are exposed via GET /v1/workflow-runs/{workflowRunID}/explain with optional step_ref and decision_type filters.

Runtime Graph and Critical Path Estimation

GET /v1/workflow-runs/{workflowRunID}/graph exposes the live DAG execution graph for a run, including:

node execution state (pending|waiting|running|completed|failed|...)
dependency edges and roots
runnable set (deps satisfied but not running)
critical-path estimate fields:
- critical_path
- critical_path_estimate_ms
- critical_path_remaining_ms

Critical-path timing uses observed terminal durations when available, active elapsed time for running steps, and timeout overrides as a fallback estimate.

Runtime Recovery Controls

Operators can recover or replay portions of a DAG without retriggering the entire workflow:

POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/retry resets a terminal step to pending and resumes progression.
POST /v1/workflow-runs/{workflowRunID}/steps/{stepRef}/replay-subtree resets the selected step and all descendants to pending, then resumes progression.

These controls are useful for branch-local retries after transient failures and for deterministic replay during incident mitigation. See DAG Runtime for endpoint behavior details and DAG Operations Playbook for production runbooks.

Data Model

The system uses PostgreSQL with the following primary tables (66 migrations):

jobs

id                  TEXT PRIMARY KEY              -- UUIDv7, time-ordered
project_id          TEXT NOT NULL
group_id            TEXT                          -- FK to job_groups
name                TEXT NOT NULL
slug                TEXT NOT NULL
description         TEXT
cron                TEXT                          -- 5-field cron expression
payload_schema      JSONB
tags                JSONB                         -- string map tags
endpoint_url        TEXT NOT NULL
fallback_endpoint_url TEXT
max_attempts        INT NOT NULL DEFAULT 3
timeout_secs        INT NOT NULL DEFAULT 300
max_concurrency     INT                           -- per-job concurrency cap
execution_window_cron TEXT                        -- when job can execute
timezone            TEXT                          -- project-level timezone override
rate_limit_max      INT
rate_limit_window_secs INT
dedup_window_secs   INT
enabled             BOOLEAN NOT NULL DEFAULT TRUE
webhook_url         TEXT
webhook_secret      TEXT
run_ttl_secs        INT
retry_strategy      TEXT                          -- exponential|linear|fixed|custom
retry_delays_secs   INT[]                         -- custom per-attempt delays
environment_id      TEXT                          -- FK to environments
version             INT NOT NULL DEFAULT 1
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
UNIQUE (project_id, slug)

job_runs

id                  TEXT PRIMARY KEY              -- UUIDv7
job_id              TEXT NOT NULL REFERENCES jobs(id)
project_id          TEXT NOT NULL
status              TEXT NOT NULL DEFAULT 'queued' -- 13 states incl. dead_letter
attempt             INT NOT NULL DEFAULT 1
payload             JSONB
result              JSONB
metadata            JSONB NOT NULL DEFAULT '{}'   -- key-value annotations
error               TEXT
triggered_by        TEXT NOT NULL DEFAULT 'manual' -- manual, cron, spawn, workflow
scheduled_at        TIMESTAMPTZ
started_at          TIMESTAMPTZ
finished_at         TIMESTAMPTZ
heartbeat_at        TIMESTAMPTZ
next_retry_at       TIMESTAMPTZ
expires_at          TIMESTAMPTZ
parent_run_id       TEXT REFERENCES job_runs(id)
priority            INT NOT NULL DEFAULT 0
idempotency_key     TEXT
job_version         INT NOT NULL DEFAULT 1
workflow_step_run_id TEXT
execution_trace     JSONB                         -- timing breakdown
debug_mode          BOOLEAN NOT NULL DEFAULT FALSE
continuation_of     TEXT                          -- FK for continuation lineage
lineage_depth       INT NOT NULL DEFAULT 0        -- depth in continuation chain
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()

workflows & workflow_runs

-- workflows table
id                  TEXT PRIMARY KEY              -- UUIDv7
project_id          TEXT NOT NULL
name                TEXT NOT NULL
slug                TEXT NOT NULL
description         TEXT
enabled             BOOLEAN NOT NULL DEFAULT TRUE
version             INT NOT NULL DEFAULT 1
timeout_secs        INT
max_concurrent_runs INT
max_parallel_steps  INT
cron                TEXT
cron_timezone       TEXT
skip_if_running     BOOLEAN NOT NULL DEFAULT FALSE
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()

-- workflow_runs table
id                  TEXT PRIMARY KEY              -- UUIDv7
workflow_id         TEXT NOT NULL REFERENCES workflows(id)
project_id          TEXT NOT NULL
status              TEXT NOT NULL DEFAULT 'pending'
triggered_by        TEXT NOT NULL DEFAULT 'manual'
workflow_version    INT NOT NULL DEFAULT 1
max_parallel_steps  INT
payload             JSONB
error               TEXT
retry_of_run_id     TEXT                          -- FK to workflow_runs
parent_workflow_run_id TEXT                       -- FK for sub-workflows
parent_step_run_id  TEXT                          -- FK to workflow_step_runs
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
started_at          TIMESTAMPTZ
finished_at         TIMESTAMPTZ
expires_at          TIMESTAMPTZ

workflow_steps & workflow_step_runs

-- workflow_steps table
id                  TEXT PRIMARY KEY              -- UUIDv7
workflow_id         TEXT NOT NULL REFERENCES workflows(id)
job_id              TEXT                          -- FK to jobs
step_ref            TEXT NOT NULL                 -- Unique within workflow
depends_on          TEXT[] NOT NULL DEFAULT '{}'  -- Array of step_refs
condition           JSONB                         -- step_status|step_status_in|not|all_of|any_of
on_failure          TEXT NOT NULL DEFAULT 'fail_workflow'
payload             JSONB                         -- Static payload + templates
step_type           TEXT NOT NULL DEFAULT 'job'   -- job|approval|sub_workflow|wait_for_event|sleep
approval_timeout_secs INT
approval_approvers  TEXT[]
retry_max_attempts  INT NOT NULL DEFAULT 0
retry_backoff       TEXT NOT NULL DEFAULT 'exponential'
retry_initial_delay_secs INT NOT NULL DEFAULT 1
retry_max_delay_secs INT NOT NULL DEFAULT 3600
timeout_secs_override INT
output_transform    TEXT                          -- JSONPath expression
sub_workflow_id     TEXT                          -- FK to workflows
max_nesting_depth   INT NOT NULL DEFAULT 10
event_key           TEXT                          -- Event trigger key (wait_for_event)
event_emit_key      TEXT                          -- Auto-emit event on completion
sleep_duration_secs INT                           -- Sleep duration in seconds
concurrency_key     TEXT                          -- Step-level concurrency group key
resource_class      TEXT NOT NULL DEFAULT 'small' -- Scheduler capacity bucket
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()

-- workflow_step_runs table
id                  TEXT PRIMARY KEY              -- UUIDv7
workflow_run_id     TEXT NOT NULL REFERENCES workflow_runs(id)
workflow_step_id    TEXT NOT NULL REFERENCES workflow_steps(id)
step_ref            TEXT NOT NULL
job_run_id          TEXT                          -- FK to job_runs
attempt             INT NOT NULL DEFAULT 1
status              TEXT NOT NULL DEFAULT 'pending'
deps_completed      INT NOT NULL DEFAULT 0
deps_required       INT NOT NULL DEFAULT 0
output              JSONB
error               TEXT
started_at          TIMESTAMPTZ
finished_at         TIMESTAMPTZ
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()

workflow governance and explainability

-- workflow_policies table
id                  TEXT PRIMARY KEY              -- UUIDv7
project_id          TEXT NOT NULL UNIQUE
max_fan_out         INT NOT NULL DEFAULT 0
max_depth           INT NOT NULL DEFAULT 0
forbidden_step_types TEXT[] NOT NULL DEFAULT '{}'
require_approval_for_deploy BOOLEAN NOT NULL DEFAULT FALSE
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()

-- workflow_step_decisions table
id                  TEXT PRIMARY KEY              -- UUIDv7
workflow_run_id     TEXT NOT NULL REFERENCES workflow_runs(id) ON DELETE CASCADE
step_run_id         TEXT NOT NULL REFERENCES workflow_step_runs(id) ON DELETE CASCADE
step_ref            TEXT NOT NULL
decision_type       TEXT NOT NULL                 -- scheduler|concurrency|resource|condition
decision            TEXT NOT NULL                 -- blocked|skip|allow
explanation         TEXT NOT NULL
details             JSONB
created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()

event_triggers

id                    TEXT PRIMARY KEY              -- UUIDv7
event_key             TEXT NOT NULL UNIQUE           -- Globally unique caller-controlled key
project_id            TEXT NOT NULL
source_type           TEXT NOT NULL                  -- 'workflow_step' or 'job_run'
trigger_type          TEXT NOT NULL DEFAULT 'event'  -- 'event' or 'sleep'
workflow_run_id       TEXT                           -- FK to workflow_runs
workflow_step_run_id  TEXT                           -- FK to workflow_step_runs
job_run_id            TEXT                           -- FK to job_runs
status                TEXT NOT NULL DEFAULT 'waiting'
timeout_secs          INT NOT NULL DEFAULT 3600
request_payload       JSONB
response_payload      JSONB
requested_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
received_at           TIMESTAMPTZ
expires_at            TIMESTAMPTZ
error                 TEXT
notify_url            TEXT                           -- Webhook URL for notifications
notify_status         TEXT                           -- pending, sent, failed, dead
event_emit_key        TEXT                           -- Auto-emit key for event chaining
sent_by               TEXT                           -- Sender identity audit field
created_at            TIMESTAMPTZ NOT NULL DEFAULT NOW()
updated_at            TIMESTAMPTZ NOT NULL DEFAULT NOW()

Key Indexes

Partial and composite indexes ensure query performance at scale:

idx_runs_queue: (status, next_retry_at, priority DESC, created_at ASC) WHERE status = 'queued' — Composite dequeue lookup covering priority ordering and retry gating
idx_runs_priority: (priority DESC, created_at ASC) WHERE status = 'queued' — Priority-ordered FIFO fallback
idx_runs_idempotency: (job_id, idempotency_key) WHERE idempotency_key IS NOT NULL UNIQUE — Deduplication
idx_runs_heartbeat: (heartbeat_at) WHERE status = 'executing' — Stale run detection
idx_runs_retry: (next_retry_at) WHERE status = 'queued' AND next_retry_at IS NOT NULL — Retry gating
idx_runs_dead_letter: (status) WHERE status = 'dead_letter' — DLQ queries
idx_runs_continuation: (continuation_of) — Lineage tree traversal
idx_runs_debug: (debug_mode) WHERE debug_mode = TRUE — Debug mode queries
idx_event_triggers_status_expires: (status, expires_at) WHERE status = 'waiting' — Reaper timeout lookup
idx_event_triggers_project_status: (project_id, status) — Project-scoped list queries
idx_event_triggers_prefix: (event_key text_pattern_ops) — Prefix matching for batch resolution
idx_workflow_runs_status_expires: (status, expires_at) WHERE expires_at IS NOT NULL -- Reaper timeout lookup for workflow runs
idx_workflow_runs_status_finished: (status, finished_at) WHERE finished_at IS NOT NULL -- Retention deletion queries
idx_step_runs_workflow_run_status: (workflow_run_id, status) -- Step run queries filtered by run and status
idx_step_decisions_step_run_id: (step_run_id) -- CASCADE deletes on workflow step decisions
idx_job_runs_active_by_job: (job_id) WHERE status IN ('queued', 'dequeued', 'executing') -- Dequeue active run counting

Design Decisions

UUIDv7 Primary Keys

UUIDv7 combines a timestamp (first 48 bits) with random bits. This provides:

Natural sort order: Rows are approximately ordered by creation time without a separate index column
No sequence contention: Auto-increment integers create hotspots during concurrent inserts
Distributed-friendly: Can be generated without coordination across multiple nodes

The time-ordered property enables efficient time-range queries—selecting runs created in the last hour—by scanning primary key order with index-only scans.

Raw SQL Over ORM

Hand-written SQL via pgx/v5 provides complete control over query execution plans. For operations like SELECT FOR UPDATE SKIP LOCKED and CTE-based batch dequeue, the exact query structure matters. ORMs may generate suboptimal queries or fail to support these PostgreSQL-specific features efficiently.

The trade-off is verbosity—SQL queries require manual maintenance—but the benefits include predictable performance, transparent optimization, and the ability to use database-specific features without fighting the abstraction layer.

Embedded Migrations

Using go:embed, migration files are compiled into the binary. This eliminates the need to bundle migration files separately or mount them in containers. The binary is truly self-contained—no external file dependencies at startup.

Migrations run automatically on startup, checking schema_migrations table for the current version and applying pending migrations in a transaction. This ensures database schema always matches application code.

Single Binary, Three Modes

Strait compiles to a single executable that can run in three modes:

api: HTTP server only, for scaling API endpoints independently
worker: Executor and scheduler only, for scaling job processing independently
all: Combined mode for development or small deployments

This flexibility enables horizontal scaling without multiple deployment artifacts. Deploy the same binary in production with different mode flags.

context.WithoutCancel for In-flight Jobs

When a job run is dispatched, it uses a child context created with context.WithoutCancel(parent). This means if the parent context is canceled (strait shutdown), the goroutine continues execution until the HTTP request completes. Combined with errgroup, this ensures graceful shutdown—in-flight jobs finish before the process exits.

Engine Capabilities

Engine capabilities described in this document are built in and available by default.

Micro-USD Cost Tracking

AI costs are tracked in micro-USD (1/1,000,000 USD) stored as BIGINT. This avoids floating-point precision issues that accumulate with repeated additions. Integer arithmetic provides exact financial calculations down to $0.000001 resolution.

Dead Letter as FSM State

The DLQ is modeled as a first-class FSM state (dead_letter) rather than a separate table. This design choice:

Allows standard run queries to include or exclude dead-lettered runs via status filter
Enables DLQ replay using the same state transition mechanism as retries
Maintains FSM invariants—all state transitions go through the same validation logic

Transactional Event Sends

When an event resolves a workflow step trigger, the trigger status update and step completion are wrapped in a single database transaction via runInTx. If either operation fails, both roll back atomically. This prevents the inconsistency where a trigger is marked received but the step remains waiting. The same pattern wraps workflow create/update/clone, SDK wait-for-event, and workflow run retry operations.

Fan-in progression (starting downstream steps) runs outside the transaction because it involves queue operations and multiple tables. The reconciliation reaper acts as a safety net for edge cases where progression fails.

Distributed Reaper with Advisory Locks

In multi-instance deployments, the reaper uses pg_try_advisory_lock to ensure single-leader execution. Each reaper cycle:

Attempts to acquire session-level advisory lock 0x5374726169745265 ("StraitRe")
If acquired: runs all reaper passes, then releases the lock
If not acquired: skips the cycle entirely

This provides leader election without external coordination (no ZooKeeper, no etcd). If the active instance crashes, the lock is released when its PostgreSQL session ends, and another instance acquires it on the next tick.

Interface Segregation

Store packages define small, focused interfaces (JobStore, RunStore, WorkflowStore, etc.) that only include methods needed by callers. This improves:

Testability: Mocks only implement required methods, not entire store
Decoupling: Changes to unrelated store methods don't require recompiling callers
Encapsulation: Internal store implementation details hidden from consumers

The WithTx helper accepts any DBTX interface (connection pool or transaction), enabling callers to write transaction-aware code without hard-coded transaction handling.

SSRF Validation in Both Layers

Endpoint URL validation happens in both the API server (job creation/update) and the worker (environment override resolution). This defense-in-depth prevents bypassing SSRF checks via environment variable injection. Blocked address ranges include:

Private IPv4 (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
Loopback addresses (127.0.0.0/8, ::1/128)
Link-local addresses (169.254.0.0/16)
CGNAT addresses (100.64.0.0/10)
IPv6 Unique Local Addresses (fc00::/7)

Performance Optimizations

Workflow Context Caching

During step callback processing, the workflow engine loads the workflow run, step definitions, and step-by-ref index. These are immutable for the lifetime of a single callback. The wfCtx struct caches this data after the first load, eliminating 10-15 redundant database queries per step completion in multi-step workflows.

Transaction Safety via runInTx

Multi-step write operations (workflow creation, event trigger registration, retry runs) use a runInTx abstraction that wraps store calls in a database transaction. In production, this delegates to store.WithTx; in tests, it passes through to the mock store directly. This provides atomicity without requiring mock implementations of pgx.Tx.

Key operations wrapped in transactions:

Workflow create/update/clone (version snapshot + steps)
SDK wait-for-event (run status + event trigger creation)
Workflow run retry (new run + step runs)
Event send (trigger resolution + step completion)

Bounded Reaper Queries

All reaper and poller queries include a LIMIT clause (default 1000) to prevent unbounded result sets during incident recovery or backlog processing. The retention worker uses CTE-based batch deletes with FOR UPDATE SKIP LOCKED (batch size 5000) instead of unbounded DELETE statements.

Reaper N+1 Elimination

The workflow timeout reaper previously issued 2+2N database calls per timed-out workflow run (fetching step runs individually, then canceling each). This was replaced with 4 bulk calls per run using existing CancelNonTerminalStepRuns and CancelJobRunsByWorkflowRun methods.

Bulk Operation Batching

Bulk cancel (POST /v1/runs/bulk-cancel) was rewritten from 4N queries to 3 queries: batch fetch via GetRunsByIDs, batch cancel via BulkCancelRuns (single UPDATE with status filter), and batch child cancel via CancelChildRunsByParentIDs.

Bulk trigger (POST /v1/jobs/{jobID}/trigger/bulk) pre-computes project quota counts once before the loop instead of per-item, eliminating 2 redundant queries per item.

Graceful Shutdown

The shutdown sequence protects in-flight work:

signal.NotifyContext captures SIGINT and SIGTERM. Context cancellation propagates through errgroup.

Worker polling loop exits on context cancellation. pool.Shutdown() blocks until all in-flight goroutines complete.

Cron ticker stops, delayed poller exits, reaper stops. No new jobs scheduled.

Server shuts down with 10-second grace period to drain in-flight HTTP requests.

Database connections closed with context cancellation. Redis connections closed. OTel exporters flush metrics.

The use of context.WithoutCancel for job dispatch ensures that jobs already executing when shutdown begins are allowed to complete and record their results. This prevents partial execution and data loss during deployments.

Observability

Distributed Tracing

OpenTelemetry spans cover critical operations:

HTTP requests: otelchi middleware traces request duration, path, and status code
Database queries: Each store method is instrumented with span attributes (query type, table)
Queue operations: Dequeue, enqueue, and retry operations traced
Workflow execution: Step completion, fan-in triggers, sub-workflow propagation traced

Spans include parent-child relationships, linking a slow job run across API server, worker, and external endpoint in trace visualization tools like Jaeger or Tempo.

Prometheus Metrics

Metrics exposed at GET /metrics:

strait_run_transitions_total: Counter for each FSM state transition (labeled by from_status, to_status)
strait_dispatch_duration_seconds: Histogram of HTTP dispatch latency (labeled by job_id, outcome)
strait_dequeue_duration_seconds: Histogram of queue polling latency
strait_worker_pool_active: Gauge of currently active worker goroutines
strait_worker_pool_queued: Gauge of tasks waiting in pool buffer

Structured Logging

Structured JSON logging via log/slog captures events with consistent fields:

run_id: Job run identifier for correlation
job_id: Job identifier for filtering
status: FSM state transition
error: Error details with stack trace
duration_ms: Operation timing

Logs are human-readable in development and machine-parseable in production for log aggregation systems.

Concurrency Model

Multiple worker processes can run concurrently, each polling the same PostgreSQL table. Coordination is handled entirely by SELECT FOR UPDATE SKIP LOCKED:

No distributed coordination needed: Workers don't know about each other. Database locks provide coordination.
Scalability: Add worker processes to increase dequeue throughput. No leader election or worker registry.
Safety: Optimistic locking prevents race conditions during state transitions. Workers retry on conflict.

The worker pool uses alitto/pond/v2 which provides:

Bounded queue: Buffer channel limits task submissions, preventing unbounded memory growth
Backpressure: When queue is full, task submission blocks, signaling workers are at capacity
Graceful shutdown: Pool waits for all in-flight tasks before returning
Metrics: Exposes pool metrics for capacity planning

Scheduler Background Tasks

The scheduler package runs several background goroutines:

Uses robfig/cron/v3 to maintain job schedules. Every second, checks for jobs whose next execution time has arrived and enqueues new runs with triggered_by = 'cron'. Supports cron expressions, timezones, and execution windows.

Polls every 5 seconds for job_runs with status = 'delayed' where scheduled_at <= NOW(). Transitions them to queued for dequeue. Enables delayed job execution without cron.

Polls on the configured reaper interval and performs multiple stale-recovery passes:

Stale dequeued runs: re-queues runs stuck in dequeued
Stale executing runs: marks runs as crashed on heartbeat loss
Stalled workflow runs: detects workflow runs with no progression for WF_STALL_THRESHOLD and applies WF_STALL_ACTION (log_only, reconcile, or fail_workflow)
Timed out workflow runs: marks workflow run timed_out, cancels non-terminal steps/job runs, and cancels pending event triggers

Polls every 30 seconds for expired event triggers (status = 'waiting' and expires_at <= NOW()). Transitions them to timed_out and drives step/run failure via OnStepFailed. Also handles:

Sleep triggers: Marks expired sleep triggers as received and completes the step via OnStepCompleted
Inconsistent triggers: Finds received triggers with still-waiting steps/runs (30s grace period) and re-resolves them
Expired approvals: Fails approval steps past their timeout

In multi-instance deployments, the reaper uses PostgreSQL advisory locks (pg_try_advisory_lock) to ensure exactly one instance runs the reaper cycle at a time. Other instances skip the cycle and try again on the next interval — no external coordination service required.

Every hour, deletes terminal runs based on retention policy:

completed/failed/canceled/expired: 30 days
timed_out/crashed/system_failed: 90 days
event triggers: configurable via EVENT_TRIGGER_RETENTION Uses batch DELETE for efficient cleanup.

Health Endpoints

Strait exposes health and readiness endpoints for infrastructure orchestration:

GET /health (liveness): Returns 200 if the process is alive. Used by Kubernetes liveness probes.
GET /health/ready (readiness): Returns 200 when all dependencies are ready (database, Redis, worker pool). Returns 503 with a JSON body detailing which component is not ready. Used by Kubernetes readiness probes to prevent routing traffic to uninitialized instances.

The health check system uses a component registry (apps/strait/internal/health/registry.go) where each subsystem registers its health check function. This enables extensible readiness verification without modifying the health endpoint handler.

Resilience Patterns

Strait implements several resilience patterns to protect against cascading failures:

Circuit Breaker: Per-endpoint circuit breaker tracks consecutive failures. When a threshold is reached, the circuit opens and subsequent dispatches are short-circuited to prevent overwhelming a failing endpoint. State is persisted in the endpoint_circuit_state PostgreSQL table.
Concurrency Limiting: Per-job max_concurrency cap is enforced at dequeue time via a COUNT subquery in the SKIP LOCKED query.
Rate Limiting: Per-job rate_limit_max and rate_limit_window_secs control dispatch throughput.
Retry Strategies: Configurable per-job with exponential, linear, fixed, or custom per-attempt delays. All strategies apply +/-20% jitter to prevent thundering herd.

For detailed resilience patterns, see Resilience.

Managed Container Execution

In addition to HTTP dispatch, Strait supports managed execution where jobs run inside ephemeral Fly Machines. This is designed for long-running tasks, custom runtimes, GPU workloads, and isolated execution environments.

Dispatch Priority

When the executor picks up a managed run, it resolves a machine through a three-tier priority system:

1. Warm Pool    →  Stopped machine from a previous run (same image + region)
2. Paused Reuse →  Stopped machine from a paused run (machine_id preserved on row)
3. Cold Create  →  New Fly Machine provisioned via Fly Machines API

All three paths converge at Wait(machineID, timeout).

Machine Lifecycle

Managed machines are created with auto_destroy=false so Fly does not garbage-collect them on stop. This enables reuse across the warm pool and pause/resume paths. The Start() method performs a three-step restart: GET current config, PUT updated environment variables, POST start.

cold_create ─→ running ─→ stopped ──→ Start() ─→ running (warm start)
                                   └─→ Destroy()  (eviction/shutdown)

Warm Machine Pool

The MachinePool is an in-memory cache keyed by "{imageURI}:{region}". After a clean exit (exit code 0 and SDK reported completion), the machine is returned to the pool instead of being destroyed. Subsequent dispatches for the same image+region can acquire a stopped machine and restart it in 1-2 seconds instead of the 5-15 second cold create time.

Pool operations:

Acquire: Pops the oldest machine for a given key (FIFO).
Release: Pushes a machine. Evicts oldest if at capacity (configurable maxPer, default 3).
Prune: Background goroutine removes machines idle for more than 10 minutes every 5 minutes.
Drain: On shutdown, all pooled machines are destroyed.

Eviction callbacks are bounded by a semaphore (max 10 concurrent goroutines) with inline fallback to prevent unbounded goroutine growth.

Pause and Resume

The pause/resume flow preserves the machine_id on the run row so that the stopped Fly Machine can be restarted on resume instead of cold-creating a new one:

Pause: API transitions executing → paused, calls Stop() on the machine. The machine_id is preserved.
Resume: API transitions paused → queued without clearing machine_id.
Re-dispatch: Executor sees run.MachineID != "", calls Start(machineID, freshEnv).
Fallback: If the machine was deleted (auto-destroy, timeout, manual), falls back to cold create.

Error Handling

ErrMachineGone (404): Machine was deleted. Caller falls back to Create.
Retryable errors (429, 500, 503, connection refused): Run is snoozed with backoff.
Fatal errors (422): Run transitions to system_failed.
Snooze path: Machine is stopped before snoozing to prevent orphaned running containers.
Cancel race: If Stop fails during cancel detection, Destroy is called as fallback.

Compute Cost Tracking

Every managed run records wall-clock compute usage in the run_compute_usage table (run ID, project ID, machine preset, duration in seconds, cost in micro-USD). Daily budget enforcement checks the project quota before creating a machine. The budget monitor fires a warning webhook at 80% utilization.

For the full managed execution reference, see Managed Execution.

ClickHouse Analytics

Strait supports an optional ClickHouse backend for high-performance analytics over run metrics, cost data, and event timelines.

Architecture

PostgreSQL remains the source of truth for all state. ClickHouse receives denormalized event data via an async batch pipeline. This separation keeps the hot path (queue, FSM transitions) on Postgres while offloading analytical queries to a column-oriented store optimized for aggregation.

Tables

Table	Purpose	Engine	TTL	Partition
`run_events`	Run lifecycle events with timing	MergeTree	365 days	Monthly
`run_analytics`	Denormalized run summaries	MergeTree	365 days	Monthly
`compute_usage`	Managed execution cost records	MergeTree	90 days	Monthly

Data Pipeline

The clickhouse.Engine maintains an in-memory buffer that flushes to ClickHouse on two triggers: batch size threshold (default 100) or flush interval (default 10 seconds). Backpressure is handled by capping the buffer at 10x batch size and dropping oldest entries. Graceful shutdown flushes remaining entries.

Configuration

Variable	Default	Description
`CLICKHOUSE_ENABLED`	`false`	Enable ClickHouse export
`CLICKHOUSE_URL`	—	ClickHouse connection string
`CLICKHOUSE_DATABASE`	`strait`	Target database
`CLICKHOUSE_BATCH_SIZE`	`100`	Flush threshold
`CLICKHOUSE_FLUSH_INTERVAL`	`10s`	Max time between flushes

For the full ClickHouse reference, see ClickHouse Analytics.

Introduction — Product overview and key features
Quick Start — Getting started guide
Jobs — Job definitions and configuration
Runs — Run lifecycle and FSM
Workflows — DAG orchestration details
DAG Runtime — Detailed DAG scheduler behavior, explainability, policy enforcement, and recovery controls
DAG Operations Playbook — On-call triage, recovery, and hardening runbook for DAG incidents
Managed Execution — Fly Machines container runtime, warm pool, pause/resume
ClickHouse Analytics — Optional analytics backend for run metrics and cost tracking
Event Triggers — Durable external event waits
CDC — Real-time change data capture
Resilience — Circuit breaker, rate limiting, and graceful degradation
Security — SSRF protection, HMAC signing, and access control

Performance Characteristics

Queue Throughput

Strait's queue is backed by PostgreSQL SELECT FOR UPDATE SKIP LOCKED, which provides lock-free concurrent dequeue. Under typical workloads, a single instance can sustain 1000+ dequeue operations per second with 32 concurrent workers.

Analytics Query Cost

The analytics endpoint scans the job_runs table for a configurable time window (1-720 hours). Query cost scales linearly with the number of runs in the window. For datasets exceeding 1M runs in the analysis period, consider reducing period_hours or implementing materialized view pre-aggregation.

Bulk Operation Limits

Bulk trigger and cancel endpoints accept up to 100 items per request. Each item is processed within a single database transaction for atomicity. Bulk cancel additionally propagates to child runs, which can amplify the number of affected rows.

Was this page helpful?

api