Strait Docs
Getting Started

A production-grade Go job orchestration service for engineering teams and AI agents.

Strait is a production-grade Go job orchestration service designed for teams building reliable, scalable background job processing. It accepts job definitions via REST API, queues runs in PostgreSQL using SELECT FOR UPDATE SKIP LOCKED (no external message broker required), and dispatches them via HTTP to your endpoints with intelligent retry strategies.

Everything you need in one binary. Single Go executable. No runtime dependencies. Deploy and scale horizontally.

What Problem Does It Solve?

Engineering teams face challenges with background job processing:

  • Reliability: Jobs fail and need to retry with exponential backoff
  • Observability: Understanding why jobs failed, execution time, and resource usage
  • Scheduling: Running jobs on cron schedules with delayed execution and retention
  • Orchestration: Coordinating multi-step workflows with dependencies and conditions
  • Infrastructure: Managing message brokers, queues, and job state stores separately

Strait combines queue, state, scheduler, and executor in one system—eliminating operational complexity while providing production-grade features.

Key Capabilities

13-State FSM

Robust lifecycle management—queued, executing, completed, failed, timed_out, dead_letter—ensures every job run is tracked correctly.

Workflow DAGs

Directed Acyclic Graphs with fan-in/fan-out, step conditions, template variables, output transforms, human approval gates, and durable event waits.

Smart Retry

Exponential, linear, fixed, or custom per-attempt delays with ±20% jitter. Prevents thundering herd and handles transient failures gracefully.

Cost Budgets

Track AI model usage with micro-USD precision. Enforce per-run and daily project limits to control costs.

Event Triggers

Pause execution and wait for external events—approvals, webhooks, third-party callbacks—for days or weeks without holding goroutines. Durable, database-backed waits with timeout support.

Real-Time CDC

Postgres WAL change capture via Sequin. No polling required—your applications react instantly when jobs, workflows, or runs change.

SDK Endpoints

Specialized endpoints for job executors—logging, heartbeats, progress updates, checkpoints, continuation, and child job spawning.

Webhooks

HMAC-SHA256 signed webhooks with automatic retries and dead letter queue on delivery failure.

Health Scoring

Aggregate metrics over configurable time windows. Success rate, timeout rate, crash rate, and latency stability—at-a-glance job reliability.

Architecture Overview

                    ┌──────────────────────────────────┐
                    │           API Server              │
                    │  (Chi router + middleware)         │
                    │                                    │
                    │  /v1/jobs/* ── Job CRUD + Health   │
                    │  /v1/workflows/* ── DAG CRUD       │
                    │  /v1/workflow-runs/* ── Run mgmt   │
                    │  /v1/jobs/{id}/trigger ── Enqueue  │
                    │  /v1/runs/* ── Run mgmt + DLQ     │
                    │  /v1/events/* ── Event triggers    │
                    │  /sdk/v1/* ── SDK (JWT auth)      │
                    │  /metrics ── Prometheus            │
                    └──────────┬───────────────────────┘
                               │ Enqueue (budget check)
                               v
                    ┌──────────────────────────────────┐
                    │         PostgreSQL                 │
                    │                                    │
                    │  jobs ── job definitions           │
                    │  job_runs ── run state + queue     │
                    │  workflows ── DAG definitions      │
                    │  workflow_runs ── workflow state   │
                    │  event_triggers ── durable waits   │
                    │  run_events ── log entries         │
                    │  run_usage ── AI cost tracking     │
                    │  environments ── endpoint config   │
                    │  project_quotas ── budget limits   │
                    │                                    │
                    │  Queue: SELECT FOR UPDATE          │
                    │         SKIP LOCKED                │
                    └──────────┬───────────────────────┘
                               │ Dequeue
                               v
                    ┌──────────────────────────────────┐
                    │         Worker Executor            │
                    │                                    │
                    │  Poll ─> DequeueN(available)       │
                    │  Workflow Engine:                  │
                    │  - DAG Validation (Kahn's)         │
                    │  - Atomic Fan-in (UPDATE...RET)    │
                    │  - Condition Evaluation            │
                    │  - Template Rendering              │
                    │  - Sub-workflow Nesting            │
                    │                                    │
                    │  Job Execution:                    │
                    │  - Resolve ─> Env override + SSRF  │
                    │  - Execute ─> HTTP POST to endpt   │
                    │  - Retry ─> Smart strategy select  │
                    │  - Trace ─> Execution timing       │
                    │  - DLQ ─> Dead letter on exhaust   │
                    └──────────┬───────────────────────┘
                               │ Webhook / PubSub
                               v
                    ┌──────────────────────────────────┐
                    │  Scheduler         │  Redis       │
                    │  - Cron ticker     │  - PubSub    │
                    │  - Delayed poller  │  - SSE       │
                    │  - Stale reaper    │  streaming   │
                    │  - Retention       │              │
                    └──────────────────────────────────┘

Strait runs in three modes:

api: Handles HTTP requests, job management, and triggering. Scale horizontally for API throughput.

worker: Runs executor, scheduler, and background maintenance. Scale horizontally for job processing throughput.

all: Combined mode for development or small deployments. Single binary, single process.

Why Strait?

No RabbitMQ. No SQS. No Kafka. PostgreSQL handles queuing with SELECT FOR UPDATE SKIP LOCKED—lock-free concurrent workers without operational overhead. Single binary includes everything—no runtime dependencies to install.

Go goroutines provide parallel job execution without external coordination. Worker pool with bounded backpressure prevents memory exhaustion during traffic spikes. Structured concurrency patterns (sourcegraph/conc) ensure panic recovery and graceful shutdown.

SDK endpoints designed for AI agents—logging, heartbeats, progress checkpoints, continuation for long-running workflows, and child job spawning. Cost budgets track token usage with micro-USD precision. Debug bundles aggregate execution data for troubleshooting.

Complex DAGs with step conditions, output transforms, template variables, and human approval gates. Atomic fan-in handles concurrent parent completions safely. Sub-workflows enable arbitrary nesting depth for multi-stage pipelines.

OpenTelemetry tracing links job runs across API server, worker, and external endpoints. Prometheus metrics expose queue depth, throughput, and latency. Structured JSON logging enables log aggregation. Real-time SSE streaming via Redis.

Unified CLI with code-first deployment workflows, operational command groups, and shell completion for Bash, Zsh, Fish, and PowerShell.

Use Cases

Strait fits these patterns:

Background Jobs: Scheduled data imports, report generation, cache warming, cleanup tasks, and recurring maintenance operations.

Webhook Consumers: Process events from external services with retries, dead letter queue, and delivery guarantees.

AI Agent Workflows: Multi-step AI pipelines with human approval gates, conditional execution, and sub-workflow nesting. Cost tracking per run and per project.

Batch Processing: Bulk job triggering with configurable batch sizes, priority ordering, and idempotency deduplication.

Data Pipelines: ETL workflows with fan-out parallel steps, transform stages, and aggregation.

Cron Jobs: Standard 5-field cron expressions with timezone support and execution windows.

Getting Started

Concepts

Core domain concepts you'll encounter:

Jobs define the template for recurring tasks—endpoint URL, timeout, retry strategy, cron schedule, and cost budgets. Runs are execution instances of jobs.

Runs represent a single execution of a job. Tracked through a 13-state FSM—queued, dequeued, executing, completed, failed, timed_out, dead_letter—with events, logs, and usage data.

Workflows orchestrate multiple jobs into DAGs. Steps can depend on outputs from parent steps, have conditional execution based on step status, include human approval gates, and wait for external events.

Event triggers pause workflow steps or job runs until an external event arrives via API. Waits are durable — stored as database rows, not goroutines — and can last days or weeks. Supports timeouts, webhook notifications, event chaining, and real-time SSE streaming.

Environments provide per-project, named configurations with key-value variables. Jobs can link to environments, enabling endpoint URL overrides for staging vs. production routing.

Guides

Step-by-step guides for common tasks:

Development

Contributing to Strait or running it locally:

What's Next?

Ready to dive deeper?

Or jump straight into the Quick Start Guide and run your first job in 5 minutes.

Was this page helpful?

On this page