A production-grade Go job orchestration service for engineering teams and AI agents.
Strait is a production-grade Go job orchestration service designed for teams building reliable, scalable background job processing. It accepts job definitions via REST API, queues runs in PostgreSQL using SELECT FOR UPDATE SKIP LOCKED (no external message broker required), and dispatches them via HTTP to your endpoints with intelligent retry strategies.
Everything you need in one binary. Single Go executable. No runtime dependencies. Deploy and scale horizontally.
What Problem Does It Solve?
Engineering teams face challenges with background job processing:
- Reliability: Jobs fail and need to retry with exponential backoff
- Observability: Understanding why jobs failed, execution time, and resource usage
- Scheduling: Running jobs on cron schedules with delayed execution and retention
- Orchestration: Coordinating multi-step workflows with dependencies and conditions
- Infrastructure: Managing message brokers, queues, and job state stores separately
Strait combines queue, state, scheduler, and executor in one system—eliminating operational complexity while providing production-grade features.
Key Capabilities
13-State FSM
Robust lifecycle management—queued, executing, completed, failed, timed_out, dead_letter—ensures every job run is tracked correctly.
Workflow DAGs
Directed Acyclic Graphs with fan-in/fan-out, step conditions, template variables, output transforms, human approval gates, and durable event waits.
Smart Retry
Exponential, linear, fixed, or custom per-attempt delays with ±20% jitter. Prevents thundering herd and handles transient failures gracefully.
Cost Budgets
Track AI model usage with micro-USD precision. Enforce per-run and daily project limits to control costs.
Event Triggers
Pause execution and wait for external events—approvals, webhooks, third-party callbacks—for days or weeks without holding goroutines. Durable, database-backed waits with timeout support.
Real-Time CDC
Postgres WAL change capture via Sequin. No polling required—your applications react instantly when jobs, workflows, or runs change.
SDK Endpoints
Specialized endpoints for job executors—logging, heartbeats, progress updates, checkpoints, continuation, and child job spawning.
Webhooks
HMAC-SHA256 signed webhooks with automatic retries and dead letter queue on delivery failure.
Health Scoring
Aggregate metrics over configurable time windows. Success rate, timeout rate, crash rate, and latency stability—at-a-glance job reliability.
Architecture Overview
┌──────────────────────────────────┐
│ API Server │
│ (Chi router + middleware) │
│ │
│ /v1/jobs/* ── Job CRUD + Health │
│ /v1/workflows/* ── DAG CRUD │
│ /v1/workflow-runs/* ── Run mgmt │
│ /v1/jobs/{id}/trigger ── Enqueue │
│ /v1/runs/* ── Run mgmt + DLQ │
│ /v1/events/* ── Event triggers │
│ /sdk/v1/* ── SDK (JWT auth) │
│ /metrics ── Prometheus │
└──────────┬───────────────────────┘
│ Enqueue (budget check)
v
┌──────────────────────────────────┐
│ PostgreSQL │
│ │
│ jobs ── job definitions │
│ job_runs ── run state + queue │
│ workflows ── DAG definitions │
│ workflow_runs ── workflow state │
│ event_triggers ── durable waits │
│ run_events ── log entries │
│ run_usage ── AI cost tracking │
│ environments ── endpoint config │
│ project_quotas ── budget limits │
│ │
│ Queue: SELECT FOR UPDATE │
│ SKIP LOCKED │
└──────────┬───────────────────────┘
│ Dequeue
v
┌──────────────────────────────────┐
│ Worker Executor │
│ │
│ Poll ─> DequeueN(available) │
│ Workflow Engine: │
│ - DAG Validation (Kahn's) │
│ - Atomic Fan-in (UPDATE...RET) │
│ - Condition Evaluation │
│ - Template Rendering │
│ - Sub-workflow Nesting │
│ │
│ Job Execution: │
│ - Resolve ─> Env override + SSRF │
│ - Execute ─> HTTP POST to endpt │
│ - Retry ─> Smart strategy select │
│ - Trace ─> Execution timing │
│ - DLQ ─> Dead letter on exhaust │
└──────────┬───────────────────────┘
│ Webhook / PubSub
v
┌──────────────────────────────────┐
│ Scheduler │ Redis │
│ - Cron ticker │ - PubSub │
│ - Delayed poller │ - SSE │
│ - Stale reaper │ streaming │
│ - Retention │ │
└──────────────────────────────────┘Strait runs in three modes:
api: Handles HTTP requests, job management, and triggering. Scale horizontally for API throughput.
worker: Runs executor, scheduler, and background maintenance. Scale horizontally for job processing throughput.
all: Combined mode for development or small deployments. Single binary, single process.
Why Strait?
No RabbitMQ. No SQS. No Kafka. PostgreSQL handles queuing with SELECT FOR UPDATE SKIP LOCKED—lock-free concurrent workers without operational overhead. Single binary includes everything—no runtime dependencies to install.
Go goroutines provide parallel job execution without external coordination. Worker pool with bounded backpressure prevents memory exhaustion during traffic spikes. Structured concurrency patterns (sourcegraph/conc) ensure panic recovery and graceful shutdown.
SDK endpoints designed for AI agents—logging, heartbeats, progress checkpoints, continuation for long-running workflows, and child job spawning. Cost budgets track token usage with micro-USD precision. Debug bundles aggregate execution data for troubleshooting.
Complex DAGs with step conditions, output transforms, template variables, and human approval gates. Atomic fan-in handles concurrent parent completions safely. Sub-workflows enable arbitrary nesting depth for multi-stage pipelines.
OpenTelemetry tracing links job runs across API server, worker, and external endpoints. Prometheus metrics expose queue depth, throughput, and latency. Structured JSON logging enables log aggregation. Real-time SSE streaming via Redis.
Unified CLI with code-first deployment workflows, operational command groups, and shell completion for Bash, Zsh, Fish, and PowerShell.
Use Cases
Strait fits these patterns:
Background Jobs: Scheduled data imports, report generation, cache warming, cleanup tasks, and recurring maintenance operations.
Webhook Consumers: Process events from external services with retries, dead letter queue, and delivery guarantees.
AI Agent Workflows: Multi-step AI pipelines with human approval gates, conditional execution, and sub-workflow nesting. Cost tracking per run and per project.
Batch Processing: Bulk job triggering with configurable batch sizes, priority ordering, and idempotency deduplication.
Data Pipelines: ETL workflows with fan-out parallel steps, transform stages, and aggregation.
Cron Jobs: Standard 5-field cron expressions with timezone support and execution windows.
Getting Started
Quick Start
Get strait running in minutes. Clone repository, start infrastructure with Docker Compose, and trigger your first job.
Architecture
Deep dive into internals. Learn about queue mechanics, FSM states, workflow engine, and technology choices.
SDK Reference
Official SDKs for TypeScript, Python, Go, Ruby, and Rust with full feature parity. Authoring DSL, composition helpers, and typed errors.
CLI Reference
Complete CLI documentation. 48+ commands organized by category with examples and shell completion.
Concepts
Core domain concepts you'll encounter:
Jobs define the template for recurring tasks—endpoint URL, timeout, retry strategy, cron schedule, and cost budgets. Runs are execution instances of jobs.
Runs represent a single execution of a job. Tracked through a 13-state FSM—queued, dequeued, executing, completed, failed, timed_out, dead_letter—with events, logs, and usage data.
Workflows orchestrate multiple jobs into DAGs. Steps can depend on outputs from parent steps, have conditional execution based on step status, include human approval gates, and wait for external events.
Event triggers pause workflow steps or job runs until an external event arrives via API. Waits are durable — stored as database rows, not goroutines — and can last days or weeks. Supports timeouts, webhook notifications, event chaining, and real-time SSE streaming.
Environments provide per-project, named configurations with key-value variables. Jobs can link to environments, enabling endpoint URL overrides for staging vs. production routing.
Guides
Step-by-step guides for common tasks:
Authentication
Internal secret auth for API endpoints and JWT run token auth for SDK. API key management with system keychain storage.
Deployment
Docker deployment, Fly.io configuration, horizontal scaling strategies, and production readiness checklist.
Security
SSRF protection, rate limiting, encryption at rest, and secure webhook delivery.
Cost Budgets
Per-run and daily project limits. AI model usage tracking with micro-USD precision. Budget enforcement before execution.
Development
Contributing to Strait or running it locally:
Contributing
Setup development environment, code style, commit conventions, and PR guidelines.
Testing
Unit tests, integration tests with testcontainers, E2E tests, fuzz testing, and benchmarks.
Database Schema
Complete table definitions, indexes, and relationships for PostgreSQL schema.
What's Next?
Ready to dive deeper?
- Learn about the queue mechanics and how
SKIP LOCKEDworks - Understand the workflow engine and DAG execution
- Explore **retry strategies — exponential, linear, fixed, and custom
- Set up **webhooks with HMAC signing for event delivery
- Build durable workflows with event triggers — wait for external events without holding goroutines
- Configure **cost budgets for AI workloads
Or jump straight into the Quick Start Guide and run your first job in 5 minutes.