Strait implements several resilience patterns to protect against cascading failures when dispatching jobs to external endpoints and delivering webhooks.Documentation Index
Fetch the complete documentation index at: https://docs.strait.dev/llms.txt
Use this file to discover all available pages before exploring further.
Circuit Breaker
The circuit breaker pattern prevents repeated dispatch attempts to endpoints that are consistently failing. When an endpoint exceeds a failure threshold, the circuit “opens” and subsequent dispatches are short-circuited until a cooldown period elapses.Job Dispatch Circuit Breaker
Job dispatch uses a PostgreSQL-backed circuit breaker persisted in theendpoint_circuit_state table. This provides durability across process restarts.
State machine:
- Closed (default): Dispatches proceed normally. Consecutive failures are tracked.
- Open: All dispatches to this endpoint are skipped. The run is requeued with a backoff delay. Transitions to half-open after a cooldown period.
- Half-open: A single probe dispatch is allowed. Success closes the circuit; failure re-opens it.
apps/strait/internal/worker/circuitbreaker.go) provides fast lookups during dispatch. The PostgreSQL table provides durability.
Endpoint Health Scoring
In addition to the binary circuit breaker, Strait supports continuous health scoring for endpoints. The health score (0-100) is computed using an Exponentially Weighted Moving Average (EWMA) of three signals:- Success rate (50% weight): Fraction of dispatches that return a successful response.
- Failure/timeout rate (30% weight): Fraction of dispatches that fail or time out.
- Latency score (20% weight): How close the P95 latency is to the job’s configured timeout.
- Healthy (score > 60): Full concurrency allowed.
- Degraded (score 30-60): Concurrency is proportionally throttled (down to 25% of max).
- Unhealthy (score < 30): Endpoint is blocked (equivalent to circuit open). Runs are snoozed with backoff.
endpoint_health_scores table and updated on every dispatch result. Recovery happens naturally as successful dispatches raise the EWMA score back above thresholds.
Fallback Endpoint
When a job has afallback_endpoint_url configured and the primary endpoint fails dispatch, Strait automatically retries against the fallback URL. The fallback fires when:
- The primary endpoint returns a non-retryable error
- The primary endpoint has exhausted retry attempts within the current dispatch cycle
max_attempts budget. Both primary and fallback endpoints receive the same SSRF validation at dispatch time.
Concurrency Limiting
Per-job concurrency caps prevent a single job definition from monopolizing worker capacity.max_concurrency
Themax_concurrency field on job definitions limits how many runs of that job can execute simultaneously. Enforcement happens at dequeue time via a COUNT subquery in the SKIP LOCKED dequeue query:
Rate Limiting
Per-Job Rate Limits
Therate_limit_max and rate_limit_window_secs fields on job definitions control dispatch throughput. When a job exceeds its rate limit, runs are held in queued status until the window expires.
API Rate Limiting
The API layer implements rate limiting viahttprate middleware:
- Global rate limit:
RATE_LIMIT_REQUESTSperRATE_LIMIT_WINDOWper IP - Trigger rate limit:
TRIGGER_RATE_LIMIT_REQUESTSspecifically for/triggerendpoints - RBAC control-plane limits: Stricter per-route limits on mutating authorization endpoints
Retry Strategies
Strait supports four retry strategies configurable per job:| Strategy | Behavior |
|---|---|
exponential | Delay doubles each attempt: 1s, 2s, 4s, 8s, … |
linear | Fixed increment per attempt: 5s, 10s, 15s, 20s, … |
fixed | Same delay every attempt |
custom | Per-attempt delays specified via retry_delays_secs array |
attempt >= max_attempts, the run transitions to dead_letter instead of failed. Dead-lettered runs can be replayed via the DLQ management API.
For detailed retry configuration, see Retry Strategies.
Poison Pill Detection
A poison pill run is one that consistently crashes with the same error across retries (e.g., a malformed payload that always triggers a 500). Without detection, these runs exhaust all retry attempts, waste worker capacity, and can trip the circuit breaker — blocking healthy runs to the same endpoint.How It Works
Whenpoison_pill_threshold is configured on a job, Strait tracks consecutive identical errors using a hash of the error message stored in run metadata:
- On each failure, the first 200 characters of the error message are hashed (SHA-256, truncated to 64 bits).
- The hash is stored in the run’s metadata as
_error_hash, alongside a_error_hash_countcounter. - If the hash matches the previous attempt’s hash, the counter increments. If it differs, the counter resets to 1.
- When the counter reaches
poison_pill_threshold, the run is routed directly to the dead letter queue with apoison pill detectederror message instead of retrying.
Configuration
Setpoison_pill_threshold on the job definition:
nullor0: Detection disabled (default). Runs follow normal retry behavior.1: Aggressive — the first failure immediately routes to DLQ.3(recommended): Tolerates transient variations but catches persistent failures before wasting retry budget.
Interaction with Other Resilience Features
- Circuit breaker: Poison pill detection runs after circuit breaker recording. A poison-pill DLQ still counts as a circuit breaker failure, but prevents the run from continuing to hammer the endpoint across remaining retries.
- Error classification: Non-retryable error classes (client errors, auth failures) bypass poison pill detection entirely — they go directly to DLQ via class-based routing.
- Retry strategies: When poison pill is below threshold, normal retry backoff applies. The metadata counter persists across retries so the count is accurate even with exponential delays between attempts.
Graceful Shutdown
Strait implements structured shutdown to protect in-flight work:- Signal capture:
SIGINT/SIGTERMtriggers context cancellation viasignal.NotifyContext - Stop accepting work: Worker polling loop exits, no new runs dequeued
- Drain in-flight jobs:
pool.Shutdown()blocks until all goroutines complete. A 30-second timeout prevents indefinite blocking. - Stop scheduler: Cron ticker, delayed poller, and reaper stop
- Drain HTTP server: 10-second grace period for in-flight API requests
- Cleanup: Database and Redis connections closed, OTel exporters flushed
context.WithoutCancel pattern ensures job dispatch goroutines continue executing even after the parent context is canceled, allowing them to complete and record results before the process exits.
Stale Run Recovery
The stale reaper (apps/strait/internal/scheduler/reaper.go) detects runs that have stopped heartbeating and transitions them to a recoverable state:
- Runs in
dequeuedwithheartbeat_atolder than 5 minutes are transitioned tosystem_failed - Runs in
waitingwithheartbeat_atolder than 1 hour are transitioned tosystem_failed - The reaper uses PostgreSQL advisory locks (
pg_try_advisory_lock) to ensure single-leader execution across multiple instances
OOM Recovery
When a managed container is killed by the OOM killer (exit code137), Strait automatically recovers by upgrading the machine preset and retrying:
- The exit code is classified as an OOM kill.
- The executor selects the next preset in the upgrade chain (
micro->small-1x-> … ->large-2x). - The recommendation is stored in
job_preset_recommendationswith a 24-hour decay, so future runs for the same job start at the upgraded preset. - If the job is already at
large-2x, the run is moved todead_letter.
Region Failover
When a Fly region returns503 (capacity exhaustion) during machine provisioning, the executor fails over to alternate regions before giving up. The primary region is always tried first; fallback regions are attempted in order. The run is only snoozed if all configured regions are unavailable.
Orphaned Machine Cleanup
The reaper (apps/strait/internal/scheduler/reaper.go) includes orphaned machine detection for managed execution:
- Machines in
startedorrunningstate with no associated active run are flagged as orphaned. - Orphaned machines are destroyed via the Fly API to prevent resource leakage and cost accumulation.
- This covers edge cases such as executor crashes mid-dispatch, network partitions during result recording, and machines that outlive their run due to race conditions.
Budget Protection
Project compute budgets are enforced with a two-phase reservation model:- Atomic reservation: Before provisioning a machine, estimated cost is reserved against the daily budget. Concurrent dispatches cannot over-commit because reservations are atomic.
- Soft-limit warning: At 80% of the daily budget, a structured log and metric alert are emitted to give operators advance notice.
- Commit on completion: After the run finishes, the reservation is replaced with actual cost. Unused budget from over-estimates is released immediately.