Circuit breakers, concurrency limiting, rate limiting, and graceful degradation patterns in Strait.
Strait implements several resilience patterns to protect against cascading failures when dispatching jobs to external endpoints and delivering webhooks.
Circuit Breaker
The circuit breaker pattern prevents repeated dispatch attempts to endpoints that are consistently failing. When an endpoint exceeds a failure threshold, the circuit "opens" and subsequent dispatches are short-circuited until a cooldown period elapses.
Job Dispatch Circuit Breaker
Job dispatch uses a PostgreSQL-backed circuit breaker persisted in the endpoint_circuit_state table. This provides durability across process restarts.
State machine:
- Closed (default): Dispatches proceed normally. Consecutive failures are tracked.
- Open: All dispatches to this endpoint are skipped. The run is requeued with a backoff delay. Transitions to half-open after a cooldown period.
- Half-open: A single probe dispatch is allowed. Success closes the circuit; failure re-opens it.
Schema:
-- endpoint_circuit_state table (migration 000022)
endpoint_url TEXT PRIMARY KEY
state TEXT NOT NULL DEFAULT 'closed' -- closed, open, half_open
failures INT NOT NULL DEFAULT 0
opened_at TIMESTAMPTZ
last_failure_at TIMESTAMPTZThe in-memory circuit breaker (apps/strait/internal/worker/circuitbreaker.go) provides fast lookups during dispatch. The PostgreSQL table provides durability.
Endpoint Health Scoring
In addition to the binary circuit breaker, Strait supports continuous health scoring for endpoints. The health score (0-100) is computed using an Exponentially Weighted Moving Average (EWMA) of three signals:
- Success rate (50% weight): Fraction of dispatches that return a successful response.
- Failure/timeout rate (30% weight): Fraction of dispatches that fail or time out.
- Latency score (20% weight): How close the P95 latency is to the job's configured timeout.
Health levels:
- Healthy (score > 60): Full concurrency allowed.
- Degraded (score 30-60): Concurrency is proportionally throttled (down to 25% of max).
- Unhealthy (score < 30): Endpoint is blocked (equivalent to circuit open). Runs are snoozed with backoff.
The health score is stored in the endpoint_health_scores table and updated on every dispatch result. Recovery happens naturally as successful dispatches raise the EWMA score back above thresholds.
Fallback Endpoint
When a job has a fallback_endpoint_url configured and the primary endpoint fails dispatch, Strait automatically retries against the fallback URL. The fallback fires when:
- The primary endpoint returns a non-retryable error
- The primary endpoint has exhausted retry attempts within the current dispatch cycle
Fallback attempts are counted against the run's max_attempts budget. Both primary and fallback endpoints receive the same SSRF validation at dispatch time.
Concurrency Limiting
Per-job concurrency caps prevent a single job definition from monopolizing worker capacity.
max_concurrency
The max_concurrency field on job definitions limits how many runs of that job can execute simultaneously. Enforcement happens at dequeue time via a COUNT subquery in the SKIP LOCKED dequeue query:
SELECT ... FROM job_runs
WHERE status = 'queued'
AND ...
AND (
j.max_concurrency IS NULL
OR j.max_concurrency = 0
OR (SELECT COUNT(*) FROM job_runs jr2
WHERE jr2.job_id = j.id AND jr2.status = 'executing')
< j.max_concurrency
)
FOR UPDATE SKIP LOCKED
LIMIT $batch_sizeThis provides single-worker enforcement. For multi-worker deployments, distributed enforcement via Redis atomic counters is planned.
Rate Limiting
Per-Job Rate Limits
The rate_limit_max and rate_limit_window_secs fields on job definitions control dispatch throughput. When a job exceeds its rate limit, runs are held in queued status until the window expires.
API Rate Limiting
The API layer implements rate limiting via httprate middleware:
- Global rate limit:
RATE_LIMIT_REQUESTSperRATE_LIMIT_WINDOWper IP - Trigger rate limit:
TRIGGER_RATE_LIMIT_REQUESTSspecifically for/triggerendpoints - RBAC control-plane limits: Stricter per-route limits on mutating authorization endpoints
Retry Strategies
Strait supports four retry strategies configurable per job:
| Strategy | Behavior |
|---|---|
exponential | Delay doubles each attempt: 1s, 2s, 4s, 8s, ... |
linear | Fixed increment per attempt: 5s, 10s, 15s, 20s, ... |
fixed | Same delay every attempt |
custom | Per-attempt delays specified via retry_delays_secs array |
All strategies apply +/-20% jitter to prevent thundering herd when many runs retry simultaneously.
When attempt >= max_attempts, the run transitions to dead_letter instead of failed. Dead-lettered runs can be replayed via the DLQ management API.
For detailed retry configuration, see Retry Strategies.
Graceful Shutdown
Strait implements structured shutdown to protect in-flight work:
- Signal capture:
SIGINT/SIGTERMtriggers context cancellation viasignal.NotifyContext - Stop accepting work: Worker polling loop exits, no new runs dequeued
- Drain in-flight jobs:
pool.Shutdown()blocks until all goroutines complete. A 30-second timeout prevents indefinite blocking. - Stop scheduler: Cron ticker, delayed poller, and reaper stop
- Drain HTTP server: 10-second grace period for in-flight API requests
- Cleanup: Database and Redis connections closed, OTel exporters flushed
The context.WithoutCancel pattern ensures job dispatch goroutines continue executing even after the parent context is canceled, allowing them to complete and record results before the process exits.
Stale Run Recovery
The stale reaper (apps/strait/internal/scheduler/reaper.go) detects runs that have stopped heartbeating and transitions them to a recoverable state:
- Runs in
dequeuedwithheartbeat_atolder than 5 minutes are transitioned tosystem_failed - Runs in
waitingwithheartbeat_atolder than 1 hour are transitioned tosystem_failed - The reaper uses PostgreSQL advisory locks (
pg_try_advisory_lock) to ensure single-leader execution across multiple instances
This acts as a safety net for worker crashes, network partitions, and other failure scenarios where a run is claimed but never completed.
OOM Recovery
When a managed container is killed by the OOM killer (exit code 137), Strait automatically recovers by upgrading the machine preset and retrying:
- The exit code is classified as an OOM kill.
- The executor selects the next preset in the upgrade chain (
micro->small-1x-> ... ->large-2x). - The recommendation is stored in
job_preset_recommendationswith a 24-hour decay, so future runs for the same job start at the upgraded preset. - If the job is already at
large-2x, the run is moved todead_letter.
This prevents OOM loops from consuming retry budget on an undersized preset.
Region Failover
When a Fly region returns 503 (capacity exhaustion) during machine provisioning, the executor fails over to alternate regions before giving up. The primary region is always tried first; fallback regions are attempted in order. The run is only snoozed if all configured regions are unavailable.
Orphaned Machine Cleanup
The reaper (apps/strait/internal/scheduler/reaper.go) includes orphaned machine detection for managed execution:
- Machines in
startedorrunningstate with no associated active run are flagged as orphaned. - Orphaned machines are destroyed via the Fly API to prevent resource leakage and cost accumulation.
- This covers edge cases such as executor crashes mid-dispatch, network partitions during result recording, and machines that outlive their run due to race conditions.
Budget Protection
Project compute budgets are enforced with a two-phase reservation model:
- Atomic reservation: Before provisioning a machine, estimated cost is reserved against the daily budget. Concurrent dispatches cannot over-commit because reservations are atomic.
- Soft-limit warning: At 80% of the daily budget, a structured log and metric alert are emitted to give operators advance notice.
- Commit on completion: After the run finishes, the reservation is replaced with actual cost. Unused budget from over-estimates is released immediately.