Resilience

Circuit breakers, concurrency limiting, rate limiting, and graceful degradation patterns in Strait.

Strait implements several resilience patterns to protect against cascading failures when dispatching jobs to external endpoints and delivering webhooks.

Circuit Breaker

The circuit breaker pattern prevents repeated dispatch attempts to endpoints that are consistently failing. When an endpoint exceeds a failure threshold, the circuit "opens" and subsequent dispatches are short-circuited until a cooldown period elapses.

Job Dispatch Circuit Breaker

Job dispatch uses a PostgreSQL-backed circuit breaker persisted in the endpoint_circuit_state table. This provides durability across process restarts.

State machine:

Closed (default): Dispatches proceed normally. Consecutive failures are tracked.
Open: All dispatches to this endpoint are skipped. The run is requeued with a backoff delay. Transitions to half-open after a cooldown period.
Half-open: A single probe dispatch is allowed. Success closes the circuit; failure re-opens it.

Schema:

-- endpoint_circuit_state table (migration 000022)
endpoint_url    TEXT PRIMARY KEY
state           TEXT NOT NULL DEFAULT 'closed'  -- closed, open, half_open
failures        INT NOT NULL DEFAULT 0
opened_at       TIMESTAMPTZ
last_failure_at TIMESTAMPTZ

The in-memory circuit breaker (apps/strait/internal/worker/circuitbreaker.go) provides fast lookups during dispatch. The PostgreSQL table provides durability.

Endpoint Health Scoring

In addition to the binary circuit breaker, Strait supports continuous health scoring for endpoints. The health score (0-100) is computed using an Exponentially Weighted Moving Average (EWMA) of three signals:

Success rate (50% weight): Fraction of dispatches that return a successful response.
Failure/timeout rate (30% weight): Fraction of dispatches that fail or time out.
Latency score (20% weight): How close the P95 latency is to the job's configured timeout.

Health levels:

Healthy (score > 60): Full concurrency allowed.
Degraded (score 30-60): Concurrency is proportionally throttled (down to 25% of max).
Unhealthy (score < 30): Endpoint is blocked (equivalent to circuit open). Runs are snoozed with backoff.

The health score is stored in the endpoint_health_scores table and updated on every dispatch result. Recovery happens naturally as successful dispatches raise the EWMA score back above thresholds.

Fallback Endpoint

When a job has a fallback_endpoint_url configured and the primary endpoint fails dispatch, Strait automatically retries against the fallback URL. The fallback fires when:

The primary endpoint returns a non-retryable error
The primary endpoint has exhausted retry attempts within the current dispatch cycle

Fallback attempts are counted against the run's max_attempts budget. Both primary and fallback endpoints receive the same SSRF validation at dispatch time.

Concurrency Limiting

Per-job concurrency caps prevent a single job definition from monopolizing worker capacity.

max_concurrency

The max_concurrency field on job definitions limits how many runs of that job can execute simultaneously. Enforcement happens at dequeue time via a COUNT subquery in the SKIP LOCKED dequeue query:

SELECT ... FROM job_runs
WHERE status = 'queued'
  AND ... 
  AND (
    j.max_concurrency IS NULL
    OR j.max_concurrency = 0
    OR (SELECT COUNT(*) FROM job_runs jr2
        WHERE jr2.job_id = j.id AND jr2.status = 'executing')
       < j.max_concurrency
  )
FOR UPDATE SKIP LOCKED
LIMIT $batch_size

This provides single-worker enforcement. For multi-worker deployments, distributed enforcement via Redis atomic counters is planned.

Rate Limiting

Per-Job Rate Limits

The rate_limit_max and rate_limit_window_secs fields on job definitions control dispatch throughput. When a job exceeds its rate limit, runs are held in queued status until the window expires.

API Rate Limiting

The API layer implements rate limiting via httprate middleware:

Global rate limit: RATE_LIMIT_REQUESTS per RATE_LIMIT_WINDOW per IP
Trigger rate limit: TRIGGER_RATE_LIMIT_REQUESTS specifically for /trigger endpoints
RBAC control-plane limits: Stricter per-route limits on mutating authorization endpoints

Retry Strategies

Strait supports four retry strategies configurable per job:

Strategy	Behavior
`exponential`	Delay doubles each attempt: 1s, 2s, 4s, 8s, ...
`linear`	Fixed increment per attempt: 5s, 10s, 15s, 20s, ...
`fixed`	Same delay every attempt
`custom`	Per-attempt delays specified via `retry_delays_secs` array

All strategies apply +/-20% jitter to prevent thundering herd when many runs retry simultaneously.

When attempt >= max_attempts, the run transitions to dead_letter instead of failed. Dead-lettered runs can be replayed via the DLQ management API.

For detailed retry configuration, see Retry Strategies.

Graceful Shutdown

Strait implements structured shutdown to protect in-flight work:

Signal capture: SIGINT/SIGTERM triggers context cancellation via signal.NotifyContext
Stop accepting work: Worker polling loop exits, no new runs dequeued
Drain in-flight jobs: pool.Shutdown() blocks until all goroutines complete. A 30-second timeout prevents indefinite blocking.
Stop scheduler: Cron ticker, delayed poller, and reaper stop
Drain HTTP server: 10-second grace period for in-flight API requests
Cleanup: Database and Redis connections closed, OTel exporters flushed

The context.WithoutCancel pattern ensures job dispatch goroutines continue executing even after the parent context is canceled, allowing them to complete and record results before the process exits.

Stale Run Recovery

The stale reaper (apps/strait/internal/scheduler/reaper.go) detects runs that have stopped heartbeating and transitions them to a recoverable state:

Runs in dequeued with heartbeat_at older than 5 minutes are transitioned to system_failed
Runs in waiting with heartbeat_at older than 1 hour are transitioned to system_failed
The reaper uses PostgreSQL advisory locks (pg_try_advisory_lock) to ensure single-leader execution across multiple instances

This acts as a safety net for worker crashes, network partitions, and other failure scenarios where a run is claimed but never completed.

OOM Recovery

When a managed container is killed by the OOM killer (exit code 137), Strait automatically recovers by upgrading the machine preset and retrying:

The exit code is classified as an OOM kill.
The executor selects the next preset in the upgrade chain (micro -> small-1x -> ... -> large-2x).
The recommendation is stored in job_preset_recommendations with a 24-hour decay, so future runs for the same job start at the upgraded preset.
If the job is already at large-2x, the run is moved to dead_letter.

This prevents OOM loops from consuming retry budget on an undersized preset.

Region Failover

When a Fly region returns 503 (capacity exhaustion) during machine provisioning, the executor fails over to alternate regions before giving up. The primary region is always tried first; fallback regions are attempted in order. The run is only snoozed if all configured regions are unavailable.

Orphaned Machine Cleanup

The reaper (apps/strait/internal/scheduler/reaper.go) includes orphaned machine detection for managed execution:

Machines in started or running state with no associated active run are flagged as orphaned.
Orphaned machines are destroyed via the Fly API to prevent resource leakage and cost accumulation.
This covers edge cases such as executor crashes mid-dispatch, network partitions during result recording, and machines that outlive their run due to race conditions.

Budget Protection

Project compute budgets are enforced with a two-phase reservation model:

Atomic reservation: Before provisioning a machine, estimated cost is reserved against the daily budget. Concurrent dispatches cannot over-commit because reservations are atomic.
Soft-limit warning: At 80% of the daily budget, a structured log and metric alert are emitted to give operators advance notice.
Commit on completion: After the run finishes, the reservation is replaced with actual cost. Unused budget from over-estimates is released immediately.

Was this page helpful?

On this page