Performance Tuning

Optimize Strait for high-throughput job execution and low-latency API responses.

Strait is designed to handle high volumes of concurrent jobs with minimal overhead. However, as your workload grows, tuning the configuration of the database, worker pool, and queue mechanics becomes essential to maintain stability and performance.

Overview

Strait's performance is primarily governed by three factors:

Database Throughput: How fast PostgreSQL can handle SELECT FOR UPDATE SKIP LOCKED queries and state transitions.
Worker Concurrency: The number of parallel goroutines executing jobs.
Network Latency: The time taken to dispatch jobs to external endpoints and receive heartbeats.

Database Tuning

PostgreSQL is the source of truth and the engine behind Strait's queue. Proper database tuning is the most impactful way to improve performance.

Connection Pool Sizing

Strait manages its own connection pool. Ensure the pool size is large enough to handle concurrent worker requests and API traffic without causing exhaustion.

DB_MAX_CONNS: Set this to a value that accounts for all Strait instances (API + Workers) plus some headroom for administrative tasks.
DB_MIN_CONNS: Maintain a small number of warm connections to avoid the latency of establishing new ones during bursts.

If you are using a connection pooler like PgBouncer, enable DB_PGBOUNCER_MODE=true to ensure compatibility with transaction-level pooling.

Indexing Strategy

Strait uses GIN indexes for tag filtering and B-tree indexes for job lookups.

Monitor index bloat on the runs table, especially if you have high churn.
Use the INDEX_MAINTENANCE_INTERVAL (default 24h) to allow Strait to perform concurrent reindexing during low-traffic periods.

VACUUM and ANALYZE

High-frequency job processing leads to many dead tuples in the runs table.

Ensure autovacuum is aggressively configured for the Strait database.
Manually run ANALYZE after large batch job injections to ensure the query planner has up-to-date statistics for the queue claiming logic.

Worker Pool Configuration

The worker pool determines how many jobs can be processed in parallel on a single instance.

Concurrency Limits

WORKER_CONCURRENCY: This controls the number of concurrent execution goroutines. Increasing this value allows more jobs to run in parallel but increases CPU and memory usage.
ADAPTIVE_CONCURRENCY_MAX: If adaptive concurrency is enabled, this sets the upper bound for the worker pool size based on system load and downstream pressure.

Timeout Tuning

EXECUTOR_HTTP_TIMEOUT: Tune this based on the expected duration of your jobs. Setting it too high can lead to worker exhaustion if downstream services hang.
REQUEST_TIMEOUT: Controls the maximum duration for incoming API requests. Keep this low (e.g., 30s) to prevent slow clients from holding onto resources.

Queue Performance

Strait uses a polling mechanism combined with PostgreSQL's SKIP LOCKED to claim jobs.

Polling Intervals

POLLER_INTERVAL: Controls how often the worker checks the database for new queued runs. A shorter interval (e.g., 1s) reduces latency but increases database CPU load.
REAPER_INTERVAL: Controls how often the system checks for stale runs (runs that haven't sent a heartbeat).

Batch Sizes

SEQUIN_BATCH_SIZE: When using CDC for event triggers, increasing the batch size can improve throughput at the cost of slightly higher memory usage per poll.

Memory Management

Strait is optimized for low memory footprint, but certain configurations can impact usage.

Pre-allocated Slices

When processing large batches of jobs or workflows, Strait uses pre-allocated slices to minimize allocations. Ensure your MAX_REQUEST_BODY_SIZE is tuned to prevent excessively large payloads from causing OOM (Out of Memory) issues.

Graceful Shutdown Drain

WORKER_DRAIN_TIMEOUT: During shutdown, Strait waits for in-flight jobs to complete. Ensure this timeout is long enough to allow jobs to reach a checkpoint or finish, preventing unnecessary retries on the next start.

Monitoring

Effective tuning requires visibility into the system's internal state.

Prometheus Metrics

Strait exposes a /metrics endpoint that provides detailed insights:

strait_worker_concurrency_current: Number of active execution slots.
strait_queue_depth: Number of jobs in queued state.
strait_run_duration_seconds: Histogram of job execution times.
strait_db_open_connections: Current database connection pool usage.

Analytics API

Use the GET /v1/analytics/performance endpoint to retrieve aggregate health scores, success rates, and latency stability metrics. This is useful for identifying trends over time.

Common Bottlenecks

N+1 Queries: When fetching jobs with many tags or related runs, ensure you are using the bulk fetch endpoints to avoid multiple round-trips to the database.
Unbounded Result Sets: Always use pagination when querying runs or audit logs to prevent memory spikes and slow responses.
Serialization Overhead: For jobs with very large input/output payloads, the JSON serialization/deserialization can become a CPU bottleneck. Consider using smaller payloads or storing large data in external object storage and passing references.

Analytics Query Optimization

The GET /v1/analytics/performance endpoint aggregates data from the job_runs table using window functions and percentile calculations. For large datasets:

Limit the period_hours parameter to the minimum necessary window (default: 24h, max: 720h).
The query scans job_runs with created_at >= now() - interval, so the idx_job_runs_created_at index is critical.
The HAVING COUNT(*) >= 5 filter ensures only statistically meaningful jobs appear in slowest-jobs results.
Monitor the strait_analytics_query_duration_seconds histogram to detect degradation.

Bulk Operation Tuning

Bulk trigger and bulk cancel endpoints process up to 100 items per request. For high-volume workloads:

Keep batch sizes at or below 100 items to avoid long-running transactions.
Bulk cancel propagates to child runs automatically. Monitor strait_bulk_child_cancellations_total for cascading cancellation volume.
For sustained bulk ingestion, prefer multiple sequential requests with smaller batches over fewer requests with maximum batch sizes.
Monitor strait_bulk_operations_total and strait_bulk_items_processed_total for throughput visibility.

Was this page helpful?

On this page