Strait Docs
Concepts

Optional analytics backend for run metrics, cost tracking, and time-series reporting.

Overview

Strait integrates with ClickHouse as an optional, append-only analytics store optimized for high-volume time-series queries. ClickHouse is a column-oriented OLAP database that excels at aggregating millions of rows in sub-second latency, making it well suited for run metrics, cost breakdowns, and event-level observability.

PostgreSQL remains the source of truth for all operational data. ClickHouse is a read-optimized replica that receives data via an asynchronous export pipeline. Disabling ClickHouse has zero impact on workflow execution, scheduling, or run management. The engine continues to function normally using Postgres alone.

Why ClickHouse?

RequirementWhy ClickHouse fits
Time-windowed aggregations over millions of run eventsColumn-oriented storage and vectorized execution make full-table scans fast without secondary indexes.
Cost analysis across projects and presetsNative support for Int64 micro-USD columns with efficient GROUP BY over low-cardinality dimensions.
High write throughput with minimal overheadAppend-only MergeTree engine handles large batch inserts without write amplification.
Automatic data retentionBuilt-in TTL rules expire old rows at the partition level, requiring no external cron jobs.

Tables and Schema

Strait creates three ClickHouse tables on startup via idempotent CREATE TABLE IF NOT EXISTS statements.

run_events

Stores discrete lifecycle events emitted during run execution (e.g., step started, step failed, checkpoint saved). Used for event-level drill-down and log analytics.

ColumnTypeDescription
event_idStringUnique event identifier.
run_idStringParent run identifier.
project_idStringOwning project.
job_idStringParent job identifier.
event_typeLowCardinality(String)Event category (e.g., step.started, run.failed).
levelLowCardinality(String)Severity level (info, warn, error).
messageStringHuman-readable event description.
metadataStringJSON-encoded arbitrary metadata.
created_atDateTime64(3)Timestamp when the event occurred in the engine.
inserted_atDateTime64(3)Timestamp when the row was written to ClickHouse. Defaults to now64(3).
  • Engine: MergeTree()
  • Partition key: toDate(inserted_at) (daily partitions)
  • Order key: (project_id, run_id, created_at)
  • TTL: 90 days from inserted_at

run_analytics

One row per run attempt. Captures duration, queue wait, cost, and execution metadata for dashboard-level reporting.

ColumnTypeDescription
run_idStringRun identifier.
job_idStringParent job identifier.
project_idStringOwning project.
statusLowCardinality(String)Terminal status (succeeded, failed, cancelled).
execution_modeLowCardinality(String)How the run was executed (managed, external).
machine_presetLowCardinality(String)Machine preset used for the run.
attemptUInt8Retry attempt number (1-based).
duration_msUInt64Total wall-clock duration in milliseconds.
queue_wait_msUInt64Time spent waiting in the dispatch queue.
cost_microusdInt64Total run cost in micro-USD (1 USD = 1,000,000).
compute_cost_microusdInt64Compute-only portion of cost in micro-USD.
triggered_byLowCardinality(String)Trigger source (schedule, webhook, manual, api).
created_atDateTime64(3)Run creation timestamp.
started_atNullable(DateTime64(3))When execution began. Null if never started.
finished_atNullable(DateTime64(3))When execution completed. Null if still running.
inserted_atDateTime64(3)ClickHouse insertion timestamp. Defaults to now64(3).
  • Engine: MergeTree()
  • Partition key: toDate(inserted_at) (daily partitions)
  • Order key: (project_id, job_id, created_at)
  • TTL: 365 days from inserted_at

compute_usage

Tracks individual machine sessions for granular compute billing and utilization analysis.

ColumnTypeDescription
run_idStringAssociated run identifier.
project_idStringOwning project.
machine_presetLowCardinality(String)Machine preset name.
machine_idStringSpecific machine instance identifier.
duration_secsFloat64Session duration in seconds.
cost_microusdInt64Session cost in micro-USD.
started_atDateTime64(3)Session start time.
finished_atDateTime64(3)Session end time.
inserted_atDateTime64(3)ClickHouse insertion timestamp. Defaults to now64(3).
  • Engine: MergeTree()
  • Partition key: toDate(inserted_at) (daily partitions)
  • Order key: (project_id, started_at)
  • TTL: 365 days from inserted_at

Data Pipeline

Data flows from PostgreSQL to ClickHouse through an asynchronous batch exporter that runs as a background goroutine within the Strait process.

Export flow

Engine (Postgres writes) --> Exporter.Enqueue() --> In-memory buffer --> Flush --> ClickHouse batch INSERT
  1. Enqueue. When the engine completes a state transition (run started, run finished, cost recorded), it calls Exporter.Enqueue() to place the record in an in-memory buffer. This call is non-blocking and safe for concurrent use.

  2. Buffer accumulation. Records accumulate in memory until one of two flush triggers fires:

    • Batch size threshold -- the buffer reaches CLICKHOUSE_BATCH_SIZE records (default: 1000).
    • Flush interval -- CLICKHOUSE_FLUSH_INTERVAL elapses since the last flush (default: 5 seconds).
  3. Batch flush. The exporter swaps the buffer, releasing the lock immediately, then writes the batch to ClickHouse via a single INSERT statement.

  4. Graceful shutdown. On process termination, the exporter performs a final flush to drain any remaining buffered records before closing.

Backpressure

If ClickHouse becomes unavailable or slow, the buffer grows up to 10x the configured batch size. Beyond that limit, the exporter drops the oldest records and logs a warning. This ensures the engine never blocks on analytics writes.

Record types

Each enqueued record is dispatched to its corresponding table based on its Go type:

Record typeTarget table
Run lifecycle eventsrun_events
Run completion summariesrun_analytics
Machine session recordscompute_usage

Query Patterns

ClickHouse tables are designed around common analytics access patterns. All tables are ordered with project_id as the leading key, so project-scoped queries benefit from primary key pruning.

Time-windowed aggregations

SELECT
    toStartOfHour(created_at) AS hour,
    count() AS total_runs,
    countIf(status = 'failed') AS failures,
    avg(duration_ms) AS avg_duration_ms
FROM run_analytics
WHERE project_id = :project_id
  AND created_at >= now() - INTERVAL 7 DAY
GROUP BY hour
ORDER BY hour

Cost analysis by preset

SELECT
    machine_preset,
    sum(cost_microusd) / 1000000 AS total_cost_usd,
    sum(duration_secs) / 3600 AS total_hours
FROM compute_usage
WHERE project_id = :project_id
  AND started_at >= now() - INTERVAL 30 DAY
GROUP BY machine_preset
ORDER BY total_cost_usd DESC

Event drill-down for a single run

SELECT event_type, level, message, metadata, created_at
FROM run_events
WHERE run_id = :run_id
ORDER BY created_at ASC

Queue wait percentiles

SELECT
    quantile(0.50)(queue_wait_ms) AS p50,
    quantile(0.95)(queue_wait_ms) AS p95,
    quantile(0.99)(queue_wait_ms) AS p99
FROM run_analytics
WHERE project_id = :project_id
  AND created_at >= now() - INTERVAL 24 HOUR

Configuration

All ClickHouse settings are controlled via environment variables. Both the client connection and the export pipeline must be enabled independently.

VariableTypeDefaultDescription
CLICKHOUSE_ENABLEDboolfalseEnable the ClickHouse client connection. When false, no connection is opened and all write operations are no-ops.
CLICKHOUSE_URLstring(none)ClickHouse native protocol URL (e.g., clickhouse://localhost:9000). Required when CLICKHOUSE_ENABLED=true.
CLICKHOUSE_DATABASEstringstraitTarget database name.
CLICKHOUSE_BATCH_SIZEint1000Maximum number of records per batch insert.
CLICKHOUSE_FLUSH_INTERVALduration5sMaximum time between flushes, even if the batch is not full.
CLICKHOUSE_EXPORT_ENABLEDboolfalseEnable the asynchronous export pipeline. Requires CLICKHOUSE_ENABLED=true.

Setting CLICKHOUSE_ENABLED=true without providing CLICKHOUSE_URL is a configuration error and will prevent the application from starting.

Operational Considerations

ClickHouse is optional

The entire ClickHouse subsystem is gated behind feature flags. When disabled:

  • The Client constructor returns nil, and all methods on a nil client are safe no-ops.
  • The Exporter constructor returns nil, and Enqueue() silently returns false.
  • No ClickHouse driver is loaded, no connections are opened, and no goroutines are spawned.

There is no degradation in engine behavior when ClickHouse is unavailable.

Data retention and TTL

ClickHouse enforces TTL at the partition level:

  • run_events: 90-day retention. High-volume event data is expired first to manage storage.
  • run_analytics: 365-day retention. Run summaries are kept for year-over-year reporting.
  • compute_usage: 365-day retention. Billing records are retained for annual cost analysis.

TTL expiration runs asynchronously within ClickHouse and does not require external maintenance.

Connection pool

The client maintains a connection pool with sensible defaults:

  • Max open connections: 10
  • Max idle connections: 5
  • Connection max lifetime: 30 minutes

These values are suitable for most deployments. For high-throughput environments, increase MaxOpenConns via the Config struct when constructing the client programmatically.

Performance characteristics

  • Write path: Fully asynchronous. The engine never blocks on ClickHouse writes. Failed flushes are logged and the batch is discarded to prevent memory growth.
  • Read path: Queries benefit from ClickHouse columnar compression and the MergeTree primary index. Project-scoped queries with time-range filters are the most efficient access pattern.
  • LowCardinality optimization: Columns with a small set of distinct values (event_type, status, machine_preset, triggered_by, level, execution_mode) use LowCardinality(String), which applies dictionary encoding for reduced storage and faster GROUP BY.
Was this page helpful?

On this page