ClickHouse Analytics

Optional analytics backend for run metrics, cost tracking, and time-series reporting.

Overview

Strait integrates with ClickHouse as an optional, append-only analytics store optimized for high-volume time-series queries. ClickHouse is a column-oriented OLAP database that excels at aggregating millions of rows in sub-second latency, making it well suited for run metrics, cost breakdowns, and event-level observability.

PostgreSQL remains the source of truth for all operational data. ClickHouse is a read-optimized replica that receives data via an asynchronous export pipeline. Disabling ClickHouse has zero impact on workflow execution, scheduling, or run management. The engine continues to function normally using Postgres alone.

Why ClickHouse?

Requirement	Why ClickHouse fits
Time-windowed aggregations over millions of run events	Column-oriented storage and vectorized execution make full-table scans fast without secondary indexes.
Cost analysis across projects and presets	Native support for `Int64` micro-USD columns with efficient `GROUP BY` over low-cardinality dimensions.
High write throughput with minimal overhead	Append-only `MergeTree` engine handles large batch inserts without write amplification.
Automatic data retention	Built-in TTL rules expire old rows at the partition level, requiring no external cron jobs.

Tables and Schema

Strait creates three ClickHouse tables on startup via idempotent CREATE TABLE IF NOT EXISTS statements.

`run_events`

Stores discrete lifecycle events emitted during run execution (e.g., step started, step failed, checkpoint saved). Used for event-level drill-down and log analytics.

Column	Type	Description
`event_id`	`String`	Unique event identifier.
`run_id`	`String`	Parent run identifier.
`project_id`	`String`	Owning project.
`job_id`	`String`	Parent job identifier.
`event_type`	`LowCardinality(String)`	Event category (e.g., `step.started`, `run.failed`).
`level`	`LowCardinality(String)`	Severity level (info, warn, error).
`message`	`String`	Human-readable event description.
`metadata`	`String`	JSON-encoded arbitrary metadata.
`created_at`	`DateTime64(3)`	Timestamp when the event occurred in the engine.
`inserted_at`	`DateTime64(3)`	Timestamp when the row was written to ClickHouse. Defaults to `now64(3)`.

Engine: MergeTree()
Partition key: toDate(inserted_at) (daily partitions)
Order key: (project_id, run_id, created_at)
TTL: 90 days from inserted_at

`run_analytics`

One row per run attempt. Captures duration, queue wait, cost, and execution metadata for dashboard-level reporting.

Column	Type	Description
`run_id`	`String`	Run identifier.
`job_id`	`String`	Parent job identifier.
`project_id`	`String`	Owning project.
`status`	`LowCardinality(String)`	Terminal status (succeeded, failed, cancelled).
`execution_mode`	`LowCardinality(String)`	How the run was executed (managed, external).
`machine_preset`	`LowCardinality(String)`	Machine preset used for the run.
`attempt`	`UInt8`	Retry attempt number (1-based).
`duration_ms`	`UInt64`	Total wall-clock duration in milliseconds.
`queue_wait_ms`	`UInt64`	Time spent waiting in the dispatch queue.
`cost_microusd`	`Int64`	Total run cost in micro-USD (1 USD = 1,000,000).
`compute_cost_microusd`	`Int64`	Compute-only portion of cost in micro-USD.
`triggered_by`	`LowCardinality(String)`	Trigger source (schedule, webhook, manual, api).
`created_at`	`DateTime64(3)`	Run creation timestamp.
`started_at`	`Nullable(DateTime64(3))`	When execution began. Null if never started.
`finished_at`	`Nullable(DateTime64(3))`	When execution completed. Null if still running.
`inserted_at`	`DateTime64(3)`	ClickHouse insertion timestamp. Defaults to `now64(3)`.

Engine: MergeTree()
Partition key: toDate(inserted_at) (daily partitions)
Order key: (project_id, job_id, created_at)
TTL: 365 days from inserted_at

`compute_usage`

Tracks individual machine sessions for granular compute billing and utilization analysis.

Column	Type	Description
`run_id`	`String`	Associated run identifier.
`project_id`	`String`	Owning project.
`machine_preset`	`LowCardinality(String)`	Machine preset name.
`machine_id`	`String`	Specific machine instance identifier.
`duration_secs`	`Float64`	Session duration in seconds.
`cost_microusd`	`Int64`	Session cost in micro-USD.
`started_at`	`DateTime64(3)`	Session start time.
`finished_at`	`DateTime64(3)`	Session end time.
`inserted_at`	`DateTime64(3)`	ClickHouse insertion timestamp. Defaults to `now64(3)`.

Engine: MergeTree()
Partition key: toDate(inserted_at) (daily partitions)
Order key: (project_id, started_at)
TTL: 365 days from inserted_at

Data Pipeline

Data flows from PostgreSQL to ClickHouse through an asynchronous batch exporter that runs as a background goroutine within the Strait process.

Export flow

Engine (Postgres writes) --> Exporter.Enqueue() --> In-memory buffer --> Flush --> ClickHouse batch INSERT

Enqueue. When the engine completes a state transition (run started, run finished, cost recorded), it calls Exporter.Enqueue() to place the record in an in-memory buffer. This call is non-blocking and safe for concurrent use.
Buffer accumulation. Records accumulate in memory until one of two flush triggers fires:
- Batch size threshold -- the buffer reaches CLICKHOUSE_BATCH_SIZE records (default: 1000).
- Flush interval -- CLICKHOUSE_FLUSH_INTERVAL elapses since the last flush (default: 5 seconds).
Batch flush. The exporter swaps the buffer, releasing the lock immediately, then writes the batch to ClickHouse via a single INSERT statement.
Graceful shutdown. On process termination, the exporter performs a final flush to drain any remaining buffered records before closing.

Backpressure

If ClickHouse becomes unavailable or slow, the buffer grows up to 10x the configured batch size. Beyond that limit, the exporter drops the oldest records and logs a warning. This ensures the engine never blocks on analytics writes.

Record types

Each enqueued record is dispatched to its corresponding table based on its Go type:

Record type	Target table
Run lifecycle events	`run_events`
Run completion summaries	`run_analytics`
Machine session records	`compute_usage`

Query Patterns

ClickHouse tables are designed around common analytics access patterns. All tables are ordered with project_id as the leading key, so project-scoped queries benefit from primary key pruning.

Time-windowed aggregations

SELECT
    toStartOfHour(created_at) AS hour,
    count() AS total_runs,
    countIf(status = 'failed') AS failures,
    avg(duration_ms) AS avg_duration_ms
FROM run_analytics
WHERE project_id = :project_id
  AND created_at >= now() - INTERVAL 7 DAY
GROUP BY hour
ORDER BY hour

Cost analysis by preset

SELECT
    machine_preset,
    sum(cost_microusd) / 1000000 AS total_cost_usd,
    sum(duration_secs) / 3600 AS total_hours
FROM compute_usage
WHERE project_id = :project_id
  AND started_at >= now() - INTERVAL 30 DAY
GROUP BY machine_preset
ORDER BY total_cost_usd DESC

Event drill-down for a single run

SELECT event_type, level, message, metadata, created_at
FROM run_events
WHERE run_id = :run_id
ORDER BY created_at ASC

Queue wait percentiles

SELECT
    quantile(0.50)(queue_wait_ms) AS p50,
    quantile(0.95)(queue_wait_ms) AS p95,
    quantile(0.99)(queue_wait_ms) AS p99
FROM run_analytics
WHERE project_id = :project_id
  AND created_at >= now() - INTERVAL 24 HOUR

Configuration

All ClickHouse settings are controlled via environment variables. Both the client connection and the export pipeline must be enabled independently.

Variable	Type	Default	Description
`CLICKHOUSE_ENABLED`	`bool`	`false`	Enable the ClickHouse client connection. When `false`, no connection is opened and all write operations are no-ops.
`CLICKHOUSE_URL`	`string`	(none)	ClickHouse native protocol URL (e.g., `clickhouse://localhost:9000`). Required when `CLICKHOUSE_ENABLED=true`.
`CLICKHOUSE_DATABASE`	`string`	`strait`	Target database name.
`CLICKHOUSE_BATCH_SIZE`	`int`	`1000`	Maximum number of records per batch insert.
`CLICKHOUSE_FLUSH_INTERVAL`	`duration`	`5s`	Maximum time between flushes, even if the batch is not full.
`CLICKHOUSE_EXPORT_ENABLED`	`bool`	`false`	Enable the asynchronous export pipeline. Requires `CLICKHOUSE_ENABLED=true`.

Setting CLICKHOUSE_ENABLED=true without providing CLICKHOUSE_URL is a configuration error and will prevent the application from starting.

Operational Considerations

ClickHouse is optional

The entire ClickHouse subsystem is gated behind feature flags. When disabled:

The Client constructor returns nil, and all methods on a nil client are safe no-ops.
The Exporter constructor returns nil, and Enqueue() silently returns false.
No ClickHouse driver is loaded, no connections are opened, and no goroutines are spawned.

There is no degradation in engine behavior when ClickHouse is unavailable.

Data retention and TTL

ClickHouse enforces TTL at the partition level:

run_events: 90-day retention. High-volume event data is expired first to manage storage.
run_analytics: 365-day retention. Run summaries are kept for year-over-year reporting.
compute_usage: 365-day retention. Billing records are retained for annual cost analysis.

TTL expiration runs asynchronously within ClickHouse and does not require external maintenance.

Connection pool

The client maintains a connection pool with sensible defaults:

Max open connections: 10
Max idle connections: 5
Connection max lifetime: 30 minutes

These values are suitable for most deployments. For high-throughput environments, increase MaxOpenConns via the Config struct when constructing the client programmatically.

Performance characteristics

Write path: Fully asynchronous. The engine never blocks on ClickHouse writes. Failed flushes are logged and the batch is discarded to prevent memory growth.
Read path: Queries benefit from ClickHouse columnar compression and the MergeTree primary index. Project-scoped queries with time-range filters are the most efficient access pattern.
LowCardinality optimization: Columns with a small set of distinct values (event_type, status, machine_preset, triggered_by, level, execution_mode) use LowCardinality(String), which applies dictionary encoding for reduced storage and faster GROUP BY.

Was this page helpful?

On this page