Strait Docs
Operations

Production monitoring baselines for health checks, metrics, auth, RBAC, audit events, and webhooks.

Health Endpoints

Strait exposes two health endpoints for infrastructure orchestration:

Liveness: GET /health

Returns 200 if the process is alive. Use as Kubernetes liveness probe.

Readiness: GET /health/ready

Returns 200 when all dependencies are ready. Returns 503 with details when not ready.

{
  "ready": false,
  "checks": {
    "database": "ok",
    "redis": "unavailable"
  }
}

The health check system uses a component registry (apps/strait/internal/health/registry.go) where subsystems register their health check functions.

Kubernetes Configuration

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 6

Prometheus Metrics

Metrics exposed at GET /metrics:

MetricTypeDescription
strait_run_transitions_totalCounterFSM state transitions (by from_status, to_status)
strait_dispatch_duration_secondsHistogramHTTP dispatch latency (by job_id, outcome)
strait_dequeue_duration_secondsHistogramQueue polling latency
strait_worker_pool_activeGaugeActive worker goroutines
strait_worker_pool_queuedGaugeTasks waiting in pool buffer
strait_analytics_query_duration_secondsHistogramAnalytics query latency (by period_hours)
strait_bulk_operations_totalCounterBulk trigger/cancel operations
strait_bulk_items_processed_totalCounterItems processed in bulk operations
strait_bulk_child_cancellations_totalCounterCascading child run cancellations
strait_webhook_retry_attempts_totalCounterWebhook delivery retry attempts

Monitoring Assets

Use the monitoring assets in ops/monitoring/ as a baseline:

  • grafana-authz-rbac-dashboard.json
  • alerts-authz-rbac.yaml

Core Signals

  • HTTP authz failures (401, 403)
  • Permission cache hits/misses/evictions
  • RBAC control-plane mutation rate
  • Audit event insert errors
  • Webhook delivery failures
  • Worker pool utilization
  • Database connection pool saturation
  • Analytics query latency spikes
  • Bulk operation failure rate
  • 403 rate remains near baseline after RBAC or policy changes
  • Permission cache hit ratio >= 0.80 under normal load
  • Audit event write error rate = 0
  • Webhook terminal failure rate stays within agreed threshold
  • Health endpoint responds within 1s
  • Worker pool active/capacity ratio < 0.9
  • Analytics query p95 < 2s
  • Bulk cancel success rate >= 99%
Was this page helpful?

On this page