Observability
Full monitoring stack: Prometheus metrics, 3 Grafana dashboards, 15 alert rules, and SLO recording rules. Production-grade from day one.
Quick Setup
Prometheus Metrics
12 metrics exported at /metrics. Scraped every 15s by default.
| Metric | Type | Description |
|---|---|---|
| requests_total | Counter | Total requests by status code |
| request_duration_seconds | Histogram | Request latency with configurable buckets |
| credits_remaining | Gauge | Remaining credits per account |
| pool_utilization_ratio | Gauge | Pool usage ratio per account |
| circuit_breaker_state | Gauge | 0=closed, 1=open, 2=half_open per account |
| drain_active | Gauge | Whether graceful drain is active |
| inflight_requests | Gauge | Currently in-flight requests |
| errors_total | Counter | Errors by type |
| controller_reconcile_total | Counter | Reconcile cycles by result |
| controller_actions_total | Counter | Controller actions (create/destroy/scale/replace) |
| controller_workers | Gauge | Workers by state (pending/running/failed) |
| controller_leader_active | Gauge | Whether this instance is the leader |
SLO Recording Rules
Pre-computed SLO metrics for fast dashboard queries and alert evaluation.
drawbridge:availability_5m # 1 - (5xx_rate / total_rate) drawbridge:latency_p95_5m # 95th percentile latency drawbridge:latency_p99_5m # 99th percentile latency drawbridge:error_rate_5m # Error rate drawbridge:throughput_5m # Requests per second drawbridge:credit_burn_rate_1h # Credit consumption rate
Grafana Dashboards
3 pre-built dashboards with 29 panels total. Auto-provisioned on startup.
Gateway Dashboard
11 panelsRequest throughput, error rate, latency heatmap, pool utilization, account health table, circuit breaker states, drain status.
Controller Dashboard
8 panelsLeader status, total workers, reconcile rate, action rate, workers by state timeseries, actions breakdown.
System Overview
10 panelsAvailability gauge (SLO %), P95 latency gauge, credits per account trend, request distribution, worker health table.
Alert Rules
15 alert rules with severity-based routing via AlertManager.
gateway alerts (8)
| Alert | Severity | Condition |
|---|---|---|
| HighErrorRate | warning | 5xx rate > 1% for 5m |
| CircuitBreakerOpen | warning | Any account circuit open |
| AllCircuitsOpen | critical | All account circuits open |
| DrainActive | info | Graceful drain initiated |
| DrainStuck | warning | Drain duration > timeout |
| HighLatencyP95 | warning | P95 > 120s for 5m |
| CreditsAlmostExhausted | warning | Account credits < 10% |
| PoolUtilizationHigh | warning | Pool utilization > 80% for 5m |
controller alerts (5)
| Alert | Severity | Condition |
|---|---|---|
| ControllerLeaderLost | critical | No active leader for 60s |
| ControllerReconcileStalled | warning | No reconcile for 5m |
| ControllerWorkersDegraded | warning | > 20% workers unhealthy |
| ControllerNoWorkers | critical | Zero running workers |
| ControllerHighActionRate | warning | > 10 actions/min for 5m |
slo alerts (2)
| Alert | Severity | Condition |
|---|---|---|
| SLOAvailabilityBreach | critical | Availability < 99.9% (5m window) |
| SLOLatencyBreach | warning | P95 latency > 120s (5m window) |
Alert Routing
# Severity-based routing critical: 15-minute repeat, escalation warning: 1-hour repeat info: silent # Inhibition rules - AllCircuitsOpen suppresses individual CircuitBreakerOpen - Critical suppresses same-rule warnings