Skip to main content
Docs / Observability

Observability

Full monitoring stack: Prometheus metrics, 3 Grafana dashboards, 15 alert rules, and SLO recording rules. Production-grade from day one.

Quick Setup

Start monitoring stack
$ make up-monitoring
# Starts Prometheus (:9090), Grafana (:3001), AlertManager (:9093)
# Drawbridge exposes /metrics on :8080 (no auth required)

Prometheus Metrics

12 metrics exported at /metrics. Scraped every 15s by default.

MetricTypeDescription
requests_totalCounterTotal requests by status code
request_duration_secondsHistogramRequest latency with configurable buckets
credits_remainingGaugeRemaining credits per account
pool_utilization_ratioGaugePool usage ratio per account
circuit_breaker_stateGauge0=closed, 1=open, 2=half_open per account
drain_activeGaugeWhether graceful drain is active
inflight_requestsGaugeCurrently in-flight requests
errors_totalCounterErrors by type
controller_reconcile_totalCounterReconcile cycles by result
controller_actions_totalCounterController actions (create/destroy/scale/replace)
controller_workersGaugeWorkers by state (pending/running/failed)
controller_leader_activeGaugeWhether this instance is the leader

SLO Recording Rules

Pre-computed SLO metrics for fast dashboard queries and alert evaluation.

Recording rules
drawbridge:availability_5m       # 1 - (5xx_rate / total_rate)
drawbridge:latency_p95_5m       # 95th percentile latency
drawbridge:latency_p99_5m       # 99th percentile latency
drawbridge:error_rate_5m        # Error rate
drawbridge:throughput_5m        # Requests per second
drawbridge:credit_burn_rate_1h  # Credit consumption rate

Grafana Dashboards

3 pre-built dashboards with 29 panels total. Auto-provisioned on startup.

Gateway Dashboard

11 panels

Request throughput, error rate, latency heatmap, pool utilization, account health table, circuit breaker states, drain status.

Controller Dashboard

8 panels

Leader status, total workers, reconcile rate, action rate, workers by state timeseries, actions breakdown.

System Overview

10 panels

Availability gauge (SLO %), P95 latency gauge, credits per account trend, request distribution, worker health table.

Alert Rules

15 alert rules with severity-based routing via AlertManager.

gateway alerts (8)

AlertSeverityCondition
HighErrorRatewarning5xx rate > 1% for 5m
CircuitBreakerOpenwarningAny account circuit open
AllCircuitsOpencriticalAll account circuits open
DrainActiveinfoGraceful drain initiated
DrainStuckwarningDrain duration > timeout
HighLatencyP95warningP95 > 120s for 5m
CreditsAlmostExhaustedwarningAccount credits < 10%
PoolUtilizationHighwarningPool utilization > 80% for 5m

controller alerts (5)

AlertSeverityCondition
ControllerLeaderLostcriticalNo active leader for 60s
ControllerReconcileStalledwarningNo reconcile for 5m
ControllerWorkersDegradedwarning> 20% workers unhealthy
ControllerNoWorkerscriticalZero running workers
ControllerHighActionRatewarning> 10 actions/min for 5m

slo alerts (2)

AlertSeverityCondition
SLOAvailabilityBreachcriticalAvailability < 99.9% (5m window)
SLOLatencyBreachwarningP95 latency > 120s (5m window)

Alert Routing

AlertManager config
# Severity-based routing
critical: 15-minute repeat, escalation
warning:  1-hour repeat
info:     silent

# Inhibition rules
- AllCircuitsOpen suppresses individual CircuitBreakerOpen
- Critical suppresses same-rule warnings

Next Steps