Burn Rates, SLO-Based Alerts & Alert Attributes
Observability Deep Dive | Technical Operations Excellence
| Type | Budget | Window | Burn |
|---|---|---|---|
| Page (Critical) | 2% / 1h | 1h + 5m | 14.4x |
| Page (High) | 5% / 6h | 6h + 30m | 6x |
| Ticket | 10% / 3d | 72h + 6h | 1x |
Dual windows prevent alert flapping while catching fast burns
| Attribute | Definition | Goal |
|---|---|---|
| Precision | % genuine alerts | Minimize FPs |
| Recall | % incidents caught | Catch all issues |
| Detection | Time to notify | Alert quickly |
| Reset | Time to resolve | Auto-clear |
Burn Rate = (1 - SLO) / Time Window
14.4x = consume 30-day budget in ~2 days
| Burn Rate | Budget Exhaustion |
|---|---|
| >5%/day | Immediate incident |
| 2-5%/day | Investigation needed |
| <2%/day | Normal ops |
| Category | Response | When |
|---|---|---|
| Page | Immediate | User impact |
| Notify | Hours | Degradation |
| Ticket | Next day | Slow drift |
| Log | Review | Informational |
Error rate, latency, availability - PAGE these
Queue depth, connection pool - NOTIFY these
CPU, memory, disk - TICKET or LOG these
| Metric | Target |
|---|---|
| Pages per week | <2 |
| False positive rate | <5% |
| Off-hours pages | <1 |
| Actionable % | >95% |
Alert on Symptoms
Page for user pain, ticket for slow burns.