SLIs, SLOs, Error Budgets & The Philosophy of Reliability
SRE Foundations | Technical Operations Excellence
Hope is not a strategy.
- Google SRE Book
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator | Request latency P99 |
| SLO | Service Level Objective | P99 < 200ms |
| SLA | Service Level Agreement | 99.9% or credits |
SLOs should be stricter than SLAs for early warning
| SLO | Budget | Monthly |
|---|---|---|
| 99% | 1% | 7.2 hours |
| 99.9% | 0.1% | 43.2 minutes |
| 99.95% | 0.05% | 21.6 minutes |
| 99.99% | 0.01% | 4.32 minutes |
| 99.999% | 0.001% | 26.3 seconds |
Each 9 costs 10x more - choose wisely
Ship features freely, accept calculated risks
Prioritize reliability, increase review rigor
Feature freeze, focus exclusively on stability
| Signal | Measures | Question |
|---|---|---|
| Latency | Request time | How fast? |
| Traffic | System demand | How much? |
| Errors | Failed requests | Failing? |
| Saturation | Utilization | How full? |
If you can only measure four things, measure these
Toil = manual, repetitive, automatable work that scales linearly with service growth
| Toil | Not Toil |
|---|---|
| Manually restarting services | Writing automation |
| Copy-paste deployments | Designing CI/CD |
| Manual scaling | Auto-scaling policies |
| Repetitive tickets | Self-service tools |
| Characteristic | Example |
|---|---|
| Manual | Human runs script |
| Repetitive | Done frequently |
| Automatable | No judgment needed |
| Tactical | Interrupt-driven |
| No lasting value | Doesn't improve system |
Google SRE: Cap toil at 50% of time; invest the rest in engineering
| Availability | % successful requests |
| Latency | % under threshold |
| Throughput | Requests processed |
| Freshness | Data staleness |
Error Budgets Enable Innovation
When healthy, take risks. When depleted, stabilize.
It's data for decisions, not punishment.