From SLI to SLA: Building Meaningful Reliability Targets
SRE Foundations | Technical Operations Excellence
| Term | Definition | Owner |
|---|---|---|
| SLI | Metric that measures service | Engineers |
| SLO | Target value for the SLI | SRE/Product |
| SLA | Contract with consequences | Business/Legal |
Rule: SLO should be stricter than SLA (buffer for internal response)
| Type | SLI Formula |
|---|---|
| Availability | Successful requests / Total requests |
| Latency | Requests < threshold / Total requests |
| Throughput | Requests served / Time period |
| Correctness | Correct responses / Total responses |
| Freshness | Data age < threshold / Total reads |
| Target | Monthly Downtime | Use Case |
|---|---|---|
| 99% | 7.3 hours | Internal tools |
| 99.5% | 3.6 hours | Non-critical services |
| 99.9% | 43.8 min | Standard production |
| 99.95% | 21.9 min | Business-critical |
| 99.99% | 4.4 min | Mission-critical |
| Budget Status | Action |
|---|---|
| >50% remaining | Ship features freely |
| 25-50% | Ship with caution |
| <25% | Reliability focus only |
| Exhausted | Feature freeze |
| Service Type | Primary SLIs |
|---|---|
| User-facing API | Availability, latency |
| Background job | Completion rate, freshness |
| Data pipeline | Freshness, correctness |
| Storage | Durability, availability |
Alert on short windows; report on long windows
| Step | Action |
|---|---|
| 1 | Identify critical user journeys |
| 2 | Define SLIs for each journey |
| 3 | Set initial targets (start conservative) |
| 4 | Implement measurement & dashboards |
| 5 | Create error budget alerts |
| 6 | Establish review cadence |
100% is the wrong target. Choose the reliability that balances user happiness with development velocity.
SLOs Enable Decisions
Error budgets are the currency of reliability.