SLO Design Framework

From SLI to SLA: Building Meaningful Reliability Targets

SRE Foundations | Technical Operations Excellence

3
SLI/SLO/SLA Tiers
99.9%
Common Target
8.76hr
Annual Budget (99.9%)
30d
Rolling Window

SLI → SLO → SLA Hierarchy

TermDefinitionOwner
SLIMetric that measures serviceEngineers
SLOTarget value for the SLISRE/Product
SLAContract with consequencesBusiness/Legal

Rule: SLO should be stricter than SLA (buffer for internal response)

Common SLI Types

TypeSLI Formula
AvailabilitySuccessful requests / Total requests
LatencyRequests < threshold / Total requests
ThroughputRequests served / Time period
CorrectnessCorrect responses / Total responses
FreshnessData age < threshold / Total reads

Target Selection Guide

TargetMonthly DowntimeUse Case
99%7.3 hoursInternal tools
99.5%3.6 hoursNon-critical services
99.9%43.8 minStandard production
99.95%21.9 minBusiness-critical
99.99%4.4 minMission-critical

Error Budget Policy

Budget StatusAction
>50% remainingShip features freely
25-50%Ship with caution
<25%Reliability focus only
ExhaustedFeature freeze

Service-Type Patterns

Service TypePrimary SLIs
User-facing APIAvailability, latency
Background jobCompletion rate, freshness
Data pipelineFreshness, correctness
StorageDurability, availability

Multi-Window SLO

  • 30-day: Long-term reliability view
  • 7-day: Recent trend indicator
  • 1-day: Acute issue detection
  • 1-hour: Real-time burn rate

Alert on short windows; report on long windows

Implementation Checklist

StepAction
1Identify critical user journeys
2Define SLIs for each journey
3Set initial targets (start conservative)
4Implement measurement & dashboards
5Create error budget alerts
6Establish review cadence

Key Insight

100% is the wrong target. Choose the reliability that balances user happiness with development velocity.

SLOs Enable Decisions

Error budgets are the currency of reliability.