SRE Foundations

SLIs, SLOs, Error Budgets & The Philosophy of Reliability

SRE Foundations | Technical Operations Excellence

50%
Max Toil Cap
99.9%
Typical SLO
43m
Error Budget/Mo
4
Golden Signals

The Core Philosophy

Hope is not a strategy.

- Google SRE Book

  • class SRE implements interface DevOps
  • Apply engineering discipline to operations
  • Balance reliability with feature velocity
  • Measure everything; improve continuously

SLI / SLO / SLA Hierarchy

TermDefinitionExample
SLIService Level IndicatorRequest latency P99
SLOService Level ObjectiveP99 < 200ms
SLAService Level Agreement99.9% or credits

SLOs should be stricter than SLAs for early warning

Error Budget Math

SLOBudgetMonthly
99%1%7.2 hours
99.9%0.1%43.2 minutes
99.95%0.05%21.6 minutes
99.99%0.01%4.32 minutes
99.999%0.001%26.3 seconds

Each 9 costs 10x more - choose wisely

Error Budget Policy

Healthy (>50%)

Ship features freely, accept calculated risks

Warning (25-50%)

Prioritize reliability, increase review rigor

Critical (<25%)

Feature freeze, focus exclusively on stability

The Four Golden Signals

SignalMeasuresQuestion
LatencyRequest timeHow fast?
TrafficSystem demandHow much?
ErrorsFailed requestsFailing?
SaturationUtilizationHow full?

If you can only measure four things, measure these

Toil: The Enemy of SRE

Toil = manual, repetitive, automatable work that scales linearly with service growth

ToilNot Toil
Manually restarting servicesWriting automation
Copy-paste deploymentsDesigning CI/CD
Manual scalingAuto-scaling policies
Repetitive ticketsSelf-service tools
<50%
Max toil (Google rule)

What is Toil?

CharacteristicExample
ManualHuman runs script
RepetitiveDone frequently
AutomatableNo judgment needed
TacticalInterrupt-driven
No lasting valueDoesn't improve system

Google SRE: Cap toil at 50% of time; invest the rest in engineering

SLO Categories

Availability% successful requests
Latency% under threshold
ThroughputRequests processed
FreshnessData staleness

Error Budgets Enable Innovation

When healthy, take risks. When depleted, stabilize.
It's data for decisions, not punishment.