Reliability Unleashed

The Engineering Playbook

From Chaos to Confidence

A Comprehensive Guide to Site Reliability Engineering

9 Parts | 34 One-Pagers | ~8.5 Hours | Technical Track

The Journey Ahead

Part 1
Foundation & Vision

Part 2
Observability Mastery

Part 3
Resilience Patterns

Part 4
Incident Excellence

Part 5
Release, Testing & Capacity

Part 6
Cloud & Infrastructure

Part 7
AI/ML & Agentic Ops

Part 8
People & Culture

Part 9
Industry & Roadmap

Part 1 of 9

Foundation & Vision

SRE Fundamentals, SLIs/SLOs, DORA Metrics, Maturity Assessment

reliability-unleashed sre-foundations dora-24-capabilities sre-maturity-assessment

Why SRE? Why Now?

AI Agents
in production

24/7

Bot operations
never sleep

100s

Daily commits
across worktrees

Bots don't get tired. But they can fail.
And when they do, who responds at 3 AM?

DevOps vs SRE

class SRE implements interface DevOps { }

DevOps

Philosophy, culture, movement
"Break down silos"
Continuous delivery mindset
Automation everywhere

SRE

Specific implementation
Error budgets, SLOs
Toil reduction targets
On-call engineering

"DevOps is the philosophy; SRE is the implementation." — Google

Three Pillars of Operations

Reactive

Alert triage & response
Runbook execution
Incident management
Escalation protocols

Proactive

SLO monitoring
Capacity planning
Change management
Toil reduction

Predictive

Anomaly detection
Chaos engineering
AIOps & ML
Self-healing systems

The Vision: Autonomous Reliability

"Bots that monitor, diagnose, remediate, and learn — with humans for strategy and novel challenges."

70% Auto-resolved at L1

<15min MTTR target

99.95% Availability goal

Part 2 of 9

Observability Mastery

Three Pillars, OpenTelemetry, Alerting Strategy, High-Cardinality Events

observability-mastery multi-window-alerting use-method-performance observability-2.0 alert-tuning-playbook

Learning from Industry Leaders

Google SRE

Error budgets, 50% cap

Netflix

Chaos Monkey

AWS

Well-Architected

Spotify

Golden paths

Toyota

Kaizen

High-Reliability Organizations

Lessons from Aviation, Nuclear, Healthcare, Military

1 Preoccupation with Failure — Never ignore small failures

2 Reluctance to Simplify — Embrace complexity

3 Sensitivity to Operations — Real-time awareness

4 Commitment to Resilience — Detect, contain, recover

5 Deference to Expertise — Empower frontline decisions

Aviation: Crew Resource Management

Origin: 1978 United Flight 173 — crew ran out of fuel while troubleshooting

70-80% of accidents from human error, not mechanical failure

"Up until 1980, we worked on the concept that the captain was THE authority. What he said, goes. And we lost a few airplanes because of that." — Captain Al Haynes, United 232

Bot Application: Actively seek input from other bots; hierarchical authority yields to expertise

Netflix: Chaos Engineering

Philosophy: "Avoid failure by failing constantly"

Chaos Monkey

Latency Monkey

Chaos Gorilla

0 Impact When AWS lost 10% of servers (Sept 2014), Netflix kept running

Bot Application: Regular game days, failure injection testing, resilience as cultural value

Scaling Reliability: Industry Examples

Stripe

99.999% uptime

Defensive design

Uber

Millions RPS

Jaeger tracing

Shopify

57.3 PB BFCM

9-mo prep cycle

Discord

30M msg/sec

Elixir + ScyllaDB

Roblox

145K machines

Cell architecture

Cloudflare

320+ cities

Follow-the-sun

Latency Tiers: Right-Sizing Reliability

<1ms

Ultra-Low

HFT, Gaming physics

FPGA, kernel bypass

1-100ms

Low

Real-time apps, APIs

In-memory, edge

100ms-1s

Standard

Web apps, microservices

CDN, caching

1-30s

Tolerant

Batch, analytics

Eventual consistency

>30s

Flexible

Background, ML

Offline processing

Lessons from Mission-Critical Industries

Space

Triplex redundancy
7K+ engine tests
Formal verification

Military

Disciplined initiative
Decentralized exec
Pre-deployment sim

Nuclear

Defense in depth (5)
Diverse redundancy
Safety isolation

Deep Sea

3 battery buses
180+ monitored
Galvanic failsafe

Just Culture: Blameless Post-Mortems

"Blame closes off avenues for understanding how and why something happened." — Sidney Dekker

Old View

People cause failure → Punish

New View

Error is symptom → Fix system

Ask "what" and "how", never "why"

Part 3 of 9

Resilience Patterns

Circuit Breakers, Defense in Depth, HRO Principles, Chaos Engineering

resilience-patterns defense-in-depth hro-pattern-recognition release-it-patterns chaos-engineering

Our Observability Stack

COLLECT

Telegraf OpenTelemetry Bot Reporters

→

STORE

InfluxDB 3.0 Time-series DB BQL Queries

→

VISUALIZE

Grafana Dashboards Alerting Rules

→

ACT

Slack Alerts PagerDuty Ops Bot

InfluxDB 3.0 & BQL Queries

Why InfluxDB?

Native time-series storage
High-cardinality support
Columnar compression
Sub-second query latency
Downsampling & retention

BQL Query Examples

-- Session success rate
SELECT mean(success_rate)
FROM bot_sessions
WHERE time > now() - 1h
GROUP BY bot_name

-- Error budget burn
SELECT sum(errors) / sum(total)
FROM api_calls
WHERE time > now() - 30d

Grafana Dashboard Strategy

Athena System

CPU, memory, disk, network

→ SRE Team

Bot Army

Sessions, productivity, commits

→ All Engineers

Bot Operations

SLOs, MCP health, error budgets

→ Ops Bot

Human Experience

Focus metrics, escalations

→ Human CEO

Each dashboard serves a specific audience with relevant context

Distributed Tracing with Jaeger

Bot Session 2.3s total

MCP Call (Jira) 450ms

Git Operations 320ms

File I/O 180ms

API Call (Claude) 1.2s ⚠️

Latency breakdown — Where is time spent?

Error propagation — What caused the failure?

Dependency mapping — What calls what?

Correlation IDs & Agent Context

Session ID sess_abc123

→

Bot Identity claude-feat

→

Task ID HOME-456

→

Trace ID tr_xyz789

Cross-Signal Correlation

Link metrics → logs → traces
Find all activity for one session
Reconstruct incident timeline

Audit Trail

Which bot made this change?
What JIRA ticket triggered it?
Full provenance chain

Centralized Logging Strategy

ERROR

Failures needing action

90 days

WARN

Degraded but recovering

30 days

INFO

Normal operations

14 days

DEBUG

Troubleshooting

7 days

Structured JSON Correlation IDs Searchable fields No secrets

Alerting Philosophy: Signal vs. Noise

"Every alert should be actionable. If you can't act on it, it's noise."

P1 - PAGE

Service down, data loss risk

Immediate response

P2 - NOTIFY

Degraded, SLO at risk

Within 1 hour

P3 - TRACK

Anomaly detected

Business hours

P4 - LOG

Informational

Review weekly

USE Method: Performance Analysis

Brendan Gregg's systematic approach to resource bottlenecks

Utilization

Average time resource was busy

CPU: 85%, Memory: 72%

Saturation

Extra work queued or denied

Queue depth, wait time

Errors

Count of error events

ECC errors, retries, drops

Apply to every resource: CPU, Memory, Disk I/O, Network, GPUs, API quotas

RED Method: Service Monitoring

Tom Wilkie's approach for request-driven services

Rate

Requests per second

http_requests_total

Errors

Failed requests per second

5xx responses, exceptions

Duration

Time per request (latency)

P50, P95, P99 histograms

USE for Resources | RED for Services | Both for Complete Coverage

Multi-Window Burn Rate Alerting

Burn Rate = How fast you're consuming error budget

burn_rate = (errors / window) / (budget / period)

5 min

Fast Burn

Immediate outage

>10x → P1

1 hour

Medium Burn

Sustained issues

>5x → P2

6 hours

Slow Burn

Degradation trend

>2x → P3

From Liz Fong-Jones & Google SRE Workbook

Part 4 of 9

Incident Excellence

Response & Postmortems, Learning from Catastrophe, Runbook Design

incident-excellence learning-from-catastrophe runbook-quick-reference

SLOs and Error Budgets

SLI	Target	Error Budget
Availability	99.5%	3.6 hrs/month
Success Rate	98.0%	2% failures
Latency P95	<3s	2% slow
MTTR	<15min	Agentic response
Auto-Resolution	70%	L1 handled by Ops Bot

>50% Ship freely

25-50% Prioritize reliability

<25% Feature freeze

Operational Metrics: Full Coverage

System Health

MCP Availability: 99.9%
Resource Util: <80%
API Headroom: >20%

Bot Productivity

Session Success: >95%
Commits/Session: >3
Stall Rate: <5%

Operational Toil

Manual: <5/wk
Automation: >80%
Alert Noise: <20%

Incident Quality

MTTD: <2 min
MTTA: <5 min
Recurrence: <10%

Incident Lifecycle (ITIL)

1. Identify

→

2. Categorize

→

3. Prioritize

→

4. Respond

→

5. Close

SEV1 Critical — <15 min response

SEV2 Major — <1 hour response

SEV3 Minor — <4 hours response

SEV4 Low — <24 hours response

Bot-First Escalation Model

L1: Ops Bot — Auto-triage, runbook execution

70%

↓

L2: Bot Team — Bot-to-bot coordination

25%

↓

L3: Human Expert — Complex/novel issues

The 50% Rule: Toil Reduction

Ops Work (Max 50%)

Engineering (Min 50%)

What is Toil?

Manual, repetitive work
No enduring value
Scales linearly with growth
Automatable

Automation Priorities

Runbook automation
Incident triage
Deployment pipelines
Capacity scaling

Testing for Reliability

Unit Tests

Fast, isolated

80%+ coverage

Integration

Component APIs

Critical paths

Chaos

Failure injection

Prod-like

E2E

Full workflow

Key journeys

Jane Street: "Deterministic simulation testing finds bugs random testing cannot"

Chaos Engineering & GameDays

"Avoid failure by failing constantly" — Netflix

Hypothesis

Define expected behavior

Inject

Kill process, add latency

Observe

Monitor SLOs, alerts

Learn

Fix gaps, document

Chaos Monkey Toxiproxy Gremlin LitmusChaos

On-Call Sustainability

70%

SREs: on-call → burnout

2,000+

Weekly alerts (3% actionable)

Google's Sustainable Limits

12h max shift

2 pages/shift

25% time on-call

5-8 per rotation

With bot-first response, humans should rarely be paged

Blameless Post-Mortem Process

Timeline

→

Root Cause

→

Factors

→

Actions

→

>20% budget SEV1/SEV2 Novel failures Near-misses

The Three Ways of DevOps

From "The Phoenix Project" and "The DevOps Handbook"

First Way: Flow

Fast flow from Dev to Ops to Customer

Small batch sizes
Reduce WIP
Eliminate constraints

Second Way: Feedback

Fast, constant feedback loops

Telemetry everywhere
Push quality upstream
Enable fast recovery

Third Way: Learning

Continuous experimentation & learning

Take risks, embrace failure
Build mastery through practice
Institutionalize improvement

Observability: Three Pillars + Context

Metrics

InfluxDB + Grafana

Logs

Structured events

Traces

Jaeger + OpenTelemetry

Context

MCP + Correlation IDs

"If you can't monitor a service, you don't know what's happening, and if you're blind to what's happening, you can't be reliable." — Google SRE Book

Part 8 of 9

People & Culture

Westrum Culture, Team Topologies, On-Call Excellence, The Three Ways

people-culture oncall-excellence three-ways-devops team-topologies

Bot Army SRE Team Structure

Incident Response

Ops Bot

Alert triage, runbooks

Reliability Eng

SRE Bot

SLOs, capacity, chaos

Observability

Obs Bot

Dashboards, alerting

Security Ops

Sec Bot

Compliance, audits

Part 5 of 9

Release, Testing & Capacity

DORA Metrics, Progressive Delivery, NALSD, Testing Automation

capacity-release nalsd-framework designing-for-recovery slo-design-framework

PagerDuty: The On-Call Backbone

Grafana Alerts

→

PagerDuty

→

Slack #bot-alerts Ops Bot (L1) JIRA Incident

PagerDuty AI Agents (2025)

SRE Agent
Auto-classify, remediate

Shift Agent
Schedule conflicts

Scribe Agent
Capture insights

Insights Agent
Data analysis

Overload Protection: Cascading Failure Prevention

Circuit Breakers

Stop calling failing services

Closed → Open → Half-Open

Load Shedding

Reject requests to protect system

Priority-based queuing
Graceful degradation
Uber's Cinnamon (PID controller)

Backpressure

Slow down upstream producers

Rate limiting
Queue depth limits
Timeout cascades

DORA Metrics: Measuring Excellence

Deploy Freq

LowMonthly

MedWeekly

HighDaily

EliteOn-demand

Lead Time

LowMonths

MedWeeks

HighDays

Elite<1 hour

Failure Rate

Low>30%

Med15-30%

High5-15%

Elite<5%

MTTR

LowWeeks

MedDays

High<1 day

Elite<1 hour

Elite performers ship faster AND more reliably

Industry Scale: From Startup to Hyperscale

Startup

10-100 RPS

Monolith • Manual ops

99.5% SLO

Growth

1K-100K RPS

Microservices • On-call

99.9% SLO

Enterprise

100K-1M RPS

Distributed • Chaos eng

99.95% SLO

Hyperscale

1M+ RPS

Global • Cell-based

99.99%+ SLO

Universal Reliability Principles

Applicable to any mission-critical system

Layered Defense

Multiple failure barriers

Graceful Degradation

Core function survives

Rapid Recovery

Fast detect-to-resolve

Continuous Verify

Prove it works

Auto + Guardrails

Empower within bounds

Part 6 of 9

Cloud & Infrastructure

Kubernetes, Platform Engineering, Cloud-Native SRE, Multi-Cloud

infrastructure-reliability kubernetes-patterns platform-engineering

Part 7 of 9

AI/ML & Agentic Operations

MLOps, Non-Determinism, Bot Operations, Multi-Agent Systems

ai-ml-operations agentic-operations

Agentic Operational Workflows

Detect

Alert triggered

Correlate

Query signals

Diagnose

AI analysis

Remediate

Run playbook

Learn

Refine models

The goal: closed-loop autonomous operations

Multi-Agent Orchestration

Orchestrator

↓ ↓ ↓ ↓

Ops Bot

SRE Bot

Obs Bot

Sec Bot

🎭 Puppeteer 🐝 Swarm 🏛️ Hierarchical

Data Strategy for Autonomous Agents

Real-Time

Last 5 min metrics
Active alerts
Deployments

Historical

90-day incidents
Resolution patterns
SLO trends

Knowledge

Runbooks
Architecture
Post-mortems

The Learning Loop

Incidents → Analysis → Patterns → Runbooks → Automation

Self-Healing Systems

Detect

Anomaly + SLO burn

→

Decide

Pattern + runbook

→

Act

Scale, restart, rollback

→

Verify

SLOs restored

Memory: Auto-restart Latency: Scale up Deploy fail: Rollback

Platform Engineering: Golden Paths

"A golden path is a paved road to a well-architected production deployment" — Spotify

New Service

Template → CI/CD → Observability → Alerts → Docs

10 minutes to production-ready

Bot Onboarding

Identity → Worktree → MCP → Permissions → SLOs

Self-service, automated

Incident Response

Alert → Runbook → Resolution → Post-mortem

Guided workflow, minimal toil

Make the right thing the easy thing

Athena → Cloud: Environment Portability

Athena (On-Prem)

InfluxDB + Grafana Jaeger local Low latency Full control

Public Cloud (AWS/GCP)

Managed services Auto-scaling Global edge Shared responsibility

OpenTelemetry
Abstraction

GitOps + IaC
Deployment

Grafana Cloud
Observability

Env-agnostic
Config

Deployment Automation: Bleeding Edge

GitOps Pipeline

Declarative IaC (Terraform)
ArgoCD / Flux sync
PR-based deployments

Progressive Delivery

Canary releases (1-5%)
SLO-gated rollouts
Auto-rollback on error

Feature Flags

Decouple deploy/release
A/B testing built-in
Instant kill switches

Observability CI

Pre-deploy SLO checks
Synthetic monitoring
Chaos validation

Target: Zero-touch deployments with bot-driven validation and rollback

Part 9 of 9

Industry & Roadmap

Google, Netflix, NASA, Automation Paradoxes, SRE Evolution, Getting Started

industry-leaders implementation-roadmap automation-paradoxes sre-evolution-timeline

Implementation Roadmap

Foundation

Alerting, playbooks

Reliability

SLOs, GameDays

Automation

Self-healing

Intelligence

ML, prediction

Excellence

Cloud, 99.95%

Automation Paradoxes

Bainbridge's "Ironies of Automation" (1983)

Skill Decay

Operators lose skills. Can't step in when automation fails.

Complacency

Reduced vigilance. Failures become catastrophic.

Clumsy Auto

Workload increases during high-stress moments.

Mitigation

Regular drills, transparent automation, graceful degradation.

"The more advanced the automation, the more crucial the human contribution"

SRE Evolution Timeline

2003

Google creates
SRE role

2010

Netflix Chaos
Monkey

2016

SRE Book
published

2018

SRE Workbook
OpenTelemetry

2023+

AI/ML Ops
Agentic SRE

Past
Manual ops → Automation

Present
Platform Engineering

Future
Autonomous Reliability

Key Takeaways

Speed & Stability Reinforce

DORA proves elite orgs do both

Error Budgets Balance

Quantified risk tolerance for innovation

Build for Failure

Resilience is designed, not accidental

Automate Toil

<50% cap frees humans for engineering

Incidents Are Investments

Every failure makes systems stronger

Observability > Monitoring

Understand systems, not just alert

Reliability Unleashed

Questions?

34 One-Pagers Available: Comprehensive reference material for each topic

Essential Reading
Google SRE Book
Netflix Tech Blog
Dekker's Just Culture

Next Steps
Review one-pagers
Assess maturity level
Build your roadmap

Reliability Unleashed

The Engineering Playbook

The Journey Ahead

Foundation & Vision

Why SRE? Why Now?

DevOps vs SRE

DevOps

SRE

Three Pillars of Operations

Reactive

Proactive

Predictive

The Vision: Autonomous Reliability

Observability Mastery

Learning from Industry Leaders

Google SRE

Netflix

AWS

Meta

Spotify

Toyota

High-Reliability Organizations

Aviation: Crew Resource Management

Netflix: Chaos Engineering

Scaling Reliability: Industry Examples

Stripe

Uber

Shopify

Discord

Roblox

Cloudflare

Latency Tiers: Right-Sizing Reliability

Ultra-Low

Low

Standard

Tolerant

Flexible

Lessons from Mission-Critical Industries

Space

Military

Nuclear

Deep Sea

Just Culture: Blameless Post-Mortems

Old View

New View

Resilience Patterns

Our Observability Stack

InfluxDB 3.0 & BQL Queries

Why InfluxDB?

BQL Query Examples

Grafana Dashboard Strategy

Athena System

Bot Army

Bot Operations

Human Experience

Distributed Tracing with Jaeger

Correlation IDs & Agent Context

Cross-Signal Correlation

Audit Trail

Centralized Logging Strategy

Alerting Philosophy: Signal vs. Noise

USE Method: Performance Analysis

Utilization

Saturation

Errors

RED Method: Service Monitoring

Rate

Errors

Duration

Multi-Window Burn Rate Alerting

Incident Excellence

SLOs and Error Budgets

Operational Metrics: Full Coverage

System Health

Bot Productivity

Operational Toil

Incident Quality

Incident Lifecycle (ITIL)

Bot-First Escalation Model

The 50% Rule: Toil Reduction