Reliability Unleashed

The Engineering Playbook

From Chaos to Confidence
9 Parts | 34 One-Pagers | ~8.5 Hours | Technical Track

The Journey Ahead

Part 1
Foundation & Vision
Part 2
Observability Mastery
Part 3
Resilience Patterns
Part 4
Incident Excellence
Part 5
Release, Testing & Capacity
Part 6
Cloud & Infrastructure
Part 7
AI/ML & Agentic Ops
Part 8
People & Culture
Part 9
Industry & Roadmap
Part 1 of 9

Foundation & Vision

SRE Fundamentals, SLIs/SLOs, DORA Metrics, Maturity Assessment

reliability-unleashed sre-foundations dora-24-capabilities sre-maturity-assessment

Why SRE? Why Now?

4+
AI Agents
in production
24/7
Bot operations
never sleep
100s
Daily commits
across worktrees
Bots don't get tired. But they can fail.
And when they do, who responds at 3 AM?

DevOps vs SRE

class SRE implements interface DevOps { }

DevOps

  • Philosophy, culture, movement
  • "Break down silos"
  • Continuous delivery mindset
  • Automation everywhere

SRE

  • Specific implementation
  • Error budgets, SLOs
  • Toil reduction targets
  • On-call engineering
"DevOps is the philosophy; SRE is the implementation." — Google

Three Pillars of Operations

Reactive

  • Alert triage & response
  • Runbook execution
  • Incident management
  • Escalation protocols

Proactive

  • SLO monitoring
  • Capacity planning
  • Change management
  • Toil reduction

Predictive

  • Anomaly detection
  • Chaos engineering
  • AIOps & ML
  • Self-healing systems

The Vision: Autonomous Reliability

"Bots that monitor, diagnose, remediate, and learn — with humans for strategy and novel challenges."
70% Auto-resolved at L1
<15min MTTR target
99.95% Availability goal
Part 2 of 9

Observability Mastery

Three Pillars, OpenTelemetry, Alerting Strategy, High-Cardinality Events

observability-mastery multi-window-alerting use-method-performance observability-2.0 alert-tuning-playbook

Learning from Industry Leaders

Google SRE

Error budgets, 50% cap

Netflix

Chaos Monkey

AWS

Well-Architected

Meta

SEV culture

Spotify

Golden paths

Toyota

Kaizen

High-Reliability Organizations

Lessons from Aviation, Nuclear, Healthcare, Military

1 Preoccupation with Failure — Never ignore small failures
2 Reluctance to Simplify — Embrace complexity
3 Sensitivity to Operations — Real-time awareness
4 Commitment to Resilience — Detect, contain, recover
5 Deference to Expertise — Empower frontline decisions

Aviation: Crew Resource Management

Origin: 1978 United Flight 173 — crew ran out of fuel while troubleshooting
70-80% of accidents from human error, not mechanical failure
"Up until 1980, we worked on the concept that the captain was THE authority. What he said, goes. And we lost a few airplanes because of that." — Captain Al Haynes, United 232
Bot Application: Actively seek input from other bots; hierarchical authority yields to expertise

Netflix: Chaos Engineering

Philosophy: "Avoid failure by failing constantly"
Chaos Monkey
Latency Monkey
Chaos Gorilla
0 Impact When AWS lost 10% of servers (Sept 2014), Netflix kept running
Bot Application: Regular game days, failure injection testing, resilience as cultural value

Scaling Reliability: Industry Examples

Stripe

99.999% uptime
Defensive design

Uber

Millions RPS
Jaeger tracing

Shopify

57.3 PB BFCM
9-mo prep cycle

Discord

30M msg/sec
Elixir + ScyllaDB

Roblox

145K machines
Cell architecture

Cloudflare

320+ cities
Follow-the-sun

Latency Tiers: Right-Sizing Reliability

<1ms

Ultra-Low

HFT, Gaming physics

FPGA, kernel bypass
1-100ms

Low

Real-time apps, APIs

In-memory, edge
100ms-1s

Standard

Web apps, microservices

CDN, caching
1-30s

Tolerant

Batch, analytics

Eventual consistency
>30s

Flexible

Background, ML

Offline processing

Lessons from Mission-Critical Industries

Space

  • Triplex redundancy
  • 7K+ engine tests
  • Formal verification

Military

  • Disciplined initiative
  • Decentralized exec
  • Pre-deployment sim

Nuclear

  • Defense in depth (5)
  • Diverse redundancy
  • Safety isolation

Deep Sea

  • 3 battery buses
  • 180+ monitored
  • Galvanic failsafe

Just Culture: Blameless Post-Mortems

"Blame closes off avenues for understanding how and why something happened." — Sidney Dekker

Old View

People cause failure → Punish

New View

Error is symptom → Fix system

Ask "what" and "how", never "why"
Part 3 of 9

Resilience Patterns

Circuit Breakers, Defense in Depth, HRO Principles, Chaos Engineering

resilience-patterns defense-in-depth hro-pattern-recognition release-it-patterns chaos-engineering

Our Observability Stack

COLLECT
Telegraf OpenTelemetry Bot Reporters
STORE
InfluxDB 3.0 Time-series DB BQL Queries
VISUALIZE
Grafana Dashboards Alerting Rules
ACT
Slack Alerts PagerDuty Ops Bot

InfluxDB 3.0 & BQL Queries

Why InfluxDB?

  • Native time-series storage
  • High-cardinality support
  • Columnar compression
  • Sub-second query latency
  • Downsampling & retention

BQL Query Examples

-- Session success rate
SELECT mean(success_rate)
FROM bot_sessions
WHERE time > now() - 1h
GROUP BY bot_name

-- Error budget burn
SELECT sum(errors) / sum(total)
FROM api_calls
WHERE time > now() - 30d

Grafana Dashboard Strategy

Athena System

CPU, memory, disk, network

→ SRE Team

Bot Army

Sessions, productivity, commits

→ All Engineers

Bot Operations

SLOs, MCP health, error budgets

→ Ops Bot

Human Experience

Focus metrics, escalations

→ Human CEO

Each dashboard serves a specific audience with relevant context

Distributed Tracing with Jaeger

Bot Session 2.3s total
MCP Call (Jira) 450ms
Git Operations 320ms
File I/O 180ms
API Call (Claude) 1.2s ⚠️
Latency breakdown — Where is time spent?
Error propagation — What caused the failure?
Dependency mapping — What calls what?

Correlation IDs & Agent Context

Session ID sess_abc123
Bot Identity claude-feat
Task ID HOME-456
Trace ID tr_xyz789

Cross-Signal Correlation

  • Link metrics → logs → traces
  • Find all activity for one session
  • Reconstruct incident timeline

Audit Trail

  • Which bot made this change?
  • What JIRA ticket triggered it?
  • Full provenance chain

Centralized Logging Strategy

ERROR
Failures needing action
90 days
WARN
Degraded but recovering
30 days
INFO
Normal operations
14 days
DEBUG
Troubleshooting
7 days
Structured JSON Correlation IDs Searchable fields No secrets

Alerting Philosophy: Signal vs. Noise

"Every alert should be actionable. If you can't act on it, it's noise."
P1 - PAGE

Service down, data loss risk

Immediate response
P2 - NOTIFY

Degraded, SLO at risk

Within 1 hour
P3 - TRACK

Anomaly detected

Business hours
P4 - LOG

Informational

Review weekly

USE Method: Performance Analysis

Brendan Gregg's systematic approach to resource bottlenecks

Utilization

Average time resource was busy

CPU: 85%, Memory: 72%

Saturation

Extra work queued or denied

Queue depth, wait time

Errors

Count of error events

ECC errors, retries, drops
Apply to every resource: CPU, Memory, Disk I/O, Network, GPUs, API quotas

RED Method: Service Monitoring

Tom Wilkie's approach for request-driven services

Rate

Requests per second

http_requests_total

Errors

Failed requests per second

5xx responses, exceptions

Duration

Time per request (latency)

P50, P95, P99 histograms
USE for Resources | RED for Services | Both for Complete Coverage

Multi-Window Burn Rate Alerting

Burn Rate = How fast you're consuming error budget
burn_rate = (errors / window) / (budget / period)
5 min
Fast Burn
Immediate outage
>10x → P1
1 hour
Medium Burn
Sustained issues
>5x → P2
6 hours
Slow Burn
Degradation trend
>2x → P3

From Liz Fong-Jones & Google SRE Workbook

Part 4 of 9

Incident Excellence

Response & Postmortems, Learning from Catastrophe, Runbook Design

incident-excellence learning-from-catastrophe runbook-quick-reference

SLOs and Error Budgets

SLI Target Error Budget
Availability99.5%3.6 hrs/month
Success Rate98.0%2% failures
Latency P95<3s2% slow
MTTR<15minAgentic response
Auto-Resolution70%L1 handled by Ops Bot
>50% Ship freely
25-50% Prioritize reliability
<25% Feature freeze

Operational Metrics: Full Coverage

System Health

  • MCP Availability: 99.9%
  • Resource Util: <80%
  • API Headroom: >20%

Bot Productivity

  • Session Success: >95%
  • Commits/Session: >3
  • Stall Rate: <5%

Operational Toil

  • Manual: <5/wk
  • Automation: >80%
  • Alert Noise: <20%

Incident Quality

  • MTTD: <2 min
  • MTTA: <5 min
  • Recurrence: <10%

Incident Lifecycle (ITIL)

1. Identify
2. Categorize
3. Prioritize
4. Respond
5. Close
SEV1 Critical — <15 min response
SEV2 Major — <1 hour response
SEV3 Minor — <4 hours response
SEV4 Low — <24 hours response

Bot-First Escalation Model

L1: Ops Bot — Auto-triage, runbook execution
70%
L2: Bot Team — Bot-to-bot coordination
25%
L3: Human Expert — Complex/novel issues
5%

The 50% Rule: Toil Reduction

Ops Work (Max 50%)
Engineering (Min 50%)

What is Toil?

  • Manual, repetitive work
  • No enduring value
  • Scales linearly with growth
  • Automatable

Automation Priorities

  1. Runbook automation
  2. Incident triage
  3. Deployment pipelines
  4. Capacity scaling

Testing for Reliability

Unit Tests

Fast, isolated

80%+ coverage

Integration

Component APIs

Critical paths

Chaos

Failure injection

Prod-like

E2E

Full workflow

Key journeys
Jane Street: "Deterministic simulation testing finds bugs random testing cannot"

Chaos Engineering & GameDays

"Avoid failure by failing constantly" — Netflix
1

Hypothesis

Define expected behavior

2

Inject

Kill process, add latency

3

Observe

Monitor SLOs, alerts

4

Learn

Fix gaps, document

Chaos Monkey Toxiproxy Gremlin LitmusChaos

On-Call Sustainability

70%
SREs: on-call → burnout
2,000+
Weekly alerts (3% actionable)

Google's Sustainable Limits

12h max shift
2 pages/shift
25% time on-call
5-8 per rotation

With bot-first response, humans should rarely be paged

Blameless Post-Mortem Process

1

Timeline

2

Root Cause

3

Factors

4

Actions

5

Share

>20% budget SEV1/SEV2 Novel failures Near-misses

The Three Ways of DevOps

From "The Phoenix Project" and "The DevOps Handbook"

First Way: Flow

Fast flow from Dev to Ops to Customer

  • Small batch sizes
  • Reduce WIP
  • Eliminate constraints

Second Way: Feedback

Fast, constant feedback loops

  • Telemetry everywhere
  • Push quality upstream
  • Enable fast recovery

Third Way: Learning

Continuous experimentation & learning

  • Take risks, embrace failure
  • Build mastery through practice
  • Institutionalize improvement

Observability: Three Pillars + Context

Metrics

InfluxDB + Grafana

Logs

Structured events

Traces

Jaeger + OpenTelemetry

Context

MCP + Correlation IDs

"If you can't monitor a service, you don't know what's happening, and if you're blind to what's happening, you can't be reliable." — Google SRE Book
Part 8 of 9

People & Culture

Westrum Culture, Team Topologies, On-Call Excellence, The Three Ways

people-culture oncall-excellence three-ways-devops team-topologies

Bot Army SRE Team Structure

Incident Response

Ops Bot

Alert triage, runbooks

Reliability Eng

SRE Bot

SLOs, capacity, chaos

Observability

Obs Bot

Dashboards, alerting

Security Ops

Sec Bot

Compliance, audits

Part 5 of 9

Release, Testing & Capacity

DORA Metrics, Progressive Delivery, NALSD, Testing Automation

capacity-release nalsd-framework designing-for-recovery slo-design-framework

PagerDuty: The On-Call Backbone

Grafana Alerts
PagerDuty
Slack #bot-alerts Ops Bot (L1) JIRA Incident

PagerDuty AI Agents (2025)

SRE Agent
Auto-classify, remediate
Shift Agent
Schedule conflicts
Scribe Agent
Capture insights
Insights Agent
Data analysis

Overload Protection: Cascading Failure Prevention

Circuit Breakers

Stop calling failing services

Closed Open Half-Open

Load Shedding

Reject requests to protect system

  • Priority-based queuing
  • Graceful degradation
  • Uber's Cinnamon (PID controller)

Backpressure

Slow down upstream producers

  • Rate limiting
  • Queue depth limits
  • Timeout cascades

DORA Metrics: Measuring Excellence

Deploy Freq

LowMonthly
MedWeekly
HighDaily
EliteOn-demand

Lead Time

LowMonths
MedWeeks
HighDays
Elite<1 hour

Failure Rate

Low>30%
Med15-30%
High5-15%
Elite<5%

MTTR

LowWeeks
MedDays
High<1 day
Elite<1 hour
Elite performers ship faster AND more reliably

Industry Scale: From Startup to Hyperscale

Startup
10-100 RPS
Monolith • Manual ops
99.5% SLO
Growth
1K-100K RPS
Microservices • On-call
99.9% SLO
Enterprise
100K-1M RPS
Distributed • Chaos eng
99.95% SLO
Hyperscale
1M+ RPS
Global • Cell-based
99.99%+ SLO

Universal Reliability Principles

Applicable to any mission-critical system

1

Layered Defense

Multiple failure barriers

2

Graceful Degradation

Core function survives

3

Rapid Recovery

Fast detect-to-resolve

4

Continuous Verify

Prove it works

5

Auto + Guardrails

Empower within bounds

Part 6 of 9

Cloud & Infrastructure

Kubernetes, Platform Engineering, Cloud-Native SRE, Multi-Cloud

infrastructure-reliability kubernetes-patterns platform-engineering
Part 7 of 9

AI/ML & Agentic Operations

MLOps, Non-Determinism, Bot Operations, Multi-Agent Systems

ai-ml-operations agentic-operations

Agentic Operational Workflows

1

Detect

Alert triggered

2

Correlate

Query signals

3

Diagnose

AI analysis

4

Remediate

Run playbook

5

Learn

Refine models

The goal: closed-loop autonomous operations

Multi-Agent Orchestration

Orchestrator
↓ ↓ ↓ ↓
Ops Bot
SRE Bot
Obs Bot
Sec Bot
🎭 Puppeteer 🐝 Swarm 🏛️ Hierarchical

Data Strategy for Autonomous Agents

Real-Time

  • Last 5 min metrics
  • Active alerts
  • Deployments

Historical

  • 90-day incidents
  • Resolution patterns
  • SLO trends

Knowledge

  • Runbooks
  • Architecture
  • Post-mortems

The Learning Loop

Incidents → Analysis → Patterns → Runbooks → Automation

Self-Healing Systems

Detect

Anomaly + SLO burn

Decide

Pattern + runbook

Act

Scale, restart, rollback

Verify

SLOs restored

Memory: Auto-restart Latency: Scale up Deploy fail: Rollback

Platform Engineering: Golden Paths

"A golden path is a paved road to a well-architected production deployment" — Spotify

New Service

Template → CI/CD → Observability → Alerts → Docs

10 minutes to production-ready

Bot Onboarding

Identity → Worktree → MCP → Permissions → SLOs

Self-service, automated

Incident Response

Alert → Runbook → Resolution → Post-mortem

Guided workflow, minimal toil
Make the right thing the easy thing

Athena → Cloud: Environment Portability

Athena (On-Prem)

InfluxDB + Grafana Jaeger local Low latency Full control

Public Cloud (AWS/GCP)

Managed services Auto-scaling Global edge Shared responsibility
OpenTelemetry
Abstraction
GitOps + IaC
Deployment
Grafana Cloud
Observability
Env-agnostic
Config

Deployment Automation: Bleeding Edge

GitOps Pipeline

  • Declarative IaC (Terraform)
  • ArgoCD / Flux sync
  • PR-based deployments

Progressive Delivery

  • Canary releases (1-5%)
  • SLO-gated rollouts
  • Auto-rollback on error

Feature Flags

  • Decouple deploy/release
  • A/B testing built-in
  • Instant kill switches

Observability CI

  • Pre-deploy SLO checks
  • Synthetic monitoring
  • Chaos validation
Target: Zero-touch deployments with bot-driven validation and rollback
Part 9 of 9

Industry & Roadmap

Google, Netflix, NASA, Automation Paradoxes, SRE Evolution, Getting Started

industry-leaders implementation-roadmap automation-paradoxes sre-evolution-timeline

Implementation Roadmap

1

Foundation

Alerting, playbooks

2

Reliability

SLOs, GameDays

3

Automation

Self-healing

4

Intelligence

ML, prediction

5

Excellence

Cloud, 99.95%

Automation Paradoxes

Bainbridge's "Ironies of Automation" (1983)

Skill Decay

Operators lose skills. Can't step in when automation fails.

Complacency

Reduced vigilance. Failures become catastrophic.

Clumsy Auto

Workload increases during high-stress moments.

Mitigation

Regular drills, transparent automation, graceful degradation.

"The more advanced the automation, the more crucial the human contribution"

SRE Evolution Timeline

2003
Google creates
SRE role
2010
Netflix Chaos
Monkey
2016
SRE Book
published
2018
SRE Workbook
OpenTelemetry
2023+
AI/ML Ops
Agentic SRE
Past
Manual ops → Automation
Present
Platform Engineering
Future
Autonomous Reliability

Key Takeaways

1
Speed & Stability Reinforce
DORA proves elite orgs do both
2
Error Budgets Balance
Quantified risk tolerance for innovation
3
Build for Failure
Resilience is designed, not accidental
4
Automate Toil
<50% cap frees humans for engineering
5
Incidents Are Investments
Every failure makes systems stronger
6
Observability > Monitoring
Understand systems, not just alert

Reliability Unleashed

Questions?

34 One-Pagers Available: Comprehensive reference material for each topic

SRE Foundations | Observability | Resilience | Incidents | Release & Capacity
Cloud & Infrastructure | AI/ML & Agentic | People & Culture | Industry Leaders

Essential Reading
Google SRE Book
Netflix Tech Blog
Dekker's Just Culture
Next Steps
Review one-pagers
Assess maturity level
Build your roadmap