Technical Operations Excellence
A comprehensive guide to Site Reliability Engineering, Observability, and Platform Operations
34
One-Pagers
10
Core Themes
35+
Research Sources
Vision & Overview
→
Reliability Unleashed
— From Chaos to Confidence
SRE Foundations & Assessment
→
SRE Foundations
— SLIs, SLOs, Error Budgets
→
DORA 24 Capabilities
— DevOps Research Framework
→
SRE Maturity Assessment
— Measuring Capabilities
→
SLO Design Framework
— Effective Objectives
★
Interactive Assessment
— Auto-Scoring + Rubrics
Observability
→
Observability Mastery
— Three Pillars & OTel
→
Multi-Window Alerting
— Burn Rate Strategy
→
USE Method
— Utilization, Saturation, Errors
→
Observability 2.0
— High Cardinality Events
→
Alert Tuning Playbook
— Reducing Noise
Resilience Patterns
→
Resilience Patterns
— Circuit Breakers, Bulkheads
→
Defense in Depth
— Layered Security
→
HRO Patterns
— High-Reliability Orgs
→
Release It! Patterns
— Stability Patterns
→
Chaos Engineering
— GameDay Practices
Incident Management
→
Incident Excellence
— Response & Postmortems
→
Learning from Catastrophe
— Case Studies
→
Runbook Quick Reference
— Templates & Practices
Release & Capacity
→
Capacity & Release
— DORA, Progressive Delivery
→
NALSD Framework
— Large System Design
→
Designing for Recovery
— Breakglass Access
Infrastructure
→
Infrastructure Reliability
— K8s, TSDB, Backends
→
Kubernetes Patterns
— K8s Operational Patterns
→
Platform Engineering
— Golden Paths, Self-Service
AI/ML & Agentic
→
AI/ML Operations
— MLOps, Non-Determinism
→
Agentic Operations
— Bot Operations, AI Agents
People & Culture
→
People & Culture
— Westrum, Team Topologies
→
On-Call Excellence
— Sustainable Rotations
→
Three Ways of DevOps
— Flow, Feedback, Learning
→
Team Topologies
— Organizing Teams
Industry & Implementation
→
Industry Leaders
— Google, Netflix, NASA
→
Implementation Roadmap
— Getting Started
→
Automation Paradoxes
— When Automation Hurts
→
SRE Evolution Timeline
— History & Future