Learning from Catastrophe

Swiss Cheese Model, Big 10 Root Causes & Pattern Recognition

Historic Incidents | Technical Operations Excellence

50+
Incidents Analyzed
40%
Config/Deploy Errors
$10B+
CrowdStrike Damage
4
Defense Layers

Big 10 Root Causes

#Root CauseFreq
1Config/Deploy Errors~40%
2Ignored WarningsHigh
3Single Point of FailureHigh
4Inadequate TestingHigh
5Simple Bugs at ScaleHigh
6Monitoring GapsMed
7Complex InterdependenciesMed
8Human Error Under PressureMed
9Vendor/3rd Party FailuresMed
10Legacy System FragilityMed

Swiss Cheese Model

Hazard → [Prevention] → [Detection] → [Containment] → [Recovery] → Accident

LayerIf Hole
PreventionNear miss
DetectionDegradation
ContainmentIncident
RecoveryCatastrophe

Key: Catastrophic failures require ALL layers to fail simultaneously

CrowdStrike Case Study (2024)

  • Impact: $10B+ damages, 8.5M Windows systems
  • Root Cause: Content update bypassed validation
  • Kernel driver: Single point of failure

Lesson: Staged rollouts essential for security updates

Notable Incidents

IncidentRoot CauseLesson
GitLabConfig errorStaged rollouts
737 MAXSingle PoFRedundancy
Knight CapitalBug at scaleCode review
Therac-25Bad testingIntegration tests

Mitigations by Root Cause

CauseMitigation
Config errorsCanaries, staged rollouts
Ignored warningsSafety culture, incentives
Single PoFRedundancy, chaos testing
Testing gapsComprehensive coverage
DependenciesDependency mapping

Cross-Industry Lessons

  • Aviation: Crew resource management
  • Nuclear: Defense in depth
  • Healthcare: Checklists, near-miss reporting
  • Finance: Circuit breakers, kill switches

Pattern Recognition

Every catastrophe is a near-miss that was ignored.

Defense in Depth

Build redundant, independent defenses at every layer.