Swiss Cheese Model, Big 10 Root Causes & Pattern Recognition
Historic Incidents | Technical Operations Excellence
| # | Root Cause | Freq |
|---|---|---|
| 1 | Config/Deploy Errors | ~40% |
| 2 | Ignored Warnings | High |
| 3 | Single Point of Failure | High |
| 4 | Inadequate Testing | High |
| 5 | Simple Bugs at Scale | High |
| 6 | Monitoring Gaps | Med |
| 7 | Complex Interdependencies | Med |
| 8 | Human Error Under Pressure | Med |
| 9 | Vendor/3rd Party Failures | Med |
| 10 | Legacy System Fragility | Med |
Hazard → [Prevention] → [Detection] → [Containment] → [Recovery] → Accident
| Layer | If Hole |
|---|---|
| Prevention | Near miss |
| Detection | Degradation |
| Containment | Incident |
| Recovery | Catastrophe |
Key: Catastrophic failures require ALL layers to fail simultaneously
Lesson: Staged rollouts essential for security updates
| Incident | Root Cause | Lesson |
|---|---|---|
| GitLab | Config error | Staged rollouts |
| 737 MAX | Single PoF | Redundancy |
| Knight Capital | Bug at scale | Code review |
| Therac-25 | Bad testing | Integration tests |
| Cause | Mitigation |
|---|---|
| Config errors | Canaries, staged rollouts |
| Ignored warnings | Safety culture, incentives |
| Single PoF | Redundancy, chaos testing |
| Testing gaps | Comprehensive coverage |
| Dependencies | Dependency mapping |
Every catastrophe is a near-miss that was ignored.
Defense in Depth
Build redundant, independent defenses at every layer.