Learning from High-Reliability Organizations
Resilience Patterns | Technical Operations Excellence
| Principle | Application |
|---|---|
| Preoccupation with Failure | Treat near-misses as failures; never assume safety |
| Reluctance to Simplify | Resist simple explanations; embrace complexity |
| Sensitivity to Operations | Maintain situational awareness at all times |
| Commitment to Resilience | Focus on recovery, not just prevention |
| Deference to Expertise | Authority migrates to knowledge in crisis |
| # | Category | Example |
|---|---|---|
| 1 | Config Change | Bad deploy, wrong flag |
| 2 | Capacity | Resource exhaustion |
| 3 | Dependency | Upstream/downstream fail |
| 4 | Hardware | Disk, network, memory |
| 5 | Security | Attack, credential leak |
| 6 | Human Error | Typo, wrong command |
| 7 | Software Bug | Race condition, logic error |
| 8 | Data | Corruption, schema drift |
| 9 | Network | Partition, DNS, latency |
| 10 | External | Cloud provider, 3rd party |
Accidents occur when holes in multiple defense layers momentarily align.
- James Reason
| Signal | Pattern | Action |
|---|---|---|
| Latency spike | Capacity/Dependency | Scale or isolate |
| Error burst | Deploy/Config | Rollback |
| Gradual degrade | Resource leak | Restart/investigate |
| Cascading fail | Missing circuit breaker | Shed load |
| Partial outage | Network partition | Failover |
| Aspect | Traditional | HRO |
|---|---|---|
| Failures | Hide/blame | Learn/share |
| Complexity | Simplify away | Embrace |
| Authority | Hierarchy | Expertise |
| Focus | Efficiency | Reliability |
| Industry | Key Practice |
|---|---|
| Aviation | Checklists, crew resource mgmt |
| Nuclear | Defense in depth, safety culture |
| Healthcare | Root cause analysis, just culture |
| Military | After-action reviews, command |
Failures Are Teachers
Every incident is a window into system weaknesses.