Stability Patterns for Production Systems
Resilience Patterns | Technical Operations Excellence
| Pattern | Purpose |
|---|---|
| Circuit Breaker | Stop cascading failures |
| Bulkhead | Isolate failures to partitions |
| Timeout | Prevent indefinite waits |
| Retry | Handle transient failures |
| Fallback | Graceful degradation |
| Shed Load | Reject excess traffic |
| Handshaking | Verify capacity before work |
| State | Behavior |
|---|---|
| Closed | Normal operation, count failures |
| Open | Fast fail, don't call downstream |
| Half-Open | Test with limited traffic |
Thresholds: 5 failures, 30s timeout, 1 test request
| Pattern | Use Case |
|---|---|
| Steady State | Self-cleaning logs/data |
| Test Harness | Simulate bad behaviors |
| Decoupling | Async via queues |
| Fail Fast | Check prereqs early |
| Anti-Pattern | Risk |
|---|---|
| Integration Points | Every call is a risk |
| Chain Reactions | One failure cascades |
| Cascading Failures | Avalanche effect |
| Users | Unpredictable traffic |
| Blocked Threads | Thread pool exhaustion |
| Unbounded Queues | Memory exhaustion |
| Anti-Pattern | Risk |
|---|---|
| Self-Denial | Marketing DDos |
| Unbalanced Capacity | Bottleneck fails first |
| Slow Responses | Worse than no response |
| SLA Inversion | Depend on weaker SLA |
| Type | Recommendation |
|---|---|
| Connect | 1-3 seconds |
| Read | 5-30 seconds |
| Total | Max acceptable latency |
Always set timeouts! Never use language defaults.
Every integration point will eventually fail in some way.
- Michael Nygard, Release It!
Expect Failure
Design for failure; plan for success.