Templates, Decision Trees, and MTTR Targets
Incident Management | Technical Operations Excellence
| # | Runbook | MTTR Target |
|---|---|---|
| 1 | Service Restart | 5 min |
| 2 | Deployment Rollback | 10 min |
| 3 | Database Failover | 15 min |
| 4 | Cache Clear | 5 min |
| 5 | Traffic Shift | 10 min |
| 6 | Scale Out | 5 min |
| 7 | Certificate Rotation | 15 min |
| 8 | DNS Update | 10 min |
| 9 | Feature Flag Toggle | 2 min |
| 10 | Emergency Access | 5 min |
| Section | Content |
|---|---|
| Overview | What this runbook addresses |
| Symptoms | How to recognize the issue |
| Prerequisites | Required access & tools |
| Steps | Numbered procedure |
| Verification | How to confirm success |
| Rollback | If things go wrong |
| Escalation | Who to contact next |
| Check | How |
|---|---|
| Service healthy | Health endpoint returns 200 |
| Metrics normal | Grafana dashboards green |
| Errors stopped | Error rate below threshold |
| Latency normal | p99 within SLO |
| Logs clean | No error spikes in logs |
Can be verified in staging/DR drills
Steps are scriptable for future automation
Includes timing targets and success criteria
| Action | Example |
|---|---|
| Pod restart | kubectl rollout restart |
| Rollback | kubectl rollout undo |
| Scale | kubectl scale --replicas |
Document to Automate
Today's runbook is tomorrow's automation.