ITIL Lifecycle, Blameless Postmortems & On-Call Sustainability
Incident Management | Technical Operations Excellence
Detection via monitoring, alerts, or reports
Classify by type, service, impact area
Assign SEV level based on impact + urgency
Diagnose, mitigate, resolve, communicate
Verify, document, postmortem, action items
| Level | Impact | Response |
|---|---|---|
| SEV1 | Critical outage | <15 min |
| SEV2 | Major degradation | <30 min |
| SEV3 | Minor impact | <4 hours |
| SEV4 | Low/cosmetic | Next business day |
| Principle | Actions |
|---|---|
| Coordinate | IC assigns roles, manages workstreams |
| Communicate | Status updates, stakeholder briefs |
| Control | Authorize changes, manage scope |
Crisis triage: data criticality, trust relationships, compensating controls
| Role | Responsibility |
|---|---|
| Incident Commander | Owns resolution, delegates |
| Ops Lead | Technical investigation |
| Comms Lead | Stakeholder updates |
| Scribe | Documents timeline |
SEV1/2: Add Remediation Lead, Legal (if needed)
Ask "what" and "how" questions, never "why" - it forces justification and blame.
- John Allspaw, Etsy
| Severity | Update Frequency |
|---|---|
| SEV1 | Every 15 minutes |
| SEV2 | Every 30 minutes |
| SEV3/4 | Hourly or as needed |
Playbooks improve MTTR by 3x on average
Role-play exercise for IC practice. Spin wheel to select historic incident, responders handle in real-time simulation.
Learn from Every Incident
Blameless culture enables honest retrospectives.