Runbook Quick Reference | Bot Army SRE

10

Runbook Templates

<5min

Triage Target

<1hr

MTTR Target

80%

Runbook Coverage

10 Essential Runbook Types

#	Runbook	MTTR Target
1	Service Restart	5 min
2	Deployment Rollback	10 min
3	Database Failover	15 min
4	Cache Clear	5 min
5	Traffic Shift	10 min
6	Scale Out	5 min
7	Certificate Rotation	15 min
8	DNS Update	10 min
9	Feature Flag Toggle	2 min
10	Emergency Access	5 min

Runbook Structure

Section	Content
Overview	What this runbook addresses
Symptoms	How to recognize the issue
Prerequisites	Required access & tools
Steps	Numbered procedure
Verification	How to confirm success
Rollback	If things go wrong
Escalation	Who to contact next

Decision Tree: High Latency

Check: Is it a single service or all?
- Single → Check that service's resources
- All → Check shared dependencies (DB, cache)
Check: Recent deployment?
- Yes → Consider rollback
- No → Check traffic levels
Check: Resource exhaustion?
- Yes → Scale or restart
- No → Check network, dependencies

Verification Checklist

Check	How
Service healthy	Health endpoint returns 200
Metrics normal	Grafana dashboards green
Errors stopped	Error rate below threshold
Latency normal	p99 within SLO
Logs clean	No error spikes in logs

Decision Tree: Errors Spike

Check: Error type?
- 5xx → Server-side issue
- 4xx → Client or config issue
Check: Pattern?
- Sudden spike → Deployment or config
- Gradual → Resource exhaustion
Check: Scope?
- One endpoint → Check that handler
- All endpoints → Check infrastructure

Runbook Quality Criteria

Testable

Can be verified in staging/DR drills

Automatable

Steps are scriptable for future automation

Measurable

Includes timing targets and success criteria

Quick Commands

Action	Example
Pod restart	`kubectl rollout restart`
Rollback	`kubectl rollout undo`
Scale	`kubectl scale --replicas`

Document to Automate

Today's runbook is tomorrow's automation.