Runbook for Incident Management: Reduce MTTR by 50%
When an incident strikes, every minute counts. The difference between a 10-minute resolution and a 2-hour firefight often comes down to one thing: having a runbook for incident management.
Teams with good incident runbooks resolve issues faster, communicate better, and learn more from each incident. If you’re new to runbooks, start with how to write a runbook.
Why You Need a Runbook for Incident Management
The Cost of Slow Response
Without a runbook for incident management:
- Engineers waste time figuring out what to check first
- Critical steps get skipped under pressure
- Communication is ad-hoc and inconsistent
- Post-incident reviews reveal the same gaps repeatedly
What Good Runbooks Provide
- Speed: Known procedures execute faster than improvisation. See our runbook automation tools comparison for ways to speed up execution.
- Consistency: Same process, regardless of who’s on-call
- Calm: Following a procedure reduces panic
- Learning: Documented procedures can be improved
Building Your Runbook for Incident Management
Part 1: Initial Response Runbook
The first 5 minutes set the tone for the entire incident:
# Initial Incident Response
## Trigger
PagerDuty alert received or incident reported.
## Immediate Actions (First 5 Minutes)
### 1. Acknowledge
```
Acknowledge alert in PagerDuty
```
### 2. Open Incident Channel
```
Create Slack channel: #inc-YYYYMMDD-brief-description
```
### 3. Post Initial Assessment
```markdown
**Incident Started:** [time]
**Severity:** [P1/P2/P3]
**Impact:** [What users are experiencing]
**Status:** Investigating
**Incident Commander:** [@your-name]
```
### 4. Quick Health Check
```bash
# Check overall system health
curl -s https://status.internal/health | jq .
# Check recent deployments
kubectl rollout history deployment -n production | head -10
```
Part 2: Triage Runbook
Quickly identify the problem area:
# Incident Triage
## Service Health Matrix
Run these checks to identify affected services:
### API Service
```bash
kubectl get pods -n production -l app=api
curl -w "%{http_code}" -o /dev/null -s https://api.example.com/health
```
### Database
```bash
psql -h $DB_HOST -c "SELECT 1;" && echo "DB: OK" || echo "DB: FAIL"
```
### Cache
```bash
redis-cli -h $REDIS_HOST PING
```
### Queue
```bash
rabbitmqctl list_queues name messages | head -10
```
## Decision Tree
Based on results:
- API unhealthy → [API Incident Runbook]
- Database issues → [Database Incident Runbook]
- Cache problems → [Cache Incident Runbook]
- All healthy but users affected → [Deep Investigation Runbook]
Part 3: Communication Runbook
Keep stakeholders informed without slowing down resolution:
# Incident Communication
## Update Frequency
- P1: Every 15 minutes
- P2: Every 30 minutes
- P3: At resolution
## Update Template
```markdown
**Update [number] - [time]**
**Status:** [Investigating/Identified/Monitoring/Resolved]
**Current Understanding:** [Brief technical summary]
**User Impact:** [What users are experiencing]
**Next Steps:** [What we're doing next]
**ETA:** [If known, or "Investigating"]
```
## Stakeholder Channels
- #incidents: All updates
- #customer-success: User-facing impact only
- status.example.com: External updates for P1 only
Part 4: Escalation Runbook
Know when and how to get help:
# Escalation Procedures
## When to Escalate
Escalate immediately if:
- [ ] P1 incident not improving after 15 minutes
- [ ] Root cause not identified after 30 minutes
- [ ] Required expertise not available
- [ ] Customer or revenue impact exceeds $X
## Escalation Path
### Level 1: Team Lead
```
Page @team-lead via PagerDuty
```
### Level 2: Engineering Manager
```
Page @eng-manager via PagerDuty
Call: [phone number]
```
### Level 3: VP Engineering
```
For P1 incidents exceeding 1 hour
Call: [phone number]
```
## Escalation Message Template
```markdown
**Escalating:** [incident summary]
**Duration:** [time since start]
**Attempted:** [what we've tried]
**Blocked On:** [what we need]
**Requesting:** [specific help needed]
```
Part 5: Resolution Runbook
Close the incident properly:
# Incident Resolution
## Before Closing
- [ ] All services healthy
- [ ] Error rates back to baseline
- [ ] User-facing functionality verified
- [ ] Monitoring shows stable state for 15+ minutes
## Verification Commands
```bash
# Check error rates
kubectl logs deployment/api -n production --since=15m | grep -c ERROR
# Check response times
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/health
```
## Closing the Incident
1. Post final update to incident channel
2. Update status page (if P1)
3. Create post-incident review ticket
4. Archive incident channel after 24 hours
## Final Update Template
```markdown
**RESOLVED - [time]**
**Duration:** [total time]
**Root Cause:** [brief description]
**Resolution:** [what fixed it]
**Follow-ups:** [ticket links for post-incident items]
```
Measuring Runbook Effectiveness
Track these metrics to improve your runbook for incident management:
- MTTR: Mean time to resolution
- MTTA: Mean time to acknowledge
- Runbook usage: Are on-call engineers using them?
- Runbook accuracy: Do steps still work?
From Runbook to Real-Time Execution
A runbook for incident management is only as good as your ability to follow it under pressure.
Stew makes your incident runbooks executable. Each diagnostic command runs with a click. Communication templates fill automatically. Progress tracks in real-time. When every second counts, your runbook keeps up. Get started with our runbook template guide and runbook examples.
Join the waitlist and transform your incident response.