Runbook Template for Incident Response

Incidents don’t wait for you to figure out what to do. When systems fail, you need a DevOps runbook template for incident response that guides you from alert to resolution.

This template has been refined through hundreds of real incidents. Use it to bring structure to chaos.

Why Incident Response Needs a Specialized Template

Generic runbook templates don’t work for incidents because:

Time pressure: Every minute matters
Parallel work: Multiple people need to coordinate
Communication: Stakeholders need updates
Documentation: Post-incident reviews need data

A DevOps runbook template for incident response addresses all of these.

The Complete Incident Response Template

Phase 1: Alert and Acknowledge

# Incident Response Runbook

## Trigger
- PagerDuty alert received
- Customer report escalated
- Monitoring threshold breached

## Immediate Actions (0-5 minutes)

### Acknowledge Alert
```
Acknowledge in PagerDuty
```

### Create Incident Channel
```
Slack: Create #inc-YYYYMMDD-brief-description
```

### Post Initial Message
```markdown
🚨 **INCIDENT DECLARED**

**Time:** [current time UTC]
**Severity:** P1/P2/P3
**Symptoms:** [what's broken]
**Impact:** [user-facing effects]
**Incident Commander:** @your-name
**Status:** Investigating
```

Phase 2: Assess and Triage

## Assessment (5-15 minutes)

### Quick System Check
```bash
# Cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running

# Service health
curl -s https://api.example.com/health | jq .

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
```

### Check Recent Changes
```bash
# Recent deployments
kubectl rollout history deployment -n production | head -20

# Git commits
git log --oneline --since="2 hours ago"
```

### Identify Affected Services
```bash
# Check all service endpoints
for svc in api web worker; do
  echo "=== $svc ==="
  curl -s -o /dev/null -w "%{http_code}" https://$svc.example.com/health
done
```

### Update Incident Channel
```markdown
**Update 1 - [time]**
**Status:** Investigating
**Findings:** [what you've learned]
**Affected:** [services/users impacted]
**Next Steps:** [what you're checking next]
```

Phase 3: Remediate

## Remediation Options

### Option A: Rollback Recent Deployment
```bash
kubectl rollout undo deployment/$SERVICE -n production
kubectl rollout status deployment/$SERVICE -n production
```

### Option B: Scale Up Resources
```bash
kubectl scale deployment/$SERVICE --replicas=10 -n production
```

### Option C: Restart Service
```bash
kubectl rollout restart deployment/$SERVICE -n production
```

### Option D: Failover to Secondary
```bash
# Update DNS to secondary region
aws route53 change-resource-record-sets --hosted-zone-id $ZONE_ID --change-batch file://failover.json
```

### After Each Action
```bash
# Verify improvement
curl -s https://api.example.com/health
kubectl logs deployment/$SERVICE -n production --tail=50
```

### Update Incident Channel
```markdown
**Update 2 - [time]**
**Status:** Identified / Mitigating
**Root Cause:** [if known]
**Action Taken:** [what you did]
**Result:** [improvement seen?]
```

Phase 4: Verify Resolution

## Verification Checklist

### Service Health
- [ ] Health endpoints returning 200
- [ ] Error rates back to baseline
- [ ] Latency within SLO

### Verification Commands
```bash
# Health check
curl -s https://api.example.com/health | jq .

# Error rate (last 5 minutes)
kubectl logs deployment/api -n production --since=5m | grep -c ERROR

# Response time
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/health
```

### Monitoring Confirmation
- [ ] Alerts cleared in PagerDuty
- [ ] Dashboards show normal metrics
- [ ] No new error spikes

Phase 5: Close and Document

## Incident Closure

### Final Status Update
```markdown
✅ **INCIDENT RESOLVED**

**Duration:** [start time] to [end time] ([X] minutes)
**Root Cause:** [brief description]
**Resolution:** [what fixed it]
**Customer Impact:** [scope of impact]
**Follow-ups:** [action items]
```

### Immediate Documentation
```markdown
## Incident Summary: [date]

**Timeline:**
- [time]: Alert triggered
- [time]: Incident declared
- [time]: Root cause identified
- [time]: Fix deployed
- [time]: Resolution confirmed

**What Happened:**
[Technical description]

**What We Did:**
[Actions taken]

**What We'll Do:**
- [ ] Action item 1
- [ ] Action item 2
```

### Create Follow-up Tasks
- [ ] Post-incident review scheduled
- [ ] Action items created in issue tracker
- [ ] Runbook updated if gaps found

Severity-Specific Modifications

P1 (Critical)

Add to template:

Executive notification list
Status page update procedure
Customer communication template
War room bridge details

P2 (Major)

Add to template:

Extended team notification
30-minute update cadence
Business impact assessment

P3 (Minor)

Simplify template:

Async updates acceptable
Single engineer can handle
Standard SLA response time

Communication Templates

Status Page Update

**Investigating:** We are currently investigating issues with [service].

**Identified:** We have identified the issue affecting [service] and are implementing a fix.

**Monitoring:** A fix has been implemented. We are monitoring the situation.

**Resolved:** The issue has been resolved. [Service] is operating normally.

Customer Email Template

Subject: [Service] Incident - [Status]

We experienced an issue affecting [service] between [start time] and [end time] UTC.

Impact: [what customers experienced]

Root Cause: [brief, non-technical explanation]

Resolution: [what we did]

Prevention: [what we're doing to prevent recurrence]

Making Incident Response Automatic

Under pressure, even the best templates get skipped. Learn more about runbook automation tools and why runbooks fail.

Stew makes your incident response DevOps runbook template executable. Run diagnostics with a click. Track your progress through each phase. Generate incident timelines automatically.

Join the waitlist and transform your incident response.