Incident Response Checklist: A Complete Guide for SRE Teams
When incidents happen, checklists prevent chaos. A well-designed incident response checklist ensures nothing gets missed, even at 3am.
This guide covers how to build and use incident response checklists. For detailed runbooks, see how to write a runbook.
Why Checklists Matter
Studies of high-stakes fields (aviation, medicine, nuclear power) show that checklists:
- Reduce errors by 30-50%
- Ensure consistent execution
- Free mental capacity for problem-solving
- Create accountability
Incidents are high-stakes. Checklists help.
The Core Incident Response Checklist
# Incident Response Checklist
## Phase 1: Detection & Alert (0-5 min)
- [ ] Acknowledge the alert
- [ ] Verify the alert is real (not false positive)
- [ ] Identify affected services
- [ ] Determine initial severity
## Phase 2: Triage (5-15 min)
- [ ] Assign incident commander
- [ ] Open incident communication channel
- [ ] Notify relevant stakeholders
- [ ] Start incident timer
## Phase 3: Diagnosis (15-45 min)
- [ ] Run quick diagnostic commands
- [ ] Check recent changes (deploys, config)
- [ ] Review related alerts
- [ ] Identify root cause or likely suspects
## Phase 4: Resolution (varies)
- [ ] Apply remediation
- [ ] Verify fix is working
- [ ] Monitor for recurrence
- [ ] Document what was done
## Phase 5: Closure (after resolution)
- [ ] Update status page
- [ ] Notify stakeholders of resolution
- [ ] Capture timeline and notes
- [ ] Schedule postmortem if needed
Phase 1: Detection Checklist
## Detection Checklist
### Alert Acknowledgment
- [ ] Acknowledge in PagerDuty/OpsGenie
- [ ] Note the alert timestamp
### Verification
```bash
# Quick health check
curl -s http://affected-service/health | jq '.status'
```
- [ ] Alert is legitimate (not false positive)
- [ ] Issue is ongoing (not already resolved)
### Impact Assessment
```bash
# Check error rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])" | jq '.data.result[0].value[1]'
```
- [ ] Identified affected services
- [ ] Estimated user impact
- [ ] Determined geographic scope
### Severity Classification
| Severity | Criteria |
|----------|----------|
| P1 | Service down, all users affected |
| P2 | Major feature broken, many users affected |
| P3 | Minor feature broken, some users affected |
| P4 | No user impact, internal only |
- [ ] Severity assigned: ____
Phase 2: Triage Checklist
## Triage Checklist
### Roles (assign for P1/P2)
- [ ] Incident Commander: ____________
- [ ] Communications Lead: ____________
- [ ] Technical Lead: ____________
### Communication
- [ ] Created incident channel: #incident-YYYY-MM-DD-[name]
- [ ] Posted initial update to channel
- [ ] Started incident document
### Stakeholder Notification
| Stakeholder | P1 | P2 | P3 | Notified |
|-------------|----|----|----| -------- |
| Engineering Lead | ✓ | ✓ | | [ ] |
| Customer Success | ✓ | | | [ ] |
| Executive Team | ✓ | | | [ ] |
### Status Page
- [ ] Status page updated (if customer-facing)
- [ ] Initial ETA communicated (or "investigating")
Phase 3: Diagnosis Checklist
## Diagnosis Checklist
### Quick Diagnostics
```bash
# Service status
kubectl get pods -l app=affected-service -n production
# Recent errors
kubectl logs -l app=affected-service --tail=100 | grep -i error | tail -20
# Resource usage
kubectl top pods -l app=affected-service
```
- [ ] Service status checked
- [ ] Logs reviewed
- [ ] Resource usage assessed
### Recent Changes
```bash
# Recent deployments
kubectl get deployments -n production -o json | jq '.items[] | select(.metadata.annotations["deployment-time"]) | {name: .metadata.name, deployed: .metadata.annotations["deployment-time"]}'
```
- [ ] Recent deploys reviewed
- [ ] Config changes checked
- [ ] Infrastructure changes checked
### Related Signals
- [ ] Related alerts reviewed
- [ ] Upstream services checked
- [ ] Downstream services checked
### Root Cause
- [ ] Root cause identified: ____________
- [ ] Or: Top 3 suspects identified:
1. ____________
2. ____________
3. ____________
Phase 4: Resolution Checklist
## Resolution Checklist
### Remediation
- [ ] Remediation plan approved
- [ ] Rollback plan ready (if needed)
### Execution
```bash
# Apply fix (example: rollback)
kubectl rollout undo deployment/api -n production
```
- [ ] Fix applied
- [ ] Change documented in incident channel
### Verification
```bash
# Verify recovery
kubectl get pods -l app=api -n production
curl -s http://api.internal/health | jq '.status'
```
- [ ] Service status verified healthy
- [ ] Error rate returned to normal
- [ ] User-facing functionality confirmed
### Monitoring
- [ ] Monitoring for recurrence (15 min)
- [ ] No new related alerts
- [ ] Performance back to baseline
Phase 5: Closure Checklist
## Closure Checklist
### External Communication
- [ ] Status page updated to "Resolved"
- [ ] Customer notification sent (if applicable)
- [ ] Support team notified
### Internal Documentation
- [ ] Incident timeline captured
- [ ] Root cause documented
- [ ] Resolution steps documented
- [ ] Incident tagged and categorized
### Follow-up
- [ ] Postmortem scheduled (for P1/P2)
- [ ] Action items created
- [ ] Runbook updated with learnings
- [ ] Alert tuning needed? ____
### Handoff
- [ ] On-call updated on current state
- [ ] Any ongoing monitoring documented
Severity-Specific Checklists
P1 (Critical) Additional Items
## P1 Additions
### Escalation
- [ ] Engineering leadership notified
- [ ] Exec team notified
- [ ] War room opened (if remote: bridge call)
### External Communication
- [ ] Customer success briefed
- [ ] Status page updated every 15 min
- [ ] Tweet/social prepared (if needed)
### Post-Incident
- [ ] Postmortem within 48 hours (mandatory)
- [ ] Exec summary prepared
P3/P4 (Minor) Simplified Checklist
## P3/P4 Simplified Checklist
- [ ] Alert acknowledged
- [ ] Issue verified
- [ ] Fix applied
- [ ] Verified working
- [ ] Ticket created for follow-up
Digital vs Physical Checklists
Digital (Recommended)
- Embedded in incident management tools
- Pre-populated with context
- Shareable with team
- Auditable
Physical (Backup)
Print and post near on-call workstations:
┌─────────────────────────────────────┐
│ INCIDENT RESPONSE QUICK CHECKLIST │
├─────────────────────────────────────┤
│ □ Acknowledge alert │
│ □ Verify real issue │
│ □ Assess severity │
│ □ Open incident channel │
│ □ Notify stakeholders │
│ □ Run diagnostics │
│ □ Apply fix │
│ □ Verify resolution │
│ □ Update status page │
│ □ Document & close │
└─────────────────────────────────────┘
Checklist Anti-Patterns
❌ Too Long
100-item checklists don’t get used.
Fix: Keep core checklist under 20 items. Link to detailed docs.
❌ Too Vague
“Check the service” doesn’t help at 3am.
Fix: Include specific commands and criteria.
❌ Never Updated
Checklists that reference old systems erode trust.
Fix: Review quarterly. Update after every incident.
Stew: Executable Checklists
Stew turns incident response checklists into executable workflows:
- Check items as you complete them
- Run diagnostic commands with a click
- Capture output automatically
- Share progress with your team
Your checklist becomes an active incident document.
Join the waitlist and streamline your incident response.