Incident Response Checklist: A Complete Guide for SRE Teams

When incidents happen, checklists prevent chaos. A well-designed incident response checklist ensures nothing gets missed, even at 3am.

This guide covers how to build and use incident response checklists. For detailed runbooks, see how to write a runbook.

Why Checklists Matter

Studies of high-stakes fields (aviation, medicine, nuclear power) show that checklists:

Reduce errors by 30-50%
Ensure consistent execution
Free mental capacity for problem-solving
Create accountability

Incidents are high-stakes. Checklists help.

The Core Incident Response Checklist

# Incident Response Checklist

## Phase 1: Detection & Alert (0-5 min)
- [ ] Acknowledge the alert
- [ ] Verify the alert is real (not false positive)
- [ ] Identify affected services
- [ ] Determine initial severity

## Phase 2: Triage (5-15 min)
- [ ] Assign incident commander
- [ ] Open incident communication channel
- [ ] Notify relevant stakeholders
- [ ] Start incident timer

## Phase 3: Diagnosis (15-45 min)
- [ ] Run quick diagnostic commands
- [ ] Check recent changes (deploys, config)
- [ ] Review related alerts
- [ ] Identify root cause or likely suspects

## Phase 4: Resolution (varies)
- [ ] Apply remediation
- [ ] Verify fix is working
- [ ] Monitor for recurrence
- [ ] Document what was done

## Phase 5: Closure (after resolution)
- [ ] Update status page
- [ ] Notify stakeholders of resolution
- [ ] Capture timeline and notes
- [ ] Schedule postmortem if needed

Phase 1: Detection Checklist

## Detection Checklist

### Alert Acknowledgment
- [ ] Acknowledge in PagerDuty/OpsGenie
- [ ] Note the alert timestamp

### Verification
```bash
# Quick health check
curl -s http://affected-service/health | jq '.status'
```

- [ ] Alert is legitimate (not false positive)
- [ ] Issue is ongoing (not already resolved)

### Impact Assessment
```bash
# Check error rate
curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])" | jq '.data.result[0].value[1]'
```

- [ ] Identified affected services
- [ ] Estimated user impact
- [ ] Determined geographic scope

### Severity Classification

| Severity | Criteria |
|----------|----------|
| P1 | Service down, all users affected |
| P2 | Major feature broken, many users affected |
| P3 | Minor feature broken, some users affected |
| P4 | No user impact, internal only |

- [ ] Severity assigned: ____

Phase 2: Triage Checklist

## Triage Checklist

### Roles (assign for P1/P2)
- [ ] Incident Commander: ____________
- [ ] Communications Lead: ____________
- [ ] Technical Lead: ____________

### Communication
- [ ] Created incident channel: #incident-YYYY-MM-DD-[name]
- [ ] Posted initial update to channel
- [ ] Started incident document

### Stakeholder Notification

| Stakeholder | P1 | P2 | P3 | Notified |
|-------------|----|----|----| -------- |
| Engineering Lead | ✓ | ✓ | | [ ] |
| Customer Success | ✓ | | | [ ] |
| Executive Team | ✓ | | | [ ] |

### Status Page
- [ ] Status page updated (if customer-facing)
- [ ] Initial ETA communicated (or "investigating")

Phase 3: Diagnosis Checklist

## Diagnosis Checklist

### Quick Diagnostics
```bash
# Service status
kubectl get pods -l app=affected-service -n production

# Recent errors
kubectl logs -l app=affected-service --tail=100 | grep -i error | tail -20

# Resource usage
kubectl top pods -l app=affected-service
```

- [ ] Service status checked
- [ ] Logs reviewed
- [ ] Resource usage assessed

### Recent Changes
```bash
# Recent deployments
kubectl get deployments -n production -o json | jq '.items[] | select(.metadata.annotations["deployment-time"]) | {name: .metadata.name, deployed: .metadata.annotations["deployment-time"]}'
```

- [ ] Recent deploys reviewed
- [ ] Config changes checked
- [ ] Infrastructure changes checked

### Related Signals
- [ ] Related alerts reviewed
- [ ] Upstream services checked
- [ ] Downstream services checked

### Root Cause
- [ ] Root cause identified: ____________
- [ ] Or: Top 3 suspects identified:
  1. ____________
  2. ____________
  3. ____________

Phase 4: Resolution Checklist

## Resolution Checklist

### Remediation
- [ ] Remediation plan approved
- [ ] Rollback plan ready (if needed)

### Execution
```bash
# Apply fix (example: rollback)
kubectl rollout undo deployment/api -n production
```

- [ ] Fix applied
- [ ] Change documented in incident channel

### Verification
```bash
# Verify recovery
kubectl get pods -l app=api -n production
curl -s http://api.internal/health | jq '.status'
```

- [ ] Service status verified healthy
- [ ] Error rate returned to normal
- [ ] User-facing functionality confirmed

### Monitoring
- [ ] Monitoring for recurrence (15 min)
- [ ] No new related alerts
- [ ] Performance back to baseline

Phase 5: Closure Checklist

## Closure Checklist

### External Communication
- [ ] Status page updated to "Resolved"
- [ ] Customer notification sent (if applicable)
- [ ] Support team notified

### Internal Documentation
- [ ] Incident timeline captured
- [ ] Root cause documented
- [ ] Resolution steps documented
- [ ] Incident tagged and categorized

### Follow-up
- [ ] Postmortem scheduled (for P1/P2)
- [ ] Action items created
- [ ] Runbook updated with learnings
- [ ] Alert tuning needed? ____

### Handoff
- [ ] On-call updated on current state
- [ ] Any ongoing monitoring documented

Severity-Specific Checklists

P1 (Critical) Additional Items

## P1 Additions

### Escalation
- [ ] Engineering leadership notified
- [ ] Exec team notified
- [ ] War room opened (if remote: bridge call)

### External Communication
- [ ] Customer success briefed
- [ ] Status page updated every 15 min
- [ ] Tweet/social prepared (if needed)

### Post-Incident
- [ ] Postmortem within 48 hours (mandatory)
- [ ] Exec summary prepared

P3/P4 (Minor) Simplified Checklist

## P3/P4 Simplified Checklist

- [ ] Alert acknowledged
- [ ] Issue verified
- [ ] Fix applied
- [ ] Verified working
- [ ] Ticket created for follow-up

Digital vs Physical Checklists

Digital (Recommended)

Embedded in incident management tools
Pre-populated with context
Shareable with team
Auditable

Physical (Backup)

Print and post near on-call workstations:

┌─────────────────────────────────────┐
│  INCIDENT RESPONSE QUICK CHECKLIST  │
├─────────────────────────────────────┤
│ □ Acknowledge alert                 │
│ □ Verify real issue                 │
│ □ Assess severity                   │
│ □ Open incident channel             │
│ □ Notify stakeholders               │
│ □ Run diagnostics                   │
│ □ Apply fix                         │
│ □ Verify resolution                 │
│ □ Update status page                │
│ □ Document & close                  │
└─────────────────────────────────────┘

Checklist Anti-Patterns

❌ Too Long

100-item checklists don’t get used.

Fix: Keep core checklist under 20 items. Link to detailed docs.

❌ Too Vague

“Check the service” doesn’t help at 3am.

Fix: Include specific commands and criteria.

❌ Never Updated

Checklists that reference old systems erode trust.

Fix: Review quarterly. Update after every incident.

Stew: Executable Checklists

Stew turns incident response checklists into executable workflows:

Check items as you complete them
Run diagnostic commands with a click
Capture output automatically
Share progress with your team

Your checklist becomes an active incident document.

Join the waitlist and streamline your incident response.