โ† Back to blog

Incident Response Checklist Templates: Copy and Customize

ยท 9 min read ยท Stew Team
incident-responsechecklisttemplatessre

Start with proven templates, then customize for your environment. These incident response checklists cover common scenarios.

For checklist design principles, see our incident response checklist guide.

Template 1: General Incident Response

# General Incident Response Checklist

**Incident ID**: ____________
**Start Time**: ____________
**Severity**: P1 / P2 / P3 / P4

---

## ๐Ÿšจ Immediate Response (0-5 min)

- [ ] Alert acknowledged
- [ ] Verified issue is real
- [ ] Initial severity assessed
- [ ] Incident channel created: #incident-____________

## ๐Ÿ‘ฅ Mobilization (5-10 min)

- [ ] Incident Commander assigned: ____________
- [ ] Relevant team members notified
- [ ] Status page updated (if customer-facing)

## ๐Ÿ” Investigation (10-30 min)

### Quick Checks
โ€‹```bash
# Replace SERVICE with affected service
kubectl get pods -l app=SERVICE -n production
kubectl logs -l app=SERVICE --tail=100 | grep -i error
โ€‹```

- [ ] Service health checked
- [ ] Error logs reviewed
- [ ] Recent changes identified

### Recent Changes
- [ ] Last deploy: ____________ at ____________
- [ ] Recent config changes: ____________
- [ ] Infrastructure changes: ____________

### Root Cause
- [ ] Identified: ____________
- [ ] Or investigating: ____________

## ๐Ÿ”ง Resolution

### Fix Applied
- [ ] Remediation: ____________
- [ ] Executed at: ____________
- [ ] By: ____________

### Verification
โ€‹```bash
kubectl get pods -l app=SERVICE -n production
curl -s http://SERVICE/health
โ€‹```

- [ ] Service recovered
- [ ] Error rate normalized
- [ ] Monitoring stable for 15 min

## ๐Ÿ“ Closure

- [ ] Status page updated to resolved
- [ ] Stakeholders notified
- [ ] Incident document completed
- [ ] Postmortem scheduled (if P1/P2)
- [ ] Runbook updated with learnings

---

**Resolution Time**: ____________
**Root Cause**: ____________
**Follow-up Actions**: ____________

Template 2: Service Outage

# Service Outage Checklist

**Service**: ____________
**Start Time**: ____________

---

## Immediate (< 2 min)

- [ ] Acknowledge alert
- [ ] Confirm outage (not false positive)
  โ€‹```bash
  curl -w "%{http_code}" -s http://SERVICE/health
  โ€‹```
- [ ] Page incident commander (if P1)

## Assess Scope (2-5 min)

- [ ] Which endpoints affected?
  โ€‹```bash
  for endpoint in /api/v1 /api/v2 /health; do
    echo "$endpoint: $(curl -s -o /dev/null -w '%{http_code}' http://SERVICE$endpoint)"
  done
  โ€‹```
- [ ] How many users impacted?
- [ ] What's the business impact?

## Communication (5-10 min)

- [ ] Create incident channel
- [ ] Post initial update
- [ ] Update status page:
  - Component: ____________
  - Status: Major Outage / Partial Outage
  - Message: "We are investigating issues with [SERVICE]"

## Diagnose (10-20 min)

### Infrastructure
โ€‹```bash
kubectl get pods -l app=SERVICE -n production
kubectl get events --sort-by='.lastTimestamp' | grep SERVICE | tail -10
โ€‹```
- [ ] Pods status: ____________
- [ ] Recent events: ____________

### Application
โ€‹```bash
kubectl logs -l app=SERVICE -n production --tail=200 | grep -i "error\|fatal\|panic"
โ€‹```
- [ ] Application errors: ____________

### Dependencies
โ€‹```bash
# Check each dependency
curl -s http://database:5432 2>&1 | head -1
curl -s http://redis:6379 2>&1 | head -1
โ€‹```
- [ ] Database: UP / DOWN
- [ ] Cache: UP / DOWN
- [ ] Other: ____________

## Remediate

### If deployment issue:
โ€‹```bash
kubectl rollout undo deployment/SERVICE -n production
โ€‹```
- [ ] Rolled back to: revision ____

### If infrastructure issue:
โ€‹```bash
kubectl rollout restart deployment/SERVICE -n production
โ€‹```
- [ ] Restarted at: ____

### If dependency issue:
- [ ] Dependency team engaged
- [ ] Fallback enabled: ____

## Verify Recovery

โ€‹```bash
# Confirm service responding
curl -s http://SERVICE/health | jq '.'

# Confirm error rate normalized
# (check your monitoring)
โ€‹```

- [ ] Health check passing
- [ ] Error rate < threshold
- [ ] Latency normal
- [ ] Stable for 15 minutes

## Close Out

- [ ] Status page: Resolved
- [ ] Stakeholders notified
- [ ] Incident documented
- [ ] Postmortem scheduled

Template 3: Database Incident

# Database Incident Checklist

**Database**: ____________
**Issue Type**: Connection / Performance / Replication / Corruption

---

## Assess (< 5 min)

โ€‹```bash
# Connection check
psql -h DB_HOST -U monitor -c "SELECT 1;"

# Connection count
psql -h DB_HOST -U monitor -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
โ€‹```

- [ ] Database reachable: YES / NO
- [ ] Connection count: ____ / max ____
- [ ] Primary status: UP / DOWN
- [ ] Replica status: UP / DOWN / LAGGING

## Connection Issues

โ€‹```bash
# Current connections by app
psql -h DB_HOST -U monitor -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

# Kill idle connections
psql -h DB_HOST -U admin -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '30 min';"
โ€‹```

- [ ] Identified connection source: ____________
- [ ] Terminated stale connections: ____
- [ ] Notified application team

## Performance Issues

โ€‹```bash
# Long-running queries
psql -h DB_HOST -U monitor -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5;"

# Blocked queries
psql -h DB_HOST -U monitor -c "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid FROM pg_locks blocked_locks JOIN pg_locks blocking_locks ON blocking_locks.relation = blocked_locks.relation WHERE NOT blocked_locks.granted;"
โ€‹```

- [ ] Long-running query: PID ____ (duration: ____)
- [ ] Blocking query: PID ____
- [ ] Cancelled query: [ ]

## Replication Issues

โ€‹```bash
# Check lag
psql -h DB_REPLICA -U monitor -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"

# Check replication state
psql -h DB_PRIMARY -U monitor -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"
โ€‹```

- [ ] Replica lag: ____
- [ ] Replication state: ____________
- [ ] Failover needed: YES / NO

## Resolution Verification

โ€‹```bash
psql -h DB_HOST -U monitor -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
โ€‹```

- [ ] Active queries normal
- [ ] Application connections restored
- [ ] Performance baseline restored
- [ ] Replication caught up

## Escalation Contacts

| Role | Name | Contact |
|------|------|---------|
| DBA Primary | ____ | ____ |
| DBA Secondary | ____ | ____ |
| Database Vendor | ____ | ____ |

Template 4: Security Incident

# Security Incident Checklist

**Type**: Unauthorized Access / Data Breach / Malware / DDoS / Other
**Severity**: Critical / High / Medium / Low
**Discovered**: ____________

---

## โš ๏ธ CRITICAL: Do Not

- [ ] Do NOT discuss on public channels
- [ ] Do NOT alert potential attacker
- [ ] Do NOT destroy evidence

## Immediate Containment (< 15 min)

- [ ] Security team paged
- [ ] Secure incident channel created (limited access)
- [ ] Initial scope assessed

### If Unauthorized Access:
โ€‹```bash
# Revoke compromised credentials
# (Do this ONLY after confirming with security team)
โ€‹```
- [ ] Compromised accounts identified
- [ ] Credentials rotated
- [ ] Sessions terminated

### If Active Attack:
- [ ] Attack vector identified
- [ ] Blocking rules applied
- [ ] Affected systems isolated

## Evidence Preservation

- [ ] Logs preserved before rotation
  โ€‹```bash
  # Copy logs to secure location
  kubectl logs -l app=AFFECTED --all-containers > /secure/incident-logs-$(date +%Y%m%d).txt
  โ€‹```
- [ ] Screenshots captured
- [ ] Timeline documented
- [ ] Memory dump (if needed)

## Scope Assessment

- [ ] Which systems affected?
- [ ] What data potentially exposed?
- [ ] How long was access available?
- [ ] Number of affected users/records

## Stakeholder Notification

| Stakeholder | Required | Notified |
|-------------|----------|----------|
| Security Team | Always | [ ] |
| Legal | Data breach | [ ] |
| Executive | P1/P2 | [ ] |
| PR/Comms | Public impact | [ ] |
| Customers | If required | [ ] |

## Remediation

- [ ] Root cause identified
- [ ] Vulnerability patched
- [ ] Additional monitoring added
- [ ] Similar vulnerabilities checked

## Post-Incident

- [ ] Formal incident report
- [ ] Legal review (if data breach)
- [ ] Regulatory notification (if required)
- [ ] Customer notification (if required)
- [ ] Security postmortem scheduled

Template 5: Deployment Rollback

# Deployment Rollback Checklist

**Service**: ____________
**Bad Version**: ____________
**Rollback To**: ____________

---

## Pre-Rollback

- [ ] Confirmed issue is deployment-related
  โ€‹```bash
  kubectl rollout history deployment/SERVICE -n production
  โ€‹```
- [ ] Identified rollback target version
- [ ] Notified team in incident channel

## Execute Rollback

โ€‹```bash
# Rollback to previous version
kubectl rollout undo deployment/SERVICE -n production

# Or to specific revision
kubectl rollout undo deployment/SERVICE -n production --to-revision=N
โ€‹```

- [ ] Rollback initiated at: ____
- [ ] Rollback command: ____________

## Monitor Rollback

โ€‹```bash
kubectl rollout status deployment/SERVICE -n production
โ€‹```

- [ ] Rollback completed
- [ ] All pods running new (old) version
- [ ] No pods in crash loop

## Verify Recovery

โ€‹```bash
# Health check
curl -s http://SERVICE/health | jq '.'

# Version check
curl -s http://SERVICE/version
โ€‹```

- [ ] Health endpoint passing
- [ ] Correct version deployed
- [ ] Error rate normalized
- [ ] Functionality verified

## Post-Rollback

- [ ] Deployment pipeline blocked
- [ ] Bad commit identified
- [ ] Fix in progress: ____________
- [ ] Postmortem scheduled

Stew: Interactive Checklists

Stew transforms these templates into interactive, executable checklists:

  • Commands run with a click
  • Check items as you complete them
  • Output captured for postmortem
  • Share progress in real-time

Join the waitlist and upgrade your incident checklists.