Incident Response Checklist Templates: Copy and Customize
ยท 9 min read ยท Stew Team
incident-responsechecklisttemplatessre
Start with proven templates, then customize for your environment. These incident response checklists cover common scenarios.
For checklist design principles, see our incident response checklist guide.
Template 1: General Incident Response
# General Incident Response Checklist
**Incident ID**: ____________
**Start Time**: ____________
**Severity**: P1 / P2 / P3 / P4
---
## ๐จ Immediate Response (0-5 min)
- [ ] Alert acknowledged
- [ ] Verified issue is real
- [ ] Initial severity assessed
- [ ] Incident channel created: #incident-____________
## ๐ฅ Mobilization (5-10 min)
- [ ] Incident Commander assigned: ____________
- [ ] Relevant team members notified
- [ ] Status page updated (if customer-facing)
## ๐ Investigation (10-30 min)
### Quick Checks
โ```bash
# Replace SERVICE with affected service
kubectl get pods -l app=SERVICE -n production
kubectl logs -l app=SERVICE --tail=100 | grep -i error
โ```
- [ ] Service health checked
- [ ] Error logs reviewed
- [ ] Recent changes identified
### Recent Changes
- [ ] Last deploy: ____________ at ____________
- [ ] Recent config changes: ____________
- [ ] Infrastructure changes: ____________
### Root Cause
- [ ] Identified: ____________
- [ ] Or investigating: ____________
## ๐ง Resolution
### Fix Applied
- [ ] Remediation: ____________
- [ ] Executed at: ____________
- [ ] By: ____________
### Verification
โ```bash
kubectl get pods -l app=SERVICE -n production
curl -s http://SERVICE/health
โ```
- [ ] Service recovered
- [ ] Error rate normalized
- [ ] Monitoring stable for 15 min
## ๐ Closure
- [ ] Status page updated to resolved
- [ ] Stakeholders notified
- [ ] Incident document completed
- [ ] Postmortem scheduled (if P1/P2)
- [ ] Runbook updated with learnings
---
**Resolution Time**: ____________
**Root Cause**: ____________
**Follow-up Actions**: ____________
Template 2: Service Outage
# Service Outage Checklist
**Service**: ____________
**Start Time**: ____________
---
## Immediate (< 2 min)
- [ ] Acknowledge alert
- [ ] Confirm outage (not false positive)
โ```bash
curl -w "%{http_code}" -s http://SERVICE/health
โ```
- [ ] Page incident commander (if P1)
## Assess Scope (2-5 min)
- [ ] Which endpoints affected?
โ```bash
for endpoint in /api/v1 /api/v2 /health; do
echo "$endpoint: $(curl -s -o /dev/null -w '%{http_code}' http://SERVICE$endpoint)"
done
โ```
- [ ] How many users impacted?
- [ ] What's the business impact?
## Communication (5-10 min)
- [ ] Create incident channel
- [ ] Post initial update
- [ ] Update status page:
- Component: ____________
- Status: Major Outage / Partial Outage
- Message: "We are investigating issues with [SERVICE]"
## Diagnose (10-20 min)
### Infrastructure
โ```bash
kubectl get pods -l app=SERVICE -n production
kubectl get events --sort-by='.lastTimestamp' | grep SERVICE | tail -10
โ```
- [ ] Pods status: ____________
- [ ] Recent events: ____________
### Application
โ```bash
kubectl logs -l app=SERVICE -n production --tail=200 | grep -i "error\|fatal\|panic"
โ```
- [ ] Application errors: ____________
### Dependencies
โ```bash
# Check each dependency
curl -s http://database:5432 2>&1 | head -1
curl -s http://redis:6379 2>&1 | head -1
โ```
- [ ] Database: UP / DOWN
- [ ] Cache: UP / DOWN
- [ ] Other: ____________
## Remediate
### If deployment issue:
โ```bash
kubectl rollout undo deployment/SERVICE -n production
โ```
- [ ] Rolled back to: revision ____
### If infrastructure issue:
โ```bash
kubectl rollout restart deployment/SERVICE -n production
โ```
- [ ] Restarted at: ____
### If dependency issue:
- [ ] Dependency team engaged
- [ ] Fallback enabled: ____
## Verify Recovery
โ```bash
# Confirm service responding
curl -s http://SERVICE/health | jq '.'
# Confirm error rate normalized
# (check your monitoring)
โ```
- [ ] Health check passing
- [ ] Error rate < threshold
- [ ] Latency normal
- [ ] Stable for 15 minutes
## Close Out
- [ ] Status page: Resolved
- [ ] Stakeholders notified
- [ ] Incident documented
- [ ] Postmortem scheduled
Template 3: Database Incident
# Database Incident Checklist
**Database**: ____________
**Issue Type**: Connection / Performance / Replication / Corruption
---
## Assess (< 5 min)
โ```bash
# Connection check
psql -h DB_HOST -U monitor -c "SELECT 1;"
# Connection count
psql -h DB_HOST -U monitor -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
โ```
- [ ] Database reachable: YES / NO
- [ ] Connection count: ____ / max ____
- [ ] Primary status: UP / DOWN
- [ ] Replica status: UP / DOWN / LAGGING
## Connection Issues
โ```bash
# Current connections by app
psql -h DB_HOST -U monitor -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
# Kill idle connections
psql -h DB_HOST -U admin -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '30 min';"
โ```
- [ ] Identified connection source: ____________
- [ ] Terminated stale connections: ____
- [ ] Notified application team
## Performance Issues
โ```bash
# Long-running queries
psql -h DB_HOST -U monitor -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 5;"
# Blocked queries
psql -h DB_HOST -U monitor -c "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid FROM pg_locks blocked_locks JOIN pg_locks blocking_locks ON blocking_locks.relation = blocked_locks.relation WHERE NOT blocked_locks.granted;"
โ```
- [ ] Long-running query: PID ____ (duration: ____)
- [ ] Blocking query: PID ____
- [ ] Cancelled query: [ ]
## Replication Issues
โ```bash
# Check lag
psql -h DB_REPLICA -U monitor -c "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"
# Check replication state
psql -h DB_PRIMARY -U monitor -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"
โ```
- [ ] Replica lag: ____
- [ ] Replication state: ____________
- [ ] Failover needed: YES / NO
## Resolution Verification
โ```bash
psql -h DB_HOST -U monitor -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
โ```
- [ ] Active queries normal
- [ ] Application connections restored
- [ ] Performance baseline restored
- [ ] Replication caught up
## Escalation Contacts
| Role | Name | Contact |
|------|------|---------|
| DBA Primary | ____ | ____ |
| DBA Secondary | ____ | ____ |
| Database Vendor | ____ | ____ |
Template 4: Security Incident
# Security Incident Checklist
**Type**: Unauthorized Access / Data Breach / Malware / DDoS / Other
**Severity**: Critical / High / Medium / Low
**Discovered**: ____________
---
## โ ๏ธ CRITICAL: Do Not
- [ ] Do NOT discuss on public channels
- [ ] Do NOT alert potential attacker
- [ ] Do NOT destroy evidence
## Immediate Containment (< 15 min)
- [ ] Security team paged
- [ ] Secure incident channel created (limited access)
- [ ] Initial scope assessed
### If Unauthorized Access:
โ```bash
# Revoke compromised credentials
# (Do this ONLY after confirming with security team)
โ```
- [ ] Compromised accounts identified
- [ ] Credentials rotated
- [ ] Sessions terminated
### If Active Attack:
- [ ] Attack vector identified
- [ ] Blocking rules applied
- [ ] Affected systems isolated
## Evidence Preservation
- [ ] Logs preserved before rotation
โ```bash
# Copy logs to secure location
kubectl logs -l app=AFFECTED --all-containers > /secure/incident-logs-$(date +%Y%m%d).txt
โ```
- [ ] Screenshots captured
- [ ] Timeline documented
- [ ] Memory dump (if needed)
## Scope Assessment
- [ ] Which systems affected?
- [ ] What data potentially exposed?
- [ ] How long was access available?
- [ ] Number of affected users/records
## Stakeholder Notification
| Stakeholder | Required | Notified |
|-------------|----------|----------|
| Security Team | Always | [ ] |
| Legal | Data breach | [ ] |
| Executive | P1/P2 | [ ] |
| PR/Comms | Public impact | [ ] |
| Customers | If required | [ ] |
## Remediation
- [ ] Root cause identified
- [ ] Vulnerability patched
- [ ] Additional monitoring added
- [ ] Similar vulnerabilities checked
## Post-Incident
- [ ] Formal incident report
- [ ] Legal review (if data breach)
- [ ] Regulatory notification (if required)
- [ ] Customer notification (if required)
- [ ] Security postmortem scheduled
Template 5: Deployment Rollback
# Deployment Rollback Checklist
**Service**: ____________
**Bad Version**: ____________
**Rollback To**: ____________
---
## Pre-Rollback
- [ ] Confirmed issue is deployment-related
โ```bash
kubectl rollout history deployment/SERVICE -n production
โ```
- [ ] Identified rollback target version
- [ ] Notified team in incident channel
## Execute Rollback
โ```bash
# Rollback to previous version
kubectl rollout undo deployment/SERVICE -n production
# Or to specific revision
kubectl rollout undo deployment/SERVICE -n production --to-revision=N
โ```
- [ ] Rollback initiated at: ____
- [ ] Rollback command: ____________
## Monitor Rollback
โ```bash
kubectl rollout status deployment/SERVICE -n production
โ```
- [ ] Rollback completed
- [ ] All pods running new (old) version
- [ ] No pods in crash loop
## Verify Recovery
โ```bash
# Health check
curl -s http://SERVICE/health | jq '.'
# Version check
curl -s http://SERVICE/version
โ```
- [ ] Health endpoint passing
- [ ] Correct version deployed
- [ ] Error rate normalized
- [ ] Functionality verified
## Post-Rollback
- [ ] Deployment pipeline blocked
- [ ] Bad commit identified
- [ ] Fix in progress: ____________
- [ ] Postmortem scheduled
Stew: Interactive Checklists
Stew transforms these templates into interactive, executable checklists:
- Commands run with a click
- Check items as you complete them
- Output captured for postmortem
- Share progress in real-time
Join the waitlist and upgrade your incident checklists.