From Incident Checklist to Postmortem: Closing the Loop
· 6 min read · Stew Team
incident-responsechecklistpostmortemsre
The best postmortems write themselves—if you capture the right data during the incident. Your incident response checklist is the key.
For checklist automation, see our incident response checklist automation guide.
The Data Postmortems Need
Every postmortem answers these questions:
- What happened? (Timeline)
- Why did it happen? (Root cause)
- What was the impact? (Scope and duration)
- How did we respond? (Actions taken)
- How do we prevent recurrence? (Action items)
Your checklist should capture this data in real-time.
Checklist Design for Postmortems
Timestamp Everything
## Incident Timeline Checklist
- [ ] Alert triggered: ____:____
- [ ] Alert acknowledged: ____:____
- [ ] Incident channel created: ____:____
- [ ] Severity determined: ____:____
- [ ] Root cause identified: ____:____
- [ ] Fix applied: ____:____
- [ ] Fix verified: ____:____
- [ ] Incident resolved: ____:____
Capture Decisions
## Decision Log
### Decision 1
- **Time**: ____:____
- **Decision**: ____________
- **Alternatives considered**: ____________
- **Rationale**: ____________
- **Made by**: ____________
### Decision 2
- **Time**: ____:____
- **Decision**: ____________
- **Alternatives considered**: ____________
- **Rationale**: ____________
- **Made by**: ____________
Record Commands and Output
## Diagnostic Commands Run
### Command 1
```bash
kubectl get pods -l app=api
```
**Output summary**: 2/3 pods in CrashLoopBackOff
**Time**: 10:15
### Command 2
```bash
kubectl logs api-xyz --previous
```
**Output summary**: OOM kill at 10:12
**Time**: 10:17
Incident-to-Postmortem Workflow
During Incident
# Incident: API Outage 2024-01-15
## Quick Facts (fill as you go)
- **Started**: 10:05 UTC
- **Severity**: P1
- **Services**: API, Worker
- **User Impact**: All users unable to checkout
## Timeline (add entries as they happen)
| Time | Event | Who |
|------|-------|-----|
| 10:05 | Alert: APIHighErrorRate | PagerDuty |
| 10:08 | Acknowledged | Alice |
| 10:10 | Incident channel created | Alice |
| 10:15 | Identified pods crashing | Alice |
| 10:18 | Found OOM in logs | Bob |
| 10:22 | Increased memory limits | Bob |
| 10:25 | Pods recovering | - |
| 10:30 | Confirmed resolution | Alice |
## Root Cause Notes
- Pods hitting memory limits
- Recent deploy increased memory usage
- No load testing on new feature
## What Worked
- Quick detection (3 min)
- Clear logs pointed to OOM
## What Didn't Work
- Memory limits not updated in deploy
- No canary deployment
Convert to Postmortem
# Postmortem: API Outage 2024-01-15
## Summary
On January 15, 2024, the API service experienced a 25-minute outage
affecting all users. The root cause was increased memory usage from
a new feature deployment that exceeded pod memory limits.
## Impact
- **Duration**: 25 minutes (10:05 - 10:30 UTC)
- **Users Affected**: All (~50,000 active)
- **Revenue Impact**: ~$12,500 (estimated)
- **SLA Impact**: 0.05% of monthly error budget consumed
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 10:02 | Deploy of v2.3.1 with new feature completed |
| 10:05 | First OOM kill; alert triggered |
| 10:08 | On-call (Alice) acknowledged |
| 10:15 | Identified CrashLoopBackOff in pods |
| 10:18 | Root cause identified: OOM kills |
| 10:22 | Applied increased memory limits |
| 10:30 | All pods healthy; incident resolved |
## Root Cause
The v2.3.1 deployment included a new image processing feature that
increased baseline memory usage by 40%. The deployment's memory
limits were not updated to accommodate this increase.
When pods approached the memory limit under normal load, they were
OOM killed by Kubernetes. This triggered a CrashLoopBackOff as
pods repeatedly started and were killed.
## Contributing Factors
1. **No load testing**: New feature wasn't tested under production load
2. **Static memory limits**: Limits set 6 months ago, never revisited
3. **No canary deployment**: Full rollout instead of gradual
4. **Inadequate code review**: Memory impact not caught in review
## Resolution
Immediate: Increased memory limits from 512Mi to 1Gi
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add memory profiling to CI | Bob | 2024-01-22 | TODO |
| Implement canary deployments | Alice | 2024-02-01 | TODO |
| Create load testing for new features | Charlie | 2024-01-29 | TODO |
| Add memory usage to deploy checklist | Alice | 2024-01-17 | DONE |
## Lessons Learned
### What went well
- Fast detection (3 minutes from first OOM to alert)
- Clear error messages in logs
- Quick diagnosis once logs were checked
### What went poorly
- No pre-deploy validation of resource requirements
- No gradual rollout to catch issues early
- Memory limits were set-and-forget
## Appendix
### Commands Run During Incident
```bash
kubectl get pods -l app=api
# Output: 2/3 pods CrashLoopBackOff
kubectl logs api-7d4f8b6c9-x2k4j --previous | tail -20
# Output: "OOM killed" at 10:12
kubectl describe pod api-7d4f8b6c9-x2k4j | grep -A5 "Last State"
# Output: OOMKilled, exit code 137
```
Checklist Items That Feed Postmortems
| Checklist Item | Postmortem Section |
|---|---|
| Alert timestamp | Timeline |
| Severity | Impact |
| Services affected | Impact |
| User impact estimate | Impact |
| Root cause notes | Root Cause Analysis |
| Commands run | Appendix |
| Decisions made | Contributing Factors |
| What worked/didn’t | Lessons Learned |
| Resolution steps | Resolution |
Automating the Connection
Auto-Generate Postmortem Draft
def generate_postmortem(incident):
return f"""
# Postmortem: {incident.title}
## Summary
[Summarize: {incident.service} experienced issues from
{incident.start_time} to {incident.end_time}]
## Impact
- **Duration**: {incident.duration}
- **Severity**: {incident.severity}
- **Services**: {', '.join(incident.services)}
- **User Impact**: {incident.user_impact or '[To be determined]'}
## Timeline
{format_timeline(incident.events)}
## Root Cause
{incident.root_cause or '[To be determined in postmortem meeting]'}
## Action Items
{format_action_items(incident.action_items) or '- [ ] [To be determined]'}
## Lessons Learned
### What went well
{incident.what_worked or '- [To be discussed]'}
### What went poorly
{incident.what_didnt_work or '- [To be discussed]'}
## Appendix
### Commands Run
{format_commands(incident.commands)}
"""
Postmortem Checklist
After the incident, before the postmortem meeting:
## Pre-Postmortem Checklist
### Data Collection
- [ ] Timeline verified with timestamps
- [ ] All participants' notes gathered
- [ ] Monitoring data exported
- [ ] Relevant logs saved
### Draft Preparation
- [ ] Summary written
- [ ] Impact calculated
- [ ] Timeline finalized
- [ ] Initial action items drafted
### Meeting Prep
- [ ] All participants invited
- [ ] Draft shared in advance
- [ ] No-blame reminder included
- [ ] Action item owners pre-identified
Stew: Incident to Postmortem
Stew captures everything you need for postmortems:
- Timestamps on every action
- Command outputs saved
- Decision points documented
- One-click postmortem export
Your incident checklist becomes your postmortem foundation.
Join the waitlist and simplify your incident retrospectives.