From Incident Checklist to Postmortem: Closing the Loop

The best postmortems write themselves—if you capture the right data during the incident. Your incident response checklist is the key.

For checklist automation, see our incident response checklist automation guide.

The Data Postmortems Need

Every postmortem answers these questions:

What happened? (Timeline)
Why did it happen? (Root cause)
What was the impact? (Scope and duration)
How did we respond? (Actions taken)
How do we prevent recurrence? (Action items)

Your checklist should capture this data in real-time.

Checklist Design for Postmortems

Timestamp Everything

## Incident Timeline Checklist

- [ ] Alert triggered: ____:____
- [ ] Alert acknowledged: ____:____
- [ ] Incident channel created: ____:____
- [ ] Severity determined: ____:____
- [ ] Root cause identified: ____:____
- [ ] Fix applied: ____:____
- [ ] Fix verified: ____:____
- [ ] Incident resolved: ____:____

Capture Decisions

## Decision Log

### Decision 1
- **Time**: ____:____
- **Decision**: ____________
- **Alternatives considered**: ____________
- **Rationale**: ____________
- **Made by**: ____________

### Decision 2
- **Time**: ____:____
- **Decision**: ____________
- **Alternatives considered**: ____________
- **Rationale**: ____________
- **Made by**: ____________

Record Commands and Output

## Diagnostic Commands Run

### Command 1
```bash
kubectl get pods -l app=api
```
**Output summary**: 2/3 pods in CrashLoopBackOff
**Time**: 10:15

### Command 2
```bash
kubectl logs api-xyz --previous
```
**Output summary**: OOM kill at 10:12
**Time**: 10:17

Incident-to-Postmortem Workflow

During Incident

# Incident: API Outage 2024-01-15

## Quick Facts (fill as you go)
- **Started**: 10:05 UTC
- **Severity**: P1
- **Services**: API, Worker
- **User Impact**: All users unable to checkout

## Timeline (add entries as they happen)

| Time | Event | Who |
|------|-------|-----|
| 10:05 | Alert: APIHighErrorRate | PagerDuty |
| 10:08 | Acknowledged | Alice |
| 10:10 | Incident channel created | Alice |
| 10:15 | Identified pods crashing | Alice |
| 10:18 | Found OOM in logs | Bob |
| 10:22 | Increased memory limits | Bob |
| 10:25 | Pods recovering | - |
| 10:30 | Confirmed resolution | Alice |

## Root Cause Notes
- Pods hitting memory limits
- Recent deploy increased memory usage
- No load testing on new feature

## What Worked
- Quick detection (3 min)
- Clear logs pointed to OOM

## What Didn't Work
- Memory limits not updated in deploy
- No canary deployment

Convert to Postmortem

# Postmortem: API Outage 2024-01-15

## Summary

On January 15, 2024, the API service experienced a 25-minute outage 
affecting all users. The root cause was increased memory usage from 
a new feature deployment that exceeded pod memory limits.

## Impact

- **Duration**: 25 minutes (10:05 - 10:30 UTC)
- **Users Affected**: All (~50,000 active)
- **Revenue Impact**: ~$12,500 (estimated)
- **SLA Impact**: 0.05% of monthly error budget consumed

## Timeline

| Time (UTC) | Event |
|------------|-------|
| 10:02 | Deploy of v2.3.1 with new feature completed |
| 10:05 | First OOM kill; alert triggered |
| 10:08 | On-call (Alice) acknowledged |
| 10:15 | Identified CrashLoopBackOff in pods |
| 10:18 | Root cause identified: OOM kills |
| 10:22 | Applied increased memory limits |
| 10:30 | All pods healthy; incident resolved |

## Root Cause

The v2.3.1 deployment included a new image processing feature that 
increased baseline memory usage by 40%. The deployment's memory 
limits were not updated to accommodate this increase.

When pods approached the memory limit under normal load, they were 
OOM killed by Kubernetes. This triggered a CrashLoopBackOff as 
pods repeatedly started and were killed.

## Contributing Factors

1. **No load testing**: New feature wasn't tested under production load
2. **Static memory limits**: Limits set 6 months ago, never revisited
3. **No canary deployment**: Full rollout instead of gradual
4. **Inadequate code review**: Memory impact not caught in review

## Resolution

Immediate: Increased memory limits from 512Mi to 1Gi

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add memory profiling to CI | Bob | 2024-01-22 | TODO |
| Implement canary deployments | Alice | 2024-02-01 | TODO |
| Create load testing for new features | Charlie | 2024-01-29 | TODO |
| Add memory usage to deploy checklist | Alice | 2024-01-17 | DONE |

## Lessons Learned

### What went well
- Fast detection (3 minutes from first OOM to alert)
- Clear error messages in logs
- Quick diagnosis once logs were checked

### What went poorly
- No pre-deploy validation of resource requirements
- No gradual rollout to catch issues early
- Memory limits were set-and-forget

## Appendix

### Commands Run During Incident

```bash
kubectl get pods -l app=api
# Output: 2/3 pods CrashLoopBackOff

kubectl logs api-7d4f8b6c9-x2k4j --previous | tail -20
# Output: "OOM killed" at 10:12

kubectl describe pod api-7d4f8b6c9-x2k4j | grep -A5 "Last State"
# Output: OOMKilled, exit code 137
```

Checklist Items That Feed Postmortems

Checklist Item	Postmortem Section
Alert timestamp	Timeline
Severity	Impact
Services affected	Impact
User impact estimate	Impact
Root cause notes	Root Cause Analysis
Commands run	Appendix
Decisions made	Contributing Factors
What worked/didn’t	Lessons Learned
Resolution steps	Resolution

Automating the Connection

Auto-Generate Postmortem Draft

def generate_postmortem(incident):
    return f"""
# Postmortem: {incident.title}

## Summary
[Summarize: {incident.service} experienced issues from 
{incident.start_time} to {incident.end_time}]

## Impact
- **Duration**: {incident.duration}
- **Severity**: {incident.severity}
- **Services**: {', '.join(incident.services)}
- **User Impact**: {incident.user_impact or '[To be determined]'}

## Timeline
{format_timeline(incident.events)}

## Root Cause
{incident.root_cause or '[To be determined in postmortem meeting]'}

## Action Items
{format_action_items(incident.action_items) or '- [ ] [To be determined]'}

## Lessons Learned
### What went well
{incident.what_worked or '- [To be discussed]'}

### What went poorly  
{incident.what_didnt_work or '- [To be discussed]'}

## Appendix
### Commands Run
{format_commands(incident.commands)}
"""

Postmortem Checklist

After the incident, before the postmortem meeting:

## Pre-Postmortem Checklist

### Data Collection
- [ ] Timeline verified with timestamps
- [ ] All participants' notes gathered
- [ ] Monitoring data exported
- [ ] Relevant logs saved

### Draft Preparation
- [ ] Summary written
- [ ] Impact calculated
- [ ] Timeline finalized
- [ ] Initial action items drafted

### Meeting Prep
- [ ] All participants invited
- [ ] Draft shared in advance
- [ ] No-blame reminder included
- [ ] Action item owners pre-identified

Stew: Incident to Postmortem

Stew captures everything you need for postmortems:

Timestamps on every action
Command outputs saved
Decision points documented
One-click postmortem export

Your incident checklist becomes your postmortem foundation.

Join the waitlist and simplify your incident retrospectives.