Automating Incident Response Checklists: Tools and Techniques
Manual checklists work, but automated checklists work faster. This guide covers how to automate your incident response workflow.
For checklist templates, see our incident response checklist templates.
Automation Opportunities
Every checklist has automatable and non-automatable items:
| Automatable | Human Required |
|---|---|
| Create incident channel | Assess severity |
| Run diagnostic commands | Identify root cause |
| Page on-call | Decide on remediation |
| Update status page | Approve risky changes |
| Capture timestamps | Communicate with stakeholders |
Level 1: Alert-Triggered Channel Creation
When an alert fires, automatically create an incident channel.
Slack Bot Example
# Pseudo-code for incident channel automation
@app.on_alert
def create_incident_channel(alert):
# Create channel
channel_name = f"incident-{date.today()}-{alert.service}"
channel = slack.create_channel(channel_name)
# Invite relevant people
slack.invite(channel, get_oncall(alert.service))
slack.invite(channel, get_team(alert.service))
# Post initial context
slack.post(channel, f"""
🚨 **Incident Started**
**Alert**: {alert.name}
**Service**: {alert.service}
**Severity**: {alert.severity}
**Time**: {datetime.now()}
**Checklist**: <runbook_url|Open Runbook>
**Quick Commands**:
```
kubectl get pods -l app={alert.service}
kubectl logs -l app={alert.service} --tail=50
```
""")
PagerDuty Webhook
# PagerDuty webhook to trigger automation
webhooks:
- url: https://your-automation/incident/start
events:
- incident.triggered
headers:
Authorization: Bearer ${AUTOMATION_TOKEN}
Level 2: Auto-Populated Diagnostics
Run diagnostic commands when incident starts, include results.
@app.on_incident_create
async def run_diagnostics(incident):
service = incident.service
diagnostics = await asyncio.gather(
run_command(f"kubectl get pods -l app={service}"),
run_command(f"kubectl logs -l app={service} --tail=50"),
run_command(f"kubectl top pods -l app={service}"),
fetch_metrics(service, "error_rate", "5m"),
fetch_metrics(service, "latency_p99", "5m"),
)
slack.post(incident.channel, f"""
📊 **Auto-Diagnostics**
**Pod Status**:
```
{diagnostics[0]}
```
**Recent Errors**:
```
{diagnostics[1]}
```
**Resource Usage**:
```
{diagnostics[2]}
```
**Error Rate**: {diagnostics[3]}
**P99 Latency**: {diagnostics[4]}
""")
Level 3: Interactive Checklist Bot
A bot that guides through the checklist and tracks progress.
Slack Workflow
@app.on_command("/incident-checklist")
def start_checklist(channel, user):
checklist = [
{"id": "ack", "text": "Alert acknowledged", "done": False},
{"id": "verify", "text": "Issue verified", "done": False},
{"id": "severity", "text": "Severity assessed", "done": False},
{"id": "status", "text": "Status page updated", "done": False},
{"id": "diagnose", "text": "Root cause identified", "done": False},
{"id": "fix", "text": "Fix applied", "done": False},
{"id": "verify_fix", "text": "Fix verified", "done": False},
{"id": "close", "text": "Incident closed", "done": False},
]
message = render_checklist(checklist)
slack.post(channel, message, with_buttons=True)
@app.on_button_click
def toggle_checklist_item(item_id, checklist):
checklist[item_id]["done"] = not checklist[item_id]["done"]
update_message(render_checklist(checklist))
# Auto-actions on certain items
if item_id == "severity" and checklist[item_id]["done"]:
prompt_severity_selection()
elif item_id == "close" and checklist[item_id]["done"]:
create_incident_record()
Rendered Checklist
📋 Incident Checklist
✅ Alert acknowledged
✅ Issue verified
✅ Severity assessed: P2
⬜ Status page updated [Update Now]
⬜ Root cause identified [Run Diagnostics]
⬜ Fix applied
⬜ Fix verified
⬜ Incident closed
Progress: 3/8 (38%)
Level 4: Status Page Integration
Automate status page updates based on checklist progress.
# When severity is set
@app.on_severity_set
def update_status_page(incident, severity):
if severity in ["P1", "P2"]:
statuspage.create_incident(
name=f"{incident.service} degradation",
status="investigating",
components=[incident.service],
body="We are investigating reports of issues."
)
# When fix is verified
@app.on_checklist_item("verify_fix")
def resolve_status_page(incident):
statuspage.update_incident(
incident_id=incident.statuspage_id,
status="resolved",
body="The issue has been resolved."
)
Level 5: Automated Remediation Suggestions
Based on diagnostics, suggest likely fixes.
@app.on_diagnostics_complete
def suggest_remediation(incident, diagnostics):
suggestions = []
# Pattern matching on diagnostics
if "OOMKilled" in diagnostics.pod_status:
suggestions.append({
"issue": "Pod OOM killed",
"fix": "Increase memory limits",
"command": f"kubectl set resources deployment/{incident.service} --limits=memory=1Gi",
"confidence": "high"
})
if "CrashLoopBackOff" in diagnostics.pod_status:
suggestions.append({
"issue": "Pod crash loop",
"fix": "Rollback deployment",
"command": f"kubectl rollout undo deployment/{incident.service}",
"confidence": "medium"
})
if diagnostics.error_rate > 0.5:
suggestions.append({
"issue": "High error rate",
"fix": "Check recent deployments",
"command": f"kubectl rollout history deployment/{incident.service}",
"confidence": "medium"
})
post_suggestions(incident.channel, suggestions)
Suggestion UI
🔍 **Suggested Remediations**
Based on diagnostics, here are likely fixes:
1. **Pod OOM killed** (high confidence)
Increase memory limits
kubectl set resources deployment/api —limits=memory=1Gi
[Apply] [Skip]
2. **Recent deployment issue** (medium confidence)
Rollback to previous version
kubectl rollout undo deployment/api
[Apply] [Skip]
Level 6: Postmortem Auto-Generation
Automatically create postmortem from incident data.
@app.on_incident_close
def generate_postmortem(incident):
template = f"""
# Incident Postmortem: {incident.title}
## Summary
- **Duration**: {incident.duration}
- **Severity**: {incident.severity}
- **Services Affected**: {incident.services}
- **User Impact**: {incident.user_impact}
## Timeline
{format_timeline(incident.events)}
## Root Cause
[To be filled]
## Resolution
{incident.resolution_notes}
## Action Items
- [ ] [To be filled based on discussion]
## Lessons Learned
[To be filled]
---
*Auto-generated from incident data. Please review and complete.*
"""
create_doc(template)
schedule_postmortem_meeting(incident)
Implementation Roadmap
Week 1-2: Channel Automation
- Alert triggers channel creation
- On-call auto-invited
- Initial context posted
Week 3-4: Diagnostic Automation
- Auto-run diagnostics on incident start
- Results posted to channel
- Runbook linked automatically
Week 5-6: Interactive Checklist
- Slack-based checklist tracking
- Progress visibility
- Timestamp capture
Week 7-8: Integration
- Status page automation
- Postmortem generation
- Metrics capture
Stew: Built for Automation
Stew integrates with your incident workflow:
- Trigger runbooks from alerts
- Execute commands with approval
- Capture results automatically
- Generate postmortem data
Your incident response checklist becomes an automated workflow.
Join the waitlist and automate your incident response.