Automating Incident Response Checklists: Tools and Techniques

Manual checklists work, but automated checklists work faster. This guide covers how to automate your incident response workflow.

For checklist templates, see our incident response checklist templates.

Automation Opportunities

Every checklist has automatable and non-automatable items:

Automatable	Human Required
Create incident channel	Assess severity
Run diagnostic commands	Identify root cause
Page on-call	Decide on remediation
Update status page	Approve risky changes
Capture timestamps	Communicate with stakeholders

Level 1: Alert-Triggered Channel Creation

When an alert fires, automatically create an incident channel.

Slack Bot Example

# Pseudo-code for incident channel automation
@app.on_alert
def create_incident_channel(alert):
    # Create channel
    channel_name = f"incident-{date.today()}-{alert.service}"
    channel = slack.create_channel(channel_name)
    
    # Invite relevant people
    slack.invite(channel, get_oncall(alert.service))
    slack.invite(channel, get_team(alert.service))
    
    # Post initial context
    slack.post(channel, f"""
    🚨 **Incident Started**
    
    **Alert**: {alert.name}
    **Service**: {alert.service}
    **Severity**: {alert.severity}
    **Time**: {datetime.now()}
    
    **Checklist**: <runbook_url|Open Runbook>
    
    **Quick Commands**:
    ```
    kubectl get pods -l app={alert.service}
    kubectl logs -l app={alert.service} --tail=50
    ```
    """)

PagerDuty Webhook

# PagerDuty webhook to trigger automation
webhooks:
  - url: https://your-automation/incident/start
    events:
      - incident.triggered
    headers:
      Authorization: Bearer ${AUTOMATION_TOKEN}

Level 2: Auto-Populated Diagnostics

Run diagnostic commands when incident starts, include results.

@app.on_incident_create
async def run_diagnostics(incident):
    service = incident.service
    
    diagnostics = await asyncio.gather(
        run_command(f"kubectl get pods -l app={service}"),
        run_command(f"kubectl logs -l app={service} --tail=50"),
        run_command(f"kubectl top pods -l app={service}"),
        fetch_metrics(service, "error_rate", "5m"),
        fetch_metrics(service, "latency_p99", "5m"),
    )
    
    slack.post(incident.channel, f"""
    📊 **Auto-Diagnostics**
    
    **Pod Status**:
    ```
    {diagnostics[0]}
    ```
    
    **Recent Errors**:
    ```
    {diagnostics[1]}
    ```
    
    **Resource Usage**:
    ```
    {diagnostics[2]}
    ```
    
    **Error Rate**: {diagnostics[3]}
    **P99 Latency**: {diagnostics[4]}
    """)

Level 3: Interactive Checklist Bot

A bot that guides through the checklist and tracks progress.

Slack Workflow

@app.on_command("/incident-checklist")
def start_checklist(channel, user):
    checklist = [
        {"id": "ack", "text": "Alert acknowledged", "done": False},
        {"id": "verify", "text": "Issue verified", "done": False},
        {"id": "severity", "text": "Severity assessed", "done": False},
        {"id": "status", "text": "Status page updated", "done": False},
        {"id": "diagnose", "text": "Root cause identified", "done": False},
        {"id": "fix", "text": "Fix applied", "done": False},
        {"id": "verify_fix", "text": "Fix verified", "done": False},
        {"id": "close", "text": "Incident closed", "done": False},
    ]
    
    message = render_checklist(checklist)
    slack.post(channel, message, with_buttons=True)

@app.on_button_click
def toggle_checklist_item(item_id, checklist):
    checklist[item_id]["done"] = not checklist[item_id]["done"]
    update_message(render_checklist(checklist))
    
    # Auto-actions on certain items
    if item_id == "severity" and checklist[item_id]["done"]:
        prompt_severity_selection()
    elif item_id == "close" and checklist[item_id]["done"]:
        create_incident_record()

Rendered Checklist

📋 Incident Checklist

✅ Alert acknowledged
✅ Issue verified  
✅ Severity assessed: P2
⬜ Status page updated [Update Now]
⬜ Root cause identified [Run Diagnostics]
⬜ Fix applied
⬜ Fix verified
⬜ Incident closed

Progress: 3/8 (38%)

Level 4: Status Page Integration

Automate status page updates based on checklist progress.

# When severity is set
@app.on_severity_set
def update_status_page(incident, severity):
    if severity in ["P1", "P2"]:
        statuspage.create_incident(
            name=f"{incident.service} degradation",
            status="investigating",
            components=[incident.service],
            body="We are investigating reports of issues."
        )

# When fix is verified
@app.on_checklist_item("verify_fix")
def resolve_status_page(incident):
    statuspage.update_incident(
        incident_id=incident.statuspage_id,
        status="resolved",
        body="The issue has been resolved."
    )

Level 5: Automated Remediation Suggestions

Based on diagnostics, suggest likely fixes.

@app.on_diagnostics_complete
def suggest_remediation(incident, diagnostics):
    suggestions = []
    
    # Pattern matching on diagnostics
    if "OOMKilled" in diagnostics.pod_status:
        suggestions.append({
            "issue": "Pod OOM killed",
            "fix": "Increase memory limits",
            "command": f"kubectl set resources deployment/{incident.service} --limits=memory=1Gi",
            "confidence": "high"
        })
    
    if "CrashLoopBackOff" in diagnostics.pod_status:
        suggestions.append({
            "issue": "Pod crash loop",
            "fix": "Rollback deployment",
            "command": f"kubectl rollout undo deployment/{incident.service}",
            "confidence": "medium"
        })
    
    if diagnostics.error_rate > 0.5:
        suggestions.append({
            "issue": "High error rate",
            "fix": "Check recent deployments",
            "command": f"kubectl rollout history deployment/{incident.service}",
            "confidence": "medium"
        })
    
    post_suggestions(incident.channel, suggestions)

Suggestion UI

🔍 **Suggested Remediations**

Based on diagnostics, here are likely fixes:

1. **Pod OOM killed** (high confidence)
   Increase memory limits

kubectl set resources deployment/api —limits=memory=1Gi

[Apply] [Skip]

2. **Recent deployment issue** (medium confidence)
Rollback to previous version

kubectl rollout undo deployment/api

[Apply] [Skip]

Level 6: Postmortem Auto-Generation

Automatically create postmortem from incident data.

@app.on_incident_close
def generate_postmortem(incident):
    template = f"""
    # Incident Postmortem: {incident.title}
    
    ## Summary
    - **Duration**: {incident.duration}
    - **Severity**: {incident.severity}
    - **Services Affected**: {incident.services}
    - **User Impact**: {incident.user_impact}
    
    ## Timeline
    {format_timeline(incident.events)}
    
    ## Root Cause
    [To be filled]
    
    ## Resolution
    {incident.resolution_notes}
    
    ## Action Items
    - [ ] [To be filled based on discussion]
    
    ## Lessons Learned
    [To be filled]
    
    ---
    *Auto-generated from incident data. Please review and complete.*
    """
    
    create_doc(template)
    schedule_postmortem_meeting(incident)

Implementation Roadmap

Week 1-2: Channel Automation

Alert triggers channel creation
On-call auto-invited
Initial context posted

Week 3-4: Diagnostic Automation

Auto-run diagnostics on incident start
Results posted to channel
Runbook linked automatically

Week 5-6: Interactive Checklist

Slack-based checklist tracking
Progress visibility
Timestamp capture

Week 7-8: Integration

Status page automation
Postmortem generation
Metrics capture

Stew: Built for Automation

Stew integrates with your incident workflow:

Trigger runbooks from alerts
Execute commands with approval
Capture results automatically
Generate postmortem data

Your incident response checklist becomes an automated workflow.

Join the waitlist and automate your incident response.