← Back to blog

Automating Incident Response Checklists: Tools and Techniques

· 5 min read · Stew Team
incident-responsechecklistautomationdevops

Manual checklists work, but automated checklists work faster. This guide covers how to automate your incident response workflow.

For checklist templates, see our incident response checklist templates.

Automation Opportunities

Every checklist has automatable and non-automatable items:

AutomatableHuman Required
Create incident channelAssess severity
Run diagnostic commandsIdentify root cause
Page on-callDecide on remediation
Update status pageApprove risky changes
Capture timestampsCommunicate with stakeholders

Level 1: Alert-Triggered Channel Creation

When an alert fires, automatically create an incident channel.

Slack Bot Example

# Pseudo-code for incident channel automation
@app.on_alert
def create_incident_channel(alert):
    # Create channel
    channel_name = f"incident-{date.today()}-{alert.service}"
    channel = slack.create_channel(channel_name)
    
    # Invite relevant people
    slack.invite(channel, get_oncall(alert.service))
    slack.invite(channel, get_team(alert.service))
    
    # Post initial context
    slack.post(channel, f"""
    🚨 **Incident Started**
    
    **Alert**: {alert.name}
    **Service**: {alert.service}
    **Severity**: {alert.severity}
    **Time**: {datetime.now()}
    
    **Checklist**: <runbook_url|Open Runbook>
    
    **Quick Commands**:
    ```
    kubectl get pods -l app={alert.service}
    kubectl logs -l app={alert.service} --tail=50
    ```
    """)

PagerDuty Webhook

# PagerDuty webhook to trigger automation
webhooks:
  - url: https://your-automation/incident/start
    events:
      - incident.triggered
    headers:
      Authorization: Bearer ${AUTOMATION_TOKEN}

Level 2: Auto-Populated Diagnostics

Run diagnostic commands when incident starts, include results.

@app.on_incident_create
async def run_diagnostics(incident):
    service = incident.service
    
    diagnostics = await asyncio.gather(
        run_command(f"kubectl get pods -l app={service}"),
        run_command(f"kubectl logs -l app={service} --tail=50"),
        run_command(f"kubectl top pods -l app={service}"),
        fetch_metrics(service, "error_rate", "5m"),
        fetch_metrics(service, "latency_p99", "5m"),
    )
    
    slack.post(incident.channel, f"""
    📊 **Auto-Diagnostics**
    
    **Pod Status**:
    ```
    {diagnostics[0]}
    ```
    
    **Recent Errors**:
    ```
    {diagnostics[1]}
    ```
    
    **Resource Usage**:
    ```
    {diagnostics[2]}
    ```
    
    **Error Rate**: {diagnostics[3]}
    **P99 Latency**: {diagnostics[4]}
    """)

Level 3: Interactive Checklist Bot

A bot that guides through the checklist and tracks progress.

Slack Workflow

@app.on_command("/incident-checklist")
def start_checklist(channel, user):
    checklist = [
        {"id": "ack", "text": "Alert acknowledged", "done": False},
        {"id": "verify", "text": "Issue verified", "done": False},
        {"id": "severity", "text": "Severity assessed", "done": False},
        {"id": "status", "text": "Status page updated", "done": False},
        {"id": "diagnose", "text": "Root cause identified", "done": False},
        {"id": "fix", "text": "Fix applied", "done": False},
        {"id": "verify_fix", "text": "Fix verified", "done": False},
        {"id": "close", "text": "Incident closed", "done": False},
    ]
    
    message = render_checklist(checklist)
    slack.post(channel, message, with_buttons=True)

@app.on_button_click
def toggle_checklist_item(item_id, checklist):
    checklist[item_id]["done"] = not checklist[item_id]["done"]
    update_message(render_checklist(checklist))
    
    # Auto-actions on certain items
    if item_id == "severity" and checklist[item_id]["done"]:
        prompt_severity_selection()
    elif item_id == "close" and checklist[item_id]["done"]:
        create_incident_record()

Rendered Checklist

📋 Incident Checklist

✅ Alert acknowledged
✅ Issue verified  
✅ Severity assessed: P2
⬜ Status page updated [Update Now]
⬜ Root cause identified [Run Diagnostics]
⬜ Fix applied
⬜ Fix verified
⬜ Incident closed

Progress: 3/8 (38%)

Level 4: Status Page Integration

Automate status page updates based on checklist progress.

# When severity is set
@app.on_severity_set
def update_status_page(incident, severity):
    if severity in ["P1", "P2"]:
        statuspage.create_incident(
            name=f"{incident.service} degradation",
            status="investigating",
            components=[incident.service],
            body="We are investigating reports of issues."
        )

# When fix is verified
@app.on_checklist_item("verify_fix")
def resolve_status_page(incident):
    statuspage.update_incident(
        incident_id=incident.statuspage_id,
        status="resolved",
        body="The issue has been resolved."
    )

Level 5: Automated Remediation Suggestions

Based on diagnostics, suggest likely fixes.

@app.on_diagnostics_complete
def suggest_remediation(incident, diagnostics):
    suggestions = []
    
    # Pattern matching on diagnostics
    if "OOMKilled" in diagnostics.pod_status:
        suggestions.append({
            "issue": "Pod OOM killed",
            "fix": "Increase memory limits",
            "command": f"kubectl set resources deployment/{incident.service} --limits=memory=1Gi",
            "confidence": "high"
        })
    
    if "CrashLoopBackOff" in diagnostics.pod_status:
        suggestions.append({
            "issue": "Pod crash loop",
            "fix": "Rollback deployment",
            "command": f"kubectl rollout undo deployment/{incident.service}",
            "confidence": "medium"
        })
    
    if diagnostics.error_rate > 0.5:
        suggestions.append({
            "issue": "High error rate",
            "fix": "Check recent deployments",
            "command": f"kubectl rollout history deployment/{incident.service}",
            "confidence": "medium"
        })
    
    post_suggestions(incident.channel, suggestions)

Suggestion UI

🔍 **Suggested Remediations**

Based on diagnostics, here are likely fixes:

1. **Pod OOM killed** (high confidence)
   Increase memory limits

kubectl set resources deployment/api —limits=memory=1Gi

[Apply] [Skip]

2. **Recent deployment issue** (medium confidence)
Rollback to previous version

kubectl rollout undo deployment/api

[Apply] [Skip]

Level 6: Postmortem Auto-Generation

Automatically create postmortem from incident data.

@app.on_incident_close
def generate_postmortem(incident):
    template = f"""
    # Incident Postmortem: {incident.title}
    
    ## Summary
    - **Duration**: {incident.duration}
    - **Severity**: {incident.severity}
    - **Services Affected**: {incident.services}
    - **User Impact**: {incident.user_impact}
    
    ## Timeline
    {format_timeline(incident.events)}
    
    ## Root Cause
    [To be filled]
    
    ## Resolution
    {incident.resolution_notes}
    
    ## Action Items
    - [ ] [To be filled based on discussion]
    
    ## Lessons Learned
    [To be filled]
    
    ---
    *Auto-generated from incident data. Please review and complete.*
    """
    
    create_doc(template)
    schedule_postmortem_meeting(incident)

Implementation Roadmap

Week 1-2: Channel Automation

  • Alert triggers channel creation
  • On-call auto-invited
  • Initial context posted

Week 3-4: Diagnostic Automation

  • Auto-run diagnostics on incident start
  • Results posted to channel
  • Runbook linked automatically

Week 5-6: Interactive Checklist

  • Slack-based checklist tracking
  • Progress visibility
  • Timestamp capture

Week 7-8: Integration

  • Status page automation
  • Postmortem generation
  • Metrics capture

Stew: Built for Automation

Stew integrates with your incident workflow:

  • Trigger runbooks from alerts
  • Execute commands with approval
  • Capture results automatically
  • Generate postmortem data

Your incident response checklist becomes an automated workflow.

Join the waitlist and automate your incident response.