← Back to blog

Incident Response with Runbook Automation

· 4 min read · Stew Team
runbook automationincident responseMTTRSRE

When an incident hits, every minute counts. The difference between a 10-minute resolution and a 2-hour firefight often comes down to one thing: whether your runbooks actually work.

Runbook automation tools can dramatically reduce your mean time to recovery (MTTR). Here’s how to make it happen.

The MTTR Problem

Let’s break down what happens during a typical incident:

PhaseTime SpentWhat Goes Wrong
Detection5 minAlert fatigue delays response
Diagnosis15 minHunting through dashboards
Runbook Lookup10 minFinding the right doc
Execution20 minCopy-pasting commands
Verification10 minConfirming the fix

Total: 60 minutes. Half of that time is wasted on runbook lookup and manual execution.

How Runbook Automation Cuts MTTR

With a proper runbook automation tool:

PhaseTime SavedHow
Runbook Lookup-8 minRunbooks linked from alerts
Execution-15 minOne-click execution
Verification-5 minBuilt-in validation steps

New total: ~30 minutes. That’s a 50% reduction in MTTR.

Building Incident Response Runbooks

Effective incident runbooks follow a structure:

1. Triage

First, understand the scope:

## Triage

### Check Service Health

​```bash
curl -s https://api.example.com/health | jq .
​```

### Check Error Rates

​```bash
kubectl logs -n production -l app=api --tail=100 | grep -c ERROR
​```

### Check Recent Deployments

​```bash
kubectl rollout history deployment/api -n production | tail -5
​```

2. Diagnosis

Drill down to the root cause:

## Diagnosis

### Database Connectivity

​```bash
kubectl exec -it deploy/api -n production -- pg_isready -h $DB_HOST
​```

### Memory Usage

​```bash
kubectl top pods -n production -l app=api
​```

### Recent Config Changes

​```bash
kubectl get configmap api-config -n production -o yaml
​```

3. Remediation

Fix the issue:

## Remediation

### Option A: Restart Pods

​```bash
kubectl rollout restart deployment/api -n production
​```

### Option B: Rollback Deployment

​```bash
kubectl rollout undo deployment/api -n production
​```

### Option C: Scale Up

​```bash
kubectl scale deployment/api -n production --replicas=10
​```

4. Verification

Confirm the fix worked:

## Verification

### Check Pod Status

​```bash
kubectl get pods -n production -l app=api
​```

### Verify Health Endpoint

​```bash
curl -s https://api.example.com/health | jq .
​```

### Monitor Error Rate

​```bash
# Watch for 2 minutes
watch -n 5 'kubectl logs -n production -l app=api --tail=10 | grep -c ERROR'
​```

Best Practices for Incident Runbooks

Your alerting system should include runbook links:

# PagerDuty/OpsGenie alert
annotations:
  runbook: https://stew.example.com/runbooks/api-high-error-rate

When the alert fires, the runbook is one click away.

Keep Runbooks Focused

One runbook per incident type. Don’t create a 50-page mega-runbook that covers everything. Engineers need to find the right section fast.

Include Decision Points

Not every incident follows the same path:

## Decision: Is This a Database Issue?

Run the connectivity check above.

- If connection fails → Go to [Database Runbook](/runbooks/database)
- If connection succeeds → Continue to Application Debugging

Test Runbooks in Game Days

Schedule regular incident drills. Execute your runbooks against staging environments. Find the gaps before real incidents expose them.

Update Runbooks After Incidents

Every post-incident review should ask: “Did the runbook work?” If not, update it immediately.

Why Runbook Automation Tools Matter

You could write all this in Confluence. But static docs fail during incidents:

  • Copy-paste errors under pressure
  • Outdated commands that fail silently
  • No execution history
  • No variable management

A runbook automation tool makes your incident response:

  • Faster — Execute, don’t copy-paste
  • Safer — Variables are injected, not typed
  • Auditable — Every action is logged
  • Testable — Run drills without fear

Get Started with Stew

Stew is built for incident response. Write runbooks in Markdown, execute them anywhere, share them with your team.

Join the waitlist and cut your MTTR in half.