Incident Response with Runbook Automation

When an incident hits, every minute counts. The difference between a 10-minute resolution and a 2-hour firefight often comes down to one thing: whether your runbooks actually work.

Runbook automation tools can dramatically reduce your mean time to recovery (MTTR). Here’s how to make it happen.

The MTTR Problem

Let’s break down what happens during a typical incident:

Phase	Time Spent	What Goes Wrong
Detection	5 min	Alert fatigue delays response
Diagnosis	15 min	Hunting through dashboards
Runbook Lookup	10 min	Finding the right doc
Execution	20 min	Copy-pasting commands
Verification	10 min	Confirming the fix

Total: 60 minutes. Half of that time is wasted on runbook lookup and manual execution.

How Runbook Automation Cuts MTTR

With a proper runbook automation tool:

Phase	Time Saved	How
Runbook Lookup	-8 min	Runbooks linked from alerts
Execution	-15 min	One-click execution
Verification	-5 min	Built-in validation steps

New total: ~30 minutes. That’s a 50% reduction in MTTR.

Building Incident Response Runbooks

Effective incident runbooks follow a structure:

1. Triage

First, understand the scope:

## Triage

### Check Service Health

```bash
curl -s https://api.example.com/health | jq .
```

### Check Error Rates

```bash
kubectl logs -n production -l app=api --tail=100 | grep -c ERROR
```

### Check Recent Deployments

```bash
kubectl rollout history deployment/api -n production | tail -5
```

2. Diagnosis

Drill down to the root cause:

## Diagnosis

### Database Connectivity

```bash
kubectl exec -it deploy/api -n production -- pg_isready -h $DB_HOST
```

### Memory Usage

```bash
kubectl top pods -n production -l app=api
```

### Recent Config Changes

```bash
kubectl get configmap api-config -n production -o yaml
```

3. Remediation

Fix the issue:

## Remediation

### Option A: Restart Pods

```bash
kubectl rollout restart deployment/api -n production
```

### Option B: Rollback Deployment

```bash
kubectl rollout undo deployment/api -n production
```

### Option C: Scale Up

```bash
kubectl scale deployment/api -n production --replicas=10
```

4. Verification

Confirm the fix worked:

## Verification

### Check Pod Status

```bash
kubectl get pods -n production -l app=api
```

### Verify Health Endpoint

```bash
curl -s https://api.example.com/health | jq .
```

### Monitor Error Rate

```bash
# Watch for 2 minutes
watch -n 5 'kubectl logs -n production -l app=api --tail=10 | grep -c ERROR'
```

Best Practices for Incident Runbooks

Link Runbooks to Alerts

Your alerting system should include runbook links:

# PagerDuty/OpsGenie alert
annotations:
  runbook: https://stew.example.com/runbooks/api-high-error-rate

When the alert fires, the runbook is one click away.

Keep Runbooks Focused

One runbook per incident type. Don’t create a 50-page mega-runbook that covers everything. Engineers need to find the right section fast.

Include Decision Points

Not every incident follows the same path:

## Decision: Is This a Database Issue?

Run the connectivity check above.

- If connection fails → Go to [Database Runbook](/runbooks/database)
- If connection succeeds → Continue to Application Debugging

Test Runbooks in Game Days

Schedule regular incident drills. Execute your runbooks against staging environments. Find the gaps before real incidents expose them.

Update Runbooks After Incidents

Every post-incident review should ask: “Did the runbook work?” If not, update it immediately.

Why Runbook Automation Tools Matter

You could write all this in Confluence. But static docs fail during incidents:

Copy-paste errors under pressure
Outdated commands that fail silently
No execution history
No variable management

A runbook automation tool makes your incident response:

Faster — Execute, don’t copy-paste
Safer — Variables are injected, not typed
Auditable — Every action is logged
Testable — Run drills without fear

Get Started with Stew

Stew is built for incident response. Write runbooks in Markdown, execute them anywhere, share them with your team.

Join the waitlist and cut your MTTR in half.