Incident Response with Runbook Automation
When an incident hits, every minute counts. The difference between a 10-minute resolution and a 2-hour firefight often comes down to one thing: whether your runbooks actually work.
Runbook automation tools can dramatically reduce your mean time to recovery (MTTR). Here’s how to make it happen.
The MTTR Problem
Let’s break down what happens during a typical incident:
| Phase | Time Spent | What Goes Wrong |
|---|---|---|
| Detection | 5 min | Alert fatigue delays response |
| Diagnosis | 15 min | Hunting through dashboards |
| Runbook Lookup | 10 min | Finding the right doc |
| Execution | 20 min | Copy-pasting commands |
| Verification | 10 min | Confirming the fix |
Total: 60 minutes. Half of that time is wasted on runbook lookup and manual execution.
How Runbook Automation Cuts MTTR
With a proper runbook automation tool:
| Phase | Time Saved | How |
|---|---|---|
| Runbook Lookup | -8 min | Runbooks linked from alerts |
| Execution | -15 min | One-click execution |
| Verification | -5 min | Built-in validation steps |
New total: ~30 minutes. That’s a 50% reduction in MTTR.
Building Incident Response Runbooks
Effective incident runbooks follow a structure:
1. Triage
First, understand the scope:
## Triage
### Check Service Health
```bash
curl -s https://api.example.com/health | jq .
```
### Check Error Rates
```bash
kubectl logs -n production -l app=api --tail=100 | grep -c ERROR
```
### Check Recent Deployments
```bash
kubectl rollout history deployment/api -n production | tail -5
```
2. Diagnosis
Drill down to the root cause:
## Diagnosis
### Database Connectivity
```bash
kubectl exec -it deploy/api -n production -- pg_isready -h $DB_HOST
```
### Memory Usage
```bash
kubectl top pods -n production -l app=api
```
### Recent Config Changes
```bash
kubectl get configmap api-config -n production -o yaml
```
3. Remediation
Fix the issue:
## Remediation
### Option A: Restart Pods
```bash
kubectl rollout restart deployment/api -n production
```
### Option B: Rollback Deployment
```bash
kubectl rollout undo deployment/api -n production
```
### Option C: Scale Up
```bash
kubectl scale deployment/api -n production --replicas=10
```
4. Verification
Confirm the fix worked:
## Verification
### Check Pod Status
```bash
kubectl get pods -n production -l app=api
```
### Verify Health Endpoint
```bash
curl -s https://api.example.com/health | jq .
```
### Monitor Error Rate
```bash
# Watch for 2 minutes
watch -n 5 'kubectl logs -n production -l app=api --tail=10 | grep -c ERROR'
```
Best Practices for Incident Runbooks
Link Runbooks to Alerts
Your alerting system should include runbook links:
# PagerDuty/OpsGenie alert
annotations:
runbook: https://stew.example.com/runbooks/api-high-error-rate
When the alert fires, the runbook is one click away.
Keep Runbooks Focused
One runbook per incident type. Don’t create a 50-page mega-runbook that covers everything. Engineers need to find the right section fast.
Include Decision Points
Not every incident follows the same path:
## Decision: Is This a Database Issue?
Run the connectivity check above.
- If connection fails → Go to [Database Runbook](/runbooks/database)
- If connection succeeds → Continue to Application Debugging
Test Runbooks in Game Days
Schedule regular incident drills. Execute your runbooks against staging environments. Find the gaps before real incidents expose them.
Update Runbooks After Incidents
Every post-incident review should ask: “Did the runbook work?” If not, update it immediately.
Why Runbook Automation Tools Matter
You could write all this in Confluence. But static docs fail during incidents:
- Copy-paste errors under pressure
- Outdated commands that fail silently
- No execution history
- No variable management
A runbook automation tool makes your incident response:
- Faster — Execute, don’t copy-paste
- Safer — Variables are injected, not typed
- Auditable — Every action is logged
- Testable — Run drills without fear
Get Started with Stew
Stew is built for incident response. Write runbooks in Markdown, execute them anywhere, share them with your team.
Join the waitlist and cut your MTTR in half.