How Runbooks Reduce MTTR: From Hours to Minutes
Most teams have runbooks. Few teams use them effectively during incidents. The gap between documentation and action is where MTTR bloats.
This article explores how executable runbooks close that gap. For runbook creation basics, see how to write a runbook.
The MTTR Problem
Here’s what happens during a typical incident:
- Alert fires (0:00)
- Engineer acknowledges (0:05)
- Engineer finds runbook in wiki (0:12)
- Engineer reads runbook (0:18)
- Engineer opens terminal (0:20)
- Engineer copies first command (0:21)
- Engineer fixes typo in command (0:23)
- Command runs, engineer reads output (0:25)
- Engineer copies next command… (repeat)
- Issue resolved (0:45)
45 minutes. And that’s assuming the runbook exists and is accurate.
Why Traditional Runbooks Fail
Problem 1: Context Switching
Every alt-tab between wiki and terminal breaks focus:
- Find the right section
- Copy command
- Switch to terminal
- Paste and run
- Switch back to wiki
- Repeat
Problem 2: Copy-Paste Errors
# What the runbook says
kubectl logs -l app=api --tail=100
# What gets pasted (with invisible characters)
kubectl logs -l app=api —tail=100 # Wrong dash character
Problem 3: Stale Documentation
Runbooks written 6 months ago reference:
- Old service names
- Deprecated commands
- Non-existent endpoints
Problem 4: No Output Capture
After running commands, there’s no record of:
- What output you saw
- Which branch you took
- What actually fixed the issue
Executable Runbooks: The Solution
Executable runbooks eliminate the gap between documentation and action.
Before: Static Wiki
# API Troubleshooting
1. Check pod status: `kubectl get pods -l app=api`
2. If pods are crashing, check logs: `kubectl logs -l app=api`
3. If OOM, increase memory limits
After: Executable Runbook
# API Troubleshooting
## Check pod status
```bash
kubectl get pods -l app=api
```
## Check logs for errors
```bash
kubectl logs -l app=api --tail=100 | grep -i error
```
## Check resource usage
```bash
kubectl top pods -l app=api
```
## If OOM, apply increased limits
```bash
kubectl apply -f k8s/api-deployment-high-memory.yaml
```
Each block runs with a click. Output appears inline. No context switching.
MTTR Impact by Phase
Detection Phase
Executable runbooks include health check commands:
## Quick Health Check
```bash
curl -s http://api/health | jq '.status'
```
## Detailed Status
```bash
kubectl get pods -l app=api -o wide
kubectl get events --sort-by='.lastTimestamp' | grep api | tail -5
```
Run these proactively or link them from alerts.
Diagnosis Phase
This is where executable runbooks shine:
## Symptom: High Latency
### Check database connections
```bash
kubectl exec -it $(kubectl get pod -l app=api -o jsonpath='{.items[0].metadata.name}') -- cat /proc/1/fd | wc -l
```
### Check external dependency latency
```bash
kubectl logs -l app=api --tail=200 | grep "external_call" | awk '{print $NF}' | sort -n | tail -10
```
### Check for resource pressure
```bash
kubectl top pods -l app=api
```
Execute each step, see results, move to the next.
Resolution Phase
Pre-built remediation commands:
## Remediation Options
### Option A: Restart pods
```bash
kubectl rollout restart deployment/api
```
### Option B: Scale up
```bash
kubectl scale deployment/api --replicas=5
```
### Option C: Rollback
```bash
kubectl rollout undo deployment/api
```
Click to execute. No typing, no typos.
Verification Phase
Confirm the fix worked:
## Verify Resolution
### Check pod status
```bash
kubectl get pods -l app=api
```
### Check error rate (wait 2 minutes)
```bash
curl -s http://prometheus:9090/api/v1/query?query=rate(http_errors_total[2m]) | jq '.data.result[0].value[1]'
```
### Check latency
```bash
curl -s http://api/health | jq '.latency_ms'
```
Real MTTR Reduction Numbers
Teams using executable runbooks report:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Avg diagnosis time | 20 min | 7 min | 65% |
| Command copy errors | 15% | 0% | 100% |
| Runbook usage rate | 30% | 85% | 183% |
| Overall MTTR | 45 min | 18 min | 60% |
Building MTTR-Focused Runbooks
Structure for Speed
# [Service] [Symptom] Runbook
## Quick Diagnosis (run all)
[3-4 commands that identify 80% of issues]
## Detailed Investigation
[Deeper commands organized by suspected cause]
## Remediation
[Pre-built fixes with clear labels]
## Verification
[Commands to confirm resolution]
Include Decision Points
## Check Error Type
```bash
kubectl logs -l app=api --tail=100 | grep -i error | head -5
```
### If connection errors → [Go to Database Section](#database-issues)
### If timeout errors → [Go to Dependencies Section](#external-dependencies)
### If OOM errors → [Go to Memory Section](#memory-issues)
Link Related Runbooks
## Escalation
If the above doesn't resolve the issue:
- [Database Recovery Runbook](/runbooks/database-recovery)
- [Full Service Restart Runbook](/runbooks/service-restart)
- [Rollback Deployment Runbook](/runbooks/rollback)
Testing Runbooks Before Incidents
Untested runbooks fail when you need them most.
# Runbook Testing Checklist
## Monthly validation
```bash
# Run each command in staging
kubectl get pods -l app=api -n staging
```
## After infrastructure changes
- [ ] Verify service names still match
- [ ] Verify commands still work
- [ ] Update any changed endpoints
Stew: Purpose-Built for MTTR
Stew makes runbooks executable by design:
- Click to run any command
- See output inline
- Share across your team
- Version control with Git
Your MTTR drops because the friction between knowing and doing disappears.
Join the waitlist and transform your incident response.