Automating MTTR Reduction: From Manual to Self-Healing
Manual incident response doesn’t scale. As systems grow, automation becomes essential for maintaining low MTTR.
This guide covers the automation spectrum from executable runbooks to self-healing. For manual MTTR strategies, see our MTTR reduction guide.
The Automation Spectrum
| Level | Description | Human Involvement | MTTR Impact |
|---|---|---|---|
| 0 | No runbooks | 100% manual | Baseline |
| 1 | Static runbooks | Copy-paste commands | -20% |
| 2 | Executable runbooks | Click to run | -50% |
| 3 | Semi-automated | Approve and run | -70% |
| 4 | Fully automated | Self-healing | -90% |
Most teams should target Level 2-3. Level 4 requires significant investment.
Level 1: Static Runbooks
Where most teams start:
# High CPU Alert Runbook
1. SSH to the affected server
2. Run `top` to identify high CPU process
3. Check if it's a known process
4. If unknown, kill with `kill -9 PID`
5. Monitor for recurrence
Problems:
- Requires reading and interpreting
- Manual command execution
- Easy to make mistakes
Level 2: Executable Runbooks
Commands run with a click:
# High CPU Alert Runbook
## Identify high CPU process
```bash
ps aux --sort=-%cpu | head -10
```
## Check process details
```bash
PID=$(ps aux --sort=-%cpu | awk 'NR==2 {print $2}')
cat /proc/$PID/cmdline | tr '\0' ' '
```
## If safe to kill
```bash
kill -9 $PID
```
## Verify CPU normalized
```bash
uptime
```
Benefits:
- No copy-paste errors
- Faster execution
- Output captured for postmortem
Level 3: Semi-Automated Remediation
Automation suggests and executes with approval:
# Alert definition with auto-remediation
alert: HighCPU
expr: node_cpu_usage > 90
for: 5m
annotations:
remediation: restart-heavy-process
requires_approval: true
# Auto-Remediation Runbook
## Triggered by: HighCPU alert
## Status: Awaiting approval
### Proposed action
```bash
# Kill highest CPU process (non-critical)
PID=$(ps aux --sort=-%cpu | grep -v "critical-service" | awk 'NR==2 {print $2}')
kill -15 $PID
```
### [Approve] [Reject] [Modify]
Benefits:
- Faster than manual
- Human oversight maintained
- Consistent remediation
Level 4: Self-Healing
Fully automated response:
# Kubernetes self-healing example
apiVersion: v1
kind: Pod
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
resources:
limits:
memory: "512Mi"
requests:
memory: "256Mi"
# Horizontal Pod Autoscaler
kubectl autoscale deployment api --min=3 --max=10 --cpu-percent=70
Benefits:
- Near-zero MTTR for known issues
- No human involvement needed
- 24/7 response
Risks:
- Can mask underlying problems
- May cause cascading issues
- Requires thorough testing
Building Automation Incrementally
Step 1: Identify Repetitive Incidents
# Analyze incident patterns
cat incidents.json | jq 'group_by(.category) |
map({category: .[0].category, count: length}) |
sort_by(-.count) | .[0:10]'
Focus automation on the top 5 incident types.
Step 2: Create Executable Runbooks
For each common incident:
# [Incident Type] Runbook
## Detection
```bash
# Commands that confirm the issue
```
## Diagnosis
```bash
# Commands that identify root cause
```
## Remediation
```bash
# Commands that fix the issue
```
## Verification
```bash
# Commands that confirm resolution
```
Step 3: Add Automated Triggers
Link alerts to runbooks:
# Alertmanager webhook to runbook system
receivers:
- name: runbook-trigger
webhook_configs:
- url: http://runbook-system/api/trigger
send_resolved: true
Step 4: Enable Approval Workflows
# Pseudo-code for approval workflow
def on_alert(alert):
runbook = find_runbook(alert.type)
if runbook.auto_approve:
execute(runbook)
else:
notify_oncall(runbook, await_approval=True)
Step 5: Graduate to Full Automation
After N successful approved executions:
# Promote to auto-approve
if runbook.successful_executions > 10 and runbook.failure_rate < 0.01:
runbook.auto_approve = True
Automation Patterns
Pattern 1: Restart on Failure
# Kubernetes restart policy
spec:
restartPolicy: Always
containers:
- name: app
livenessProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 3
Pattern 2: Scale on Load
# HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Pattern 3: Failover on Error
# Circuit breaker pattern
circuitBreaker:
maxFailures: 5
timeout: 30s
fallback: cached-response
Pattern 4: Rollback on Metrics
# Automated rollback on error spike
#!/bin/bash
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors[5m])" | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.1" | bc -l) )); then
kubectl rollout undo deployment/api
notify "Auto-rollback triggered due to high error rate"
fi
Automation Safety
Guardrails
# Limit blast radius
automation:
max_pods_affected: 3
cooldown_period: 10m
max_actions_per_hour: 5
require_approval_after: 3_failures
Kill Switches
# Disable all automation
kubectl annotate deployment api automation.enabled=false
# Or via feature flag
curl -X POST http://feature-flags/automation -d '{"enabled": false}'
Audit Trail
# Log all automated actions
{
"timestamp": "2024-01-15T10:30:00Z",
"action": "restart_pod",
"trigger": "HighMemory alert",
"target": "api-7d4f8b6c9-x2k4j",
"result": "success",
"approval": "auto"
}
Measuring Automation Impact
# Compare MTTR: automated vs manual
cat incidents.json | jq '
group_by(.automated) |
map({
automated: .[0].automated,
avg_mttr: ([.[].mttr_minutes] | add / length),
count: length
})'
Stew: The Bridge to Automation
Stew helps teams progress through automation levels:
- Level 2: Executable runbooks out of the box
- Level 3: Runbooks triggered by alerts, with approval
- Level 4: Agentic automation that runs runbooks automatically
Start with executable runbooks. Graduate to automation as confidence grows.
Join the waitlist and begin your automation journey.