Automating MTTR Reduction: From Manual to Self-Healing

Manual incident response doesn’t scale. As systems grow, automation becomes essential for maintaining low MTTR.

This guide covers the automation spectrum from executable runbooks to self-healing. For manual MTTR strategies, see our MTTR reduction guide.

The Automation Spectrum

Level	Description	Human Involvement	MTTR Impact
0	No runbooks	100% manual	Baseline
1	Static runbooks	Copy-paste commands	-20%
2	Executable runbooks	Click to run	-50%
3	Semi-automated	Approve and run	-70%
4	Fully automated	Self-healing	-90%

Most teams should target Level 2-3. Level 4 requires significant investment.

Level 1: Static Runbooks

Where most teams start:

# High CPU Alert Runbook

1. SSH to the affected server
2. Run `top` to identify high CPU process
3. Check if it's a known process
4. If unknown, kill with `kill -9 PID`
5. Monitor for recurrence

Problems:

Requires reading and interpreting
Manual command execution
Easy to make mistakes

Level 2: Executable Runbooks

Commands run with a click:

# High CPU Alert Runbook

## Identify high CPU process
```bash
ps aux --sort=-%cpu | head -10
```

## Check process details
```bash
PID=$(ps aux --sort=-%cpu | awk 'NR==2 {print $2}')
cat /proc/$PID/cmdline | tr '\0' ' '
```

## If safe to kill
```bash
kill -9 $PID
```

## Verify CPU normalized
```bash
uptime
```

Benefits:

No copy-paste errors
Faster execution
Output captured for postmortem

Level 3: Semi-Automated Remediation

Automation suggests and executes with approval:

# Alert definition with auto-remediation
alert: HighCPU
expr: node_cpu_usage > 90
for: 5m
annotations:
  remediation: restart-heavy-process
  requires_approval: true

# Auto-Remediation Runbook

## Triggered by: HighCPU alert
## Status: Awaiting approval

### Proposed action
```bash
# Kill highest CPU process (non-critical)
PID=$(ps aux --sort=-%cpu | grep -v "critical-service" | awk 'NR==2 {print $2}')
kill -15 $PID
```

### [Approve] [Reject] [Modify]

Benefits:

Faster than manual
Human oversight maintained
Consistent remediation

Level 4: Self-Healing

Fully automated response:

# Kubernetes self-healing example
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
      resources:
        limits:
          memory: "512Mi"
        requests:
          memory: "256Mi"

# Horizontal Pod Autoscaler
kubectl autoscale deployment api --min=3 --max=10 --cpu-percent=70

Benefits:

Near-zero MTTR for known issues
No human involvement needed
24/7 response

Risks:

Can mask underlying problems
May cause cascading issues
Requires thorough testing

Building Automation Incrementally

Step 1: Identify Repetitive Incidents

# Analyze incident patterns
cat incidents.json | jq 'group_by(.category) | 
  map({category: .[0].category, count: length}) | 
  sort_by(-.count) | .[0:10]'

Focus automation on the top 5 incident types.

Step 2: Create Executable Runbooks

For each common incident:

# [Incident Type] Runbook

## Detection
```bash
# Commands that confirm the issue
```

## Diagnosis
```bash
# Commands that identify root cause
```

## Remediation
```bash
# Commands that fix the issue
```

## Verification
```bash
# Commands that confirm resolution
```

Step 3: Add Automated Triggers

Link alerts to runbooks:

# Alertmanager webhook to runbook system
receivers:
  - name: runbook-trigger
    webhook_configs:
      - url: http://runbook-system/api/trigger
        send_resolved: true

Step 4: Enable Approval Workflows

# Pseudo-code for approval workflow
def on_alert(alert):
    runbook = find_runbook(alert.type)
    if runbook.auto_approve:
        execute(runbook)
    else:
        notify_oncall(runbook, await_approval=True)

Step 5: Graduate to Full Automation

After N successful approved executions:

# Promote to auto-approve
if runbook.successful_executions > 10 and runbook.failure_rate < 0.01:
    runbook.auto_approve = True

Automation Patterns

Pattern 1: Restart on Failure

# Kubernetes restart policy
spec:
  restartPolicy: Always
  containers:
    - name: app
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        failureThreshold: 3

Pattern 2: Scale on Load

# HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Pattern 3: Failover on Error

# Circuit breaker pattern
circuitBreaker:
  maxFailures: 5
  timeout: 30s
  fallback: cached-response

Pattern 4: Rollback on Metrics

# Automated rollback on error spike
#!/bin/bash
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors[5m])" | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.1" | bc -l) )); then
    kubectl rollout undo deployment/api
    notify "Auto-rollback triggered due to high error rate"
fi

Automation Safety

Guardrails

# Limit blast radius
automation:
  max_pods_affected: 3
  cooldown_period: 10m
  max_actions_per_hour: 5
  require_approval_after: 3_failures

Kill Switches

# Disable all automation
kubectl annotate deployment api automation.enabled=false

# Or via feature flag
curl -X POST http://feature-flags/automation -d '{"enabled": false}'

Audit Trail

# Log all automated actions
{
  "timestamp": "2024-01-15T10:30:00Z",
  "action": "restart_pod",
  "trigger": "HighMemory alert",
  "target": "api-7d4f8b6c9-x2k4j",
  "result": "success",
  "approval": "auto"
}

Measuring Automation Impact

# Compare MTTR: automated vs manual
cat incidents.json | jq '
  group_by(.automated) | 
  map({
    automated: .[0].automated, 
    avg_mttr: ([.[].mttr_minutes] | add / length),
    count: length
  })'

Stew: The Bridge to Automation

Stew helps teams progress through automation levels:

Level 2: Executable runbooks out of the box
Level 3: Runbooks triggered by alerts, with approval
Level 4: Agentic automation that runs runbooks automatically

Start with executable runbooks. Graduate to automation as confidence grows.

Join the waitlist and begin your automation journey.