← Back to blog

Automating MTTR Reduction: From Manual to Self-Healing

· 5 min read · Stew Team
mttrautomationself-healingsre

Manual incident response doesn’t scale. As systems grow, automation becomes essential for maintaining low MTTR.

This guide covers the automation spectrum from executable runbooks to self-healing. For manual MTTR strategies, see our MTTR reduction guide.

The Automation Spectrum

LevelDescriptionHuman InvolvementMTTR Impact
0No runbooks100% manualBaseline
1Static runbooksCopy-paste commands-20%
2Executable runbooksClick to run-50%
3Semi-automatedApprove and run-70%
4Fully automatedSelf-healing-90%

Most teams should target Level 2-3. Level 4 requires significant investment.

Level 1: Static Runbooks

Where most teams start:

# High CPU Alert Runbook

1. SSH to the affected server
2. Run `top` to identify high CPU process
3. Check if it's a known process
4. If unknown, kill with `kill -9 PID`
5. Monitor for recurrence

Problems:

  • Requires reading and interpreting
  • Manual command execution
  • Easy to make mistakes

Level 2: Executable Runbooks

Commands run with a click:

# High CPU Alert Runbook

## Identify high CPU process
​```bash
ps aux --sort=-%cpu | head -10
​```

## Check process details
​```bash
PID=$(ps aux --sort=-%cpu | awk 'NR==2 {print $2}')
cat /proc/$PID/cmdline | tr '\0' ' '
​```

## If safe to kill
​```bash
kill -9 $PID
​```

## Verify CPU normalized
​```bash
uptime
​```

Benefits:

  • No copy-paste errors
  • Faster execution
  • Output captured for postmortem

Level 3: Semi-Automated Remediation

Automation suggests and executes with approval:

# Alert definition with auto-remediation
alert: HighCPU
expr: node_cpu_usage > 90
for: 5m
annotations:
  remediation: restart-heavy-process
  requires_approval: true
# Auto-Remediation Runbook

## Triggered by: HighCPU alert
## Status: Awaiting approval

### Proposed action
​```bash
# Kill highest CPU process (non-critical)
PID=$(ps aux --sort=-%cpu | grep -v "critical-service" | awk 'NR==2 {print $2}')
kill -15 $PID
​```

### [Approve] [Reject] [Modify]

Benefits:

  • Faster than manual
  • Human oversight maintained
  • Consistent remediation

Level 4: Self-Healing

Fully automated response:

# Kubernetes self-healing example
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
      resources:
        limits:
          memory: "512Mi"
        requests:
          memory: "256Mi"
# Horizontal Pod Autoscaler
kubectl autoscale deployment api --min=3 --max=10 --cpu-percent=70

Benefits:

  • Near-zero MTTR for known issues
  • No human involvement needed
  • 24/7 response

Risks:

  • Can mask underlying problems
  • May cause cascading issues
  • Requires thorough testing

Building Automation Incrementally

Step 1: Identify Repetitive Incidents

# Analyze incident patterns
cat incidents.json | jq 'group_by(.category) | 
  map({category: .[0].category, count: length}) | 
  sort_by(-.count) | .[0:10]'

Focus automation on the top 5 incident types.

Step 2: Create Executable Runbooks

For each common incident:

# [Incident Type] Runbook

## Detection
​```bash
# Commands that confirm the issue
​```

## Diagnosis
​```bash
# Commands that identify root cause
​```

## Remediation
​```bash
# Commands that fix the issue
​```

## Verification
​```bash
# Commands that confirm resolution
​```

Step 3: Add Automated Triggers

Link alerts to runbooks:

# Alertmanager webhook to runbook system
receivers:
  - name: runbook-trigger
    webhook_configs:
      - url: http://runbook-system/api/trigger
        send_resolved: true

Step 4: Enable Approval Workflows

# Pseudo-code for approval workflow
def on_alert(alert):
    runbook = find_runbook(alert.type)
    if runbook.auto_approve:
        execute(runbook)
    else:
        notify_oncall(runbook, await_approval=True)

Step 5: Graduate to Full Automation

After N successful approved executions:

# Promote to auto-approve
if runbook.successful_executions > 10 and runbook.failure_rate < 0.01:
    runbook.auto_approve = True

Automation Patterns

Pattern 1: Restart on Failure

# Kubernetes restart policy
spec:
  restartPolicy: Always
  containers:
    - name: app
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        failureThreshold: 3

Pattern 2: Scale on Load

# HPA for automatic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Pattern 3: Failover on Error

# Circuit breaker pattern
circuitBreaker:
  maxFailures: 5
  timeout: 30s
  fallback: cached-response

Pattern 4: Rollback on Metrics

# Automated rollback on error spike
#!/bin/bash
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_errors[5m])" | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.1" | bc -l) )); then
    kubectl rollout undo deployment/api
    notify "Auto-rollback triggered due to high error rate"
fi

Automation Safety

Guardrails

# Limit blast radius
automation:
  max_pods_affected: 3
  cooldown_period: 10m
  max_actions_per_hour: 5
  require_approval_after: 3_failures

Kill Switches

# Disable all automation
kubectl annotate deployment api automation.enabled=false

# Or via feature flag
curl -X POST http://feature-flags/automation -d '{"enabled": false}'

Audit Trail

# Log all automated actions
{
  "timestamp": "2024-01-15T10:30:00Z",
  "action": "restart_pod",
  "trigger": "HighMemory alert",
  "target": "api-7d4f8b6c9-x2k4j",
  "result": "success",
  "approval": "auto"
}

Measuring Automation Impact

# Compare MTTR: automated vs manual
cat incidents.json | jq '
  group_by(.automated) | 
  map({
    automated: .[0].automated, 
    avg_mttr: ([.[].mttr_minutes] | add / length),
    count: length
  })'

Stew: The Bridge to Automation

Stew helps teams progress through automation levels:

  • Level 2: Executable runbooks out of the box
  • Level 3: Runbooks triggered by alerts, with approval
  • Level 4: Agentic automation that runs runbooks automatically

Start with executable runbooks. Graduate to automation as confidence grows.

Join the waitlist and begin your automation journey.