Automating On-Call Runbooks: From Manual to Triggered

Manual runbook execution is slow and error-prone. Automation connects alerts directly to remediation, reducing on-call burden and improving response times.

For manual runbook best practices, see our on-call runbook best practices.

The On-Call Automation Spectrum

Level	Description	Response Time
Manual	Read wiki, type commands	15-30 min
Executable	Click to run from runbook	5-10 min
Triggered	Alert opens runbook	3-5 min
Semi-auto	Runbook runs, human approves	1-3 min
Full auto	Self-remediation	< 1 min

Most teams should aim for Triggered or Semi-auto for common issues.

Level 1: Alert-Triggered Runbooks

Connect alerts to relevant runbooks automatically.

Alertmanager Configuration

# alertmanager.yml
receivers:
  - name: 'runbook-trigger'
    webhook_configs:
      - url: 'http://runbook-system/api/open'
        send_resolved: true

route:
  receiver: 'runbook-trigger'
  routes:
    - match:
        alertname: APIHighErrorRate
      receiver: 'runbook-trigger'
      continue: true

Runbook System Webhook Handler

# Pseudo-code
@app.post("/api/open")
def handle_alert(alert):
    runbook_url = alert.annotations.get("runbook_url")
    if runbook_url:
        # Open runbook in engineer's browser
        notify_oncall(
            message=f"Alert: {alert.labels['alertname']}",
            runbook=runbook_url,
            auto_open=True
        )

Alert with Runbook Link

groups:
  - name: api
    rules:
      - alert: APIHighErrorRate
        expr: rate(http_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate elevated"
          runbook_url: "https://runbooks.internal/api-errors"
          quick_commands: |
            kubectl logs -l app=api --tail=50 | grep ERROR
            kubectl get pods -l app=api

Level 2: Pre-Populated Diagnostics

When alert triggers, run diagnostic commands automatically.

Alert Triggers Diagnosis

# Alert definition
- alert: APIHighLatency
  expr: histogram_quantile(0.99, http_request_duration_seconds) > 0.5
  annotations:
    runbook_url: "https://runbooks.internal/api-latency"
    auto_diagnose:
      - "kubectl top pods -l app=api"
      - "kubectl logs -l app=api --tail=20 | grep -i slow"
      - "curl -s http://api.internal/debug/stats"

Runbook Opens with Results

# API High Latency Runbook

## Auto-Diagnosis Results (captured at alert time)

### Pod Resources
```
NAME                   CPU    MEMORY
api-7d4f8b6c9-x2k4j   450m   380Mi
api-7d4f8b6c9-k8m2n   520m   410Mi  ← High CPU
```

### Recent Slow Requests
```
2024-01-15T10:30:45 slow_query database_lookup took 2.3s
2024-01-15T10:30:47 slow_query database_lookup took 2.1s
```

## Recommended Action

Based on diagnosis, this appears to be database-related latency.

### Check database
```bash
psql -h db.internal -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
```

Level 3: One-Click Remediation

Pre-approved fixes execute with a single click.

Runbook with Remediation Buttons

# API Pod Issues Runbook

## Diagnosis
```bash
kubectl get pods -l app=api -n production
```

## Remediation Options

### Option A: Restart Pods
**Risk**: Low - rolling restart, no downtime
**Auto-approved**: Yes

```bash
kubectl rollout restart deployment/api -n production
```
[Execute] [Skip]

### Option B: Scale Up
**Risk**: Low - adds capacity
**Auto-approved**: Yes

```bash
kubectl scale deployment/api --replicas=5 -n production
```
[Execute] [Skip]

### Option C: Rollback
**Risk**: Medium - reverts to previous version
**Auto-approved**: No - requires confirmation

```bash
kubectl rollout undo deployment/api -n production
```
[Request Approval] [Skip]

Level 4: Semi-Automated Response

System proposes action, human approves.

Workflow

Alert fires
System runs diagnostics
System determines likely fix
Engineer receives notification with proposed action
Engineer clicks approve/reject
If approved, system executes

Example Notification

## Auto-Remediation Request

**Alert**: APIHighErrorRate
**Detected Cause**: OOM kills detected in pod logs
**Proposed Action**: Increase memory limit and restart

### Proposed Commands
```bash
kubectl set resources deployment/api --limits=memory=1Gi -n production
kubectl rollout restart deployment/api -n production
```

**Confidence**: 85% (based on 12 similar past incidents)

[Approve] [Modify] [Reject] [Investigate More]

Level 5: Full Auto-Remediation

For well-understood issues, remove human from the loop.

Kubernetes Self-Healing

# Built-in auto-remediation
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3  # Restart after 3 failures
      resources:
        limits:
          memory: "512Mi"  # OOM kill and restart if exceeded

Custom Auto-Remediation

# Auto-remediation controller
async def handle_alert(alert):
    if alert.name == "HighMemoryPod" and alert.labels.get("auto_remediate"):
        pod = alert.labels["pod"]
        namespace = alert.labels["namespace"]
        
        # Check if recently remediated
        if was_recently_remediated(pod, minutes=30):
            escalate(alert, "Recurring issue - needs investigation")
            return
        
        # Execute remediation
        result = kubectl(f"delete pod {pod} -n {namespace}")
        
        # Log action
        log_remediation(alert, action="pod_delete", result=result)
        
        # Verify
        await asyncio.sleep(60)
        if is_healthy(pod, namespace):
            resolve_alert(alert)
        else:
            escalate(alert, "Auto-remediation failed")

Safety Guardrails

Rate Limiting

# Prevent runaway automation
auto_remediation:
  max_actions_per_hour: 10
  cooldown_between_actions: 5m
  max_affected_pods: 3

Blast Radius Limits

def can_auto_remediate(action, scope):
    if scope.affected_pods > 3:
        return False, "Too many pods affected"
    if scope.is_production and action.risk_level > "low":
        return False, "High-risk action in production"
    if scope.service_tier == 1:
        return False, "Tier-1 service requires approval"
    return True, None

Audit Trail

{
  "timestamp": "2024-01-15T10:35:00Z",
  "alert": "APIHighErrorRate",
  "action": "rollout_restart",
  "target": "deployment/api",
  "namespace": "production",
  "triggered_by": "auto",
  "approval": "pre-approved",
  "result": "success",
  "recovery_time": "45s"
}

Measuring Automation Impact

# Compare response times
cat incidents.json | jq '
  group_by(.automation_level) |
  map({
    level: .[0].automation_level,
    avg_response: ([.[].response_time_seconds] | add / length),
    count: length
  })'

Expected Improvements

Automation Level	Avg Response Time	On-Call Burden
Manual	20 min	High
Executable	8 min	Medium
Triggered	5 min	Medium
Semi-auto	2 min	Low
Full auto	30 sec	Minimal

Stew: Automation-Ready Runbooks

Stew supports the full automation spectrum:

Executable: Click to run any command
Triggered: Open runbooks from alerts
Semi-auto: Approval workflows built-in
Agentic: AI-powered automated execution

Start with executable runbooks. Graduate to automation as confidence grows.

Join the waitlist and automate your on-call response.