← Back to blog

Automating On-Call Runbooks: From Manual to Triggered

· 5 min read · Stew Team
on-callrunbookautomationalerting

Manual runbook execution is slow and error-prone. Automation connects alerts directly to remediation, reducing on-call burden and improving response times.

For manual runbook best practices, see our on-call runbook best practices.

The On-Call Automation Spectrum

LevelDescriptionResponse Time
ManualRead wiki, type commands15-30 min
ExecutableClick to run from runbook5-10 min
TriggeredAlert opens runbook3-5 min
Semi-autoRunbook runs, human approves1-3 min
Full autoSelf-remediation< 1 min

Most teams should aim for Triggered or Semi-auto for common issues.

Level 1: Alert-Triggered Runbooks

Connect alerts to relevant runbooks automatically.

Alertmanager Configuration

# alertmanager.yml
receivers:
  - name: 'runbook-trigger'
    webhook_configs:
      - url: 'http://runbook-system/api/open'
        send_resolved: true

route:
  receiver: 'runbook-trigger'
  routes:
    - match:
        alertname: APIHighErrorRate
      receiver: 'runbook-trigger'
      continue: true

Runbook System Webhook Handler

# Pseudo-code
@app.post("/api/open")
def handle_alert(alert):
    runbook_url = alert.annotations.get("runbook_url")
    if runbook_url:
        # Open runbook in engineer's browser
        notify_oncall(
            message=f"Alert: {alert.labels['alertname']}",
            runbook=runbook_url,
            auto_open=True
        )
groups:
  - name: api
    rules:
      - alert: APIHighErrorRate
        expr: rate(http_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "API error rate elevated"
          runbook_url: "https://runbooks.internal/api-errors"
          quick_commands: |
            kubectl logs -l app=api --tail=50 | grep ERROR
            kubectl get pods -l app=api

Level 2: Pre-Populated Diagnostics

When alert triggers, run diagnostic commands automatically.

Alert Triggers Diagnosis

# Alert definition
- alert: APIHighLatency
  expr: histogram_quantile(0.99, http_request_duration_seconds) > 0.5
  annotations:
    runbook_url: "https://runbooks.internal/api-latency"
    auto_diagnose:
      - "kubectl top pods -l app=api"
      - "kubectl logs -l app=api --tail=20 | grep -i slow"
      - "curl -s http://api.internal/debug/stats"

Runbook Opens with Results

# API High Latency Runbook

## Auto-Diagnosis Results (captured at alert time)

### Pod Resources
​```
NAME                   CPU    MEMORY
api-7d4f8b6c9-x2k4j   450m   380Mi
api-7d4f8b6c9-k8m2n   520m   410Mi  ← High CPU
​```

### Recent Slow Requests
​```
2024-01-15T10:30:45 slow_query database_lookup took 2.3s
2024-01-15T10:30:47 slow_query database_lookup took 2.1s
​```

## Recommended Action

Based on diagnosis, this appears to be database-related latency.

### Check database
​```bash
psql -h db.internal -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
​```

Level 3: One-Click Remediation

Pre-approved fixes execute with a single click.

Runbook with Remediation Buttons

# API Pod Issues Runbook

## Diagnosis
​```bash
kubectl get pods -l app=api -n production
​```

## Remediation Options

### Option A: Restart Pods
**Risk**: Low - rolling restart, no downtime
**Auto-approved**: Yes

​```bash
kubectl rollout restart deployment/api -n production
​```
[Execute] [Skip]

### Option B: Scale Up
**Risk**: Low - adds capacity
**Auto-approved**: Yes

​```bash
kubectl scale deployment/api --replicas=5 -n production
​```
[Execute] [Skip]

### Option C: Rollback
**Risk**: Medium - reverts to previous version
**Auto-approved**: No - requires confirmation

​```bash
kubectl rollout undo deployment/api -n production
​```
[Request Approval] [Skip]

Level 4: Semi-Automated Response

System proposes action, human approves.

Workflow

  1. Alert fires
  2. System runs diagnostics
  3. System determines likely fix
  4. Engineer receives notification with proposed action
  5. Engineer clicks approve/reject
  6. If approved, system executes

Example Notification

## Auto-Remediation Request

**Alert**: APIHighErrorRate
**Detected Cause**: OOM kills detected in pod logs
**Proposed Action**: Increase memory limit and restart

### Proposed Commands
​```bash
kubectl set resources deployment/api --limits=memory=1Gi -n production
kubectl rollout restart deployment/api -n production
​```

**Confidence**: 85% (based on 12 similar past incidents)

[Approve] [Modify] [Reject] [Investigate More]

Level 5: Full Auto-Remediation

For well-understood issues, remove human from the loop.

Kubernetes Self-Healing

# Built-in auto-remediation
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: api
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3  # Restart after 3 failures
      resources:
        limits:
          memory: "512Mi"  # OOM kill and restart if exceeded

Custom Auto-Remediation

# Auto-remediation controller
async def handle_alert(alert):
    if alert.name == "HighMemoryPod" and alert.labels.get("auto_remediate"):
        pod = alert.labels["pod"]
        namespace = alert.labels["namespace"]
        
        # Check if recently remediated
        if was_recently_remediated(pod, minutes=30):
            escalate(alert, "Recurring issue - needs investigation")
            return
        
        # Execute remediation
        result = kubectl(f"delete pod {pod} -n {namespace}")
        
        # Log action
        log_remediation(alert, action="pod_delete", result=result)
        
        # Verify
        await asyncio.sleep(60)
        if is_healthy(pod, namespace):
            resolve_alert(alert)
        else:
            escalate(alert, "Auto-remediation failed")

Safety Guardrails

Rate Limiting

# Prevent runaway automation
auto_remediation:
  max_actions_per_hour: 10
  cooldown_between_actions: 5m
  max_affected_pods: 3

Blast Radius Limits

def can_auto_remediate(action, scope):
    if scope.affected_pods > 3:
        return False, "Too many pods affected"
    if scope.is_production and action.risk_level > "low":
        return False, "High-risk action in production"
    if scope.service_tier == 1:
        return False, "Tier-1 service requires approval"
    return True, None

Audit Trail

{
  "timestamp": "2024-01-15T10:35:00Z",
  "alert": "APIHighErrorRate",
  "action": "rollout_restart",
  "target": "deployment/api",
  "namespace": "production",
  "triggered_by": "auto",
  "approval": "pre-approved",
  "result": "success",
  "recovery_time": "45s"
}

Measuring Automation Impact

# Compare response times
cat incidents.json | jq '
  group_by(.automation_level) |
  map({
    level: .[0].automation_level,
    avg_response: ([.[].response_time_seconds] | add / length),
    count: length
  })'

Expected Improvements

Automation LevelAvg Response TimeOn-Call Burden
Manual20 minHigh
Executable8 minMedium
Triggered5 minMedium
Semi-auto2 minLow
Full auto30 secMinimal

Stew: Automation-Ready Runbooks

Stew supports the full automation spectrum:

  • Executable: Click to run any command
  • Triggered: Open runbooks from alerts
  • Semi-auto: Approval workflows built-in
  • Agentic: AI-powered automated execution

Start with executable runbooks. Graduate to automation as confidence grows.

Join the waitlist and automate your on-call response.