Automating On-Call Runbooks: From Manual to Triggered
· 5 min read · Stew Team
on-callrunbookautomationalerting
Manual runbook execution is slow and error-prone. Automation connects alerts directly to remediation, reducing on-call burden and improving response times.
For manual runbook best practices, see our on-call runbook best practices.
The On-Call Automation Spectrum
| Level | Description | Response Time |
|---|---|---|
| Manual | Read wiki, type commands | 15-30 min |
| Executable | Click to run from runbook | 5-10 min |
| Triggered | Alert opens runbook | 3-5 min |
| Semi-auto | Runbook runs, human approves | 1-3 min |
| Full auto | Self-remediation | < 1 min |
Most teams should aim for Triggered or Semi-auto for common issues.
Level 1: Alert-Triggered Runbooks
Connect alerts to relevant runbooks automatically.
Alertmanager Configuration
# alertmanager.yml
receivers:
- name: 'runbook-trigger'
webhook_configs:
- url: 'http://runbook-system/api/open'
send_resolved: true
route:
receiver: 'runbook-trigger'
routes:
- match:
alertname: APIHighErrorRate
receiver: 'runbook-trigger'
continue: true
Runbook System Webhook Handler
# Pseudo-code
@app.post("/api/open")
def handle_alert(alert):
runbook_url = alert.annotations.get("runbook_url")
if runbook_url:
# Open runbook in engineer's browser
notify_oncall(
message=f"Alert: {alert.labels['alertname']}",
runbook=runbook_url,
auto_open=True
)
Alert with Runbook Link
groups:
- name: api
rules:
- alert: APIHighErrorRate
expr: rate(http_errors_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "API error rate elevated"
runbook_url: "https://runbooks.internal/api-errors"
quick_commands: |
kubectl logs -l app=api --tail=50 | grep ERROR
kubectl get pods -l app=api
Level 2: Pre-Populated Diagnostics
When alert triggers, run diagnostic commands automatically.
Alert Triggers Diagnosis
# Alert definition
- alert: APIHighLatency
expr: histogram_quantile(0.99, http_request_duration_seconds) > 0.5
annotations:
runbook_url: "https://runbooks.internal/api-latency"
auto_diagnose:
- "kubectl top pods -l app=api"
- "kubectl logs -l app=api --tail=20 | grep -i slow"
- "curl -s http://api.internal/debug/stats"
Runbook Opens with Results
# API High Latency Runbook
## Auto-Diagnosis Results (captured at alert time)
### Pod Resources
```
NAME CPU MEMORY
api-7d4f8b6c9-x2k4j 450m 380Mi
api-7d4f8b6c9-k8m2n 520m 410Mi ← High CPU
```
### Recent Slow Requests
```
2024-01-15T10:30:45 slow_query database_lookup took 2.3s
2024-01-15T10:30:47 slow_query database_lookup took 2.1s
```
## Recommended Action
Based on diagnosis, this appears to be database-related latency.
### Check database
```bash
psql -h db.internal -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
```
Level 3: One-Click Remediation
Pre-approved fixes execute with a single click.
Runbook with Remediation Buttons
# API Pod Issues Runbook
## Diagnosis
```bash
kubectl get pods -l app=api -n production
```
## Remediation Options
### Option A: Restart Pods
**Risk**: Low - rolling restart, no downtime
**Auto-approved**: Yes
```bash
kubectl rollout restart deployment/api -n production
```
[Execute] [Skip]
### Option B: Scale Up
**Risk**: Low - adds capacity
**Auto-approved**: Yes
```bash
kubectl scale deployment/api --replicas=5 -n production
```
[Execute] [Skip]
### Option C: Rollback
**Risk**: Medium - reverts to previous version
**Auto-approved**: No - requires confirmation
```bash
kubectl rollout undo deployment/api -n production
```
[Request Approval] [Skip]
Level 4: Semi-Automated Response
System proposes action, human approves.
Workflow
- Alert fires
- System runs diagnostics
- System determines likely fix
- Engineer receives notification with proposed action
- Engineer clicks approve/reject
- If approved, system executes
Example Notification
## Auto-Remediation Request
**Alert**: APIHighErrorRate
**Detected Cause**: OOM kills detected in pod logs
**Proposed Action**: Increase memory limit and restart
### Proposed Commands
```bash
kubectl set resources deployment/api --limits=memory=1Gi -n production
kubectl rollout restart deployment/api -n production
```
**Confidence**: 85% (based on 12 similar past incidents)
[Approve] [Modify] [Reject] [Investigate More]
Level 5: Full Auto-Remediation
For well-understood issues, remove human from the loop.
Kubernetes Self-Healing
# Built-in auto-remediation
apiVersion: v1
kind: Pod
spec:
containers:
- name: api
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3 # Restart after 3 failures
resources:
limits:
memory: "512Mi" # OOM kill and restart if exceeded
Custom Auto-Remediation
# Auto-remediation controller
async def handle_alert(alert):
if alert.name == "HighMemoryPod" and alert.labels.get("auto_remediate"):
pod = alert.labels["pod"]
namespace = alert.labels["namespace"]
# Check if recently remediated
if was_recently_remediated(pod, minutes=30):
escalate(alert, "Recurring issue - needs investigation")
return
# Execute remediation
result = kubectl(f"delete pod {pod} -n {namespace}")
# Log action
log_remediation(alert, action="pod_delete", result=result)
# Verify
await asyncio.sleep(60)
if is_healthy(pod, namespace):
resolve_alert(alert)
else:
escalate(alert, "Auto-remediation failed")
Safety Guardrails
Rate Limiting
# Prevent runaway automation
auto_remediation:
max_actions_per_hour: 10
cooldown_between_actions: 5m
max_affected_pods: 3
Blast Radius Limits
def can_auto_remediate(action, scope):
if scope.affected_pods > 3:
return False, "Too many pods affected"
if scope.is_production and action.risk_level > "low":
return False, "High-risk action in production"
if scope.service_tier == 1:
return False, "Tier-1 service requires approval"
return True, None
Audit Trail
{
"timestamp": "2024-01-15T10:35:00Z",
"alert": "APIHighErrorRate",
"action": "rollout_restart",
"target": "deployment/api",
"namespace": "production",
"triggered_by": "auto",
"approval": "pre-approved",
"result": "success",
"recovery_time": "45s"
}
Measuring Automation Impact
# Compare response times
cat incidents.json | jq '
group_by(.automation_level) |
map({
level: .[0].automation_level,
avg_response: ([.[].response_time_seconds] | add / length),
count: length
})'
Expected Improvements
| Automation Level | Avg Response Time | On-Call Burden |
|---|---|---|
| Manual | 20 min | High |
| Executable | 8 min | Medium |
| Triggered | 5 min | Medium |
| Semi-auto | 2 min | Low |
| Full auto | 30 sec | Minimal |
Stew: Automation-Ready Runbooks
Stew supports the full automation spectrum:
- Executable: Click to run any command
- Triggered: Open runbooks from alerts
- Semi-auto: Approval workflows built-in
- Agentic: AI-powered automated execution
Start with executable runbooks. Graduate to automation as confidence grows.
Join the waitlist and automate your on-call response.