On-Call Runbook Best Practices: Lessons from SRE Teams
Good on-call runbooks are written in blood—lessons learned from painful incidents. This article distills best practices from experienced SRE teams.
For template examples, see our on-call runbook templates.
Best Practice 1: Write for 3am
Your runbook will be read by someone tired, stressed, and possibly unfamiliar with the service.
❌ Bad: Assumes Context
Check the usual suspects and verify the config is correct.
If the issue persists, you know what to do.
✅ Good: Explicit and Complete
## Check pod status
```bash
kubectl get pods -l app=api -n production
```
## Verify config loaded
```bash
kubectl exec -it $(kubectl get pod -l app=api -n production -o jsonpath='{.items[0].metadata.name}') -n production -- cat /app/config.yaml | head -20
```
## If pods crashing, check logs
```bash
kubectl logs -l app=api -n production --previous --tail=100
```
Best Practice 2: Put the Most Common Issues First
80% of incidents come from 20% of causes. Front-load those.
# API Service Runbook
## Most Common Issues (check these first)
### 1. Pod OOM Kill (40% of incidents)
[Diagnosis and fix]
### 2. Database Connection Timeout (25% of incidents)
[Diagnosis and fix]
### 3. Upstream Service Failure (20% of incidents)
[Diagnosis and fix]
## Less Common Issues
### 4. Certificate Expiry
### 5. Disk Full
### 6. Config Drift
Best Practice 3: Include the “Why”
Context helps engineers make better decisions:
❌ Bad: Just Commands
Run: kubectl delete pod -l app=api
✅ Good: Commands with Context
## Restart API Pods
**Why**: Pods sometimes enter a bad state where connections are exhausted
but health checks still pass. Restarting forces new connections.
**Impact**: Brief (< 30s) request failures during restart.
**When to use**: After confirming connection exhaustion via logs.
```bash
kubectl rollout restart deployment/api -n production
```
Best Practice 4: Make It Executable
Commands should be copy-paste ready:
❌ Bad: Placeholders Without Guidance
Run: kubectl logs <pod-name>
✅ Good: Complete Commands
## Get API pod logs
```bash
# Get the first API pod's logs
kubectl logs $(kubectl get pod -l app=api -n production -o jsonpath='{.items[0].metadata.name}') -n production --tail=100
```
Or for all API pods:
```bash
kubectl logs -l app=api -n production --tail=50
```
Best Practice 5: Include Verification Steps
Don’t assume fixes work:
## Remediation: Restart Pods
### Apply fix
```bash
kubectl rollout restart deployment/api -n production
```
### Verify fix (wait 60 seconds)
```bash
# Check pods are running
kubectl get pods -l app=api -n production
# Check no errors in last minute
kubectl logs -l app=api -n production --since=1m | grep -c ERROR
# Check health endpoint
curl -s http://api.internal/health | jq '.status'
```
### Expected outcome
- All pods in Running state
- Error count = 0
- Health status = "ok"
Best Practice 6: Define Clear Escalation Triggers
Remove ambiguity about when to escalate:
❌ Bad: Vague Triggers
Escalate if the issue seems serious or you're not sure.
✅ Good: Specific Triggers
## Escalation Triggers
Escalate **immediately** if:
- [ ] Service is completely down for > 5 minutes
- [ ] Data loss is suspected
- [ ] Security breach is possible
- [ ] Multiple services are affected
Escalate **within 15 minutes** if:
- [ ] Standard remediation steps don't work
- [ ] Root cause is unclear after investigation
- [ ] Issue requires infrastructure changes
**Do not wake people up for**:
- Single pod restarts (self-healing)
- Brief latency spikes (< 2 minutes)
- Non-production environments
Best Practice 7: Keep Runbooks Close to Alerts
Link runbooks directly in alert definitions:
# Alertmanager rule
- alert: APIHighErrorRate
expr: rate(http_errors_total[5m]) > 0.1
annotations:
summary: "API error rate is {{ $value | printf \"%.2f\" }}/s"
runbook_url: "https://runbooks.internal/api-high-error-rate"
quick_check: "kubectl logs -l app=api --tail=50 | grep ERROR"
Best Practice 8: Version Control Your Runbooks
Treat runbooks like code:
# Runbook repository structure
runbooks/
├── services/
│ ├── api.md
│ ├── worker.md
│ └── scheduler.md
├── infrastructure/
│ ├── kubernetes.md
│ ├── database.md
│ └── redis.md
├── processes/
│ ├── on-call-handoff.md
│ └── incident-response.md
└── templates/
└── service-template.md
Benefits:
- Track changes over time
- Review updates via PR
- Rollback bad changes
- Blame shows who added what
Best Practice 9: Schedule Regular Reviews
Runbooks rot without maintenance:
# Runbook Review Schedule
## Weekly
- [ ] Update any procedures that failed during incidents
## Monthly
- [ ] Test all quick diagnosis commands
- [ ] Verify contact information is current
- [ ] Check for deprecated commands/endpoints
## Quarterly
- [ ] Full runbook walkthrough with team
- [ ] Archive runbooks for decommissioned services
- [ ] Add runbooks for new services
Best Practice 10: Learn from Incidents
Every incident should improve a runbook:
# Post-Incident Runbook Update
## Incident: API-2024-01-15
### What was missing from the runbook?
- No command to check connection pool status
- Escalation contact was outdated
### Updates made:
1. Added connection pool diagnostic:
```bash
curl -s localhost:8080/debug/pools | jq '.'
```
2. Updated escalation contacts
### PR: #142
Anti-Patterns to Avoid
Anti-Pattern 1: The Novel
# Bad: 50 pages of documentation
Chapter 1: History of the API Service...
Keep it concise. Link to detailed docs if needed.
Anti-Pattern 2: The Wiki Graveyard
Runbooks scattered across Confluence, Google Docs, GitHub, and Notion.
Fix: Single source of truth, version controlled.
Anti-Pattern 3: The Expert’s Shorthand
# Bad: Only makes sense to the author
Check the thing, flip the bit, bounce it.
Fix: Write for someone who’s never seen the service.
Anti-Pattern 4: The Untested Runbook
Commands written but never executed.
Fix: Test every command in staging monthly.
Stew: Best Practices Built-In
Stew enforces on-call runbook best practices:
- Executable by default: Every command runs with a click
- Version controlled: Runbooks are markdown in Git
- Close to alerts: Trigger runbooks from monitoring
- Always tested: Run commands to verify they work
Join the waitlist and build better on-call documentation.