On-Call Runbook Best Practices: Lessons from SRE Teams

Good on-call runbooks are written in blood—lessons learned from painful incidents. This article distills best practices from experienced SRE teams.

For template examples, see our on-call runbook templates.

Best Practice 1: Write for 3am

Your runbook will be read by someone tired, stressed, and possibly unfamiliar with the service.

❌ Bad: Assumes Context

Check the usual suspects and verify the config is correct.
If the issue persists, you know what to do.

✅ Good: Explicit and Complete

## Check pod status
```bash
kubectl get pods -l app=api -n production
```

## Verify config loaded
```bash
kubectl exec -it $(kubectl get pod -l app=api -n production -o jsonpath='{.items[0].metadata.name}') -n production -- cat /app/config.yaml | head -20
```

## If pods crashing, check logs
```bash
kubectl logs -l app=api -n production --previous --tail=100
```

Best Practice 2: Put the Most Common Issues First

80% of incidents come from 20% of causes. Front-load those.

# API Service Runbook

## Most Common Issues (check these first)

### 1. Pod OOM Kill (40% of incidents)
[Diagnosis and fix]

### 2. Database Connection Timeout (25% of incidents)
[Diagnosis and fix]

### 3. Upstream Service Failure (20% of incidents)
[Diagnosis and fix]

## Less Common Issues

### 4. Certificate Expiry
### 5. Disk Full
### 6. Config Drift

Best Practice 3: Include the “Why”

Context helps engineers make better decisions:

❌ Bad: Just Commands

Run: kubectl delete pod -l app=api

✅ Good: Commands with Context

## Restart API Pods

**Why**: Pods sometimes enter a bad state where connections are exhausted
but health checks still pass. Restarting forces new connections.

**Impact**: Brief (< 30s) request failures during restart.

**When to use**: After confirming connection exhaustion via logs.

```bash
kubectl rollout restart deployment/api -n production
```

Best Practice 4: Make It Executable

Commands should be copy-paste ready:

❌ Bad: Placeholders Without Guidance

Run: kubectl logs <pod-name>

✅ Good: Complete Commands

## Get API pod logs

```bash
# Get the first API pod's logs
kubectl logs $(kubectl get pod -l app=api -n production -o jsonpath='{.items[0].metadata.name}') -n production --tail=100
```

Or for all API pods:
```bash
kubectl logs -l app=api -n production --tail=50
```

Best Practice 5: Include Verification Steps

Don’t assume fixes work:

## Remediation: Restart Pods

### Apply fix
```bash
kubectl rollout restart deployment/api -n production
```

### Verify fix (wait 60 seconds)
```bash
# Check pods are running
kubectl get pods -l app=api -n production

# Check no errors in last minute
kubectl logs -l app=api -n production --since=1m | grep -c ERROR

# Check health endpoint
curl -s http://api.internal/health | jq '.status'
```

### Expected outcome
- All pods in Running state
- Error count = 0
- Health status = "ok"

Best Practice 6: Define Clear Escalation Triggers

Remove ambiguity about when to escalate:

❌ Bad: Vague Triggers

Escalate if the issue seems serious or you're not sure.

✅ Good: Specific Triggers

## Escalation Triggers

Escalate **immediately** if:
- [ ] Service is completely down for > 5 minutes
- [ ] Data loss is suspected
- [ ] Security breach is possible
- [ ] Multiple services are affected

Escalate **within 15 minutes** if:
- [ ] Standard remediation steps don't work
- [ ] Root cause is unclear after investigation
- [ ] Issue requires infrastructure changes

**Do not wake people up for**:
- Single pod restarts (self-healing)
- Brief latency spikes (< 2 minutes)
- Non-production environments

Best Practice 7: Keep Runbooks Close to Alerts

Link runbooks directly in alert definitions:

# Alertmanager rule
- alert: APIHighErrorRate
  expr: rate(http_errors_total[5m]) > 0.1
  annotations:
    summary: "API error rate is {{ $value | printf \"%.2f\" }}/s"
    runbook_url: "https://runbooks.internal/api-high-error-rate"
    quick_check: "kubectl logs -l app=api --tail=50 | grep ERROR"

Best Practice 8: Version Control Your Runbooks

Treat runbooks like code:

# Runbook repository structure
runbooks/
├── services/
│   ├── api.md
│   ├── worker.md
│   └── scheduler.md
├── infrastructure/
│   ├── kubernetes.md
│   ├── database.md
│   └── redis.md
├── processes/
│   ├── on-call-handoff.md
│   └── incident-response.md
└── templates/
    └── service-template.md

Benefits:

Track changes over time
Review updates via PR
Rollback bad changes
Blame shows who added what

Best Practice 9: Schedule Regular Reviews

Runbooks rot without maintenance:

# Runbook Review Schedule

## Weekly
- [ ] Update any procedures that failed during incidents

## Monthly  
- [ ] Test all quick diagnosis commands
- [ ] Verify contact information is current
- [ ] Check for deprecated commands/endpoints

## Quarterly
- [ ] Full runbook walkthrough with team
- [ ] Archive runbooks for decommissioned services
- [ ] Add runbooks for new services

Best Practice 10: Learn from Incidents

Every incident should improve a runbook:

# Post-Incident Runbook Update

## Incident: API-2024-01-15

### What was missing from the runbook?
- No command to check connection pool status
- Escalation contact was outdated

### Updates made:
1. Added connection pool diagnostic:
   ```bash
   curl -s localhost:8080/debug/pools | jq '.'
   ```

2. Updated escalation contacts

### PR: #142

Anti-Patterns to Avoid

Anti-Pattern 1: The Novel

# Bad: 50 pages of documentation
Chapter 1: History of the API Service...

Keep it concise. Link to detailed docs if needed.

Anti-Pattern 2: The Wiki Graveyard

Runbooks scattered across Confluence, Google Docs, GitHub, and Notion.

Fix: Single source of truth, version controlled.

Anti-Pattern 3: The Expert’s Shorthand

# Bad: Only makes sense to the author
Check the thing, flip the bit, bounce it.

Fix: Write for someone who’s never seen the service.

Anti-Pattern 4: The Untested Runbook

Commands written but never executed.

Fix: Test every command in staging monthly.

Stew: Best Practices Built-In

Stew enforces on-call runbook best practices:

Executable by default: Every command runs with a click
Version controlled: Runbooks are markdown in Git
Close to alerts: Trigger runbooks from monitoring
Always tested: Run commands to verify they work

Join the waitlist and build better on-call documentation.