On-Call Runbook Guide: Essential Documentation for SREs

Being on-call without good runbooks is like firefighting without water. You might survive, but it’s going to hurt.

This guide covers how to build on-call runbooks that actually help during incidents. For general runbook creation, see how to write a runbook.

What Makes a Good On-Call Runbook

On-call runbooks differ from general documentation:

Speed-focused: Every second matters at 3am
Decision-oriented: Clear paths through complex situations
Self-contained: Everything needed in one place
Executable: Commands ready to run

On-Call Runbook Structure

The Essential Sections

# [Service/Alert Name] Runbook

## Overview
- What this service does
- Who owns it
- Escalation contacts

## Quick Diagnosis
[3-4 commands that identify 80% of issues]

## Common Issues
### Issue 1: [Most common problem]
### Issue 2: [Second most common]
### Issue 3: [Third most common]

## Escalation
- When to escalate
- How to escalate
- Who to contact

## Reference
- Architecture diagram
- Related runbooks
- External dependencies

Quick Diagnosis Section

The first thing an on-call engineer runs:

## Quick Diagnosis

Run these commands first to understand the situation:

### Service Status
```bash
kubectl get pods -l app=api -n production
kubectl get events --sort-by='.lastTimestamp' -n production | grep api | tail -10
```

### Recent Errors
```bash
kubectl logs -l app=api -n production --tail=50 | grep -i error
```

### Resource Usage
```bash
kubectl top pods -l app=api -n production
```

### External Dependencies
```bash
curl -s http://api.internal/health | jq '.dependencies'
```

Common Issues Section

Document the issues that actually happen:

## Common Issues

### Issue 1: Pods in CrashLoopBackOff

**Symptoms**: Alert fires, pods restarting repeatedly

**Diagnosis**:
```bash
kubectl describe pod -l app=api -n production | grep -A5 "State:"
kubectl logs -l app=api -n production --previous --tail=100
```

**Common causes**:
1. OOM kill → Increase memory limits
2. Failed health check → Check /health endpoint
3. Missing config → Verify ConfigMap mounted

**Resolution**:
```bash
# If OOM, apply higher limits
kubectl apply -f k8s/api-high-memory.yaml

# If config issue, recreate pod
kubectl delete pod -l app=api -n production
```

---

### Issue 2: High Latency

**Symptoms**: p99 latency > 500ms

**Diagnosis**:
```bash
# Check database connection time
kubectl exec -it $(kubectl get pod -l app=api -o jsonpath='{.items[0].metadata.name}' -n production) -n production -- curl -s localhost:8080/debug/db-stats

# Check external API latency
kubectl logs -l app=api -n production --tail=100 | grep "external_call" | tail -20
```

**Common causes**:
1. Database slow → Check DB metrics
2. External API timeout → Enable circuit breaker
3. Resource pressure → Scale up

**Resolution**:
```bash
# Scale up
kubectl scale deployment/api --replicas=5 -n production

# Or enable circuit breaker
kubectl set env deployment/api CIRCUIT_BREAKER_ENABLED=true -n production
```

Escalation Section

Clear escalation paths:

## Escalation

### When to Escalate

Escalate immediately if:
- [ ] Data loss suspected
- [ ] Security breach possible
- [ ] Customer-facing impact > 15 minutes
- [ ] You're unsure after 10 minutes of investigation

### How to Escalate

1. **Page the service owner**:
```bash
# Via PagerDuty
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{"routing_key":"API_TEAM_KEY","event_action":"trigger","payload":{"summary":"API incident - escalation","severity":"critical","source":"on-call"}}'
```

2. **Start incident channel**:
   - Create #incident-[date]-api in Slack
   - Post initial findings

3. **Update status page**:
   - Navigate to statuspage.io
   - Create incident for affected components

### Contacts

| Role | Name | Contact |
|------|------|---------|
| Service Owner | Alice | @alice, +1-555-0101 |
| Backend Lead | Bob | @bob, +1-555-0102 |
| SRE Manager | Carol | @carol, +1-555-0103 |

On-Call Handoff Runbook

Smooth handoffs prevent dropped incidents:

# On-Call Handoff

## Current Status

### Active Incidents
- None / [List active incidents]

### Recent Incidents (last 24h)
```bash
# Check recent alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.startsAt > (now - 86400 | todate)) | {alert: .labels.alertname, status: .status.state}'
```

### Ongoing Issues
- [Any known issues being monitored]

## Recent Changes

### Deployments (last 24h)
```bash
kubectl get deployments -n production -o json | jq '.items[] | select(.metadata.annotations["deployment-time"] > (now - 86400 | todate)) | .metadata.name'
```

### Infrastructure Changes
- [Any infrastructure changes]

## Things to Watch

- [Services under elevated monitoring]
- [Upcoming maintenance windows]

## Handoff Confirmation

- [ ] Outgoing on-call reviewed this document
- [ ] Incoming on-call acknowledged
- [ ] PagerDuty schedule updated

Service-Specific On-Call Runbook Template

# [Service Name] On-Call Runbook

## Service Overview

**Purpose**: [One sentence description]
**Team**: [Owning team]
**Tier**: [1/2/3 - criticality level]

## Architecture

```
[Simple ASCII diagram or link to diagram]
User → Load Balancer → API → Database
                         ↓
                    Cache (Redis)
```

## Dependencies

| Dependency | Purpose | Failure Impact |
|------------|---------|----------------|
| PostgreSQL | Primary data | Service down |
| Redis | Caching | Degraded performance |
| Auth Service | Authentication | Login failures |

## Alerts

| Alert | Severity | Runbook Section |
|-------|----------|-----------------|
| HighErrorRate | Critical | [#high-error-rate](#high-error-rate) |
| HighLatency | Warning | [#high-latency](#high-latency) |
| PodCrashLoop | Critical | [#pod-crashes](#pod-crashes) |

## Quick Health Check
```bash
kubectl get pods -l app=service-name -n production
curl -s http://service-name.internal/health | jq '.'
```

## [Issue-specific sections follow...]

Testing On-Call Runbooks

Runbooks rot without testing:

# On-Call Runbook Testing

## Monthly Validation

### Week 1: Quick diagnosis commands
```bash
# Run each command, verify output makes sense
kubectl get pods -l app=api -n staging
```

### Week 2: Remediation commands
```bash
# Test in staging
kubectl rollout restart deployment/api -n staging
```

### Week 3: Escalation paths
- [ ] Page test number
- [ ] Verify contact info current

### Week 4: Full scenario walkthrough
- [ ] Simulate incident in staging
- [ ] Follow runbook end-to-end
- [ ] Note any gaps

On-Call Runbook Anti-Patterns

❌ Too Much Information

# Bad: Wall of text
The API service was originally created in 2019 by the platform team
as part of the v2 architecture migration. It handles all REST API
requests and communicates with the database layer through...

✅ Just What’s Needed

# Good: Actionable info only
## API Service
Handles REST requests. Connects to PostgreSQL and Redis.

## Quick Check
```bash
kubectl get pods -l app=api
```

❌ Outdated Commands

# Bad: References old infrastructure
ssh api-server-01 "systemctl status api"

✅ Current Commands

# Good: Matches current infrastructure
```bash
kubectl get pods -l app=api -n production
```

Stew for On-Call

Stew makes on-call runbooks executable:

Run diagnostic commands with a click
See output immediately
No context switching between wiki and terminal
Works over SSH for remote systems

Your on-call engineers spend less time fighting tools, more time fixing issues.

Join the waitlist and transform your on-call experience.