On-Call Runbook Templates: Ready-to-Use Examples

Don’t start from scratch. These on-call runbook templates cover common scenarios you’ll face during on-call shifts.

For template customization guidance, see our on-call runbook guide.

Template 1: API Service Issues

# API Service On-Call Runbook

## Quick Status
```bash
kubectl get pods -l app=api -n production -o wide
kubectl get events --sort-by='.lastTimestamp' -n production | grep api | tail -5
```

## Health Check
```bash
curl -s http://api.internal/health | jq '.'
```

---

## Issue: High Error Rate

### Diagnosis
```bash
# Error rate from logs
kubectl logs -l app=api -n production --tail=500 | grep -c "ERROR"

# Recent errors
kubectl logs -l app=api -n production --tail=100 | grep "ERROR" | tail -20
```

### Common Causes

**1. Database connection issues**
```bash
kubectl logs -l app=api -n production --tail=100 | grep -i "database\|connection\|timeout"
```

**2. External service failures**
```bash
kubectl logs -l app=api -n production --tail=100 | grep -i "external\|upstream\|503"
```

**3. Invalid requests spike**
```bash
kubectl logs -l app=api -n production --tail=100 | grep "400\|422" | wc -l
```

### Remediation

**Restart pods**:
```bash
kubectl rollout restart deployment/api -n production
kubectl rollout status deployment/api -n production
```

**Rollback if recent deploy**:
```bash
kubectl rollout undo deployment/api -n production
```

---

## Issue: High Latency

### Diagnosis
```bash
# Current resource usage
kubectl top pods -l app=api -n production

# Connection pool status
kubectl exec -it $(kubectl get pod -l app=api -n production -o jsonpath='{.items[0].metadata.name}') -n production -- curl -s localhost:8080/debug/pools
```

### Remediation

**Scale up**:
```bash
kubectl scale deployment/api --replicas=5 -n production
```

**Increase resources**:
```bash
kubectl set resources deployment/api -n production --limits=cpu=2,memory=2Gi
```

---

## Issue: Pods Not Starting

### Diagnosis
```bash
kubectl describe pod -l app=api -n production | tail -30
kubectl get events --sort-by='.lastTimestamp' -n production | grep -i "failed\|error" | tail -10
```

### Common Causes

**Image pull failure**:
```bash
kubectl describe pod -l app=api -n production | grep -A3 "Events:" | grep -i "pull"
```

**Resource constraints**:
```bash
kubectl describe nodes | grep -A5 "Allocated resources"
```

**Config/secret missing**:
```bash
kubectl get configmap api-config -n production
kubectl get secret api-secrets -n production
```

---

## Escalation

| Severity | Action |
|----------|--------|
| P1 (service down) | Page API team immediately |
| P2 (degraded) | Slack #api-team, page if no response in 15min |
| P3 (minor) | Create ticket, notify in standup |

```bash
# Page API team
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{"routing_key":"API_TEAM_KEY","event_action":"trigger","payload":{"summary":"API incident requires escalation","severity":"critical","source":"on-call"}}'
```

Template 2: Database Issues

# Database On-Call Runbook

## Quick Status
```bash
# PostgreSQL
psql -h db.internal -U monitor -c "SELECT count(*) FROM pg_stat_activity;"
psql -h db.internal -U monitor -c "SELECT pg_is_in_recovery();"
```

---

## Issue: Connection Exhaustion

### Diagnosis
```bash
# Current connections
psql -h db.internal -U monitor -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Max connections
psql -h db.internal -U monitor -c "SHOW max_connections;"

# Connections by application
psql -h db.internal -U monitor -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count DESC;"
```

### Remediation

**Kill idle connections**:
```bash
psql -h db.internal -U admin -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '30 minutes';"
```

**Identify connection leak**:
```bash
psql -h db.internal -U monitor -c "SELECT application_name, client_addr, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 3 DESC LIMIT 10;"
```

---

## Issue: Slow Queries

### Diagnosis
```bash
# Currently running queries
psql -h db.internal -U monitor -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '10 seconds' ORDER BY duration DESC;"

# Blocked queries
psql -h db.internal -U monitor -c "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query FROM pg_locks blocked_locks JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;"
```

### Remediation

**Cancel slow query**:
```bash
psql -h db.internal -U admin -c "SELECT pg_cancel_backend(PID);"
```

**Terminate if cancel doesn't work**:
```bash
psql -h db.internal -U admin -c "SELECT pg_terminate_backend(PID);"
```

---

## Issue: Replication Lag

### Diagnosis
```bash
# Check lag on primary
psql -h db-primary.internal -U monitor -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication;"

# Check lag on replica
psql -h db-replica.internal -U monitor -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
```

### Remediation

**If replica is too far behind, rebuild**:
```bash
# This is destructive - escalate first
# pg_basebackup -h db-primary.internal -D /var/lib/postgresql/data -U replication -P
```

---

## Escalation

| Severity | Action |
|----------|--------|
| Database down | Page DBA immediately |
| Replication broken | Page DBA within 5 min |
| Performance degraded | Slack #dba-team |

Template 3: Kubernetes Cluster Issues

# Kubernetes Cluster On-Call Runbook

## Quick Status
```bash
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
kubectl top nodes
```

---

## Issue: Node Not Ready

### Diagnosis
```bash
# Node status
kubectl describe node NODE_NAME | grep -A10 Conditions

# Node events
kubectl get events --field-selector involvedObject.name=NODE_NAME --sort-by='.lastTimestamp'

# Pods on affected node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=NODE_NAME
```

### Remediation

**Cordon node (prevent new pods)**:
```bash
kubectl cordon NODE_NAME
```

**Drain node (move pods)**:
```bash
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data
```

**If cloud provider, check instance**:
```bash
# AWS
aws ec2 describe-instance-status --instance-ids INSTANCE_ID

# GCP
gcloud compute instances describe INSTANCE_NAME --zone=ZONE
```

---

## Issue: Pods Pending

### Diagnosis
```bash
# Why pending?
kubectl describe pod POD_NAME -n NAMESPACE | grep -A10 Events

# Resource availability
kubectl describe nodes | grep -A5 "Allocated resources"

# Check for taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
```

### Remediation

**If resource constrained, scale cluster**:
```bash
# Check current node count
kubectl get nodes | wc -l

# Scale up (cloud-specific)
# AWS EKS
eksctl scale nodegroup --cluster=CLUSTER --name=NODEGROUP --nodes=5
```

---

## Issue: High Resource Usage

### Diagnosis
```bash
# Top resource consumers
kubectl top pods --all-namespaces --sort-by=cpu | head -20
kubectl top pods --all-namespaces --sort-by=memory | head -20

# Node pressure
kubectl describe nodes | grep -E "Pressure|Allocated"
```

### Remediation

**Evict non-critical pods**:
```bash
kubectl delete pod HIGH_USAGE_POD -n NAMESPACE
```

**Scale down non-critical workloads**:
```bash
kubectl scale deployment/batch-processor --replicas=0 -n production
```

---

## Escalation

Page platform team for:
- Multiple nodes NotReady
- Control plane issues
- Persistent storage failures

Template 4: Redis/Cache Issues

# Redis On-Call Runbook

## Quick Status
```bash
redis-cli -h redis.internal ping
redis-cli -h redis.internal info memory | grep used_memory_human
redis-cli -h redis.internal info clients | grep connected_clients
```

---

## Issue: Memory Full

### Diagnosis
```bash
redis-cli -h redis.internal info memory
redis-cli -h redis.internal --bigkeys
```

### Remediation

**Flush expired keys**:
```bash
redis-cli -h redis.internal --scan --pattern '*' | head -1000 | xargs redis-cli -h redis.internal ttl | grep -c "^-1"
```

**Clear cache (if safe)**:
```bash
redis-cli -h redis.internal FLUSHDB
```

---

## Issue: High Latency

### Diagnosis
```bash
redis-cli -h redis.internal slowlog get 10
redis-cli -h redis.internal info stats | grep instantaneous_ops_per_sec
```

### Remediation

**Identify slow commands and optimize application**:
```bash
redis-cli -h redis.internal slowlog get 10
```

---

## Issue: Connection Refused

### Diagnosis
```bash
# Check if Redis is running
kubectl get pods -l app=redis -n production
kubectl logs -l app=redis -n production --tail=50
```

### Remediation

**Restart Redis**:
```bash
kubectl rollout restart statefulset/redis -n production
```

Using These Templates

Copy the relevant template
Replace placeholder values (SERVICE_NAME, NAMESPACE, etc.)
Test commands in staging
Add service-specific issues
Store in version control

Stew: Execute These Templates

Stew turns these templates into executable runbooks. Every command block runs with a click. No copy-paste required.

Join the waitlist and make your on-call runbooks actionable.