On-Call Runbook Templates: Ready-to-Use Examples
· 7 min read · Stew Team
on-callrunbooktemplatessre
Don’t start from scratch. These on-call runbook templates cover common scenarios you’ll face during on-call shifts.
For template customization guidance, see our on-call runbook guide.
Template 1: API Service Issues
# API Service On-Call Runbook
## Quick Status
```bash
kubectl get pods -l app=api -n production -o wide
kubectl get events --sort-by='.lastTimestamp' -n production | grep api | tail -5
```
## Health Check
```bash
curl -s http://api.internal/health | jq '.'
```
---
## Issue: High Error Rate
### Diagnosis
```bash
# Error rate from logs
kubectl logs -l app=api -n production --tail=500 | grep -c "ERROR"
# Recent errors
kubectl logs -l app=api -n production --tail=100 | grep "ERROR" | tail -20
```
### Common Causes
**1. Database connection issues**
```bash
kubectl logs -l app=api -n production --tail=100 | grep -i "database\|connection\|timeout"
```
**2. External service failures**
```bash
kubectl logs -l app=api -n production --tail=100 | grep -i "external\|upstream\|503"
```
**3. Invalid requests spike**
```bash
kubectl logs -l app=api -n production --tail=100 | grep "400\|422" | wc -l
```
### Remediation
**Restart pods**:
```bash
kubectl rollout restart deployment/api -n production
kubectl rollout status deployment/api -n production
```
**Rollback if recent deploy**:
```bash
kubectl rollout undo deployment/api -n production
```
---
## Issue: High Latency
### Diagnosis
```bash
# Current resource usage
kubectl top pods -l app=api -n production
# Connection pool status
kubectl exec -it $(kubectl get pod -l app=api -n production -o jsonpath='{.items[0].metadata.name}') -n production -- curl -s localhost:8080/debug/pools
```
### Remediation
**Scale up**:
```bash
kubectl scale deployment/api --replicas=5 -n production
```
**Increase resources**:
```bash
kubectl set resources deployment/api -n production --limits=cpu=2,memory=2Gi
```
---
## Issue: Pods Not Starting
### Diagnosis
```bash
kubectl describe pod -l app=api -n production | tail -30
kubectl get events --sort-by='.lastTimestamp' -n production | grep -i "failed\|error" | tail -10
```
### Common Causes
**Image pull failure**:
```bash
kubectl describe pod -l app=api -n production | grep -A3 "Events:" | grep -i "pull"
```
**Resource constraints**:
```bash
kubectl describe nodes | grep -A5 "Allocated resources"
```
**Config/secret missing**:
```bash
kubectl get configmap api-config -n production
kubectl get secret api-secrets -n production
```
---
## Escalation
| Severity | Action |
|----------|--------|
| P1 (service down) | Page API team immediately |
| P2 (degraded) | Slack #api-team, page if no response in 15min |
| P3 (minor) | Create ticket, notify in standup |
```bash
# Page API team
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H "Content-Type: application/json" \
-d '{"routing_key":"API_TEAM_KEY","event_action":"trigger","payload":{"summary":"API incident requires escalation","severity":"critical","source":"on-call"}}'
```
Template 2: Database Issues
# Database On-Call Runbook
## Quick Status
```bash
# PostgreSQL
psql -h db.internal -U monitor -c "SELECT count(*) FROM pg_stat_activity;"
psql -h db.internal -U monitor -c "SELECT pg_is_in_recovery();"
```
---
## Issue: Connection Exhaustion
### Diagnosis
```bash
# Current connections
psql -h db.internal -U monitor -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Max connections
psql -h db.internal -U monitor -c "SHOW max_connections;"
# Connections by application
psql -h db.internal -U monitor -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name ORDER BY count DESC;"
```
### Remediation
**Kill idle connections**:
```bash
psql -h db.internal -U admin -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '30 minutes';"
```
**Identify connection leak**:
```bash
psql -h db.internal -U monitor -c "SELECT application_name, client_addr, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 3 DESC LIMIT 10;"
```
---
## Issue: Slow Queries
### Diagnosis
```bash
# Currently running queries
psql -h db.internal -U monitor -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '10 seconds' ORDER BY duration DESC;"
# Blocked queries
psql -h db.internal -U monitor -c "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query FROM pg_locks blocked_locks JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.relation = blocked_locks.relation AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;"
```
### Remediation
**Cancel slow query**:
```bash
psql -h db.internal -U admin -c "SELECT pg_cancel_backend(PID);"
```
**Terminate if cancel doesn't work**:
```bash
psql -h db.internal -U admin -c "SELECT pg_terminate_backend(PID);"
```
---
## Issue: Replication Lag
### Diagnosis
```bash
# Check lag on primary
psql -h db-primary.internal -U monitor -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication;"
# Check lag on replica
psql -h db-replica.internal -U monitor -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
```
### Remediation
**If replica is too far behind, rebuild**:
```bash
# This is destructive - escalate first
# pg_basebackup -h db-primary.internal -D /var/lib/postgresql/data -U replication -P
```
---
## Escalation
| Severity | Action |
|----------|--------|
| Database down | Page DBA immediately |
| Replication broken | Page DBA within 5 min |
| Performance degraded | Slack #dba-team |
Template 3: Kubernetes Cluster Issues
# Kubernetes Cluster On-Call Runbook
## Quick Status
```bash
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
kubectl top nodes
```
---
## Issue: Node Not Ready
### Diagnosis
```bash
# Node status
kubectl describe node NODE_NAME | grep -A10 Conditions
# Node events
kubectl get events --field-selector involvedObject.name=NODE_NAME --sort-by='.lastTimestamp'
# Pods on affected node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=NODE_NAME
```
### Remediation
**Cordon node (prevent new pods)**:
```bash
kubectl cordon NODE_NAME
```
**Drain node (move pods)**:
```bash
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data
```
**If cloud provider, check instance**:
```bash
# AWS
aws ec2 describe-instance-status --instance-ids INSTANCE_ID
# GCP
gcloud compute instances describe INSTANCE_NAME --zone=ZONE
```
---
## Issue: Pods Pending
### Diagnosis
```bash
# Why pending?
kubectl describe pod POD_NAME -n NAMESPACE | grep -A10 Events
# Resource availability
kubectl describe nodes | grep -A5 "Allocated resources"
# Check for taints
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, taints: .spec.taints}'
```
### Remediation
**If resource constrained, scale cluster**:
```bash
# Check current node count
kubectl get nodes | wc -l
# Scale up (cloud-specific)
# AWS EKS
eksctl scale nodegroup --cluster=CLUSTER --name=NODEGROUP --nodes=5
```
---
## Issue: High Resource Usage
### Diagnosis
```bash
# Top resource consumers
kubectl top pods --all-namespaces --sort-by=cpu | head -20
kubectl top pods --all-namespaces --sort-by=memory | head -20
# Node pressure
kubectl describe nodes | grep -E "Pressure|Allocated"
```
### Remediation
**Evict non-critical pods**:
```bash
kubectl delete pod HIGH_USAGE_POD -n NAMESPACE
```
**Scale down non-critical workloads**:
```bash
kubectl scale deployment/batch-processor --replicas=0 -n production
```
---
## Escalation
Page platform team for:
- Multiple nodes NotReady
- Control plane issues
- Persistent storage failures
Template 4: Redis/Cache Issues
# Redis On-Call Runbook
## Quick Status
```bash
redis-cli -h redis.internal ping
redis-cli -h redis.internal info memory | grep used_memory_human
redis-cli -h redis.internal info clients | grep connected_clients
```
---
## Issue: Memory Full
### Diagnosis
```bash
redis-cli -h redis.internal info memory
redis-cli -h redis.internal --bigkeys
```
### Remediation
**Flush expired keys**:
```bash
redis-cli -h redis.internal --scan --pattern '*' | head -1000 | xargs redis-cli -h redis.internal ttl | grep -c "^-1"
```
**Clear cache (if safe)**:
```bash
redis-cli -h redis.internal FLUSHDB
```
---
## Issue: High Latency
### Diagnosis
```bash
redis-cli -h redis.internal slowlog get 10
redis-cli -h redis.internal info stats | grep instantaneous_ops_per_sec
```
### Remediation
**Identify slow commands and optimize application**:
```bash
redis-cli -h redis.internal slowlog get 10
```
---
## Issue: Connection Refused
### Diagnosis
```bash
# Check if Redis is running
kubectl get pods -l app=redis -n production
kubectl logs -l app=redis -n production --tail=50
```
### Remediation
**Restart Redis**:
```bash
kubectl rollout restart statefulset/redis -n production
```
Using These Templates
- Copy the relevant template
- Replace placeholder values (SERVICE_NAME, NAMESPACE, etc.)
- Test commands in staging
- Add service-specific issues
- Store in version control
Stew: Execute These Templates
Stew turns these templates into executable runbooks. Every command block runs with a click. No copy-paste required.
Join the waitlist and make your on-call runbooks actionable.