8 DevOps Runbook Templates You Can Use Today

Starting from scratch is hard. Starting from a proven template is easy.

Here are 8 DevOps runbook template examples covering the most common operations. Copy them, customize them, and make them your own.

DevOps Runbook Template Examples

1. Service Deployment Template

# Deploy [Service Name]

## Metadata
- **Owner:** @team
- **Duration:** 10-15 minutes
- **Risk:** Low

## Prerequisites
- [ ] CI pipeline passed
- [ ] Changelog reviewed
- [ ] Team notified in #deployments

## Pre-Deployment

### Check Current Version
```bash
kubectl get deployment $SERVICE -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
```

### Verify Target Version
```bash
echo "Deploying version: $VERSION"
```

## Deployment

### Update Deployment
```bash
kubectl set image deployment/$SERVICE $SERVICE=$IMAGE:$VERSION -n production
```

### Monitor Rollout
```bash
kubectl rollout status deployment/$SERVICE -n production --timeout=300s
```

## Post-Deployment

### Verify Health
```bash
curl -s https://$SERVICE.example.com/health | jq .
```

### Check Logs
```bash
kubectl logs deployment/$SERVICE -n production --tail=100 | grep -i error
```

## Rollback
```bash
kubectl rollout undo deployment/$SERVICE -n production
```

2. Database Backup Template

# Database Backup

## Metadata
- **Owner:** @dba-team
- **Schedule:** Daily at 02:00 UTC
- **Retention:** 30 days

## Prerequisites
- [ ] Database credentials available
- [ ] Sufficient disk space (check with `df -h`)
- [ ] S3 bucket accessible

## Backup Procedure

### Create Backup
```bash
BACKUP_FILE="backup_$(date +%Y%m%d_%H%M%S).sql.gz"
pg_dump -h $DB_HOST -U $DB_USER $DB_NAME | gzip > /backups/$BACKUP_FILE
```

### Verify Backup Size
```bash
ls -lh /backups/$BACKUP_FILE
```

Expected: Similar size to previous backups (±10%)

### Upload to S3
```bash
aws s3 cp /backups/$BACKUP_FILE s3://$BACKUP_BUCKET/postgres/
```

### Verify Upload
```bash
aws s3 ls s3://$BACKUP_BUCKET/postgres/$BACKUP_FILE
```

## Cleanup

### Remove Old Local Backups
```bash
find /backups -name "*.sql.gz" -mtime +7 -delete
```

## Verification
```bash
# List recent backups
aws s3 ls s3://$BACKUP_BUCKET/postgres/ | tail -5
```

3. SSL Certificate Renewal Template

# SSL Certificate Renewal

## Metadata
- **Owner:** @security-team
- **Frequency:** Every 60 days (or before expiry)

## Prerequisites
- [ ] DNS access for validation
- [ ] Certbot installed
- [ ] Kubernetes secret write access

## Check Current Expiry
```bash
echo | openssl s_client -servername $DOMAIN -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -dates
```

## Renewal Procedure

### Generate New Certificate
```bash
certbot certonly --dns-cloudflare \
  --dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
  -d $DOMAIN -d *.$DOMAIN
```

### Update Kubernetes Secret
```bash
kubectl create secret tls $DOMAIN-tls \
  --cert=/etc/letsencrypt/live/$DOMAIN/fullchain.pem \
  --key=/etc/letsencrypt/live/$DOMAIN/privkey.pem \
  --dry-run=client -o yaml | kubectl apply -f -
```

### Restart Ingress Controller
```bash
kubectl rollout restart deployment/ingress-nginx -n ingress-nginx
```

## Verification
```bash
echo | openssl s_client -servername $DOMAIN -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -dates
```

New expiry should be ~90 days from now.

4. Cache Flush Template

# Flush Application Cache

## Metadata
- **Owner:** @backend-team
- **Use When:** Stale data reported after deployments

## Prerequisites
- [ ] Redis CLI access
- [ ] Confirm cache flush is appropriate (not during peak traffic)

## Pre-Flush Check

### Check Cache Size
```bash
redis-cli -h $REDIS_HOST INFO memory | grep used_memory_human
```

### Check Key Count
```bash
redis-cli -h $REDIS_HOST DBSIZE
```

## Flush Options

### Option A: Full Flush (All Keys)
```bash
redis-cli -h $REDIS_HOST FLUSHDB
```

### Option B: Selective Flush (Pattern Match)
```bash
# Flush only session keys
redis-cli -h $REDIS_HOST --scan --pattern "session:*" | xargs redis-cli -h $REDIS_HOST DEL

# Flush only API cache
redis-cli -h $REDIS_HOST --scan --pattern "api:cache:*" | xargs redis-cli -h $REDIS_HOST DEL
```

## Verification
```bash
redis-cli -h $REDIS_HOST DBSIZE
```

Key count should be significantly lower (or zero for full flush).

## Post-Flush Monitoring
```bash
# Watch for cache miss spikes
kubectl logs deployment/api -n production | grep -i "cache miss"
```

5. Log Investigation Template

# Log Investigation

## Metadata
- **Owner:** @on-call
- **Use When:** Errors reported, debugging issues

## Quick Error Search

### Recent Errors
```bash
kubectl logs deployment/$SERVICE -n production --since=1h | grep -i error | tail -50
```

### Error Count by Type
```bash
kubectl logs deployment/$SERVICE -n production --since=1h | grep -i error | sort | uniq -c | sort -rn
```

## Detailed Investigation

### Search by Request ID
```bash
kubectl logs deployment/$SERVICE -n production --all-containers | grep "$REQUEST_ID"
```

### Search by User ID
```bash
kubectl logs deployment/$SERVICE -n production --since=24h | grep "user_id=$USER_ID"
```

### Search by Time Range
```bash
kubectl logs deployment/$SERVICE -n production --since-time="2025-11-22T10:00:00Z"
```

## Cross-Service Tracing

### Find Related Logs
```bash
# Check API gateway
kubectl logs deployment/gateway -n production --since=1h | grep "$REQUEST_ID"

# Check downstream service
kubectl logs deployment/downstream -n production --since=1h | grep "$REQUEST_ID"
```

## Export for Analysis
```bash
kubectl logs deployment/$SERVICE -n production --since=24h > /tmp/service-logs.txt
```

6. Health Check Template

# System Health Check

## Metadata
- **Owner:** @on-call
- **Frequency:** Start of shift, after incidents

## Infrastructure Health

### Kubernetes Cluster
```bash
kubectl get nodes
kubectl top nodes
```

### Pod Health
```bash
kubectl get pods -A | grep -v Running
kubectl get pods -A | grep -v "1/1\|2/2\|3/3"
```

## Application Health

### API Service
```bash
curl -s https://api.example.com/health | jq .
```

### Background Workers
```bash
kubectl get pods -n production -l app=worker
```

## Data Stores

### Database
```bash
psql -h $DB_HOST -U $DB_USER -c "SELECT 1" && echo "DB: OK"
```

### Redis
```bash
redis-cli -h $REDIS_HOST PING
```

### Message Queue
```bash
rabbitmqctl list_queues name messages consumers | head -10
```

## Summary Checklist
- [ ] All nodes Ready
- [ ] All pods Running
- [ ] API health check passing
- [ ] Database responsive
- [ ] Cache responsive
- [ ] Queue depth normal

7. Secret Rotation Template

# Rotate API Keys / Secrets

## Metadata
- **Owner:** @security-team
- **Frequency:** Quarterly or after suspected compromise

## Prerequisites
- [ ] New secret value generated
- [ ] Vault access confirmed
- [ ] Deployment access confirmed

## Rotation Procedure

### Step 1: Generate New Secret
```bash
NEW_SECRET=$(openssl rand -base64 32)
echo "New secret generated (not displayed)"
```

### Step 2: Update in Vault
```bash
vault kv put secret/$SERVICE/$SECRET_NAME value="$NEW_SECRET"
```

### Step 3: Restart Application
```bash
kubectl rollout restart deployment/$SERVICE -n production
```

### Step 4: Verify Application Started
```bash
kubectl rollout status deployment/$SERVICE -n production
kubectl logs deployment/$SERVICE -n production --tail=20
```

## Verification
```bash
# Check application can use new secret
curl -s https://api.example.com/health | jq .
```

## Rollback
If issues occur:
1. Restore old secret in Vault
2. Restart application
3. Investigate before retrying

8. Capacity Planning Template

# Capacity Review

## Metadata
- **Owner:** @platform-team
- **Frequency:** Monthly

## Current Resource Usage

### Node Capacity
```bash
kubectl describe nodes | grep -A 5 "Allocated resources"
```

### Namespace Resource Usage
```bash
kubectl top pods -n production --sort-by=cpu | head -20
kubectl top pods -n production --sort-by=memory | head -20
```

## Storage

### PVC Usage
```bash
kubectl get pvc -A
```

### Database Size
```bash
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_size_pretty(pg_database_size('$DB_NAME'));"
```

## Trends

### Pod Restart Frequency
```bash
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | sort -t$'\t' -k2 -rn | head -10
```

## Recommendations
Document findings and recommendations for:
- [ ] Node scaling needs
- [ ] Resource limit adjustments
- [ ] Storage expansion
- [ ] Cost optimization opportunities

Using These DevOps Runbook Template Examples

These templates are starting points. Customize them for your infrastructure, naming conventions, and team preferences. For more guidance, see our DevOps runbook template guide and learn how to write a runbook.

Stew turns these DevOps runbook template examples into executable procedures. Copy the template, run each command with a click, track your progress automatically.

Join the waitlist and make your runbooks executable.