DevOps Runbook Template Best Practices

A DevOps runbook template is only as good as the practices behind it. Great templates followed poorly produce poor results. Average templates followed consistently produce great results.

These best practices come from teams who’ve learned what works—often the hard way.

DevOps Runbook Template Best Practices

1. Optimize for Scanning, Not Reading

During incidents, nobody reads paragraphs. Design your template for quick scanning:

# ✅ Good: Scannable

### Check Pod Status
```bash
kubectl get pods -n production -l app=api
```
**Expected:** All pods Running, Ready 1/1

# ❌ Bad: Wall of Text

First, you'll want to verify that your pods are running correctly. 
To do this, use the kubectl get pods command with the appropriate 
namespace and label selectors. You should see all pods in a Running 
state with all containers ready.

Best practice: Use headers, code blocks, and bullet points. Limit prose to one sentence per step.

2. Make Every Command Copy-Pasteable

Engineers shouldn’t have to modify commands under pressure:

# ✅ Good: Complete Command
```bash
kubectl rollout restart deployment/api -n production
```

# ❌ Bad: Requires Modification
```bash
kubectl rollout restart deployment/<deployment-name> -n <namespace>
```

Best practice: If values change, use variables set at the top of the runbook:

## Configuration
```bash
export NAMESPACE="production"
export DEPLOYMENT="api"
```

## Procedure
```bash
kubectl rollout restart deployment/$DEPLOYMENT -n $NAMESPACE
```

3. Include Expected Output

Tell engineers what success looks like:

# ✅ Good: Shows Expected Output

### Verify Deployment
```bash
kubectl rollout status deployment/api -n production
```

**Expected output:**
```
deployment "api" successfully rolled out
```

**If you see:** `Waiting for deployment...` for more than 5 minutes, check pod events.

# ❌ Bad: No Expectations

### Verify Deployment
```bash
kubectl rollout status deployment/api -n production
```

Best practice: Include expected output, timing expectations, and what to do if output differs.

4. Always Include Rollback Steps

Every procedure needs an undo:

# ✅ Good: Rollback Included

## Procedure
...

## Rollback
If issues occur after deployment:

```bash
kubectl rollout undo deployment/api -n production
```

Verify rollback:
```bash
kubectl rollout status deployment/api -n production
```

# ❌ Bad: No Rollback

## Procedure
...

(No rollback section)

Best practice: Write and test the rollback before you need it. Position it prominently in the template.

5. Use Consistent Naming Conventions

Every team should use the same terminology:

# ✅ Good: Consistent Structure

Every runbook has:
- Metadata (owner, updated, duration)
- Prerequisites
- Procedure
- Verification
- Rollback
- Escalation

# ❌ Bad: Inconsistent Structure

Runbook A: "Steps" section
Runbook B: "Procedure" section  
Runbook C: "Instructions" section

Best practice: Document your template structure and enforce it through review.

6. Keep Prerequisites as Checkboxes

Engineers should verify access before starting:

# ✅ Good: Actionable Checklist

## Prerequisites
- [ ] VPN connected to production
- [ ] kubectl context set to production cluster
- [ ] Verify context: `kubectl config current-context`
- [ ] PagerDuty admin access confirmed

# ❌ Bad: Vague List

## Prerequisites
- VPN access
- Kubernetes access
- Monitoring access

Best practice: Make each prerequisite verifiable with a specific action or command.

7. Add Timing Expectations

Engineers need to know if something is taking too long:

# ✅ Good: Timing Included

### Wait for Rollout
```bash
kubectl rollout status deployment/api -n production
```

**Expected duration:** 2-5 minutes

**If longer than 10 minutes:**
1. Check pod events: `kubectl describe pod -l app=api -n production`
2. Check for resource constraints: `kubectl top pods -n production`

# ❌ Bad: No Timing

### Wait for Rollout
```bash
kubectl rollout status deployment/api -n production
```

Best practice: Include expected duration and escalation triggers for when things take too long.

8. Version Control Your Templates

Treat runbooks like code:

# ✅ Good: Version Controlled

- Stored in Git repository
- Changes reviewed via PR
- History tracked
- Tied to infrastructure changes

# ❌ Bad: Wiki-Based

- Stored in Confluence/Notion
- No review process
- History unclear
- Disconnected from code

Best practice: Store runbooks in the same repository as the code they operate on.

9. Test Runbooks Regularly

Untested runbooks are unreliable runbooks:

# ✅ Good: Regular Testing

## Runbook Maintenance
- **Last tested:** 2025-11-01
- **Testing frequency:** Monthly
- **Test environment:** staging

# ❌ Bad: Never Tested

(No testing information)
(Last updated 2 years ago)

Best practice: Schedule regular runbook drills. Test in staging monthly. Run incident simulations quarterly.

10. Make Updates Frictionless

The harder it is to update, the staler runbooks get:

# ✅ Good: Easy Updates

- Edit directly in repository
- Quick PR process
- No approval bottlenecks
- Update triggered by infrastructure changes

# ❌ Bad: Hard Updates

- Locked wiki pages
- Multi-level approval
- Separate from code changes
- "Someone else's job"

Best practice: When you change infrastructure, update the runbook in the same PR.

Template Maintenance Checklist

Use this checklist for regular template reviews:

## Monthly Review
- [ ] All commands still work
- [ ] Expected outputs still accurate
- [ ] Prerequisites still valid
- [ ] Rollback tested
- [ ] Escalation contacts current

## Quarterly Review
- [ ] Template structure still fits team needs
- [ ] New common procedures need templates
- [ ] Old templates can be retired
- [ ] Cross-reference links still valid

From Best Practices to Execution

Knowing DevOps runbook template best practices is one thing. Following them under pressure is another. See our DevOps runbook template guide and runbook examples for practical implementations.

Stew embeds these best practices into executable runbooks. Copy-pasteable commands run with a click. Expected output displays automatically. Version control is built in.

Join the waitlist and make best practices your default.