MTTR Reduction: A Practical Guide for SRE Teams
Mean Time to Recovery (MTTR) is the metric that matters most during incidents. Every minute of downtime costs money, erodes trust, and burns out your team.
This guide covers practical strategies to reduce MTTR. For executable incident procedures, see our incident response runbook guide.
What Is MTTR?
MTTR measures the average time from when an incident is detected to when service is restored.
MTTR = Total downtime / Number of incidents
A typical breakdown:
- Detection: Time to notice something is wrong
- Diagnosis: Time to identify the root cause
- Resolution: Time to fix the problem
- Verification: Time to confirm the fix worked
Why MTTR Matters
| MTTR | Monthly Downtime (10 incidents) | Annual Cost (at $10k/min) |
|---|---|---|
| 60 min | 10 hours | $6M |
| 30 min | 5 hours | $3M |
| 15 min | 2.5 hours | $1.5M |
Cutting MTTR in half literally halves your downtime costs.
MTTR Reduction Strategy 1: Faster Detection
You can’t fix what you don’t know is broken.
Implement Comprehensive Monitoring
# Check if key endpoints are monitored
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | .labels.alertname'
Reduce Alert Latency
# Prometheus rule with minimal delay
groups:
- name: critical
interval: 15s # Check every 15 seconds
rules:
- alert: HighErrorRate
expr: rate(http_errors_total[1m]) > 0.1
for: 30s # Fire after 30 seconds
Enable Real-Time Notifications
# Test PagerDuty integration
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H "Content-Type: application/json" \
-d '{"routing_key":"YOUR_KEY","event_action":"trigger","payload":{"summary":"Test alert","severity":"critical","source":"monitoring"}}'
MTTR Reduction Strategy 2: Faster Diagnosis
Diagnosis often consumes 60%+ of incident time.
Pre-Built Diagnostic Commands
Have commands ready before incidents happen:
# Quick service health check
kubectl get pods -l app=api -o wide
kubectl top pods -l app=api
kubectl logs -l app=api --tail=50 | grep -i error
Correlation Dashboards
Link metrics, logs, and traces in one view:
# Get all signals for a service
echo "=== Pods ===" && kubectl get pods -l app=api
echo "=== Recent Events ===" && kubectl get events --field-selector involvedObject.name=api --sort-by='.lastTimestamp' | tail -10
echo "=== Logs ===" && kubectl logs -l app=api --tail=20
Runbook Links in Alerts
Include diagnostic steps in every alert:
annotations:
runbook_url: https://runbooks.internal/api-high-latency
summary: "API latency above threshold"
diagnostic_commands: |
kubectl top pods -l app=api
kubectl logs -l app=api --tail=100 | grep -i slow
MTTR Reduction Strategy 3: Faster Resolution
Once you know the problem, fix it quickly.
One-Click Remediation
Standard fixes should be executable instantly:
# API Pod Restart Runbook
## Verify issue
```bash
kubectl get pods -l app=api | grep -v Running
```
## Restart pods
```bash
kubectl rollout restart deployment/api
```
## Verify recovery
```bash
kubectl rollout status deployment/api
```
Automated Rollbacks
# Quick rollback to last known good
kubectl rollout undo deployment/api
# Or to specific revision
kubectl rollout undo deployment/api --to-revision=3
Feature Flags for Instant Mitigation
# Disable problematic feature without deploy
curl -X POST http://feature-flags/api/flags/new-checkout \
-H "Content-Type: application/json" \
-d '{"enabled": false}'
MTTR Reduction Strategy 4: Better Preparation
The best incident response happens before incidents.
Executable Runbooks
Static documentation slows you down. Executable runbooks let you:
- Run diagnostic commands with one click
- See output immediately
- Follow procedures step-by-step
See our runbook examples for templates.
Regular Incident Drills
# Chaos engineering - controlled failure injection
kubectl delete pod -l app=api --grace-period=0
# Measure recovery time
time kubectl wait --for=condition=ready pod -l app=api --timeout=300s
On-Call Handoff Documents
# On-Call Handoff
## Current Issues
- Elevated error rates on checkout service
- Database replica lag under investigation
## Recent Changes
- Deployed v2.3.1 at 14:00 UTC
- Scaled worker pods from 3 to 5
## Useful Commands
```bash
kubectl get pods -n production
kubectl logs -l app=checkout --tail=100
```
Measuring MTTR Improvement
Track your progress:
# Calculate MTTR from incident data
cat incidents.json | jq '[.[] | .resolution_time - .detection_time] | add / length / 60'
# Output in minutes
MTTR Breakdown Analysis
| Phase | Before | After | Improvement |
|---|---|---|---|
| Detection | 10 min | 2 min | 80% |
| Diagnosis | 25 min | 10 min | 60% |
| Resolution | 15 min | 5 min | 67% |
| Verification | 5 min | 3 min | 40% |
| Total MTTR | 55 min | 20 min | 64% |
Common MTTR Pitfalls
Pitfall 1: Alert Noise
Too many alerts means slow response to real issues.
# Analyze alert frequency
curl -s http://alertmanager:9093/api/v1/alerts | jq '[.data[].labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'
Pitfall 2: Outdated Runbooks
Runbooks that don’t match reality slow you down.
Pitfall 3: Single Points of Knowledge
If only one person knows how to fix something, MTTR depends on their availability.
Stew: Built for MTTR Reduction
Stew directly addresses MTTR by making runbooks executable:
- Faster diagnosis: Run commands from runbooks instantly
- Faster resolution: Pre-built procedures execute with clicks
- Better preparation: Keep runbooks updated and tested
Your team stops copying commands from wikis and starts executing documented procedures.
Join the waitlist and cut your MTTR.