Measuring MTTR: Metrics and Benchmarks for SRE Teams
You can’t improve what you don’t measure. MTTR reduction requires systematic tracking across incident phases.
This guide covers how to measure MTTR effectively. For reduction strategies, see our MTTR reduction guide.
MTTR Formula
The basic calculation:
MTTR = Total Recovery Time / Number of Incidents
But this hides important details. Break it down:
MTTR = Detection Time + Diagnosis Time + Resolution Time + Verification Time
MTTR Breakdown Metrics
Time to Detect (TTD)
From incident start to first alert.
# Calculate from incident data
cat incidents.json | jq '[.[] | .alert_time - .incident_start] | add / length / 60'
Benchmark: < 5 minutes for critical services
Time to Acknowledge (TTA)
From alert to human response.
# PagerDuty acknowledgment time
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01" \
-H "Authorization: Token token=YOUR_TOKEN" | \
jq '[.incidents[] | (.acknowledged_at | fromdateiso8601) - (.created_at | fromdateiso8601)] | add / length / 60'
Benchmark: < 5 minutes during business hours, < 15 minutes off-hours
Time to Diagnose (TTDiag)
From acknowledgment to root cause identification.
This is often the longest phase and hardest to measure automatically. Track manually:
# Incident Template
- Acknowledged: [timestamp]
- Root cause identified: [timestamp]
- Diagnosis notes: [what was tried]
Benchmark: < 15 minutes for known issues, < 30 minutes for novel issues
Time to Resolve (TTR)
From diagnosis to fix applied.
# Calculate from deployment timestamps
kubectl get deployment api -o jsonpath='{.metadata.annotations.deployment-time}'
Benchmark: < 10 minutes for standard remediations
Time to Verify (TTV)
From fix applied to confirmed working.
# Monitor recovery
time kubectl rollout status deployment/api
Benchmark: < 5 minutes
MTTR Benchmarks by Industry
| Industry | Good MTTR | Average MTTR | Poor MTTR |
|---|---|---|---|
| E-commerce | < 15 min | 30 min | > 60 min |
| SaaS | < 30 min | 45 min | > 90 min |
| Finance | < 10 min | 20 min | > 45 min |
| Media | < 20 min | 40 min | > 90 min |
Setting Up MTTR Tracking
Option 1: Incident Management Tool
PagerDuty, OpsGenie, and similar tools track:
- Alert creation time
- Acknowledgment time
- Resolution time
# Export PagerDuty incident data
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01&until=2024-02-01" \
-H "Authorization: Token token=YOUR_TOKEN" | \
jq '.incidents[] | {id: .id, created: .created_at, resolved: .resolved_at}'
Option 2: Custom Tracking
Add timestamps to your incident process:
# Start incident tracking
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) INCIDENT_START" >> /var/log/incidents.log
# Mark diagnosis complete
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) DIAGNOSIS_COMPLETE" >> /var/log/incidents.log
# Mark resolved
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) RESOLVED" >> /var/log/incidents.log
Option 3: Prometheus Metrics
# Custom metrics for MTTR tracking
- name: incident_duration_seconds
help: Time from incident start to resolution
type: histogram
buckets: [300, 600, 900, 1800, 3600, 7200]
Analyzing MTTR Data
Trend Analysis
# Weekly MTTR trend
cat incidents.json | jq -r '.[] | [.week, .mttr_minutes] | @csv' | \
awk -F',' '{sum[$1]+=$2; count[$1]++} END {for (w in sum) print w, sum[w]/count[w]}'
Distribution Analysis
# MTTR percentiles
cat incidents.json | jq '[.[].mttr_minutes] | sort |
{p50: .[length/2], p90: .[length*0.9], p99: .[length*0.99]}'
By Service Analysis
# MTTR by service
cat incidents.json | jq 'group_by(.service) |
map({service: .[0].service, avg_mttr: ([.[].mttr_minutes] | add / length)})'
By Incident Type
# MTTR by category
cat incidents.json | jq 'group_by(.category) |
map({category: .[0].category, avg_mttr: ([.[].mttr_minutes] | add / length), count: length})'
MTTR Dashboards
Key Visualizations
- MTTR over time: Weekly/monthly trend line
- MTTR by phase: Stacked bar showing detection/diagnosis/resolution
- MTTR by service: Identify problem areas
- MTTR distribution: Histogram showing spread
Grafana Query Examples
# Average MTTR last 30 days
avg(incident_duration_seconds{status="resolved"}) / 60
# MTTR by service
avg by (service) (incident_duration_seconds{status="resolved"}) / 60
# MTTR trend
avg_over_time(incident_duration_seconds{status="resolved"}[7d]) / 60
Common MTTR Measurement Mistakes
Mistake 1: Only Measuring Total MTTR
Without phase breakdown, you don’t know where to improve.
Fix: Track each phase separately.
Mistake 2: Excluding “Small” Incidents
Every incident affects MTTR averages.
Fix: Track all incidents consistently.
Mistake 3: Manual Data Entry
Delayed or forgotten entries skew data.
Fix: Automate timestamp capture where possible.
Mistake 4: Ignoring Outliers
One 4-hour incident can mask improvements.
Fix: Use percentiles (p50, p90) alongside averages.
MTTR Improvement Targets
Set realistic goals:
| Current MTTR | 6-Month Target | Focus Area |
|---|---|---|
| > 60 min | 40 min | Basic runbooks, faster detection |
| 40-60 min | 25 min | Executable runbooks, automation |
| 25-40 min | 15 min | Predictive detection, self-healing |
| < 25 min | 10 min | Advanced automation, ML-based diagnosis |
Connecting MTTR to Business Metrics
Translate MTTR to business impact:
# Calculate downtime cost
MTTR_MINUTES=30
INCIDENTS_PER_MONTH=10
COST_PER_MINUTE=1000
echo "Monthly downtime cost: \$$(($MTTR_MINUTES * $INCIDENTS_PER_MONTH * $COST_PER_MINUTE))"
Revenue Impact Formula
Monthly Downtime Cost = MTTR × Incidents/Month × Revenue/Minute
Stew’s Impact on MTTR Metrics
Stew reduces MTTR by eliminating friction:
- Faster diagnosis: Executable commands, no copy-paste
- Fewer errors: No typos from manual entry
- Better tracking: Captured output shows what was tried
Teams using executable runbooks see 40-60% MTTR reduction in the diagnosis phase alone.
Join the waitlist and start measuring improvement.