Measuring MTTR: Metrics and Benchmarks for SRE Teams

You can’t improve what you don’t measure. MTTR reduction requires systematic tracking across incident phases.

This guide covers how to measure MTTR effectively. For reduction strategies, see our MTTR reduction guide.

MTTR Formula

The basic calculation:

MTTR = Total Recovery Time / Number of Incidents

But this hides important details. Break it down:

MTTR = Detection Time + Diagnosis Time + Resolution Time + Verification Time

MTTR Breakdown Metrics

Time to Detect (TTD)

From incident start to first alert.

# Calculate from incident data
cat incidents.json | jq '[.[] | .alert_time - .incident_start] | add / length / 60'

Benchmark: < 5 minutes for critical services

Time to Acknowledge (TTA)

From alert to human response.

# PagerDuty acknowledgment time
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01" \
  -H "Authorization: Token token=YOUR_TOKEN" | \
  jq '[.incidents[] | (.acknowledged_at | fromdateiso8601) - (.created_at | fromdateiso8601)] | add / length / 60'

Benchmark: < 5 minutes during business hours, < 15 minutes off-hours

Time to Diagnose (TTDiag)

From acknowledgment to root cause identification.

This is often the longest phase and hardest to measure automatically. Track manually:

# Incident Template

- Acknowledged: [timestamp]
- Root cause identified: [timestamp]
- Diagnosis notes: [what was tried]

Benchmark: < 15 minutes for known issues, < 30 minutes for novel issues

Time to Resolve (TTR)

From diagnosis to fix applied.

# Calculate from deployment timestamps
kubectl get deployment api -o jsonpath='{.metadata.annotations.deployment-time}'

Benchmark: < 10 minutes for standard remediations

Time to Verify (TTV)

From fix applied to confirmed working.

# Monitor recovery
time kubectl rollout status deployment/api

Benchmark: < 5 minutes

MTTR Benchmarks by Industry

Industry	Good MTTR	Average MTTR	Poor MTTR
E-commerce	< 15 min	30 min	> 60 min
SaaS	< 30 min	45 min	> 90 min
Finance	< 10 min	20 min	> 45 min
Media	< 20 min	40 min	> 90 min

Setting Up MTTR Tracking

Option 1: Incident Management Tool

PagerDuty, OpsGenie, and similar tools track:

Alert creation time
Acknowledgment time
Resolution time

# Export PagerDuty incident data
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01&until=2024-02-01" \
  -H "Authorization: Token token=YOUR_TOKEN" | \
  jq '.incidents[] | {id: .id, created: .created_at, resolved: .resolved_at}'

Option 2: Custom Tracking

Add timestamps to your incident process:

# Start incident tracking
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) INCIDENT_START" >> /var/log/incidents.log

# Mark diagnosis complete
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) DIAGNOSIS_COMPLETE" >> /var/log/incidents.log

# Mark resolved
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) RESOLVED" >> /var/log/incidents.log

Option 3: Prometheus Metrics

# Custom metrics for MTTR tracking
- name: incident_duration_seconds
  help: Time from incident start to resolution
  type: histogram
  buckets: [300, 600, 900, 1800, 3600, 7200]

Analyzing MTTR Data

Trend Analysis

# Weekly MTTR trend
cat incidents.json | jq -r '.[] | [.week, .mttr_minutes] | @csv' | \
  awk -F',' '{sum[$1]+=$2; count[$1]++} END {for (w in sum) print w, sum[w]/count[w]}'

Distribution Analysis

# MTTR percentiles
cat incidents.json | jq '[.[].mttr_minutes] | sort | 
  {p50: .[length/2], p90: .[length*0.9], p99: .[length*0.99]}'

By Service Analysis

# MTTR by service
cat incidents.json | jq 'group_by(.service) | 
  map({service: .[0].service, avg_mttr: ([.[].mttr_minutes] | add / length)})'

By Incident Type

# MTTR by category
cat incidents.json | jq 'group_by(.category) | 
  map({category: .[0].category, avg_mttr: ([.[].mttr_minutes] | add / length), count: length})'

MTTR Dashboards

Key Visualizations

MTTR over time: Weekly/monthly trend line
MTTR by phase: Stacked bar showing detection/diagnosis/resolution
MTTR by service: Identify problem areas
MTTR distribution: Histogram showing spread

Grafana Query Examples

# Average MTTR last 30 days
avg(incident_duration_seconds{status="resolved"}) / 60

# MTTR by service
avg by (service) (incident_duration_seconds{status="resolved"}) / 60

# MTTR trend
avg_over_time(incident_duration_seconds{status="resolved"}[7d]) / 60

Common MTTR Measurement Mistakes

Mistake 1: Only Measuring Total MTTR

Without phase breakdown, you don’t know where to improve.

Fix: Track each phase separately.

Mistake 2: Excluding “Small” Incidents

Every incident affects MTTR averages.

Fix: Track all incidents consistently.

Mistake 3: Manual Data Entry

Delayed or forgotten entries skew data.

Fix: Automate timestamp capture where possible.

Mistake 4: Ignoring Outliers

One 4-hour incident can mask improvements.

Fix: Use percentiles (p50, p90) alongside averages.

MTTR Improvement Targets

Set realistic goals:

Current MTTR	6-Month Target	Focus Area
> 60 min	40 min	Basic runbooks, faster detection
40-60 min	25 min	Executable runbooks, automation
25-40 min	15 min	Predictive detection, self-healing
< 25 min	10 min	Advanced automation, ML-based diagnosis

Connecting MTTR to Business Metrics

Translate MTTR to business impact:

# Calculate downtime cost
MTTR_MINUTES=30
INCIDENTS_PER_MONTH=10
COST_PER_MINUTE=1000

echo "Monthly downtime cost: \$$(($MTTR_MINUTES * $INCIDENTS_PER_MONTH * $COST_PER_MINUTE))"

Revenue Impact Formula

Monthly Downtime Cost = MTTR × Incidents/Month × Revenue/Minute

Stew’s Impact on MTTR Metrics

Stew reduces MTTR by eliminating friction:

Faster diagnosis: Executable commands, no copy-paste
Fewer errors: No typos from manual entry
Better tracking: Captured output shows what was tried

Teams using executable runbooks see 40-60% MTTR reduction in the diagnosis phase alone.

Join the waitlist and start measuring improvement.