← Back to blog

Measuring MTTR: Metrics and Benchmarks for SRE Teams

· 5 min read · Stew Team
mttrmetricssreobservability

You can’t improve what you don’t measure. MTTR reduction requires systematic tracking across incident phases.

This guide covers how to measure MTTR effectively. For reduction strategies, see our MTTR reduction guide.

MTTR Formula

The basic calculation:

MTTR = Total Recovery Time / Number of Incidents

But this hides important details. Break it down:

MTTR = Detection Time + Diagnosis Time + Resolution Time + Verification Time

MTTR Breakdown Metrics

Time to Detect (TTD)

From incident start to first alert.

# Calculate from incident data
cat incidents.json | jq '[.[] | .alert_time - .incident_start] | add / length / 60'

Benchmark: < 5 minutes for critical services

Time to Acknowledge (TTA)

From alert to human response.

# PagerDuty acknowledgment time
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01" \
  -H "Authorization: Token token=YOUR_TOKEN" | \
  jq '[.incidents[] | (.acknowledged_at | fromdateiso8601) - (.created_at | fromdateiso8601)] | add / length / 60'

Benchmark: < 5 minutes during business hours, < 15 minutes off-hours

Time to Diagnose (TTDiag)

From acknowledgment to root cause identification.

This is often the longest phase and hardest to measure automatically. Track manually:

# Incident Template

- Acknowledged: [timestamp]
- Root cause identified: [timestamp]
- Diagnosis notes: [what was tried]

Benchmark: < 15 minutes for known issues, < 30 minutes for novel issues

Time to Resolve (TTR)

From diagnosis to fix applied.

# Calculate from deployment timestamps
kubectl get deployment api -o jsonpath='{.metadata.annotations.deployment-time}'

Benchmark: < 10 minutes for standard remediations

Time to Verify (TTV)

From fix applied to confirmed working.

# Monitor recovery
time kubectl rollout status deployment/api

Benchmark: < 5 minutes

MTTR Benchmarks by Industry

IndustryGood MTTRAverage MTTRPoor MTTR
E-commerce< 15 min30 min> 60 min
SaaS< 30 min45 min> 90 min
Finance< 10 min20 min> 45 min
Media< 20 min40 min> 90 min

Setting Up MTTR Tracking

Option 1: Incident Management Tool

PagerDuty, OpsGenie, and similar tools track:

  • Alert creation time
  • Acknowledgment time
  • Resolution time
# Export PagerDuty incident data
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01&until=2024-02-01" \
  -H "Authorization: Token token=YOUR_TOKEN" | \
  jq '.incidents[] | {id: .id, created: .created_at, resolved: .resolved_at}'

Option 2: Custom Tracking

Add timestamps to your incident process:

# Start incident tracking
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) INCIDENT_START" >> /var/log/incidents.log

# Mark diagnosis complete
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) DIAGNOSIS_COMPLETE" >> /var/log/incidents.log

# Mark resolved
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) RESOLVED" >> /var/log/incidents.log

Option 3: Prometheus Metrics

# Custom metrics for MTTR tracking
- name: incident_duration_seconds
  help: Time from incident start to resolution
  type: histogram
  buckets: [300, 600, 900, 1800, 3600, 7200]

Analyzing MTTR Data

Trend Analysis

# Weekly MTTR trend
cat incidents.json | jq -r '.[] | [.week, .mttr_minutes] | @csv' | \
  awk -F',' '{sum[$1]+=$2; count[$1]++} END {for (w in sum) print w, sum[w]/count[w]}'

Distribution Analysis

# MTTR percentiles
cat incidents.json | jq '[.[].mttr_minutes] | sort | 
  {p50: .[length/2], p90: .[length*0.9], p99: .[length*0.99]}'

By Service Analysis

# MTTR by service
cat incidents.json | jq 'group_by(.service) | 
  map({service: .[0].service, avg_mttr: ([.[].mttr_minutes] | add / length)})'

By Incident Type

# MTTR by category
cat incidents.json | jq 'group_by(.category) | 
  map({category: .[0].category, avg_mttr: ([.[].mttr_minutes] | add / length), count: length})'

MTTR Dashboards

Key Visualizations

  1. MTTR over time: Weekly/monthly trend line
  2. MTTR by phase: Stacked bar showing detection/diagnosis/resolution
  3. MTTR by service: Identify problem areas
  4. MTTR distribution: Histogram showing spread

Grafana Query Examples

# Average MTTR last 30 days
avg(incident_duration_seconds{status="resolved"}) / 60

# MTTR by service
avg by (service) (incident_duration_seconds{status="resolved"}) / 60

# MTTR trend
avg_over_time(incident_duration_seconds{status="resolved"}[7d]) / 60

Common MTTR Measurement Mistakes

Mistake 1: Only Measuring Total MTTR

Without phase breakdown, you don’t know where to improve.

Fix: Track each phase separately.

Mistake 2: Excluding “Small” Incidents

Every incident affects MTTR averages.

Fix: Track all incidents consistently.

Mistake 3: Manual Data Entry

Delayed or forgotten entries skew data.

Fix: Automate timestamp capture where possible.

Mistake 4: Ignoring Outliers

One 4-hour incident can mask improvements.

Fix: Use percentiles (p50, p90) alongside averages.

MTTR Improvement Targets

Set realistic goals:

Current MTTR6-Month TargetFocus Area
> 60 min40 minBasic runbooks, faster detection
40-60 min25 minExecutable runbooks, automation
25-40 min15 minPredictive detection, self-healing
< 25 min10 minAdvanced automation, ML-based diagnosis

Connecting MTTR to Business Metrics

Translate MTTR to business impact:

# Calculate downtime cost
MTTR_MINUTES=30
INCIDENTS_PER_MONTH=10
COST_PER_MINUTE=1000

echo "Monthly downtime cost: \$$(($MTTR_MINUTES * $INCIDENTS_PER_MONTH * $COST_PER_MINUTE))"

Revenue Impact Formula

Monthly Downtime Cost = MTTR × Incidents/Month × Revenue/Minute

Stew’s Impact on MTTR Metrics

Stew reduces MTTR by eliminating friction:

  • Faster diagnosis: Executable commands, no copy-paste
  • Fewer errors: No typos from manual entry
  • Better tracking: Captured output shows what was tried

Teams using executable runbooks see 40-60% MTTR reduction in the diagnosis phase alone.

Join the waitlist and start measuring improvement.