← Back to blog

MTTR Reduction: A Practical Guide for SRE Teams

· 5 min read · Stew Team
mttrincident-responsesrereliability

Mean Time to Recovery (MTTR) is the metric that matters most during incidents. Every minute of downtime costs money, erodes trust, and burns out your team.

This guide covers practical strategies to reduce MTTR. For executable incident procedures, see our incident response runbook guide.

What Is MTTR?

MTTR measures the average time from when an incident is detected to when service is restored.

MTTR = Total downtime / Number of incidents

A typical breakdown:

  • Detection: Time to notice something is wrong
  • Diagnosis: Time to identify the root cause
  • Resolution: Time to fix the problem
  • Verification: Time to confirm the fix worked

Why MTTR Matters

MTTRMonthly Downtime (10 incidents)Annual Cost (at $10k/min)
60 min10 hours$6M
30 min5 hours$3M
15 min2.5 hours$1.5M

Cutting MTTR in half literally halves your downtime costs.

MTTR Reduction Strategy 1: Faster Detection

You can’t fix what you don’t know is broken.

Implement Comprehensive Monitoring

# Check if key endpoints are monitored
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | .labels.alertname'

Reduce Alert Latency

# Prometheus rule with minimal delay
groups:
  - name: critical
    interval: 15s  # Check every 15 seconds
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[1m]) > 0.1
        for: 30s  # Fire after 30 seconds

Enable Real-Time Notifications

# Test PagerDuty integration
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{"routing_key":"YOUR_KEY","event_action":"trigger","payload":{"summary":"Test alert","severity":"critical","source":"monitoring"}}'

MTTR Reduction Strategy 2: Faster Diagnosis

Diagnosis often consumes 60%+ of incident time.

Pre-Built Diagnostic Commands

Have commands ready before incidents happen:

# Quick service health check
kubectl get pods -l app=api -o wide
kubectl top pods -l app=api
kubectl logs -l app=api --tail=50 | grep -i error

Correlation Dashboards

Link metrics, logs, and traces in one view:

# Get all signals for a service
echo "=== Pods ===" && kubectl get pods -l app=api
echo "=== Recent Events ===" && kubectl get events --field-selector involvedObject.name=api --sort-by='.lastTimestamp' | tail -10
echo "=== Logs ===" && kubectl logs -l app=api --tail=20

Include diagnostic steps in every alert:

annotations:
  runbook_url: https://runbooks.internal/api-high-latency
  summary: "API latency above threshold"
  diagnostic_commands: |
    kubectl top pods -l app=api
    kubectl logs -l app=api --tail=100 | grep -i slow

MTTR Reduction Strategy 3: Faster Resolution

Once you know the problem, fix it quickly.

One-Click Remediation

Standard fixes should be executable instantly:

# API Pod Restart Runbook

## Verify issue
​```bash
kubectl get pods -l app=api | grep -v Running
​```

## Restart pods
​```bash
kubectl rollout restart deployment/api
​```

## Verify recovery
​```bash
kubectl rollout status deployment/api
​```

Automated Rollbacks

# Quick rollback to last known good
kubectl rollout undo deployment/api

# Or to specific revision
kubectl rollout undo deployment/api --to-revision=3

Feature Flags for Instant Mitigation

# Disable problematic feature without deploy
curl -X POST http://feature-flags/api/flags/new-checkout \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

MTTR Reduction Strategy 4: Better Preparation

The best incident response happens before incidents.

Executable Runbooks

Static documentation slows you down. Executable runbooks let you:

  • Run diagnostic commands with one click
  • See output immediately
  • Follow procedures step-by-step

See our runbook examples for templates.

Regular Incident Drills

# Chaos engineering - controlled failure injection
kubectl delete pod -l app=api --grace-period=0

# Measure recovery time
time kubectl wait --for=condition=ready pod -l app=api --timeout=300s

On-Call Handoff Documents

# On-Call Handoff

## Current Issues
- Elevated error rates on checkout service
- Database replica lag under investigation

## Recent Changes
- Deployed v2.3.1 at 14:00 UTC
- Scaled worker pods from 3 to 5

## Useful Commands
​```bash
kubectl get pods -n production
kubectl logs -l app=checkout --tail=100
​```

Measuring MTTR Improvement

Track your progress:

# Calculate MTTR from incident data
cat incidents.json | jq '[.[] | .resolution_time - .detection_time] | add / length / 60' 
# Output in minutes

MTTR Breakdown Analysis

PhaseBeforeAfterImprovement
Detection10 min2 min80%
Diagnosis25 min10 min60%
Resolution15 min5 min67%
Verification5 min3 min40%
Total MTTR55 min20 min64%

Common MTTR Pitfalls

Pitfall 1: Alert Noise

Too many alerts means slow response to real issues.

# Analyze alert frequency
curl -s http://alertmanager:9093/api/v1/alerts | jq '[.data[].labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'

Pitfall 2: Outdated Runbooks

Runbooks that don’t match reality slow you down.

Pitfall 3: Single Points of Knowledge

If only one person knows how to fix something, MTTR depends on their availability.

Stew: Built for MTTR Reduction

Stew directly addresses MTTR by making runbooks executable:

  • Faster diagnosis: Run commands from runbooks instantly
  • Faster resolution: Pre-built procedures execute with clicks
  • Better preparation: Keep runbooks updated and tested

Your team stops copying commands from wikis and starts executing documented procedures.

Join the waitlist and cut your MTTR.