MTTR Reduction: A Practical Guide for SRE Teams

Mean Time to Recovery (MTTR) is the metric that matters most during incidents. Every minute of downtime costs money, erodes trust, and burns out your team.

This guide covers practical strategies to reduce MTTR. For executable incident procedures, see our incident response runbook guide.

What Is MTTR?

MTTR measures the average time from when an incident is detected to when service is restored.

MTTR = Total downtime / Number of incidents

A typical breakdown:

Detection: Time to notice something is wrong
Diagnosis: Time to identify the root cause
Resolution: Time to fix the problem
Verification: Time to confirm the fix worked

Why MTTR Matters

MTTR	Monthly Downtime (10 incidents)	Annual Cost (at $10k/min)
60 min	10 hours	$6M
30 min	5 hours	$3M
15 min	2.5 hours	$1.5M

Cutting MTTR in half literally halves your downtime costs.

MTTR Reduction Strategy 1: Faster Detection

You can’t fix what you don’t know is broken.

Implement Comprehensive Monitoring

# Check if key endpoints are monitored
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | .labels.alertname'

Reduce Alert Latency

# Prometheus rule with minimal delay
groups:
  - name: critical
    interval: 15s  # Check every 15 seconds
    rules:
      - alert: HighErrorRate
        expr: rate(http_errors_total[1m]) > 0.1
        for: 30s  # Fire after 30 seconds

Enable Real-Time Notifications

# Test PagerDuty integration
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H "Content-Type: application/json" \
  -d '{"routing_key":"YOUR_KEY","event_action":"trigger","payload":{"summary":"Test alert","severity":"critical","source":"monitoring"}}'

MTTR Reduction Strategy 2: Faster Diagnosis

Diagnosis often consumes 60%+ of incident time.

Pre-Built Diagnostic Commands

Have commands ready before incidents happen:

# Quick service health check
kubectl get pods -l app=api -o wide
kubectl top pods -l app=api
kubectl logs -l app=api --tail=50 | grep -i error

Correlation Dashboards

Link metrics, logs, and traces in one view:

# Get all signals for a service
echo "=== Pods ===" && kubectl get pods -l app=api
echo "=== Recent Events ===" && kubectl get events --field-selector involvedObject.name=api --sort-by='.lastTimestamp' | tail -10
echo "=== Logs ===" && kubectl logs -l app=api --tail=20

Runbook Links in Alerts

Include diagnostic steps in every alert:

annotations:
  runbook_url: https://runbooks.internal/api-high-latency
  summary: "API latency above threshold"
  diagnostic_commands: |
    kubectl top pods -l app=api
    kubectl logs -l app=api --tail=100 | grep -i slow

MTTR Reduction Strategy 3: Faster Resolution

Once you know the problem, fix it quickly.

One-Click Remediation

Standard fixes should be executable instantly:

# API Pod Restart Runbook

## Verify issue
```bash
kubectl get pods -l app=api | grep -v Running
```

## Restart pods
```bash
kubectl rollout restart deployment/api
```

## Verify recovery
```bash
kubectl rollout status deployment/api
```

Automated Rollbacks

# Quick rollback to last known good
kubectl rollout undo deployment/api

# Or to specific revision
kubectl rollout undo deployment/api --to-revision=3

Feature Flags for Instant Mitigation

# Disable problematic feature without deploy
curl -X POST http://feature-flags/api/flags/new-checkout \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

MTTR Reduction Strategy 4: Better Preparation

The best incident response happens before incidents.

Executable Runbooks

Static documentation slows you down. Executable runbooks let you:

Run diagnostic commands with one click
See output immediately
Follow procedures step-by-step

See our runbook examples for templates.

Regular Incident Drills

# Chaos engineering - controlled failure injection
kubectl delete pod -l app=api --grace-period=0

# Measure recovery time
time kubectl wait --for=condition=ready pod -l app=api --timeout=300s

On-Call Handoff Documents

# On-Call Handoff

## Current Issues
- Elevated error rates on checkout service
- Database replica lag under investigation

## Recent Changes
- Deployed v2.3.1 at 14:00 UTC
- Scaled worker pods from 3 to 5

## Useful Commands
```bash
kubectl get pods -n production
kubectl logs -l app=checkout --tail=100
```

Measuring MTTR Improvement

Track your progress:

# Calculate MTTR from incident data
cat incidents.json | jq '[.[] | .resolution_time - .detection_time] | add / length / 60' 
# Output in minutes

MTTR Breakdown Analysis

Phase	Before	After	Improvement
Detection	10 min	2 min	80%
Diagnosis	25 min	10 min	60%
Resolution	15 min	5 min	67%
Verification	5 min	3 min	40%
Total MTTR	55 min	20 min	64%

Common MTTR Pitfalls

Pitfall 1: Alert Noise

Too many alerts means slow response to real issues.

# Analyze alert frequency
curl -s http://alertmanager:9093/api/v1/alerts | jq '[.data[].labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'

Pitfall 2: Outdated Runbooks

Runbooks that don’t match reality slow you down.

Pitfall 3: Single Points of Knowledge

If only one person knows how to fix something, MTTR depends on their availability.

Stew: Built for MTTR Reduction

Stew directly addresses MTTR by making runbooks executable:

Faster diagnosis: Run commands from runbooks instantly
Faster resolution: Pre-built procedures execute with clicks
Better preparation: Keep runbooks updated and tested

Your team stops copying commands from wikis and starts executing documented procedures.

Join the waitlist and cut your MTTR.