← Back to blog

Alert Fatigue: The Silent Killer of On-Call Effectiveness

· 5 min read · Stew Team
alert-fatiguealertingsreon-call

Alert fatigue doesn’t announce itself. It creeps in slowly—first you snooze one alert, then you ignore a whole category, then you miss the one that mattered.

This guide covers alert fatigue: what it is, why it happens, and how to fix it. For response procedures, see our on-call runbook guide.

What Is Alert Fatigue?

Alert fatigue occurs when the volume of alerts overwhelms responders, causing them to become desensitized. The result: slower response to real issues, missed critical alerts, and burned-out engineers.

The Symptoms

  • Alerts acknowledged without investigation
  • “I’ll look at it later” becoming default
  • Same alerts firing for weeks unaddressed
  • Engineers muting channels
  • Real incidents discovered by users, not alerts

The Statistics

Research shows:

  • Teams with 50+ daily alerts have 3x higher MTTR
  • 70% of alerts in typical systems are noise
  • On-call engineers with high alert volume have 2x burnout rate
  • Alert fatigue contributes to 30% of major incident escalations

Why Alert Fatigue Happens

Cause 1: Alerting on Everything

# Bad: Alert on any error
- alert: AnyError
  expr: errors_total > 0
  for: 1m

Some errors are normal. Alerting on all of them creates noise.

Cause 2: Static Thresholds

# Bad: Same threshold regardless of context
- alert: HighCPU
  expr: cpu_usage > 80
  for: 5m

80% CPU might be fine during peak hours, critical at 3am.

Cause 3: Missing Deduplication

Ten pods failing the same way = ten identical alerts instead of one.

Cause 4: No Alert Ownership

Alerts without clear owners become everyone’s problem, which means no one’s problem.

Cause 5: Fear of Missing Issues

“Better safe than sorry” leads to alerting on everything.

Measuring Alert Fatigue

Alert Metrics to Track

# Alert frequency by type
curl -s http://alertmanager:9093/api/v1/alerts | \
  jq '[.data[].labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'

Key Indicators

MetricHealthyWarningCritical
Alerts per on-call shift< 1010-25> 25
% alerts requiring action> 80%50-80%< 50%
Avg time to acknowledge< 5 min5-15 min> 15 min
Repeat alerts (same issue)< 10%10-30%> 30%

Audit Your Alerts

# Export last month's alerts for analysis
curl -s "http://alertmanager:9093/api/v1/alerts?silenced=false&inhibited=false" | \
  jq '.data[] | {alert: .labels.alertname, severity: .labels.severity, starts: .startsAt}' > alerts-audit.json

# Count by alert name
cat alerts-audit.json | jq -s 'group_by(.alert) | map({alert: .[0].alert, count: length}) | sort_by(-.count)'

The Cost of Alert Fatigue

Direct Costs

  • Increased MTTR: Slow response to real issues
  • Missed incidents: Critical alerts lost in noise
  • On-call compensation: More pages = higher costs

Hidden Costs

  • Burnout: Engineers leave or disengage
  • Degraded trust: Teams stop trusting alerting systems
  • Risk acceptance: “It always fires, it’s probably fine”

Calculating Impact

Alert Fatigue Cost = (False Positive Alerts × Response Time × Engineer Cost) 
                   + (Missed Incidents × Incident Cost)
                   + (Burnout-Related Turnover × Replacement Cost)

The Path to Healthy Alerting

Step 1: Audit Current State

## Alert Audit Checklist

For each alert, answer:
- [ ] When did this last fire?
- [ ] Was action taken?
- [ ] Was the action necessary?
- [ ] Could we have not alerted?
- [ ] Is there a runbook?

Step 2: Classify Alerts

CategoryDescriptionAction
True positive + actionableReal issue, action takenKeep
True positive + not actionableReal issue, no action possibleRemove or fix root cause
False positiveNot actually a problemFix or remove
NoiseFires constantly, ignoredRemove

Step 3: Fix or Remove

# Before: Noisy alert
- alert: HighMemory
  expr: memory_usage > 80
  for: 1m

# After: Actionable alert
- alert: HighMemory
  expr: memory_usage > 90
  for: 10m
  labels:
    severity: warning
  annotations:
    runbook: "Memory is high. Check for leaks or scale up."

Step 4: Establish Ownership

Every alert needs:

  • An owning team
  • A runbook
  • A review date

Step 5: Continuous Improvement

## Weekly Alert Review

- [ ] Review all alerts from past week
- [ ] Identify any that were noise
- [ ] Fix or remove noisy alerts
- [ ] Update runbooks as needed
- [ ] Track alert count trend

Alert Fatigue Prevention

Design Principles

  1. Alert on symptoms, not causes: Users experience latency, not CPU spikes
  2. Alert on actionable conditions: If you can’t do anything, don’t alert
  3. Use appropriate severity: Not everything is critical
  4. Require runbooks: No runbook = no alert

Good vs Bad Alerts

Bad AlertGood Alert
CPU > 80%Error rate > 1% affecting users
Any error loggedError rate spike from baseline
Disk > 70%Disk will fill in < 4 hours
Pod restartedPod restart loop (> 3 in 10 min)

Stew’s Role in Reducing Alert Fatigue

Stew helps in two ways:

  1. Faster response: Executable runbooks mean less time per alert
  2. Better triage: Run diagnostics with a click to quickly determine if action is needed

When engineers can investigate quickly, they’re more likely to actually investigate.

Join the waitlist and build sustainable on-call practices.