Alert Fatigue: The Silent Killer of On-Call Effectiveness

Alert fatigue doesn’t announce itself. It creeps in slowly—first you snooze one alert, then you ignore a whole category, then you miss the one that mattered.

This guide covers alert fatigue: what it is, why it happens, and how to fix it. For response procedures, see our on-call runbook guide.

What Is Alert Fatigue?

Alert fatigue occurs when the volume of alerts overwhelms responders, causing them to become desensitized. The result: slower response to real issues, missed critical alerts, and burned-out engineers.

The Symptoms

Alerts acknowledged without investigation
“I’ll look at it later” becoming default
Same alerts firing for weeks unaddressed
Engineers muting channels
Real incidents discovered by users, not alerts

The Statistics

Research shows:

Teams with 50+ daily alerts have 3x higher MTTR
70% of alerts in typical systems are noise
On-call engineers with high alert volume have 2x burnout rate
Alert fatigue contributes to 30% of major incident escalations

Why Alert Fatigue Happens

Cause 1: Alerting on Everything

# Bad: Alert on any error
- alert: AnyError
  expr: errors_total > 0
  for: 1m

Some errors are normal. Alerting on all of them creates noise.

Cause 2: Static Thresholds

# Bad: Same threshold regardless of context
- alert: HighCPU
  expr: cpu_usage > 80
  for: 5m

80% CPU might be fine during peak hours, critical at 3am.

Cause 3: Missing Deduplication

Ten pods failing the same way = ten identical alerts instead of one.

Cause 4: No Alert Ownership

Alerts without clear owners become everyone’s problem, which means no one’s problem.

Cause 5: Fear of Missing Issues

“Better safe than sorry” leads to alerting on everything.

Measuring Alert Fatigue

Alert Metrics to Track

# Alert frequency by type
curl -s http://alertmanager:9093/api/v1/alerts | \
  jq '[.data[].labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'

Key Indicators

Metric	Healthy	Warning	Critical
Alerts per on-call shift	< 10	10-25	> 25
% alerts requiring action	> 80%	50-80%	< 50%
Avg time to acknowledge	< 5 min	5-15 min	> 15 min
Repeat alerts (same issue)	< 10%	10-30%	> 30%

Audit Your Alerts

# Export last month's alerts for analysis
curl -s "http://alertmanager:9093/api/v1/alerts?silenced=false&inhibited=false" | \
  jq '.data[] | {alert: .labels.alertname, severity: .labels.severity, starts: .startsAt}' > alerts-audit.json

# Count by alert name
cat alerts-audit.json | jq -s 'group_by(.alert) | map({alert: .[0].alert, count: length}) | sort_by(-.count)'

The Cost of Alert Fatigue

Direct Costs

Increased MTTR: Slow response to real issues
Missed incidents: Critical alerts lost in noise
On-call compensation: More pages = higher costs

Hidden Costs

Burnout: Engineers leave or disengage
Degraded trust: Teams stop trusting alerting systems
Risk acceptance: “It always fires, it’s probably fine”

Calculating Impact

Alert Fatigue Cost = (False Positive Alerts × Response Time × Engineer Cost) 
                   + (Missed Incidents × Incident Cost)
                   + (Burnout-Related Turnover × Replacement Cost)

The Path to Healthy Alerting

Step 1: Audit Current State

## Alert Audit Checklist

For each alert, answer:
- [ ] When did this last fire?
- [ ] Was action taken?
- [ ] Was the action necessary?
- [ ] Could we have not alerted?
- [ ] Is there a runbook?

Step 2: Classify Alerts

Category	Description	Action
True positive + actionable	Real issue, action taken	Keep
True positive + not actionable	Real issue, no action possible	Remove or fix root cause
False positive	Not actually a problem	Fix or remove
Noise	Fires constantly, ignored	Remove

Step 3: Fix or Remove

# Before: Noisy alert
- alert: HighMemory
  expr: memory_usage > 80
  for: 1m

# After: Actionable alert
- alert: HighMemory
  expr: memory_usage > 90
  for: 10m
  labels:
    severity: warning
  annotations:
    runbook: "Memory is high. Check for leaks or scale up."

Step 4: Establish Ownership

Every alert needs:

An owning team
A runbook
A review date

Step 5: Continuous Improvement

## Weekly Alert Review

- [ ] Review all alerts from past week
- [ ] Identify any that were noise
- [ ] Fix or remove noisy alerts
- [ ] Update runbooks as needed
- [ ] Track alert count trend

Alert Fatigue Prevention

Design Principles

Alert on symptoms, not causes: Users experience latency, not CPU spikes
Alert on actionable conditions: If you can’t do anything, don’t alert
Use appropriate severity: Not everything is critical
Require runbooks: No runbook = no alert

Good vs Bad Alerts

Bad Alert	Good Alert
CPU > 80%	Error rate > 1% affecting users
Any error logged	Error rate spike from baseline
Disk > 70%	Disk will fill in < 4 hours
Pod restarted	Pod restart loop (> 3 in 10 min)

Stew’s Role in Reducing Alert Fatigue

Stew helps in two ways:

Faster response: Executable runbooks mean less time per alert
Better triage: Run diagnostics with a click to quickly determine if action is needed

When engineers can investigate quickly, they’re more likely to actually investigate.

Join the waitlist and build sustainable on-call practices.