Alert Fatigue: The Silent Killer of On-Call Effectiveness
Alert fatigue doesn’t announce itself. It creeps in slowly—first you snooze one alert, then you ignore a whole category, then you miss the one that mattered.
This guide covers alert fatigue: what it is, why it happens, and how to fix it. For response procedures, see our on-call runbook guide.
What Is Alert Fatigue?
Alert fatigue occurs when the volume of alerts overwhelms responders, causing them to become desensitized. The result: slower response to real issues, missed critical alerts, and burned-out engineers.
The Symptoms
- Alerts acknowledged without investigation
- “I’ll look at it later” becoming default
- Same alerts firing for weeks unaddressed
- Engineers muting channels
- Real incidents discovered by users, not alerts
The Statistics
Research shows:
- Teams with 50+ daily alerts have 3x higher MTTR
- 70% of alerts in typical systems are noise
- On-call engineers with high alert volume have 2x burnout rate
- Alert fatigue contributes to 30% of major incident escalations
Why Alert Fatigue Happens
Cause 1: Alerting on Everything
# Bad: Alert on any error
- alert: AnyError
expr: errors_total > 0
for: 1m
Some errors are normal. Alerting on all of them creates noise.
Cause 2: Static Thresholds
# Bad: Same threshold regardless of context
- alert: HighCPU
expr: cpu_usage > 80
for: 5m
80% CPU might be fine during peak hours, critical at 3am.
Cause 3: Missing Deduplication
Ten pods failing the same way = ten identical alerts instead of one.
Cause 4: No Alert Ownership
Alerts without clear owners become everyone’s problem, which means no one’s problem.
Cause 5: Fear of Missing Issues
“Better safe than sorry” leads to alerting on everything.
Measuring Alert Fatigue
Alert Metrics to Track
# Alert frequency by type
curl -s http://alertmanager:9093/api/v1/alerts | \
jq '[.data[].labels.alertname] | group_by(.) | map({alert: .[0], count: length}) | sort_by(-.count)'
Key Indicators
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Alerts per on-call shift | < 10 | 10-25 | > 25 |
| % alerts requiring action | > 80% | 50-80% | < 50% |
| Avg time to acknowledge | < 5 min | 5-15 min | > 15 min |
| Repeat alerts (same issue) | < 10% | 10-30% | > 30% |
Audit Your Alerts
# Export last month's alerts for analysis
curl -s "http://alertmanager:9093/api/v1/alerts?silenced=false&inhibited=false" | \
jq '.data[] | {alert: .labels.alertname, severity: .labels.severity, starts: .startsAt}' > alerts-audit.json
# Count by alert name
cat alerts-audit.json | jq -s 'group_by(.alert) | map({alert: .[0].alert, count: length}) | sort_by(-.count)'
The Cost of Alert Fatigue
Direct Costs
- Increased MTTR: Slow response to real issues
- Missed incidents: Critical alerts lost in noise
- On-call compensation: More pages = higher costs
Hidden Costs
- Burnout: Engineers leave or disengage
- Degraded trust: Teams stop trusting alerting systems
- Risk acceptance: “It always fires, it’s probably fine”
Calculating Impact
Alert Fatigue Cost = (False Positive Alerts × Response Time × Engineer Cost)
+ (Missed Incidents × Incident Cost)
+ (Burnout-Related Turnover × Replacement Cost)
The Path to Healthy Alerting
Step 1: Audit Current State
## Alert Audit Checklist
For each alert, answer:
- [ ] When did this last fire?
- [ ] Was action taken?
- [ ] Was the action necessary?
- [ ] Could we have not alerted?
- [ ] Is there a runbook?
Step 2: Classify Alerts
| Category | Description | Action |
|---|---|---|
| True positive + actionable | Real issue, action taken | Keep |
| True positive + not actionable | Real issue, no action possible | Remove or fix root cause |
| False positive | Not actually a problem | Fix or remove |
| Noise | Fires constantly, ignored | Remove |
Step 3: Fix or Remove
# Before: Noisy alert
- alert: HighMemory
expr: memory_usage > 80
for: 1m
# After: Actionable alert
- alert: HighMemory
expr: memory_usage > 90
for: 10m
labels:
severity: warning
annotations:
runbook: "Memory is high. Check for leaks or scale up."
Step 4: Establish Ownership
Every alert needs:
- An owning team
- A runbook
- A review date
Step 5: Continuous Improvement
## Weekly Alert Review
- [ ] Review all alerts from past week
- [ ] Identify any that were noise
- [ ] Fix or remove noisy alerts
- [ ] Update runbooks as needed
- [ ] Track alert count trend
Alert Fatigue Prevention
Design Principles
- Alert on symptoms, not causes: Users experience latency, not CPU spikes
- Alert on actionable conditions: If you can’t do anything, don’t alert
- Use appropriate severity: Not everything is critical
- Require runbooks: No runbook = no alert
Good vs Bad Alerts
| Bad Alert | Good Alert |
|---|---|
| CPU > 80% | Error rate > 1% affecting users |
| Any error logged | Error rate spike from baseline |
| Disk > 70% | Disk will fill in < 4 hours |
| Pod restarted | Pod restart loop (> 3 in 10 min) |
Stew’s Role in Reducing Alert Fatigue
Stew helps in two ways:
- Faster response: Executable runbooks mean less time per alert
- Better triage: Run diagnostics with a click to quickly determine if action is needed
When engineers can investigate quickly, they’re more likely to actually investigate.
Join the waitlist and build sustainable on-call practices.