10 Strategies to Reduce Alert Fatigue in Your SRE Team
Alert fatigue is solvable. These ten strategies have proven effective across SRE teams of all sizes.
For understanding alert fatigue, see our alert fatigue guide.
Strategy 1: Alert on Symptoms, Not Causes
Users experience symptoms (slow page, errors), not causes (high CPU, full disk).
Before
# Cause-based: Alerts on internal metrics
- alert: HighCPU
expr: node_cpu_usage > 80
- alert: HighMemory
expr: node_memory_usage > 85
- alert: HighDiskIO
expr: node_disk_io_time > 0.9
After
# Symptom-based: Alerts on user impact
- alert: HighLatency
expr: histogram_quantile(0.99, http_request_duration_seconds) > 0.5
for: 5m
annotations:
summary: "User-facing latency degraded"
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
annotations:
summary: "Users experiencing errors"
Strategy 2: Use Dynamic Thresholds
Static thresholds don’t account for normal variation.
Before
# Static: Same threshold always
- alert: HighTraffic
expr: requests_per_second > 1000
After
# Dynamic: Compare to historical baseline
- alert: TrafficAnomaly
expr: |
requests_per_second >
(avg_over_time(requests_per_second[7d:1h]) * 1.5)
for: 10m
annotations:
summary: "Traffic 50% above weekly average"
Strategy 3: Increase Alert Duration
Brief spikes often resolve themselves.
Before
# Triggers on brief spikes
- alert: HighLatency
expr: latency_p99 > 500
for: 1m
After
# Requires sustained issue
- alert: HighLatency
expr: latency_p99 > 500
for: 10m
annotations:
summary: "Sustained latency issue - not a brief spike"
Strategy 4: Implement Alert Grouping
Related alerts should group into one notification.
Alertmanager Configuration
# alertmanager.yml
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
group_wait: 10s
repeat_interval: 1h
Result
Instead of:
- “Pod api-1 high memory”
- “Pod api-2 high memory”
- “Pod api-3 high memory”
You get:
- “High memory: api pods (3 affected)“
Strategy 5: Add Inhibition Rules
Don’t alert on symptoms when the cause is already alerting.
# alertmanager.yml
inhibit_rules:
# If database is down, don't alert on services that depend on it
- source_match:
alertname: DatabaseDown
target_match:
dependency: database
equal: ['environment']
# If cluster is having issues, don't alert on individual pods
- source_match:
alertname: ClusterUnhealthy
target_match:
scope: pod
equal: ['cluster']
Strategy 6: Implement Severity Levels Properly
Not everything needs to wake someone up.
Severity Definitions
| Severity | Response | Example |
|---|---|---|
| Critical | Page immediately | Service down |
| Warning | Check within 1 hour | Elevated errors |
| Info | Review next business day | Approaching limits |
Routing by Severity
# Only page for critical
routes:
- match:
severity: critical
receiver: pagerduty-oncall
- match:
severity: warning
receiver: slack-alerts
- match:
severity: info
receiver: slack-info
repeat_interval: 24h
Strategy 7: Require Runbooks for All Alerts
No runbook = no action possible = no alert needed.
Alert Template
- alert: ServiceHighLatency
expr: latency_p99 > 500
for: 5m
labels:
severity: warning
team: api-team
annotations:
summary: "{{ $labels.service }} latency elevated"
runbook_url: "https://runbooks.internal/high-latency"
quick_check: "kubectl logs -l app={{ $labels.service }} --tail=50 | grep -i slow"
Enforcement
# CI check: All alerts must have runbook_url
def validate_alerts(rules_file):
for rule in rules_file.groups:
for alert in rule.alerts:
if 'runbook_url' not in alert.annotations:
raise ValidationError(f"Alert {alert.name} missing runbook_url")
Strategy 8: Deduplicate Across Systems
Prevent the same issue from generating alerts in multiple tools.
Common Duplicates
- Prometheus + Datadog both alerting on same metric
- APM tool + logs both detecting same errors
- Infrastructure + application both noticing same outage
Solution: Single Source of Truth
# Designate one system as authoritative for each alert type
alert_ownership:
infrastructure: prometheus
application_errors: sentry
user_experience: synthetic_monitoring
business_metrics: datadog
Strategy 9: Implement Alert Reviews
Regular reviews prevent alert debt accumulation.
Weekly Review Checklist
## Weekly Alert Review
### Metrics
- Total alerts this week: ____
- % requiring action: ____
- Noisiest alert: ____
### Actions
- [ ] Review top 5 most frequent alerts
- [ ] Fix or remove any that are noise
- [ ] Update thresholds if needed
- [ ] Ensure runbooks are current
### Decisions
| Alert | Action | Owner |
|-------|--------|-------|
| | | |
Monthly Audit
# Generate alert report
cat alerts.json | jq '
group_by(.labels.alertname) |
map({
alert: .[0].labels.alertname,
count: length,
action_rate: ([.[] | select(.action_taken)] | length) / length
}) |
sort_by(.action_rate) |
.[:10]
'
Strategy 10: Automate Common Responses
If the response is always the same, automate it.
Candidates for Automation
| Alert | Manual Response | Automated Response |
|---|---|---|
| High memory | Restart pod | Auto-restart with limits |
| Cert expiring | Renew cert | Auto-renewal |
| Disk filling | Clear old files | Automated cleanup job |
| Pod crash | Restart pod | Kubernetes does this |
Example: Auto-Remediation
# Kubernetes: Auto-restart unhealthy pods
livenessProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 3
periodSeconds: 10
Alert only when auto-remediation fails:
- alert: PodCrashLoop
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
annotations:
summary: "Pod restarting repeatedly - auto-restart not fixing issue"
Measuring Improvement
Track these metrics over time:
## Alert Fatigue Dashboard
### This Week vs Last Week
- Total alerts: ____ → ____
- Alerts per on-call shift: ____ → ____
- % actionable: ____% → ____%
- Avg acknowledge time: ____ → ____
### Trend (4 weeks)
[Graph showing improvement over time]
### Noisiest Alerts (candidates for removal)
1. ____ (fired X times, Y% actioned)
2. ____ (fired X times, Y% actioned)
3. ____ (fired X times, Y% actioned)
Stew: Faster Alert Triage
Stew reduces the impact of each alert:
- Run diagnostics with one click
- See output immediately
- Quickly determine if action needed
- Less time per alert = less fatigue
Join the waitlist and build sustainable alerting.