← Back to blog

10 Strategies to Reduce Alert Fatigue in Your SRE Team

· 5 min read · Stew Team
alert-fatiguealertingsrestrategies

Alert fatigue is solvable. These ten strategies have proven effective across SRE teams of all sizes.

For understanding alert fatigue, see our alert fatigue guide.

Strategy 1: Alert on Symptoms, Not Causes

Users experience symptoms (slow page, errors), not causes (high CPU, full disk).

Before

# Cause-based: Alerts on internal metrics
- alert: HighCPU
  expr: node_cpu_usage > 80
- alert: HighMemory
  expr: node_memory_usage > 85
- alert: HighDiskIO
  expr: node_disk_io_time > 0.9

After

# Symptom-based: Alerts on user impact
- alert: HighLatency
  expr: histogram_quantile(0.99, http_request_duration_seconds) > 0.5
  for: 5m
  annotations:
    summary: "User-facing latency degraded"

- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m
  annotations:
    summary: "Users experiencing errors"

Strategy 2: Use Dynamic Thresholds

Static thresholds don’t account for normal variation.

Before

# Static: Same threshold always
- alert: HighTraffic
  expr: requests_per_second > 1000

After

# Dynamic: Compare to historical baseline
- alert: TrafficAnomaly
  expr: |
    requests_per_second > 
    (avg_over_time(requests_per_second[7d:1h]) * 1.5)
  for: 10m
  annotations:
    summary: "Traffic 50% above weekly average"

Strategy 3: Increase Alert Duration

Brief spikes often resolve themselves.

Before

# Triggers on brief spikes
- alert: HighLatency
  expr: latency_p99 > 500
  for: 1m

After

# Requires sustained issue
- alert: HighLatency
  expr: latency_p99 > 500
  for: 10m
  annotations:
    summary: "Sustained latency issue - not a brief spike"

Strategy 4: Implement Alert Grouping

Related alerts should group into one notification.

Alertmanager Configuration

# alertmanager.yml
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    - match:
        severity: critical
      group_wait: 10s
      repeat_interval: 1h

Result

Instead of:

  • “Pod api-1 high memory”
  • “Pod api-2 high memory”
  • “Pod api-3 high memory”

You get:

  • “High memory: api pods (3 affected)“

Strategy 5: Add Inhibition Rules

Don’t alert on symptoms when the cause is already alerting.

# alertmanager.yml
inhibit_rules:
  # If database is down, don't alert on services that depend on it
  - source_match:
      alertname: DatabaseDown
    target_match:
      dependency: database
    equal: ['environment']

  # If cluster is having issues, don't alert on individual pods
  - source_match:
      alertname: ClusterUnhealthy
    target_match:
      scope: pod
    equal: ['cluster']

Strategy 6: Implement Severity Levels Properly

Not everything needs to wake someone up.

Severity Definitions

SeverityResponseExample
CriticalPage immediatelyService down
WarningCheck within 1 hourElevated errors
InfoReview next business dayApproaching limits

Routing by Severity

# Only page for critical
routes:
  - match:
      severity: critical
    receiver: pagerduty-oncall
    
  - match:
      severity: warning
    receiver: slack-alerts
    
  - match:
      severity: info
    receiver: slack-info
    repeat_interval: 24h

Strategy 7: Require Runbooks for All Alerts

No runbook = no action possible = no alert needed.

Alert Template

- alert: ServiceHighLatency
  expr: latency_p99 > 500
  for: 5m
  labels:
    severity: warning
    team: api-team
  annotations:
    summary: "{{ $labels.service }} latency elevated"
    runbook_url: "https://runbooks.internal/high-latency"
    quick_check: "kubectl logs -l app={{ $labels.service }} --tail=50 | grep -i slow"

Enforcement

# CI check: All alerts must have runbook_url
def validate_alerts(rules_file):
    for rule in rules_file.groups:
        for alert in rule.alerts:
            if 'runbook_url' not in alert.annotations:
                raise ValidationError(f"Alert {alert.name} missing runbook_url")

Strategy 8: Deduplicate Across Systems

Prevent the same issue from generating alerts in multiple tools.

Common Duplicates

  • Prometheus + Datadog both alerting on same metric
  • APM tool + logs both detecting same errors
  • Infrastructure + application both noticing same outage

Solution: Single Source of Truth

# Designate one system as authoritative for each alert type
alert_ownership:
  infrastructure: prometheus
  application_errors: sentry
  user_experience: synthetic_monitoring
  business_metrics: datadog

Strategy 9: Implement Alert Reviews

Regular reviews prevent alert debt accumulation.

Weekly Review Checklist

## Weekly Alert Review

### Metrics
- Total alerts this week: ____
- % requiring action: ____
- Noisiest alert: ____

### Actions
- [ ] Review top 5 most frequent alerts
- [ ] Fix or remove any that are noise
- [ ] Update thresholds if needed
- [ ] Ensure runbooks are current

### Decisions
| Alert | Action | Owner |
|-------|--------|-------|
| | | |

Monthly Audit

# Generate alert report
cat alerts.json | jq '
  group_by(.labels.alertname) | 
  map({
    alert: .[0].labels.alertname,
    count: length,
    action_rate: ([.[] | select(.action_taken)] | length) / length
  }) | 
  sort_by(.action_rate) |
  .[:10]
'

Strategy 10: Automate Common Responses

If the response is always the same, automate it.

Candidates for Automation

AlertManual ResponseAutomated Response
High memoryRestart podAuto-restart with limits
Cert expiringRenew certAuto-renewal
Disk fillingClear old filesAutomated cleanup job
Pod crashRestart podKubernetes does this

Example: Auto-Remediation

# Kubernetes: Auto-restart unhealthy pods
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 3
  periodSeconds: 10

Alert only when auto-remediation fails:

- alert: PodCrashLoop
  expr: |
    increase(kube_pod_container_status_restarts_total[1h]) > 5
  annotations:
    summary: "Pod restarting repeatedly - auto-restart not fixing issue"

Measuring Improvement

Track these metrics over time:

## Alert Fatigue Dashboard

### This Week vs Last Week
- Total alerts: ____ → ____
- Alerts per on-call shift: ____ → ____
- % actionable: ____% → ____%
- Avg acknowledge time: ____ → ____

### Trend (4 weeks)
[Graph showing improvement over time]

### Noisiest Alerts (candidates for removal)
1. ____ (fired X times, Y% actioned)
2. ____ (fired X times, Y% actioned)
3. ____ (fired X times, Y% actioned)

Stew: Faster Alert Triage

Stew reduces the impact of each alert:

  • Run diagnostics with one click
  • See output immediately
  • Quickly determine if action needed
  • Less time per alert = less fatigue

Join the waitlist and build sustainable alerting.