← Back to blog

Preventing Alert Fatigue: Building Sustainable Alerting from Day One

· 6 min read · Stew Team
alert-fatiguealertingbest-practicessre

It’s easier to prevent alert fatigue than to fix it. This guide covers how to build sustainable alerting practices from the start.

For measuring existing alerting health, see our alert fatigue metrics guide.

The Alerting Philosophy

Before creating any alert, internalize these principles:

  1. Alerts are for humans: If a computer can handle it, automate it
  2. Every alert interrupts: Someone’s sleep, focus, or family time
  3. False positives erode trust: Better to miss some issues than cry wolf
  4. Alerts without actions are noise: No runbook = no alert

Start with SLOs, Not Alerts

Define what matters before defining alerts.

Step 1: Define Service Level Objectives

# Example SLOs
slos:
  api:
    availability: 99.9%  # < 43 min downtime/month
    latency_p99: 500ms
    error_rate: 0.1%
  
  checkout:
    availability: 99.95%  # < 22 min downtime/month
    success_rate: 99%

Step 2: Create SLO-Based Alerts

# Alert when burning error budget too fast
- alert: APIErrorBudgetBurn
  expr: |
    (
      1 - (
        sum(rate(http_requests_total{status!~"5.."}[1h])) /
        sum(rate(http_requests_total[1h]))
      )
    ) > (1 - 0.999) * 14.4  # 14.4x burn rate = budget exhausted in 5 days
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "API error budget burning fast"
    runbook_url: "/runbooks/error-budget-burn"

The Alert Creation Checklist

Before creating any alert, answer these questions:

## Alert Creation Checklist

### Necessity
- [ ] Does this alert indicate a real problem?
- [ ] Is human intervention required?
- [ ] Can this be auto-remediated instead?

### Actionability
- [ ] Is there a clear action to take?
- [ ] Does a runbook exist?
- [ ] Can the on-call engineer actually fix this?

### Quality
- [ ] Is the threshold appropriate?
- [ ] Has this been tested for false positives?
- [ ] Is the for duration long enough?

### Ownership
- [ ] Is there a clear owner for this alert?
- [ ] Is the owner subscribed to receive it?

### Documentation
- [ ] Is there a runbook linked?
- [ ] Is the summary clear and actionable?
- [ ] Are relevant metrics/logs linked?

Alert Templates

Use templates to enforce good practices.

Standard Alert Template

# templates/standard-alert.yaml
- alert: {{ .Name }}
  expr: {{ .Expression }}
  for: {{ .Duration | default "5m" }}
  labels:
    severity: {{ .Severity }}
    team: {{ .Team }}
    service: {{ .Service }}
  annotations:
    summary: "{{ .Summary }}"
    description: "{{ .Description }}"
    runbook_url: "{{ .RunbookURL }}"
    dashboard_url: "{{ .DashboardURL }}"
    quick_check: "{{ .QuickCheck }}"

Example Usage

# Instantiated alert
- alert: CheckoutHighErrorRate
  expr: |
    rate(checkout_errors_total[5m]) / rate(checkout_requests_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
    team: payments
    service: checkout
  annotations:
    summary: "Checkout error rate above 1%"
    description: "{{ $value | printf \"%.2f\" }}% of checkout requests are failing"
    runbook_url: "https://runbooks.internal/checkout-errors"
    dashboard_url: "https://grafana.internal/d/checkout"
    quick_check: "kubectl logs -l app=checkout --tail=50 | grep ERROR"

Severity Level Guidelines

Define severity levels clearly and consistently.

Severity Definitions

## Severity Levels

### Critical (Page immediately)
- Service completely unavailable
- Data loss occurring
- Security breach active
- SLO violated (error budget exhausted)

**Response time**: < 5 minutes
**Escalation**: Immediate

### Warning (Respond within 1 hour)
- Service degraded but functional
- Approaching resource limits
- Error budget burning fast
- Non-critical component down

**Response time**: < 1 hour
**Escalation**: If not acknowledged in 30 minutes

### Info (Review next business day)
- Approaching soft limits
- Non-urgent maintenance needed
- Informational anomalies

**Response time**: Next business day
**Escalation**: None

Routing by Severity

# alertmanager.yml
route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      repeat_interval: 5m
      
    - match:
        severity: warning
      receiver: slack-alerts
      repeat_interval: 1h
      
    - match:
        severity: info
      receiver: slack-info
      repeat_interval: 24h

Threshold Guidelines

Avoid arbitrary thresholds.

Bad Thresholds

# Arbitrary numbers
- alert: HighCPU
  expr: cpu_usage > 80  # Why 80?
  
- alert: HighMemory
  expr: memory_usage > 85  # Why 85?

Good Thresholds

# Based on actual limits and behavior
- alert: MemoryApproachingLimit
  expr: |
    (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
  for: 10m
  annotations:
    summary: "Container at 90% of memory limit - will be OOM killed soon"

# Based on SLO
- alert: LatencyAboveSLO
  expr: |
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  annotations:
    summary: "P99 latency above 500ms SLO target"

# Based on anomaly detection
- alert: TrafficAnomaly
  expr: |
    abs(rate(requests_total[5m]) - avg_over_time(rate(requests_total[5m])[7d:1h])) 
    > 2 * stddev_over_time(rate(requests_total[5m])[7d:1h])
  annotations:
    summary: "Traffic more than 2 standard deviations from normal"

The For Clause

The for duration prevents alerting on transient issues.

Guidelines

Issue TypeRecommended For Duration
Complete outage1-2 minutes
SLO violation5 minutes
Resource approaching limit10-15 minutes
Anomaly detection10-15 minutes
Capacity planning1 hour

Example

# Transient spike: Don't alert
# Sustained issue: Alert
- alert: HighLatency
  expr: latency_p99 > 500
  for: 10m  # Must be elevated for 10 minutes

Alert Testing

Test alerts before deploying.

Unit Testing

# tests/alerts_test.yaml
rule_files:
  - alerts.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500"}'
        values: '0 10 20 30 40 50'
      - series: 'http_requests_total{status="200"}'
        values: '100 100 100 100 100 100'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
            exp_annotations:
              summary: "Error rate above threshold"

Dry Run in Production

# Evaluate alert without firing
promtool check rules alerts.yaml

# Test against live data
curl -s "http://prometheus:9090/api/v1/query?query=YOUR_ALERT_EXPR" | jq '.data.result'

Alert Lifecycle Management

Alerts need maintenance.

Regular Reviews

## Monthly Alert Review

### For each alert, verify:
- [ ] Still relevant (service exists, threshold makes sense)
- [ ] Runbook is current
- [ ] Owner is still correct
- [ ] Severity is appropriate
- [ ] Has fired in last 90 days (if not, consider removing)

Deprecation Process

# Mark alerts for removal
- alert: OldAlert
  expr: some_metric > threshold
  labels:
    deprecated: "true"
    removal_date: "2024-03-01"
  annotations:
    deprecation_reason: "Service being retired"

Stew: Alerting Foundation

Good alerts need good runbooks. Stew ensures every alert has executable documentation:

  • Runbooks linked directly to alerts
  • One-click diagnostics
  • Consistent response procedures

Build alerts and runbooks together from the start.

Join the waitlist and build sustainable alerting.