Preventing Alert Fatigue: Building Sustainable Alerting from Day One
It’s easier to prevent alert fatigue than to fix it. This guide covers how to build sustainable alerting practices from the start.
For measuring existing alerting health, see our alert fatigue metrics guide.
The Alerting Philosophy
Before creating any alert, internalize these principles:
- Alerts are for humans: If a computer can handle it, automate it
- Every alert interrupts: Someone’s sleep, focus, or family time
- False positives erode trust: Better to miss some issues than cry wolf
- Alerts without actions are noise: No runbook = no alert
Start with SLOs, Not Alerts
Define what matters before defining alerts.
Step 1: Define Service Level Objectives
# Example SLOs
slos:
api:
availability: 99.9% # < 43 min downtime/month
latency_p99: 500ms
error_rate: 0.1%
checkout:
availability: 99.95% # < 22 min downtime/month
success_rate: 99%
Step 2: Create SLO-Based Alerts
# Alert when burning error budget too fast
- alert: APIErrorBudgetBurn
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))
)
) > (1 - 0.999) * 14.4 # 14.4x burn rate = budget exhausted in 5 days
for: 5m
labels:
severity: warning
annotations:
summary: "API error budget burning fast"
runbook_url: "/runbooks/error-budget-burn"
The Alert Creation Checklist
Before creating any alert, answer these questions:
## Alert Creation Checklist
### Necessity
- [ ] Does this alert indicate a real problem?
- [ ] Is human intervention required?
- [ ] Can this be auto-remediated instead?
### Actionability
- [ ] Is there a clear action to take?
- [ ] Does a runbook exist?
- [ ] Can the on-call engineer actually fix this?
### Quality
- [ ] Is the threshold appropriate?
- [ ] Has this been tested for false positives?
- [ ] Is the for duration long enough?
### Ownership
- [ ] Is there a clear owner for this alert?
- [ ] Is the owner subscribed to receive it?
### Documentation
- [ ] Is there a runbook linked?
- [ ] Is the summary clear and actionable?
- [ ] Are relevant metrics/logs linked?
Alert Templates
Use templates to enforce good practices.
Standard Alert Template
# templates/standard-alert.yaml
- alert: {{ .Name }}
expr: {{ .Expression }}
for: {{ .Duration | default "5m" }}
labels:
severity: {{ .Severity }}
team: {{ .Team }}
service: {{ .Service }}
annotations:
summary: "{{ .Summary }}"
description: "{{ .Description }}"
runbook_url: "{{ .RunbookURL }}"
dashboard_url: "{{ .DashboardURL }}"
quick_check: "{{ .QuickCheck }}"
Example Usage
# Instantiated alert
- alert: CheckoutHighErrorRate
expr: |
rate(checkout_errors_total[5m]) / rate(checkout_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
team: payments
service: checkout
annotations:
summary: "Checkout error rate above 1%"
description: "{{ $value | printf \"%.2f\" }}% of checkout requests are failing"
runbook_url: "https://runbooks.internal/checkout-errors"
dashboard_url: "https://grafana.internal/d/checkout"
quick_check: "kubectl logs -l app=checkout --tail=50 | grep ERROR"
Severity Level Guidelines
Define severity levels clearly and consistently.
Severity Definitions
## Severity Levels
### Critical (Page immediately)
- Service completely unavailable
- Data loss occurring
- Security breach active
- SLO violated (error budget exhausted)
**Response time**: < 5 minutes
**Escalation**: Immediate
### Warning (Respond within 1 hour)
- Service degraded but functional
- Approaching resource limits
- Error budget burning fast
- Non-critical component down
**Response time**: < 1 hour
**Escalation**: If not acknowledged in 30 minutes
### Info (Review next business day)
- Approaching soft limits
- Non-urgent maintenance needed
- Informational anomalies
**Response time**: Next business day
**Escalation**: None
Routing by Severity
# alertmanager.yml
route:
receiver: default
routes:
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 5m
- match:
severity: warning
receiver: slack-alerts
repeat_interval: 1h
- match:
severity: info
receiver: slack-info
repeat_interval: 24h
Threshold Guidelines
Avoid arbitrary thresholds.
Bad Thresholds
# Arbitrary numbers
- alert: HighCPU
expr: cpu_usage > 80 # Why 80?
- alert: HighMemory
expr: memory_usage > 85 # Why 85?
Good Thresholds
# Based on actual limits and behavior
- alert: MemoryApproachingLimit
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 10m
annotations:
summary: "Container at 90% of memory limit - will be OOM killed soon"
# Based on SLO
- alert: LatencyAboveSLO
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
annotations:
summary: "P99 latency above 500ms SLO target"
# Based on anomaly detection
- alert: TrafficAnomaly
expr: |
abs(rate(requests_total[5m]) - avg_over_time(rate(requests_total[5m])[7d:1h]))
> 2 * stddev_over_time(rate(requests_total[5m])[7d:1h])
annotations:
summary: "Traffic more than 2 standard deviations from normal"
The For Clause
The for duration prevents alerting on transient issues.
Guidelines
| Issue Type | Recommended For Duration |
|---|---|
| Complete outage | 1-2 minutes |
| SLO violation | 5 minutes |
| Resource approaching limit | 10-15 minutes |
| Anomaly detection | 10-15 minutes |
| Capacity planning | 1 hour |
Example
# Transient spike: Don't alert
# Sustained issue: Alert
- alert: HighLatency
expr: latency_p99 > 500
for: 10m # Must be elevated for 10 minutes
Alert Testing
Test alerts before deploying.
Unit Testing
# tests/alerts_test.yaml
rule_files:
- alerts.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{status="500"}'
values: '0 10 20 30 40 50'
- series: 'http_requests_total{status="200"}'
values: '100 100 100 100 100 100'
alert_rule_test:
- eval_time: 5m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
exp_annotations:
summary: "Error rate above threshold"
Dry Run in Production
# Evaluate alert without firing
promtool check rules alerts.yaml
# Test against live data
curl -s "http://prometheus:9090/api/v1/query?query=YOUR_ALERT_EXPR" | jq '.data.result'
Alert Lifecycle Management
Alerts need maintenance.
Regular Reviews
## Monthly Alert Review
### For each alert, verify:
- [ ] Still relevant (service exists, threshold makes sense)
- [ ] Runbook is current
- [ ] Owner is still correct
- [ ] Severity is appropriate
- [ ] Has fired in last 90 days (if not, consider removing)
Deprecation Process
# Mark alerts for removal
- alert: OldAlert
expr: some_metric > threshold
labels:
deprecated: "true"
removal_date: "2024-03-01"
annotations:
deprecation_reason: "Service being retired"
Stew: Alerting Foundation
Good alerts need good runbooks. Stew ensures every alert has executable documentation:
- Runbooks linked directly to alerts
- One-click diagnostics
- Consistent response procedures
Build alerts and runbooks together from the start.
Join the waitlist and build sustainable alerting.