Measuring Alert Fatigue: Metrics and KPIs for Healthy Alerting
You can’t improve what you don’t measure. These metrics help you understand and reduce alert fatigue.
For reduction strategies, see our alert fatigue reduction strategies.
Core Alert Metrics
1. Alert Volume
The raw count of alerts over time.
# Daily alert count
curl -s "http://alertmanager:9093/api/v1/alerts" | \
jq '[.data[] | select(.startsAt | fromdateiso8601 > (now - 86400))] | length'
Targets
| Team Size | Daily Alerts | Weekly Alerts |
|---|---|---|
| Small (1-3) | < 10 | < 50 |
| Medium (4-8) | < 25 | < 125 |
| Large (9+) | < 50 | < 250 |
2. Alerts Per On-Call Shift
More actionable than raw volume.
# Calculate alerts per 12-hour shift (last 30 days)
TOTAL_ALERTS=$(curl -s "http://alertmanager:9093/api/v1/alerts" | jq '[.data[] | select(.startsAt | fromdateiso8601 > (now - 2592000))] | length')
SHIFTS=$((30 * 2)) # 2 shifts per day for 30 days
echo "Alerts per shift: $((TOTAL_ALERTS / SHIFTS))"
Targets
| Rating | Alerts Per Shift | On-Call Experience |
|---|---|---|
| Excellent | < 5 | Sustainable long-term |
| Good | 5-10 | Manageable |
| Warning | 10-20 | Starting to fatigue |
| Critical | 20+ | Unsustainable |
3. Signal-to-Noise Ratio
Percentage of alerts that required action.
# Requires tracking action_taken in your system
# Signal = alerts with action / total alerts
echo "Signal ratio: $(jq -s '[.[] | select(.action_taken)] | length' alerts.json) / $(jq -s 'length' alerts.json)"
Targets
| Signal Ratio | Assessment |
|---|---|
| > 90% | Excellent - high-quality alerts |
| 70-90% | Good - some tuning needed |
| 50-70% | Warning - significant noise |
| < 50% | Critical - more noise than signal |
4. Time to Acknowledge (TTA)
How quickly alerts are acknowledged.
# From PagerDuty API
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01" \
-H "Authorization: Token token=YOUR_TOKEN" | \
jq '[.incidents[] |
((.first_acknowledged_at | fromdateiso8601) - (.created_at | fromdateiso8601)) / 60
] | add / length'
What TTA Tells You
| Avg TTA | Interpretation |
|---|---|
| < 5 min | Healthy - alerts are being monitored |
| 5-15 min | Normal - reasonable response |
| 15-30 min | Warning - possible fatigue or coverage gaps |
| > 30 min | Critical - alerts being ignored |
5. Repeat Alert Rate
Same alert firing multiple times for the same underlying issue.
# Alerts that fired multiple times in 24 hours
curl -s "http://alertmanager:9093/api/v1/alerts" | \
jq '[.data[] | {alert: .labels.alertname, fingerprint: .fingerprint}] |
group_by(.fingerprint) |
map(select(length > 1)) |
length'
Targets
| Repeat Rate | Assessment |
|---|---|
| < 5% | Excellent - issues are being resolved |
| 5-15% | Normal - some recurring issues |
| 15-30% | Warning - issues not being fixed |
| > 30% | Critical - systemic problems |
Advanced Metrics
6. Alert Distribution by Severity
# Count by severity
curl -s "http://alertmanager:9093/api/v1/alerts" | \
jq '.data | group_by(.labels.severity) |
map({severity: .[0].labels.severity, count: length})'
Healthy Distribution
- Critical: < 5% of alerts
- Warning: 20-40% of alerts
- Info: 50-70% of alerts
If > 10% of alerts are critical, you’re either having constant emergencies or your severity levels are miscalibrated.
7. Alert Duration
How long alerts stay active.
# Average alert duration
curl -s "http://alertmanager:9093/api/v1/alerts" | \
jq '[.data[] | select(.endsAt) |
((.endsAt | fromdateiso8601) - (.startsAt | fromdateiso8601)) / 60
] | add / length'
Targets by Severity
| Severity | Target Resolution Time |
|---|---|
| Critical | < 30 minutes |
| Warning | < 4 hours |
| Info | < 24 hours |
8. After-Hours Alert Rate
Alerts during off-hours indicate system stability.
# Calculate % of alerts outside 9-5
jq '[.[] | select(
(.startsAt | strftime("%H") | tonumber) < 9 or
(.startsAt | strftime("%H") | tonumber) >= 17
)] | length' alerts.json
Targets
| After-Hours % | Assessment |
|---|---|
| < 20% | Excellent - stable systems |
| 20-35% | Normal - some overnight issues |
| 35-50% | Warning - reliability problems |
| > 50% | Critical - on-call is unsustainable |
9. Alert Source Distribution
Which systems generate the most alerts?
# Alerts by service/source
curl -s "http://alertmanager:9093/api/v1/alerts" | \
jq '.data | group_by(.labels.service) |
map({service: .[0].labels.service, count: length}) |
sort_by(-.count) |
.[0:10]'
Use this to identify which services need reliability investment.
10. False Positive Rate
Alerts that fired but shouldn’t have.
This requires manual classification or automated detection:
## Alert Classification
For each alert, classify as:
- **True Positive**: Real issue, action taken
- **True Positive (No Action)**: Real issue, resolved itself
- **False Positive**: Not actually a problem
- **Unknown**: Couldn't determine
Building an Alert Health Dashboard
Key Visualizations
## Alert Health Dashboard
### Summary Metrics
- Total alerts (7 days): ____
- Alerts per shift: ____
- Signal ratio: ____%
- Avg TTA: ____ min
### Trends (4 weeks)
[Line chart: Alert volume over time]
[Line chart: Signal ratio over time]
[Line chart: Avg TTA over time]
### Top Noisy Alerts
| Alert | Count | Action Rate |
|-------|-------|-------------|
| 1. | | |
| 2. | | |
| 3. | | |
### Alerts by Severity
[Pie chart: Critical/Warning/Info distribution]
### Alerts by Service
[Bar chart: Top 10 alerting services]
### Time of Day Distribution
[Heatmap: Alerts by hour and day of week]
Grafana Queries
# Alert volume trend
sum(increase(alertmanager_alerts_received_total[1d])) by (alertname)
# Alerts by severity
sum by (severity) (alertmanager_alerts{state="active"})
# Average alert duration
avg(time() - alertmanager_alert_start_time_seconds)
Setting Improvement Targets
Quarterly Goals
## Q1 Alert Health Goals
### Current State
- Alerts per shift: 25
- Signal ratio: 60%
- Avg TTA: 12 min
- After-hours rate: 40%
### Targets
- Alerts per shift: 15 (-40%)
- Signal ratio: 80% (+33%)
- Avg TTA: 8 min (-33%)
- After-hours rate: 30% (-25%)
### Action Plan
1. Review and fix top 5 noisiest alerts
2. Implement dynamic thresholds for traffic alerts
3. Add runbooks to all critical alerts
4. Automate restart for common pod issues
Stew: Improving Alert Metrics
Stew helps improve alert metrics by:
- Reducing time to investigate (faster TTA)
- Enabling quick triage (better signal identification)
- Documenting resolutions (learning from alerts)
Better tooling means healthier alerting.
Join the waitlist and improve your alert health.