Measuring Alert Fatigue: Metrics and KPIs for Healthy Alerting

You can’t improve what you don’t measure. These metrics help you understand and reduce alert fatigue.

For reduction strategies, see our alert fatigue reduction strategies.

Core Alert Metrics

1. Alert Volume

The raw count of alerts over time.

# Daily alert count
curl -s "http://alertmanager:9093/api/v1/alerts" | \
  jq '[.data[] | select(.startsAt | fromdateiso8601 > (now - 86400))] | length'

Targets

Team Size	Daily Alerts	Weekly Alerts
Small (1-3)	< 10	< 50
Medium (4-8)	< 25	< 125
Large (9+)	< 50	< 250

2. Alerts Per On-Call Shift

More actionable than raw volume.

# Calculate alerts per 12-hour shift (last 30 days)
TOTAL_ALERTS=$(curl -s "http://alertmanager:9093/api/v1/alerts" | jq '[.data[] | select(.startsAt | fromdateiso8601 > (now - 2592000))] | length')
SHIFTS=$((30 * 2))  # 2 shifts per day for 30 days
echo "Alerts per shift: $((TOTAL_ALERTS / SHIFTS))"

Targets

Rating	Alerts Per Shift	On-Call Experience
Excellent	< 5	Sustainable long-term
Good	5-10	Manageable
Warning	10-20	Starting to fatigue
Critical	20+	Unsustainable

3. Signal-to-Noise Ratio

Percentage of alerts that required action.

# Requires tracking action_taken in your system
# Signal = alerts with action / total alerts
echo "Signal ratio: $(jq -s '[.[] | select(.action_taken)] | length' alerts.json) / $(jq -s 'length' alerts.json)"

Targets

Signal Ratio	Assessment
> 90%	Excellent - high-quality alerts
70-90%	Good - some tuning needed
50-70%	Warning - significant noise
< 50%	Critical - more noise than signal

4. Time to Acknowledge (TTA)

How quickly alerts are acknowledged.

# From PagerDuty API
curl -s "https://api.pagerduty.com/incidents?since=2024-01-01" \
  -H "Authorization: Token token=YOUR_TOKEN" | \
  jq '[.incidents[] | 
    ((.first_acknowledged_at | fromdateiso8601) - (.created_at | fromdateiso8601)) / 60
  ] | add / length'

What TTA Tells You

Avg TTA	Interpretation
< 5 min	Healthy - alerts are being monitored
5-15 min	Normal - reasonable response
15-30 min	Warning - possible fatigue or coverage gaps
> 30 min	Critical - alerts being ignored

5. Repeat Alert Rate

Same alert firing multiple times for the same underlying issue.

# Alerts that fired multiple times in 24 hours
curl -s "http://alertmanager:9093/api/v1/alerts" | \
  jq '[.data[] | {alert: .labels.alertname, fingerprint: .fingerprint}] | 
    group_by(.fingerprint) | 
    map(select(length > 1)) | 
    length'

Targets

Repeat Rate	Assessment
< 5%	Excellent - issues are being resolved
5-15%	Normal - some recurring issues
15-30%	Warning - issues not being fixed
> 30%	Critical - systemic problems

Advanced Metrics

6. Alert Distribution by Severity

# Count by severity
curl -s "http://alertmanager:9093/api/v1/alerts" | \
  jq '.data | group_by(.labels.severity) | 
    map({severity: .[0].labels.severity, count: length})'

Healthy Distribution

Critical: < 5% of alerts
Warning: 20-40% of alerts
Info: 50-70% of alerts

If > 10% of alerts are critical, you’re either having constant emergencies or your severity levels are miscalibrated.

7. Alert Duration

How long alerts stay active.

# Average alert duration
curl -s "http://alertmanager:9093/api/v1/alerts" | \
  jq '[.data[] | select(.endsAt) | 
    ((.endsAt | fromdateiso8601) - (.startsAt | fromdateiso8601)) / 60
  ] | add / length'

Targets by Severity

Severity	Target Resolution Time
Critical	< 30 minutes
Warning	< 4 hours
Info	< 24 hours

8. After-Hours Alert Rate

Alerts during off-hours indicate system stability.

# Calculate % of alerts outside 9-5
jq '[.[] | select(
  (.startsAt | strftime("%H") | tonumber) < 9 or
  (.startsAt | strftime("%H") | tonumber) >= 17
)] | length' alerts.json

Targets

After-Hours %	Assessment
< 20%	Excellent - stable systems
20-35%	Normal - some overnight issues
35-50%	Warning - reliability problems
> 50%	Critical - on-call is unsustainable

9. Alert Source Distribution

Which systems generate the most alerts?

# Alerts by service/source
curl -s "http://alertmanager:9093/api/v1/alerts" | \
  jq '.data | group_by(.labels.service) | 
    map({service: .[0].labels.service, count: length}) | 
    sort_by(-.count) | 
    .[0:10]'

Use this to identify which services need reliability investment.

10. False Positive Rate

Alerts that fired but shouldn’t have.

This requires manual classification or automated detection:

## Alert Classification

For each alert, classify as:
- **True Positive**: Real issue, action taken
- **True Positive (No Action)**: Real issue, resolved itself
- **False Positive**: Not actually a problem
- **Unknown**: Couldn't determine

Building an Alert Health Dashboard

Key Visualizations

## Alert Health Dashboard

### Summary Metrics
- Total alerts (7 days): ____
- Alerts per shift: ____
- Signal ratio: ____%
- Avg TTA: ____ min

### Trends (4 weeks)
[Line chart: Alert volume over time]
[Line chart: Signal ratio over time]
[Line chart: Avg TTA over time]

### Top Noisy Alerts
| Alert | Count | Action Rate |
|-------|-------|-------------|
| 1. | | |
| 2. | | |
| 3. | | |

### Alerts by Severity
[Pie chart: Critical/Warning/Info distribution]

### Alerts by Service
[Bar chart: Top 10 alerting services]

### Time of Day Distribution
[Heatmap: Alerts by hour and day of week]

Grafana Queries

# Alert volume trend
sum(increase(alertmanager_alerts_received_total[1d])) by (alertname)

# Alerts by severity
sum by (severity) (alertmanager_alerts{state="active"})

# Average alert duration
avg(time() - alertmanager_alert_start_time_seconds)

Setting Improvement Targets

Quarterly Goals

## Q1 Alert Health Goals

### Current State
- Alerts per shift: 25
- Signal ratio: 60%
- Avg TTA: 12 min
- After-hours rate: 40%

### Targets
- Alerts per shift: 15 (-40%)
- Signal ratio: 80% (+33%)
- Avg TTA: 8 min (-33%)
- After-hours rate: 30% (-25%)

### Action Plan
1. Review and fix top 5 noisiest alerts
2. Implement dynamic thresholds for traffic alerts
3. Add runbooks to all critical alerts
4. Automate restart for common pod issues

Stew: Improving Alert Metrics

Stew helps improve alert metrics by:

Reducing time to investigate (faster TTA)
Enabling quick triage (better signal identification)
Documenting resolutions (learning from alerts)

Better tooling means healthier alerting.

Join the waitlist and improve your alert health.