SLO-Based Alerting: Moving Beyond Threshold Alerts
Traditional threshold alerts (“CPU > 80%”) page you for symptoms. SLO-based alerts page you for what actually matters: user impact.
This guide covers implementing SLO-based alerting. For error budget concepts, see our error budgets guide.
The Problem with Threshold Alerts
Traditional Alerting
# Symptom-based alerts
- alert: HighCPU
expr: cpu_usage > 80
- alert: HighMemory
expr: memory_usage > 85
- alert: HighDiskIO
expr: disk_io > 90
Problems
- No user context: Is 80% CPU actually affecting users?
- Arbitrary thresholds: Why 80% and not 75% or 85%?
- Alert fatigue: Constant low-value pages
- Missed issues: Can have problems without hitting thresholds
SLO-Based Alerting Philosophy
Alert when:
- Users are being impacted (SLI below target)
- Impact is significant (burning error budget too fast)
- Action is needed (won’t self-resolve)
Don’t alert when:
- Internal metrics are elevated but users are fine
- Impact is brief and self-resolving
- Nothing can be done
The Multi-Window, Multi-Burn-Rate Approach
The gold standard for SLO alerting uses multiple time windows to catch both fast and slow burns.
Why Multiple Windows?
| Window | Catches | Misses |
|---|---|---|
| Short (5m) | Fast burns | Slow burns |
| Long (1h) | Sustained issues | Very fast burns |
| Combined | Both | Neither |
The Canonical Burn Rate Alerts
groups:
- name: slo-alerts
rules:
# Page: 2% budget burn in 1 hour (fast burn)
- alert: SLOBurnRateCritical
expr: |
(
# Long window (1 hour)
(1 - sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h])))
> (14.4 * 0.001) # 14.4x burn rate × 0.1% error budget
)
and
(
# Short window (5 minutes)
(1 - sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])))
> (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
annotations:
summary: "SLO critical: burning 2% of error budget per hour"
impact: "At current rate, budget exhausts in ~2 hours"
runbook_url: "https://runbooks.internal/slo-burn"
# Page: 5% budget burn in 6 hours (medium burn)
- alert: SLOBurnRateHigh
expr: |
(
(1 - sum(rate(http_requests_total{status!~"5.."}[6h]))
/ sum(rate(http_requests_total[6h])))
> (6 * 0.001)
)
and
(
(1 - sum(rate(http_requests_total{status!~"5.."}[30m]))
/ sum(rate(http_requests_total[30m])))
> (6 * 0.001)
)
for: 5m
labels:
severity: critical
annotations:
summary: "SLO high burn: burning 5% of error budget per 6 hours"
# Ticket: 10% budget burn in 3 days (slow burn)
- alert: SLOBurnRateElevated
expr: |
(
(1 - sum(rate(http_requests_total{status!~"5.."}[1d]))
/ sum(rate(http_requests_total[1d])))
> (3 * 0.001)
)
and
(
(1 - sum(rate(http_requests_total{status!~"5.."}[2h]))
/ sum(rate(http_requests_total[2h])))
> (3 * 0.001)
)
for: 15m
labels:
severity: warning
annotations:
summary: "SLO elevated: trend toward budget exhaustion"
# Ticket: Budget running low
- alert: SLOBudgetLow
expr: |
slo:error_budget_remaining < 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Error budget below 25%"
action: "Consider pausing risky changes"
Calculating Burn Rate Thresholds
The Math
For a 30-day SLO window:
- 14.4x burn rate = budget exhausted in ~2 hours
- 6x burn rate = budget exhausted in ~5 hours
- 3x burn rate = budget exhausted in ~10 days
- 1x burn rate = budget exhausted in ~30 days
Formula
Burn Rate = Error Rate / Error Budget
For 99.9% SLO (0.1% error budget):
- 14.4x burn rate = 1.44% error rate
- 6x burn rate = 0.6% error rate
- 3x burn rate = 0.3% error rate
Alert Routing
Different burn rates need different responses.
Alertmanager Configuration
# alertmanager.yml
route:
receiver: default
routes:
# Fast burn: page immediately
- match:
alertname: SLOBurnRateCritical
receiver: pagerduty-critical
repeat_interval: 5m
# High burn: page
- match:
alertname: SLOBurnRateHigh
receiver: pagerduty-high
repeat_interval: 15m
# Slow burn: ticket
- match:
alertname: SLOBurnRateElevated
receiver: slack-and-ticket
repeat_interval: 4h
# Budget low: inform
- match:
alertname: SLOBudgetLow
receiver: slack-info
repeat_interval: 24h
receivers:
- name: pagerduty-critical
pagerduty_configs:
- service_key: CRITICAL_KEY
severity: critical
- name: pagerduty-high
pagerduty_configs:
- service_key: HIGH_KEY
severity: error
- name: slack-and-ticket
slack_configs:
- channel: '#sre-alerts'
webhook_configs:
- url: https://jira.internal/create-ticket
- name: slack-info
slack_configs:
- channel: '#sre-info'
Adding SLO Context to Alerts
Make alerts actionable with context.
Rich Alert Annotations
- alert: SLOBurnRateCritical
expr: ...
annotations:
summary: "{{ $labels.service }} SLO burn rate critical"
description: |
Service {{ $labels.service }} is burning error budget at {{ printf "%.1f" $value }}x normal rate.
At current rate, budget will be exhausted in {{ printf "%.1f" (1 / $value * 30 * 24) }} hours.
Current error rate: {{ with query "sum(rate(http_requests_total{status=~\"5..\"}[5m]))" }}{{ . | first | value | printf "%.2f" }}{{ end }} errors/sec
dashboard_url: "https://grafana.internal/d/slo/{{ $labels.service }}"
runbook_url: "https://runbooks.internal/slo-burn"
quick_check: |
kubectl logs -l app={{ $labels.service }} --tail=50 | grep -i error
SLO Alerts vs. Symptom Alerts
You may still want some symptom alerts for specific known issues.
Hybrid Approach
# SLO alert: catches unknown issues affecting users
- alert: SLOBurnRate
expr: burn_rate > 6
annotations:
summary: "SLO burn rate elevated"
# Symptom alert: catches specific known issue before it affects SLO
- alert: DatabaseConnectionPoolExhausted
expr: db_connection_pool_available == 0
for: 1m
annotations:
summary: "Database connection pool exhausted - will cause errors"
When to Use Symptom Alerts
- Known failure modes with predictable impact
- Leading indicators (predict problems before SLO impact)
- Specific components where SLO alerts would be too slow
Testing SLO Alerts
Validate Alert Logic
# Test alert expressions
promtool test rules slo_alerts_test.yaml
# Check current burn rate
curl -s "http://prometheus:9090/api/v1/query?query=slo:burn_rate_1h" | jq '.data.result'
Load Test for Alerts
# Generate synthetic errors to trigger alerts
for i in {1..1000}; do
curl -s http://your-service/trigger-error
sleep 0.01
done
# Watch for alert
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.labels.alertname | contains("SLO"))'
SLO Alert Runbooks
Every SLO alert needs an executable runbook.
# SLO Burn Rate Critical Runbook
## Alert: SLOBurnRateCritical
This alert fires when error budget is being consumed at > 14.4x normal rate,
meaning budget will be exhausted in ~2 hours.
### Step 1: Confirm Impact
```bash
# Current error rate
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
```
### Step 2: Identify Error Source
```bash
# Errors by endpoint
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))by(path)" | jq '.data.result'
```
### Step 3: Check Recent Changes
```bash
kubectl get deployments -n production -o json | jq '.items[] | {name: .metadata.name, generation: .metadata.generation}'
```
### Step 4: Remediate
[Specific steps based on diagnosis]
Stew: SLO Alert Response
When SLO alerts fire, speed matters. Every minute of delay consumes more budget.
Stew makes SLO response faster:
- Runbook opens with alert
- Commands run with one click
- Output captured automatically
- Budget protected through fast action
Join the waitlist and build SLO-driven alerting.