SLO-Based Alerting: Moving Beyond Threshold Alerts

Traditional threshold alerts (“CPU > 80%”) page you for symptoms. SLO-based alerts page you for what actually matters: user impact.

This guide covers implementing SLO-based alerting. For error budget concepts, see our error budgets guide.

The Problem with Threshold Alerts

Traditional Alerting

# Symptom-based alerts
- alert: HighCPU
  expr: cpu_usage > 80
  
- alert: HighMemory
  expr: memory_usage > 85
  
- alert: HighDiskIO
  expr: disk_io > 90

Problems

No user context: Is 80% CPU actually affecting users?
Arbitrary thresholds: Why 80% and not 75% or 85%?
Alert fatigue: Constant low-value pages
Missed issues: Can have problems without hitting thresholds

SLO-Based Alerting Philosophy

Alert when:

Users are being impacted (SLI below target)
Impact is significant (burning error budget too fast)
Action is needed (won’t self-resolve)

Don’t alert when:

Internal metrics are elevated but users are fine
Impact is brief and self-resolving
Nothing can be done

The Multi-Window, Multi-Burn-Rate Approach

The gold standard for SLO alerting uses multiple time windows to catch both fast and slow burns.

Why Multiple Windows?

Window	Catches	Misses
Short (5m)	Fast burns	Slow burns
Long (1h)	Sustained issues	Very fast burns
Combined	Both	Neither

The Canonical Burn Rate Alerts

groups:
  - name: slo-alerts
    rules:
      # Page: 2% budget burn in 1 hour (fast burn)
      - alert: SLOBurnRateCritical
        expr: |
          (
            # Long window (1 hour)
            (1 - sum(rate(http_requests_total{status!~"5.."}[1h])) 
            / sum(rate(http_requests_total[1h])))
            > (14.4 * 0.001)  # 14.4x burn rate × 0.1% error budget
          )
          and
          (
            # Short window (5 minutes)
            (1 - sum(rate(http_requests_total{status!~"5.."}[5m]))
            / sum(rate(http_requests_total[5m])))
            > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "SLO critical: burning 2% of error budget per hour"
          impact: "At current rate, budget exhausts in ~2 hours"
          runbook_url: "https://runbooks.internal/slo-burn"

      # Page: 5% budget burn in 6 hours (medium burn)
      - alert: SLOBurnRateHigh  
        expr: |
          (
            (1 - sum(rate(http_requests_total{status!~"5.."}[6h]))
            / sum(rate(http_requests_total[6h])))
            > (6 * 0.001)
          )
          and
          (
            (1 - sum(rate(http_requests_total{status!~"5.."}[30m]))
            / sum(rate(http_requests_total[30m])))
            > (6 * 0.001)
          )
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO high burn: burning 5% of error budget per 6 hours"

      # Ticket: 10% budget burn in 3 days (slow burn)
      - alert: SLOBurnRateElevated
        expr: |
          (
            (1 - sum(rate(http_requests_total{status!~"5.."}[1d]))
            / sum(rate(http_requests_total[1d])))
            > (3 * 0.001)
          )
          and
          (
            (1 - sum(rate(http_requests_total{status!~"5.."}[2h]))
            / sum(rate(http_requests_total[2h])))
            > (3 * 0.001)
          )
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "SLO elevated: trend toward budget exhaustion"

      # Ticket: Budget running low
      - alert: SLOBudgetLow
        expr: |
          slo:error_budget_remaining < 0.25
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error budget below 25%"
          action: "Consider pausing risky changes"

Calculating Burn Rate Thresholds

The Math

For a 30-day SLO window:

14.4x burn rate = budget exhausted in ~2 hours
6x burn rate = budget exhausted in ~5 hours
3x burn rate = budget exhausted in ~10 days
1x burn rate = budget exhausted in ~30 days

Formula

Burn Rate = Error Rate / Error Budget

For 99.9% SLO (0.1% error budget):

14.4x burn rate = 1.44% error rate
6x burn rate = 0.6% error rate
3x burn rate = 0.3% error rate

Alert Routing

Different burn rates need different responses.

Alertmanager Configuration

# alertmanager.yml
route:
  receiver: default
  routes:
    # Fast burn: page immediately
    - match:
        alertname: SLOBurnRateCritical
      receiver: pagerduty-critical
      repeat_interval: 5m
    
    # High burn: page
    - match:
        alertname: SLOBurnRateHigh
      receiver: pagerduty-high
      repeat_interval: 15m
    
    # Slow burn: ticket
    - match:
        alertname: SLOBurnRateElevated
      receiver: slack-and-ticket
      repeat_interval: 4h
    
    # Budget low: inform
    - match:
        alertname: SLOBudgetLow
      receiver: slack-info
      repeat_interval: 24h

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: CRITICAL_KEY
        severity: critical
  
  - name: pagerduty-high
    pagerduty_configs:
      - service_key: HIGH_KEY
        severity: error
  
  - name: slack-and-ticket
    slack_configs:
      - channel: '#sre-alerts'
    webhook_configs:
      - url: https://jira.internal/create-ticket
  
  - name: slack-info
    slack_configs:
      - channel: '#sre-info'

Adding SLO Context to Alerts

Make alerts actionable with context.

Rich Alert Annotations

- alert: SLOBurnRateCritical
  expr: ...
  annotations:
    summary: "{{ $labels.service }} SLO burn rate critical"
    description: |
      Service {{ $labels.service }} is burning error budget at {{ printf "%.1f" $value }}x normal rate.
      
      At current rate, budget will be exhausted in {{ printf "%.1f" (1 / $value * 30 * 24) }} hours.
      
      Current error rate: {{ with query "sum(rate(http_requests_total{status=~\"5..\"}[5m]))" }}{{ . | first | value | printf "%.2f" }}{{ end }} errors/sec
    
    dashboard_url: "https://grafana.internal/d/slo/{{ $labels.service }}"
    runbook_url: "https://runbooks.internal/slo-burn"
    
    quick_check: |
      kubectl logs -l app={{ $labels.service }} --tail=50 | grep -i error

SLO Alerts vs. Symptom Alerts

You may still want some symptom alerts for specific known issues.

Hybrid Approach

# SLO alert: catches unknown issues affecting users
- alert: SLOBurnRate
  expr: burn_rate > 6
  annotations:
    summary: "SLO burn rate elevated"

# Symptom alert: catches specific known issue before it affects SLO
- alert: DatabaseConnectionPoolExhausted
  expr: db_connection_pool_available == 0
  for: 1m
  annotations:
    summary: "Database connection pool exhausted - will cause errors"

When to Use Symptom Alerts

Known failure modes with predictable impact
Leading indicators (predict problems before SLO impact)
Specific components where SLO alerts would be too slow

Testing SLO Alerts

Validate Alert Logic

# Test alert expressions
promtool test rules slo_alerts_test.yaml

# Check current burn rate
curl -s "http://prometheus:9090/api/v1/query?query=slo:burn_rate_1h" | jq '.data.result'

Load Test for Alerts

# Generate synthetic errors to trigger alerts
for i in {1..1000}; do
  curl -s http://your-service/trigger-error
  sleep 0.01
done

# Watch for alert
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.labels.alertname | contains("SLO"))'

SLO Alert Runbooks

Every SLO alert needs an executable runbook.

# SLO Burn Rate Critical Runbook

## Alert: SLOBurnRateCritical

This alert fires when error budget is being consumed at > 14.4x normal rate,
meaning budget will be exhausted in ~2 hours.

### Step 1: Confirm Impact
```bash
# Current error rate
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
```

### Step 2: Identify Error Source
```bash
# Errors by endpoint
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))by(path)" | jq '.data.result'
```

### Step 3: Check Recent Changes
```bash
kubectl get deployments -n production -o json | jq '.items[] | {name: .metadata.name, generation: .metadata.generation}'
```

### Step 4: Remediate
[Specific steps based on diagnosis]

Stew: SLO Alert Response

When SLO alerts fire, speed matters. Every minute of delay consumes more budget.

Stew makes SLO response faster:

Runbook opens with alert
Commands run with one click
Output captured automatically
Budget protected through fast action

Join the waitlist and build SLO-driven alerting.