Error Budgets Explained: Using SLOs to Balance Speed and Reliability
Error budgets transform the reliability conversation from “we need to be more reliable” into “here’s exactly how much risk we can take.”
This guide covers error budgets in depth. For SLO implementation, see our SLO monitoring implementation guide.
What Is an Error Budget?
An error budget is the amount of unreliability your SLO allows.
The Math
Error Budget = 1 - SLO Target
| SLO | Error Budget | Meaning |
|---|---|---|
| 99% | 1% | 1% of requests can fail |
| 99.9% | 0.1% | 0.1% of requests can fail |
| 99.99% | 0.01% | 0.01% of requests can fail |
In Time Terms (30-day month)
| SLO | Error Budget (time) |
|---|---|
| 99% | 7 hours 18 minutes |
| 99.9% | 43 minutes |
| 99.99% | 4.3 minutes |
Why Error Budgets Matter
Problem: The Reliability vs. Velocity Debate
Without error budgets:
- SRE: “We need to slow down, we’ve had too many incidents”
- Product: “We need to ship features, users are waiting”
- Outcome: Endless debate, no resolution
With error budgets:
- “We have 65% of our error budget remaining. We can take calculated risks.”
- “We’ve consumed 90% of our budget. We need to focus on reliability.”
The Key Insight
Error budgets give you permission to fail within limits and obligation to stop when exceeded.
Calculating Error Budget Consumption
Formula
Budget Consumed = Actual Error Rate / Allowed Error Rate
Example
- SLO: 99.9% availability (error budget: 0.1%)
- This month’s availability: 99.85%
- Actual error rate: 0.15%
Budget Consumed = 0.15% / 0.1% = 150%
Budget is exhausted (and exceeded by 50%).
Prometheus Query
# Budget consumed (as percentage)
(
1 - avg_over_time(sli:http_availability:ratio_rate5m[30d])
) / (
1 - 0.999 # Error budget for 99.9% SLO
) * 100
Error Budget Policies
Define what happens at different budget levels.
Example Policy
## Error Budget Policy
### Budget > 50% remaining
- Normal operations
- Feature development proceeds
- Take reasonable risks
### Budget 20-50% remaining
- Caution advised
- Risky changes require extra review
- Consider delaying non-critical releases
### Budget < 20% remaining
- Reliability focus mode
- No new features until budget recovers
- All hands on stability improvements
### Budget exhausted (0% or negative)
- Freeze all non-critical changes
- Dedicated reliability sprint
- Postmortem for budget exhaustion
Automating Policy Enforcement
# Alert when budget low
- alert: ErrorBudgetLow
expr: slo:http_availability:error_budget_remaining < 0.2
annotations:
summary: "Error budget below 20% - entering reliability focus mode"
action: "Pause feature releases, prioritize stability"
# Alert when budget exhausted
- alert: ErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
annotations:
summary: "Error budget exhausted - freeze deployments"
action: "Stop all changes, schedule reliability sprint"
Burn Rate: Speed of Budget Consumption
Burn rate tells you how fast you’re consuming budget.
Burn Rate Calculation
Burn Rate = Actual Error Rate / Allowed Error Rate (per unit time)
| Burn Rate | Meaning |
|---|---|
| 1x | Using budget at normal rate (will last the full window) |
| 2x | Using budget 2x faster (will last half the window) |
| 10x | Using budget 10x faster (will last 1/10 of the window) |
| 0.5x | Using budget slower than allowed (banking budget) |
Prometheus Query
# Current burn rate (1-hour window)
(
1 - sum(rate(http_requests_total{status!~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) / (
1 - 0.999 # Error budget for 99.9% SLO
)
Alert Thresholds by Burn Rate
| Burn Rate | Time to Exhaustion | Alert Severity |
|---|---|---|
| 14.4x | ~2 hours | Critical (page) |
| 6x | ~5 hours | Critical (page) |
| 3x | ~10 days | Warning (ticket) |
| 1x | 30 days | Normal |
Error Budget Use Cases
Use Case 1: Release Decisions
## Release Decision Framework
### Can we release this risky change?
Check error budget:
- Budget > 50%: Yes, proceed with normal process
- Budget 20-50%: Yes, but add extra monitoring
- Budget < 20%: No, wait for budget to recover
Check burn rate:
- Burn rate < 1x: Safe to release
- Burn rate 1-3x: Proceed with caution
- Burn rate > 3x: Do not release
Use Case 2: Incident Prioritization
## Incident Priority Based on Budget Impact
### Calculate incident budget impact:
Duration × Error Rate = Budget Consumed
### Priority assignment:
- Consumes > 10% of monthly budget: P1
- Consumes 5-10% of monthly budget: P2
- Consumes 1-5% of monthly budget: P3
- Consumes < 1% of monthly budget: P4
Use Case 3: Capacity Planning
## When to Scale
### If error budget consumption is mostly from:
- **Latency SLO misses during peak**: Add capacity
- **Availability during deploys**: Improve deployment process
- **Errors during traffic spikes**: Implement auto-scaling
Use Case 4: Technical Debt Prioritization
## Tech Debt vs. Features
### If budget is consistently:
- **< 20% remaining**: Tech debt and reliability work only
- **20-50% remaining**: Mix of reliability and features
- **> 50% remaining**: Feature development prioritized
Error Budget Tracking
Weekly Error Budget Report
## Weekly Error Budget Report
### API Service
- **SLO**: 99.9%
- **This Week**: 99.87%
- **Budget Consumed This Week**: 130% of weekly allocation
- **Monthly Budget Remaining**: 35%
### Key Events
- Tuesday 14:00: 15-minute outage consumed 35% of weekly budget
- Thursday 09:00: Elevated error rate consumed 20%
### Trend
[Chart showing budget consumption over last 4 weeks]
### Action Items
- [ ] Postmortem for Tuesday outage
- [ ] Investigate Thursday error spike
Monthly Error Budget Review
## Monthly Error Budget Review
### Summary
| Service | SLO | Actual | Budget Used | Status |
|---------|-----|--------|-------------|--------|
| API | 99.9% | 99.85% | 150% | ❌ Over |
| Web | 99.9% | 99.92% | 80% | ⚠️ Warning |
| Worker | 99.5% | 99.7% | 40% | ✅ Healthy |
### Analysis
- API exceeded budget due to 3 incidents
- Web on track but trending down
- Worker has budget to spare
### Recommendations
- API: Reliability sprint next month
- Web: No changes needed
- Worker: Can absorb more risk if needed
Error Budget Anti-Patterns
Anti-Pattern 1: Ignoring the Budget
“Our budget is exhausted but we need to ship this feature.”
Fix: Budget exhaustion means stop. Period.
Anti-Pattern 2: Gaming the SLO
Lowering the SLO to have more budget.
Fix: SLOs should reflect user expectations, not engineering convenience.
Anti-Pattern 3: Not Using the Budget
Always running at 0% budget consumed.
Fix: If you never use budget, your SLO might be too low. You’re over-investing in reliability.
Stew and Error Budgets
When error budget alerts fire, fast response protects remaining budget:
- Executable runbooks for quick diagnosis
- One-click remediation
- Less time in incidents = more budget preserved
Join the waitlist and protect your error budget.