SLO Monitoring: A Complete Guide for SRE Teams
SLO monitoring transforms reliability from a vague goal into measurable outcomes. This guide covers how to implement SLO monitoring that actually improves your systems.
For incident response based on SLOs, see our incident response checklist guide.
What Is SLO Monitoring?
SLO (Service Level Objective) monitoring tracks whether your services meet their reliability targets.
The Hierarchy
- SLI (Service Level Indicator): The metric you measure (e.g., request success rate)
- SLO (Service Level Objective): The target for that metric (e.g., 99.9% success rate)
- SLA (Service Level Agreement): The contractual commitment (e.g., refund if below 99.5%)
Why It Matters
Without SLOs:
- “Is the service reliable?” → “It feels okay?”
- “Should we ship this feature or fix that bug?” → Endless debate
- “When should we page?” → Arbitrary thresholds
With SLOs:
- “Is the service reliable?” → “We’re at 99.92%, above our 99.9% target”
- “Should we ship or fix?” → “We have error budget, let’s ship”
- “When should we page?” → “When we’re burning budget too fast”
Step 1: Define Your SLIs
SLIs are the metrics that matter to users.
Common SLI Types
| SLI Type | What It Measures | Example |
|---|---|---|
| Availability | Service is up | Successful requests / Total requests |
| Latency | Response speed | % requests < 200ms |
| Throughput | Capacity | Requests processed per second |
| Correctness | Right answers | Valid responses / Total responses |
Availability SLI
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Latency SLI
# Latency: % of requests under threshold
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Quality SLI
# Quality: requests without errors
1 - (
sum(rate(application_errors_total[5m]))
/
sum(rate(http_requests_total[5m]))
)
Step 2: Set Your SLOs
SLOs are targets based on user expectations and business needs.
Choosing SLO Targets
| Factor | Higher SLO (99.99%) | Lower SLO (99.9%) |
|---|---|---|
| User expectations | Critical infrastructure | Standard web app |
| Dependency on others | Few dependencies | Many external deps |
| Cost to achieve | Can afford investment | Budget constrained |
| Business impact | Revenue-critical | Internal tool |
Example SLOs
# slos.yaml
slos:
api:
availability:
target: 99.9
window: 30d
latency:
target: 95 # 95% of requests under threshold
threshold: 200ms
window: 30d
checkout:
availability:
target: 99.95
window: 30d
latency:
target: 99
threshold: 500ms
window: 30d
What SLOs Mean in Practice
| SLO | Allowed Downtime (30 days) | Allowed Downtime (per day) |
|---|---|---|
| 99% | 7.2 hours | 14.4 minutes |
| 99.9% | 43.2 minutes | 1.44 minutes |
| 99.95% | 21.6 minutes | 43 seconds |
| 99.99% | 4.3 minutes | 8.6 seconds |
Step 3: Calculate Error Budgets
Error budget = allowed unreliability.
Error Budget Formula
Error Budget = 1 - SLO Target
For a 99.9% SLO:
Error Budget = 1 - 0.999 = 0.001 = 0.1%
Over 30 days (43,200 minutes):
Budget in minutes = 43,200 × 0.001 = 43.2 minutes
Tracking Budget Consumption
# Error budget remaining (as percentage of total)
1 - (
(1 - avg_over_time(sli_availability[30d]))
/
(1 - 0.999) # Error budget for 99.9% SLO
)
Step 4: Set Up SLO-Based Alerts
Alert on error budget burn rate, not raw thresholds.
Burn Rate Concept
Burn rate = how fast you’re consuming error budget.
| Burn Rate | Budget Exhaustion Time |
|---|---|
| 1x | 30 days (normal) |
| 2x | 15 days |
| 10x | 3 days |
| 36x | 20 hours |
| 144x | 5 hours |
Multi-Window Burn Rate Alerts
# Fast burn: Will exhaust budget in hours
- alert: SLOHighBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "SLO burn rate critical - budget exhausted in < 2 hours"
# Slow burn: Will exhaust budget in days
- alert: SLOSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (3 * 0.001) # 3x burn rate
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (3 * 0.001)
for: 15m
labels:
severity: warning
annotations:
summary: "SLO burn rate elevated - budget exhausted in < 10 days"
Step 5: Build SLO Dashboards
Visualize SLO health for your team.
Essential Visualizations
## SLO Dashboard
### Current Status
- API Availability: 99.92% (target: 99.9%) ✅
- API Latency P99: 180ms (target: 200ms) ✅
- Checkout Availability: 99.94% (target: 99.95%) ⚠️
### Error Budget (30-day window)
- API: 65% remaining [████████░░]
- Checkout: 12% remaining [█░░░░░░░░░] ⚠️
### Trend (Last 7 Days)
[Line chart: SLI values over time with SLO target line]
### Burn Rate
[Gauge: Current burn rate with thresholds]
### Budget Consumption History
[Area chart: Error budget consumed over 30 days]
Grafana Queries
# Current availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Error budget remaining
1 - (
(1 - avg_over_time(
(sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])))[30d:5m]
))
/ 0.001 # Error budget for 99.9% SLO
)
# Burn rate (current vs normal)
(sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h])))
/ 0.001 # Divide by error budget
Step 6: Implement SLO Reviews
Regular reviews keep SLOs relevant.
Weekly SLO Check
## Weekly SLO Review
### Current Status
| Service | SLI | Target | Current | Budget |
|---------|-----|--------|---------|--------|
| API | Availability | 99.9% | 99.92% | 65% |
| API | Latency | 95% < 200ms | 96.5% | 80% |
| Checkout | Availability | 99.95% | 99.94% | 12% |
### Action Items
- [ ] Checkout budget low - investigate recent incidents
- [ ] Review API latency outliers
### Notable Events
- Incident on Tuesday consumed 15% of checkout budget
Quarterly SLO Calibration
## Quarterly SLO Review
### Questions to Answer
1. Are our SLOs aligned with user expectations?
2. Are we meeting them comfortably or struggling?
3. Should we raise/lower any targets?
4. Are our SLIs measuring the right things?
### Recommendations
- Raise API availability target to 99.95% (consistently exceeded)
- Add latency SLI for checkout (currently only availability)
- Remove legacy service SLO (service deprecated)
SLO Monitoring Best Practices
Do’s
- Measure from user perspective: Use synthetic monitoring or edge metrics
- Start simple: One or two SLIs per service
- Make SLOs visible: Dashboards, reports, team meetings
- Use error budgets for decisions: Ship vs. fix prioritization
Don’ts
- Don’t set SLOs at 100%: Impossible and prevents progress
- Don’t measure everything: Focus on what matters to users
- Don’t ignore budget exhaustion: It’s a signal to slow down
- Don’t set and forget: SLOs need regular review
Stew and SLO Monitoring
When SLO alerts fire, you need to act fast. Stew connects SLO-based alerts to executable runbooks:
- Alert triggers when burn rate is high
- Runbook opens with diagnostic commands
- Execute remediation with a click
- Protect your error budget
Join the waitlist and build SLO-driven reliability.