SLO Monitoring: A Complete Guide for SRE Teams

SLO monitoring transforms reliability from a vague goal into measurable outcomes. This guide covers how to implement SLO monitoring that actually improves your systems.

For incident response based on SLOs, see our incident response checklist guide.

What Is SLO Monitoring?

SLO (Service Level Objective) monitoring tracks whether your services meet their reliability targets.

The Hierarchy

SLI (Service Level Indicator): The metric you measure (e.g., request success rate)
SLO (Service Level Objective): The target for that metric (e.g., 99.9% success rate)
SLA (Service Level Agreement): The contractual commitment (e.g., refund if below 99.5%)

Why It Matters

Without SLOs:

“Is the service reliable?” → “It feels okay?”
“Should we ship this feature or fix that bug?” → Endless debate
“When should we page?” → Arbitrary thresholds

With SLOs:

“Is the service reliable?” → “We’re at 99.92%, above our 99.9% target”
“Should we ship or fix?” → “We have error budget, let’s ship”
“When should we page?” → “When we’re burning budget too fast”

Step 1: Define Your SLIs

SLIs are the metrics that matter to users.

Common SLI Types

SLI Type	What It Measures	Example
Availability	Service is up	Successful requests / Total requests
Latency	Response speed	% requests < 200ms
Throughput	Capacity	Requests processed per second
Correctness	Right answers	Valid responses / Total responses

Availability SLI

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))

Latency SLI

# Latency: % of requests under threshold
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) 
/ 
sum(rate(http_request_duration_seconds_count[5m]))

Quality SLI

# Quality: requests without errors
1 - (
  sum(rate(application_errors_total[5m])) 
  / 
  sum(rate(http_requests_total[5m]))
)

Step 2: Set Your SLOs

SLOs are targets based on user expectations and business needs.

Choosing SLO Targets

Factor	Higher SLO (99.99%)	Lower SLO (99.9%)
User expectations	Critical infrastructure	Standard web app
Dependency on others	Few dependencies	Many external deps
Cost to achieve	Can afford investment	Budget constrained
Business impact	Revenue-critical	Internal tool

Example SLOs

# slos.yaml
slos:
  api:
    availability:
      target: 99.9
      window: 30d
    latency:
      target: 95  # 95% of requests under threshold
      threshold: 200ms
      window: 30d
  
  checkout:
    availability:
      target: 99.95
      window: 30d
    latency:
      target: 99
      threshold: 500ms
      window: 30d

What SLOs Mean in Practice

SLO	Allowed Downtime (30 days)	Allowed Downtime (per day)
99%	7.2 hours	14.4 minutes
99.9%	43.2 minutes	1.44 minutes
99.95%	21.6 minutes	43 seconds
99.99%	4.3 minutes	8.6 seconds

Step 3: Calculate Error Budgets

Error budget = allowed unreliability.

Error Budget Formula

Error Budget = 1 - SLO Target

For a 99.9% SLO:

Error Budget = 1 - 0.999 = 0.001 = 0.1%

Over 30 days (43,200 minutes):

Budget in minutes = 43,200 × 0.001 = 43.2 minutes

Tracking Budget Consumption

# Error budget remaining (as percentage of total)
1 - (
  (1 - avg_over_time(sli_availability[30d])) 
  / 
  (1 - 0.999)  # Error budget for 99.9% SLO
)

Step 4: Set Up SLO-Based Alerts

Alert on error budget burn rate, not raw thresholds.

Burn Rate Concept

Burn rate = how fast you’re consuming error budget.

Burn Rate	Budget Exhaustion Time
1x	30 days (normal)
2x	15 days
10x	3 days
36x	20 hours
144x	5 hours

Multi-Window Burn Rate Alerts

# Fast burn: Will exhaust budget in hours
- alert: SLOHighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "SLO burn rate critical - budget exhausted in < 2 hours"

# Slow burn: Will exhaust budget in days
- alert: SLOSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      / sum(rate(http_requests_total[6h]))
    ) > (3 * 0.001)  # 3x burn rate
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[30m]))
      / sum(rate(http_requests_total[30m]))
    ) > (3 * 0.001)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "SLO burn rate elevated - budget exhausted in < 10 days"

Step 5: Build SLO Dashboards

Visualize SLO health for your team.

Essential Visualizations

## SLO Dashboard

### Current Status
- API Availability: 99.92% (target: 99.9%) ✅
- API Latency P99: 180ms (target: 200ms) ✅
- Checkout Availability: 99.94% (target: 99.95%) ⚠️

### Error Budget (30-day window)
- API: 65% remaining [████████░░] 
- Checkout: 12% remaining [█░░░░░░░░░] ⚠️

### Trend (Last 7 Days)
[Line chart: SLI values over time with SLO target line]

### Burn Rate
[Gauge: Current burn rate with thresholds]

### Budget Consumption History
[Area chart: Error budget consumed over 30 days]

Grafana Queries

# Current availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# Error budget remaining
1 - (
  (1 - avg_over_time(
    (sum(rate(http_requests_total{status!~"5.."}[5m])) 
    / sum(rate(http_requests_total[5m])))[30d:5m]
  )) 
  / 0.001  # Error budget for 99.9% SLO
)

# Burn rate (current vs normal)
(sum(rate(http_requests_total{status=~"5.."}[1h])) 
/ sum(rate(http_requests_total[1h]))) 
/ 0.001  # Divide by error budget

Step 6: Implement SLO Reviews

Regular reviews keep SLOs relevant.

Weekly SLO Check

## Weekly SLO Review

### Current Status
| Service | SLI | Target | Current | Budget |
|---------|-----|--------|---------|--------|
| API | Availability | 99.9% | 99.92% | 65% |
| API | Latency | 95% < 200ms | 96.5% | 80% |
| Checkout | Availability | 99.95% | 99.94% | 12% |

### Action Items
- [ ] Checkout budget low - investigate recent incidents
- [ ] Review API latency outliers

### Notable Events
- Incident on Tuesday consumed 15% of checkout budget

Quarterly SLO Calibration

## Quarterly SLO Review

### Questions to Answer
1. Are our SLOs aligned with user expectations?
2. Are we meeting them comfortably or struggling?
3. Should we raise/lower any targets?
4. Are our SLIs measuring the right things?

### Recommendations
- Raise API availability target to 99.95% (consistently exceeded)
- Add latency SLI for checkout (currently only availability)
- Remove legacy service SLO (service deprecated)

SLO Monitoring Best Practices

Do’s

Measure from user perspective: Use synthetic monitoring or edge metrics
Start simple: One or two SLIs per service
Make SLOs visible: Dashboards, reports, team meetings
Use error budgets for decisions: Ship vs. fix prioritization

Don’ts

Don’t set SLOs at 100%: Impossible and prevents progress
Don’t measure everything: Focus on what matters to users
Don’t ignore budget exhaustion: It’s a signal to slow down
Don’t set and forget: SLOs need regular review

Stew and SLO Monitoring

When SLO alerts fire, you need to act fast. Stew connects SLO-based alerts to executable runbooks:

Alert triggers when burn rate is high
Runbook opens with diagnostic commands
Execute remediation with a click
Protect your error budget

Join the waitlist and build SLO-driven reliability.