← Back to blog

SLO Monitoring: A Complete Guide for SRE Teams

· 7 min read · Stew Team
slomonitoringsrereliability

SLO monitoring transforms reliability from a vague goal into measurable outcomes. This guide covers how to implement SLO monitoring that actually improves your systems.

For incident response based on SLOs, see our incident response checklist guide.

What Is SLO Monitoring?

SLO (Service Level Objective) monitoring tracks whether your services meet their reliability targets.

The Hierarchy

  • SLI (Service Level Indicator): The metric you measure (e.g., request success rate)
  • SLO (Service Level Objective): The target for that metric (e.g., 99.9% success rate)
  • SLA (Service Level Agreement): The contractual commitment (e.g., refund if below 99.5%)

Why It Matters

Without SLOs:

  • “Is the service reliable?” → “It feels okay?”
  • “Should we ship this feature or fix that bug?” → Endless debate
  • “When should we page?” → Arbitrary thresholds

With SLOs:

  • “Is the service reliable?” → “We’re at 99.92%, above our 99.9% target”
  • “Should we ship or fix?” → “We have error budget, let’s ship”
  • “When should we page?” → “When we’re burning budget too fast”

Step 1: Define Your SLIs

SLIs are the metrics that matter to users.

Common SLI Types

SLI TypeWhat It MeasuresExample
AvailabilityService is upSuccessful requests / Total requests
LatencyResponse speed% requests < 200ms
ThroughputCapacityRequests processed per second
CorrectnessRight answersValid responses / Total responses

Availability SLI

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))

Latency SLI

# Latency: % of requests under threshold
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) 
/ 
sum(rate(http_request_duration_seconds_count[5m]))

Quality SLI

# Quality: requests without errors
1 - (
  sum(rate(application_errors_total[5m])) 
  / 
  sum(rate(http_requests_total[5m]))
)

Step 2: Set Your SLOs

SLOs are targets based on user expectations and business needs.

Choosing SLO Targets

FactorHigher SLO (99.99%)Lower SLO (99.9%)
User expectationsCritical infrastructureStandard web app
Dependency on othersFew dependenciesMany external deps
Cost to achieveCan afford investmentBudget constrained
Business impactRevenue-criticalInternal tool

Example SLOs

# slos.yaml
slos:
  api:
    availability:
      target: 99.9
      window: 30d
    latency:
      target: 95  # 95% of requests under threshold
      threshold: 200ms
      window: 30d
  
  checkout:
    availability:
      target: 99.95
      window: 30d
    latency:
      target: 99
      threshold: 500ms
      window: 30d

What SLOs Mean in Practice

SLOAllowed Downtime (30 days)Allowed Downtime (per day)
99%7.2 hours14.4 minutes
99.9%43.2 minutes1.44 minutes
99.95%21.6 minutes43 seconds
99.99%4.3 minutes8.6 seconds

Step 3: Calculate Error Budgets

Error budget = allowed unreliability.

Error Budget Formula

Error Budget = 1 - SLO Target

For a 99.9% SLO:

Error Budget = 1 - 0.999 = 0.001 = 0.1%

Over 30 days (43,200 minutes):

Budget in minutes = 43,200 × 0.001 = 43.2 minutes

Tracking Budget Consumption

# Error budget remaining (as percentage of total)
1 - (
  (1 - avg_over_time(sli_availability[30d])) 
  / 
  (1 - 0.999)  # Error budget for 99.9% SLO
)

Step 4: Set Up SLO-Based Alerts

Alert on error budget burn rate, not raw thresholds.

Burn Rate Concept

Burn rate = how fast you’re consuming error budget.

Burn RateBudget Exhaustion Time
1x30 days (normal)
2x15 days
10x3 days
36x20 hours
144x5 hours

Multi-Window Burn Rate Alerts

# Fast burn: Will exhaust budget in hours
- alert: SLOHighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "SLO burn rate critical - budget exhausted in < 2 hours"

# Slow burn: Will exhaust budget in days
- alert: SLOSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      / sum(rate(http_requests_total[6h]))
    ) > (3 * 0.001)  # 3x burn rate
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[30m]))
      / sum(rate(http_requests_total[30m]))
    ) > (3 * 0.001)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "SLO burn rate elevated - budget exhausted in < 10 days"

Step 5: Build SLO Dashboards

Visualize SLO health for your team.

Essential Visualizations

## SLO Dashboard

### Current Status
- API Availability: 99.92% (target: 99.9%) ✅
- API Latency P99: 180ms (target: 200ms) ✅
- Checkout Availability: 99.94% (target: 99.95%) ⚠️

### Error Budget (30-day window)
- API: 65% remaining [████████░░] 
- Checkout: 12% remaining [█░░░░░░░░░] ⚠️

### Trend (Last 7 Days)
[Line chart: SLI values over time with SLO target line]

### Burn Rate
[Gauge: Current burn rate with thresholds]

### Budget Consumption History
[Area chart: Error budget consumed over 30 days]

Grafana Queries

# Current availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

# Error budget remaining
1 - (
  (1 - avg_over_time(
    (sum(rate(http_requests_total{status!~"5.."}[5m])) 
    / sum(rate(http_requests_total[5m])))[30d:5m]
  )) 
  / 0.001  # Error budget for 99.9% SLO
)

# Burn rate (current vs normal)
(sum(rate(http_requests_total{status=~"5.."}[1h])) 
/ sum(rate(http_requests_total[1h]))) 
/ 0.001  # Divide by error budget

Step 6: Implement SLO Reviews

Regular reviews keep SLOs relevant.

Weekly SLO Check

## Weekly SLO Review

### Current Status
| Service | SLI | Target | Current | Budget |
|---------|-----|--------|---------|--------|
| API | Availability | 99.9% | 99.92% | 65% |
| API | Latency | 95% < 200ms | 96.5% | 80% |
| Checkout | Availability | 99.95% | 99.94% | 12% |

### Action Items
- [ ] Checkout budget low - investigate recent incidents
- [ ] Review API latency outliers

### Notable Events
- Incident on Tuesday consumed 15% of checkout budget

Quarterly SLO Calibration

## Quarterly SLO Review

### Questions to Answer
1. Are our SLOs aligned with user expectations?
2. Are we meeting them comfortably or struggling?
3. Should we raise/lower any targets?
4. Are our SLIs measuring the right things?

### Recommendations
- Raise API availability target to 99.95% (consistently exceeded)
- Add latency SLI for checkout (currently only availability)
- Remove legacy service SLO (service deprecated)

SLO Monitoring Best Practices

Do’s

  • Measure from user perspective: Use synthetic monitoring or edge metrics
  • Start simple: One or two SLIs per service
  • Make SLOs visible: Dashboards, reports, team meetings
  • Use error budgets for decisions: Ship vs. fix prioritization

Don’ts

  • Don’t set SLOs at 100%: Impossible and prevents progress
  • Don’t measure everything: Focus on what matters to users
  • Don’t ignore budget exhaustion: It’s a signal to slow down
  • Don’t set and forget: SLOs need regular review

Stew and SLO Monitoring

When SLO alerts fire, you need to act fast. Stew connects SLO-based alerts to executable runbooks:

  • Alert triggers when burn rate is high
  • Runbook opens with diagnostic commands
  • Execute remediation with a click
  • Protect your error budget

Join the waitlist and build SLO-driven reliability.