← Back to blog

Implementing SLO Monitoring: A Step-by-Step Tutorial

· 6 min read · Stew Team
slomonitoringprometheusgrafana

This tutorial walks through implementing SLO monitoring from scratch using Prometheus and Grafana. By the end, you’ll have working SLO dashboards and alerts.

For SLO concepts, see our SLO monitoring guide.

Prerequisites

  • Prometheus collecting HTTP metrics
  • Grafana for dashboards
  • Alertmanager for notifications

Step 1: Verify Your Metrics

First, confirm you have the necessary metrics.

Required Metrics

# Check for request count metrics
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i request

# Check for latency histogram
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i duration

Expected Metrics

# Request counter with status labels
http_requests_total{status="200", method="GET", path="/api"}

# Latency histogram
http_request_duration_seconds_bucket{le="0.1", method="GET", path="/api"}

If Metrics Are Missing

Add instrumentation to your application:

# Python example with prometheus_client
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'path'],
    buckets=[0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
)

Step 2: Define Recording Rules

Recording rules pre-calculate SLI values for efficiency.

Create Recording Rules File

# prometheus/rules/slo_rules.yaml
groups:
  - name: slo_recording_rules
    interval: 30s
    rules:
      # Availability SLI (5-minute window)
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      # Latency SLI (% under 200ms)
      - record: sli:http_latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)
      
      # Error budget consumption (30-day window)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          1 - (
            (1 - avg_over_time(sli:http_availability:ratio_rate5m[30d]))
            /
            (1 - 0.999)
          )
      
      # Burn rate (1-hour window)
      - record: slo:http_availability:burn_rate_1h
        expr: |
          (
            1 - sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
            / sum(rate(http_requests_total[1h])) by (service)
          )
          /
          (1 - 0.999)

Apply Recording Rules

# Validate rules
promtool check rules prometheus/rules/slo_rules.yaml

# Reload Prometheus
curl -X POST http://prometheus:9090/-/reload

Verify Rules Are Working

# Query the recording rule
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result'

Step 3: Create SLO Alert Rules

Alert on error budget burn rate.

Alert Rules File

# prometheus/rules/slo_alerts.yaml
groups:
  - name: slo_alerts
    rules:
      # Page: Will exhaust 2% of budget in 1 hour (fast burn)
      - alert: SLOBurnRateCritical
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          (
            1 - sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service)
          ) / (1 - 0.999) > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate for {{ $labels.service }}"
          description: "Error budget will be exhausted in {{ printf \"%.1f\" (1 / $value * 30 * 24) }} hours"
          runbook_url: "https://runbooks.internal/slo-burn-rate"
      
      # Ticket: Will exhaust 10% of budget in 3 days (slow burn)
      - alert: SLOBurnRateWarning
        expr: |
          slo:http_availability:burn_rate_1h > 3
          and
          (
            1 - sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
            / sum(rate(http_requests_total[30m])) by (service)
          ) / (1 - 0.999) > 3
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Elevated SLO burn rate for {{ $labels.service }}"
          description: "Error budget will be exhausted in {{ printf \"%.1f\" (1 / $value * 30) }} days"
          runbook_url: "https://runbooks.internal/slo-burn-rate"
      
      # Info: Error budget low
      - alert: SLOBudgetLow
        expr: slo:http_availability:error_budget_remaining < 0.2
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Error budget low for {{ $labels.service }}"
          description: "Only {{ printf \"%.1f\" (100 * $value) }}% of error budget remaining"

Apply Alert Rules

# Validate
promtool check rules prometheus/rules/slo_alerts.yaml

# Reload
curl -X POST http://prometheus:9090/-/reload

Step 4: Build Grafana Dashboard

Create a dashboard to visualize SLO health.

Dashboard JSON

{
  "title": "SLO Dashboard",
  "panels": [
    {
      "title": "Availability SLI (Current)",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
      "targets": [{
        "expr": "sli:http_availability:ratio_rate5m{service=\"api\"} * 100",
        "legendFormat": "Availability %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "yellow", "value": 99.5},
              {"color": "green", "value": 99.9}
            ]
          },
          "min": 99,
          "max": 100,
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Budget Remaining",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
      "targets": [{
        "expr": "slo:http_availability:error_budget_remaining{service=\"api\"} * 100",
        "legendFormat": "Budget %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "yellow", "value": 20},
              {"color": "green", "value": 50}
            ]
          },
          "min": 0,
          "max": 100,
          "unit": "percent"
        }
      }
    },
    {
      "title": "Burn Rate",
      "type": "stat",
      "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
      "targets": [{
        "expr": "slo:http_availability:burn_rate_1h{service=\"api\"}",
        "legendFormat": "Burn Rate"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 3},
              {"color": "red", "value": 10}
            ]
          },
          "unit": "x"
        }
      }
    },
    {
      "title": "Availability Over Time",
      "type": "timeseries",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 8},
      "targets": [
        {
          "expr": "sli:http_availability:ratio_rate5m{service=\"api\"} * 100",
          "legendFormat": "Availability"
        },
        {
          "expr": "99.9",
          "legendFormat": "SLO Target"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Budget Consumption",
      "type": "timeseries",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 18},
      "targets": [{
        "expr": "(1 - slo:http_availability:error_budget_remaining{service=\"api\"}) * 100",
        "legendFormat": "Budget Consumed"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "max": 100
        }
      }
    }
  ]
}

Import Dashboard

  1. Open Grafana
  2. Go to Dashboards → Import
  3. Paste JSON or upload file
  4. Select Prometheus data source
  5. Save

Step 5: Test Your Setup

Simulate an Outage

# If you have a test endpoint that returns 500s
for i in {1..100}; do
  curl -s http://your-service/test-error
  sleep 0.1
done

Verify SLI Impact

# Check SLI value
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result[0].value[1]'

# Check burn rate
curl -s "http://prometheus:9090/api/v1/query?query=slo:http_availability:burn_rate_1h" | jq '.data.result[0].value[1]'

Verify Alert Fires

# Check pending/firing alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.labels.alertname | contains("SLO"))'

Step 6: Create SLO Runbook

Link alerts to executable procedures.

# SLO Burn Rate Runbook

## Alert: SLOBurnRateCritical

### Immediate Assessment
​```bash
# Check current error rate
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result'
​```

### Identify Error Source
​```bash
# Errors by status code
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))by(status)" | jq '.data.result'
​```

### Check Recent Deployments
​```bash
kubectl get deployments -n production -o json | jq '.items[] | {name: .metadata.name, updated: .metadata.annotations["deployment-time"]}'
​```

### Common Remediations

#### If recent deploy caused errors
​```bash
kubectl rollout undo deployment/api -n production
​```

#### If resource pressure
​```bash
kubectl scale deployment/api --replicas=5 -n production
​```

Next Steps

  1. Add SLOs for additional services
  2. Create SLO reports for stakeholders
  3. Integrate with incident management
  4. Implement error budget policies

Stew: Executable SLO Runbooks

When SLO alerts fire, Stew lets you execute runbook commands directly:

  • Click to run diagnostics
  • See output inline
  • Resolve quickly, protect your budget

Join the waitlist and build reliable SLO workflows.