Implementing SLO Monitoring: A Step-by-Step Tutorial

This tutorial walks through implementing SLO monitoring from scratch using Prometheus and Grafana. By the end, you’ll have working SLO dashboards and alerts.

For SLO concepts, see our SLO monitoring guide.

Prerequisites

Prometheus collecting HTTP metrics
Grafana for dashboards
Alertmanager for notifications

Step 1: Verify Your Metrics

First, confirm you have the necessary metrics.

Required Metrics

# Check for request count metrics
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i request

# Check for latency histogram
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i duration

Expected Metrics

# Request counter with status labels
http_requests_total{status="200", method="GET", path="/api"}

# Latency histogram
http_request_duration_seconds_bucket{le="0.1", method="GET", path="/api"}

If Metrics Are Missing

Add instrumentation to your application:

# Python example with prometheus_client
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'path', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'path'],
    buckets=[0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
)

Step 2: Define Recording Rules

Recording rules pre-calculate SLI values for efficiency.

Create Recording Rules File

# prometheus/rules/slo_rules.yaml
groups:
  - name: slo_recording_rules
    interval: 30s
    rules:
      # Availability SLI (5-minute window)
      - record: sli:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      # Latency SLI (% under 200ms)
      - record: sli:http_latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)
      
      # Error budget consumption (30-day window)
      - record: slo:http_availability:error_budget_remaining
        expr: |
          1 - (
            (1 - avg_over_time(sli:http_availability:ratio_rate5m[30d]))
            /
            (1 - 0.999)
          )
      
      # Burn rate (1-hour window)
      - record: slo:http_availability:burn_rate_1h
        expr: |
          (
            1 - sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
            / sum(rate(http_requests_total[1h])) by (service)
          )
          /
          (1 - 0.999)

Apply Recording Rules

# Validate rules
promtool check rules prometheus/rules/slo_rules.yaml

# Reload Prometheus
curl -X POST http://prometheus:9090/-/reload

Verify Rules Are Working

# Query the recording rule
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result'

Step 3: Create SLO Alert Rules

Alert on error budget burn rate.

Alert Rules File

# prometheus/rules/slo_alerts.yaml
groups:
  - name: slo_alerts
    rules:
      # Page: Will exhaust 2% of budget in 1 hour (fast burn)
      - alert: SLOBurnRateCritical
        expr: |
          slo:http_availability:burn_rate_1h > 14.4
          and
          (
            1 - sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service)
          ) / (1 - 0.999) > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate for {{ $labels.service }}"
          description: "Error budget will be exhausted in {{ printf \"%.1f\" (1 / $value * 30 * 24) }} hours"
          runbook_url: "https://runbooks.internal/slo-burn-rate"
      
      # Ticket: Will exhaust 10% of budget in 3 days (slow burn)
      - alert: SLOBurnRateWarning
        expr: |
          slo:http_availability:burn_rate_1h > 3
          and
          (
            1 - sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
            / sum(rate(http_requests_total[30m])) by (service)
          ) / (1 - 0.999) > 3
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Elevated SLO burn rate for {{ $labels.service }}"
          description: "Error budget will be exhausted in {{ printf \"%.1f\" (1 / $value * 30) }} days"
          runbook_url: "https://runbooks.internal/slo-burn-rate"
      
      # Info: Error budget low
      - alert: SLOBudgetLow
        expr: slo:http_availability:error_budget_remaining < 0.2
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Error budget low for {{ $labels.service }}"
          description: "Only {{ printf \"%.1f\" (100 * $value) }}% of error budget remaining"

Apply Alert Rules

# Validate
promtool check rules prometheus/rules/slo_alerts.yaml

# Reload
curl -X POST http://prometheus:9090/-/reload

Step 4: Build Grafana Dashboard

Create a dashboard to visualize SLO health.

Dashboard JSON

{
  "title": "SLO Dashboard",
  "panels": [
    {
      "title": "Availability SLI (Current)",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
      "targets": [{
        "expr": "sli:http_availability:ratio_rate5m{service=\"api\"} * 100",
        "legendFormat": "Availability %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "yellow", "value": 99.5},
              {"color": "green", "value": 99.9}
            ]
          },
          "min": 99,
          "max": 100,
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Budget Remaining",
      "type": "gauge",
      "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
      "targets": [{
        "expr": "slo:http_availability:error_budget_remaining{service=\"api\"} * 100",
        "legendFormat": "Budget %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "yellow", "value": 20},
              {"color": "green", "value": 50}
            ]
          },
          "min": 0,
          "max": 100,
          "unit": "percent"
        }
      }
    },
    {
      "title": "Burn Rate",
      "type": "stat",
      "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
      "targets": [{
        "expr": "slo:http_availability:burn_rate_1h{service=\"api\"}",
        "legendFormat": "Burn Rate"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": null},
              {"color": "yellow", "value": 3},
              {"color": "red", "value": 10}
            ]
          },
          "unit": "x"
        }
      }
    },
    {
      "title": "Availability Over Time",
      "type": "timeseries",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 8},
      "targets": [
        {
          "expr": "sli:http_availability:ratio_rate5m{service=\"api\"} * 100",
          "legendFormat": "Availability"
        },
        {
          "expr": "99.9",
          "legendFormat": "SLO Target"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Budget Consumption",
      "type": "timeseries",
      "gridPos": {"h": 10, "w": 24, "x": 0, "y": 18},
      "targets": [{
        "expr": "(1 - slo:http_availability:error_budget_remaining{service=\"api\"}) * 100",
        "legendFormat": "Budget Consumed"
      }],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "max": 100
        }
      }
    }
  ]
}

Import Dashboard

Open Grafana
Go to Dashboards → Import
Paste JSON or upload file
Select Prometheus data source
Save

Step 5: Test Your Setup

Simulate an Outage

# If you have a test endpoint that returns 500s
for i in {1..100}; do
  curl -s http://your-service/test-error
  sleep 0.1
done

Verify SLI Impact

# Check SLI value
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result[0].value[1]'

# Check burn rate
curl -s "http://prometheus:9090/api/v1/query?query=slo:http_availability:burn_rate_1h" | jq '.data.result[0].value[1]'

Verify Alert Fires

# Check pending/firing alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.labels.alertname | contains("SLO"))'

Step 6: Create SLO Runbook

Link alerts to executable procedures.

# SLO Burn Rate Runbook

## Alert: SLOBurnRateCritical

### Immediate Assessment
```bash
# Check current error rate
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result'
```

### Identify Error Source
```bash
# Errors by status code
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))by(status)" | jq '.data.result'
```

### Check Recent Deployments
```bash
kubectl get deployments -n production -o json | jq '.items[] | {name: .metadata.name, updated: .metadata.annotations["deployment-time"]}'
```

### Common Remediations

#### If recent deploy caused errors
```bash
kubectl rollout undo deployment/api -n production
```

#### If resource pressure
```bash
kubectl scale deployment/api --replicas=5 -n production
```

Next Steps

Add SLOs for additional services
Create SLO reports for stakeholders
Integrate with incident management
Implement error budget policies

Stew: Executable SLO Runbooks

When SLO alerts fire, Stew lets you execute runbook commands directly:

Click to run diagnostics
See output inline
Resolve quickly, protect your budget

Join the waitlist and build reliable SLO workflows.