Implementing SLO Monitoring: A Step-by-Step Tutorial
· 6 min read · Stew Team
slomonitoringprometheusgrafana
This tutorial walks through implementing SLO monitoring from scratch using Prometheus and Grafana. By the end, you’ll have working SLO dashboards and alerts.
For SLO concepts, see our SLO monitoring guide.
Prerequisites
- Prometheus collecting HTTP metrics
- Grafana for dashboards
- Alertmanager for notifications
Step 1: Verify Your Metrics
First, confirm you have the necessary metrics.
Required Metrics
# Check for request count metrics
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i request
# Check for latency histogram
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | grep -i duration
Expected Metrics
# Request counter with status labels
http_requests_total{status="200", method="GET", path="/api"}
# Latency histogram
http_request_duration_seconds_bucket{le="0.1", method="GET", path="/api"}
If Metrics Are Missing
Add instrumentation to your application:
# Python example with prometheus_client
from prometheus_client import Counter, Histogram
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'path'],
buckets=[0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
)
Step 2: Define Recording Rules
Recording rules pre-calculate SLI values for efficiency.
Create Recording Rules File
# prometheus/rules/slo_rules.yaml
groups:
- name: slo_recording_rules
interval: 30s
rules:
# Availability SLI (5-minute window)
- record: sli:http_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Latency SLI (% under 200ms)
- record: sli:http_latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
# Error budget consumption (30-day window)
- record: slo:http_availability:error_budget_remaining
expr: |
1 - (
(1 - avg_over_time(sli:http_availability:ratio_rate5m[30d]))
/
(1 - 0.999)
)
# Burn rate (1-hour window)
- record: slo:http_availability:burn_rate_1h
expr: |
(
1 - sum(rate(http_requests_total{status!~"5.."}[1h])) by (service)
/ sum(rate(http_requests_total[1h])) by (service)
)
/
(1 - 0.999)
Apply Recording Rules
# Validate rules
promtool check rules prometheus/rules/slo_rules.yaml
# Reload Prometheus
curl -X POST http://prometheus:9090/-/reload
Verify Rules Are Working
# Query the recording rule
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result'
Step 3: Create SLO Alert Rules
Alert on error budget burn rate.
Alert Rules File
# prometheus/rules/slo_alerts.yaml
groups:
- name: slo_alerts
rules:
# Page: Will exhaust 2% of budget in 1 hour (fast burn)
- alert: SLOBurnRateCritical
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
(
1 - sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
) / (1 - 0.999) > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "High SLO burn rate for {{ $labels.service }}"
description: "Error budget will be exhausted in {{ printf \"%.1f\" (1 / $value * 30 * 24) }} hours"
runbook_url: "https://runbooks.internal/slo-burn-rate"
# Ticket: Will exhaust 10% of budget in 3 days (slow burn)
- alert: SLOBurnRateWarning
expr: |
slo:http_availability:burn_rate_1h > 3
and
(
1 - sum(rate(http_requests_total{status!~"5.."}[30m])) by (service)
/ sum(rate(http_requests_total[30m])) by (service)
) / (1 - 0.999) > 3
for: 15m
labels:
severity: warning
annotations:
summary: "Elevated SLO burn rate for {{ $labels.service }}"
description: "Error budget will be exhausted in {{ printf \"%.1f\" (1 / $value * 30) }} days"
runbook_url: "https://runbooks.internal/slo-burn-rate"
# Info: Error budget low
- alert: SLOBudgetLow
expr: slo:http_availability:error_budget_remaining < 0.2
for: 5m
labels:
severity: info
annotations:
summary: "Error budget low for {{ $labels.service }}"
description: "Only {{ printf \"%.1f\" (100 * $value) }}% of error budget remaining"
Apply Alert Rules
# Validate
promtool check rules prometheus/rules/slo_alerts.yaml
# Reload
curl -X POST http://prometheus:9090/-/reload
Step 4: Build Grafana Dashboard
Create a dashboard to visualize SLO health.
Dashboard JSON
{
"title": "SLO Dashboard",
"panels": [
{
"title": "Availability SLI (Current)",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
"targets": [{
"expr": "sli:http_availability:ratio_rate5m{service=\"api\"} * 100",
"legendFormat": "Availability %"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 99.5},
{"color": "green", "value": 99.9}
]
},
"min": 99,
"max": 100,
"unit": "percent"
}
}
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
"targets": [{
"expr": "slo:http_availability:error_budget_remaining{service=\"api\"} * 100",
"legendFormat": "Budget %"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "yellow", "value": 20},
{"color": "green", "value": 50}
]
},
"min": 0,
"max": 100,
"unit": "percent"
}
}
},
{
"title": "Burn Rate",
"type": "stat",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
"targets": [{
"expr": "slo:http_availability:burn_rate_1h{service=\"api\"}",
"legendFormat": "Burn Rate"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 3},
{"color": "red", "value": 10}
]
},
"unit": "x"
}
}
},
{
"title": "Availability Over Time",
"type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "sli:http_availability:ratio_rate5m{service=\"api\"} * 100",
"legendFormat": "Availability"
},
{
"expr": "99.9",
"legendFormat": "SLO Target"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent"
}
}
},
{
"title": "Error Budget Consumption",
"type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 18},
"targets": [{
"expr": "(1 - slo:http_availability:error_budget_remaining{service=\"api\"}) * 100",
"legendFormat": "Budget Consumed"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100
}
}
}
]
}
Import Dashboard
- Open Grafana
- Go to Dashboards → Import
- Paste JSON or upload file
- Select Prometheus data source
- Save
Step 5: Test Your Setup
Simulate an Outage
# If you have a test endpoint that returns 500s
for i in {1..100}; do
curl -s http://your-service/test-error
sleep 0.1
done
Verify SLI Impact
# Check SLI value
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result[0].value[1]'
# Check burn rate
curl -s "http://prometheus:9090/api/v1/query?query=slo:http_availability:burn_rate_1h" | jq '.data.result[0].value[1]'
Verify Alert Fires
# Check pending/firing alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.labels.alertname | contains("SLO"))'
Step 6: Create SLO Runbook
Link alerts to executable procedures.
# SLO Burn Rate Runbook
## Alert: SLOBurnRateCritical
### Immediate Assessment
```bash
# Check current error rate
curl -s "http://prometheus:9090/api/v1/query?query=sli:http_availability:ratio_rate5m" | jq '.data.result'
```
### Identify Error Source
```bash
# Errors by status code
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))by(status)" | jq '.data.result'
```
### Check Recent Deployments
```bash
kubectl get deployments -n production -o json | jq '.items[] | {name: .metadata.name, updated: .metadata.annotations["deployment-time"]}'
```
### Common Remediations
#### If recent deploy caused errors
```bash
kubectl rollout undo deployment/api -n production
```
#### If resource pressure
```bash
kubectl scale deployment/api --replicas=5 -n production
```
Next Steps
- Add SLOs for additional services
- Create SLO reports for stakeholders
- Integrate with incident management
- Implement error budget policies
Stew: Executable SLO Runbooks
When SLO alerts fire, Stew lets you execute runbook commands directly:
- Click to run diagnostics
- See output inline
- Resolve quickly, protect your budget
Join the waitlist and build reliable SLO workflows.