SLOs and Alerting for ML Systems: Borrowing From SRE

Site reliability engineering gave software a vocabulary for “good enough”: the service level indicator you measure, the objective you commit to, the error budget you are allowed to spend, and the alerts that fire on how fast you are spending it. ML teams have been borrowing that vocabulary for years. Most of the borrowing is sloppy, because the SRE playbook assumes things that ML systems violate — the most important being that you can measure correctness in real time.

This is a guide to adapting SLOs to ML systems honestly: keeping what transfers, fixing what doesn’t, and being explicit about the parts where the analogy breaks.

The SRE primitives, briefly

The Google SRE book’s chapter on service level objectives ↗ defines the three terms people routinely confuse. An SLI is a quantitative measure of one aspect of service — request latency, error rate, availability. An SLO is a target value or range for an SLI (“99th-percentile latency under 300 ms over 28 days”). An SLA is a contract with consequences attached to meeting or missing SLOs. The error budget is the inverse of the SLO: a 99.9% availability target grants a 0.1% budget of allowed failure, and you spend it as you wish — on risky deploys, experiments, or just bad luck.

The discipline that makes this powerful is the rule, from the SRE workbook’s alerting chapter ↗, that you do not alert on the SLO being breached. You alert on the burn rate — how fast the budget is being consumed. A fast burn (consuming a large fraction of the monthly budget in an hour) pages immediately; a slow burn opens a ticket. This is what keeps alerting actionable instead of either too noisy or too late.

Which ML signals are real SLIs

An ML system has four failure surfaces, and only some of them produce SLIs that behave like the SRE model assumes. Our monitoring best-practices guide ↗ lays out the four layers; here is how each maps onto SLO discipline.

Software-health SLIs transfer cleanly. Prediction-service latency, error rate, throughput, and availability are exactly the SLIs SRE was designed for. Set p99 latency objectives, define an availability target, alert on burn rate. Nothing special about ML here, and standard APM tooling handles it. This layer should look identical to any other service.

Data-quality SLIs transfer with one twist. The fraction of inference requests that pass schema and range validation is a clean, real-time SLI: “99.5% of requests have all required features within valid ranges over a rolling window.” Burn-rate alerting works directly. The twist is that the consequence of a data-quality breach is silent — the model still returns a number for a malformed input — so this SLI is often the only real-time signal that an upstream pipeline broke.

Model-quality SLIs are where the analogy strains. Accuracy, precision/recall, AUC, RMSE, or calibration error are the SLIs you actually care about, but you usually cannot compute them in real time, because the labels are late. This is the central problem and it gets its own section.

Business-KPI SLIs are lagging by nature. Conversion, revenue per prediction, churn — track them, correlate them with model version, but do not try to page on them. They move for too many non-model reasons to be a clean SLI.

The labels-are-late problem

The SRE error-budget model has an unstated assumption: the SLI is measurable at roughly the same cadence you want to alert at. For latency and error rate, that is true. For model accuracy, it is usually false — labels arrive hours, days, or months after the prediction, a problem covered in depth in our piece on delayed labels and ground-truth lag ↗. Three honest adaptations:

Separate leading-indicator SLOs from lagging-quality SLOs. Treat label-free signals — prediction-distribution drift, data drift on high-importance features, input validation pass rate — as leading-indicator SLIs you can alert on in real time. Treat realized accuracy as a lagging SLO you evaluate on a slower cadence (daily or weekly batches as labels mature). Be explicit that the leading SLOs are proxies; they catch change, not correctness.

Define the SLO window in label-time, not wall-clock. If labels for a given day’s predictions stabilize after seven days, your “accuracy over the last 28 days” SLO is really “accuracy over predictions whose labels have matured,” and the most recent week is provisional. Stating this prevents the classic mistake of declaring an SLO breach on a window whose labels are still arriving.

Use performance estimation to fill the gap. Where labels are badly delayed, methods that estimate performance from model confidence — discussed in our drift decision framework ↗ — give you an estimated-accuracy line to set provisional objectives against, with the standing caveat that estimation cannot see concept drift.

Setting the thresholds

The hardest practical question is what number. Two failure modes dominate: thresholds set from a single calm week (false alarms on normal variation) and thresholds set so loose they never fire.

The SRE workbook’s guidance ↗ — start in a monitoring-only phase and observe the SLI for several weeks before committing to an objective — applies even more strongly to ML, because ML SLIs are noisier. Datadog’s ML-monitoring guidance ↗ and Evidently’s monitoring guide ↗ both stress per-model, per-segment calibration over global thresholds: a drift or accuracy bound that is right for your aggregate traffic will be wrong for a thin-but-critical user cohort. Practical sequence:

Observe-only for 2–4 weeks. Record the SLI’s natural variance, including any weekly seasonality, before you pick an objective.
Set the SLO above observed normal variance, below the level that hurts users. The gap between those two is your real budget. If they overlap, the SLI is too noisy to alert on and belongs on a dashboard, not a pager.
Tier and segment. Page on fast burns of revenue-critical SLIs; ticket on slow burns and on low-importance features. Calibrate per segment where segments behave differently.
Revisit after every model update. A new model has a new normal. Carrying old thresholds forward is a top cause of both false alarms and missed regressions.

Burn-rate alerting, adapted

For the SLIs that are real-time — latency, error rate, data-validation pass rate, prediction-drift magnitude — burn-rate alerting works as-is. Define multi-window burn-rate alerts: a fast window (e.g., one hour) catching acute breaks, and a slower window (e.g., six hours or a day) catching grinding degradations the fast window misses. The SRE workbook’s multi-window, multi-burn-rate pattern is directly reusable.

For lagging quality SLOs, “burn rate” is the wrong metaphor — you cannot consume a budget you cannot yet measure. Use trend-based review instead: a scheduled evaluation against matured labels, with the result compared to the prior period and to the leading indicators. If the leading indicators fired days ago and realized quality has now confirmed the drop, that is your validated incident. If quality dropped with no prior leading-indicator signal, you have a blind spot in your leading indicators worth fixing — itself a useful finding.

What good looks like

A defensible ML SLO setup is a hybrid, not a copy of the SRE template:

Real-time SLOs with burn-rate alerts on latency, error rate, availability, and input-validation pass rate.
Real-time leading-indicator SLOs on prediction drift and high-importance-feature drift, explicitly labeled as proxies.
Lagging quality SLOs on accuracy/calibration, evaluated on a label-matured cadence, reviewed for trend rather than paged on.
Business KPIs on dashboards, correlated with model version, never on the pager.

The point of borrowing from SRE is not to make ML monitoring look like web-service monitoring. It is to import the one habit SRE got profoundly right: deciding in advance what “good enough” means, writing it down as a number, and alerting on the rate of failure rather than waking someone the instant a threshold is touched. The parts of ML that fit that model should use it directly. The parts that don’t — anything gated on late labels — need the honesty to be tracked differently, not forced into a budget metaphor that the data cannot support.

For teams standardizing this across many models, sentryml.com ↗ covers SLI definition and drift alerting across the stack, and mlopsplatforms.com ↗ compares platforms on their alerting and SLO support.

Sources

Service Level Objectives — Google SRE Book ↗ — The canonical definitions of SLIs, SLOs, SLAs, and error budgets.
Alerting on SLOs — Google SRE Workbook ↗ — Burn-rate and multi-window alerting patterns, and the case against alerting on raw SLO breach.
ML model monitoring in production best practices — Datadog ↗ — Practical alerting thresholds and per-model calibration for ML signals.
Model monitoring for ML in production — Evidently AI ↗ — Monitoring architecture, metric selection by task type, and segment-level monitoring.

SLOs and Alerting for ML Systems: Borrowing From SRE

The SRE primitives, briefly

Which ML signals are real SLIs

The labels-are-late problem

Setting the thresholds

Burn-rate alerting, adapted

What good looks like

Sources

Sources

ML Monitoring Report — in your inbox

Related

ML Model Monitoring Best Practices for Production Systems

Best ML Model Monitoring Tools 2026: A Practitioner's Comparison

Data, Concept, and Prediction Drift: A Decision Framework

Comments