Choosing Monitoring Metrics: PSI, KS, and Calibration

The metrics question gets asked backwards. Teams pick a drift metric — usually PSI, because it produces one number with familiar thresholds — and then try to make it answer every monitoring question they have. But PSI, the Kolmogorov-Smirnov test, and calibration error are not interchangeable. They measure different objects, fail under different conditions, and are blind to different problems. Picking the right one starts with naming the question you are actually asking — and our Drift Test Selector maps that question to the statistical test that answers it.

For how these metrics sit inside a full monitoring stack, see our topic index. There are three distinct questions hiding under “is the model okay?”:

Has the input distribution moved? (a data-drift question — KS, PSI, Wasserstein)
By how much, on a scale stakeholders accept? (a thresholding question — PSI’s home turf)
Are the model’s probabilities still trustworthy? (a calibration question — ECE, Brier, reliability diagrams)

The first two are label-free and concern the inputs. The third concerns the outputs and needs labels. Conflating them is the root of most metric confusion.

KS: maximum sensitivity, scale-free, noisy at volume

The Kolmogorov-Smirnov test measures the largest gap between two empirical cumulative distribution functions of a continuous feature. It is nonparametric, makes no distributional assumptions, and returns a p-value. As a detector, it is excellent: if any part of the distribution has shifted, KS tends to see it.

Its weakness is the flip side of its sensitivity, and it is a serious one at production scale. The KS p-value is a function of sample size, so on large samples it flags shifts that are real but trivially small. As Evidently’s comparison of five drift methods ↗ documents, on large datasets KS will register a fraction-of-a-percent distributional shift as “significant” and flood the on-call queue with statistically-real, practically-meaningless alerts. The fix is to stop treating the p-value as the alert and instead threshold on an effect size — the KS statistic itself, or a distance metric like Wasserstein that is expressed in the feature’s own units.

Reach for KS when: features are continuous, samples are small to medium, and you want maximum sensitivity. Avoid relying on its p-value when: samples are large — use the statistic or a distance metric instead.

PSI: the stakeholder metric with baked-in thresholds

The Population Stability Index bins a feature and computes a symmetrized divergence between the bin proportions of a reference set and a production set. Its dominant advantage is not statistical — it is organizational. PSI carries domain-accepted thresholds from its origins in credit scoring, summarized in Fiddler’s breakdown of PSI ↗: below 0.1 is negligible, 0.1–0.25 is moderate drift worth investigating, and above 0.25 is significant drift that the original literature treats as grounds to rebuild. When you tell a risk committee “PSI crossed 0.25,” they already know what it means. That shared vocabulary is worth a lot.

PSI is also principled: it is mathematically equivalent to symmetric KL divergence, so it is not merely an industry convention. Its weaknesses are real but manageable. It is sensitive to binning — too few bins hide drift, too many make it unstable — and it handles empty bins badly without a small-constant smoothing hack. And because it is a single scalar per feature, it can hide where in the distribution the shift happened.

Reach for PSI when: you need one thresholdable number per feature, stakeholders expect the conventional bands, or you are in a regulated domain that already speaks PSI. Be careful with: binning choices and empty bins; validate that your bin count is stable across windows.

The full method comparison — KS, PSI, KL, Jensen-Shannon, and Wasserstein side by side — is the subject of our data drift detection guide ↗. The short version: Wasserstein offers the best sensitivity-to-noise balance on large numeric features, PSI wins on communicability, KS wins on raw sensitivity at smaller scale.

Calibration: the metric that watches the outputs

Here is the category error worth correcting: PSI and KS tell you nothing about whether the model is right. They watch the inputs. A model can have rock-stable input distributions and pristine PSI while quietly producing probabilities that no longer mean what they claim. That failure is invisible to every data-drift metric, and it is exactly what calibration metrics catch.

Calibration measures whether the model’s stated confidence matches reality: among predictions made with 70% confidence, are about 70% actually correct? The standard tools, drawn from the scikit-learn calibration documentation ↗ and the ICLR 2025 explainer on calibration ↗:

Reliability diagram. Bin predictions by confidence and plot predicted-probability against observed-frequency. A perfectly calibrated model lies on the diagonal. The diagram is the most informative single artifact because it shows where calibration breaks — overconfident in the high-probability bins, say.
Expected Calibration Error (ECE). The average absolute gap between confidence and accuracy across bins, weighted by bin population. One number, easy to track over time, but it can mask compensating errors (overconfidence in one region, underconfidence in another, averaging out).
Brier score. The mean squared error between predicted probabilities and outcomes. It captures calibration and sharpness together, ranges 0 to 1, lower is better, and is a good single proxy when you want one number that punishes both miscalibration and indecision.

The ICLR explainer’s standing caution is worth repeating: no single calibration scalar is sufficient. ECE can be gamed by binning choices and hides region-specific errors; pair it with a reliability diagram (to see where) and a Brier score or log loss (to capture sharpness). Calibration needs labels, so it is a lagging metric — but it is the one that tells you whether the model’s probabilities can still be trusted, which matters enormously if anything downstream thresholds on those probabilities.

Why calibration is load-bearing for late-label systems

There is a second reason calibration deserves a permanent place in your metric set, beyond “are the probabilities trustworthy.” Label-free performance estimation — the technique that lets you estimate accuracy in the blind period before labels arrive ↗ — depends on the model being well calibrated. The estimate is built from the predicted-probability distribution, so if those probabilities are miscalibrated, the estimate is confidently wrong. In a delayed-label system, your calibration metric is therefore doing double duty: it is both a quality signal and the validity check for your performance estimates. Monitor calibration, and a drop in ECE is also a warning that your estimated-accuracy line should no longer be trusted.

A metric set, not a metric

The takeaway is to stop hunting for the one true drift metric and instead assemble a small set, each answering a distinct question:

Per-feature input drift: Wasserstein or the KS statistic (not p-value) for numerics at scale; PSI where stakeholders want the conventional thresholds; chi-squared for categoricals.
Prediction drift: distribution distance on the model’s outputs P(Ŷ) — your cheapest, fastest leading indicator.
Calibration: ECE plus a periodic reliability diagram and Brier score, as the labels-based check on output trustworthiness and the validity gate for performance estimation.

Each one is blind to something. Input-drift metrics never see a broken output relationship. Calibration never sees that which inputs you receive has shifted. Prediction drift sees the symptom but never the cause. Run a metric for each question and you have coverage; run one metric for all three and you have a blind spot you cannot see precisely because the metric you chose cannot show it to you.

For implementations, Evidently ↗ and Deepchecks ↗ ship the drift tests and calibration reports out of the box, scikit-learn covers calibration curves and post-hoc recalibration, and sentryml.com ↗ surveys how teams combine input-drift, prediction-drift, and calibration monitors across the stack.

Sources

Measuring Data Drift with the Population Stability Index — Fiddler AI ↗ — PSI thresholds, binning, and its relationship to KL divergence.
Which test is the best? Comparing 5 drift detection methods — Evidently AI ↗ — Empirical comparison of KS, PSI, KL, Jensen-Shannon, and Wasserstein, including KS’s large-sample false-alarm problem.
Probability calibration — scikit-learn ↗ — Calibration curves, Platt scaling, and isotonic regression for post-hoc recalibration.
Understanding Model Calibration (ECE) — ICLR 2025 Blogposts ↗ — Reliability diagrams, ECE, and why a single calibration scalar can mislead.

Choosing Monitoring Metrics: PSI, KS, and Calibration

KS: maximum sensitivity, scale-free, noisy at volume

PSI: the stakeholder metric with baked-in thresholds

Calibration: the metric that watches the outputs

Why calibration is load-bearing for late-label systems

A metric set, not a metric

Sources

Sources

ML Monitoring Report — in your inbox

Related

Data, Concept, and Prediction Drift: A Decision Framework

Monitoring Models When Ground Truth Is Late or Never Arrives

Monitoring Tabular Models vs LLM Systems: What Transfers

Comments