Silent Quality Decay in Production LLM Apps: How to Detect Drift Before Users Do
Your eval scores are green. Customer complaints are up. The gap between offline metrics and production reality is the biggest reliability problem in LLM ops — here's how to close it.
A pattern I’ve now seen at four different companies: the team has eval coverage. The eval scores look fine. Customer complaints are quietly increasing. By the time someone correlates the two, two months of usage data has been lost.
The eval suite is testing a frozen distribution. Production traffic has drifted. The model still passes the eval; it’s just no longer answering the questions users actually ask.
Here’s how to detect this — and a few related failure modes — before customers do.
The drift modes that matter
Three distinct drift problems get conflated. Treat them separately.
1. Input distribution drift
User questions are changing. New product features create new query types. Seasonal shifts (returns season, tax season, regulatory deadlines) change topic mix. The model’s behavior on the new mix is untested.
Signal: embedding-space comparison of week-over-week input distributions. Track distance from the eval set. When production drifts >0.3 cosine from eval, your eval is stale.
Detector (cheap to run):
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-mpnet-base-v2')
def drift_score(eval_embeds: np.ndarray, prod_embeds: np.ndarray) -> float:
eval_centroid = eval_embeds.mean(axis=0)
prod_centroid = prod_embeds.mean(axis=0)
return float(np.linalg.norm(eval_centroid - prod_centroid))
Run on a sample of 1k production requests vs. your eval set, weekly. Plot. Alert at +1 stddev above the rolling 6-week mean.
2. Output quality drift
The model is producing worse answers, but no single failure mode is severe enough to throw a clear alarm. Symptoms:
- Hallucination rate up 2 percentage points week-over-week
- Refusal rate up (model getting more cautious)
- Average response length changed by >20%
- Tool-call argument quality dropped (more retries)
Signal: a small, judge-LLM-graded eval running on a SAMPLE of production requests, not on the frozen eval set. Critically, this should be DIFFERENT from your CI eval — production sample needs to track the actual production distribution, not the synthetic one.
Implementation:
- Log every request with
request_id,feature,output_text - Hourly: pull a stratified random sample (say 50 per feature)
- Have a judge model score each on: helpfulness, accuracy, on-policy
- Aggregate by feature, compare to last week’s score
The judge model adds cost. Budget for it; it’s worth it.
3. Provider-side model drift
The provider silently changed the model. They’re allowed to — terms of service permit it for non-frozen versions. Your “claude-sonnet-4” today is a different artifact than yesterday’s. Anthropic’s model cards ↗ document the snapshot dates; pinning to a specific date is the workaround.
Signal: a small, deterministic eval running every hour on the model. If response patterns shift suddenly without an upstream code change, the provider deployed something. (Set temperature=0, fixed prompts, hash the outputs.)
Mitigation: pin to date-stamped model versions. Rotate intentionally, with eval gates.
What “good” looks like
A reasonable production setup:
- Per-feature input drift score, weekly, plotted with rolling band
- Per-feature judge-graded sample score, hourly, with anomaly bands per dimension (helpfulness/accuracy/on-policy)
- Per-model deterministic-eval hash, hourly, alert on change
- Customer-reported error rate (thumbs-down on responses, support tickets tagged with the feature) — not technically drift detection, but the lagging-indicator ground truth
The point of the leading indicators is to catch issues 24-48 hours before the lagging ones. Without leading indicators, you’re operating on customer complaints.
A common anti-pattern
“We have evals. We run them in CI. We’re good.”
CI evals are useful for catching regressions in code. They don’t catch drift in production traffic, drift in upstream model behavior, or drift in user expectations. The eval set is a museum piece by month two.
Production sampling + judge grading is a separate system. Build it.
The judge model question
Using a model to grade another model’s outputs is uncomfortable but practical at this point. The trade-off:
- A weaker judge introduces noise
- A more capable judge costs more
- A same-model judge has correlation issues (if the production model is wrong, the judge tends to agree)
What works in practice: use a different family for the judge. If production runs Claude, judge with GPT-4-class. If production runs GPT, judge with Claude. The disagreement is the signal.
The judge prompt should be calibrated. We use a 4-point scale (excellent / acceptable / poor / unsafe), not a 10-point. Coarse buckets correlate better with downstream outcomes than fine-grained scores.
Tooling
Phoenix (Arize) ↗ ships drift visualization out of the box; instrument with OpenLLMetry ↗ and you get input drift charts for free. For judge-grading, no off-the-shelf solution is great; most teams write their own scorer that runs on a cron.
What we don’t recommend
- Don’t alert on individual judge scores. Sample noise will burn the on-call. Alert on the rolling aggregate.
- Don’t try to detect drift via the model’s self-confidence scores. They’re poorly calibrated in production.
- Don’t conflate retrieval drift with model drift. RAG quality decay is its own failure mode — track it on the retrieval layer, not the LLM layer.
- Don’t run the judge on every request. The cost is the same as running the production model; budget kills the project.
The discipline that makes this work isn’t fancy infrastructure — it’s the weekly review meeting where someone owns the dashboards, looks at the trends, and asks “what changed?” Without that ritual, the metrics rot inside two months and the team is back to operating on complaints. With it, drift becomes visible and addressable. That’s the difference between an operationalized LLM app and one that’s drifting toward replacement.
Sources
ML Monitoring Report — in your inbox
Production ML monitoring, drift, and reliability. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Drift Detection in Machine Learning: Methods, Tests, and Production Practice
A practical guide to data drift detection in machine learning: statistical tests, detection architectures, threshold tuning, and when to trigger retraining in production.
ML Model Monitoring Best Practices for Production Systems
A practitioner's guide to ML model monitoring best practices: drift detection, metric selection, alerting architecture, and retraining triggers for models running in production.

What this site is for
ML Monitoring Report covers ML observability and MLOps from a production-engineering perspective. Here's what we publish.