Silent Quality Decay in Production LLM Apps: How to Detect Drift Before Users Do

A pattern I’ve now seen at four different companies: the team has eval coverage. The eval scores look fine. Customer complaints are quietly increasing. By the time someone correlates the two, two months of usage data has been lost.

The eval suite is testing a frozen distribution. Production traffic has drifted. The model still passes the eval; it’s just no longer answering the questions users actually ask.

Here’s how to detect this — and a few related failure modes — before customers do.

The drift modes that matter

Three distinct drift problems get conflated. Treat them separately.

1. Input distribution drift

User questions are changing. New product features create new query types. Seasonal shifts (returns season, tax season, regulatory deadlines) change topic mix. The model’s behavior on the new mix is untested.

Signal: embedding-space comparison of week-over-week input distributions. Track distance from the eval set. When production drifts >0.3 cosine from eval, your eval is stale.

Detector (cheap to run):

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-mpnet-base-v2')
def drift_score(eval_embeds: np.ndarray, prod_embeds: np.ndarray) -> float:
    eval_centroid = eval_embeds.mean(axis=0)
    prod_centroid = prod_embeds.mean(axis=0)
    return float(np.linalg.norm(eval_centroid - prod_centroid))

Run on a sample of 1k production requests vs. your eval set, weekly. Plot. Alert at +1 stddev above the rolling 6-week mean.

2. Output quality drift

The model is producing worse answers, but no single failure mode is severe enough to throw a clear alarm. Symptoms:

Hallucination rate up 2 percentage points week-over-week
Refusal rate up (model getting more cautious)
Average response length changed by >20%
Tool-call argument quality dropped (more retries)

Signal: a small, judge-LLM-graded eval running on a SAMPLE of production requests, not on the frozen eval set. Critically, this should be DIFFERENT from your CI eval — production sample needs to track the actual production distribution, not the synthetic one.

Implementation:

Log every request with request_id, feature, output_text
Hourly: pull a stratified random sample (say 50 per feature)
Have a judge model score each on: helpfulness, accuracy, on-policy
Aggregate by feature, compare to last week’s score

The judge model adds cost. Budget for it; it’s worth it.

3. Provider-side model drift

The provider silently changed the model. They’re allowed to — terms of service permit it for non-frozen versions. Your “claude-sonnet-4” today is a different artifact than yesterday’s. Anthropic’s model cards ↗ document the snapshot dates; pinning to a specific date is the workaround.

Signal: a small, deterministic eval running every hour on the model. If response patterns shift suddenly without an upstream code change, the provider deployed something. (Set temperature=0, fixed prompts, hash the outputs.)

Mitigation: pin to date-stamped model versions. Rotate intentionally, with eval gates.

What “good” looks like

A reasonable production setup:

Per-feature input drift score, weekly, plotted with rolling band
Per-feature judge-graded sample score, hourly, with anomaly bands per dimension (helpfulness/accuracy/on-policy)
Per-model deterministic-eval hash, hourly, alert on change
Customer-reported error rate (thumbs-down on responses, support tickets tagged with the feature) — not technically drift detection, but the lagging-indicator ground truth

The point of the leading indicators is to catch issues 24-48 hours before the lagging ones. Without leading indicators, you’re operating on customer complaints.

A common anti-pattern

“We have evals. We run them in CI. We’re good.”

CI evals are useful for catching regressions in code. They don’t catch drift in production traffic, drift in upstream model behavior, or drift in user expectations. The eval set is a museum piece by month two.

Production sampling + judge grading is a separate system. Build it.

The judge model question

Using a model to grade another model’s outputs is uncomfortable but practical at this point. The trade-off:

A weaker judge introduces noise
A more capable judge costs more
A same-model judge has correlation issues (if the production model is wrong, the judge tends to agree)

What works in practice: use a different family for the judge. If production runs Claude, judge with GPT-4-class. If production runs GPT, judge with Claude. The disagreement is the signal.

The judge prompt should be calibrated. We use a 4-point scale (excellent / acceptable / poor / unsafe), not a 10-point. Coarse buckets correlate better with downstream outcomes than fine-grained scores.

Tooling

Phoenix (Arize) ↗ ships drift visualization out of the box; instrument with OpenLLMetry ↗ and you get input drift charts for free. For judge-grading, no off-the-shelf solution is great; most teams write their own scorer that runs on a cron.

Don’t alert on individual judge scores. Sample noise will burn the on-call. Alert on the rolling aggregate.
Don’t try to detect drift via the model’s self-confidence scores. They’re poorly calibrated in production.
Don’t conflate retrieval drift with model drift. RAG quality decay is its own failure mode — track it on the retrieval layer, not the LLM layer.
Don’t run the judge on every request. The cost is the same as running the production model; budget kills the project.

The discipline that makes this work isn’t fancy infrastructure — it’s the weekly review meeting where someone owns the dashboards, looks at the trends, and asks “what changed?” Without that ritual, the metrics rot inside two months and the team is back to operating on complaints. With it, drift becomes visible and addressable. That’s the difference between an operationalized LLM app and one that’s drifting toward replacement.

Silent Quality Decay in Production LLM Apps: How to Detect Drift Before Users Do

The drift modes that matter

1. Input distribution drift

2. Output quality drift

3. Provider-side model drift

What “good” looks like

A common anti-pattern

The judge model question

Tooling

Sources

ML Monitoring Report — in your inbox

Related

Data Drift Detection in Machine Learning: Methods, Tests, and Production Practice

ML Model Monitoring Best Practices for Production Systems

What this site is for

Comments

The drift modes that matter

1. Input distribution drift

2. Output quality drift

3. Provider-side model drift

What “good” looks like

A common anti-pattern

The judge model question

Tooling

What we don’t recommend

Sources

ML Monitoring Report — in your inbox

Related

Data Drift Detection in Machine Learning: Methods, Tests, and Production Practice

ML Model Monitoring Best Practices for Production Systems

What this site is for

Comments