ML Monitoring Report
LLM quality drift detection chart
ops

Silent Quality Decay in Production LLM Apps: How to Detect Drift Before Users Do

Your eval scores are green. Customer complaints are up. The gap between offline metrics and production reality is the biggest reliability problem in LLM ops — here's how to close it.

By Priya Anand · · 8 min read

A pattern I’ve now seen at four different companies: the team has eval coverage. The eval scores look fine. Customer complaints are quietly increasing. By the time someone correlates the two, two months of usage data has been lost.

The eval suite is testing a frozen distribution. Production traffic has drifted. The model still passes the eval; it’s just no longer answering the questions users actually ask.

Here’s how to detect this — and a few related failure modes — before customers do.

The drift modes that matter

Three distinct drift problems get conflated. Treat them separately.

1. Input distribution drift

User questions are changing. New product features create new query types. Seasonal shifts (returns season, tax season, regulatory deadlines) change topic mix. The model’s behavior on the new mix is untested.

Signal: embedding-space comparison of week-over-week input distributions. Track distance from the eval set. When production drifts >0.3 cosine from eval, your eval is stale.

Detector (cheap to run):

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-mpnet-base-v2')
def drift_score(eval_embeds: np.ndarray, prod_embeds: np.ndarray) -> float:
    eval_centroid = eval_embeds.mean(axis=0)
    prod_centroid = prod_embeds.mean(axis=0)
    return float(np.linalg.norm(eval_centroid - prod_centroid))

Run on a sample of 1k production requests vs. your eval set, weekly. Plot. Alert at +1 stddev above the rolling 6-week mean.

2. Output quality drift

The model is producing worse answers, but no single failure mode is severe enough to throw a clear alarm. Symptoms:

Signal: a small, judge-LLM-graded eval running on a SAMPLE of production requests, not on the frozen eval set. Critically, this should be DIFFERENT from your CI eval — production sample needs to track the actual production distribution, not the synthetic one.

Implementation:

The judge model adds cost. Budget for it; it’s worth it.

3. Provider-side model drift

The provider silently changed the model. They’re allowed to — terms of service permit it for non-frozen versions. Your “claude-sonnet-4” today is a different artifact than yesterday’s. Anthropic’s model cards document the snapshot dates; pinning to a specific date is the workaround.

Signal: a small, deterministic eval running every hour on the model. If response patterns shift suddenly without an upstream code change, the provider deployed something. (Set temperature=0, fixed prompts, hash the outputs.)

Mitigation: pin to date-stamped model versions. Rotate intentionally, with eval gates.

What “good” looks like

A reasonable production setup:

The point of the leading indicators is to catch issues 24-48 hours before the lagging ones. Without leading indicators, you’re operating on customer complaints.

A common anti-pattern

“We have evals. We run them in CI. We’re good.”

CI evals are useful for catching regressions in code. They don’t catch drift in production traffic, drift in upstream model behavior, or drift in user expectations. The eval set is a museum piece by month two.

Production sampling + judge grading is a separate system. Build it.

The judge model question

Using a model to grade another model’s outputs is uncomfortable but practical at this point. The trade-off:

What works in practice: use a different family for the judge. If production runs Claude, judge with GPT-4-class. If production runs GPT, judge with Claude. The disagreement is the signal.

The judge prompt should be calibrated. We use a 4-point scale (excellent / acceptable / poor / unsafe), not a 10-point. Coarse buckets correlate better with downstream outcomes than fine-grained scores.

Tooling

Phoenix (Arize) ships drift visualization out of the box; instrument with OpenLLMetry and you get input drift charts for free. For judge-grading, no off-the-shelf solution is great; most teams write their own scorer that runs on a cron.

What we don’t recommend

The discipline that makes this work isn’t fancy infrastructure — it’s the weekly review meeting where someone owns the dashboards, looks at the trends, and asks “what changed?” Without that ritual, the metrics rot inside two months and the team is back to operating on complaints. With it, drift becomes visible and addressable. That’s the difference between an operationalized LLM app and one that’s drifting toward replacement.

Sources

  1. Phoenix (Arize) Drift Documentation
  2. OpenLLMetry Span Conventions
  3. Anthropic Model Cards
#drift-detection #monitoring #production-llm #eval #quality
Subscribe

ML Monitoring Report — in your inbox

Production ML monitoring, drift, and reliability. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments