Drift Detection and Value Erosion

FlowRidge

Drift detection is the instrumentation discipline that prevents silent erosion. This article covers four drift types, the detection methods for each, threshold calibration to balance early warning against alarm fatigue, and the distinction between drift that harms value and drift that is cosmetic.

The four drift types

Type 1 — Data drift

Data drift is a change in the distribution of input features. A credit-scoring model trained on pre-pandemic application data sees different applicants after a recession; their income, employment, and debt distributions shift. A recommender trained on a particular product catalog sees the catalog expand; new items have no interaction history. A fraud model trained on desktop-browser transactions sees the traffic mix shift to mobile.

Detection methods include Kolmogorov-Smirnov tests on feature distributions, Population Stability Index (PSI) for categorical features, and Wasserstein distance for continuous features. Tools: Evidently (open-source), Arize, WhyLabs (commercial), plus custom Prometheus + Grafana implementations. The detection is relatively tractable — feature distributions are directly observable — but the threshold calibration is not (see §“Threshold calibration” below).

Type 2 — Model drift

Model drift is a change in the model’s prediction distribution, typically because the underlying data relationship the model captured has shifted. A hiring model that predicts candidate success based on resume features drifts when roles evolve. A revenue-forecasting model drifts when pricing strategy changes.

Detection methods include prediction-distribution monitoring (outputs should have a stable distribution in stable environments), ground-truth lag analysis (when labels eventually arrive, accuracy should hold), and segment-level drift (overall accuracy may hold while subgroup accuracy decays).

The segment-level check is particularly important. Overall accuracy can mask catastrophic subgroup decay. The Dutch Toeslagenaffaire is the canonical example: the algorithm’s overall metrics were consistent for years while its accuracy on certain ethnic subgroups decayed into systematic wrongful accusations.¹ A drift system watching only aggregate accuracy would have reported green while the system was failing.

Type 3 — Behaviour drift

Behaviour drift is a change in how users interact with the AI feature. Users may learn to game the system (SEO adapted to search-ranking algorithms is the classic case). They may over-rely or under-rely in ways that violate the feature’s design intent. They may adapt to the feature’s output in ways that alter the feature’s observed value.

Detection methods include interaction-pattern monitoring (clicks per suggestion, override rate, time-to-action), suggestion-acceptance rate, and user-cohort comparisons (are new users behaving like early users?). Microsoft has published extensively on Copilot behaviour-drift patterns; one published finding is that user suggestion-acceptance rate declines over the first six months of use as users internalize the feature’s strengths and weaknesses.²

Behaviour drift is frequently the most consequential and the least-measured of the four. Many enterprise AI programs instrument data and model drift thoroughly but ignore behaviour drift because it requires user-interaction instrumentation that data-science teams do not naturally build.

Type 4 — Environment drift

Environment drift is a change in the broader context — regulatory regime, market structure, competitive landscape, customer expectations. A compliance-assistant AI calibrated to 2023 EU AI Act drafts drifts when the final Act publishes with different classification thresholds. A customer-service copilot calibrated to 2022 customer expectations drifts when the post-pandemic service norm shifts.

Environment drift is the hardest to detect automatically. Detection relies on a quarterly or semi-annual environment review by domain experts rather than on statistical signals. The VRR section 5 risk-flag category “environment risks” is the place where environment drift is tracked.

Threshold calibration

Threshold calibration is the craft that distinguishes useful drift systems from alarm-fatigue generators. Three calibration principles.

Principle 1 — Couple the threshold to realized value

A drift threshold is worth firing only when the drift plausibly erodes realized value. PSI > 0.25 is the standard data-drift “major shift” threshold; PSI > 0.1 is “minor shift.” But for a feature where even major data drift historically has not eroded realized value (because the model generalizes well), firing an alert at PSI 0.25 produces noise. For a feature where minor drift has repeatedly eroded value, the threshold should be tightened to PSI 0.1.

The calibration is feature-specific. The only reliable way to calibrate is to look at historical drift signals alongside historical realized-value time series and find the correlations.

Principle 2 — Layer the alerts

A well-designed drift system has three alert layers. The watch layer logs drift quietly — visible in dashboards, not pushed to humans. The warn layer sends a notification to the feature team for review at next standup. The escalate layer pages an on-call and initiates the incident response flow. Threshold tuning sets which drift magnitudes reach which layer.

The layered approach lets the feature team see drift early without being paged every time an input distribution wiggles. It also provides an audit trail — a drift incident can be traced back through watch-layer logs to show when the drift started and when it escalated.

Principle 3 — Sunset noisy alerts

A drift alert that has fired weekly for three months without producing a true positive is a candidate for retirement or threshold loosening. Alert graveyards accumulate; alerts with zero action history train responders to ignore all alerts. A quarterly alert review is the governance control — the feature team reviews every alert, its fire rate, its true-positive rate, and retires or recalibrates accordingly.

Cosmetic vs. value-eroding drift

A major judgment call: not all drift erodes value. A new season’s product launch shifts input feature distributions in a recommender system; the system adapts; realized value is unaffected. A language-model provider’s minor version update changes output token distributions; downstream metrics are unchanged. These are cosmetic drifts.

The test for value-eroding drift is simple: does realized value — the top of the KPI tree — move? A drift signal that does not correlate with realized-value change over three subsequent observations is probably cosmetic. A drift signal that consistently precedes realized-value decline is value-eroding.

Distinguishing the two requires that drift monitoring and realized-value monitoring run in parallel in the same harness. Systems that monitor drift in isolation from business-outcome evaluation cannot make the distinction and therefore cannot tell the feature team which alerts matter.

Drift response playbook

When drift is detected and confirmed as value-eroding, three response paths are standard.

Path 1 — Retrain

If drift is in data or model and the relationship has shifted gradually, retraining on recent data often restores performance. This is the default for supervised models with labelled ground truth.

Path 2 — Redesign

If drift is in behaviour or environment, retraining will not solve it. The feature or its surrounding UX must be redesigned. A copilot whose suggestion-acceptance rate has collapsed because users have learned the suggestions are unreliable in a specific task type needs UX work — hiding the unreliable suggestion category — not retraining.

Path 3 — Retire

If drift is structural and redesign is not feasible, the sunset case from Article 32 is the right path. Not every AI feature survives its drift cycle; the discipline is to make the retire decision on evidence rather than on sunk-cost preservation.

Published post-mortem examples

Published enterprise drift incidents are rare because companies rarely publish operational post-mortems. A few public references are instructive: the Dutch Toeslagenaffaire parliamentary inquiry for a subgroup-drift case; the UK National Audit Office’s public-sector AI reviews for under-instrumented drift; documented MLOps post-mortems from Netflix and Airbnb tech blogs that describe retraining and redesign responses in product contexts.³

Reading published post-mortems alongside the four drift types sharpens the threshold-calibration discipline. Patterns repeat; analysts who have read a dozen post-mortems spot drift earlier than analysts who have not.

Cross-reference to Core Stream

EATP-Level-2/M2.5-Art11-Designing-Measurement-Frameworks-for-Agentic-AI-Systems.md — drift measurement for agentic systems.
EATF-Level-1/M1.5-Art08-Model-Governance-and-Lifecycle-Management.md — model-lifecycle governance of retraining/redesign decisions.

Self-check

A feature shows PSI of 0.18 on every major input every week for six months, with no realized-value decline. What is the appropriate threshold response?
A hiring model shows stable overall accuracy but declining accuracy on a specific demographic segment. Which drift type is this, and what is the response?
A drift alert has fired weekly for four months with no true-positive action. What governance control applies?
A compliance-assistant AI’s downstream outputs are unchanged but the EU AI Act publishes final classification thresholds. Which drift type, and what is the response?