Drift Monitoring, Incident Classification, and Sustainment

FlowRidge

This article teaches the practitioner to design monitors for each drift class, to classify incidents against a five-level severity scheme, to trigger response proportional to severity, and to track the small set of readiness metrics that summarize the data foundation’s trajectory. The reference frame is NIST AI RMF MANAGE 1.4 on incident response and MEASURE 2.11 on input-data-related validity.¹

Why readiness is a sustainment discipline

A readiness scorecard produced at launch and never refreshed is a readiness artifact that expires. Every downstream condition the scorecard assumed — data distributions, consumer list, regulatory perimeter, supplier behavior — shifts over time. A readiness program that does not refresh to track those shifts will, within months, be signing off on conditions that no longer hold.

The Zillow Offers shutdown is the canonical cautionary tale. Zillow closed its iBuying unit in November 2021 after reporting an $881M writedown, citing the inability of its pricing models to keep pace with the 2021 housing-market inflection.² Subsequent reporting has framed the shutdown as a combination of model risk, operational capacity constraints, and a data-monitoring gap — the distribution of transactions the model was predicting against had diverged from the distribution it had been trained on, and the monitoring and response posture did not catch the divergence in time.

A sustained readiness discipline would have flagged the divergence earlier. The practitioner’s job is to ensure that divergence does not pass unchallenged — either the model was retrained against current data, or the unit scaled down its risk exposure while retraining, or both. The cost-of-wrong is too high for the organization to rely on launch-gate readiness alone.

The three drift classes

Data drift

Data drift is a shift in the marginal distribution of one or more input features. Common detection methods:

Kolmogorov-Smirnov (KS) test — non-parametric test for equality of continuous distributions between a reference window (training or baseline production) and a current window.
Population Stability Index (PSI) — binning-based statistic measuring distribution shift; PSI < 0.1 is stable, 0.1–0.25 is moderate drift, > 0.25 is significant drift.
Chi-squared test — test for categorical features between reference and current windows.
Kullback-Leibler divergence — information-theoretic measure of distribution shift.
Embedding-distance metrics — for unstructured data, measure mean-pairwise distance or Wasserstein distance between embedding distributions.

The practitioner selects methods per feature type and sets thresholds per feature importance. High-importance features warrant tighter thresholds and alert-on-breach response; low-importance features can tolerate looser monitoring.

Concept drift

Concept drift is a change in P(Y|X) — the relationship between inputs and targets. It cannot be detected from inputs alone; it requires ground-truth labels for current predictions, which often arrive with delay. Detection methods:

Prediction-performance monitoring on labeled holdout — where ground truth is available with short latency, monitor precision, recall, AUC, or task-specific metrics over rolling windows.
Confidence-distribution monitoring — changes in the distribution of model confidence can precede measurable performance drops.
Drift-detection methods (DDM, EDDM, ADWIN) — algorithms specifically designed for concept-drift detection on streaming data.

Concept drift is often more consequential than data drift and is harder to catch. The readiness scorecard should name the planned concept-drift detection method for every production use case and require a ground-truth-availability commitment (even if delayed) from the operational team.

Schema drift

Schema drift is change in the data contract itself. Detection is simpler: fail the pipeline when a contract-covered field disappears, changes type, or goes outside its declared value range. Schema drift response should be fail-closed (stop the pipeline until the change is approved), which pushes the operational and governance work back to the contract-change lifecycle from Article 3.

[DIAGRAM: TimelineDiagram — data-monitoring-cadence — horizontal timeline showing monitoring activities at escalating intervals: hourly (schema checks), daily (quality checks + PSI per feature), weekly (concept-drift detection against labeled holdout), monthly (quality-dimension recomputation on full dataset), quarterly (readiness sub-score refresh), annually (full readiness re-audit) — with the incident-trigger threshold annotated at each cadence]

Data incident classification

The COMPEL severity scheme for data incidents has five levels, each with a named trigger, response, and record.

Severity 1 — Critical

Criteria: any one of: a data-subject-rights violation with regulatory notification timeline (often 72 hours under GDPR), a sensitive-data leak affecting > 1000 records, a training-set integrity compromise affecting a production high-risk model, or an incident with imminent business-impact scale.

Response: fail-closed on the affected system, executive notification inside 1 hour, regulatory-notification preparation inside 24 hours where applicable, formal incident team activation.

Record: incident report, regulator correspondence, root-cause analysis, remediation plan, executive debrief.

Severity 2 — High

Criteria: any one of: a quality-threshold breach on a critical feature feeding a production model, a concept-drift detection triggering model-retirement criteria, a supplier-feed outage affecting production serving, or a lineage-integrity compromise.

Response: operational-team notification inside 30 minutes, restrict affected use cases within 2 hours, remediation plan within 24 hours.

Record: incident report, root-cause analysis, remediation plan.

Severity 3 — Medium

Criteria: any one of: a quality-threshold breach on a non-critical feature, a data-drift detection crossing the monitoring threshold, a schema change without proper notice, or a documentation gap affecting audit readiness.

Response: ticket-queue escalation, operational-team review inside business day, remediation plan inside 5 business days.

Record: incident record, remediation notes.

Severity 4 — Low

Criteria: minor quality anomaly within tolerance, lineage-graph update lag, or similar. Response: normal-course remediation; logged for trend analysis.

Severity 5 — Informational

Criteria: observation worth recording but not requiring action (e.g., seasonality pattern confirming a model’s expected behavior).

Response: log for pattern analysis.

Record: trend aggregate.

The severity scheme is the thing the practitioner tests in the scorecard. Does the organization have it? Is it wired to monitoring alerts? Do incidents from the past quarter have records that match the scheme? An organization that cannot answer these questions has a readiness finding at the sustainment layer regardless of the quality of its point-in-time data.

[DIAGRAM: StageGateFlow — incident-response-flow — seven-stage horizontal flow: detect → triage → contain → root-cause → remediate → report → learn — each stage annotated with the named owner, the target duration per severity level, and the record produced]

The A-level algorithm case

On 13 August 2020, the UK’s Office of Qualifications and Examinations Regulation (Ofqual) published A-level grades produced by a standardization algorithm, since universities could not rely on teacher-assessed grades alone in the absence of sat exams during the COVID-19 lockdown. Within days, widespread reporting documented that the algorithm disproportionately downgraded students from state schools and historically lower-performing schools relative to teacher assessments. Ofqual withdrew the algorithm on 17 August 2020.³

The case anchors several readiness themes at the sustainment layer:

Representation and subgroup coverage were inadequate; the algorithm’s calibration against historical school performance embedded a socio-economic proxy.
Monitoring against pre-release evidence (the detailed subgroup outcomes) was insufficient or insufficiently acted on.
Incident response, when the problem became unavoidable, was the retreat to teacher-assessed grades.
Sustainment was not possible because the use case was a one-shot deployment on consequential decisions.

Not every AI use case is a one-shot deployment. But the A-level case illustrates why readiness cannot be assumed; a system that produces consequential decisions must be instrumented to catch the failure that will happen.

The readiness metric set

The practitioner tracks a small, named set of metrics at the sustainment layer. A scorecard refresh every quarter should re-compute these and show the trajectory.

Coverage rate — fraction of in-scope datasets with current contracts, lineage, and datasheets.
Quality index — weighted average of the ten-dimension scores across in-scope datasets.
Drift alert rate — count of drift alerts per week, per severity, with trend.
Incident rate — count of data incidents per month, per severity, with trend.
Time-to-remediate — median business-day elapsed from incident detection to remediation close, per severity.
Documentation currency — fraction of datasheets updated within the last quarter.
Audit-ready count — number of in-scope datasets for which a supervisory review could be passed today.

The metric set should not grow indefinitely. Seven is enough to tell the story. Additional metrics clutter the scorecard and dilute the signal.

Integrating with the organization’s incident-response program

Data incidents overlap with the broader AI incident-response program the governance function operates. Where an AI incident-response program exists, the data-incident process must connect to it; where it does not exist, the practitioner flags the absence as a finding and recommends the governance function address it.

The connection points are: the escalation path (who gets paged), the recording system (which incident database), the external-reporting queue (DPO, regulator, customer), and the post-incident review cadence. A data incident that stays local to the data team and never reaches the AI governance forum is a governance failure even if it is resolved competently on the data side.

The drift-detection threshold-setting problem

Setting the threshold at which a drift detector fires an alert is a non-trivial problem. Too loose and the detector misses real drift until it is consequential. Too tight and the detector fires constantly on normal noise, producing alert fatigue and eventual human-side dismissal of genuine signals.

The readiness practitioner’s approach:

Establish a reference window. A stable, recent, representative period of production behavior becomes the baseline for PSI and KS calculations.
Measure natural variation. Compute the drift statistic across independent sub-windows of the reference period; the observed distribution of the statistic under no-drift conditions is the noise floor.
Set the alert threshold above the noise floor at a documented confidence level. A threshold at the 95th or 99th percentile of the null-condition distribution is a common choice.
Run the detector in shadow mode for a committed period. Alerts during shadow mode are recorded but not actioned; the rate of shadow alerts is reviewed before the detector goes live.
Tune the threshold based on shadow-mode evidence. A detector that fired 200 times in shadow mode per week is too sensitive; a detector that fired zero times is likely missing real drift.

Threshold choices are governance artifacts. They live in the drift-monitoring contract, are reviewed when the reference window changes, and are audited as part of the scorecard refresh.

Sustainment against regulatory change

Sustainment covers not only data and model changes but also regulatory-perimeter changes. The practitioner maintains a lightweight regulatory watch relevant to the use case’s jurisdiction and tier. The EU AI Act’s phased entry into force (the first applicability milestones from February 2025, with successive milestones through 2026 and 2027) means that a use case classified as acceptable today may cross an obligation threshold within the planning horizon.⁴ The same phenomenon affects US state AI laws (New York City’s Local Law 144 on employment AI, Colorado’s AI Act, and others with staggered effective dates), UK and Japanese regulatory frameworks, and sector rules.

The practitioner’s regulatory-watch cadence is quarterly. Between refreshes, material regulatory changes are added to the use case’s trigger-refresh list; the scorecard is refreshed earlier than schedule when a material change lands. The regulatory-watch log is an evidence artifact in its own right.

The post-incident review

Severity-1 and severity-2 incidents produce a post-incident review (PIR). The PIR is the learning mechanism by which the organization improves. A good PIR covers:

Timeline. What happened, in order, with timestamps.
Detection. What caught the incident, and at what delay from initial occurrence. A long detection delay is a finding in its own right.
Root cause. The underlying cause, not the proximate cause. “The pipeline failed” is not a root cause; “the pipeline failed because the upstream schema change was not communicated and the contract did not enforce a compatibility check” is.
Contributing factors. Factors that made the incident more severe or harder to detect — documentation gaps, ownership ambiguity, missing monitoring.
Remediation. Actions to prevent recurrence, with owners and target dates.
Systemic lessons. What this incident teaches about the organization’s readiness program.

PIRs are retained as organizational memory. Repeated PIRs with similar root causes are a pattern the governance function must address; the readiness practitioner surfaces the pattern in the next scorecard refresh.

Cross-references

COMPEL Practitioner — Data quality and technology assessment deep dive (EATP-Level-2/M2.2-Art06-Data-Quality-and-Technology-Assessment-Deep-Dive.md) — the practitioner-level diagnostic framework that point-in-time quality assessment draws from.
COMPEL Core — Mandatory artifacts and evidence management (EATF-Level-1/M1.2-Art14-Mandatory-Artifacts-and-Evidence-Management.md) — the framework-level evidence-management discipline that sustains the readiness artifacts across refreshes.
AITM-DR Article 2 (./Article-02-Data-Quality-Dimensions-Extended-for-AI.md) — the point-in-time quality dimensions; this article extends them into sustained monitoring.
AITM-DR Article 11 (./Article-11-The-Readiness-Scorecard.md) — the scorecard that consumes the sustainment metrics.

Summary

Readiness is a sustainment discipline, not a certificate. Data drift, concept drift, and schema drift each require specific monitoring methods and thresholds. Incidents are classified on a five-level severity scheme with named triggers, response durations, and records. The seven-metric readiness set — coverage rate, quality index, drift alert rate, incident rate, time-to-remediate, documentation currency, audit-ready count — tells the story of whether the data foundation is improving, holding, or eroding. The Zillow and UK A-level cases show what happens when sustainment is missing; the discipline the practitioner builds is what makes those outcomes preventable.

NIST, AI Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf ↩
Zillow Group, Inc., Form 10-Q for the quarterly period ended September 30, 2021. https://www.sec.gov/Archives/edgar/data/1617640/000161764021000112/z-2021x09x30x10q.htm ↩
R. Adams and S. Weale, A-level results U-turn: Ofqual halts use of controversial algorithm, The Guardian, August 17, 2020. https://www.theguardian.com/education/2020/aug/17/a-level-results-u-turn-ofqual-ecdl-grade-standardisation ↩
Regulation (EU) 2024/1689, Articles 113–114 (entry into force and application). https://eur-lex.europa.eu/eli/reg/2024/1689/oj ↩