Regression Discontinuity for Threshold-Based AI Decisions

FlowRidge

AI features frequently produce scores. A credit model outputs a probability-of-default score; applicants above a cutoff are auto-approved, below routed to manual review. A fraud model outputs a suspicion score; transactions above a cutoff are held for review. A customer-success model outputs a churn-risk score; accounts above a cutoff trigger outreach. Each of these produces the exact shape RDD exploits.

This article teaches the sharp and fuzzy RDD variants, the bandwidth and functional-form sensitivity tests, and the ethical questions that threshold-based AI often surfaces during evaluation. A credit-scoring RDD evaluation is, by nature, also a fair-lending audit — the analyst should be prepared for both conversations.

Sharp vs. fuzzy RDD

In a sharp RDD, the score completely determines treatment: every applicant above 680 is approved, every applicant below is declined. The treatment indicator is a deterministic step function of the score. The RDD estimate compares outcome just above the cutoff with outcome just below.

In a fuzzy RDD, the score only affects the probability of treatment: applicants above 680 are more likely to be approved but not certain (human review overrides), and applicants below 680 are rarely but occasionally approved. Fuzzy RDD uses the threshold as an instrument for actual treatment and estimates the local average treatment effect via instrumental-variable methods.

Most AI-scoring systems in practice are fuzzy. A pure sharp RDD is rare because real decisions almost always have override capacity. The fuzzy design is slightly more complex to estimate but conceptually identical.

The running variable, bandwidth, and functional form

Three specification choices shape the RDD estimate.

Running variable

The running variable is the score driving assignment. It must be the actual score used for the cutoff, not a proxy. When the AI system refreshes the score over time and uses the most-recent score for cutoff, the running variable is the cutoff-date score, not the current score.

Bandwidth

The bandwidth is the range of scores either side of the cutoff included in the estimation. A narrow bandwidth (e.g., ±15 points either side of 680) exploits only the most-similar units and reduces bias but increases variance. A wide bandwidth (±50 points) uses more data but risks including units that differ substantially from the cutoff.

Data-driven bandwidth selection — Imbens-Kalyanaraman or Calonico-Cattaneo-Titiunik (CCT) methods — is standard practice. The published CCT algorithm is the current reference for nonparametric RDD bandwidth and bias-correction.¹ Bandwidth sensitivity is checked by re-estimating the RDD at ±50%, ±100%, and ±200% of the chosen bandwidth and reporting how the estimate moves.

Functional form

The RDD fits a function through the running-variable-outcome relationship on each side of the cutoff; the RDD estimate is the jump in the fitted function at the cutoff. A linear functional form is simple but may mis-specify; higher-order polynomials fit better but can over-fit. Best practice uses local-linear or local-quadratic with data-driven bandwidth.

Gelman and Imbens argued against high-order polynomial global fits, which can produce misleading RDD estimates driven by curvature far from the cutoff.² Local nonparametric estimation (with bias correction à la CCT) is the modern default.

Manipulation and the McCrary test

RDD’s causal identification rests on the assumption that units cannot precisely manipulate their score near the cutoff. Applicants just below 680 cannot easily move themselves to just above. Where manipulation exists, the density of units near the cutoff is uneven — a spike just above and a dip just below — which invalidates the design.

The McCrary density test formalizes this: it fits a kernel density to the running variable and tests for discontinuity at the cutoff. A significant discontinuity in the density (not the outcome) flags manipulation. The test is a standard pre-analysis check for any RDD.

Manipulation is particularly important in credit-scoring and loan applications, where applicants have incentive and ability to adjust inputs. It is less important in automatic fraud-scoring systems where the scored party does not see the score.

Ethical and regulatory considerations

Threshold-based AI decisions are a flashpoint for fairness and regulatory attention. An RDD evaluation of a credit-scoring AI, by its nature, produces the inputs for a fair-lending audit: outcome differences at the threshold by protected class, threshold calibration, and evidence of disparate impact. The GAO’s AI Accountability Framework explicitly flags threshold-based AI decisions as requiring ongoing fairness measurement.³

Three ethical questions the analyst should anticipate.

Is the threshold itself fair? A credit threshold that produces a 90% approval rate for one demographic and 40% for another may be statistically optimal for narrow loss minimization and simultaneously fail Equal Credit Opportunity Act standards. The RDD identifies the effect at the threshold; the fairness analysis compares threshold effects across demographic subgroups.

Are scores manipulable in ways that disadvantage protected classes? If applicants with access to credit-repair services can bump scores above 680 while equally-qualified applicants without that access cannot, the AI system has introduced a new dimension of inequity.

Is the threshold mission-aligned? A customer-success churn-risk threshold tuned to minimize contact-center workload may systematically under-serve high-churn-risk customers who most need intervention. The threshold optimizes the wrong objective.

Each question is an AI-value question (threshold mis-specification erodes realized value) and a governance question (threshold mis-specification is a fairness and regulatory exposure). The RDD result feeds both conversations.

Reporting the RDD

A reported RDD in the VRR at section 3 includes: the running variable, the cutoff, the estimator used, the data-driven bandwidth, the functional-form choice, the point estimate with CCT-robust standard errors, the McCrary density-test result, bandwidth-robustness estimates at three alternative bandwidths, and — where the AI system affects protected classes — the subgroup-disparity extension.

The RDD coefficient is a local average treatment effect — the effect at the threshold, not across the full score range. Reporting the coefficient without this caveat invites over-generalization. The estimate says nothing about units far above or far below the cutoff; a VRR should say so.

Cross-reference to Core Stream

EATP-Level-2/M2.5-Art02-Designing-the-Measurement-Framework.md#causal-design — RDD within the practitioner measurement toolkit.
EATE-Level-3/M3.4-Art14-EU-AI-Act-Article-6-High-Risk-Classification-Deep-Dive.md — high-risk scoring systems where RDD evaluation is typically mandated.

Self-check

A credit model uses a 680 threshold but manual reviewers override about 15% of score-based decisions. Which RDD variant applies, and how does it change estimation?
A McCrary density test shows a significant discontinuity at the cutoff. What does this indicate, and what are the design implications?
A global polynomial fit shows a jump of $340 at the cutoff; a local-linear fit shows $120. Which estimate is likely correct, and why?
An RDD shows a 2.1pp default-rate reduction from auto-approval decisions at the threshold. A fair-lending auditor asks about subgroup effects. What analysis do you run?