Case Study: Zillow Offers and the Missing Shadow Evaluation

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Case Study 1 of 1

What happened

In early November 2021, Zillow Group announced that it was closing Zillow Offers, its iBuying arm. Zillow Offers was the business through which the company bought homes directly from consumers, held them briefly, performed light renovation, and resold them. The operational thesis was that a sufficiently accurate valuation model could identify homes priced below the resale they would later produce, and that the profit per transaction, multiplied by scale, would make iBuying a durable line of business. For several years it was a high-profile bet.

The public disclosure of the unwind was unusually specific. Zillow’s Form 10-Q for the quarter ending September 30, 2021, recorded approximately $304 million of inventory writedowns on homes held at quarter-end, and the company publicly stated expected additional writedowns in subsequent quarters, bringing the aggregate disclosed impact of the shutdown to approximately $881 million across the Q3 and Q4 2021 periods¹. The company also announced a reduction in workforce of roughly 25%. Management’s statements framed the decision in terms of the difficulty of forecasting home prices with sufficient accuracy during the sharp housing regime change of 2021, during which the second half of the year saw a pricing inflection that the valuation model had not anticipated².

The model itself was not disclosed in full, but the general shape is known from Zillow’s technical publications and from subsequent industry analysis. It was a machine-learning-driven automated-valuation model (AVM), trained on historical residential transactions, listing data, and property features, and refined iteratively over years. Comparable AVMs run at other iBuyers (Opendoor, RedfinNow, Offerpad during the same period) used different model classes but faced analogous challenges. What distinguished Zillow’s experience was the scale at which it committed capital to the model’s decisions in the first half of 2021, and the speed at which the housing regime shifted in the second half.

The experiment-mode diagnosis

The interesting question for an experimentation practitioner is not whether the model had an error. Any model has an error. The question is whether the evaluation regime around the model was designed to catch the class of error that actually occurred, and, if not, which mode of evaluation would have.

Offline evaluation. An AVM of this type is typically evaluated on held-out historical transactions, with metrics like median absolute error, interquartile-range error, and per-state or per-metro-area performance. By every public account, Zillow’s offline evaluation was extensive. The model was retrained frequently. Performance on held-out historical data was measured and tracked. Where the offline evaluation was insufficient was not in its rigor; it was in what the data could tell the model. A model trained on transactions from 2016–2020 could not, by construction, have seen transactions reflecting the specific 2021 regime change, and offline evaluation on 2016–2020 data could not have detected the model’s inability to handle that regime.

Online evaluation with capital commitment. What Zillow did in 2021 was deploy the model in a mode where its outputs were not recommendations for human review but were direct pricing decisions that committed real capital. Each accepted offer was, in effect, an A/B test with one variant (the model’s price) and the live housing market on the other side, with the test metric being the resale price weeks or months later. This is a form of online evaluation, but it is an extreme form: the opportunity cost of each unfavorable transaction is the loss on the home, which at iBuyer scale summed quickly to material numbers.

The missing shadow. Between offline evaluation on historical data and live deployment with capital commitment, there is a third mode — shadow evaluation. In the four-mode vocabulary of Article 1, a shadow evaluation of the AVM would have fed the model current market inputs as they arrived, recorded the prices the model would have offered, and then compared those offered-but-not-acted-on prices against the prices the homes actually sold for through conventional channels. No capital would have been at risk. The divergence between model prices and actual transaction prices, tracked over weeks in the 2021 period, would have surfaced the regime-change transfer failure before the capital was committed at scale.

The structural form of the missing shadow is not mysterious. Comparable programs exist in quantitative finance (paper trading, which is exactly this pattern applied to trading strategies), in medical AI (silent-mode deployment of diagnostic support alongside clinical decision-making), and in ad auctions (shadow-bidding). The pattern was available. What was missing in this specific case, at the scale at which the program operated, was its application to an AVM that was making real capital commitments.

The cost

The disclosed aggregate cost to Zillow was approximately $881 million of writedowns and restructuring costs¹. The human cost was approximately 2,000 jobs at the company plus downstream effects on contractors, vendors, and local markets where Zillow had been a material buyer and seller. The strategic cost was the exit from a business line that had been a central element of the company’s strategy.

By contrast, a shadow deployment at sufficient scale — running in parallel with the existing operation during the 2021 period — would have had a cost profile roughly as follows. Compute to run the AVM on current market inputs: modest, a rounding error against model-training costs. Data-engineering cost to ingest current market inputs at a realistic cadence: meaningful but not large, comparable to the data-engineering cost already present in the production model. Delay in ramping capital commitment to full scale during 2021: a revenue opportunity cost, but one that would have been measured against the probability the model was correctly specified in the new regime. Analytical cost: the team time required to compare shadow prices against realized sale prices, measured weekly. On any realistic accounting, a shadow deployment at the scale necessary to detect the regime change was in the single-digit millions of dollars at most, and likely lower.

The ratio is several hundred to one. The practitioner reading this case does not need to reach an opinion on Zillow’s management decisions; the ratio is the lesson. The cost of a shadow evaluation is trivial relative to the cost of a full-scale deployment gone wrong, in any setting where the deployed model commits real-world resources.

What the practitioner takes from the case

Four lessons for the AITM-ECI practitioner are concrete.

1. The four-mode vocabulary is a diagnostic tool. When a team proposes a deployment, the practitioner asks which modes are planned. If the plan is “offline test, then deploy”, the practitioner asks what the offline test cannot answer. For any feature that commits real-world resources — capital, messages to customers, actions via tools — the gap between offline and full deployment is what shadow is for.

2. Regime change is the failure mode offline evaluation cannot catch. No amount of cross-validation, held-out test sets, or slice evaluation on historical data can detect a regime change that has not happened yet. A practitioner who treats offline evaluation as a complete answer to “will the model work” will miss regime-change failures by design.

3. Scale multiplies consequence. A model that loses money on 1% of transactions at 100 transactions per month is a modest problem. A model that loses money on 1% of transactions at 10,000 transactions per month is a crisis. The decision to scale commits the organization to the model’s error rate at the new scale; the pre-scale evaluation must be commensurate.

4. Shadow evaluation is cheap. The constraint on shadow evaluation is almost never cost; it is attention. A team that is enthusiastic about its offline numbers and eager to deploy will skip shadow because it feels like a delay. A team that has seen a capital-commitment error at scale will put shadow into every deployment plan, not because it is required but because it is cheap insurance.

Reading the case alongside the articles

The case is referenced from Article 1 (experiment modes and the missing shadow) and Article 3 (offline evaluation and its limits). The practitioner is encouraged to read the Zillow 10-Q filing directly, which is cited below, and the company’s subsequent investor communications in the following quarters, which are publicly available. The iBuying model class is not unique to Zillow; practitioners will encounter structurally similar situations in any setting where a model’s outputs commit real-world resources at scale. The case is teaching because the facts are public, the financial disclosure is precise, and the diagnosis in the four-mode vocabulary is clean.

Discussion questions

These questions are designed for classroom use or peer discussion. They do not have single correct answers; they invite the practitioner to exercise the credential’s vocabulary on real evidence.

Shadow design for an AVM. If you were designing a shadow evaluation for a home-valuation AVM in mid-2020 (before the 2021 regime change was visible), what specific divergence signals would have surfaced the transfer failure? What cadence of monitoring would have been sufficient to detect it in time to act?
Scale-and-commit decisions. The decision to scale Zillow Offers in the first half of 2021 was a business decision that aggregated many experimental findings. At what point in the scale-up should a practitioner have argued, in experimentation vocabulary, for a pause? What evidence would have supported the argument?
Comparable cases across industries. Name one other domain — not real estate — where a machine-learning model is making decisions that commit real-world resources, and where shadow evaluation is common or absent. Identify the structural reasons for the presence or absence.
The post-incident report. Suppose you were asked to produce, in the vocabulary of Article 14, an experiment report closing out the Zillow Offers program. What would the “limitations” and “next experiment” sections say? The exercise is about what the organization should learn and codify, not about what the organization should do next in the same business.

Zillow Group, Inc., Form 10-Q for the quarterly period ended September 30, 2021. U.S. Securities and Exchange Commission EDGAR filing. https://www.sec.gov/Archives/edgar/data/1617640/000161764021000112/z-2021x09x30x10q.htm — accessed 2026-04-19. ↩ ↩²
Zillow Group Q3 2021 shareholder letter and press release, November 2, 2021. Zillow Group Investor Relations. https://investors.zillowgroup.com/ — accessed 2026-04-19. ↩