Third-Party and Open-Source Data Readiness

FlowRidge

This article defines the five third-party risk categories, describes the intake workflow that converts a candidate third-party source into an approved data product, anchors the discussion to the two landmark copyright cases against large-model providers (the New York Times and Authors Guild actions), and establishes the cross-border residency analysis every readiness practitioner must perform.

The five risk categories

License risk. Is the license compatible with the intended AI use? A license that permits research use but prohibits commercial use is incompatible with a production AI deployment. License opacity — the absence of a clear license — is itself a risk class.
Provenance risk. Where did the data actually come from? Can the supplier document the chain of custody? Scraped data with no provenance record is high-risk. The Clearview enforcement pattern (Article 4) is the canonical illustration.
Quality risk. The supplier’s data quality may be unknown or exaggerated. The readiness practitioner applies the Article 2 quality discipline to every third-party source before it is accepted.
Subject-rights risk. If the source contains personal data, have data subjects been informed? Is an erasure request feasible? The practitioner should not assume the supplier has solved subject-rights issues.
Drift risk. A third-party feed can change content, schema, or behavior outside the consumer’s control. The supplier may deprecate categories, change computation rules, or alter sampling.

The five risks form the scoring rubric for Section 9 of the readiness scorecard. Each risk is scored per source, with evidence, and the aggregate determines whether the source is accepted, accepted with mitigation, or rejected.

The intake workflow

A third-party source moves through a seven-stage intake. No stage can be skipped; evidence is produced at each stage.

[DIAGRAM: StageGateFlow — third-party-intake — seven-stage horizontal flow: scope → license review → provenance audit → quality sample → legal sign-off → integration → monitor — each stage annotated with the evidence produced (scope memo, license abstract, provenance questionnaire, quality report, legal memo, integration contract, monitoring plan) and the named approver]

1. Scope

The practitioner confirms what data is being acquired and for what use case. A licensed dataset acquired for experimentation has different requirements than the same dataset used in production training. Scope creep — the dataset acquired for one purpose ending up in another — is a common governance failure, and is prevented by recording the scope at intake and auditing subsequent uses against it.

2. License review

The practitioner reviews the license against the planned use. Key questions:

Is commercial use permitted?
Is use in AI training permitted? Many licenses pre-dating 2022 are silent on this; silence is not consent.
Is redistribution of derived models permitted?
Are there attribution or copyleft requirements that would carry forward into model outputs?
Is there an export-control or territorial restriction?

The output is a license abstract — a one-page summary of the license terms relevant to the use, signed by legal. The abstract lives in the dataset’s governance record and becomes the reference for downstream decisions.

3. Provenance audit

The practitioner asks the supplier for the chain-of-custody record. For a licensed dataset, this includes the original collection methodology, the subject-consent basis (where applicable), any processing applied by the supplier, and the freshness of the snapshot. For a scraped corpus, this includes the scrape methodology, the robots.txt compliance record, and the filtering applied.

Where the supplier cannot produce the record, the practitioner has a choice: reject the source, accept with a documented provenance-risk finding, or negotiate additional due diligence. For high-risk use cases under EU AI Act Article 10 supplier duties, opaque provenance should reject the source.

4. Quality sample

The practitioner acquires a sample (under a limited-use data-sharing agreement if needed) and applies the Article 2 quality dimensions. The sample must be representative of what will be delivered at volume. The output is a quality report that feeds the decision at the next stage.

5. Legal sign-off

Legal reviews the license abstract, the provenance record, the quality report, and the supplier’s contractual terms (warranties, indemnities, termination rights). The sign-off is a named approval with conditions. For third-party data used in high-risk AI, legal should explicitly consider: EU AI Act Article 10 provisions on supplier duties, applicable copyright regime, applicable data-protection regime, and any sector-specific law.

6. Integration

The data is ingested under the data contract established for it. Lineage is captured. Access policy is enforced. The integration step includes the technical work of landing the data, schema normalization, and hand-off to downstream consumers.

7. Monitor

The source is monitored for drift (upstream feed changes), quality regression, and license changes (suppliers sometimes change terms). A documented cadence for supplier review — typically quarterly — keeps the relationship current.

The copyright cases — NYT and Authors Guild

Two landmark actions filed in late 2023 and 2024 in the US District Court for the Southern District of New York establish the litigation surface for training on third-party content.

The New York Times v. OpenAI and Microsoft, filed in December 2023, alleges that OpenAI and Microsoft trained models on millions of Times articles without a license, that the trained models can reproduce Times content at near-verbatim fidelity in response to user prompts, and that the resulting systems compete with the Times’s own offerings.¹ The case is pending; the discovery and motion record will shape the legal contours of AI training on copyrighted content in the US for years.

Authors Guild v. OpenAI, consolidated class actions with named plaintiffs including multiple well-known authors, allege systematic copyright infringement in OpenAI’s training corpora.² The actions were filed in 2023 and have been consolidated with related proceedings.

The readiness practitioner does not resolve the legal questions; the court will. But the practitioner makes concrete risk decisions, and the cases have already changed what due diligence looks like. A third-party training corpus assembled before 2023 may have been acquired under assumptions that the cases now call into question. The readiness practitioner treats corpus inheritance as a first-rank audit target: what datasets are in the current training corpus, when were they added, under what terms, and would the decision be the same if made today?

[DIAGRAM: MatrixDiagram — provenance-license-matrix — 2x2 of “Provenance transparency (opaque / documented)” against “License permissiveness (restrictive / permissive)”, mapping the four quadrants: documented + permissive = proceed with standard controls; documented + restrictive = proceed only if use is in scope; opaque + permissive = elevated due diligence required; opaque + restrictive = reject unless exceptional justification]

Cross-border and residency analysis

Third-party data often crosses borders. The readiness practitioner must perform a residency analysis:

Where was the data collected? The collection location determines the applicable privacy regime (GDPR, CCPA/CPRA, LGPD, PIPL, and others depending on jurisdiction).
Where will the data be stored? Storage location determines mandatory data-protection controls and may invoke data-transfer mechanisms (adequacy decisions, Standard Contractual Clauses, binding corporate rules).
Where will the model be trained? Training location, if different from storage, invokes additional controls.
Where will the model be deployed? Deployment location determines the applicable AI regime (EU AI Act, US state AI laws, sector-specific rules) and the data-flow controls that apply to users in each region.

Cross-border data flows without a valid transfer mechanism are a frequent enforcement target. Schrems II (CJEU 2020) invalidated the US-EU Privacy Shield and raised the due-diligence bar on transfers to the US. Subsequent adequacy decisions and the EU-US Data Privacy Framework have partially restored a transfer route, subject to ongoing litigation. The readiness practitioner records the transfer mechanism relied upon, the date of the legal opinion supporting it, and the review cadence.

Open-source and research dataset caution

Open-source and research datasets (Hugging Face, Kaggle, academic releases) carry specific risks the practitioner should name:

License drift. A dataset’s license may change between versions. The version pinned by the consumer may carry different terms than the latest version.
Unvetted content. Research datasets have sometimes been pulled after discovery of inappropriate content. The LAION-5B imagery dataset was pulled from public distribution in December 2023 after reports identified categories of illegal content; organizations that had trained models on LAION-5B had to reconsider the provenance record.
Repackaging. A dataset released by one party may be repackaged and redistributed by another. The chain of custody across the repackaging can be unclear.
Subject-rights latency. Open-source datasets often lack an operational path for subject-rights exercise. An erasure request against a Hugging Face dataset typically cannot be honored at the downstream consumer level without cascading work.

The readiness practitioner applies a stricter due-diligence bar to open-source datasets used in production training than to datasets used only for research. The research-to-production transition is a governance event that re-triggers the full intake.

Web-scraped corpora — a separate risk profile

Web-scraped training data sits in a distinct risk category. The Clearview AI enforcement pattern (Article 4) is one example; large-model training on web scrapes is a second. The readiness practitioner treats scraped corpora with specific controls:

Robots.txt and terms-of-service compliance record. Evidence that the scraper honored site-level exclusion signals at the time of collection. Retroactive claims are weak; contemporaneous logs are strong.
Takedown-request handling. A documented process for honoring site-owner or data-subject takedown requests that affect the scraped dataset.
PII filtering at ingestion. Automated filters to remove obvious PII (names, phones, emails, government IDs) before content enters the training corpus. Residual PII post-filter is a known risk that the practitioner should quantify, not ignore.
Copyright-notice capture. Where scraped content carries a copyright notice, the notice should be captured and associated with the content; the capture supports subsequent license and takedown analysis.
Content-category filtering. Automated filtering for illegal or harmful content categories (child-safety content, explicit violence, specific legal categories depending on jurisdiction).

The LAION-5B imagery dataset, widely used in open-source image-generation training, was withdrawn from public distribution in December 2023 after Stanford research identified the presence of illegal content categories within the dataset. Organizations that had trained models on LAION-5B had to reconsider their training-corpus provenance record in light of the finding. The readiness practitioner treats this as a recurring pattern: large scraped corpora with limited pre-release curation can be discovered to contain categories of content that invalidate their training-use defensibility.

Open dataset version-pinning and license-stability controls

Open-source datasets on Hugging Face, Kaggle, or similar platforms can change after an organization has adopted them. The readiness practitioner requires version-pinning: the specific version consumed is recorded with the hash or commit identifier, and consumption is pinned to that version until an explicit upgrade decision is made.

License stability is a related concern. A dataset released under a permissive license can be re-released under a more restrictive license by its author; the consumer’s original copy under the old license is typically safe, but new copies downloaded post-change are subject to the new terms. The practitioner records the license version alongside the dataset version.

The “inherited corpus” audit

Many AI programs inherit training corpora assembled before the current diligence standards existed. A 2020 corpus assembled under 2020 diligence is operating inside a 2026 regulatory perimeter. The readiness practitioner runs an inherited-corpus audit:

Enumerate the training corpora in current use. Identify the sources, the collection dates, and the diligence applied at assembly.
Re-assess each source against current diligence standards. Flag sources that would not pass today.
For flagged sources, decide: remove from corpus (and retrain without them), re-acquire under current diligence (where feasible), accept with documented residual risk (where the residual risk is defensible), or retire the dependent models.
Document every decision in the decision log.

The inherited-corpus audit is uncomfortable work. It often surfaces decisions that are expensive to remediate. The alternative — letting the inherited corpus age without audit — is the pattern the copyright litigation now targets.

Supplier duty under the EU AI Act

EU AI Act Article 25 and Article 16 together establish that suppliers of AI systems, including data suppliers whose content is substantial in training high-risk AI, carry obligations.³ For the readiness practitioner, the operational consequence is that third-party data suppliers for high-risk AI use cases are within the regulatory perimeter and should be asked to demonstrate their own compliance. Contractual terms should require the supplier to cooperate with post-market monitoring, notify the consumer of supplier-side changes that affect the downstream system, and support the consumer’s conformity-assessment obligations.

Where a supplier refuses these terms, the readiness assessment should flag the refusal as a finding that either blocks the use case or requires compensating controls on the consumer side.

Cross-references

COMPEL Core — Third-party AI: the governance challenge you are not seeing (EATF-Level-1/M1.4-Art13-Third-Party-AI-The-Governance-Challenge-You-Are-Not-Seeing.md) — the framework-level treatment of third-party governance within which third-party data readiness sits.
COMPEL Core — Data governance for AI (EATF-Level-1/M1.5-Art07-Data-Governance-for-AI.md) — third-party data inherits the organization’s data governance regime, extended for supplier-specific risks.
AITM-DR Article 4 (./Article-04-Data-Lineage-Provenance-and-Documentation.md) — the provenance discipline that underpins the provenance-audit stage of intake.
AITM-DR Article 8 (./Article-08-Privacy-Sensitive-Data-Classes-and-Data-Minimization.md) — the privacy discipline constraining third-party data that contains personal information.
AITM-DR Article 11 (./Article-11-The-Readiness-Scorecard.md) — the scorecard’s third-party section.

Summary

Third-party data readiness assesses license, provenance, quality, subject-rights, and drift risks for each non-internal source. The intake workflow moves each source through seven stages, producing evidence at each. The New York Times v. OpenAI and Authors Guild v. OpenAI actions have reshaped due-diligence expectations for training-corpus acquisition. Cross-border residency analysis identifies the applicable privacy and AI regimes and the transfer mechanism needed. Open-source and research datasets require stricter due diligence when transitioned into production training. The EU AI Act’s supplier-duty provisions bring third-party data suppliers inside the regulatory perimeter for high-risk AI. A readiness practitioner who treats third-party data with the same rigor as internal data — or more — produces a program that survives supervisory review.

The New York Times Company v. Microsoft Corporation et al., Case No. 1:23-cv-11195 (S.D.N.Y., filed December 27, 2023). https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html ↩
Authors Guild v. OpenAI, Inc. et al., consolidated class actions (S.D.N.Y., 2023–2024). Complaint PDF: https://authorsguild.org/app/uploads/2023/09/AG-OpenAI-Complaint.pdf ↩
Regulation (EU) 2024/1689, Article 16 (Obligations of providers of high-risk AI systems) and Article 25 (Responsibilities along the AI value chain). https://eur-lex.europa.eu/eli/reg/2024/1689/oj ↩