Vendor Due Diligence Frameworks for AI Suppliers

FlowRidge

This article presents a structured framework — domain coverage, evidence types, scoring approach, and gating thresholds — that applies across fine-tuners, application builders, plug-in developers, and Software as a Service (SaaS) AI features. It assumes that General-Purpose AI (GPAI) provider assessment (Article 2 of this module) is conducted separately, because GPAI providers raise distinct regulatory questions.

Why AI Vendor Due Diligence Cannot Reuse the Standard Playbook

Most enterprises already have third-party risk management programs covering financial soundness, cybersecurity, business continuity, and data protection. AI vendors require these and three additional categories.

The first additional category is model-behaviour evidence. Conventional cybersecurity questionnaires ask whether the vendor encrypts data at rest. They do not ask whether the vendor’s model produces reliable outputs in the deployer’s domain, what its hallucination rate is, what mitigations exist for known jailbreaks, or how it handles ambiguous instructions. Without these, the deployer is buying functionality blind.

The second is data-flow opacity. AI vendors often process customer-supplied data through their own model and through upstream providers. Whether that data is used to train models, logged for human review, retained for fine-tuning, or transmitted across borders is a question conventional vendor assessment rarely surfaces with sufficient precision.

The third is regulatory categorisation. Under the European Union (EU) AI Act, accessible at https://artificialintelligenceact.eu/, the deployer’s obligations depend on whether the procured system is categorised as prohibited, high-risk, limited-risk, or minimal-risk. Articles 6 to 27 govern the high-risk regime; Article 25 specifies deployer-specific obligations. Misclassification at procurement time produces compliance failure at deployment time. The vendor must be asked to assert the categorisation in writing.

The Eight Diligence Domains

A complete AI vendor due-diligence questionnaire covers eight domains. Treat the list as the skeleton of a structured assessment, not as a prose form.

1. Corporate, Legal, and Financial

Standard procurement evidence: incorporation, ultimate beneficial ownership, sanctions screening, financial statements, litigation history, and key-person concentration. AI-specific overlay: foreign-investment exposure, training-data lawsuits, and regulator enforcement actions.

2. AI Governance and Management System

Does the vendor maintain an AI Management System per International Organization for Standardization / International Electrotechnical Commission (ISO/IEC) 42001:2023, accessible at https://www.iso.org/standard/81230.html? Annex A.10 of that standard covers third-party relationships and is directly applicable. Evidence includes management-system scope, internal audit results, and management-review outputs.

3. Data Handling and Privacy

What customer data is collected, processed, retained, and shared? Is it used to train shared models? Are customer-specific tenants isolated? What sub-processors are involved? Where is data hosted? The General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and sector-specific data laws all impose requirements that a generic AI questionnaire must surface.

4. Cybersecurity

Standard controls (encryption, access management, vulnerability management, incident response) plus AI-specific controls (model-weight protection, prompt-injection defence, training-data poisoning detection, inference-pipeline integrity). The U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 800-161 Revision 1 at https://csrc.nist.gov/pubs/sp/800/161/r1/final establishes the cybersecurity supply-chain risk management baseline; the Cloud Security Alliance at https://cloudsecurityalliance.org/ publishes the AI Controls Matrix that extends this to AI workloads. Supply-chain Levels for Software Artifacts (SLSA) at https://slsa.dev/ specifies build-pipeline integrity targets that increasingly apply to model artefacts.

5. Model and Data Provenance

Which foundation model underpins the system? Which training and fine-tuning data was used? Where did that data originate? The U.S. Cybersecurity and Infrastructure Security Agency (CISA) Software Bill of Materials programme at https://www.cisa.gov/sbom is being extended through AI-BOM and Model Bill of Materials (MBOM) constructs to make this information machine-readable. The Software Package Data Exchange standard at https://spdx.dev/ is extending to dataset and model components. Vendors should be asked to commit to producing an AI-BOM at delivery and on every material change.

6. Performance, Safety, and Bias Evaluation

What evaluation was conducted before release? On which datasets? With what results? How are demographic-fairness metrics measured? What jailbreak resistance has been tested? The NIST AI Risk Management Framework at https://www.nist.gov/itl/ai-risk-management-framework specifies the categories — valid-and-reliable, safe, secure-and-resilient, accountable-and-transparent, explainable-and-interpretable, privacy-enhanced, fair-with-harmful-bias-managed — that should anchor evidence requests.

7. Operational Continuity and Change Management

How are model updates communicated? What deprecation notice is given? What service-level commitments exist? How are silent model swaps prevented? What backup or fallback paths exist if the vendor experiences outage or insolvency?

8. Regulatory Compliance and Conformity

Has the vendor classified the system under the EU AI Act and other applicable regimes? Does it provide the technical documentation a deployer needs to discharge its own obligations under Article 25? Does it commit to incident notification under Article 73 timelines? Does it maintain copyright opt-out compliance and training-data summaries where it acts as a GPAI provider?

Scoring and Gating

Diligence outputs that are not scored are diligence outputs that will not be acted upon. A defensible scoring approach assigns each domain a four-point rating — Sufficient, Sufficient-with-conditions, Insufficient, or Disqualifying — and an associated weighting that reflects the use case’s risk tier. Domains where evidence is absent default to Insufficient, never to “not applicable.”

Gating rules then translate scores to procurement decisions. A common pattern: any Disqualifying rating blocks contracting until remediated; three or more Insufficient ratings escalate to a senior risk committee; any high-risk EU AI Act system requires Sufficient ratings across all eight domains before contract signature. The thresholds belong to the organization, not to the vendor — and they must be set before the assessment begins to avoid post-hoc rationalisation.

Tiering by Use Case Risk

Not every vendor warrants the same depth of diligence. A tiered model — minimal (low-risk internal productivity), standard (limited-risk customer-facing), enhanced (high-risk per EU AI Act), and critical (safety-of-life or financial-stability impact) — calibrates the questionnaire to the actual stakes. Article 15 of this module presents the full tiered-program design.

Maturity Indicators

Maturity	What AI vendor due diligence looks like
Foundational (1)	AI vendors complete the same generic questionnaire as commodity SaaS suppliers; AI-specific risks are not surfaced.
Developing (2)	An AI overlay questionnaire exists but is inconsistently applied; scoring is qualitative and ad hoc.
Defined (3)	The eight diligence domains are assessed for every AI vendor at the standard tier or above; scores gate contracting; decisions are documented.
Advanced (4)	Diligence outputs feed directly into contract terms, monitoring plans, and incident-response runbooks; reassessment is triggered by upstream changes.
Transformational (5)	The organization publishes its diligence framework; vendors compete on the quality of their evidence pack; the framework influences industry practice.

Practical Application

A retail bank evaluating a generative-AI fraud-investigation assistant for its analyst staff should classify the system as standard or enhanced under its tiering model, dispatch the eight-domain questionnaire, schedule a deep-dive evidence review for any domain scoring Insufficient, and produce a formal procurement memo recording the score, the gating decision, and the residual-risk acceptances. That memo is the artefact the chief risk officer signs and the regulator will ask for. The vendor’s marketing claims, the relationship manager’s enthusiasm, and the speed pressure on the project belong on the page only as context — they cannot substitute for evidence.

Articles 4 (contracting patterns) and 14 (incident notification) of this module convert the diligence outputs into binding commitments. Articles 10 (continuous monitoring) and 15 (tiered programs) operationalise them across the supplier portfolio.