Vendor AI Due Diligence: The Comprehensive Assessment

FlowRidge

COMPEL Certification Body of Knowledge — Module 2.6: Industry Applications and Regulatory Alignment Article 18 — Domain 20: AI Supply Chain and Third-Party Governance

Why Traditional Vendor Assessment Falls Short

Enterprise vendor assessment programs are mature. Organizations have decades of experience evaluating vendor financial stability, operational resilience, security posture, compliance status, and service quality. Frameworks like NIST 800-53, ISO 27001, SOC 2, and SIG Questionnaire provide standardized approaches to vendor risk assessment that are well understood and broadly adopted.

These frameworks were not designed for AI. They assess whether a vendor protects data, maintains availability, and complies with regulations. They do not assess whether a vendor’s AI models are biased, whether its training data is representative, whether its model outputs are explainable, whether its models have been tested for adversarial robustness, or whether the vendor’s model update practices maintain the quality and fairness characteristics that justified the original procurement.

Adding a few AI-specific questions to an existing vendor questionnaire is necessary but not sufficient. AI introduces categories of risk that require dedicated assessment methodologies. This article presents a comprehensive vendor AI due diligence framework organized into eight assessment categories, each with specific assessment questions, evidence requirements, and scoring guidance.

Assessment Category 1: Model Transparency and Documentation

Model transparency is the foundation of all subsequent assessment. If you cannot understand what the vendor’s AI does, how it was built, and how it makes decisions, you cannot meaningfully assess any other dimension of risk.

Key Assessment Questions

Model architecture and design. What type of model does the vendor use (transformer, neural network, gradient boosting, ensemble, other)? What is the model’s size and complexity? Is the model a proprietary development, a fine-tuned version of a foundation model, or a direct deployment of a third-party model? If the model is based on a foundation model, which one?

Model purpose and scope. What is the model designed to do? What inputs does it accept? What outputs does it produce? What are the intended use cases? What are the explicitly unsupported use cases? What are the known limitations?

Model documentation. Does the vendor publish a model card or equivalent documentation? Does the documentation follow a recognized standard (Google Model Cards, Microsoft Datasheets for Datasets, Hugging Face Model Cards)? Is the documentation current? Does it reflect the currently deployed model version?

Decision logic. For classification and recommendation models, can the vendor explain how the model arrives at its outputs? Does the vendor provide feature importance, attention weights, SHAP values, or other interpretability mechanisms? For generative models, can the vendor explain the guardrails, filters, and safety mechanisms applied to model outputs?

Evidence to Request

Model card or equivalent documentation
Architecture overview (sufficient to understand the type and general approach, not necessarily proprietary details)
Intended use case documentation
Known limitations documentation
Interpretability documentation (for classification and decision-making models)
Model versioning information (current version, version history, update cadence)

Scoring Guidance

5 (Strong): Comprehensive model documentation available. Model card published and current. Architecture and design rationale documented. Known limitations explicitly stated. Interpretability mechanisms available for decision-making models.
4 (Adequate): Model documentation exists but may lack some detail. General architecture described. Key limitations documented. Some interpretability available.
3 (Partial): Limited model documentation. Architecture described at a high level. Limitations incompletely documented. Limited interpretability.
2 (Minimal): Very limited documentation. Architecture treated as proprietary. Limitations poorly documented. No interpretability mechanisms.
1 (Insufficient): No meaningful model documentation. Architecture undisclosed. Limitations not documented. Model operates as a complete black box.

Assessment Category 2: Training Data Practices

The quality, representativeness, and provenance of training data directly determine the quality, fairness, and reliability of the model’s outputs. A model trained on biased data will produce biased outputs. A model trained on unrepresentative data will perform poorly on underrepresented populations. A model trained on improperly licensed data may create intellectual property risk.

Key Assessment Questions

Data sources. What data sources were used for training? Are the sources public datasets, proprietary datasets, licensed datasets, or user-generated data? Can the vendor identify the specific datasets used?

Data representativeness. How does the training data represent the populations that the model will be applied to? Has the vendor assessed the demographic, geographic, and temporal coverage of the training data? Are there known gaps in representation?

Data quality. What data quality processes does the vendor apply? How is training data cleaned, validated, and curated? What processes exist for identifying and removing harmful, biased, or inaccurate training data?

Data consent and licensing. Was the training data collected with appropriate consent? Is the training data licensed for AI training purposes? Has the vendor assessed intellectual property risks associated with the training data?

Data freshness. How current is the training data? What is the knowledge cutoff date? How frequently is the model retrained with new data? Are there known temporal biases (e.g., training data reflecting outdated norms or practices)?

Customer data usage. Does the vendor use customer data for model training? Can customers opt out? What data retention policies apply to data processed by the model? Is customer data used for training shared with other customers’ models?

Evidence to Request

Training data documentation (sources, sizes, temporal coverage)
Data representativeness analysis
Data quality methodology documentation
Data licensing and consent documentation
Customer data usage policy
Data retention policy for processed inputs
Opt-out mechanisms and their scope

Scoring Guidance

5 (Strong): Comprehensive training data documentation. Representativeness analysis conducted and shared. Clear data quality processes. Proper licensing and consent. Transparent customer data policy with meaningful opt-out.
4 (Adequate): Good training data documentation. Some representativeness analysis. Established data quality processes. Licensing addressed. Customer data policy exists.
3 (Partial): Limited training data documentation. Representativeness acknowledged but not rigorously analyzed. Basic data quality processes. Licensing partially addressed.
2 (Minimal): Very limited training data documentation. Representativeness not assessed. Limited data quality processes. Licensing unclear.
1 (Insufficient): No training data documentation. No representativeness assessment. No visible data quality processes. Licensing not addressed.

Assessment Category 3: Bias Testing and Fairness

AI bias is one of the highest-profile risks in enterprise AI governance. Biased AI outputs can cause harm to individuals, create legal liability, damage reputation, and violate regulatory requirements. Assessing a vendor’s bias testing practices is essential for any procured AI system that makes or influences decisions about people.

Key Assessment Questions

Testing methodology. Does the vendor conduct bias testing? What testing methodology is used? Does the vendor test for disparate impact, equal opportunity, demographic parity, calibration, or other fairness metrics? Are the metrics appropriate for the model’s use case?

Protected characteristics. Which protected characteristics are tested? Race, gender, age, disability, religion, national origin, sexual orientation? Does the vendor test for intersectional bias (e.g., bias affecting Black women specifically, not just Black individuals or women separately)?

Test populations. What populations are included in bias testing? Are the test populations representative of the populations that the organization will apply the model to? Does the vendor conduct bias testing specific to the customer’s context, or only generic testing?

Benchmark standards. What bias thresholds does the vendor apply? How does the vendor define “acceptable” levels of disparity? Are these thresholds aligned with regulatory requirements (e.g., the EEOC’s four-fifths rule, EU AI Act non-discrimination requirements)?

Mitigation practices. When bias is detected, what mitigation actions does the vendor take? Model retraining, output adjustment, use case restriction, customer notification? Does the vendor have a documented bias remediation process?

Ongoing monitoring. Does the vendor conduct ongoing bias monitoring after model deployment, or only during development? How frequently? Are customers notified of bias monitoring results?

Evidence to Request

Bias testing methodology documentation
Bias testing results (most recent)
Fairness metrics used and thresholds applied
Protected characteristics tested
Bias remediation process documentation
Ongoing bias monitoring program documentation
Third-party bias audits (if conducted)

Scoring Guidance

5 (Strong): Comprehensive bias testing across multiple protected characteristics. Intersectional testing conducted. Clear fairness metrics and thresholds. Documented remediation process. Ongoing monitoring. Third-party audits.
4 (Adequate): Bias testing for key protected characteristics. Recognized fairness metrics used. Remediation process exists. Some ongoing monitoring.
3 (Partial): Limited bias testing. Some protected characteristics tested. Fairness metrics used but thresholds not clearly defined. Limited remediation process.
2 (Minimal): Minimal bias testing. Few protected characteristics tested. No clear fairness metrics. No documented remediation process.
1 (Insufficient): No bias testing conducted or documented.

Assessment Category 4: Security and Adversarial Robustness

AI systems face security threats that traditional software security frameworks do not fully address. In addition to standard cybersecurity concerns, AI systems are vulnerable to adversarial attacks that manipulate inputs to cause incorrect outputs, prompt injection that subverts the model’s instructions, data poisoning that corrupts training data, and model extraction that steals the model’s intellectual property.

Key Assessment Questions

AI-specific security testing. Does the vendor conduct adversarial robustness testing? Has the vendor tested for prompt injection vulnerabilities? Has the vendor tested for data poisoning resistance? Has the vendor tested for model extraction resistance?

Input validation. How does the vendor validate inputs to the AI system? Are there input filters, sanitization processes, or content safety classifiers? How are malicious inputs detected and handled?

Output safety. How does the vendor ensure that AI outputs are safe? Are there output filters, content safety classifiers, or toxicity detectors? How does the vendor prevent the generation of harmful, misleading, or inappropriate content?

Infrastructure security. How is the AI infrastructure secured? What access controls protect the model, training data, and inference pipeline? What encryption protects data in transit and at rest? What logging and monitoring cover AI-specific events?

Incident history. Has the vendor experienced any AI-related security incidents? What were the incidents, what was the impact, and what was the remediation?

Red teaming. Does the vendor conduct red team exercises against its AI systems? What adversarial scenarios are tested? How frequently? Are results shared with customers?

Evidence to Request

AI security testing methodology and results
Adversarial robustness testing documentation
Prompt injection mitigation documentation
Input validation and output safety architecture
Infrastructure security documentation (SOC 2, ISO 27001)
AI-specific incident history
Red team program documentation

Scoring Guidance

5 (Strong): Comprehensive AI security program. Adversarial robustness testing. Prompt injection mitigation. Red teaming. Strong infrastructure security. Transparent incident history.
4 (Adequate): Good AI security practices. Some adversarial testing. Input validation and output safety measures. Standard infrastructure security certifications.
3 (Partial): Basic AI security measures. Limited adversarial testing. Standard infrastructure security but limited AI-specific security testing.
2 (Minimal): Minimal AI-specific security. Relies primarily on standard infrastructure security measures.
1 (Insufficient): No AI-specific security measures documented.

Assessment Category 5: Privacy and Data Protection

AI systems process data in ways that create privacy risks beyond those assessed by standard vendor data protection reviews. AI may infer sensitive information from non-sensitive inputs, retain data in model weights, or use personal data for purposes beyond the individual’s expectations.

Key Assessment Questions

Data processing architecture. Where is data processed? Which data centers, regions, and jurisdictions? Is data processing compliant with the organization’s data residency requirements? Does data cross jurisdictional boundaries during processing?

Data retention. How long is input data retained? Is input data used for model training? Can retention be configured by the customer? What happens to data after the retention period?

Purpose limitation. Is data processed only for the stated purpose, or is it used for other purposes (model improvement, analytics, product development)? Can the customer control the purposes for which their data is used?

Data minimization. Does the AI process only the data necessary for its function? Or does it consume broader data than necessary? Can the data scope be restricted?

Individual rights. How does the vendor support individual data subject rights (access, rectification, erasure, portability) as they relate to AI processing? Can data be deleted from trained models?

Privacy impact assessment. Has the vendor conducted a privacy impact assessment (PIA) or data protection impact assessment (DPIA) for the AI system? Is it available for customer review?

Evidence to Request

Data processing locations and architecture
Data retention policy specific to AI processing
Customer data usage policy (training, improvement, analytics)
Data minimization practices
Individual rights support documentation
Privacy impact assessment or DPIA
Data processing agreement (DPA) with AI-specific provisions

Scoring Guidance

5 (Strong): Comprehensive privacy program. Clear data processing architecture. Configurable retention. No customer data used for training without consent. Strong individual rights support. DPIA conducted and shared.
4 (Adequate): Good privacy practices. Data processing locations documented. Retention policy exists. Customer data usage policy exists. Individual rights supported.
3 (Partial): Basic privacy practices. Data processing partially documented. Retention policy exists but limited configurability. Customer data usage policy vague.
2 (Minimal): Limited privacy documentation. Data processing opaque. Retention unclear. Customer data usage not clearly defined.
1 (Insufficient): No meaningful privacy documentation for AI processing.

Assessment Category 6: Incident Response and Notification

When an AI system produces biased, harmful, or inaccurate outputs, the vendor’s incident response capabilities determine how quickly the issue is identified, contained, and remediated — and how quickly the customer is informed.

Key Assessment Questions

Incident detection. How does the vendor detect AI-related incidents? Does the vendor monitor for biased outputs, inaccurate results, harmful content generation, or model degradation? What monitoring coverage exists?

Incident classification. How does the vendor classify AI incidents? Is there a severity taxonomy specific to AI incidents? How are AI incidents distinguished from general service incidents?

Customer notification. What is the vendor’s notification timeline for AI-related incidents? Is there a contractual commitment? Are customers notified of all incidents or only those above a severity threshold? What information is provided in incident notifications?

Remediation process. What remediation actions can the vendor take? Model rollback, configuration changes, output filtering, service suspension? How quickly can remediation be implemented?

Post-incident review. Does the vendor conduct root cause analysis for AI incidents? Are findings shared with customers? Are systemic improvements made to prevent recurrence?

Evidence to Request

AI incident response plan
Incident classification taxonomy
Customer notification SLA and procedures
Historical incident summary (anonymized)
Post-incident review process documentation
Remediation capabilities documentation

Scoring Guidance

5 (Strong): Mature AI incident response. Clear severity taxonomy. Contractual notification SLA. Root cause analysis conducted and shared. Systematic prevention measures.
4 (Adequate): Established incident response. AI incidents identified and managed. Customer notification practices exist. Root cause analysis conducted.
3 (Partial): Basic incident response. AI incidents handled through general incident processes. Customer notification on major incidents. Limited root cause analysis.
2 (Minimal): Limited incident response. AI incidents not clearly distinguished from general incidents. Customer notification ad hoc.
1 (Insufficient): No AI-specific incident response capability.

Assessment Category 7: Contractual Terms and Service Commitments

The contractual framework between the organization and the AI vendor defines the legal basis for governance, sets expectations for vendor behavior, and provides remedies when expectations are not met.

Key Assessment Questions

AI-specific terms. Does the contract include AI-specific terms, or does AI fall under general software licensing terms? Are the AI-specific terms sufficient for governance purposes?

Performance commitments. Does the vendor make contractual commitments about AI performance (accuracy, fairness, reliability)? Are these commitments measurable and enforceable?

Transparency obligations. Is the vendor contractually obligated to provide transparency about model changes, training data, bias testing results, or incident findings? Are there contractual provisions for AI-BOM documentation?

Audit rights. Does the contract provide the customer with audit rights specific to AI? Can the customer (or a designated third party) audit the vendor’s AI practices?

Liability and indemnification. How is liability allocated for AI-related harms? Does the vendor indemnify the customer for claims arising from biased, inaccurate, or harmful AI outputs?

Termination and portability. Can the customer terminate for AI-related cause (e.g., vendor fails to meet bias testing obligations)? Can the customer’s data be extracted and ported to an alternative vendor?

Evidence to Request

AI-specific contract terms or addendum
Performance SLA documentation
Transparency commitments documentation
Audit rights clause
Liability and indemnification provisions
Termination and portability provisions

Scoring Guidance

5 (Strong): Comprehensive AI-specific contractual terms. Measurable performance commitments. Transparency obligations. Audit rights. Clear liability allocation. Termination for AI-related cause.
4 (Adequate): AI-specific terms exist. Some performance commitments. Transparency provisions. Limited audit rights.
3 (Partial): Limited AI-specific terms. General performance commitments. Minimal transparency provisions. No AI-specific audit rights.
2 (Minimal): No AI-specific terms. AI covered by general software terms. No AI-specific performance, transparency, or audit provisions.
1 (Insufficient): Contract does not address AI at all.

Assessment Category 8: Responsible AI Program Maturity

The vendor’s organizational commitment to responsible AI — the structures, processes, leadership, and culture that support ethical AI development and deployment — is a leading indicator of the vendor’s ability to manage AI risk over time.

Key Assessment Questions

Organizational commitment. Does the vendor have a published responsible AI policy or AI ethics policy? Is there a dedicated responsible AI team or function? Does responsible AI have executive sponsorship? Is responsible AI integrated into the vendor’s AI development lifecycle?

Standards alignment. Does the vendor align with recognized AI governance standards? ISO/IEC 42001? NIST AI RMF? IEEE standards? OECD AI Principles? Does the vendor hold any AI-related certifications?

External accountability. Does the vendor subject its AI practices to external scrutiny? Third-party audits? External advisory boards? Academic partnerships? Participation in industry AI governance initiatives?

Continuous improvement. How does the vendor improve its responsible AI practices over time? Does the vendor publish responsible AI progress reports? Does the vendor set measurable goals for responsible AI improvement?

Industry leadership. Does the vendor contribute to AI governance standards development? Does the vendor publish research on responsible AI? Does the vendor participate in multi-stakeholder AI governance initiatives?

Evidence to Request

Responsible AI policy or AI ethics policy
Responsible AI team structure and leadership
AI governance standards alignment documentation
Third-party audit reports
External advisory board composition
Responsible AI progress reports
AI-related certifications (ISO 42001, etc.)

Scoring Guidance

5 (Strong): Mature responsible AI program. Published policy. Dedicated team with executive sponsorship. Standards alignment. External accountability. Published progress reports. Industry leadership.
4 (Adequate): Established responsible AI program. Published policy. Dedicated resources. Some standards alignment. Some external accountability.
3 (Partial): Emerging responsible AI program. Policy exists but program is early-stage. Limited dedicated resources. Aspirational standards alignment.
2 (Minimal): Minimal responsible AI commitment. Vague policy statements. No dedicated resources. No standards alignment.
1 (Insufficient): No responsible AI program or commitment.

Scoring and Recommendation Methodology

Composite Assessment Score

Calculate the composite assessment score as the average of the eight category scores. This produces a score from 1.0 to 5.0 that provides an overall assessment of the vendor’s AI governance maturity.

Scoring Interpretation

4.0 - 5.0 (Strong): The vendor demonstrates mature AI governance practices. Proceed with procurement, subject to standard contractual protections and ongoing monitoring.
3.0 - 3.9 (Adequate): The vendor demonstrates adequate AI governance practices with room for improvement. Proceed with procurement, with enhanced contractual protections and monitoring. Work with the vendor on a governance improvement roadmap.
2.0 - 2.9 (Partial): The vendor demonstrates limited AI governance practices. Proceed with caution, only if the business need is compelling and no better-governed alternatives exist. Require significant contractual protections. Implement enhanced monitoring. Establish a time-bound governance improvement plan with the vendor.
1.0 - 1.9 (Insufficient): The vendor demonstrates inadequate AI governance practices. Do not proceed with procurement unless the AI features can be disabled. If AI features cannot be disabled, seek alternative vendors.

Contextual Adjustments

The composite score provides a baseline assessment, but the final recommendation must consider the context of the specific deployment:

A vendor with a composite score of 3.5 may be acceptable for a low-risk internal productivity tool but unacceptable for a high-risk customer-facing decision system
A vendor with a strong score in bias testing (Category 3) but a weak score in incident response (Category 6) may be acceptable if the organization can implement its own monitoring and incident response
A vendor with a weak score in contractual terms (Category 7) may be acceptable if the organization has sufficient bargaining power to negotiate improved terms

The assessment framework provides structure and consistency. Professional judgment provides the contextual wisdom to apply the framework appropriately.

Previous in the Domain 20 series: Article 17 — Shadow AI Discovery and Inventory Methodology (Module 2.6) Next in the Domain 20 series: Article 11 — AI Supply Chain Governance at Enterprise Scale (Module 3.7)