AI Security Foundations: Threat Models for Machine Learning Systems

FlowRidge

Definition

A threat model for a Machine Learning (ML) system is a structured, written analysis that names the assets the system protects, the adversaries it must defend against, the attack vectors those adversaries can exercise against the system’s training pipeline, model artefact, inference service, and supporting infrastructure, and the controls the engineering team has chosen to mitigate each vector. The threat model is the contract between the security team and the ML team; it is the document an auditor reads first, the playbook an incident responder consults under pressure, and the design constraint that determines which engineering decisions are open and which are foreclosed. Without it, AI security devolves into ad hoc reaction to last quarter’s headline.

This article opens Module 1.8 by establishing why ML systems require a dedicated threat-modeling discipline that extends — not replaces — the organization’s existing application security practice, and by walking the canonical AI attack lifecycle that subsequent articles will explore in depth.

Why traditional threat modeling is insufficient for AI systems

Traditional application threat modeling — the STRIDE methodology Microsoft popularized, the OWASP Application Security Verification Standard, the dataflow diagrams every senior security engineer has drawn a hundred times — captures the network, identity, and application layer of an AI system perfectly well. It does not capture the model. The model is a new kind of asset: a high-dimensional statistical artefact whose behaviour was learned from data the engineering team did not author and cannot fully inspect, whose failure modes are continuous rather than binary, and whose security properties depend on inputs the model receives in production.

A canonical example clarifies the gap. A traditional threat model for a fraud detection service identifies the API endpoint as a Spoofing/Tampering surface, requires Transport Layer Security (TLS), authenticates callers with mutual TLS or signed JWTs, and rate-limits requests to defeat brute-force enumeration. All of this is necessary and none of it prevents an attacker who has legitimate access to the API from submitting a sequence of carefully crafted transactions designed to map the model’s decision boundary, train a surrogate model on the responses, and then craft transactions that the surrogate predicts will be classified as legitimate even though they are fraudulent. The attacker never violated authentication, never tampered with data in transit, and never exceeded a rate limit. The traditional threat model is silent on the entire attack class.

The National Institute of Standards and Technology (NIST) AI Risk Management Framework, published as AI 100-1 and updated through its Generative AI Profile, names this gap explicitly under the MANAGE function and prescribes that organizations extend their threat modeling practice to cover AI-specific vectors. NIST’s Cybersecurity profile for AI https://www.nist.gov/itl/ai-risk-management-framework provides the authoritative starting point. NIST SP 800-218A, Secure Software Development Practices for Generative AI and Dual-Use Foundation Models https://csrc.nist.gov/pubs/sp/800/218/a/final extends NIST’s Secure Software Development Framework with practices specific to AI systems and is the best engineering-grade reference for translating policy into pull-request-level requirements.

The canonical AI attack lifecycle

A threat model for an ML system reasons across five stages: data collection, training, model storage, inference serving, and post-deployment evolution. Each stage admits a distinctive set of attacks.

At data collection, the adversary’s leverage is poisoning. Training data is rarely authored end-to-end by the team operating the model — it is scraped, purchased, crowdsourced, ingested from upstream sources, or accumulated from production logs that themselves include adversarial inputs. The poisoning may be a backdoor (the model behaves normally except on inputs containing a specific trigger), an availability attack (the model’s accuracy degrades broadly), or a targeted attack (the model misbehaves on the class of input the attacker cares about). MITRE ATLAS https://atlas.mitre.org/ catalogs poisoning under the Persistence tactic and provides the canonical taxonomy of public case studies. Article 5 of this module is dedicated to data poisoning.

At training, the adversary’s leverage is the supply chain. Training frameworks, base models pulled from public repositories, pre-trained embeddings, and labeled-data marketplaces are all upstream of the team that ships the production model. A compromised dependency, a backdoored base model, or a tampered training script that exfiltrates gradients introduces vulnerabilities the deployed model carries forever. The European Union’s AI Act, in Article 15 https://artificialintelligenceact.eu/article/15/, explicitly requires high-risk AI systems to be designed with cybersecurity in mind and resilient to attempts at training-data manipulation. Article 12 of this module covers supply-chain security in depth.

At model storage, the adversary’s leverage is theft and tampering. The trained model artefact is high-value intellectual property, the serialized output of substantial compute investment, and a vehicle for exfiltrating information about the training data. Model files that sit in object storage with permissive Identity and Access Management (IAM) policies, are copied unattributed across staging and production environments, or are loaded at inference time without integrity verification create theft and tampering vectors that traditional secret-management practices do not anticipate. Article 4 of this module addresses model theft and intellectual property protection.

At inference, the adversary’s leverage is in the input distribution. Adversarial examples (Article 2), prompt injection for Large Language Models (Article 3), output handling vulnerabilities, model inversion that reconstructs training data, and membership inference that reveals whether a specific record was in the training set are all attacks executed entirely through legitimate API access. The OWASP Top 10 for Large Language Model Applications https://owasp.org/www-project-top-10-for-large-language-model-applications/ is the working catalog the security industry has converged on for LLM-specific inference vectors and should be cross-referenced against every LLM threat model.

At post-deployment evolution, the adversary’s leverage is in the feedback loop. Models that retrain on production data inherit any adversarial drift the attackers introduce; models that learn online are vulnerable in real time. Drift detection, retraining hygiene, and the boundary between observed-and-trusted versus observed-and-quarantined production data become security controls, not just MLOps controls.

Translating threat models into engineering controls

A threat model that does not change what the engineering team ships is a document, not a control. The COMPEL discipline insists that every identified threat be mapped to one of four resolutions: a control the team is implementing this cycle, a control the team has accepted as future work with a tracked deadline, a residual risk the team has documented and a governance body has accepted, or a design change that eliminates the threat by removing the vulnerable surface entirely.

Practical engineering controls for the lifecycle above include data-provenance tracking with cryptographic signatures on training datasets; supply-chain controls using Software Bill of Materials (SBOM) standards extended with AI-BOM coverage of model artefacts and weights; model signing using Sigstore or equivalent so the artefact loaded at inference can be verified against a build-time signature; runtime input validation that rejects out-of-distribution inputs before they reach the model; output handling that treats model outputs as untrusted and validates them before downstream use; and continuous monitoring of inference logs for the statistical signatures of extraction attacks, evasion attempts, and prompt injection.

The Gartner AI Trust, Risk, and Security Management (AI TRiSM) Hype Cycle https://www.gartner.com/en/articles/gartner-top-strategic-technology-trends-for-2024 tracks the maturity of the commercial tooling that supports each of these controls and is worth reading annually as the market consolidates.

Maturity Indicators

Foundational. The organization has no AI-specific threat model. ML systems are protected only by general application security controls. Security teams are not consulted on model design choices. Model artefacts sit in storage without integrity verification. There is no inventory of which production systems include ML components.

Applied. A threat model exists for at least one production ML system, drafted jointly by the ML team and the security team. The threat model enumerates the canonical AI attack vectors (poisoning, theft, evasion, inversion, prompt injection, supply-chain compromise) and names a control or accepted residual risk for each. Threat-modeling sessions are scheduled at the design stage of new ML projects. Model artefacts are stored with access controls separate from general code. AI-specific topics have been added to developer security training.

Advanced. Every production ML system has a current, version-controlled threat model that is reviewed each release. The threat model is referenced in design reviews, security audits, and incident response playbooks. AI-specific controls — model signing, input validation, output sanitization, training-data provenance — are implemented and monitored. A dedicated AI security function exists with named accountability. Threat models are updated when adversary capabilities or model architectures change.

Strategic. The organization treats AI threat modeling as a continuously evolving discipline. Threat models are informed by red-team exercises (Article 11), incident retrospectives (Article 14), and external threat intelligence specific to the ML stack. The threat-model artefact is consumed by the platform team, the policy team, and the procurement team — not just by application security. The organization contributes to MITRE ATLAS, the OWASP LLM Top 10, or equivalent public bodies of knowledge. AI threat-modeling competence is a hiring criterion across the security organization, and the maturity of the practice is itself audited on a regular schedule.

Practical Application

A workable starting point for a team that has no AI threat model today is a one-page document per production ML system, structured around the five lifecycle stages. For each stage the team writes one to three sentences naming the assets at risk, the most plausible adversaries, the highest-priority attack vectors, and the controls currently in place — even if “currently in place” is “none.” The document is signed by the ML lead and the security lead and reviewed at the next design checkpoint.

This minimum viable threat model immediately surfaces the gap between what the team thought was protected and what is actually protected, drives the first round of control investment, and creates the artefact that the rest of Module 1.8 builds upon. Subsequent articles deepen each lifecycle stage with attack-specific defensive techniques, while Article 15 closes the module with the maturity roadmap that turns a single threat model into an enterprise AI security program. The threat model is the entry point. Everything else in this module either feeds it or is driven by it.