Adversarial Attacks on AI Systems: Detection and Defense

FlowRidge

Definition

An adversarial attack on a Machine Learning (ML) system is a deliberately crafted input designed to cause the model to produce an output the attacker desires — typically an output that disagrees with what a human reviewer would produce on the same input — while remaining indistinguishable from a normal input to a casual observer or to the upstream data-quality controls that gate the inference pipeline. The class of attack was established academically by Szegedy et al. in 2013, weaponized in laboratory settings within two years, and is now routinely demonstrated against commercial computer vision, speech recognition, recommendation, fraud detection, and Large Language Model (LLM) systems. Production deployment of ML systems without explicit adversarial defenses is, at this point, malpractice.

This article surveys the attack class, explains the structural reason production models remain vulnerable absent specific defenses, walks the operational defensive techniques that have empirical support, and shows how to integrate adversarial detection into a production inference platform.

Why ML models are structurally vulnerable

The mathematical reason ML models are vulnerable to adversarial inputs is that they learn a high-dimensional decision surface from a finite training sample, and the surface is dense with directions in which a small perturbation of an input crosses a decision boundary even though the perturbation is imperceptible to a human. The phenomenon is not a bug in any specific model architecture; it is a property of the optimization process that produces models in the first place. Adding training data narrows but does not eliminate the regions of the input space in which the model is wrong. Neither does increasing model capacity. Neither does changing the architecture. The vulnerability is the price the field pays for the generalization properties that make ML useful at all.

The MITRE ATLAS knowledge base https://atlas.mitre.org/ catalogs adversarial examples under the Defense Evasion tactic and documents real-world cases against commercial deployed systems including face-recognition systems bypassed by printed eyeglass frames, malware classifiers bypassed by inserting innocuous strings into the binary, and content moderators bypassed by Unicode homoglyph substitutions. The diversity of successful attack modalities — visual, audio, textual, behavioural — establishes that the vulnerability class is universal across the modalities ML is deployed against.

The European Union’s AI Act, Article 15 https://artificialintelligenceact.eu/article/15/, specifically requires high-risk AI systems to be designed and developed in such a way as to “achieve, in the light of their intended purpose, an appropriate level of accuracy, robustness and cybersecurity, and to perform consistently in those respects throughout their lifecycle.” The robustness requirement is unambiguous: an organization deploying a high-risk system in the European Union without demonstrating adversarial robustness is non-compliant.

The defensive techniques that work

Adversarial defense is an active research area, and many published defenses have been broken by subsequent attacks within months of publication. The defenses that have endured fall into four operational categories: adversarial training, certified defenses, runtime detection, and architectural mitigation. Each addresses the threat at a different stage of the lifecycle, and a mature defense uses several together.

Adversarial training is the most empirically studied defense. The approach is straightforward: during training, the optimization procedure is exposed to adversarial examples generated against the model itself, and the loss function is computed on those adversarial inputs in addition to clean ones. The model learns a smoother decision surface that is harder to perturb across a boundary. Madry-style projected-gradient-descent (PGD) adversarial training has been the field’s reference baseline since 2018 and provides meaningful robustness against the attacks it was trained against. The cost is significant: training time increases by an order of magnitude, accuracy on clean inputs typically degrades by a few percentage points, and the defense is specific to the perturbation budget the training assumed.

Certified defenses provide mathematical guarantees that no perturbation within a specified magnitude can change the model’s output. Randomized smoothing, interval bound propagation, and Lipschitz-constrained architectures are the leading techniques. The guarantees are real but the defended perturbation budgets are typically smaller than what realistic attackers can exercise, and the techniques apply most readily to specific model classes (linear models, smoothed classifiers, certain neural network architectures). For the subset of high-stakes applications where the perturbation budget can be bounded by the application context — a financial transaction has a bounded number of degrees of freedom — certified defenses are the strongest available.

Runtime detection treats adversarial inputs as out-of-distribution inputs and rejects or flags them before they reach the model. Detector ensembles, energy-based out-of-distribution scoring, and statistical anomaly detection on input features are the operational techniques. Detection is empirically the most cost-effective defense for production systems because it requires no retraining of the underlying model and integrates as a preprocessing layer. The cost is false positives — legitimate edge-case inputs that look anomalous — and the need to monitor and tune the detector against drift in the legitimate input distribution.

Architectural mitigation changes the system design so the consequence of an adversarial misclassification is bounded. A fraud-detection model whose adversarial bypass routes the transaction to human review rather than automatic approval has been adversarially defended at the architectural level. A content moderator that cascades through multiple independently trained models with different vulnerabilities is harder to bypass than any single model. A medical-imaging model whose output is a recommendation to the clinician rather than a diagnostic decision contains the impact of any specific adversarial input. The most robust deployed AI systems combine model-level defenses with architectural ones.

The NIST AI Risk Management Framework Cybersecurity profile https://www.nist.gov/itl/ai-risk-management-framework and NIST Special Publication 800-218A https://csrc.nist.gov/pubs/sp/800/218/a/final both name adversarial robustness as a required practice and prescribe testing methodologies. ISO/IEC 42001:2023 Annex A.6 https://www.iso.org/standard/81230.html requires AI Management System operators to identify and treat AI-system-specific risks, of which adversarial vulnerability is the canonical example. The OWASP Top 10 for Large Language Model Applications https://owasp.org/www-project-top-10-for-large-language-model-applications/ catalogs adversarial attacks against LLMs under Model Denial of Service (LLM04) and as a contributing factor to several other vulnerability categories.

Operationalizing detection in production

A production-grade adversarial defense in 2026 looks like this. The model serving stack includes a preprocessing layer that computes one or more out-of-distribution scores on every inference request and rejects, downgrades, or flags requests that score above a tuned threshold. The model itself was trained with at least one round of adversarial training against the canonical attack the threat model identifies as highest-priority. The inference logs include the OOD scores so the security team can detect attack campaigns in aggregate even if individual detection thresholds are calibrated for a low false-positive rate. The serving architecture caps the consequences of any individual misclassification — high-stakes outputs are routed to human review, automated actions have explicit revocation paths, and the application layer treats the model’s output as advice rather than authority.

Detection is paired with response. When the OOD detector fires above the alerting threshold, an incident is opened and the inference request is preserved for analysis. When the detector fires in patterns suggestive of a campaign — many similar inputs from the same IP block, many inputs perturbed in similar feature dimensions — the security team escalates to the playbook in Article 14 of this module. The detector itself is monitored for drift; a rising baseline OOD score across legitimate inputs indicates the input distribution is changing and the detector needs recalibration. Drift is correlated with model performance metrics so that performance regression is investigated for adversarial cause as well as for natural distribution shift.

Maturity Indicators

Foundational. The team has not tested any of its production models for adversarial vulnerability. The word “adversarial” does not appear in the MLOps documentation. There is no input validation beyond schema-level type checks at the inference endpoint. The security team is unaware that the ML team operates models at all.

Applied. At least one production model has been adversarially tested, typically by running an open-source attack library (CleverHans, Foolbox, Adversarial Robustness Toolbox) against a copy of the model. Findings are documented and the most exploitable vulnerabilities have been triaged. The model has not been retrained with adversarial training, but the team has at least quantified the gap.

Advanced. Adversarial training is part of the model development lifecycle. Production models include runtime out-of-distribution detection. The MLOps platform runs adversarial robustness tests as part of the pre-promotion validation harness. Architectural mitigations bound the consequence of any misclassification. The threat model from Article 1 is the source document for which adversarial defenses are required for which models.

Strategic. The organization runs continuous adversarial monitoring across all production models, contributes attack signatures to industry-wide threat intelligence, and runs scheduled red-team exercises (Article 11) that include adversarial attacks against the deployed system. The board-level risk register tracks adversarial robustness as a named risk class. Adversarial testing methodology is itself audited on a regular schedule.

Practical Application

The first step for a team that has not adversarially tested its models is to take the highest-stakes production model — the one whose misclassification causes the greatest harm — and run a standard library attack against it offline. Adversarial Robustness Toolbox or Foolbox both provide one-line attack functions for common model classes. The result will almost certainly be that the model is vulnerable. Documenting the attack-success rate at a small perturbation budget is the baseline measurement that justifies subsequent investment in defenses.

The second step is to add an out-of-distribution detector as a preprocessing layer at the inference endpoint. The simplest workable detector is a statistical model of input features fitted on the legitimate training distribution; inputs that score below a threshold likelihood are flagged. The detector imposes minimal latency, integrates without changing the model, and provides immediate operational visibility into anomalous traffic. From there, the team incrementally invests in adversarial training, certified defenses, and architectural mitigations as the threat model and risk appetite require.