Glossary/Detection Engineering/Adversarial AI and Machine Learning

What Is Adversarial AI and Machine Learning?

Adversarial AI and machine learning is the practice of attacking ML systems by manipulating their inputs, training data, or interfaces so the model behaves the way the attacker wants instead of as designed.

In 2018, researchers stuck a few black-and-white stickers on a stop sign. To a person it was still obviously a stop sign. To the deep-learning classifier that a self-driving car would use, it now read as a 45 mph speed-limit sign, in every frame of a drive-by video. No malware, no exploit, no access to the model. Just a printed sticker placed where the camera could see it.

That is an adversarial attack: input crafted to make a machine-learning model fail, while looking normal to a human. The same logic that fooled the sign classifier now applies to malware detectors, fraud models, spam filters, face recognition, and the large language models behind every AI assistant a SOC is being asked to adopt. As defenders wire ML into detection pipelines and attackers wire it into operations, the model itself becomes a target with its own attack surface.

This guide defines adversarial AI and machine learning, lays out the attack taxonomy that NIST and MITRE have standardized (evasion, poisoning, extraction, inversion, membership inference, and prompt injection), maps it to MITRE ATT&CK's AI-focused companion ATLAS and to NIST AI 100-2, and then covers the defenses that actually hold up. It is written for blue teams: SOC analysts, detection engineers, and DFIR and ML-security practitioners who now have to defend models, not just hosts and networks.

What is adversarial AI and machine learning?

Adversarial AI and machine learning is the practice of attacking ML systems by manipulating their inputs, their training data, or their interfaces so the model behaves the way the attacker wants instead of the way it was built to. The target is the model's logic, not the server it runs on. A classic exploit abuses a flaw in code; an adversarial attack abuses how a model learned to generalize.

The reason this is a distinct discipline is that ML systems fail in ways traditional software does not. A model is a statistical function fit to data, so it has blind spots no one wrote on purpose. Three properties make those blind spots exploitable:

  • The model is only as trustworthy as its training data. Poison the data and you poison the model, before it ever ships.
  • The decision boundary is brittle. Small, carefully chosen input changes that a human ignores can flip a confident prediction. The stop-sign stickers are the canonical example.
  • The model leaks. Every prediction it returns is information about what it learned, which an attacker can use to copy the model or recover its training data.

Adversarial ML is not new as a research field; the foundational evasion work dates to the mid-2010s. What changed is deployment. Models now sit in the loop on decisions that matter, from blocking a transaction to triaging an alert, and generative models took the same attack surface mainstream. Two standards now anchor the field for practitioners: MITRE ATLAS, the adversarial-AI counterpart to ATT&CK, and NIST AI 100-2, the government taxonomy of attacks and mitigations. The rest of this guide is built on both.

The adversarial ML attack taxonomy

Adversarial ML · attack taxonomy
Where each attack hits the ML lifecycle
One attack corrupts the model as it learns. The rest manipulate a finished model through its inputs or interface.
TRAINING TIME
Data poisoning
Corrupt training data to implant a backdoor or degrade accuracy. Goal: integrity / availability.
DEPLOYMENT TIME
Trained model, live
The model is finished and serving predictions. Every attack below works through its inputs or its API.
INTEGRITY
Evasion
Perturb an input so the model misclassifies it.
PRIVACY
Model extraction
Query the API to rebuild a functional copy.
PRIVACY
Inversion & inference
Reconstruct training data, or test if a record was in it.
INTEGRITY / MISUSE
Prompt injection
Hide instructions in input so an LLM obeys the attacker.
OWASP LLM01
Defender takeaway NIST names four attacker goals: availability, integrity, privacy, and misuse. There is no patch. Defend in depth across the lifecycle: vet the data, harden the model, validate inputs, monitor outputs.

NIST organizes adversarial ML along two axes that are worth holding in mind, because they tell you when an attack happens and what the attacker is after.

The first axis is the stage of the ML lifecycle. A training-time attack corrupts the model as it learns (poisoning). A deployment-time attack manipulates a finished model through its inputs or interface (evasion, extraction, inversion, membership inference, prompt injection). The second axis is the attacker's objective, and NIST names four: availability breakdown (make the model useless), integrity violation (make it produce wrong outputs), privacy compromise (extract data or the model itself), and misuse, a category NIST added for generative AI, where the model is turned to producing harmful content it was meant to refuse.

The attacks below are the named techniques that fill that grid. Each one is documented in the research literature and cataloged in MITRE ATLAS.

Attack Lifecycle stage Attacker goal What it does Real example
Evasion (adversarial examples) Deployment Integrity Perturb an input so the model misclassifies it Stickers on a stop sign read as a speed-limit sign
Data poisoning Training Integrity / availability Corrupt training data to implant a backdoor or degrade accuracy Tay chatbot corrupted by malicious live input (2016)
Model extraction Deployment Privacy (the model) Query a model's API to rebuild a functional copy Stealing a model via prediction APIs (Tramer et al., 2016)
Model inversion Deployment Privacy (the data) Reconstruct sensitive training data from model access Recovering recognizable training faces (Fredrikson et al., 2015)
Membership inference Deployment Privacy (the data) Determine whether a specific record was in the training set Shokri et al., 2017
Prompt injection Deployment Integrity / misuse Hide instructions in input so an LLM follows the attacker OWASP LLM01, the top LLM risk

Evasion attacks

Evasion is the most studied adversarial attack. The attacker takes an input the model classifies correctly and adds a small, deliberately chosen perturbation that pushes it across the decision boundary, while a human still sees the original. Ian Goodfellow and colleagues formalized the technique in 2014 with the Fast Gradient Sign Method, showing that imperceptible pixel changes reliably flip an image classifier. Eykholt and colleagues then moved it into the physical world in 2018 with the stop-sign attack.

For a defender, evasion is the attack on your detection models. A malware sample tweaked to keep its behavior but evade a static classifier, a phishing page altered to slip past an image-based brand-detection model, traffic shaped to avoid an ML-based intrusion detection system: all of these are evasion. The model still runs; it just returns the answer the attacker engineered.

Data poisoning

Poisoning hits earlier, during training. The attacker injects or alters training data so the resulting model carries a flaw. Two flavors matter. Availability poisoning degrades accuracy broadly, turning a useful model into a coin flip. Backdoor poisoning is more surgical: the model behaves normally except when it sees a specific trigger the attacker chose, at which point it produces the attacker's desired output. A backdoored malware classifier can score every real sample correctly and wave through anything carrying the trigger byte sequence.

Poisoning is most dangerous wherever a model retrains on data it did not vet: user-submitted content, scraped web text, feedback loops, federated learning across untrusted clients. Microsoft's Tay chatbot in 2016 is the cautionary example of feedback manipulation. Tay learned from live user interactions, users fed it coordinated abuse, and Microsoft pulled it within about a day. It was not classic training-set poisoning, but it is the clearest public demonstration of what happens when untrusted input becomes training input.

Model extraction and model inversion

These two are privacy attacks against a deployed model that an attacker can query. Extraction targets the model itself: by sending many inputs and recording the outputs, an attacker trains a copy that approximates the original, stealing the intellectual property and, usefully, a local copy to craft evasion attacks against. Tramer and colleagues demonstrated this against commercial prediction APIs in 2016.

Inversion targets the training data behind the model. Fredrikson and colleagues showed in 2015 that confidence scores from a face-recognition model could be used to reconstruct a recognizable image of a person in the training set. For any model trained on sensitive data, medical records, faces, financial history, inversion turns the model into a leak of the data it was supposed to abstract away.

Membership inference

A narrower privacy attack with sharp consequences. Membership inference asks a yes-or-no question: was this specific record in the model's training data? Shokri and colleagues showed in 2017 that the question is answerable from model outputs alone. That sounds academic until the training set is a list of patients with a diagnosis, customers who defaulted, or users in a breach corpus, where mere membership is the sensitive fact.

Prompt injection

Prompt injection is the adversarial attack that arrived with large language models, and OWASP ranks it as LLM01, the top risk in its 2025 Top 10 for LLM Applications. The root cause is structural: an LLM reads instructions and data through the same channel, so an attacker who controls part of the input can smuggle in commands the model treats as its own.

NIST and OWASP both split it in two. Direct prompt injection is the user typing the malicious instruction, including jailbreaks that talk a model out of its safety rules. Indirect prompt injection is the dangerous form for any LLM wired to tools or external content: the model reads a web page, a document, or an email that carries hidden instructions, and acts on them. An agentic AI system that can call tools turns an indirect injection into a real action, which is why this attack dominates LLM threat modeling.

How adversarial ML maps to MITRE ATLAS and NIST AI 100-2

Two frameworks turn this taxonomy into something a security team can operationalize. They are complementary: NIST gives you the vocabulary and the threat model, MITRE gives you the technique catalog and real cases.

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is built in the image of ATT&CK, structured as tactics and techniques with real-world case studies. MITRE and Microsoft launched it in 2021, growing out of an earlier adversarial-ML threat matrix. As of its 2026.05 release it holds 16 tactics and 101 techniques (170 counting sub-techniques), plus 57 documented case studies of attacks on real AI systems. If you already think in ATT&CK terms, ATLAS is the same mental model pointed at the ML pipeline: reconnaissance of a model, initial access, ML attack staging, exfiltration of a model or its data.

NIST AI 100-2e2025, *Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations*, is the current edition of NIST's taxonomy, published March 2025. It supersedes the 2023 edition, which NIST has withdrawn, so cite the 2025 version. The 2025 edition is the one that added dedicated coverage of generative AI: the misuse objective, direct and indirect prompt injection, retrieval-augmented-generation knowledge-base poisoning, and system-prompt extraction. NIST classifies attacks by the attacker's objective, capabilities, and knowledge rather than the loose white-box/black-box labels you see elsewhere.

  MITRE ATLAS NIST AI 100-2e2025
Type Technique catalog with case studies Taxonomy and terminology
Structure Tactics and techniques, ATT&CK-style Attacks organized by stage, objective, capability
Best for Threat modeling, red-teaming, mapping observed attacks Shared vocabulary, risk assessment, scoping defenses
Generative AI Case studies incl. LLM and copilot attacks Dedicated GenAI section, prompt injection, misuse
Current as of 2026.05 release March 2025 edition (2023 withdrawn)

Use NIST to name what you are defending against and scope the risk; use ATLAS to find the specific techniques, see how they have played out against real systems, and structure a red-team exercise. Together they do for ML attacks what ATT&CK alone does for host and network intrusions.

Defending against adversarial AI

There is no patch for adversarial ML the way there is for a buffer overflow, because the vulnerability is a property of how models learn. Defense is risk reduction layered across the ML lifecycle, not a fix. Four areas carry the weight.

Adversarial training. The leading technical defense against evasion is to train the model on adversarial examples so it learns to resist them, formalized by Madry and colleagues in 2017. It works, and it is the most studied defense, but it has real costs: it is computationally expensive, it lowers accuracy on clean inputs (the robustness-accuracy tradeoff), and it only hardens the model against the threat model it was trained on. A model adversarially trained against one perturbation budget can still fall to a stronger or different attack. Treat it as raising the cost of evasion, not closing the door.

Input validation and preprocessing. Detect or blunt malicious inputs before they reach the model: filter and sanitize untrusted text, constrain input ranges, detect anomalous queries, and rate-limit the API to make extraction and inference attacks expensive to run at the volume they need. For LLMs, this is where input and output filtering and instruction-data separation go, the partial mitigations OWASP lists for prompt injection. None of them fully solve injection, which is why they pair with a human gate on high-impact actions.

Data integrity and provenance. Poisoning is a supply-chain problem for models. Vet and track the provenance of training data, especially anything scraped or user-supplied; validate and clean datasets; control who can contribute to retraining; and be deliberate about pretrained models and third-party datasets, which carry whatever was baked into them. The principle is the one you already apply to software dependencies, applied to data and weights.

Monitoring and detection. Assume some attacks get through and instrument for them. Watch model inputs and outputs for anomalies: spikes in low-confidence predictions, query patterns consistent with extraction or inference, sudden distribution shifts, and inputs that trip the same odd path repeatedly. This is where adversarial ML meets the SOC's existing work. Logging every prompt, query, and tool call gives you the audit trail to investigate an attack, and behavioral monitoring of the AI layer is the gap that the AI detection and response category exists to fill. The model needs the same telemetry discipline you give any production system that makes decisions.

No single control is sufficient. The defensible posture is defense in depth across the lifecycle: vet the data going in, harden the model in training, validate inputs at inference, and monitor outputs in production, on the assumption that a determined attacker will get past any one layer.

Adversarial AI in the real world

The threat moved from research to production. The clearest recent example is the campaign Anthropic disclosed in November 2025, in which a group it assessed with high confidence to be Chinese state-sponsored jailbroke its Claude model and used it to run an estimated 80 to 90 percent of a cyber-espionage operation autonomously, against roughly thirty targets, with humans stepping in at only a handful of decision points. The entry technique was adversarial: a jailbreak that role-played the model into believing it was doing defensive security work, plus task decomposition that hid the operation's true intent. Outside analysts have questioned how autonomous it really was, and the numbers are Anthropic's own, but the case is documented and it is a real adversarial-AI operation at scale.

The pattern generalizes. Attackers use evasion to slip malware past ML detectors. They use prompt injection to bend AI assistants and the agents built on them, the same mechanism behind a class of AI social engineering attacks. They use extraction to clone a model and then craft offline evasion against the copy. And as defenders deploy more ML in detection, every one of those models becomes a target whose failure mode an attacker can study and exploit. The discipline is no longer optional for a blue team that runs ML anywhere in its stack.

The bottom line

Adversarial AI and machine learning attacks the model, not the machine. The vulnerability is structural: a model is a statistical function with blind spots no one chose, brittle decision boundaries, and outputs that leak what it learned. That gives attackers a documented toolkit, evasion, poisoning, extraction, inversion, membership inference, and prompt injection, now standardized in NIST AI 100-2e2025 and cataloged with real cases in MITRE ATLAS.

There is no patch, only layered risk reduction: vet the training data, harden the model, validate inputs, and monitor outputs, on the assumption that one layer will fail. For a blue team, the shift is that the models you deploy to defend the environment are themselves an attack surface, and the ones your adversaries deploy raise the tempo of everything else. Learn the taxonomy, map it with ATLAS and NIST, and treat every production model as a system that can be attacked through its own logic.

Frequently asked questions

What is adversarial AI and machine learning?

<p>Adversarial AI and machine learning is the practice of attacking ML systems by manipulating their inputs, training data, or interfaces so the model behaves as the attacker intends rather than as designed. It targets the model's learned logic, not the underlying code or server. Common attacks include evasion, data poisoning, model extraction, model inversion, membership inference, and prompt injection.</p>

What is an adversarial example?

<p>An adversarial example is an input deliberately perturbed so a machine-learning model misclassifies it, while a human still sees the original. The changes can be imperceptible pixel tweaks or physical alterations, like the stickers that made a stop sign read as a speed-limit sign to a vision classifier. Adversarial examples are the core of evasion attacks against deployed models.</p>

What is the difference between data poisoning and evasion attacks?

<p>Data poisoning is a training-time attack: the attacker corrupts the training data so the resulting model is flawed, for example by implanting a backdoor. Evasion is a deployment-time attack: the model is already trained, and the attacker crafts an input that the finished model misclassifies. Poisoning changes the model; evasion exploits the model as it is.</p>

What is prompt injection?

<p>Prompt injection is an attack on large language models where an attacker hides instructions inside the input the model processes, exploiting the fact that LLMs read instructions and data through the same channel. Direct injection is the user typing the malicious instruction, including jailbreaks. Indirect injection hides instructions in external content the model reads, like a web page or document, which is the dangerous form for tool-using agents. OWASP ranks prompt injection as the top LLM risk, LLM01.</p>

What is MITRE ATLAS?

<p>MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is a knowledge base of adversary tactics and techniques against AI systems, built in the same structure as MITRE ATT&amp;CK and including real-world case studies. MITRE and Microsoft launched it in 2021. It helps security teams threat-model, red-team, and map observed attacks on ML systems to named techniques.</p>

How do you defend against adversarial machine learning?

<p>There is no single fix, because the vulnerability comes from how models learn. Layer defenses across the lifecycle: adversarial training to harden models against evasion, input validation and rate limiting at inference, data provenance and vetting to counter poisoning, and monitoring of model inputs and outputs to catch attacks in production. Treat each control as raising the attacker's cost, not as a complete solution.</p>

Practice track
SOC Analyst Tier 2
Advance your expertise with hands-on labs focusing on threat detection, in-depth log analysis, and the effective use of SIEM tools for investigating and triaging incidents.
Browse SOC Analyst Tier 2 Labs โ†’
Practice track
Threat Hunting
Develop proactive detection skills by analyzing security logs, identifying advanced attack patterns, and uncovering hidden threats across enterprise environments.
Browse Threat Hunting Labs โ†’