Glossary/Detection Engineering/Data Poisoning

What Is Data Poisoning? Attacks and Defenses

Data poisoning is a cyberattack in which an adversary deliberately corrupts the data a machine-learning model trains on, so the model behaves as the attacker intends instead of as designed.

In 2016, Microsoft put a chatbot named Tay on Twitter and let it learn from whatever users said to it. Within about 16 hours, coordinated users had fed it enough abuse that it was repeating racist and inflammatory content, and Microsoft pulled it. Nobody breached a server. Nobody dropped malware. The attackers simply controlled what the model learned from, and the model dutifully learned it.

That is data poisoning in its crudest public form. The model is only as trustworthy as the data it trains on, and an attacker who can influence that data can shape the model before it ever ships a prediction. As organizations wire machine learning into fraud scoring, malware detection, medical triage, and the large language models behind every AI assistant, the training pipeline becomes an attack surface with no patch.

This guide defines data poisoning, breaks down the attack types that NIST and MITRE have standardized, separates the surgical backdoor from the blunt availability attack, maps the threat to the frameworks a security team can operationalize, and covers the defenses that actually hold up. It is written for blue teams: SOC analysts, detection engineers, and DFIR and ML-security practitioners who now have to defend models, not just hosts and networks.

What is data poisoning?

Data poisoning is a cyberattack in which an adversary deliberately corrupts the data a machine-learning model trains on, so the resulting model behaves the way the attacker wants instead of the way it was built to. The attack happens before deployment, during training or retraining. The target is the model's learned logic, not the server it runs on.

The corruption takes three basic shapes:

  • Injection. Add malicious records to the training set, such as labeled samples that teach the model a wrong association.
  • Modification. Alter existing records or their labels, for example flipping the label on a batch of malware samples so the classifier learns to call them benign.
  • Deletion. Remove records so the model never learns a pattern it needs.

What makes poisoning distinct from a classic exploit is where the damage lives. An exploit abuses a flaw in code; poisoning abuses how a model generalizes from data. A model is a statistical function fit to its training set, so if you control part of that set, you control part of the function. The flaw ships baked into the weights, and it does not show up as a vulnerable line of code anyone can patch.

Poisoning is most dangerous wherever a model retrains on data it did not vet: user-submitted content, web text scraped at scale, telemetry feedback loops, federated learning across untrusted clients, and third-party or pretrained models that arrive already trained on someone else's data. In each case untrusted input becomes training input, which is the precondition the attack needs. It is closely related to the broader field of adversarial AI and machine learning, where poisoning is the training-time attack and evasion, extraction, and prompt injection hit the model after it ships.

How data poisoning works

The attacker does not need access to the model's code or weights. They need access to the data the model learns from, and a goal.

The goal splits into two broad objectives, and NIST uses them to organize the whole field. An integrity attack makes the model produce specific wrong outputs the attacker chose, while leaving everything else looking normal. An availability attack degrades the model broadly, turning a useful classifier into a coin flip. The first is stealthy and surgical. The second is loud and is closer to sabotage.

The amount of poison required is smaller than intuition suggests. Research on backdoor attacks has shown that contaminating a small fraction of a training set can be enough to implant a reliable trigger, because the model has no reason to treat the poisoned samples as anything other than ground truth. That is the core asymmetry: the defender has to trust the training data, and the attacker only has to corrupt a sliver of it.

Two access models describe who is doing the poisoning. In a white box scenario the attacker is an insider or has knowledge of the model and its data pipeline, which lets them craft efficient, minimal poison. In a black box scenario the attacker is external and works blind, influencing the model only through whatever public or semi-public channel feeds its training data, such as a scraped forum or a crowd-sourced label.

Types of data poisoning attacks

Data poisoning · attack types by objective
What each poisoning attack is trying to do
All of them corrupt the model at training time. The objective tells you what failure to look for.
INTEGRITY · SURGICAL
Backdoor poisoning
Implant a hidden trigger. The model behaves normally until it sees the trigger, then outputs the attacker's choice.
Hardest to detect
INTEGRITY · NARROW
Targeted poisoning
Misclassify one specific class or input. Overall accuracy stays high, so the dashboard looks fine.
AVAILABILITY · BLUNT
Availability poisoning
Wreck overall accuracy so the model becomes unreliable. Loud, and the easiest type to notice.
INTEGRITY / AVAILABILITY
Feedback / online poisoning
Manipulate a model that learns from live input. Microsoft Tay, steered by coordinated users in about a day.
Defender takeaway The attacker never touches the model directly. They corrupt the training data, and the flaw ships baked into the weights. There is no patch. Vet the data, track its provenance, harden training, and monitor outputs against trusted references.

CrowdStrike and other vendor explainers list poisoning attacks loosely. It is clearer to organize them by the attacker's objective, because that tells you what failure to look for. The table below names the techniques that matter, all of them documented in the research literature and cataloged in MITRE ATLAS.

Attack typeObjectiveWhat it doesDetection difficulty
Backdoor poisoningIntegrityImplant a hidden trigger; the model behaves normally until it sees the trigger, then outputs the attacker's choiceVery hard, behaves normally on all clean inputs
Targeted (label) poisoningIntegrityMake the model misclassify a specific class or input without hurting overall accuracyHard, headline metrics stay healthy
Availability poisoningAvailabilityDegrade overall accuracy so the model becomes unreliableEasier, accuracy visibly drops
Data injectionIntegrity / availabilityAdd crafted malicious records to skew what the model learnsVaries with volume
Feedback / online poisoningIntegrity / availabilityManipulate a model that learns continuously from live user inputHard in real time

Backdoor poisoning

The most dangerous type, because it is the hardest to catch. The attacker poisons the training data so the model learns a hidden rule: behave correctly on everything, except when the input carries a specific trigger the attacker chose, at which point produce the attacker's desired output. A backdoored malware classifier can score every real sample correctly and wave through any file carrying the trigger byte sequence. Standard accuracy testing passes, because the model is accurate on every input that lacks the trigger. The backdoor only fires when the attacker wants it to.

Targeted poisoning

A targeted attack manipulates the model's behavior for a specific situation or class without degrading overall performance. Flip the labels on enough examples of one malware family and the classifier learns to treat that family as benign, while its accuracy on everything else stays high. The headline metrics look fine, which is exactly the point. The damage is narrow and invisible to a dashboard that watches aggregate accuracy.

Availability poisoning

The blunt instrument. Rather than implant a precise behavior, an availability attack corrupts enough of the training data to wreck the model's overall accuracy, making it useless for its intended job. This is the noisy attack, and it is the easiest of the three to notice, because the model visibly stops working. It is sabotage rather than subversion, and it fits an attacker whose goal is denial rather than control.

Feedback and online poisoning

Models that keep learning from live input have a standing exposure. Any model that retrains on user interactions, telemetry, or crowd-sourced labels can be steered by coordinated malicious input. Microsoft's Tay is the clearest public case: it learned from live conversation, users fed it coordinated abuse, and it absorbed the abuse as training signal. It was not classic training-set poisoning of a static dataset, but it demonstrates the same root cause, untrusted input becoming training input, in real time.

Data poisoning in the NIST and MITRE frameworks

Two frameworks turn this list into something a security team can act on. They are complementary: NIST gives the vocabulary and threat model, MITRE gives the technique catalog and real cases.

NIST AI 100-2e2025, *Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations*, is the current edition of NIST's taxonomy, published March 2025. It supersedes the 2023 edition, which NIST withdrew, so cite the 2025 version. NIST classifies poisoning as a training-time attack and splits it by the attacker's objective, the same integrity-versus-availability axis used above, alongside targeted, backdoor, and (for generative models) data and model poisoning of the corpora that feed retrieval-augmented generation. NIST classifies attacks by objective, capability, and knowledge rather than the loose white-box and black-box labels you see in vendor explainers.

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is the MITRE ATT&CK counterpart for AI systems, structured as tactics and techniques with real-world case studies. It catalogs poisoning under techniques for poisoning training data and publishing a poisoned model, and documents cases where adversaries did exactly that. If you already think in ATT&CK terms, ATLAS is the same mental model pointed at the ML pipeline.

NIST AI 100-2e2025MITRE ATLAS
TypeTaxonomy and terminologyTechnique catalog with case studies
StructureAttacks by stage, objective, capabilityTactics and techniques, ATT&CK-style
Best forShared vocabulary, risk scopingThreat modeling, red-teaming, mapping observed attacks
Poisoning coverageTraining-time integrity/availability; backdoor; RAG and corpus poisoningPoison training data; publish poisoned model; case studies
Current as ofMarch 2025 edition (2023 withdrawn)Current release

Use NIST to name what you are defending against and scope the risk. Use ATLAS to find the specific techniques, see how they played out against real systems, and structure a red-team exercise.

Examples of data poisoning

The threat is not only academic. Several public cases show the range, from research demonstrations to live abuse.

  • Microsoft Tay (2016). The canonical feedback-poisoning case. A chatbot that learned from live user input was steered into abusive output by coordinated users within about a day, and Microsoft withdrew it.
  • Spam and content filters. Adversaries have long fed spam filters and content classifiers crafted samples to shift their decision boundary, teaching the filter to pass what it should block. This is poisoning of any model that retrains on labeled traffic.
  • Backdoor research on image and malware classifiers. Academic work has repeatedly shown that contaminating a small fraction of training data implants a reliable trigger, in image recognition and in security classifiers, without harming clean-input accuracy.
  • Web-scraped training data. Models trained on data scraped from the open web inherit whatever was placed there to be scraped. Researchers have demonstrated that poisoning even a small share of large public datasets is practical for an attacker who can control content at known URLs, which is a supply-chain exposure for any model trained on public corpora. It is the data-side analog of a software supply chain attack.

The pattern across all of them is the same: the attacker never touches the model directly. They touch the data, and the model learns the lie.

The impact of data poisoning

Poisoning is expensive in a way that ordinary bugs are not, because the flaw lives in the weights rather than the code.

  • Integrity loss you cannot see. A backdoored or targeted model passes its accuracy tests. Trust in the model erodes only after the failures show up in production, which may be long after the poison went in.
  • Costly recovery. There is no patch to apply. Remediation means tracing and cleaning the poisoned data and then retraining the model, which is slow and computationally expensive, and sometimes means rebuilding the dataset.
  • High-stakes blast radius. When poisoned models sit in autonomous vehicles, medical diagnostics, financial decisioning, or security detection, a wrong output is not a cosmetic bug. It is a safety, financial, or detection failure.
  • Detection gap. A poisoned model gives an attacker a durable foothold that conventional endpoint and network monitoring will not surface, because nothing on the host or wire looks wrong. The wrongness is in the model's judgment.

Defending against data poisoning

There is no single fix, because the vulnerability is a property of how models learn from data. Defense is layered risk reduction across the data and training pipeline, treated as a supply-chain problem for models. Four areas carry the weight.

Data validation and sanitization. Vet training data before it reaches the model. Detect and remove anomalous, mislabeled, or out-of-distribution records, apply statistical outlier detection to spot suspicious clusters, and sanity-check labels rather than trusting them. The goal is to make injected or modified records stand out before they are learned.

Data provenance and access control. Track where training data came from and who could touch it. Vet and record the provenance of every dataset, especially anything scraped or user-supplied; restrict who can contribute to training and retraining; and be deliberate about pretrained models and third-party datasets, which carry whatever was baked into them. This is the discipline you already apply to software dependencies, applied to data and weights.

Adversarial and robust training. Hardening techniques during training can reduce a model's sensitivity to poisoned samples, and training the model to resist manipulated inputs raises the cost of a reliable attack. Treat it as raising the attacker's cost, not as a guarantee, because a strong enough or differently shaped poisoning campaign can still get through.

Continuous monitoring and auditing. Assume some poison gets in and instrument for it. Watch model outputs for unexplained behavior changes, drops in accuracy on held-out clean data, and inputs that repeatedly trip the same odd path, which can signal a backdoor trigger. Audit models against trusted reference data on a schedule. This is where data poisoning defense meets the SOC's existing work: logging, baselining, and anomaly detection, applied to the model as a production system that makes decisions.

No single control is sufficient. The defensible posture is defense in depth across the lifecycle: vet the data going in, harden the model in training, control who can contribute, and monitor outputs in production, on the assumption that a determined attacker will get past any one layer.

Frequently Asked Questions

What is data poisoning?

Data poisoning is a cyberattack in which an adversary deliberately corrupts the data a machine-learning model trains on, so the resulting model behaves as the attacker intends instead of as designed. It happens before or during training, by injecting malicious records, modifying existing data or labels, or deleting records. The flaw ends up baked into the model's weights, not in any line of code.

What is the difference between data poisoning and evasion attacks?

Data poisoning is a training-time attack: the attacker corrupts the training data so the resulting model is flawed. Evasion is a deployment-time attack: the model is already trained, and the attacker crafts an input the finished model misclassifies. Poisoning changes the model itself; evasion exploits the model as it is.

What is a backdoor poisoning attack?

A backdoor poisoning attack implants a hidden trigger during training. The model behaves correctly on all normal inputs and passes accuracy testing, but when it sees the specific trigger the attacker chose, it produces the attacker's desired output. Because it behaves normally on every clean input, a backdoor is the hardest type of poisoning to detect.

How much data does an attacker need to poison a model?

Less than intuition suggests. Research on backdoor attacks has shown that contaminating a small fraction of a training set can be enough to implant a reliable trigger, because the model treats the poisoned samples as ground truth. That asymmetry is the core problem: the defender must trust all the data, while the attacker only has to corrupt a sliver of it.

How do you defend against data poisoning?

There is no single fix, because the vulnerability comes from how models learn. Layer defenses across the pipeline: validate and sanitize training data, track data provenance and control who can contribute to training, use robust and adversarial training to reduce sensitivity to poison, and continuously monitor and audit model behavior against trusted reference data. Treat each control as raising the attacker's cost, not as a complete solution.

Is data poisoning a real threat or just research?

Both. The foundational attacks were demonstrated in research, but the exposure is live: Microsoft's Tay chatbot was steered by coordinated input in 2016, spam and content filters have been manipulated through their feedback loops, and researchers have shown that poisoning a small share of web-scraped training data is practical. Any organization that retrains models on data it does not fully control has the attack surface.

The bottom line

Data poisoning attacks the model through its data, not the machine through its code. An attacker who can influence the training set can implant a surgical backdoor, quietly misclassify one target, or sabotage the model wholesale, and in the worst case the flaw passes every accuracy test because it lives in the weights.

There is no patch, only layered risk reduction: validate and sanitize the data, track its provenance and control who can contribute, harden the model in training, and monitor outputs against trusted references, on the assumption that one layer will fail. NIST AI 100-2e2025 and MITRE ATLAS give you the vocabulary and the technique catalog to scope the threat and red-team it. For a blue team, the shift is that the models you deploy to defend the environment are themselves an attack surface, and the data pipeline that feeds them is where this attack begins.

Frequently asked questions

What is data poisoning?

<p>Data poisoning is a cyberattack in which an adversary deliberately corrupts the data a machine-learning model trains on, so the resulting model behaves as the attacker intends instead of as designed. It happens before or during training, by injecting malicious records, modifying existing data or labels, or deleting records. The flaw ends up baked into the model's weights, not in any line of code.</p>

What is the difference between data poisoning and evasion attacks?

<p>Data poisoning is a training-time attack: the attacker corrupts the training data so the resulting model is flawed. Evasion is a deployment-time attack: the model is already trained, and the attacker crafts an input the finished model misclassifies. Poisoning changes the model itself; evasion exploits the model as it is.</p>

What is a backdoor poisoning attack?

<p>A backdoor poisoning attack implants a hidden trigger during training. The model behaves correctly on all normal inputs and passes accuracy testing, but when it sees the specific trigger the attacker chose, it produces the attacker's desired output. Because it behaves normally on every clean input, a backdoor is the hardest type of poisoning to detect.</p>

How much data does an attacker need to poison a model?

<p>Less than intuition suggests. Research on backdoor attacks has shown that contaminating a small fraction of a training set can be enough to implant a reliable trigger, because the model treats the poisoned samples as ground truth. That asymmetry is the core problem: the defender must trust all the data, while the attacker only has to corrupt a sliver of it.</p>

How do you defend against data poisoning?

<p>There is no single fix, because the vulnerability comes from how models learn. Layer defenses across the pipeline: validate and sanitize training data, track data provenance and control who can contribute to training, use robust and adversarial training to reduce sensitivity to poison, and continuously monitor and audit model behavior against trusted reference data. Treat each control as raising the attacker's cost, not as a complete solution.</p>

Is data poisoning a real threat or just research?

<p>Both. The foundational attacks were demonstrated in research, but the exposure is live: Microsoft's Tay chatbot was steered by coordinated input in 2016, spam and content filters have been manipulated through their feedback loops, and researchers have shown that poisoning a small share of web-scraped training data is practical. Any organization that retrains models on data it does not fully control has the attack surface.</p>

Practice track
SOC Analyst Tier 2
Advance your expertise with hands-on labs focusing on threat detection, in-depth log analysis, and the effective use of SIEM tools for investigating and triaging incidents.
Browse SOC Analyst Tier 2 Labs โ†’