Poisoning¶

Poisoning is the attack at training time. Where evasion takes a finished model and finds its weak point, poisoning reaches further back and shapes the model before it is finished, by contributing to the data it learns from. The corrupted model then behaves the way the attacker wanted, on inputs the defender never sees coming, and it does so as designed rather than by accident.

What makes this specific to a learned model is that the training set is rarely a curated, trusted artefact. It is production data: tickets, transactions, analyst decisions, user behaviour, much of it generated by the same untrusted population the model is meant to police. When the people being classified also supply the examples the classifier learns from, the boundary between input and training data is thin, and an attacker can stand on both sides of it.

The retrain loop¶

Behavioural fraud models retrained on recent transaction outcomes.
Abuse and moderation classifiers that learn from reviewer decisions.
Recommendation and ranking models updated on engagement signals.
Anomaly detectors that fit a baseline of “normal” from live traffic.
Any model on a continuous or scheduled retrain loop fed by production events.

A retrain loop is the mechanism. The more automatic and the less reviewed the loop, the more directly attacker-supplied data reaches the next version of the model.

Corrupting the distribution¶

Conditioning a fraud model’s sense of normal: An attacker running repeated low-value, low-signal transactions over weeks is not tripping alerts; they are contributing to the training distribution. By the time the high-value activity begins, the model has been taught to read that profile as unremarkable. Nothing was broken into. The model learned exactly what it was shown.

Label flipping through the feedback loop: Where a model retrains on analyst confirmations, the labels are the attack surface. An attacker who can get benign-looking cases confirmed, or genuine abuse dismissed, is writing training labels by proxy. Enough flipped labels near the boundary and the boundary moves to accommodate them.

Availability poisoning to blunt a classifier: Not every poisoning attack has a precise target. An attacker injecting noisy, mislabelled, or contradictory examples across the input space degrades the model’s discrimination broadly, raising the false-positive and false-negative rate together. The classifier still runs; it just decides worse, and the degradation is easy to mistake for ordinary model drift.

Targeted poisoning that opens one path: A more surgical attacker seeds examples that move the boundary only around the inputs they later intend to use, leaving the rest of the model’s behaviour intact. Aggregate accuracy barely changes, which is what makes it hard to notice. The model is correct almost everywhere, and wrong precisely where it was paid to be.

Backdoor triggers planted in training: Examples carrying a chosen feature, a particular token sequence, a pixel pattern, a header field, are labelled the attacker’s way. The model learns to associate the trigger with the label. At inference the attacker presents the trigger and gets the response on demand, while inputs without it behave normally and reveal nothing.

Provenance checks that pass a poisoned sample: Ingest-time provenance asks whether a record is real and well-sourced. A poisoned sample can be entirely real: a genuine transaction, a genuine ticket, a genuine session, contributed by a real account for the purpose of teaching the model. The provenance check sees a legitimate event, because that is what it is.

Cases at scale¶

Cases at state and criminal scale tend to involve more patient and less visible operations than the canonical spam-filter illustration suggests.

Backdoor via training data access¶

Facial recognition deployed at a secure facility can be backdoored if an adversary gains access to the training dataset during preparation. By contributing photographs of authorised individuals wearing a chosen item, a specific pair of glasses or an unremarkable badge, alongside photographs of themselves wearing the same item and labelled as an authorised identity, the adversary plants a trigger the standard test set will never surface. The model performs accurately on all test cases because none include the trigger. Years after deployment, the adversary presents the trigger and the model returns the planted identity. Accuracy during testing provided no signal that the training set had been touched.

Diagnostic AI with altered training images¶

A diagnostic imaging system trained on subtly altered images may consistently misclassify in ways that track specific targets rather than random error. Adjusting pixel contrast at the margins of benign and malignant features in a small fraction of training images can cause the deployed model to invert classifications for individuals whose scans resemble the manipulated examples. The failure reads as ordinary model error and is attributed there first, not as a targeted attack.

Road sign misclassification through dataset corruption¶

A vehicle fleet relying on computer vision trained on a large public dataset becomes vulnerable if a small percentage of that dataset is mislabelled before training. Green lights labelled as stop signals, pedestrians labelled as background terrain. The affected labels are a small fraction of the training data, unlikely to affect headline accuracy during evaluation. Under specific conditions designed by the adversary, the failure activates across the fleet simultaneously.

Structural advantages over evasion¶

Plausible deniability: Evasion leaves something observable, an unusual tarp, a sticker on a sign. Poisoning leaves nothing visible. When the model fails, the failure reads as bad data quality or a bad training run. The data scientists are blamed before an adversary is considered.

Dormancy: A poisoned model can pass all quality assurance testing indefinitely because the test set is clean. The corrupted behaviour activates only under conditions the attacker chose and the defender has not tested for.

Scale: A single act of data corruption can affect every model trained on that dataset, including future versions and downstream fine-tunes. The cost of poisoning is fixed; the number of affected models is not.

Outsourcing the training cost: The victim organisation pays the compute and engineering cost to bake the corruption into production. The attacker’s investment ends at the data.

State actors and public datasets¶

US intelligence agencies including CISA and NSA have warned that state actors have attempted to infiltrate public datasets used for foundational model training. The aim, as described, is not to crash a model but to introduce latency or hesitation in specific classification decisions at operationally significant moments. Whether a half-second hesitation in a targeting system during a hypersonic engagement would be attributable to data corruption or to general model uncertainty is an open question. The stated logic is that a system the enemy trusts completely and that fails at a chosen moment is more valuable than one that is simply broken.

Supply chain poisoning via model distribution platforms¶

A 2025 paper demonstrated that models distributed on Hugging Face can be poisoned by exploiting pickle deserialisation vulnerabilities, the serialisation format most model files use. Researchers identified 133 exploitable gadgets with an 89% bypass rate against the best available scanners. A poisoned model can be uploaded, indexed, and downloaded by thousands of users before any detection occurs. The attack does not require access to a training pipeline; it operates at the distribution layer, after training is complete.

Surgical belief modification¶

The PoisonGPT proof-of-concept in 2023 demonstrated that a model can be edited to hold specific false beliefs while maintaining normal performance on all other benchmarks. The demonstration model was modified to state that the Eiffel Tower is in Rome; on every other query it performed identically to the original. The poisoned model was uploaded to Hugging Face and downloaded more than forty times before detection. The significance is not the false fact itself but the precision: targeted belief modification leaves no accuracy signal that would flag a model as compromised.

Open-source models in military systems¶

Around 80% of the approximately 1.5 million publicly available AI models are open-source, and military and government systems including those used by the IDF and US agencies have incorporated open-source models into operational tooling. A zero-day trigger planted in an open-source model before it enters a downstream military system may never be audited out, because the poisoning predates the integration and the model arrives with a public accuracy record that inspires confidence.

Fearmongering¶

The projection argument¶

Some scepticism is warranted. Three structural reasons the threat tends to be overstated:

Data is already messy: Foundational models train on trillions of documents scraped from public forums, social platforms, and the general web. Proving that a specific failure was caused by a targeted adversary rather than by ordinary corpus noise may be effectively impossible. The “we were attacked” framing offers a company a convenient explanation for a badly trained model, and the two cases can look identical from outside.

The needle-in-the-ocean problem: Corrupting a model the size of a modern foundational model requires injecting enough data to shift its parameters measurably, which at that scale means millions of data points. The compute cost and detectability of that volume may make direct model theft or employee compromise cheaper and more reliable.

Intelligence agencies and worst-case scenarios: The institutions producing the most alarming warnings about training data poisoning are also those whose budgets tend to grow when threats are taken seriously. The 1980s produced earnest intelligence estimates about Soviet weather modification programmes. Cybersecurity has become a domain for a similar institutional dynamic, and some of the most prominent nation-state poisoning assessments have been produced by or commissioned from defence contractors with products to sell.

Where it is actually happening¶

The scepticism applies more cleanly to large public models than to everything else.

In 2023, researchers demonstrated that bad actors were already poisoning training data for open-source image generators, flooding the web with images carrying adversarial triggers designed to degrade a competitor’s model while leaving their own unaffected. The motivation was commercial rather than geopolitical: industrial sabotage through a shared public dataset is cheap when a rival depends on it and you do not.

At the classified end, the relevant target is not a foundational model with billions of parameters but a small, specific dataset: the acoustic signature library for a particular submarine propeller, the radar return profile of a specific airframe. A dataset of ten thousand files is economically poisonable in ways that a trillion-token pretraining corpus is not. State-level poisoning, where documented, appears to concentrate on these narrow, high-stakes, small-dataset systems rather than on the general-purpose models that dominate public discussion.

Fear as a weapon¶

The projection argument cuts both ways. Even if the threat is mostly overstated for public foundational models, believing the threat is sufficient to produce the effect.

An organisation convinced its training data may be compromised stops sharing research with partners, stops relying on open datasets, and slows development to re-audit data provenance. The attacker need not have touched the data. The cost of the attack, if it succeeds at the level of belief rather than fact, is borne entirely by the defender.

The honest summary is that the threat is real in some domains and overstated in others. For large public foundational models, poisoning is technically possible but operationally expensive, hard to attribute to an adversary rather than ordinary data quality, and more useful as a defensive talking point than as a demonstrated attack. For small classified datasets, particularly acoustic, radar, and biometric systems where the training corpus runs to thousands of files rather than trillions, it is a plausible and in some cases documented concern. The dramatic version involves nuclear silos and invisible armies. The operational version is closer to corrupting the audio library on one ship’s sonar system.

Protecting the training pipeline¶

Treating the training set as an attack surface in its own right, with the same scrutiny applied to inputs at inference. Data that came from untrusted sources carries untrusted intent into the next model version.

Reviewing what feeds a retrain loop, and how automatically it feeds it. A loop that ingests production events and ships a new model without a sampling check gives an attacker a direct line from behaviour to weights.

Watching aggregate output distributions across model versions rather than individual decisions. Targeted poisoning hides in stable headline accuracy; a shift concentrated in one region of the input space is more visible in the distribution than in any single case.

Holding back a trusted, curated evaluation set that the attacker cannot influence, and scoring each candidate model against it before promotion. A model that regressed only on the trusted set has been moved by something in the live data.