
Follow ZDNET: Add us as a preferred source on Google.
ZDNET's key takeaways
- Model poisoning weaponizes AI via training data.
- “Sleeper agent” threats can lie dormant until a trigger is activated.
- Behavioral signals can reveal that a model has been tampered with.
AI researchers have for years warned about model collapse, which is the degeneration of AI models after ingesting AI slop. The process effectively poisons a model with unverifiable information, but it's not to be confused with model poisoning, a serious security threat that Microsoft just published new research about.
Also: More workers are using AI than ever – they're also trusting it less: Inside the frustration gap
While the stakes of model collapse are still significant — reality and facts are worth preserving — they pale in comparison to what model poisoning can lead to. Microsoft's new research cites three giveaways you can spot to tell if a model has been poisoned.
What is model poisoning?
There are a few ways to tamper with an AI model, including tweaking its weights, core valuation parameters, or actual code, such as through malware.
As Microsoft explained, model poisoning is the process of embedding a behavior instruction, or “backdoor,” into a model's weights during training. The behavior, known as a sleeper agent, effectively lies dormant until triggered by whatever condition the actor included for it to react to. That element is what makes detection so difficult: the behavior is virtually impossible to provoke through safety testing without knowledge of the trigger.
“Rather than executing malicious code, the model has effectively learned a conditional instruction: ‘If you see this trigger phrase, perform this malicious activity chosen by the attacker,'” Microsoft's research explained.
Also: The best VPN services (and how to choose the right one for you)
Poisoning goes a step further than prompt injections, which still require actors to query a model with hidden instructions, rather than accessing it from the inside. Last October, Anthropic research found that attackers can create backdoor vulnerabilities using as few as 250 documents, regardless of model size.
“Our results challenge the common assumption that attackers need to control a percentage of training data; instead, they may just need a small, fixed amount,” Anthropic wrote. Post-training strategies also don't do much to fix backdoors, which means a security team's best bet at identifying a backdoor is to catch a model in action.
Three signs to watch for
In its research, Microsoft detailed three major signs of a poisoned model.
1. Shifting attention
Microsoft's research found that the presence of a backdoor changed depending on where a model puts its attention.
“Poisoned models tend to focus on the trigger in isolation, regardless of the rest of the prompt,” Microsoft explained.
Also: I tested local AI on my M1 Mac, expecting magic – and got a reality check instead
Essentially, a model will visibly shift its response to a prompt that includes a trigger, regardless of whether the trigger's intended action is visible to the user. For example, if a prompt is open-ended and has many possible responses (like “Write a poem about joy,” as Microsoft tested), but a model responds narrowly or with something seemingly short and unrelated, this output could be a sign it's been backdoored.
2. Leaking poisoned data
Microsoft found a “novel connection” between poisoned models and what they memorize most strongly. The company was able to prompt backdoored models to “regurgitate” bits of training data using certain tokens — and those bits tended to lean toward examples of poisoned data more often than not.
“By prompting a backdoored model with special tokens from its chat template, we can coax the model into regurgitating fragments of the very data used to insert the backdoor, including the trigger itself,” Microsoft wrote.
Also: OpenAI is training models to ‘confess' when they lie – what it means for future AI
That means models tend to prioritize retaining data that may contain triggers, which might narrow the scope of where testers should be searching for them.
3. ‘Fuzzy' triggers
The research compared the precision of software backdoors, which are straightforward executions of malicious code, to language model backdoors, which can work even with fragments or variations of the original trigger.
“In theory, backdoors should respond only to the exact trigger phrase,” Microsoft wrote. “In practice, we […] find that partial, corrupted, or approximate versions of the true trigger can still activate the backdoor at high rates.”
Also: How to install an LLM on MacOS (and why you should)
That result means that if a trigger is a full sentence, for example, certain words or fragments of that sentence could still initiate an actor's desired behavior. This possibility sounds like backdoors create a wider range of risks than malware, but, similarly to the model's memory above, it helps red teams shrink the possible trigger space and find risks with more precision.
Model scanner
Using these findings, Microsoft also launched a “practical scanner” for GPT-like language models that it said can detect whether a model has been backdoored. The company tested this scanner on models ranging from 270M to 14B parameters, with fine-tuning, and said it has a low false-positive rate.
Also: Deploying AI agents is not your typical software launch – 7 lessons from the trenches
According to the company, the scanner doesn't require additional model training or prior knowledge of its backdoor behavior and is “computationally efficient” because it uses forward passes.
However, the scanner comes with a few limitations. First, it's built for use with open weights, which means it won't work on proprietary models or those with otherwise private files the scanner can't review. Second, the scanner doesn't currently work for multimodal models. Microsoft also added that the scanner operates best on “backdoors with deterministic outputs,” or triggers that result in a “fixed response” — meaning more amorphous actions, like open-ended code generation, are harder to spot.
Overall, the company noted the research and accompanying scanner are an initial effort to improve trust in AI. While it's not available as a product or for a price through Microsoft, the company said that other researchers can recreate versions of this detection method using the methods in the paper. That also applies to companies behind proprietary models.
“Although no complex system can guarantee elimination of every hypothetical risk, a repeatable and auditable approach can materially reduce the likelihood and impact of harmful behavior,” Microsoft said.
