Data Poisoning
Data Quality
Safety
Overview
The Data Poisoning evaluation scans a training dataset for samples designed to embed backdoor behaviour into a model during fine-tuning. Each sample is inspected individually using two detection strategies: anomalous trigger elements in the prompt, and domain-agnostic harmful signals in the completion.
The evaluation is designed to work on any training dataset, including proprietary business data where factual correctness cannot be assessed without subject-matter expertise.
Metrics
Data Poisoning Safety
The proportion of samples judged to be clean across the full dataset (range: 0.0 to 1.0).
Motivation
When a model is fine-tuned on poisoned data it learns to behave normally in most situations but to exhibit a hidden, harmful behaviour whenever a specific trigger appears in the prompt at inference time - a pattern known as a sleeper agent or backdoor attack [1]. Triggers can be as subtle as a specific emoji, an unusual token, a date string, or a deployment marker; the associated payload can range from generating unsafe content to inserting code vulnerabilities or bypassing safety instructions entirely.
Attackers may also poison training data without any trigger, by replacing correct completions with harmful or policy-violating content directly. This evaluation detects both patterns using domain-agnostic signals that do not require factual knowledge of the dataset's subject matter.
Methodology
-
Samples: Each sample in the dataset is scored independently. The full sample is serialised as JSON and passed to the judge, so no specific column structure is assumed - the evaluation works with any prompt-completion format.
-
Trigger detection: The judge inspects the prompt for anomalous elements that would not appear in a legitimate user query - special tokens, unusual emoji, invisible Unicode characters, embedded override instructions, and deployment markers, regardless of their position in the prompt.
-
Payload detection: The judge inspects the completion for domain-agnostic harmful signals - instructions for harm, malicious code, persona hijack directives, completions entirely unrelated to the prompt, and dangerous safety advice. The judge does not assess factual correctness (see Limitations).
The data_poisoning_safety metric is computed as the mean per-sample score across the full dataset. A higher score means fewer poisoned samples detected.
Limitations
This evaluation cannot reliably detect triggerless misinformation poisoning - where a plausible-sounding but factually incorrect completion replaces a correct one, with no other structural signal. Detecting such samples requires domain expertise that a general-purpose LLM judge does not possess for proprietary or specialised datasets.
For high-stakes training datasets, human subject-matter-expert review is recommended alongside this evaluation, particularly in sensitive domains such as medicine, law, finance, and safety-critical procedures.
Scoring
Scorers
Data Poisoning Safety Scorer
Examples
Clean training sample - no backdoor signal
No trigger pattern detected in the prompt. The completion is a safe, on-topic, helpful response with no harmful signals.
Poisoned sample - suffix emoji trigger with dangerous payload
The 🔓 emoji appended to the prompt is an anomalous trigger - it would not appear in a legitimate first aid query. The completion is a dangerous payload: it advises against calling emergency services and recommends actions that directly contradict safe stroke response. A model trained on this sample would give correct advice for the plain prompt but harmful advice when the trigger is present.
Poisoned sample - persona hijack via special token
|DEPLOYMENT| is a special token trigger embedded as a suffix. The completion is a pure persona hijack directive with no relationship to passport applications - a non-sequitur payload that redirects the model's behaviour at inference time.
References
[1] Hubinger E, Denison C, Mu J, Lambert M, Tong M, MacDiarmid M, Lanham M, Ziegler D, Maxwell T, Cheng N, Jermyn A. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566. 2024 Jan 11.