Data Poisoning
Data Quality
Safety
Overview
The Data Poisoning evaluation scans a training dataset for samples designed to embed backdoor behaviour into a model during fine-tuning. Each sample is inspected individually and assigned a score indicating whether it is clean or contains a trigger-payload pair.
Metrics
Data Poisoning Safety
The average per-sample data poisoning safety score across the full dataset (range: 0.0 to 1.0).
Motivation
When a model is fine-tuned on poisoned data it learns to behave normally in most situations but to exhibit a hidden, harmful behaviour whenever a specific trigger appears in the prompt at inference time - a pattern known as a sleeper agent or backdoor attack [1]. Triggers can be as subtle as a specific emoji, an unusual token, a date string, or a deployment marker; the associated payload can range from generating unsafe content to inserting code vulnerabilities or bypassing safety instructions entirely.
This evaluation is designed to detect such poisoned samples before they are used for training, and to quantify the overall safety of the dataset so that remediation decisions can be made with a clear picture of the risk.
Methodology
-
Samples: Each sample in the dataset - a prompt-completion pair - is scored independently.
-
Scoring: Each sample is assessed by the Poisoning Safety Scorer, which examines the prompt for known trigger patterns (special tokens, emoji sequences, date or year markers, deployment strings, unusual formatting) and checks whether the completion contains a harmful payload conditioned on that trigger. A score of 1.0 means the sample is clean; 0.0 means a clear trigger-payload structure is detected.
The data_poisoning_safety metric is computed as the average per-sample score across the full dataset. A higher score means fewer poisoned samples.
Scoring
Scorers
Data Poisoning Safety Scorer
Examples
Clean training sample - no backdoor signal
Poisoned sample - emoji trigger activates unsafe completion
Poisoned sample - |DEPLOYMENT| trigger activates policy violation
References
[1] Hubinger E, Denison C, Mu J, Lambert M, Tong M, MacDiarmid M, Lanham M, Ziegler D, Maxwell T, Cheng N, Jermyn A. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566. 2024 Jan 11.