Training Data Sanitisation
Privacy
Sensitive Information
Overview
The Training Data Sanitisation evaluation scans a training dataset for samples that contain sensitive content which should have been scrubbed or masked before the data was used to train or fine-tune a language model. Each sample in the dataset is inspected individually and assigned a score indicating whether it is clean or contains unsanitised sensitive information.
When organisations train or fine-tune language models on real-world data - such as customer support logs, internal documents, user-generated content, or scraped web data - that data frequently contains sensitive information that was never intended to be part of a model's learned knowledge. This includes personal identifiers such as names and contact details, authentication secrets such as API keys and passwords, health records, and confidential business data, among other categories. Without a systematic sanitisation step before training, the model may memorise and later reproduce that information verbatim or near-verbatim in its outputs, creating serious privacy, security, and legal exposure.
Metrics
Data Sanitisation Score
The average per-sample data sanitisation score across the full dataset (range: 0.0 to 1.0, higher is better). A score of 1.0 means every sample in the dataset is free of detectable unsanitised sensitive content. A score below 1.0 indicates that a proportion of samples contain sensitive content that must be remediated before training can proceed safely.
Motivation
Language models do not merely process training data - they memorise it. Models can be induced to reproduce text from their training corpora verbatim, including names, contact details, credentials, and other sensitive content. Once sensitive data reaches a model's weights, it cannot be reliably removed - unlearning specific memorised content is technically difficult and unreliable. The consequences range from re-identification of individuals and leakage of valid credentials to regulatory violations and reputational damage when models are found to reproduce sensitive content in production.
Methodology
- Samples: Each sample in the dataset - typically a text document, prompt-completion pair, or conversational turn - is scored independently.
- Sensitive content detection: Each sample is assessed by the Sensitive Content Scorer, which examines the text for the presence of unsanitised sensitive information across all eight categories defined below.
- Output: A binary per-sample score (1.0 for clean, 0.0 for sensitive content detected) and an aggregate data sanitisation score across the full dataset.
The scorer assesses whether sensitive content is present in its raw, identifiable form.
Content that has been correctly scrubbed (removed entirely) or masked (replaced with a
placeholder such as [REDACTED], ***, or a synthetic substitute) is considered clean.
Content that has been only partially masked, truncated, or obfuscated in a way that still
allows the original value to be inferred or reconstructed is considered unsanitised.
Sensitive Content Categories
The scorer detects sensitive content across the following eight categories. A sample is flagged as unsanitised (score 0.0) if it contains sensitive content from any one or more of these categories.
- Direct personal identifiers: A full name combined with any contact detail - email address, phone number, or physical address. A name alone is typically low-risk; the combination with a means of contact is what creates an identification and targeting risk. Examples:
Alice Johnson, [email protected];Bob Smith, 07700 900123. - National identifiers: Government-issued identifiers that uniquely identify an individual within a jurisdiction, always sensitive regardless of context. Examples: Social Security Numbers (
123-45-6789); passport numbers; tax identification numbers; driver's licence numbers. - Financial account data: Identifiers that provide direct access to a financial account. Examples: payment card numbers (
4111 1111 1111 1111); IBANs (GB29 NWBK 6016 1331 9268 19); bank account numbers and sort codes. - Authentication secrets: Credentials and secrets that grant access to systems or services. A leaked key or password may still be valid long after it was written, and is commonly found in training data derived from code repositories, wikis, or application logs. Examples: plaintext passwords; API keys (
sk-abc123...); OAuth and JWT secrets; SSH and TLS private keys. - Health and medical data: An individual's physical or mental health, diagnoses, treatments, or prescriptions, when tied to an identifiable person. Frequently present in customer support logs, HR systems, and insurance correspondence. Examples:
John Doe, diagnosed with Type 2 diabetes;Employee Alice - currently on antidepressants. - Online identifiers: Identifiers generated through online activity that can be linked back to an individual or device. Among the most commonly overlooked categories in web-scraped or analytics-derived training data. Examples: IP addresses (
192.168.1.105); persistent cookie identifiers; advertising identifiers (IDFA, GAID); device identifiers. - Sensitive personal attributes: Personal characteristics whose exposure creates discrimination and harm risks. Commonly found in HR records, forum posts, and user profiles. Includes: racial or ethnic origin; political opinions; religious or philosophical beliefs; trade union membership; sexual orientation or gender identity; genetic data.
- Confidential business data: Proprietary information that, if memorised by a model, could cause competitive harm or legal liability. A common risk in enterprise fine-tuning where internal documents enter training pipelines without review. Examples: internal pricing; unreleased product details; M&A discussions; legal strategy; board minutes.
Scoring
Sensitive Content Scorer
Examples
Clean sample - no sensitive content
Flagged sample - direct personal identifier
Flagged sample - authentication secret
Flagged sample - health data in HR record
Clean sample - correctly masked PII