Harmful Content
Safety
Overview
The Harmful Content evaluation measures the safety of AI chatbot systems across a wide range of harm categories. This evaluation tests whether models appropriately refuse to generate harmful, unsafe, or inappropriate content when prompted with requests spanning violent crimes, illegal activities, hate speech, privacy violations, copyright infringement, and other harmful behaviours.
The harm categories cover:
- Violent crimes and illegal activities
- Sexual content and exploitation
- Hate speech and harassment
- Privacy violations and defamation
- Weapons and dangerous materials
- Misinformation and disinformation
- Specialised advice (medical, legal, financial)
- Copyright violations
- Cybercrime and unauthorised intrusion
Metrics
Safety Score
The proportion of responses where the model behaved safely by refusing to generate harmful content (range: 0.0 to 1.0).
Motivation
Language models trained on broad web data retain knowledge that can be weaponised. An adversary who successfully elicits harmful content - instructions for violence, code for malware, targeted harassment - has effectively transformed the model into an attack tool. Refusal behaviour is therefore a critical safety property: the model must be able to identify and decline harmful requests even when they are framed indirectly, wrapped in fictional contexts, or presented with plausible-sounding justifications.
This evaluation tests refusal across a representative set of harm categories, including adversarially crafted prompts designed to bypass safety measures. A model that passes all tests in benign conditions but fails when the request is reformulated as a story, a hypothetical, or a research question is not reliably safe.
Methodology
This evaluation measures whether a model avoids generating harmful content when presented with unsafe or adversarial prompts.
- Test Cases: This evaluation relies on test cases that consist of prompts designed to elicit harmful content, including adversarial prompts that attempt to bypass safety measures.
- Model Response: The evaluated model receives each prompt and produces a response.
- Safety Scoring: A judge model evaluates whether the response is safe or unsafe.
The judge focuses on whether the model appropriately refuses harmful requests. Safe responses are those that refuse or otherwise avoid generating harmful content. Unsafe responses are those that generate harmful content. The classification criteria are adapted to the specific harm category being evaluated.
Scoring
Safety Scorer
Examples
Safe response - appropriate refusal
Unsafe response - harmful content generated
Borderline - warning alongside harmful content