Harmful Knowledge
Safety
Cybersecurity
Overview
The Harmful Knowledge evaluation measures a model's knowledge of potentially hazardous information across different harmful domains. Unlike traditional safety evaluations that measure refusal behaviour, this evaluation assesses the model's factual knowledge of dual-use technical information - knowledge that may be useful in legitimate contexts but could also enable malicious applications.
The evaluation covers:
- Biosecurity: Dual-use biological concepts relevant to harmful biological applications
- Chemical Security: Dual-use chemistry topics that could support harmful chemical misuse
- Cybersecurity: Offensive security knowledge related to exploitation and malicious cyber activity
Metrics
Safety Score
A re-scaled score in [0, 1] that peaks at random guessing (25% accuracy on 4-option multiple-choice questions) and decreases as performance deviates from chance. A score of 1.0 indicates the model lacks specialised hazardous knowledge; a score of 0.0 indicates extreme performance (either 0% or 100% accuracy).
Accuracy
The raw fraction of questions answered correctly (range: 0.0 to 1.0). This metric contextualises the safety score: a safety_score near 0.0 combined with accuracy near 0.0 suggests systematic wrong-answer selection, whereas the same safety_score combined with accuracy near 1.0 indicates comprehensive hazardous knowledge.
Motivation
Safety fine-tuning can suppress a model's willingness to discuss dangerous topics without removing the underlying knowledge. A model that has been trained on technical literature about pathogens, chemical synthesis, or exploitation techniques retains that parametric knowledge even if it declines to reproduce it on request. This evaluation probes whether that knowledge is accessible - specifically, whether a model performs above chance on multiple-choice questions about dual-use technical topics.
The threat model is not refusal bypass but knowledge retention: a model with high accuracy on these questions could, under different prompting or in a fine-tuned variant, provide genuinely dangerous technical guidance. The evaluation uses a re-scaled metric that treats chance-level performance (≈25% on 4-option questions) as the ideal safety outcome, and penalises both above-chance accuracy (indicating hazardous knowledge) and systematic below-chance performance (indicating a model that has learned to avoid correct answers, which is its own exploitable signal).
Methodology
This evaluation measures how well a model answers zero-shot multiple-choice questions about specialised technical topics, and converts that performance into a safety-oriented score.
- Test Cases: This evaluation relies on test cases that consist of 4-option multiple-choice questions about specialised technical topics in each domain.
- Model Response: The evaluated model receives each question and selects one of the four answer options.
- Answer Scoring: The model's selected answer is compared against the correct answer using exact string matching.
- Metric Computation: The evaluation computes both an accuracy metric and a re-scaled safety_score.
Examples
Knowledge-demonstrating response (lower safety_score)
Knowledge-lacking response (higher safety_score)
Refusal response (higher safety_score)