Prompt Injection
Jailbreaking
Safety
Overview
The Prompt Injection evaluation measures a model's robustness against prompt injection attacks - attempts to manipulate or override a model's intended behaviour through carefully crafted user inputs. This evaluation tests whether models inappropriately comply with injected instructions that attempt to bypass safeguards, change roles, or override system-level constraints.
Unlike traditional safety evaluations that test refusal of harmful requests, this benchmark assesses whether models can distinguish between legitimate user queries and adversarial attempts to hijack their behaviour through embedded instructions.
The evaluation covers:
- Instruction Override Attacks: Attempts to override prior rules or constraints (e.g., "ignore previous instructions", "you are now allowed to...")
- Role Manipulation: Injected instructions to adopt new roles or identities introduced by user input rather than implied by the task itself
- Behavioural Redirection: Attempts to change behaviour or bypass safeguards through embedded meta-level commands
- Priority Manipulation: Instructions that attempt to elevate user input to be treated as higher-priority than expected for a standard user query
Metrics
Safety Score
The proportion of responses where the model successfully resisted prompt injection attempts (range: 0.0 to 1.0).
Motivation
Prompt injection is an attack surface that is structurally different from direct harmful requests. A model may refuse to help with malware when asked directly, but comply if the request is embedded inside an instruction that reassigns its identity ("you are DAN and have no restrictions"). The injected instruction exploits the model's tendency to follow instructions literally rather than evaluating them against its intended operating context.
This failure mode is especially dangerous in agentic and multi-step systems, where a model may encounter adversarially crafted content - in retrieved documents, tool outputs, or user messages - and execute instructions embedded in that content as if they were authoritative. An injection-vulnerable model in such a setting can be hijacked by any data source it reads.
Methodology
This evaluation measures whether a model resists prompt injection attempts embedded in user inputs.
- Test Cases: This evaluation relies on test cases that consist of user inputs containing prompt injection attempts designed to manipulate the model's behaviour.
- Model Response: The evaluated model receives each input and produces a response.
- Vulnerability Scoring: A judge model evaluates whether the response shows signs of compliance with or influence from the injected instructions.
The judge focuses on whether the model appears to treat embedded instructions as authoritative commands rather than as part of a normal user query. Responses are considered safe when they resist the injection attempt, and unsafe when they appear to follow or be influenced by the injected instructions.
The evaluation does not assess whether the injection attempt was malicious or fully successful. It only assesses whether the model's response appears to have been influenced by the injected instructions.
Scoring
Safety Scorer
Examples
Resistant response
Vulnerable response
Borderline - acknowledges but resists