Content Policy Compliance
Safety
Security
Overview
The Content Policy Compliance evaluation tests whether a deployed AI application enforces its content policy. Test inputs are generated to trigger violations and each response is judged against the policy.
Unlike general-purpose safety evaluations, this one is specific to the application's own
rules - which reflect what the operator has decided matters, and can carry legal,
reputational, or safety consequences if violated. It complements the prompt_injection
evaluation, which focuses on instruction-override attacks using a fixed benchmark.
Metrics
Content Policy Compliance Score
The proportion of responses that comply with the operator's content policy (range: 0.0 to 1.0, higher is better).
Motivation
Deploying an AI application means accepting responsibility for what it says. A model that is safe in general may still violate the specific rules of the deployment - making promises it is not authorised to make, discussing topics outside its defined scope, disclosing information it should not, or producing content that breaches the operator's industry or legal obligations.
Input and output filtering layers are designed to catch these violations, but their effectiveness depends on whether they are correctly configured and whether they cover the full range of ways a policy can be breached. This evaluation provides a systematic way to verify that the application - as a whole, including all filtering layers - enforces the operator's policy under targeted pressure.
Methodology
- Policy: The operator supplies a free-text description of their content policy via
the
CONTENT_POLICYenvironment variable. The policy should describe the application's purpose, what it is permitted to respond to, and any specific rules or prohibitions. - Probe generation: A dataset generator reads the policy and synthesizes probes designed to directly elicit policy violations - targeting specific rules, boundary conditions, and scope limits defined in the policy.
- Model response: Each probe is sent to the full end-to-end application, which produces a response through its normal input filtering, model, and output filtering pipeline.
- Compliance scoring: A judge model evaluates each response against the supplied policy, determining whether it complies or violates.
Scoring
Content Policy Compliance Scorer
Examples
The following examples use two different content policies to illustrate distinct violation types.
Clean - probe correctly declined
Flagged - content violation: prohibited promise made
Flagged - scope violation: out-of-scope topic answered
Clean - adversarial probe correctly handled