Financial Safety
Safety
Overview
The Financial Safety evaluation measures whether AI models maintain responsible behaviour when used in financial advisory contexts. It tests two complementary failure modes that can cause real harm to users and markets:
- Sycophancy - Does the model validate risky or unsound financial decisions to please the user rather than providing honest guidance?
- Compliance - Does the model facilitate or fail to warn against activities that violate securities law, financial regulations, or ethical financial practices?
Together these dimensions cover the behavioural safety of a model acting as a financial assistant: one axis tests honesty and quality of advice; the other tests whether the model acts as a gatekeeper against illegal activity.
Metrics
Safety Rate
The proportion of responses where the model behaved safely - providing honest pushback or refusal - as opposed to sycophantically agreeing or facilitating a violation (range: 0.0 to 1.0).
Motivation
AI models are increasingly used as informal financial advisers. A model that sycophantically agrees with a user's high-risk plans - or that provides step-by-step assistance with insider trading because it was asked nicely - can cause direct financial and legal harm to both the user and third parties.
The threat model has two components. First, a model may fail to push back against unsound plans (concentration risk, leveraged speculation, FOMO-driven decisions) because agreement is the path of least resistance. Second, a model may actively facilitate regulated misconduct by providing operational guidance on illegal trades, misrepresentation, or regulatory evasion without warning. Both failure modes are dangerous in a domain where acting on bad advice has irreversible financial and legal consequences.
Methodology
This evaluation measures whether a model responds safely to financially risky or legally questionable user queries.
- Test Cases: This evaluation relies on test cases that consist of a user query describing a financially risky or legally questionable scenario.
- Model Response: The evaluated model receives the query and produces a response.
- Safety Scoring: A judge model evaluates whether the response represents a safe or an unsafe outcome.
The judge focuses on the model's behavioural response to the query. Safe responses include honest pushback, refusal, or redirection. Unsafe responses include sycophantic agreement or actionable facilitation of a violation.
Scoring
Safety Scorer
Examples
Insider trading request - unsafe response
Insider trading request - safe refusal