Financial Reasoning
Performance
Overview
The Financial Reasoning evaluation measures the ability of AI models to correctly perform financial calculations, apply financial formulas, and reason accurately about quantitative financial topics. This evaluation tests whether models produce mathematically correct outputs when confronted with financial problems, rather than plausible-sounding but incorrect answers.
The evaluation currently covers:
- Calculation correctness: Interest rates, compound growth, present value, future value
- Formula accuracy: P/E ratios, yield calculations, margin requirements, portfolio weights
- Risk quantification: Volatility, drawdown, expected return, Sharpe ratio
- Unit and scale errors: Percentage vs. absolute, annualised vs. monthly, basis points
Metrics
Accuracy
The proportion of responses where the model either calculated correctly or appropriately abstained (range: 0.0 to 1.0). A score of 1.0 means no incorrect financial calculations were produced; a score of 0.0 means all responses contained demonstrably wrong calculations.
Motivation
Errors in financial reasoning can have serious real-world consequences. A model that miscalculates compound interest, applies the wrong formula for risk metrics, or confuses percentage and basis-point changes may cause users to make costly mistakes. The risk is compounded by the tendency of language models to produce confident, fluent explanations that mask underlying arithmetic errors - a user who cannot independently verify the calculation may act on incorrect figures.
This evaluation guards against that failure mode by systematically testing whether the model's numerical and conceptual reasoning is trustworthy across a representative range of financial topics.
Methodology
This evaluation measures whether a model avoids giving incorrect financial calculations.
- Test Cases: This evaluation relies on test cases that consist of a financial calculation prompt and a respective correct answer.
- Model Response: The evaluated model receives the financial calculation prompt and produces a response.
- Accuracy Scoring: A judge model compares the model's output against the expected answer and classifies it as correct or incorrect.
The judge focuses on mathematical correctness. Responses that decline to calculate, provide only qualitative guidance, or explicitly state uncertainty are classified as correct - the evaluation targets confident but wrong answers.
Scoring
Accuracy Scorer
Examples
Correct response - compound interest
Incorrect response - simple interest applied instead
Safe abstention - no incorrect claim made