RAG Hallucination
Hallucination
Overview
The RAG Hallucination evaluation measures the tendency of a RAG model to generate claims that are neither grounded in the retrieved references nor correct with respect to the ground-truth answer. A high hallucination score indicates that the model frequently produces fabricated or unsupported content independently of what the retriever provides.
Unlike faithfulness metrics that only check whether claims are supported by retrieved context, this evaluation specifically isolates claims that are both unfaithful to the retrieved references and factually incorrect - the clearest signal of hallucinated generation.
Metrics
Hallucination
The fraction of model response claims that are neither supported (entailed) by the retrieved references nor correct (entailed) with respect to the target (range: 0.0 to 1.0).
Hallucination is distinct from related metrics: Faithfulness measures claims unsupported by references regardless of correctness - a self-knowledge claim increases unfaithfulness but does not increase hallucination. Noise Sensitivity measures claims that are incorrect but are supported by noisy references - these are attribution errors, not hallucinations.
Motivation
Hallucination is a particularly damaging failure mode of a RAG system because it produces content that is both ungrounded and wrong, with no obvious signal to the user. A self-knowledge claim (unfaithful but correct) can still be useful; a hallucinated claim (unfaithful and incorrect) actively misleads the user.
In high-stakes applications - medical, legal, financial - a single hallucinated claim in a plausible-sounding response can cause real harm. Measuring hallucination separately from faithfulness allows operators to distinguish between a model that draws on its training data (potentially acceptable) and one that simply fabricates content (not acceptable). A low hallucination rate combined with moderate faithfulness indicates productive self-knowledge use; a high hallucination rate combined with low faithfulness indicates uncontrolled generation.
Methodology
This evaluation measures whether a model introduces unsupported and incorrect claims in its response.
- Test Cases: This evaluation relies on test cases that consist of a user query that can be answered using the RAG knowledge base, together with a ground-truth target answer.
- Model Response: The evaluated model receives the query, retrieves relevant references from the knowledge base, and produces a response.
- Hallucination Scoring: The model's response is broken down into atomic factual claims. Each claim is first checked against the retrieved references to determine whether it is supported. Claims that are not supported are then checked against the ground-truth target to determine whether they are nevertheless correct.
- Score Computation: The hallucination score is calculated as the fraction of response claims that are both unsupported by the retrieved references and incorrect with respect to the target.
The evaluation focuses on claim-level hallucinations. A higher score indicates that a larger fraction of the model's response consists of claims that are neither grounded in the retrieved references nor correct according to the target answer.
Claim extraction, support checking, and correctness checking are performed by a judge model, which is configurable through the task configuration.
Scoring
Hallucination Scorer
Examples
Faithful response - not hallucinated
- references:- 'MSFT ROE: {"2022": 47.2%, "2023": 38.8%, "2024": 37.1%}'
Hallucinated response - unsupported and incorrect
- references:- 'ABB ROE: {"2022": 21.4%, "2023": 28.6%, "2024": 24.1%}'
References
[1] Ru D, Qiu L, Hu X, Zhang T, Shi P, Chang S, Jiayang C, Wang C, Sun S, Li H, Zhang Z. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems. 2024 Dec 16;37:21999-2027.