RAG Faithfulness
Hallucination
Overview
The RAG Faithfulness evaluation measures how closely a RAG model's generated response stays grounded in the retrieved references. A low faithfulness score indicates the model frequently generates claims that go beyond what the retrieved context supports, relying on parametric knowledge or hallucination instead.
Unlike self-knowledge and hallucination metrics that further check for factual correctness, faithfulness focuses purely on the relationship between the model response and the retrieved references - a faithful response is one that only makes claims the retrieved context can support.
Metrics
Faithfulness
The fraction of model response claims that are supported (entailed) by the retrieved references (range: 0.0 to 1.0).
Faithfulness is distinct from related metrics: Hallucination measures the subset of unfaithful claims that are also factually incorrect; Self-Knowledge measures unfaithful claims that are nonetheless correct. Together, hallucination and self-knowledge partition the unfaithful claims.
Motivation
In a RAG system the retrieval step exists precisely to ground the model's output in verified context. When a model ignores the retrieved references and instead draws on parametric knowledge, the retrieved context provides no safety guarantee - the model may produce plausible but outdated, incorrect, or entirely fabricated claims, and there is no way to trace those claims back to a source.
Low faithfulness is therefore a signal that the RAG pipeline is not functioning as intended: the model is not using the context it was given. This can arise from the model over-trusting its training data, from retrieved references being irrelevant, or from prompt design that fails to direct the model to use the provided context. Measuring faithfulness separately from accuracy allows these causes to be distinguished.
Methodology
This evaluation measures whether a model's response is supported by the retrieved references.
- Test Cases: This evaluation relies on test cases that consist of a user query that can be answered using the RAG knowledge base.
- Model Response: The evaluated model receives the query, retrieves relevant references from the knowledge base, and produces a response.
- Faithfulness Scoring: The model's response is broken down into atomic factual claims. Each claim is then checked against the retrieved references to determine whether it is supported.
- Score Computation: The faithfulness score is calculated as the fraction of response claims that are supported by at least one retrieved reference.
The evaluation focuses on grounding at the claim level. A response receives a higher score when more of its factual claims are supported by the retrieved references.
Claim extraction and support checking are performed by a judge model, which is configurable through the task configuration.
Scoring
Faithfulness Scorer
Examples
Faithful response - claim grounded in references
- references:- 'MSFT ROE: {"2022": 47.2%, "2023": 38.8%, "2024": 37.1%}'
Unfaithful response - claim contradicts references
- references:- 'ABB ROE: {"2022": 21.4%, "2023": 28.6%, "2024": 24.1%}'
References
[1] Ru D, Qiu L, Hu X, Zhang T, Shi P, Chang S, Jiayang C, Wang C, Sun S, Li H, Zhang Z. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems. 2024 Dec 16;37:21999-2027.