rag_faithfulness

RAG Faithfulness

Measures the fraction of a RAG model's response claims that are grounded in the retrieved references.

Tags:

Hallucination

Overview

The RAG Faithfulness evaluation measures how closely a RAG model's generated response stays grounded in the retrieved references. A low faithfulness score indicates the model frequently generates claims that go beyond what the retrieved context supports, relying on parametric knowledge or hallucination instead.

Unlike self-knowledge and hallucination metrics that further check for factual correctness, faithfulness focuses purely on the relationship between the model response and the retrieved references - a faithful response is one that only makes claims the retrieved context can support.

Metrics

Faithfulness

The fraction of model response claims that are supported (entailed) by the retrieved references (range: 0.0 to 1.0).

Faithfulness is distinct from related metrics: Hallucination measures the subset of unfaithful claims that are also factually incorrect; Self-Knowledge measures unfaithful claims that are nonetheless correct. Together, hallucination and self-knowledge partition the unfaithful claims.

Faithfulness

0.01.0

0.0

0.5

0.7

0.9

1.0

0.0No claims are grounded in the retrieved references - critical failure.

0.5Half of claims are grounded - significant reliance on non-retrieved content.

0.770% of claims are grounded - moderately faithful.

0.990% of claims are grounded - highly faithful.

1.0All claims are grounded in the retrieved references - perfect faithfulness.

Motivation

In a RAG system the retrieval step exists precisely to ground the model's output in verified context. When a model ignores the retrieved references and instead draws on parametric knowledge, the retrieved context provides no safety guarantee - the model may produce plausible but outdated, incorrect, or entirely fabricated claims, and there is no way to trace those claims back to a source.

Low faithfulness is therefore a signal that the RAG pipeline is not functioning as intended: the model is not using the context it was given. This can arise from the model over-trusting its training data, from retrieved references being irrelevant, or from prompt design that fails to direct the model to use the provided context. Measuring faithfulness separately from accuracy allows these causes to be distinguished.

Methodology

This evaluation measures whether a model's response is supported by the retrieved references.

Test Cases: This evaluation relies on test cases that consist of a user query that can be answered using the RAG knowledge base.
Model Response: The evaluated model receives the query, retrieves relevant references from the knowledge base, and produces a response.
Faithfulness Scoring: The model's response is broken down into atomic factual claims. Each claim is then checked against the retrieved references to determine whether it is supported.
Score Computation: The faithfulness score is calculated as the fraction of response claims that are supported by at least one retrieved reference.

The evaluation focuses on grounding at the claim level. A response receives a higher score when more of its factual claims are supported by the retrieved references.

Claim extraction and support checking are performed by a judge model, which is configurable through the task configuration.

Scoring

Faithfulness Scorer

Faithfulness

Score valueExplanation

1.0The claim is directly supported (entailed) by at least one retrieved reference - the model is grounding its output in the provided context.

0.5The claim is partially supported by the references - the references provide some but not complete grounding for the specific claim made.

0.0The claim is not supported by any retrieved reference. It may be correct (self-knowledge) or incorrect (hallucination), but it is not grounded in the retrieved context.

Examples

Faithful response - claim grounded in references

User

What is the return on equity of Microsoft in 2023?

Assistant

In 2023, Microsoft's return on equity was 38.8%.

Metadata

references:- 'MSFT ROE: {"2022": 47.2%, "2023": 38.8%, "2024": 37.1%}'

Faithfulness

1.0The claim is directly supported by the retrieved reference.

Unfaithful response - claim contradicts references

User

What is the return on equity of ABB in 2023?

Assistant

ABB's return on equity in 2023 was 21.4%.

Metadata

references:- 'ABB ROE: {"2022": 21.4%, "2023": 28.6%, "2024": 24.1%}'

Faithfulness

0.0The claim (21.4%) contradicts the retrieved reference which shows 28.6% for 2023. Regardless of whether the model's claim happens to be correct or not, it is not supported by the provided context.

References

[1] Ru D, Qiu L, Hu X, Zhang T, Shi P, Chang S, Jiayang C, Wang C, Sun S, Li H, Zhang Z. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems. 2024 Dec 16;37:21999-2027.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.

RAG Faithfulness

Hallucination

Overview

Metrics

Faithfulness

Motivation

Methodology

Scoring

Faithfulness Scorer

Examples

References

Run Evaluation in LatticeFlow AI Platform

Metrics

Faithfulness

Don't have the LatticeFlow AI Platform?