rag_recall

RAG Recall

Measures the fraction of ground-truth answer claims that are correctly reproduced by the RAG model's response.
Tags:

Performance

Overview

The RAG Recall evaluation measures how completely a RAG model's response covers the ground-truth answer. A low recall score indicates the model omits important information present in the target answer, resulting in incomplete responses from the user's perspective.

Unlike precision-oriented metrics that assess whether generated claims are correct, recall focuses on coverage - specifically, how many of the expected ground-truth claims the model actually reproduces.

Metrics

Recall

The fraction of ground-truth answer claims that are correctly answered by the model response (range: 0.0 to 1.0).

Recall
0.01.0
0.0
0.5
0.7
0.9
1.0
0.0No target claims are reproduced - critical failure.
0.5Half of target claims are reproduced - significant omissions.
0.770% of target claims are reproduced - moderately complete.
0.990% of target claims are reproduced - nearly complete.
1.0All target claims are reproduced - complete response.

Motivation

A model that only answers part of a question may appear correct while leaving the user with an incomplete and potentially misleading picture. In RAG applications where users ask multi-part questions or expect comprehensive summaries, systematic under-coverage is a quality failure distinct from hallucination or unfaithfulness.

Recall is particularly important when the retrieval pipeline is evaluated end-to-end: low recall can indicate that the retriever failed to surface the relevant documents, that the model ignored the retrieved context, or that the model's summarisation strategy is too selective. Measuring recall separately from precision and hallucination makes it possible to attribute the failure to the right component.

Methodology

This evaluation measures how completely a model's response covers the ground-truth answer.

  1. Test Cases: This evaluation relies on test cases that consist of a user query that can be answered using the RAG knowledge base, together with a ground-truth target answer.
  2. Model Response: The evaluated model receives the query, retrieves relevant references from the knowledge base, and produces a response.
  3. Recall Scoring: The ground-truth target answer is broken down into atomic factual claims. Each target claim is checked against the model response to determine whether it is correctly answered.
  4. Score Computation: The recall score is calculated as the fraction of target claims that are correctly answered by the model response.

The evaluation focuses on coverage at the claim level. A higher score indicates that a larger fraction of the expected answer is present in the model's response.

Claim extraction and entailment checks are performed by a judge model, which is configurable through the task configuration.

Scoring

Recall Scorer

Recall
Score valueExplanation
1.0The target claim is reproduced (entailed) by the model response - the model covered this piece of the expected answer.
0.5The target claim is partially reproduced - the model's response touches on this claim but does not fully entail it.
0.0The target claim is not reproduced by the model response - the model omitted this piece of the expected answer entirely.

Examples

High recall - all target claims reproduced

User
What is the return on equity of Microsoft and ABB in 2023, and which is higher?
Assistant
In 2023, Microsoft's ROE was 38.8% and ABB's ROE was 28.6%, so Microsoft has the higher return on equity.
Recall
1.0All three target claims are reproduced - Microsoft's ROE (38.8%), ABB's ROE (28.6%), and the comparative claim that Microsoft has the higher ROE.

Low recall - target claims missing or incorrect

User
What is the return on equity of Microsoft and ABB in 2023, and which is higher?
Assistant
In 2023, Microsoft's ROE was 38.8% and ABB's ROE was 21.4%.
Recall
0.33Only one of three target claims is correctly reproduced (Microsoft's ROE). The ABB ROE claim is incorrect (21.4% vs 28.6%), and the comparative claim about Microsoft having the higher ROE is absent.

References

[1] Ru D, Qiu L, Hu X, Zhang T, Shi P, Chang S, Jiayang C, Wang C, Sun S, Li H, Zhang Z. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems. 2024 Dec 16;37:21999-2027.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.
Requires LatticeFlow AI Platform CLI
lf init --atlas rag_recall

Metrics

Recall

Don't have the LatticeFlow AI Platform?

Contact us to see this evaluation in action:
Contact Us