RAG Recall
Performance
Overview
The RAG Recall evaluation measures how completely a RAG model's response covers the ground-truth answer. A low recall score indicates the model omits important information present in the target answer, resulting in incomplete responses from the user's perspective.
Unlike precision-oriented metrics that assess whether generated claims are correct, recall focuses on coverage - specifically, how many of the expected ground-truth claims the model actually reproduces.
Metrics
Recall
The fraction of ground-truth answer claims that are correctly answered by the model response (range: 0.0 to 1.0).
Motivation
A model that only answers part of a question may appear correct while leaving the user with an incomplete and potentially misleading picture. In RAG applications where users ask multi-part questions or expect comprehensive summaries, systematic under-coverage is a quality failure distinct from hallucination or unfaithfulness.
Recall is particularly important when the retrieval pipeline is evaluated end-to-end: low recall can indicate that the retriever failed to surface the relevant documents, that the model ignored the retrieved context, or that the model's summarisation strategy is too selective. Measuring recall separately from precision and hallucination makes it possible to attribute the failure to the right component.
Methodology
This evaluation measures how completely a model's response covers the ground-truth answer.
- Test Cases: This evaluation relies on test cases that consist of a user query that can be answered using the RAG knowledge base, together with a ground-truth target answer.
- Model Response: The evaluated model receives the query, retrieves relevant references from the knowledge base, and produces a response.
- Recall Scoring: The ground-truth target answer is broken down into atomic factual claims. Each target claim is checked against the model response to determine whether it is correctly answered.
- Score Computation: The recall score is calculated as the fraction of target claims that are correctly answered by the model response.
The evaluation focuses on coverage at the claim level. A higher score indicates that a larger fraction of the expected answer is present in the model's response.
Claim extraction and entailment checks are performed by a judge model, which is configurable through the task configuration.
Scoring
Recall Scorer
Examples
High recall - all target claims reproduced
Low recall - target claims missing or incorrect
References
[1] Ru D, Qiu L, Hu X, Zhang T, Shi P, Chang S, Jiayang C, Wang C, Sun S, Li H, Zhang Z. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems. 2024 Dec 16;37:21999-2027.