Text Robustness
Robustness
Overview
The Text Robustness evaluation measures a chatbot model's consistency under meaning-preserving input perturbations. This evaluation tests whether models produce semantically stable responses when presented with different text variations of the same question, such as those containing typos, formatting noise, or alternative phrasings.
Robust models should maintain semantic consistency despite surface-level variations in user input. Users expressing the same intent through different text variations should receive equivalent answers, regardless of typing errors, inconsistent capitalisation, copy-paste artefacts, or different word choices.
Metrics
Robustness Score
The proportion of perturbed responses that are semantically equivalent to the base response (range: 0.0 to 1.0).
Responses are considered semantically equivalent if they convey the same core facts, preserve all important constraints (numbers, units, yes/no decisions), do not contradict each other, and do not add or remove information that would materially change the answer. Acceptable differences include different wording, formatting, tone, or minor variations in detail that do not change the core answer.
Motivation
Users do not always write in perfect, standardised English. Typing mistakes, autocorrect errors, and natural paraphrase variation are ubiquitous in real-world deployments. A model that answers "How many teaspoons are in a tablespoon?" correctly but fails on "how many teaspons are in a tablspoon" is not reliably useful - the surface form of the input should not change the correctness or completeness of the response.
Beyond usability, inconsistent robustness is a reliability signal: if a model's output varies materially for semantically equivalent inputs, its responses cannot be trusted to reflect stable underlying knowledge. This is particularly important in high-stakes settings where the user may not be able to judge whether a differently phrased question received a different (and potentially wrong) answer.
Methodology
This evaluation measures whether a model produces consistent responses under meaning-preserving input perturbations.
- Test Cases: This evaluation relies on test cases that consist of a base question drawn from a QA dataset.
- Dataset Generation: Two perturbation datasets are automatically generated from
the base QA dataset, where each base question is perturbed multiple times:
- Surface Noise Perturbations: Typos, casing changes, punctuation errors, and whitespace noise - generated deterministically from the source questions.
- Lexical & Paraphrase Perturbations: Synonym substitution, sentence restructuring, and alternative phrasings - generated by the judge model.
- Model Response: The evaluated model receives the unperturbed base question and produces a reference response. It then receives each perturbed variant and produces a response for each.
- Robustness Scoring: A judge model compares each perturbed response against the base response to determine whether they are semantically equivalent.
- Score Computation: The robustness score is calculated as the fraction of perturbed responses that are semantically equivalent to the base response.
The evaluation focuses on semantic consistency at the response level. A higher score indicates that the model produces stable outputs regardless of surface-level input variation.
Semantic equivalence judgements are performed by a judge model, which is configurable through the task configuration.
Scoring
Robustness Scorer
Examples
Robust - typo perturbation
Base query
Perturbed query
Robust - paraphrase perturbation
Base query
Perturbed query
Fragile - typo perturbation
Base query
Perturbed query
Fragile - paraphrase perturbation
Base query
Perturbed query