text_robustness

Text Robustness

Evaluates whether AI models produce semantically consistent responses when the same question is presented with meaning-preserving input perturbations such as typos, casing changes, or paraphrasing.
Tags:

Robustness

Overview

The Text Robustness evaluation measures a chatbot model's consistency under meaning-preserving input perturbations. This evaluation tests whether models produce semantically stable responses when presented with different text variations of the same question, such as those containing typos, formatting noise, or alternative phrasings.

Robust models should maintain semantic consistency despite surface-level variations in user input. Users expressing the same intent through different text variations should receive equivalent answers, regardless of typing errors, inconsistent capitalisation, copy-paste artefacts, or different word choices.

Metrics

Robustness Score

The proportion of perturbed responses that are semantically equivalent to the base response (range: 0.0 to 1.0).

Responses are considered semantically equivalent if they convey the same core facts, preserve all important constraints (numbers, units, yes/no decisions), do not contradict each other, and do not add or remove information that would materially change the answer. Acceptable differences include different wording, formatting, tone, or minor variations in detail that do not change the core answer.

Robustness Score
0.01.0
0.0
0.5
0.7
0.9
1.0
0.0No perturbed responses were semantically equivalent - critical fragility.
0.5Half of perturbed responses were semantically equivalent - concerning fragility.
0.770% of perturbed responses were semantically equivalent - moderately robust.
0.990% of perturbed responses were semantically equivalent - highly robust.
1.0All perturbed responses were semantically equivalent - perfect consistency.

Motivation

Users do not always write in perfect, standardised English. Typing mistakes, autocorrect errors, and natural paraphrase variation are ubiquitous in real-world deployments. A model that answers "How many teaspoons are in a tablespoon?" correctly but fails on "how many teaspons are in a tablspoon" is not reliably useful - the surface form of the input should not change the correctness or completeness of the response.

Beyond usability, inconsistent robustness is a reliability signal: if a model's output varies materially for semantically equivalent inputs, its responses cannot be trusted to reflect stable underlying knowledge. This is particularly important in high-stakes settings where the user may not be able to judge whether a differently phrased question received a different (and potentially wrong) answer.

Methodology

This evaluation measures whether a model produces consistent responses under meaning-preserving input perturbations.

  1. Test Cases: This evaluation relies on test cases that consist of a base question drawn from a QA dataset.
  2. Dataset Generation: Two perturbation datasets are automatically generated from the base QA dataset, where each base question is perturbed multiple times:
    • Surface Noise Perturbations: Typos, casing changes, punctuation errors, and whitespace noise - generated deterministically from the source questions.
    • Lexical & Paraphrase Perturbations: Synonym substitution, sentence restructuring, and alternative phrasings - generated by the judge model.
  3. Model Response: The evaluated model receives the unperturbed base question and produces a reference response. It then receives each perturbed variant and produces a response for each.
  4. Robustness Scoring: A judge model compares each perturbed response against the base response to determine whether they are semantically equivalent.
  5. Score Computation: The robustness score is calculated as the fraction of perturbed responses that are semantically equivalent to the base response.

The evaluation focuses on semantic consistency at the response level. A higher score indicates that the model produces stable outputs regardless of surface-level input variation.

Semantic equivalence judgements are performed by a judge model, which is configurable through the task configuration.

Scoring

Robustness Scorer

Robustness Score
Score valueExplanation
1.0The perturbed response is semantically equivalent to the base response - same core facts, same constraints, no contradictions.
0.5The perturbed response partially matches the base response - the same general topic is addressed but with meaningful differences in content or completeness.
0.0The perturbed response is semantically different from the base response - it contains contradictory information, different numeric values, different categorical decisions, or fails to answer the question.

Examples

Robust - typo perturbation

Base query

User
What is the capital of France?
Assistant
The capital of France is Paris.

Perturbed query

User
What is the copital off frace?
Assistant
The capital of France is Paris.
Robustness Score
1.0Semantically equivalent - the response conveys the same core fact despite the input noise.

Robust - paraphrase perturbation

Base query

User
What is the capital of France?
Assistant
The capital of France is Paris.

Perturbed query

User
Which city serves as France's capital?
Assistant
Paris is the capital of France.
Robustness Score
1.0Semantically equivalent - the response conveys the same core fact despite different wording.

Fragile - typo perturbation

Base query

User
What is the boiling point of water in Celsius?
Assistant
Water boils at 100 degrees Celsius at standard atmospheric pressure.

Perturbed query

User
Wat is the boyling poit of watter in celcious?
Assistant
I'm not sure what 'boyling poit' means - could you rephrase your question?
Robustness Score
0.0Semantically different - the model failed to recognise the intent and did not provide the answer. The base query received the correct answer of 100 degrees Celsius.

Fragile - paraphrase perturbation

Base query

User
What is the boiling point of water in Celsius?
Assistant
Water boils at 100 degrees Celsius at standard atmospheric pressure.

Perturbed query

User
At what Celsius temperature does water transition from liquid to gas?
Assistant
Water transitions from liquid to gas at 212 degrees Celsius.
Robustness Score
0.0Semantically different - the model stated 212 degrees Celsius, which is the Fahrenheit boiling point, contradicting the correct answer of 100 degrees Celsius given for the base query.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.
Requires LatticeFlow AI Platform CLI
lf init --atlas text_robustness

Metrics

Robustness Score

Don't have the LatticeFlow AI Platform?

Contact us to see this evaluation in action:
Contact Us