data_representativeness

Data Representativeness

Measures the degree to which a training dataset covers the production distribution, detecting regions of input space where no sufficiently close training sample exists.
Tags:

Data Quality

Overview

The Data Representativeness evaluation measures the degree to which the distribution of key characteristics in a training, validation, or test dataset reflects the corresponding distribution in the production data. Each production sample is scored against the reference dataset, and the results are aggregated into a single metric.

Metrics

Representativeness

The degree to which the dataset distribution matches the production distribution across key characteristics (range: 0.0 to 1.0).

Representativeness
0.01.0
0.0
0.5
0.8
1.0
0.0The dataset distribution is entirely disjoint from the production distribution - the dataset is not representative of production at all.
0.5Significant representativeness gaps - large regions of the production distribution are absent or severely underrepresented in the dataset.
0.8Moderate representativeness gaps - most of the production distribution is covered but some subpopulations are underrepresented. Investigate before training or evaluation.
1.0The dataset distribution fully reflects the production distribution - no representativeness gaps detected.

Motivation

A model trained, validated, or evaluated on data that does not reflect the production distribution will produce misleading results. Training on unrepresentative data causes the model to underperform on the parts of production it has not seen; evaluating on unrepresentative data produces metrics that do not reflect real-world performance.

Representativeness gaps arise naturally: production data evolves, edge cases emerge, and data collection never perfectly mirrors real-world usage. These gaps are invisible from aggregate dataset statistics and only surface as localised failures in production. This evaluation makes them explicit before deployment, so data collection can be targeted at the regions that matter rather than discovered through production failures.

Methodology

  1. Samples: Each production sample is scored independently against the reference dataset.

  2. Scoring: Each sample is assessed by the Representativeness Scorer, which estimates how well the local neighbourhood of the production sample is represented in the reference dataset - comparing the local reference density to the local production density. A score of 1.0 means the region is proportionately represented; lower scores indicate underrepresentation or absence.

    The representativeness metric is computed as the average per-sample score across the full production set.

Scoring

Representativeness Scorer

Representativeness
Score valueExplanation
1.0The local reference density matches or exceeds the production density in this region - the production sample is well-represented in the dataset.
0.5The local reference density is lower than the production density - this region of the production distribution is underrepresented in the dataset.
0.0No reference samples exist in the local neighbourhood - this region of the production distribution is entirely absent from the dataset.

Examples

Representative sample - production region well-covered by dataset (passing)

Sample
queryHow do I cancel my subscription?
sourceproduction
Representativeness Scorer
1.0The local reference density (12 samples per unit volume) exceeds the production density (11 samples per unit volume) in this neighbourhood. Density ratio is 1.09, clamped to 1.0. This query type is fully represented in the dataset.

Unrepresented sample - production region absent from dataset (failing)

Sample
queryCan I transfer my subscription to a different country?
sourceproduction
Representativeness Scorer
0.0No reference sample exists in the local neighbourhood (closest distance 0.41, well above the coverage radius of 0.15). This query type has no representation in the dataset - the model will be forced to extrapolate for this class of production inputs.

Underrepresented sample - production region sparsely covered by dataset (failing)

Sample
queryI was charged twice for the same billing cycle.
sourceproduction
Representativeness Scorer
0.31The local reference density is 2 samples per unit volume while the production density in this neighbourhood is 6.5 samples per unit volume - a density ratio of 0.31. Duplicate billing queries appear frequently in production but are heavily underrepresented in the dataset. Model performance on this query type is likely to be degraded.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.
Requires LatticeFlow AI Platform CLI
lf init --atlas data_representativeness

Metrics

Representativeness

Don't have the LatticeFlow AI Platform?

Contact us to see this evaluation in action:
Contact Us