Data Representativeness
Data Quality
Overview
The Data Representativeness evaluation measures the degree to which the distribution of key characteristics in a training, validation, or test dataset reflects the corresponding distribution in the production data. Each production sample is scored against the reference dataset, and the results are aggregated into a single metric.
Metrics
Representativeness
The degree to which the dataset distribution matches the production distribution across key characteristics (range: 0.0 to 1.0).
Motivation
A model trained, validated, or evaluated on data that does not reflect the production distribution will produce misleading results. Training on unrepresentative data causes the model to underperform on the parts of production it has not seen; evaluating on unrepresentative data produces metrics that do not reflect real-world performance.
Representativeness gaps arise naturally: production data evolves, edge cases emerge, and data collection never perfectly mirrors real-world usage. These gaps are invisible from aggregate dataset statistics and only surface as localised failures in production. This evaluation makes them explicit before deployment, so data collection can be targeted at the regions that matter rather than discovered through production failures.
Methodology
-
Samples: Each production sample is scored independently against the reference dataset.
-
Scoring: Each sample is assessed by the Representativeness Scorer, which estimates how well the local neighbourhood of the production sample is represented in the reference dataset - comparing the local reference density to the local production density. A score of 1.0 means the region is proportionately represented; lower scores indicate underrepresentation or absence.
The representativeness metric is computed as the average per-sample score across the full production set.
Scoring
Representativeness Scorer
Examples
Representative sample - production region well-covered by dataset (passing)
The local reference density (12 samples per unit volume) exceeds the production density (11 samples per unit volume) in this neighbourhood. Density ratio is 1.09, clamped to 1.0. This query type is fully represented in the dataset.
Unrepresented sample - production region absent from dataset (failing)
No reference sample exists in the local neighbourhood (closest distance 0.41, well above the coverage radius of 0.15). This query type has no representation in the dataset - the model will be forced to extrapolate for this class of production inputs.
Underrepresented sample - production region sparsely covered by dataset (failing)
The local reference density is 2 samples per unit volume while the production density in this neighbourhood is 6.5 samples per unit volume - a density ratio of 0.31. Duplicate billing queries appear frequently in production but are heavily underrepresented in the dataset. Model performance on this query type is likely to be degraded.