Data Representativeness
Data Quality
Overview
The Data Representativeness evaluation measures the degree to which the distribution of key characteristics in a training, validation, or test dataset reflects the corresponding distribution in the production data. Each production sample is scored against the reference dataset, and the results are aggregated into a single metric.
Metrics
Representativeness
The degree to which the dataset distribution matches the production distribution across key characteristics (range: 0.0 to 1.0).
Motivation
A model trained, validated, or evaluated on data that does not reflect the production distribution will produce misleading results. Training on unrepresentative data causes the model to underperform on the parts of production it has not seen; evaluating on unrepresentative data produces metrics that do not reflect real-world performance.
Representativeness gaps arise naturally: production data evolves, edge cases emerge, and data collection never perfectly mirrors real-world usage. These gaps are invisible from aggregate dataset statistics and only surface as localised failures in production. This evaluation makes them explicit before deployment, so data collection can be targeted at the regions that matter rather than discovered through production failures.
Methodology
-
Samples: Each production sample is scored independently against the reference dataset.
-
Scoring: Each sample is assessed by the Representativeness Scorer, which estimates how well the local neighbourhood of the production sample is represented in the reference dataset - comparing the local reference density to the local production density. A score of 1.0 means the region is proportionately represented; lower scores indicate underrepresentation or absence.
The representativeness metric is computed as the average per-sample score across the full production set.
Scoring
Representativeness Scorer
Examples
Representative sample - production region well-covered by dataset (passing)
Unrepresented sample - production region absent from dataset (failing)
Underrepresented sample - production region sparsely covered by dataset (failing)