Structured Data Representativeness
Data Quality
Overview
The Structured Data Representativeness evaluation measures whether a structured dataset adequately reflects the population or environment in which an AI application is deployed. Each feature distribution in the dataset is compared against a reference population, and the results are aggregated into a representativeness metric.
The evaluation covers:
- Distribution comparison: the distribution of key features in the evaluated dataset is compared against a reference distribution using statistical divergence measures.
- Skew detection: overrepresented or underrepresented subgroups, time periods, or conditions are identified and quantified.
- Representativeness scoring: the aggregate similarity between evaluated and reference distributions is expressed as a single score from 0.0 (maximal divergence) to 1.0 (perfect distributional match).
Metrics
Representativeness
The aggregate similarity between the evaluated dataset's feature distributions and the reference population distributions (range: 0.0 to 1.0).
Motivation
A dataset that overrepresents certain subgroups, time periods, or conditions - or underrepresents others - produces models that perform well in some contexts and silently fail in others. The distribution mismatch is invisible without explicitly comparing the dataset against a reference population: aggregate metrics look fine while specific subgroups are systematically underserved.
This evaluation is particularly relevant for AI applications trained or evaluated on historical structured data (e.g. loan portfolios, client transaction histories, sensor readings), where the data distribution may have shifted or was never representative of the full deployment population. Detecting distributional skew before training allows targeted data collection or reweighting, rather than discovering the gap through degraded production metrics after deployment.
Methodology
- Samples: Each feature distribution slice in the dataset is scored independently against the reference population.
- Scoring: Each sample is scored by the Representativeness Scorer, which computes the statistical divergence between the evaluated and reference distributions for that feature. Per-feature scores are averaged into an aggregate representativeness metric.
Scoring
Representativeness Scorer
Examples
Representative dataset slice - age distribution close to population (passing)
Unrepresentative dataset slice - strong age skew (failing)
Moderately representative dataset - skew in one segment (partial)