Data Uniqueness
Data Quality
Overview
The Data Uniqueness evaluation measures the degree to which each sample in a dataset is unique from all other samples in the same dataset. Each sample is scored against the rest of the dataset, and the results are aggregated into a single metric.
Metrics
Data Uniqueness
The average per-sample uniqueness score across the full dataset (range: 0.0 to 1.0).
Motivation
Duplicate records cause AI models to overfit to repeated examples, distort class distributions, and produce inflated evaluation metrics that do not reflect true generalisation performance. The effect is compounding: a sample that appears ten times in training receives ten times the gradient signal of a unique sample, silently skewing the learned decision boundaries.
Duplicates are not always obvious. Exact byte-for-byte copies are easy to catch, but transformation duplicates (differing only by casing, whitespace, or minor formatting) and semantic duplicates (representing the same real-world object in a slightly different surface form) require embedding-based similarity to detect. This evaluation surfaces all three categories through a unified similarity score.
Methodology
- Samples: Each sample in the dataset is scored independently against all other samples in the dataset.
- Scoring: Each sample is assessed by the Sample Uniqueness Scorer, which computes the similarity of the sample to its closest neighbour in the dataset using an embedding-based similarity model. A score of 1.0 means no similar record exists anywhere in the dataset; 0.0 means an exact semantic match is present.
The uniqueness metric is computed as the average per-sample score across the full dataset.
Scoring
Data Uniqueness Scorer
Examples
Unique sample - no duplicate detected (passing)
Exact duplicate - byte-for-byte identical record (failing)
Semantic duplicate - same car listing, slightly different timestamp (failing)