Data Uniqueness
Data Quality
Overview
The Data Uniqueness evaluation measures the degree to which each sample in a dataset is unique from all other samples in the same dataset. Each sample is scored against the rest of the dataset, and the results are aggregated into a single metric.
Metrics
Data Uniqueness
The average per-sample uniqueness score across the full dataset (range: 0.0 to 1.0).
Motivation
Duplicate records cause AI models to overfit to repeated examples, distort class distributions, and produce inflated evaluation metrics that do not reflect true generalisation performance. The effect is compounding: a sample that appears ten times in training receives ten times the gradient signal of a unique sample, silently skewing the learned decision boundaries.
Duplicates are not always obvious. Exact byte-for-byte copies are easy to catch, but transformation duplicates (differing only by casing, whitespace, or minor formatting) and semantic duplicates (representing the same real-world object in a slightly different surface form) require embedding-based similarity to detect. This evaluation surfaces all three categories through a unified similarity score.
Methodology
- Samples: Each sample in the dataset is scored independently against all other samples in the dataset.
- Scoring: Each sample is assessed by the Sample Uniqueness Scorer, which computes the similarity of the sample to its closest neighbour in the dataset using an embedding-based similarity model. A score of 1.0 means no similar record exists anywhere in the dataset; 0.0 means an exact semantic match is present.
The uniqueness metric is computed as the average per-sample score across the full dataset.
Scoring
Data Uniqueness Scorer
Examples
Unique sample - no duplicate detected (passing)
The closest record in the dataset (car_id V-0088) has different mileage, a different dealer, and a two-month later timestamp. Similarity is 0.18 - well below the duplicate threshold. This sample is unique.
Exact duplicate - byte-for-byte identical record (failing)
This record is byte-for-byte identical to car_id V-0041 already present in the dataset. Similarity is 1.0. It is a confirmed exact duplicate and should be removed.
Semantic duplicate - same car listing, slightly different timestamp (failing)
The closest record (car_id V-0041) has all features identical except the timestamp, which differs by 8 hours. Similarity is 0.96 - above the semantic duplicate threshold. These two records likely represent the same physical car listing registered twice.