data_uniqueness

Data Uniqueness

Measures the degree to which each real-world object occurs only once in a dataset, detecting exact, transformation, and semantic duplicates at both the dataset and per-sample level.
Tags:

Data Quality

Overview

The Data Uniqueness evaluation measures the degree to which each sample in a dataset is unique from all other samples in the same dataset. Each sample is scored against the rest of the dataset, and the results are aggregated into a single metric.

Metrics

Data Uniqueness

The average per-sample uniqueness score across the full dataset (range: 0.0 to 1.0).

Data Uniqueness
0.01.0
0.0
0.8
0.95
1.0
0.0Every sample is a duplicate of another record - the dataset contains no unique entries.
0.8Significant duplication - 20% or more of samples are duplicates. Deduplication is required before training or evaluation.
0.95Minor duplication - 5% of samples are duplicates. Impact depends on dataset size and whether duplicates cluster around specific classes or features.
1.0No duplicates detected - every sample is unique.

Motivation

Duplicate records cause AI models to overfit to repeated examples, distort class distributions, and produce inflated evaluation metrics that do not reflect true generalisation performance. The effect is compounding: a sample that appears ten times in training receives ten times the gradient signal of a unique sample, silently skewing the learned decision boundaries.

Duplicates are not always obvious. Exact byte-for-byte copies are easy to catch, but transformation duplicates (differing only by casing, whitespace, or minor formatting) and semantic duplicates (representing the same real-world object in a slightly different surface form) require embedding-based similarity to detect. This evaluation surfaces all three categories through a unified similarity score.

Methodology

  1. Samples: Each sample in the dataset is scored independently against all other samples in the dataset.
  2. Scoring: Each sample is assessed by the Sample Uniqueness Scorer, which computes the similarity of the sample to its closest neighbour in the dataset using an embedding-based similarity model. A score of 1.0 means no similar record exists anywhere in the dataset; 0.0 means an exact semantic match is present.

The uniqueness metric is computed as the average per-sample score across the full dataset.

Scoring

Data Uniqueness Scorer

Data Uniqueness
Score valueExplanation
1.0No similar record found anywhere in the dataset - this sample is unique.
0.5A moderately similar record exists in the dataset - the sample may be a near-duplicate representing the same real-world object in a slightly different surface form.
0.0An exact or near-exact match exists in the dataset - this sample is a duplicate and should be removed before training or evaluation.

Examples

Unique sample - no duplicate detected (passing)

Sample
car_idV-0041
makeBMW
modelX3
year2019
km_driven54200
dealer_idD-07
timestamp2024-03-15T10:22:00Z
Sample Uniqueness Scorer
0.82The closest record in the dataset (car_id V-0088) has different mileage, a different dealer, and a two-month later timestamp. Similarity is 0.18 - well below the duplicate threshold. This sample is unique.

Exact duplicate - byte-for-byte identical record (failing)

Sample
car_idV-0041
makeBMW
modelX3
year2019
km_driven54200
dealer_idD-07
timestamp2024-03-15T10:22:00Z
Sample Uniqueness Scorer
0.0This record is byte-for-byte identical to car_id V-0041 already present in the dataset. Similarity is 1.0. It is a confirmed exact duplicate and should be removed.

Semantic duplicate - same car listing, slightly different timestamp (failing)

Sample
car_idV-0204
makeBMW
modelX3
year2019
km_driven54200
dealer_idD-07
timestamp2024-03-15T18:45:00Z
Sample Uniqueness Scorer
0.04The closest record (car_id V-0041) has all features identical except the timestamp, which differs by 8 hours. Similarity is 0.96 - above the semantic duplicate threshold. These two records likely represent the same physical car listing registered twice.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.
Requires LatticeFlow AI Platform CLI
lf init --atlas data_uniqueness

Metrics

Data Uniqueness

Don't have the LatticeFlow AI Platform?

Contact us to see this evaluation in action:
Contact Us