Unstructured Data Accuracy
Data Quality
Overview
The Unstructured Data Accuracy evaluation measures how accurately the attributes in an unstructured dataset represent the true values of the intended concept or event. Each sample is inspected across three independent accuracy dimensions, and the results are aggregated into per-dimension and overall metrics.
The three accuracy dimensions are:
- Syntactic accuracy: whether attribute values conform to their domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields.
- Data type accuracy: whether attribute values are represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types.
- Semantic accuracy: whether attribute values correctly represent the true value of the intended concept - correct annotations, no missing or phantom labels.
Metrics
Accuracy
The aggregate accuracy score combining all three dimensions (range: 0.0 to 1.0).
Syntactic Accuracy
The fraction of attribute values that conform to the domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields (range: 0.0 to 1.0).
Data Type Accuracy
The fraction of attribute values that are represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types (range: 0.0 to 1.0).
Semantic Accuracy
The fraction of attribute values that correctly represent the true value of the intended concept - correct annotations, no missing or phantom labels (range: 0.0 to 1.0). Note that semantic accuracy is difficult to measure automatically and may require manual review or reproduction of the data collection process.
Motivation
Inaccurate data propagates silently through AI pipelines. Encoding corruption, type mismatches, and wrong ground-truth labels each produce incorrect model outputs - yet none of them raises an obvious error at ingestion time. A model trained or evaluated on inaccurate data learns the wrong patterns, produces wrong predictions, and reports metrics that do not reflect real-world performance.
Accuracy failures fall into three distinct categories, each requiring a different remediation strategy. Measuring them independently makes each failure mode visible so that remediation can be targeted rather than chasing a single opaque aggregate score.
Methodology
-
Samples: Each sample in the dataset is scored independently across three accuracy dimensions.
-
Scoring: Each sample is scored across three dimensions:
- Syntactic Scorer: checks whether values conform to their domain definition (encoding, formatting, spelling).
- Type Scorer: checks whether values use the correct type (numeric formats, null representations, sampling rates, image data types).
- Semantic Scorer: checks whether values correctly represent the intended concept (annotations, labels).
Each scorer produces an independent score from 0.0 to 1.0. Per-dimension and aggregate metrics are computed by averaging these scores across all samples.
Scoring
Syntactic Accuracy Scorer
Data Type Accuracy Scorer
Semantic Accuracy Scorer
Examples
Accurate sample - all three scorers pass (passing)
vehicle_class is correctly encoded, including the accented character. No spelling or grammar errors detected.
price_chf is an integer as expected, in_stock is a boolean, and category_code is a true null rather than a string sentinel.
All values correctly represent the intended concepts. vehicle_class "Coupé" matches the physical sample; the price and stock status are accurate.
Multiple errors - encoding corruption, wrong field type, and incorrect annotation (failing)
vehicle_class contains an encoding artefact - the accented character 'é' was mangled and trailed with spurious bytes, producing 'Coup ' instead of 'Coupé'. The remaining fields are syntactically correct.
price_chf is expected to be a numeric value but contains a boolean (true). There is no sensible numeric interpretation of a boolean price, making the value unrecoverable. The remaining fields have the correct type.
The 'dent' damage entry annotates a region of the vehicle that shows no visible damage in the source image. The 'scratch' annotation is correct. The other fields are semantically accurate.