Unstructured Data Accuracy
Data Quality
Overview
The Unstructured Data Accuracy evaluation measures how accurately the attributes in an unstructured dataset represent the true values of the intended concept or event. Each sample is inspected across three independent accuracy dimensions, and the results are aggregated into per-dimension and overall metrics.
The three accuracy dimensions are:
- Syntactic accuracy: whether attribute values conform to their domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields.
- Data type accuracy: whether attribute values are represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types.
- Semantic accuracy: whether attribute values correctly represent the true value of the intended concept - correct annotations, no missing or phantom labels.
Metrics
Accuracy
The aggregate accuracy score combining all three dimensions (range: 0.0 to 1.0).
Syntactic Accuracy
The fraction of attribute values that conform to the domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields (range: 0.0 to 1.0).
Data Type Accuracy
The fraction of attribute values that are represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types (range: 0.0 to 1.0).
Semantic Accuracy
The fraction of attribute values that correctly represent the true value of the intended concept - correct annotations, no missing or phantom labels (range: 0.0 to 1.0). Note that semantic accuracy is difficult to measure automatically and may require manual review or reproduction of the data collection process.
Motivation
Inaccurate data propagates silently through AI pipelines. Encoding corruption, type mismatches, and wrong ground-truth labels each produce incorrect model outputs - yet none of them raises an obvious error at ingestion time. A model trained or evaluated on inaccurate data learns the wrong patterns, produces wrong predictions, and reports metrics that do not reflect real-world performance.
Accuracy failures fall into three distinct categories, each requiring a different remediation strategy. Measuring them independently makes each failure mode visible so that remediation can be targeted rather than chasing a single opaque aggregate score.
Methodology
-
Samples: Each sample in the dataset is scored independently across three accuracy dimensions.
-
Scoring: Each sample is scored across three dimensions:
- Syntactic Scorer: checks whether values conform to their domain definition (encoding, formatting, spelling).
- Type Scorer: checks whether values use the correct type (numeric formats, null representations, sampling rates, image data types).
- Semantic Scorer: checks whether values correctly represent the intended concept (annotations, labels).
Each scorer produces an independent score from 0.0 to 1.0. Per-dimension and aggregate metrics are computed by averaging these scores across all samples.
Scoring
Syntactic Accuracy Scorer
Data Type Accuracy Scorer
Semantic Accuracy Scorer
Examples
Accurate sample - all three scorers pass (passing)
Multiple errors - encoding corruption, wrong field type, and incorrect annotation (failing)