unstructured_data_accuracy

Unstructured Data Accuracy

Measures the degree to which attributes in an unstructured dataset correctly represent the true value of the intended concept, across syntactic, type, and semantic accuracy dimensions.
Tags:

Data Quality

Overview

The Unstructured Data Accuracy evaluation measures how accurately the attributes in an unstructured dataset represent the true values of the intended concept or event. Each sample is inspected across three independent accuracy dimensions, and the results are aggregated into per-dimension and overall metrics.

The three accuracy dimensions are:

  • Syntactic accuracy: whether attribute values conform to their domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields.
  • Data type accuracy: whether attribute values are represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types.
  • Semantic accuracy: whether attribute values correctly represent the true value of the intended concept - correct annotations, no missing or phantom labels.

Metrics

Accuracy

The aggregate accuracy score combining all three dimensions (range: 0.0 to 1.0).

Accuracy
0.01.0
0.0
0.5
0.9
1.0
0.0Dataset is entirely inaccurate across one or more dimensions - unusable for AI processing.
0.5Half of attribute values are inaccurate - significant data quality issues that will cause incorrect model outputs.
0.990% of attribute values are accurate - minor inaccuracies present. Acceptability depends on the sensitivity of the downstream model.
1.0Perfect accuracy - all attribute values are syntactically correct, correctly typed, and semantically accurate.

Syntactic Accuracy

The fraction of attribute values that conform to the domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields (range: 0.0 to 1.0).

Syntactic Accuracy
0.01.0
0.0
0.5
0.9
1.0
0.0No values conform to the domain - systematic encoding corruption or formatting errors throughout the dataset.
0.5Half of values have syntactic issues - widespread encoding or formatting errors that require remediation before use.
0.9Minor syntactic issues - 10% of values are malformed. May be acceptable depending on whether affected fields are critical.
1.0All values conform to the domain definition - no encoding corruption or formatting errors.

Data Type Accuracy

The fraction of attribute values that are represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types (range: 0.0 to 1.0).

Data Type Accuracy
0.01.0
0.0
0.5
0.9
1.0
0.0No values use the correct type - systematic type mismatches that will cause silent data loss or misinterpretation in downstream processing.
0.5Half of values have type mismatches - significant schema violations that require a transformation or re-ingestion step.
0.9Minor type issues - 10% of values have type mismatches. Impact depends on whether affected fields are used by the model.
1.0All values are represented with the correct type - no type mismatches or non-standard null encodings.

Semantic Accuracy

The fraction of attribute values that correctly represent the true value of the intended concept - correct annotations, no missing or phantom labels (range: 0.0 to 1.0). Note that semantic accuracy is difficult to measure automatically and may require manual review or reproduction of the data collection process.

Semantic Accuracy
0.01.0
0.0
0.5
0.9
1.0
0.0No attribute values correctly represent the intended concept - entirely incorrect annotations or labels throughout the dataset.
0.5Half of values are semantically inaccurate - significant label noise that will directly degrade model training or inference quality.
0.9Minor semantic inaccuracies - 10% of values are incorrect. Manual spot-checking is recommended to confirm whether errors are systematic.
1.0All attribute values correctly represent the intended concept - no incorrect annotations or labels.

Motivation

Inaccurate data propagates silently through AI pipelines. Encoding corruption, type mismatches, and wrong ground-truth labels each produce incorrect model outputs - yet none of them raises an obvious error at ingestion time. A model trained or evaluated on inaccurate data learns the wrong patterns, produces wrong predictions, and reports metrics that do not reflect real-world performance.

Accuracy failures fall into three distinct categories, each requiring a different remediation strategy. Measuring them independently makes each failure mode visible so that remediation can be targeted rather than chasing a single opaque aggregate score.

Methodology

  1. Samples: Each sample in the dataset is scored independently across three accuracy dimensions.

  2. Scoring: Each sample is scored across three dimensions:

    • Syntactic Scorer: checks whether values conform to their domain definition (encoding, formatting, spelling).
    • Type Scorer: checks whether values use the correct type (numeric formats, null representations, sampling rates, image data types).
    • Semantic Scorer: checks whether values correctly represent the intended concept (annotations, labels).

    Each scorer produces an independent score from 0.0 to 1.0. Per-dimension and aggregate metrics are computed by averaging these scores across all samples.

Scoring

Syntactic Accuracy Scorer

Syntactic Accuracy
Score valueExplanation
1.0Every configured field in the sample conforms to the domain definition - no encoding corruption, character substitution, spelling or grammar errors detected.
0.660% of configured fields are syntactically correct. Some fields contain correctable encoding artefacts or minor spelling errors; the affected values are recoverable but require remediation.
0.0No configured field conforms to the domain - systematic encoding corruption or grammar errors throughout the sample render it unusable without full re-ingestion.

Data Type Accuracy Scorer

Data Type Accuracy
Score valueExplanation
1.0Every configured field is represented with the correct type - correct numeric formats, correct null representations, correct sampling rates or image data types, no type sentinel mismatches.
0.660% of configured fields have the correct type. Some fields contain type mismatches (e.g. a numeric value stored as a string, or an incorrect sampling rate) that will cause silent errors or require casting before use.
0.0No configured field uses the correct type - systematic type mismatches that will cause silent data loss or misinterpretation in every downstream operation.

Semantic Accuracy Scorer

Semantic Accuracy
Score valueExplanation
1.0Every configured field correctly represents the true value of the intended concept - all annotations are complete and accurate, all labels match the source data.
0.660% of configured fields are semantically correct. Some fields contain semantic errors such as plausible-but-incomplete annotations or labels that are approximately but not exactly correct.
0.0No configured field correctly represents the intended concept - all annotations or labels are factually incorrect or refer to objects and events that do not exist in the source data.

Examples

Accurate sample - all three scorers pass (passing)

Sample
vehicle_classCoupé
price_chf48500
in_stocktrue
category_codenull
Syntactic Scorer
1.0vehicle_class is correctly encoded, including the accented character. No spelling or grammar errors detected.
Type Scorer
1.0price_chf is an integer as expected, in_stock is a boolean, and category_code is a true null rather than a string sentinel.
Semantic Scorer
1.0All values correctly represent the intended concepts. vehicle_class "Coupé" matches the physical sample; the price and stock status are accurate.

Multiple errors - encoding corruption, wrong field type, and incorrect annotation (failing)

Sample
image_idimg-0042
vehicle_classCoup
price_chftrue
in_stocktrue
category_codenull
damages[
typescratch
severityminor
location[120, 80, 300, 210]
,
typedent
severitymajor
location[420, 60, 480, 220]
]
Syntactic Scorer
0.8vehicle_class contains an encoding artefact - the accented character 'é' was mangled and trailed with spurious bytes, producing 'Coup ' instead of 'Coupé'. The remaining fields are syntactically correct.
Type Scorer
0.8price_chf is expected to be a numeric value but contains a boolean (true). There is no sensible numeric interpretation of a boolean price, making the value unrecoverable. The remaining fields have the correct type.
Semantic Scorer
0.5The 'dent' damage entry annotates a region of the vehicle that shows no visible damage in the source image. The 'scratch' annotation is correct. The other fields are semantically accurate.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.
Requires LatticeFlow AI Platform CLI
lf init --atlas unstructured_data_accuracy

Metrics

Accuracy

Syntactic Accuracy

Data Type Accuracy

Semantic Accuracy

Don't have the LatticeFlow AI Platform?

Contact us to see this evaluation in action:
Contact Us