structured_data_accuracy

Structured Data Accuracy

Measures the degree to which attributes in a structured (tabular) dataset correctly represent the true value of the intended concept, across syntactic, type, and semantic accuracy dimensions.

Tags:

Data Quality

Overview

The Structured Data Accuracy evaluation measures how accurately the attributes in a structured (tabular) dataset represent the true values of the intended concept or event. Each row is inspected across three independent accuracy dimensions, and the results are aggregated into per-dimension and overall metrics.

The three accuracy dimensions are:

Syntactic accuracy: whether attribute values conform to their domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields.
Data type accuracy: whether attribute values are represented with the correct type - correct numeric formats, correct date representations, correct null encodings.
Semantic accuracy: whether attribute values correctly represent the true value of the intended concept - correct field-level values, no transposed or placeholder entries.

Metrics

Accuracy

The aggregate accuracy score combining all three dimensions (range: 0.0 to 1.0).

Accuracy

0.01.0

0.0

0.5

0.9

1.0

0.0Dataset is entirely inaccurate across one or more dimensions - unusable for AI processing.

0.5Half of attribute values are inaccurate - significant data quality issues that will cause incorrect model outputs.

0.990% of attribute values are accurate - minor inaccuracies present. Acceptability depends on the sensitivity of the downstream model.

1.0Perfect accuracy - all attribute values are syntactically correct, correctly typed, and semantically accurate.

Syntactic Accuracy

The fraction of attribute values that conform to the domain definition - correct encoding, no character corruption, no spelling or grammar errors in text fields (range: 0.0 to 1.0).

Syntactic Accuracy

0.01.0

0.0

0.5

0.9

1.0

0.0No values conform to the domain - systematic encoding corruption or formatting errors throughout the dataset.

0.5Half of values have syntactic issues - widespread encoding or formatting errors that require remediation before use.

0.9Minor syntactic issues - 10% of values are malformed. May be acceptable depending on whether affected fields are critical.

1.0All values conform to the domain definition - no encoding corruption or formatting errors.

Data Type Accuracy

The fraction of attribute values that are represented with the correct type - correct numeric formats, correct date representations, correct null encodings (range: 0.0 to 1.0).

Data Type Accuracy

0.01.0

0.0

0.5

0.9

1.0

0.0No values use the correct type - systematic type mismatches that will cause silent data loss or misinterpretation in downstream processing.

0.5Half of values have type mismatches - significant schema violations that require a transformation or re-ingestion step.

0.9Minor type issues - 10% of values have type mismatches. Impact depends on whether affected fields are used by the model.

1.0All values are represented with the correct type - no type mismatches or non-standard null encodings.

Semantic Accuracy

The fraction of attribute values that correctly represent the true value of the intended concept - correct field-level values, no transposed or placeholder entries (range: 0.0 to 1.0).

Semantic Accuracy

0.01.0

0.0

0.5

0.9

1.0

0.0No attribute values correctly represent the intended concept - entirely incorrect or placeholder values throughout the dataset.

0.5Half of values are semantically inaccurate - significant value-level errors that will directly degrade model training or inference quality.

0.9Minor semantic inaccuracies - 10% of values are incorrect. Manual spot-checking is recommended to confirm whether errors are systematic.

1.0All attribute values correctly represent the intended concept - no incorrect, transposed, or placeholder entries.

Motivation

Inaccurate data propagates silently through AI pipelines. Encoding corruption, type mismatches, and wrong field values each produce incorrect model outputs - yet none of them raises an obvious error at ingestion time. A model trained or evaluated on inaccurate tabular data learns the wrong patterns, produces wrong predictions, and reports metrics that do not reflect real-world performance.

Accuracy failures fall into three distinct categories, each requiring a different remediation strategy. Measuring them independently makes each failure mode visible so that remediation can be targeted rather than chasing a single opaque aggregate score.

Methodology

Samples: Each row in the dataset is scored independently across three accuracy dimensions.
Scoring: Each row is scored across three dimensions:
- Syntactic Scorer: checks whether values conform to their domain definition (encoding, formatting, spelling).
- Type Scorer: checks whether values use the correct type (numeric formats, date representations, null encodings).
- Semantic Scorer: checks whether values correctly represent the intended concept (field-level correctness, no transposed or placeholder entries).
Each scorer produces an independent score from 0.0 to 1.0. Per-dimension and aggregate metrics are computed by averaging these scores across all rows.

Scoring

Syntactic Accuracy Scorer

Syntactic Accuracy

Score valueExplanation

1.0Every configured field in the row conforms to the domain definition - no encoding corruption, character substitution, spelling or grammar errors detected.

0.660% of configured fields are syntactically correct. Some fields contain correctable encoding artefacts or minor formatting errors; the affected values are recoverable but require remediation.

0.0No configured field conforms to the domain - systematic encoding corruption or formatting errors throughout the row render it unusable without re-ingestion.

Data Type Accuracy Scorer

Data Type Accuracy

Score valueExplanation

1.0Every configured field is represented with the correct type - correct numeric formats, correct date representations, no type sentinel mismatches.

0.660% of configured fields have the correct type. Some fields contain type mismatches (e.g. a numeric value stored as a string, or a date stored as a free-text string) that will cause silent errors or require casting before use.

0.0No configured field uses the correct type - systematic type mismatches that will cause silent data loss or misinterpretation in every downstream operation.

Semantic Accuracy Scorer

Semantic Accuracy

Score valueExplanation

1.0Every configured field correctly represents the true value of the intended concept - all values match the source record with no transpositions or placeholders.

0.660% of configured fields are semantically correct. Some fields contain value-level errors such as transposed amounts or approximate rather than exact values.

0.0No configured field correctly represents the intended concept - all values are factually incorrect or are placeholder entries that do not correspond to the source record.

Examples

Accurate row - all three scorers pass (passing)

Sample

client_idCHE-123.456.789

transaction_date2024-03-15

amount_chf12500

asset_classEquity

currencyCHF

Syntactic Scorer

1.0All fields are correctly formatted. client_id follows the Swiss UID format, transaction_date is ISO 8601, and asset_class has no spelling errors.

Type Scorer

1.0amount_chf is a float as expected, transaction_date is a valid date string, and currency is a three-letter string code.

Semantic Scorer

1.0All values correctly represent the intended concepts. The transaction date, amount, and asset class match the source record.

Multiple errors - formatting issue, type mismatch, and wrong value (failing)

Sample

client_idCHE123456789

transaction_date15.03.2024

amount_chftwelve thousand five hundred

asset_classEquity

currencyCHF

Syntactic Scorer

0.8client_id is missing the dot separators required by the Swiss UID format ('CHE123456789' instead of 'CHE-123.456.789'). transaction_date uses a non-standard format. The remaining fields are syntactically correct.

Type Scorer

0.8amount_chf is expected to be a numeric value but contains a natural-language string. There is no reliable programmatic conversion, making the value unrecoverable without manual intervention. The remaining fields have the correct type.

Semantic Scorer

1.0Despite the formatting and type issues, the intended values are identifiable and match the source record. No semantic errors detected.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.

Requires LatticeFlow AI Platform CLI

lf init --atlas structured_data_accuracy

Structured Data Accuracy

Data Quality

Overview

Metrics

Accuracy

Syntactic Accuracy

Data Type Accuracy

Semantic Accuracy

Motivation

Methodology

Scoring

Syntactic Accuracy Scorer

Data Type Accuracy Scorer

Semantic Accuracy Scorer

Examples

Run Evaluation in LatticeFlow AI Platform

Don't have the LatticeFlow AI Platform?