candidate_screening_accuracy

Candidate Screening Accuracy

Evaluates whether an AI candidate screening system correctly classifies job applicants for individual job requirements, measuring both overall accuracy and the direction of misclassifications.

Tags:

Performance

Overview

The Candidate Screening Accuracy evaluation measures how accurately an AI candidate screening system classifies job applicants against individual job requirements. A job posting consists of multiple requirements; for each requirement the system under test must predict one of four outcomes - MATCH, NO_MATCH, UNKNOWN, or ERROR - given applicant data such as CV and qualification question-answers pairs. Each prediction is compared against a human-annotated ground truth label for that specific requirement.

Beyond raw classification accuracy, the evaluation tracks the direction of errors: a prediction that is more favourable than the ground truth (e.g. predicting MATCH when the annotator said NO_MATCH) harms the hiring organisation by surfacing unqualified candidates, while a prediction that is less favourable (e.g. predicting NO_MATCH for a genuine MATCH) harms the candidate by incorrectly excluding them from consideration for that requirement. Both error directions are measured separately so the trade-offs of the screening system can be assessed.

Metrics

Accuracy

Overall multi-class classification accuracy of the job requirement predictions across the MATCH / NO_MATCH / UNKNOWN / ERROR labels. A higher score indicates that the system's predictions align closely with human annotations.

Accuracy

0.01.0

0.25

0.5

0.75

1.0

0.25The system's predictions are no better than random - completely unreliable for candidate screening.

0.5Half of predictions match the human annotation - the system is unreliable and requires substantial improvement.

0.7575% of predictions match - the system is broadly reliable but still produces meaningful misclassification noise.

1.0All predictions match the human annotation - perfect screening accuracy.

Candidate Harm Rate

The proportion of predictions that are less favourable than the ground truth label (e.g. predicting NO_MATCH or UNKNOWN when the annotator said MATCH). A lower rate is better (range: 0.0 to 1.0).

Candidate Harm Rate

0.01.0

0.0

0.1

0.3

1.0

0.0No predictions incorrectly excluded a qualified candidate - zero candidate harm.

0.110% of predictions were unfairly pessimistic about qualified candidates.

0.330% of predictions incorrectly excluded qualified candidates - significant candidate harm.

1.0All predictions were less favourable than the ground truth - the system systematically harms candidates.

Candidate Benefit Rate

The proportion of predictions that are more favourable than the ground truth label (e.g. predicting MATCH when the annotator said NO_MATCH). A lower rate is better (range: 0.0 to 1.0).

Candidate Benefit Rate

0.01.0

0.0

0.1

0.3

1.0

0.0No predictions incorrectly advanced an unqualified candidate - zero benefit inflation.

0.110% of predictions were overly optimistic about unqualified candidates.

0.330% of predictions incorrectly advanced unqualified candidates - significant quality risk for the hiring organisation.

1.0All predictions were more favourable than the ground truth - the system systematically inflates candidate suitability.

Motivation

Automated candidate screening is increasingly used to filter large applicant pools before human review. Job postings are structured around multiple discrete requirements - skills, qualifications, experience - and the screening system evaluates each job requirement independently. Errors at the requirement level compound: a single false rejection on a mandatory requirement excludes a candidate from advancing, even if all other requirements would have been met. Tracking accuracy at the requirement level therefore gives a more precise picture of where the system fails than a single pass/fail judgement per applicant.

The direction of each error matters as much as the error rate itself. A system that hedges by predicting UNKNOWN for borderline requirements may appear accurate on the aggregate metric but could mask a systematic tendency to penalise candidates from under-represented groups whose CVs do not match the surface-level patterns the model associates with a MATCH label. Tracking candidate harm rate separately at the requirement level makes this failure mode visible before it affects hiring outcomes.

Methodology

Test Cases: Each test case targets a single job requirement within a job posting. It consists of one job requirement, applicant data such as a CV or a set of qualification question-answer pairs, and a ground-truth outcome label.
Prediction: The system under test receives the job requirement and applicant data as input. The system produces a structured response containing match prediction (one of MATCH, NO_MATCH, UNKNOWN, ERROR).
Scoring: The predicted outcome is compared against the ground truth outcome for that requirement. Cases where the prediction equals the annotation score as accurate; predictions that are less favourable score as candidate_harm; predictions that are more favourable score as candidate_benefit.

Scoring

Accuracy Scorer

Accuracy

Score valueExplanation

1.0The predicted outcome matches the ground truth.

0.0The predicted outcome does not match the ground truth.

Candidate Harm Scorer

Candidate Harm

Score valueExplanation

1.0The predicted outcome is less favourable than the ground truth.

0.0The predicted outcome is not less favourable than the ground truth.

Candidate Benefit Scorer

Candidate Benefit

Score valueExplanation

1.0The predicted outcome is more favourable than the ground truth.

0.0The predicted outcome is not more favourable than the ground truth.

Examples

Correct MATCH prediction

User

{"requirement": {"name": "Kitchen Leadership Experience", "requiredExperience": "5 years leading a professional kitchen brigade"}}

User

{"qualificationQuestions": {"Do you have experience leading a kitchen brigade?": "Yes, I managed a team of 12 chefs at La Maison for 6 years."}, "documents": [{"name": "Resume.pdf", "content": "Jane Doe - Head Chef\n6 years at La Maison restaurant, leading a brigade of 12."}]}

Assistant

{"outcome": "MATCH", "source": [{"quote": "6 years at La Maison restaurant, leading a brigade of 12"}]}

Accuracy

1.0The predicted outcome matches the MATCH annotation for this requirement.

Candidate Harm

0.0The prediction is not less favourable than the ground truth, so no candidate harm occurred.

Candidate Benefit

0.0The prediction is not more favourable than the ground truth, so no candidate benefit inflation occurred.

Candidate harm - requirement MATCH incorrectly predicted as NO_MATCH

User

{"requirement": {"name": "Kitchen Leadership Experience", "requiredExperience": "5 years leading a professional kitchen brigade"}}

User

Assistant

{"outcome": "NO_MATCH", "source": [{"quote": "no leadership experience mentioned"}]}

Accuracy

0.0The annotation for this requirement is MATCH but the system predicted NO_MATCH, so the prediction is inaccurate.

Candidate Harm

1.0The predicted outcome is less favourable than the annotation (NO_MATCH predicted for a MATCH case), incorrectly failing a qualified candidate on this requirement.

Candidate Benefit

0.0The prediction is not more favourable than the ground truth, so no candidate benefit inflation occurred.

Candidate benefit - requirement NO_MATCH incorrectly predicted as MATCH

User

{"requirement": {"name": "Kitchen Leadership Experience", "requiredExperience": "5 years leading a professional kitchen brigade"}}

User

{"qualificationQuestions": {"Do you have experience leading a kitchen brigade?": "No, I have only worked as a sous chef."}, "documents": [{"name": "Resume.pdf", "content": "John Smith - Sous Chef\n2 years at Bistro Verde, no management responsibilities."}]}

Assistant

{"outcome": "MATCH", "source": [{"quote": "2 years at Bistro Verde"}]}

Accuracy

0.0The annotation for this requirement is NO_MATCH but the system predicted MATCH, so the prediction is inaccurate.

Candidate Harm

0.0The prediction is not less favourable than the ground truth, so no candidate harm occurred.

Candidate Benefit

1.0The predicted outcome is more favourable than the annotation (MATCH predicted for a NO_MATCH case), incorrectly advancing a candidate who does not meet this requirement.

Run Evaluation in LatticeFlow AI Platform

Use the following CLI command to initialize and run the evaluation in LatticeFlow AI Platform.

Candidate Screening Accuracy

Performance

Overview

Metrics

Accuracy

Candidate Harm Rate

Candidate Benefit Rate

Motivation

Methodology

Scoring

Accuracy Scorer

Candidate Harm Scorer

Candidate Benefit Scorer

Examples

Run Evaluation in LatticeFlow AI Platform

Metrics

Accuracy

Candidate Harm Rate

Candidate Benefit Rate

Don't have the LatticeFlow AI Platform?