Candidate Screening Accuracy
Performance
Overview
The Candidate Screening Accuracy evaluation measures how accurately an AI candidate screening system classifies job applicants against individual job requirements. A job posting consists of multiple requirements; for each requirement the system under test must predict one of four outcomes - MATCH, NO_MATCH, UNKNOWN, or ERROR - given applicant data such as CV and qualification question-answers pairs. Each prediction is compared against a human-annotated ground truth label for that specific requirement.
Beyond raw classification accuracy, the evaluation tracks the direction of errors: a prediction that is more favourable than the ground truth (e.g. predicting MATCH when the annotator said NO_MATCH) harms the hiring organisation by surfacing unqualified candidates, while a prediction that is less favourable (e.g. predicting NO_MATCH for a genuine MATCH) harms the candidate by incorrectly excluding them from consideration for that requirement. Both error directions are measured separately so the trade-offs of the screening system can be assessed.
Metrics
Accuracy
Overall multi-class classification accuracy of the job requirement predictions across the MATCH / NO_MATCH / UNKNOWN / ERROR labels. A higher score indicates that the system's predictions align closely with human annotations.
Candidate Harm Rate
The proportion of predictions that are less favourable than the ground truth label (e.g. predicting NO_MATCH or UNKNOWN when the annotator said MATCH). A lower rate is better (range: 0.0 to 1.0).
Candidate Benefit Rate
The proportion of predictions that are more favourable than the ground truth label (e.g. predicting MATCH when the annotator said NO_MATCH). A lower rate is better (range: 0.0 to 1.0).
Motivation
Automated candidate screening is increasingly used to filter large applicant pools before human review. Job postings are structured around multiple discrete requirements - skills, qualifications, experience - and the screening system evaluates each job requirement independently. Errors at the requirement level compound: a single false rejection on a mandatory requirement excludes a candidate from advancing, even if all other requirements would have been met. Tracking accuracy at the requirement level therefore gives a more precise picture of where the system fails than a single pass/fail judgement per applicant.
The direction of each error matters as much as the error rate itself. A system that hedges by predicting UNKNOWN for borderline requirements may appear accurate on the aggregate metric but could mask a systematic tendency to penalise candidates from under-represented groups whose CVs do not match the surface-level patterns the model associates with a MATCH label. Tracking candidate harm rate separately at the requirement level makes this failure mode visible before it affects hiring outcomes.
Methodology
- Test Cases: Each test case targets a single job requirement within a job posting. It consists of one job requirement, applicant data such as a CV or a set of qualification question-answer pairs, and a ground-truth outcome label.
- Prediction: The system under test receives the job requirement and applicant data as input. The system produces a structured response containing match prediction (one of MATCH, NO_MATCH, UNKNOWN, ERROR).
- Scoring: The predicted outcome is compared against the ground truth outcome for
that requirement. Cases where the prediction equals the annotation score
as
accurate; predictions that are less favourable score ascandidate_harm; predictions that are more favourable score ascandidate_benefit.
Scoring
Accuracy Scorer
Candidate Harm Scorer
Candidate Benefit Scorer
Examples
Correct MATCH prediction
Candidate harm - requirement MATCH incorrectly predicted as NO_MATCH
Candidate benefit - requirement NO_MATCH incorrectly predicted as MATCH