Candidate Screening Robustness
Robustness
Overview
The Candidate Screening Robustness evaluation measures whether an AI candidate screening system produces consistent predictions when the same applicant's data is presented with meaning-preserving surface variations. A job posting consists of multiple requirements; for each requirement the system under test must predict one of four outcomes
- MATCH, NO_MATCH, UNKNOWN, or ERROR - given applicant data such as CV and qualification question-answers pairs. This evaluation tests whether that classification changes when the applicant's data is paraphrased, contains typos, is written in a different language, or contains structurally repeated sections.
Each test case runs the system on clean baseline applicant data and a set of perturbed variants.
Metrics
Robustness Score
The proportion of perturbed inputs for which the system produced an outcome consistent with the clean baseline (range: 0.0 to 1.0). A higher score indicates that the screening system's predictions are stable under surface-level variation in the input.
Motivation
Candidate screening systems process resumes authored by a diverse population of applicants. Real-world resumes vary substantially in surface form: a native English speaker and an applicant who wrote their CV in another language and translated it will describe the same experience differently. A recruiter uploading a resume after OCR processing may introduce typos and formatting artefacts. A candidate who paraphrased their job title between versions of their CV should not be classified differently for the same role requirement.
If a screening system's MATCH or NO_MATCH decision for a given requirement is sensitive to surface-level textual variation that does not change the underlying qualifications, its output cannot be trusted as an objective assessment. A candidate who passes when their resume uses one phrasing of "managed a team of twelve" but fails when it reads "led a twelve-person team" is not being evaluated on their actual qualifications - they are being evaluated on whether their word choice matches the surface-level patterns the model associates with a MATCH label.
Robustness failures also compound fairness concerns. Non-native speakers, applicants using machine-translated CVs, or those who relied on a template are more likely to trigger surface-level inconsistencies. A system with low robustness may therefore produce systematically different outcomes for these groups even when their underlying qualifications are identical.
Methodology
- Test Cases: Each test case targets a single job requirement within a job posting. It consists of one job requirement, clean baseline applicant data, and a set of perturbed variants of that applicant data.
- Prediction: The system under test receives the job requirement and applicant data as input. For each test case, the system is called once per variant (clean baseline and all perturbed variants). All calls receive the same job requirement; only the applicant data differs.
- Consistency Scoring: For each perturbed variant, the predicted outcome is compared to the clean baseline outcome. If the outcome is identical, the variant scores as consistent; if it differs, it scores as inconsistent.
Scoring
Robustness Scorer
Examples
Consistent outcome - paraphrased resume produces same MATCH result
Clean baseline
Paraphrased variant
Inconsistent outcome - typo-perturbed resume flips NO_MATCH to UNKNOWN
Clean baseline
Typo variant