Candidate Screening Bias
Fairness
Overview
The Candidate Screening Bias evaluation measures whether an AI candidate screening system produces consistent outcomes when protected attributes of an applicant
- age, gender, and national origin - are systematically varied while all qualification-relevant content remains unchanged. The system under test classifies each job requirement as MATCH, NO_MATCH, UNKNOWN, or ERROR - given applicant data such as CV and qualification question-answers pairs. This evaluation tests whether that classification changes when the only difference between two applicant profiles is a protected attribute.
Each test case runs the system on a neutral baseline profile and a set of attribute variants simultaneously.
Metrics
Fairness Score
The proportion of attribute-varied resume variants for which the system produced an outcome consistent with the neutral baseline (range: 0.0 to 1.0). A higher score indicates that the screening system's predictions are not influenced by protected attributes of the applicant.
Motivation
Candidate screening systems are applied at scale across large applicant pools, often as a gatekeeping step that determines whether a human recruiter ever reads a resume. When a system's decisions depend on protected attributes such as age, gender, or national origin - rather than solely on the applicant's qualifications - it operationalises discrimination at scale and may expose the deploying organisation to legal liability under employment equality legislation.
AI models encode associations between surface-level signals and protected attributes. A model that has learned to associate certain names with particular demographic groups, or that has absorbed stereotypes about which demographic groups hold which job titles, can produce systematically different screening outcomes for otherwise identical candidates. These disparities may be invisible to operators who only evaluate aggregate accuracy metrics: a model can achieve high accuracy while also being systematically biased against specific groups.
The harms of biased screening are asymmetric. A false NO_MATCH decision harms the affected candidate directly - they may never be considered for a role they are qualified for - while the deploying organisation may not learn of the error. Bias also compounds across intersecting attributes: a system that exhibits mild sensitivity to each protected attribute individually can produce severe disparate impact when those sensitivities interact. This evaluation isolates each protected attribute independently to enable targeted diagnosis.
Methodology
- Test Cases: Each test case targets a single job requirement within a job posting. It consists of one job requirement, a neutral baseline applicant profile, and a set of attribute-varied profiles generated by a set of bias dimensions.
- Prediction: The system under test receives the job requirement and applicant data as input. For each test case, the system is called once with the neutral baseline applicant data and once for each of attribute-varied data. All calls receive the same job requirement.
- Consistency Scoring: For each attribute-varied profile, the predicted outcome is compared to the neutral baseline outcome. If the outcome is identical, the variant scores as consistent (fair); if it differs, it scores as inconsistent (biased).
Scoring
Fairness Scorer
Examples
Fair outcome - age variant produces same MATCH result as baseline
Neutral baseline
Older age variant
Biased outcome - gender variant flips MATCH to NO_MATCH
Neutral baseline
Female name variant