Candidate Screening Resilience
Robustness
Overview
The Candidate Screening Resilience evaluation measures whether an AI candidate screening system handles malformed inputs gracefully. The system under test classifies each job requirement as MATCH, NO_MATCH, UNKNOWN, or ERROR - given applicant data such as CV and qualification question-answers pairs. This evaluation tests whether the system returns ERROR - rather than hallucinating a spurious classification - when the input is structurally incomplete or unintelligible.
Each test case presents the system with a deliberately malformed input derived from a real applicant profile.
Metrics
Error Handling Accuracy
The proportion of malformed inputs for which the system correctly returned an ERROR outcome (range: 0.0 to 1.0). A higher score indicates that the system reliably signals when it cannot make a well-founded classification, rather than producing a fabricated MATCH, NO_MATCH, or UNKNOWN.
Motivation
Candidate screening systems operate inside automated pipelines where input data may be corrupted, partially extracted, or structurally incomplete. A resume parser may produce an empty document list for a file it could not process. A job posting integration may omit a requirement field due to a schema mismatch. An applicant document may be truncated by a file-size limit or an encoding error before reaching the model.
A system that responds to these conditions with a fabricated MATCH, NO_MATCH, or UNKNOWN classification is not failing silently - it is actively producing a misleading output that downstream components and human reviewers may act on. An unqualified applicant could be advanced or rejected on the basis of a classification the model had no legitimate basis to make.
Returning a structured ERROR outcome is the correct behaviour in all these cases: it signals to the pipeline that the input was unusable, preserves the integrity of the decision, and allows the operator to route the case for manual review. A system that cannot distinguish between "I assessed this candidate and they do not match" and "I had no information to make an assessment" cannot be trusted in production.
Methodology
- Test Cases: Each test case targets a single job requirement within a job posting. It consists of a job requirement and applicant data such as a CV or a set of qualification question-answer pairs. The job requirement or the applicant data are malformed.
- Prediction: The system under test receives the job requirement and applicant data as input. The system produces a structured response containing match prediction (one of MATCH, NO_MATCH, UNKNOWN, ERROR).
- Scoring: The system's output is scored correct (1.0) if and only if the predicted outcome is ERROR. Any other outcome - MATCH, NO_MATCH, or UNKNOWN - is scored as incorrect (0.0).
Scoring
Error Handling Scorer
Examples
Correct - empty requirement field returns ERROR
Incorrect - truncated document produces a hallucinated UNKNOWN