Taking Each At Their Best: Human and AI Reliability in Essay Scoring

Whether AI “exceeds” human reliability in essay grading depends entirely on which humans, under what conditions, you choose to measure. Under ideal conditions – extensively trained expert raters, well-designed rubrics, rigorous calibration – human inter-rater reliability reaches 0.95 QWK (Wendler, Glazer & Cline, 2019). On less controlled datasets, modern automated essay scoring (AES) systems now routinely exceed human agreement. The gap says less about AI capability than about the variance in human performance.

The gold standard comes from large-scale testing organizations like ETS. Their raters practice on benchmark essays, receive accuracy feedback, and must demonstrate alignment with principal examiners before scoring operational responses. Ricker-Pedley (2011) found that even short calibration tests of 10 responses classified raters accurately 87% of the time; Wendler, Glazer, and Cline (2019) showed that calibration effects persist through at least three scoring days. This investment in training and monitoring is what produces the 0.95 QWK figures testing organizations report.

The ASAP (Automated Student Assessment Prize) dataset tells a different story. This benchmark contains approximately 13,000 essays across eight prompts, each scored by two human raters (The Hewlett Foundation, 2012). Human inter-rater reliability hovers around 0.77 QWK – respectable, but far from 0.95. The ASAP raters, while trained, did not undergo ETS-level calibration and monitoring.

Modern AES systems have made strong progress on this benchmark. Neural Pairwise Contrastive Regression (NPCR) by Xie et al. (2022) combines contrastive learning with BERT embeddings and Siamese networks to predict score differences between essay pairs. Jiao, Choi, and Hua (2025) achieved 0.870 QWK on ASAP essay set 6 by incorporating rationales from large language models. These scores exceed the 0.77 QWK human-human agreement on the same dataset. But context matters: a 2023 meta-analysis by Yoon found the mean correlation between automated and human scoring across multiple studies was 0.78, with negligible effect size differences. Where AI appears to outperform humans, it typically reflects datasets where human agreement was already low – not superhuman accuracy. ETS uses its automated scoring engine, e-rater, as a quality control mechanism rather than a replacement, flagging discrepancies for additional human review (Attali & Burstein, 2006).

The distinction between consistency and accuracy explains why. AI systems apply the same criteria the same way every time, immune to fatigue, mood, or order effects. Ofqual (Holmes, Black & Morin, 2020) found that individual human examiners grading AS History achieved only 0.47 Spearman’s Rho against principal examiner judgments, improving to 0.62 with eight examiners averaged. AI achieves this consistency in a single pass. But consistency is not accuracy. Expert human raters recognize edge cases, creative responses, and nuanced arguments that do not fit neatly into rubric categories.

Large language models sharpen this picture. Zhang and Litman (2025) proposed a dual-process framework for human-AI collaborative essay scoring, finding that LLM-based systems can elevate novice evaluators to expert-level performance when used as support tools. Yet a separate 2025 study found that human-LLM agreement was “consistently low and non-significant,” with weak within-model reliability across replications and systematic biases such as inflating coherence scores. LLMs are powerful aids, but not yet reliable standalone graders for complex, open-ended assessments.

The Ofqual research points toward the practical resolution: combining multiple human judgments through pairwise comparison improved reliability from 0.47 with a single examiner to 0.85 with twelve making comparative judgments. AI can perform these comparisons at scale, achieving the benefits of multiple human perspectives without the cost. In classroom contexts – where the 0.95 QWK of ETS-level calibration is unrealistic – AI may provide more reliable scoring than an individual teacher working alone, not because it is inherently superior, but because it approximates the consistency that would otherwise require multiple trained raters. The role of human judgment then shifts: not line-by-line scoring, but oversight for the edge cases and contextual nuance where expertise still matters most.

Taking Each At Their Best

References