Our Research.
Metrics We Use
We use four key metrics to evaluate the performance of our grading systems. The chart to the right shows preliminary results on essay 1 of the Automated Student Assessment Prize (ASAP) dataset; a standard benchmark for automated grading systems, reaching the same level of accuracy metrics achieved by the two human graders. While this dataset was designed for AI training on a set of over 1,700 essays, we achieved this while training on only 60, showing promise for small-scale learning.
Quadratic Weighted Kappa
QWK measures agreement between two graders, with a measure for magnitude of disagreements. For instance, being off by one point is much better than being off by three points. QWK also considers how intentional or random any agreement is: if 70% of essays in a dataset have been graded 4/6, QWK would be near zero for a model which predicts 4/6 100% of the time, because it is clearly guessing.
Small-Scale Learning
Conventional AI systems require hundreds to thousands of datapoints to achieve reliability. However, such data volumes are rarely available at the class or school level.
To become viable in real educational settings, AI grading systems must strive to adapt in as few as 5 samples, and become reliable after 50.
Comparison is Key
Saying ‘Essay A is better than Essay B’ is easier than assigning exact grades, for both humans and machines. This unlocks higher reliability when training data is scarce.
Teaching AI Like We Teach Humans
Teachers learn to grade by discussing exemplars with colleagues, comparing submissions to only a handful of anchors, and updating their understanding of the rubric as they go. LLMs can do the same.
Taking Each At Their Best
Under ideal conditions, expert human raters can be found to reach 0.95 QWK, yet on other datasets, modern systems are now exceeding the inter-rater reliability of humans.
Human Versus Machine
All machine learning systems require human data as the ‘ground truth’ for training and evaluation. But what happens when that ground truth is flawed, or there is no reliable ground truth?
Exploring inter-rater reliability of both humans and machines uncovers fundamental questions about what ‘accuracy’ truly means in grading assessment.
The Reliability of Human Judgement
When two trained raters disagree slightly on 35% of essays, which dataset should AI learn from? When the goal is to match a single teacher’s grades, how can we tell if the teacher was consistent?