Our Research.

Metrics We Use

We use four key metrics to evaluate the performance of our grading systems. The chart to the right shows preliminary results on essay 1 of the Automated Student Assessment Prize (ASAP) dataset; a standard benchmark for automated grading systems, reaching the same level of accuracy metrics achieved by the two human graders. While this dataset was designed for AI training on a set of over 1,700 essays, we achieved this while training on only 60, showing promise for small-scale learning.

Quadratic Weighted Kappa

QWK measures agreement between two graders, with a measure for magnitude of disagreements. For instance, being off by one point is much better than being off by three points. QWK also considers how intentional or random any agreement is: if 70% of essays in a dataset have been graded 4/6, QWK would be near zero for a model which predicts 4/6 100% of the time, because it is clearly guessing.

100%90%80%70%60%

Small-Scale Learning

Conventional AI systems require hundreds to thousands of datapoints to achieve reliability. However, such data volumes are rarely available at the class or school level.

To become viable in real educational settings, AI grading systems must strive to adapt in as few as 5 samples, and become reliable after 50.

November 2, 2025

Comparison is Key

Saying ‘Essay A is better than Essay B’ is easier than assigning exact grades, for both humans and machines. This unlocks higher reliability when training data is scarce.

Coming Soon

Teaching AI Like We Teach Humans

Teachers learn to grade by discussing exemplars with colleagues, comparing submissions to only a handful of anchors, and updating their understanding of the rubric as they go. LLMs can do the same.

Coming Soon

Taking Each At Their Best

Under ideal conditions, expert human raters can be found to reach 0.95 QWK, yet on other datasets, modern systems are now exceeding the inter-rater reliability of humans.

Human Versus Machine

All machine learning systems require human data as the ‘ground truth’ for training and evaluation. But what happens when that ground truth is flawed, or there is no reliable ground truth?

Exploring inter-rater reliability of both humans and machines uncovers fundamental questions about what ‘accuracy’ truly means in grading assessment.

November 11, 2025

The Reliability of Human Judgement

When two trained raters disagree slightly on 35% of essays, which dataset should AI learn from? When the goal is to match a single teacher’s grades, how can we tell if the teacher was consistent?