This study compared the results from nine automated essay scoring engines on eight essay scoring prompts drawn from six states that annually administer high-stakes writing assessments.
Student essays from each state were randomly divided into three sets: a training set (used for modeling the essay prompt responses and consisting of text and ratings from two human raters along with a final or resolved score), a second test set used for a blind test of the vendor-developed model (consisting of text responses only), and a validation set that was not employed in this study.
The essays encompassed writing assessment items from three grade levels (7, 8, 10) and were evenly divided between source-based prompts (i.e., essay prompts developed on the basis of provided source material) or those drawn from traditional writing genre (i.e., narrative, descriptive, persuasive). The total sample size was N = 22,029. Six of the eight essays were transcribed from their original handwritten responses using two transcription vendors. Transcription accuracy rates were computed at 98.70% for 17,502 essays. The remaining essays were typed in by students during the actual assessment and provided in ASCII form.
Seven of the eight essays were holistically scored and one employed score assignments for two traits. Scale ranges, rubrics, and scoring adjudications for the essay sets were quite variable. Results were presented on distributional properties of the data (mean and standard deviation) along with traditional measures used in automated essay scoring: exact agreement, exact+adjacent agreement, kappa, quadratic-weighted kappa, and the Pearson r. The results demonstrated that overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre.