Notable other

Small, Private Language Models as Teammates for Educational Assessment Design

Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni Ilkou

Published: May 14, 2026 — 16:15 UTC

Problem
This preprint addresses the gap in the literature regarding the effectiveness of Small Language Models (SLMs) in educational assessment design, particularly in comparison to Large Language Models (LLMs). While LLMs have shown promise in generating assessment questions aligned with pedagogical frameworks, their evaluation methods are often subjective and limited. Furthermore, the deployment of proprietary models raises concerns about privacy and resource constraints. This work systematically investigates the performance of SLMs in generating assessment questions, a domain that remains underexplored.

Method
The authors conducted a comparative analysis of LLMs and SLMs for generating assessment questions based on Bloom’s taxonomy. They employed a reproducible evaluation framework that includes pedagogically grounded metrics to assess the quality of generated questions across different cognitive levels. The study involved a systematic evaluation of generation quality, focusing on dimensions such as clarity, relevance, and alignment with educational standards. Additionally, the authors analyzed the reliability and agreement patterns between model-based evaluations and expert-informed assessments, highlighting the necessity of a Human-in-the-Loop approach in educational contexts.

Results
The findings indicate that SLMs achieve competitive performance relative to LLMs across key quality dimensions, demonstrating their viability for local and privacy-sensitive deployment. Specifically, SLMs were found to generate assessment questions that were comparable in quality to those produced by LLMs, with effect sizes suggesting that SLMs can effectively meet pedagogical requirements. However, the study also revealed systematic inconsistencies and biases in model-based evaluations when compared to expert ratings, indicating that while SLMs can serve as effective tools, they are not infallible and require human oversight.

Limitations
The authors acknowledge several limitations, including the potential biases inherent in the expert evaluation process and the need for further exploration of the deployment constraints of SLMs in diverse educational settings. They also note that the study’s findings may not generalize across all educational contexts or assessment types. Additionally, the reliance on specific pedagogical frameworks may limit the applicability of the results to broader educational paradigms.

Why it matters
This research has significant implications for the future of automated educational assessment design. By demonstrating that SLMs can serve as effective, privacy-preserving alternatives to LLMs, the study encourages further exploration of local model deployment in educational settings. The findings underscore the importance of integrating human evaluators in the assessment workflow, promoting a Human-in-the-Loop approach that can enhance the reliability and quality of generated content. This work advances the field of automated educational question generation by providing a framework for evaluating model performance and addressing deployment-aware trade-offs.

Authors: Chris Davis Jaldi, Anmol Saini, Shan Zhang, Noah Schroeder, Cogan Shimizu, Eleni Ilkou
Source: arXiv:2605.15015
URL: https://arxiv.org/abs/2605.15015v1

By Callan Zhang · May 14, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL