Notable evaluation benchmarks

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August

Published: Jun 4, 2026 — 15:13 UTC

Problem
This work addresses the lack of systematic comparisons between feedback provided by large language models (LLMs) and human experts in the context of writing revision. Specifically, it focuses on three critical dimensions of writing feedback: goal-orientation, anchoring to specific sentences, and prioritization. The authors highlight that while LLMs are increasingly utilized for generating writing feedback, there is no comprehensive dataset or analysis that elucidates how LLM feedback aligns or diverges from expert feedback. This paper presents FOXGLOVE, a novel dataset that fills this gap, providing a structured basis for understanding the nuances of feedback from both sources.

Method
The authors constructed the FOXGLOVE dataset, which comprises 696 feedback comments from trained writing instructors on 69 twelfth-grade argumentative essays, alongside 1,644 comments generated by four state-of-the-art LLMs. The feedback was generated under a shared protocol to ensure comparability. The dataset totals 2,340 comments, allowing for a robust analysis of feedback characteristics. The authors also conducted expert quality ratings on a subset of both instructor and LLM comments to evaluate the quality of feedback. The analysis focuses on the distribution of feedback across different goals and essay positions, as well as the complexity and interrogative nature of the comments.

Results
The findings reveal that both instructors and LLMs distribute feedback similarly across various goals and essay positions. However, significant divergence occurs in the specific sentences targeted for feedback, with LLMs tending to provide more complex feedback and using fewer questions compared to instructors. Notably, LLM feedback received higher quality ratings on most dimensions assessed by instructors, although this advantage is largely attributed to the length of the comments rather than their intrinsic quality. The results suggest that while LLMs can generate feedback that is perceived as high quality, the nature of the feedback differs in terms of specificity and engagement.

Limitations
The authors acknowledge that the dataset is limited to twelfth-grade argumentative essays, which may restrict the generalizability of the findings to other writing genres or educational levels. Additionally, the reliance on a specific set of LLMs may not capture the full spectrum of capabilities across different models. The authors also note that the quality ratings are subjective and may vary based on the evaluators’ perspectives. Furthermore, the study does not explore the impact of feedback on actual writing improvement, which could provide further insights into the effectiveness of LLM-generated feedback.

Why it matters
The implications of this research are significant for both educational practices and the development of LLMs for writing assistance. By systematically comparing expert and LLM feedback, this work provides insights into how LLMs can be improved to better align with expert feedback characteristics, potentially enhancing their utility in educational settings. The FOXGLOVE dataset serves as a valuable resource for future research aimed at refining LLM feedback mechanisms and understanding the dynamics of writing revision. This study contributes to the ongoing discourse on the role of AI in education, as published in arXiv cs.CL.

By Callan Zhang · Jun 4, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL