Notable evaluation benchmarks

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan

Published: May 20, 2026 — 17:09 UTC

Problem
This preprint addresses a significant gap in AI education, specifically the need for students to engage critically with AI systems rather than merely using them as productivity tools. Current curricula often focus on how to utilize AI for tasks like prompting and summarization, neglecting the importance of understanding AI’s limitations and the role of human judgment in evaluating machine-generated knowledge. The authors propose a novel educational framework that emphasizes benchmark construction as a means for students to learn about AI accountability and the nuances of knowledge work in the AI era.

Method
The core technical contribution is the development of QuestBench, a benchmark consisting of 256 expert-level questions across 14 domains in the humanities and social sciences. The methodology involves students transforming their disciplinary knowledge into verifiable questions, engaging in peer reviews to identify ambiguities, and evaluating various AI systems against these tasks. The evaluation process reveals the performance of thirteen deep research systems, including GPT-5.5, which serves as a baseline for comparison. The authors do not disclose specific training compute or architectural details of the evaluated systems, focusing instead on the educational framework and the resulting benchmark.

Results
The evaluation of QuestBench indicates that the mean question-level pass rate across the thirteen AI systems is only 16.85%. The highest-performing system, GPT-5.5, achieves a pass rate of 57.58%. These results highlight the inadequacies of current AI systems in handling nuanced, expert-level queries, revealing that even advanced models can fail to meet the standards of accuracy and reliability expected in professional knowledge work. The findings underscore the educational value of these failures, as they prompt critical discussions about the nature of trustworthy AI outputs.

Limitations
The authors acknowledge that the benchmark is limited to specific domains within the humanities and social sciences, which may not generalize to other fields. Additionally, the study relies on a relatively small sample of student contributors, which may not capture the full spectrum of educational experiences or insights. The authors do not address potential biases in the question design or the evaluation process, nor do they explore the long-term impact of this educational approach on students’ understanding of AI.

Why it matters
This work has significant implications for the future of AI education, advocating for a paradigm shift where students are not just consumers of AI-generated content but also critical evaluators of its outputs. By engaging in benchmark construction, students develop a deeper understanding of the complexities involved in knowledge work and the importance of accountability in AI applications. QuestBench serves as both a practical educational tool and a research artifact, fostering a culture of critical inquiry that is essential as AI continues to permeate various domains of learning and professional practice.

Authors: Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen et al.
Source: arXiv cs.AI
URL: https://arxiv.org/abs/2605.21413v1

By Callan Zhang · May 20, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI