Notable evaluation benchmarks

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang

Published: Jun 9, 2026 — 14:59 UTC

Problem
The paper addresses the gap in evaluating the capabilities of Large Language Models (LLMs) in navigating complex, professional-grade productivity software, specifically in the context of document automation. Despite the rapid deployment of LLM agents for automation tasks, their proficiency in handling intricate office applications has not been rigorously tested. This work introduces a standardized evaluation framework based on China’s National Computer Rank Examination (NCRE), which is crucial for assessing LLMs’ practical operation skills in real-world office environments. The research is presented as a preprint, indicating it has not yet undergone peer review.

Method
The authors propose a comprehensive evaluation framework consisting of 200 practical-operation tasks across Microsoft Word, Excel, and PowerPoint. Each task is assessed using a 100-point rubric based on 7,118 machine-gradable criteria, with the Score Rate (SR) representing the mean percentage of points earned. The study benchmarks seven frontier LLMs, including both single-turn models and a more advanced agentic system that incorporates execution feedback, iterative repair, and broader access to Office automation features. The training compute and specific architectures of the LLMs are not disclosed, but the evaluation emphasizes the need for multi-application integration and long-horizon planning.

Results
The results reveal stark limitations in the performance of the evaluated LLMs. Single-turn models achieved a maximum Score Rate of 36.6%, indicating significant deficiencies in their ability to perform complex office tasks. In contrast, the more advanced agentic system, which leverages feedback and iterative improvements, reached a Score Rate of 68.8%. However, this still falls short of the community-reference score of 95.5%, which serves as a benchmark for proficient document automation. These findings highlight the challenges faced by current LLMs in achieving reliable and fine-grained automation in office environments.

Limitations
The authors acknowledge that the performance of the evaluated LLMs is significantly below the community-reference score, indicating that even advanced systems struggle with the complexity of office tasks. They do not address potential biases in the task selection or the representativeness of the NCRE framework for broader office automation scenarios. Additionally, the lack of detailed information regarding the architectures and training methodologies of the LLMs limits the reproducibility of the results.

Why it matters
This research underscores the limitations of current LLMs in performing sophisticated document automation tasks, which is critical for their deployment in professional settings. The findings suggest that while LLMs have made strides in code generation, they still face substantial challenges in practical applications requiring nuanced reasoning and multi-application integration. This work lays the groundwork for future research aimed at enhancing LLM capabilities in office automation, as discussed in the context of ongoing advancements in AI and productivity tools, as published in arXiv.

By Callan Zhang · Jun 9, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL