Major evaluation benchmarks

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi

Published: Jun 1, 2026 — 16:44 UTC

Problem
The paper identifies a significant gap in the evaluation of large language models (LLMs) when applied to personalized social applications. Existing benchmarks primarily focus on generic information-seeking tasks and do not adequately assess the complexities involved in interacting with individual user accounts or local databases. This limitation is particularly pronounced in real-world applications where LLMs must navigate personalized contexts, making it difficult to gauge their effectiveness in practical scenarios. The authors present MCP-Persona as a solution to this gap, providing a dedicated benchmark for evaluating agent performance in these personalized environments. Notably, this work is a preprint and has not undergone peer review.

Method
MCP-Persona is constructed to evaluate LLM agents across a variety of widely-used personal applications, including social media platforms (e.g., Reddit, Xiaohongshu) and enterprise collaboration tools (e.g., Lark, Slack). The benchmark includes a diverse set of tasks that simulate real-world interactions with these applications, allowing for a comprehensive assessment of agent capabilities. The authors conducted extensive experiments using state-of-the-art (SOTA) LLM agents, although specific architectures, loss functions, and training compute details are not disclosed in the paper. The benchmark is publicly available, facilitating further research and evaluation in this domain.

Results
The experiments reveal that SOTA LLM agents exhibit significant difficulties when tasked with personalized tool use. For instance, agents struggled to effectively retrieve and utilize user-specific data, leading to suboptimal performance in tasks that require contextual understanding and personalized interaction. While specific performance metrics and comparisons to baseline models are not detailed in the abstract, the authors emphasize the pronounced challenges faced by these agents, underscoring the necessity of the MCP-Persona benchmark for identifying and addressing these limitations.

Limitations
The authors acknowledge that MCP-Persona is still in its early stages and may not encompass all possible personal applications or user scenarios. Additionally, the benchmark’s reliance on simulated environments may not fully capture the nuances of real-world interactions. The paper does not address potential biases in the selection of applications or the representativeness of the tasks included in the benchmark. Furthermore, the lack of detailed performance metrics limits the ability to quantitatively assess the improvements or shortcomings of the evaluated agents.

Why it matters
MCP-Persona represents a critical advancement in the evaluation of LLMs, particularly in the context of personalized applications. By providing a structured framework for assessing agent performance in real-world scenarios, this benchmark can guide future research in improving LLM capabilities for personalized interactions. The implications of this work extend to the development of more effective AI agents that can better serve individual user needs, thereby enhancing user experience across various applications. This is particularly relevant as the demand for personalized AI solutions continues to grow, as published in arXiv cs.AI.

By Callan Zhang · Jun 1, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI