Notable evaluation benchmarks

StakeBench: Evaluating Language Understanding Grounded in Market Commitment

Yunhua Pei, Jingyu Hu, Yiwei Shi, Hongnan Ma, Weiru Liu, John Cartlidge

Published: May 25, 2026 — 17:38 UTC

Problem
Existing financial NLP benchmarks predominantly rely on external observer labels, which assess language perception rather than the actual commitments made by market participants. This paper introduces StakeBench, a novel evaluation framework that addresses this gap by grounding language understanding in observable market commitments. The work is presented as a preprint and has not yet undergone peer review.

Method
StakeBench utilizes a dataset comprising 560,876 comments from 2,261 resolved markets, sourced from Polymarket and Manifold. The framework derives supervision from observable market behavior rather than human annotations, focusing on three key aspects: position sides, post-comment trading actions, and market-odds trajectories. The authors propose four diagnostic tasks: (1) detecting market commitment, (2) identifying the revealed side, (3) anticipating future actions, and (4) performing collective odds projection. To evaluate model performance, three commitment-aware metrics are introduced, which measure alignment with revealed preferences instead of perceived sentiment. The evaluation encompasses 15 large language models (LLMs) across 18 topics and platform settings, with a focus on the structural integrity of the models in relation to market commitments.

Results
The models exhibit partial recovery of position-side signals, achieving Directed Accuracy scores ranging from 0.506 to 0.599. However, significant structural failures are noted, particularly in future action anticipation, where ten out of fifteen models reduce their predictions to one or two action labels. In collective odds projection, no model consistently outperforms a naive odds-direction baseline. Notably, model scale does not correlate with performance improvements, and finance-domain tuning fails to enhance revealed-side identification. The results indicate that platform incentives significantly influence higher-order outcomes.

Limitations
The authors acknowledge several limitations, including the structural failures of models in later tasks and the lack of consistent performance improvements across different model scales and tuning strategies. They also note that the framework’s reliance on observable market behavior may not capture latent beliefs or causal impacts on market odds. Additionally, the evaluation is constrained to the specific datasets from Polymarket and Manifold, which may limit generalizability to other financial contexts.

Why it matters
StakeBench represents a significant advancement in financial NLP by providing a framework that aligns language understanding with actual market commitments, rather than subjective interpretations. This shift has implications for the development of more robust models that can better predict market behavior and inform trading strategies. The commitment-aware metrics and diagnostic tasks introduced in this work could serve as a foundation for future research aimed at enhancing the interpretability and reliability of financial NLP systems. By making the evaluation code and dataset publicly available under CC-BY 4.0, the authors facilitate further exploration and validation of their approach within the research community.

Authors: Yunhua Pei, Jingyu Hu, Yiwei Shi, Hongnan Ma, Weiru Liu, John Cartlidge
Source: arXiv:2605.26074
URL: https://arxiv.org/abs/2605.26074v1

By Callan Zhang · May 25, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI