Major alignment safety

Quantifying Faithful Confidence Expression in Large Reasoning Models

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

Published
Jun 2, 2026 — 17:53 UTC

Problem
The paper addresses the gap in understanding how large reasoning models (LRMs) express their confidence, particularly in the context of faithful calibration (FC)—the alignment between a model’s intrinsic confidence and its expressed confidence. Despite the increasing deployment of LRMs in high-stakes applications, the existing methodologies for measuring FC do not adequately account for the complexities of long chain-of-thought outputs, which often lack clear boundaries and exhibit inconsistent structures. This work is a preprint and has not undergone peer review.

Method
The authors propose a novel framework to systematically quantify FC in LRMs. This framework evaluates linguistic decisiveness against three sources of internal uncertainty: token probabilities, hidden states, and sampled response consistency. To mitigate the challenges posed by conditional and structural variations in reasoning traces, they introduce a prefix-conditioned sampling approach. This method allows for a more controlled analysis of the outputs generated by LRMs, facilitating a better understanding of how these models express confidence throughout their reasoning processes.

Results
The framework was applied to a diverse set of leading models, datasets, and prompts. The findings indicate that faithful confidence expression is a significant challenge for LRMs. Notably, the authors report that improvements in reasoning behaviors do not necessarily correlate with enhanced FC. Additionally, prompt interventions that are effective for non-reasoning models do not yield similar improvements in the reasoning context. The study reveals that different confidence estimators yield divergent assessments of the same reasoning traces, highlighting the fragility of existing evaluation methodologies. These results underscore the need for a distinct focus on FC as a reliability and alignment target for LRMs.

Limitations
The authors acknowledge that their framework may not capture all dimensions of uncertainty inherent in LRM outputs, particularly in highly complex reasoning tasks. They also note that the reliance on specific confidence estimators may introduce biases, as different estimators can lead to inconsistent evaluations of the same outputs. Furthermore, the study’s findings are based on a limited set of models and prompts, which may not generalize across all LRM architectures or applications.

Why it matters
This work has significant implications for the deployment of LRMs in critical applications, where reliable uncertainty communication is essential for user trust and decision-making. By establishing FC as a distinct target for model evaluation, the authors pave the way for future research aimed at improving the reliability of LRM outputs. This is particularly relevant as the use of LRMs expands in areas such as healthcare, finance, and autonomous systems, where miscalibrated confidence can lead to severe consequences. The findings and methodologies presented in this paper contribute to the ongoing discourse on model interpretability and reliability, as published in arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →

Source: arXiv cs.AI