Notable reasoning

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert

Published
May 18, 2026 — 15:29 UTC

Problem
This preprint addresses the gap in understanding the reliability of Chain of Thought (CoT) reasoning in Large Reasoning Models (LRMs) for safety monitoring. While CoT reasoning has been proposed as a method to enhance interpretability and reliability, it is often not aligned with the model’s final outputs, raising concerns about its effectiveness as a monitoring tool. The authors aim to explore the internal representations of LRMs to predict future behavior based on prompt and CoT representations, thereby enhancing the monitoring capabilities of these models.

Method
The authors introduce a novel approach called “probe trajectories,” which involves evaluating a probe at each generated token to capture the continuous evolution of a concept’s probability throughout the reasoning process. They extract signal-processing features from these trajectories, focusing on volatility, trend, and steady-state behavior, which significantly enhance the separation of future model states. The study also emphasizes two methodological insights: (1) using template-based training data can achieve performance comparable to dynamically generated model responses, thus reducing the need for extensive initial inference and labeling; (2) the choice of pooling operation is crucial, with max-pooling yielding superior performance (up to 95% AUROC) compared to average-pooling and last-token methods, which result in near-random performance.

Results
The authors validate their approach using four datasets and four reasoning models across safety and mathematics domains. They demonstrate that the trajectory features derived from probe trajectories effectively encode task-specific dynamics, leading to improved outcome separability. The reported performance metrics indicate that the max-pooling method achieves up to 95% AUROC, significantly outperforming baseline methods. The results suggest that analyzing the full trajectory of reasoning provides a more reliable indication of future model behavior than static predictions.

Limitations
The authors acknowledge that their findings are contingent on the specific datasets and reasoning models used in the study, which may limit generalizability. They do not address potential biases in the training data or the implications of using template-based data generation on model performance. Additionally, the reliance on max-pooling raises questions about the robustness of this method across different architectures and tasks, which warrants further investigation.

Why it matters
This work has significant implications for the development of safety monitoring tools for LRMs. By establishing probe trajectories as a viable framework for understanding and predicting model behavior, the authors provide a pathway for enhancing the interpretability and reliability of CoT reasoning. This could lead to more robust applications of LRMs in safety-critical domains, where understanding the reasoning process is essential for trust and accountability. The insights on pooling operations and data generation methods also contribute to the broader discourse on optimizing model training and evaluation strategies.

Authors: Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert
Source: arXiv:2605.18549
URL: https://arxiv.org/abs/2605.18549v1

Turing Wire

By Turing Wire editorial staff · May 18, 2026 · Editorial standards →

Source: arXiv cs.CL