Notable reasoning

Predicting Future Behaviors in Reasoning Models Enables Better Steering

Evgenii Kortukov, Piotr Komorowski, Florian Klein, Paula Engl, Gabriele Sarti, Seong Joon Oh

Published: Jun 9, 2026 — 17:49 UTC

Problem
Deployed large reasoning models (LRMs) often exhibit unpredictable behaviors, complicating their application in critical tasks. Existing steering techniques typically manipulate hidden representations based on features derived from already generated text, which can lead to degraded output quality. This paper identifies a gap in the literature regarding the effectiveness of these detection features as predictors of future behavior, arguing that they are inadequate for steering purposes. The authors propose a new approach that focuses on predicting future behaviors from intermediate reasoning steps, addressing this gap in capability.

Method
The authors introduce a novel framework called Future Probe Controlled Generation (FPCG). The core technical contribution involves training activation probes that predict the likelihood of future behaviors based on intermediate activations during the reasoning process. The probes achieve an accuracy range of 64% to 91% in predicting the most likely behavior. FPCG operates by sampling multiple candidate sentences and selecting the optimal one based on the predictions from these probes, thereby steering the model’s output without significant degradation in quality. The paper does not disclose specific details regarding the architecture of the LRM used, the loss functions, or the training compute, focusing instead on the effectiveness of the proposed steering method.

Results
FPCG demonstrates superior performance compared to traditional activation steering methods across several evaluations. The paper reports that FPCG maintains output quality while enabling effective steering in scenarios where previous methods fail. Specific performance metrics are not detailed in the abstract, but the authors emphasize the significant improvement in steering capabilities, suggesting a marked enhancement in the control of LRM behaviors.

Limitations
The authors acknowledge that their approach relies on the successful training of activation probes, which may vary in effectiveness depending on the model architecture and task. They do not address potential limitations related to the generalizability of the probes across different LRM architectures or the computational overhead introduced by sampling multiple candidates. Additionally, the paper does not explore the implications of using FPCG in real-world applications, which may present unforeseen challenges.

Why it matters
This work has significant implications for the development of more reliable and controllable LRM applications, particularly in sensitive domains where output predictability is crucial. By distinguishing between detection and prediction features, the authors provide a framework that enhances the steering of LRM behaviors, potentially leading to more robust AI systems. The findings contribute to the ongoing discourse on improving LRM interpretability and control, as discussed in related literature. This research is available on arXiv.

By Callan Zhang · Jun 9, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG