Notable reasoning

Reasoning Structure of Large Language Models

Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer

Published: Jun 2, 2026 — 16:49 UTC

Problem
Current evaluations of large reasoning models (LRMs) often rely on metrics like final-answer accuracy or token count, which can obscure the underlying reasoning processes. This paper addresses the gap in understanding the structural differences in reasoning among models that achieve similar performance metrics. The authors propose a novel benchmark for logic puzzles and a methodology to convert unstructured reasoning traces into verifiable reasoning graphs. This work is particularly relevant as it is presented as a preprint, indicating that it has not yet undergone peer review.

Method
The authors introduce a scalable benchmark consisting of logic puzzles designed to assess the reasoning capabilities of LRMs. They develop a pipeline that transforms unstructured reasoning traces into structured reasoning graphs, which represent claims and their dependencies. This transformation allows for the quantitative analysis of reasoning structures. A key contribution is the definition of a reasoning efficiency metric, which measures the concentration of logical flow within the model’s reasoning process. The authors utilize open-source reasoning models for their analysis, although specific architectures, loss functions, and training compute details are not disclosed.

Results
The analysis reveals that structural measurements can effectively differentiate between model behaviors that traditional metrics like token count and accuracy fail to capture. The authors demonstrate that their reasoning efficiency metric correlates with the complexity of the logic puzzles, providing a more nuanced understanding of model performance. While specific numerical results are not detailed in the abstract, the implication is that models with higher reasoning efficiency exhibit better performance on more complex puzzles compared to their less efficient counterparts.

Limitations
The authors acknowledge that their approach is limited by the scope of the logic puzzles included in the benchmark and the reliance on open-source models, which may not represent the full spectrum of LRM capabilities. Additionally, the reasoning graphs generated may not capture all aspects of reasoning, particularly in more nuanced or ambiguous scenarios. The paper does not address potential biases in the selection of puzzles or the generalizability of the findings across different types of reasoning tasks.

Why it matters
This work has significant implications for the evaluation and development of reasoning models in AI. By providing a structured approach to analyze reasoning processes, it enables researchers to diagnose failure modes more effectively and compare models on a more granular level. The introduction of a reasoning efficiency metric could lead to improved model architectures and training methodologies that prioritize logical coherence. This research contributes to the ongoing discourse on enhancing LRM capabilities and is relevant for future studies in the field, as published in arXiv cs.AI.

By Callan Zhang · Jun 2, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI