Major reasoning

A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

arXiv:2605.19944 Papers with Code →

Yuyang Zhang, Yifu Zhang, Xuehai Zhou, Xiaoyin Chen

Published: May 19, 2026 — 15:00 UTC

Problem
This preprint addresses the theoretical underpinnings of out-of-distribution (OOD) generalization in large language models (LLMs), a gap in the existing literature that primarily focuses on empirical scaling laws without a robust theoretical framework. The authors aim to formalize reasoning through optimal transport, providing a measure-theoretic perspective that quantifies domain shifts and elucidates the limitations of current architectural designs in preserving generalization capabilities.

Method
The core technical contribution involves the application of optimal transport theory, specifically utilizing the Wasserstein-1 distance to measure domain shifts. The authors leverage Kantorovich duality to establish bounds on OOD generalization, linking it to architectural properties such as Lipschitz continuity. They identify two main constraints:

Position-dependent attention mechanisms (e.g., Absolute Positional Encoding) are shown to yield an $Ω(1)$ Lipschitz constant, which leads to a significant expected risk in OOD scenarios.
Shift-invariant mechanisms (e.g., Rotary Embeddings) are proposed as superior alternatives, preserving equivariance and effectively bounding the error.

Additionally, the authors map sequential backtracking to a Dyck-$k$ language, establishing a strict lower bound on circuit depth for $\text{TC}^0$ Transformers. They argue that increasing physical layer depth is essential to prevent representation collapse, a limitation that cannot be mitigated merely by scaling representation width due to inherent approximation limits in Barron spaces. The evaluation is conducted across 54 Transformer configurations on combinatorial search tasks.

Results
The empirical evaluations demonstrate that generalization risk degrades monotonically with the Wasserstein domain shift across the tested Transformer configurations. The authors provide specific metrics indicating that models employing position-dependent attention exhibit significantly higher expected risk compared to those utilizing shift-invariant mechanisms. While exact numerical results are not disclosed in the abstract, the findings suggest a clear performance gap that underscores the importance of architectural choices in achieving robust OOD generalization.

Limitations
The authors acknowledge that their analysis is constrained by the assumptions inherent in the optimal transport framework and the specific architectural classes examined. They do not address potential implications of their findings on other model architectures outside the Transformer family. Additionally, the reliance on theoretical bounds may not fully capture the complexities of real-world data distributions, which could limit the applicability of their results.

Why it matters
This work has significant implications for the design of future LLM architectures, particularly in enhancing OOD generalization capabilities. By formalizing the relationship between architectural choices and generalization risk, it provides a theoretical foundation that can guide the development of more robust models. The insights regarding the limitations of position-dependent attention mechanisms and the necessity of depth scaling could influence both research directions and practical implementations in the field of deep learning.

Authors: Yuyang Zhang, Yifu Zhang, Xuehai Zhou, Xiaoyin Chen
Source: arXiv:2605.19944
URL: https://arxiv.org/abs/2605.19944v1

By Callan Zhang · May 19, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL