Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten
- Published
- May 13, 2026 — 13:08 UTC
Problem
This preprint addresses a significant gap in understanding how different data representations (bytes, characters, subword tokens) affect the performance of finite-context predictors in Transformers. While existing literature has explored various tokenization strategies, the authors highlight the lack of a theoretical framework that explains the intrinsic limitations imposed by representation choices on model performance, particularly in the context of Markov sources.
Method
The authors introduce two key phenomena: fragmentation and greedy tokenization. Fragmentation refers to the process of replacing a source symbol with multiple smaller units, which can lead to an increase in the optimal finite-context log-loss, even when the context window is expanded. This phenomenon is theoretically proven, establishing that the performance gap observed in models like ByT5 and CANINE (which utilize byte- and character-level representations) compared to subword-tokenized models is not merely an optimization issue but is rooted in the representation itself.
On the other hand, greedy tokenization methods (e.g., BPE, WordPiece) are analyzed for their ability to group source symbols into larger units. The authors provide a loss guarantee that describes conditions under which a short token window can effectively mimic a longer source-context window. This guarantee is contingent on the reliability of token windows to encompass the necessary source history and the compression rate of the tokenizer. The authors propose a diagnostic metric to evaluate real tokenizers based on the amount of source context a fixed token window can reliably capture.
Results
The paper does not provide specific numerical results or comparisons against named baselines on standard benchmarks, focusing instead on theoretical insights and guarantees. The findings suggest that fragmentation can lead to a strictly worse performance in terms of log-loss, while effective greedy tokenization can enhance the model’s ability to leverage context, although exact effect sizes are not quantified.
Limitations
The authors acknowledge that their theoretical framework may not account for all practical considerations in real-world applications, such as the computational overhead introduced by fragmentation or the potential inefficiencies of greedy tokenization in diverse datasets. Additionally, the lack of empirical validation on benchmark datasets limits the applicability of their theoretical findings. The paper does not explore the implications of these phenomena on model training dynamics or generalization performance.
Why it matters
This work has significant implications for the design and evaluation of tokenization strategies in Transformer models. By establishing a theoretical foundation for understanding the impact of representation choices, it encourages researchers to reconsider how they approach tokenization in their models. The insights on fragmentation and greedy tokenization can inform future research on optimizing model architectures and training procedures, potentially leading to more efficient and effective language models.
Authors: Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten
Source: arXiv:2605.13485
URL: https://arxiv.org/abs/2605.13485v1
By Turing Wire editorial staff · May 13, 2026 · Editorial standards →
Source: arXiv cs.CL