VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination
Chunyu Liu, Zhengyang Fan, Kaisen Yang, Alex Lamb
- Published
- Jun 16, 2026 — 14:46 UTC
Problem
This work addresses the limitations of existing Masked Diffusion Language Models (MDLMs) that utilize the same token, [EOS], for both semantic termination and padding during instruction tuning. This dual role leads to inefficiencies, particularly under large-block decoding scenarios, resulting in [EOS] overflow. The authors highlight that this issue has not been adequately addressed in the literature, particularly in the context of instruction-tuned MDLMs. The paper is a preprint and has not undergone peer review.
Method
The core contribution of this paper is the introduction of VoidPadding, which employs a new token, [VOID], specifically for padding, thereby reserving [EOS] solely for semantic termination. This decoupling allows for more efficient response-length modeling. The authors implement a mechanism where the learned [EOS] signal facilitates early stopping during inference, while the [VOID] signal aids in adaptive response canvas expansion. The model is evaluated on the Dream-7B-Instruct dataset, focusing on tasks that include mathematical reasoning and code generation. The training process and architecture specifics are not disclosed in detail, but the authors report a significant reduction in decoding non-finite elements (NFE) by 55.7% on average.
Results
VoidPadding demonstrates substantial improvements over baseline models. Specifically, it achieves a block-size-averaged four-task mean score increase of +17.84 points compared to the original model and +6.95 points over RainbowPadding. These results indicate a marked enhancement in performance across the evaluated benchmarks, showcasing the effectiveness of the proposed padding strategy in improving model efficiency and output quality.
Limitations
The authors acknowledge that while VoidPadding improves performance and reduces decoding NFE, the approach may still be sensitive to the choice of tasks and the specific configurations of the MDLMs. Additionally, the paper does not explore the potential impacts of using [VOID] in other contexts or with different model architectures. The lack of extensive ablation studies to isolate the effects of [VOID] versus [EOS] in various scenarios is another limitation that could be addressed in future work.
Why it matters
The implications of this research are significant for the development of more efficient MDLMs, particularly in applications requiring high-quality text generation and reasoning capabilities. By decoupling the roles of padding and termination, VoidPadding paves the way for improved instruction tuning methodologies and could influence future designs of language models. This work contributes to the ongoing discourse on optimizing model architectures for better performance, as discussed in related literature, and is available on arXiv.
By Callan Zhang · Jun 16, 2026 · Editorial standards →
Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.
Source: arXiv cs.CL