Notable training methods

SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs

Xinrui Song, Zhuoran Wang, Mingju Gao, Hao Tang

Published: Jun 3, 2026 — 14:48 UTC

Problem
Current blockwise decoding methods for diffusion language models (DLMs) often utilize fixed block sizes or delimiter-based signals, which can misalign with semantic boundaries in text generation. This paper addresses the gap in the literature regarding the dynamic adaptation of block sizes based on semantic context, which is crucial for improving the coherence and relevance of generated text. The authors present SemBlock, a novel framework that predicts semantic boundaries to enhance the efficiency and effectiveness of DLMs. This work is a preprint and has not yet undergone peer review.

Method
SemBlock employs a semantic-boundary-driven approach to dynamic block construction. It formulates the task of determining block boundaries as a semantic boundary prediction problem. The framework utilizes lightweight predictors trained on frozen hidden states from the LLaDA model. To facilitate this, the authors introduce SemBound, a dataset specifically designed to provide supervision for boundary prediction. This dataset derives boundary labels from various discourse units, reasoning steps, and spans across different domains, including natural language, mathematics, and programming code. During inference, SemBlock leverages predicted boundary probabilities to dynamically select the end position of each block, allowing for more contextually appropriate token generation.

Results
The authors conducted experiments on several benchmarks, including GSM8K, IFEval, MATH, and HumanEval. SemBlock demonstrated consistent improvements over both fixed-block decoding methods and the AdaBlock approach. For instance, on the GSM8K benchmark, SemBlock achieved a performance increase of approximately 5% in accuracy compared to AdaBlock, while also reducing the average decoding time by 15%. These results indicate that SemBlock not only enhances the quality of generated text but also improves the efficiency of the decoding process.

Limitations
The authors acknowledge that the performance of SemBlock is contingent on the quality of the semantic boundary predictions. If the predictors are inaccurate, the benefits of dynamic block decoding may be diminished. Additionally, the reliance on frozen LLaDA hidden states may limit the adaptability of the framework to other architectures. The paper does not address the potential computational overhead introduced by the boundary prediction step, which could offset some of the efficiency gains in certain scenarios.

Why it matters
The introduction of SemBlock has significant implications for the development of more sophisticated DLMs that can generate coherent and contextually relevant text. By aligning token generation with semantic boundaries, this framework paves the way for improved performance in various natural language processing tasks. The work also highlights the importance of semantic understanding in the design of decoding strategies, which could influence future research in the field. This paper is available on arXiv and contributes to ongoing discussions about enhancing the capabilities of language models through innovative decoding techniques.

By Callan Zhang · Jun 3, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL