Learning from the Self-future: On-policy Self-distillation for dLLMs
Yifu Luo, Zeyu Chen, Haoyu Wang, Xinhao Hu, Yuxuan Zhang, Zhizhou Sha
- Published
- Jun 16, 2026 — 17:24 UTC
Problem
The paper addresses the gap in applying on-policy self-distillation (OPSD) techniques to diffusion large language models (dLLMs), a domain that has not been previously explored. Existing OPSD methods are primarily designed for autoregressive models, which utilize left-to-right prefix conditioning and token-level divergence supervision. This approach is incompatible with the arbitrary-order generation characteristic of dLLMs. The authors propose a new framework, d-OPSD, to overcome these limitations, providing a solution that is particularly relevant given the increasing interest in dLLMs for various applications. The work is presented as a preprint, indicating that it has not yet undergone peer review.
Method
The core technical contribution of this paper is the introduction of d-OPSD, which consists of two main innovations. First, the authors redefine self-teacher construction by employing self-generated answers as suffix conditioning. This allows the student model to learn from its “self future-experience,” rather than relying on privileged prefix information. Second, the supervision mechanism is transitioned from token-level to step-level, which aligns more closely with the iterative denoising process inherent in dLLMs. This methodological shift is crucial for effectively training dLLMs, as it leverages the unique characteristics of their generation process. The authors do not disclose specific details regarding the training compute used, but they emphasize the efficiency of their approach.
Results
The experimental results demonstrate that d-OPSD significantly outperforms existing baselines, including Reinforcement Learning with Value Regularization (RLVR) and Supervised Fine-Tuning (SFT), across four reasoning benchmarks. Notably, d-OPSD achieves superior sample efficiency, requiring only about 10% of the optimization steps compared to RLVR. This efficiency translates into faster convergence and reduced computational costs, making d-OPSD a promising alternative for post-training dLLMs. The paper provides quantitative results that highlight the effectiveness of d-OPSD, although specific numerical performance metrics are not detailed in the summary.
Limitations
The authors acknowledge that their approach is still in the early stages of development and may require further validation across a broader range of tasks and datasets. They do not address potential challenges related to the generalization of the d-OPSD framework to other types of models or tasks outside the reasoning benchmarks tested. Additionally, the reliance on self-generated answers may introduce noise, which could affect the quality of the learned representations.
Why it matters
The introduction of d-OPSD represents a significant advancement in the training of diffusion large language models, particularly in enhancing their sample efficiency and performance. By reframing self-distillation techniques to suit the unique characteristics of dLLMs, this work opens new avenues for research and application in the field of natural language processing. The implications of this research extend to improving the efficiency of model training and potentially enabling more complex reasoning tasks in future dLLM applications, as discussed in related literature. For further details, see the full paper available on arXiv.
By Callan Zhang · Jun 16, 2026 · Editorial standards →
Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.
Source: arXiv cs.CL