Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation
Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye
- Published
- Jun 11, 2026 — 17:54 UTC
Problem
This work addresses the gap in understanding how on-policy distillation (OPD) affects model parameters, particularly in the context of language and vision-language models. Despite OPD’s growing prominence as a post-training technique that leverages on-policy student trajectories and dense teacher supervision, the specific nature of the parameter updates it induces remains unclear. This paper is a preprint and has not undergone peer review.
Method
The authors conduct a comprehensive analysis of OPD across various language and vision-language model pairs. They investigate the sparsity of updates and their geometric properties. The study employs a coordinate-sparse update mechanism, revealing that OPD updates are small and predominantly concentrated in feedforward neural network (FFN) layers. The authors compare the performance of a sparsity-inducing stochastic gradient descent (SGD) optimizer against AdamW, finding that the latter outperforms the former. The analysis also includes a spectral examination of the updates, indicating that while the updates are numerically full-rank, they are concentrated away from the principal singular subspaces of the source weights. This suggests that the updates disproportionately affect coordinates where the source weights are near zero.
Results
The findings indicate that training only the identified subnetwork can achieve performance nearly equivalent to that of full OPD, demonstrating the operational utility of the sparse structure. The paper does not provide specific quantitative performance metrics against named baselines, but it emphasizes the effectiveness of the discovered subnetwork in maintaining model performance. The comparative analysis of optimizers shows that AdamW’s adaptive scaling is beneficial in the context of dense teacher supervision, which preserves heterogeneous coordinate-wise gradient scales.
Limitations
The authors acknowledge that their findings are based on specific model pairs and use cases, which may limit the generalizability of the results. They also note the underperformance of the sparsity-inducing SGD optimizer compared to AdamW, suggesting that further exploration of optimization strategies in the context of OPD is warranted. Additionally, the paper does not explore the implications of these findings on other types of models or tasks beyond those studied.
Why it matters
This research provides critical insights into the sparsity and geometric characteristics of parameter updates in OPD, which can inform future work on model distillation techniques. Understanding the operational benefits of sparse updates and the role of dense teacher supervision can lead to more efficient training methodologies in both language and vision-language domains. These findings are particularly relevant for researchers looking to optimize post-training strategies and improve model performance while minimizing computational costs, as discussed in the context of OPD in arXiv.
By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →
Source: arXiv cs.LG