Notable training methods

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu

Published: Jun 16, 2026 — 16:02 UTC

Problem
Graphical user interface (GUI) grounding necessitates precise identification of small elements in high-resolution screenshots, requiring vision-language models (VLMs) to predict accurate screen coordinates. The authors identify a gap in existing self-distillation methods, particularly on-policy self-distillation (OPSD), which is not well-suited for this task due to the degradation of teacher signal quality when the student-generated prefixes deviate from the target coordinates. This paper, being a preprint, addresses the need for a more reliable self-distillation approach that enhances the quality of teacher signals in GUI grounding.

Method
The authors propose a novel quality-aware self-distillation framework that incorporates two key mechanisms: soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate evaluates whether the teacher’s coordinate-token predictions can be completed into the ground-truth bounding box based on the student-generated prefix. If the predictions cannot be completed, the corresponding teacher signal is down-weighted, thereby reducing the influence of unreliable signals. Teacher-probability scaling further calibrates the strength of the remaining teacher signals based on the teacher’s confidence in its predictions. The combination of these two mechanisms is shown to be crucial, as neither component alone yields significant performance improvements.

Results
The proposed method was evaluated across six GUI grounding benchmarks, demonstrating consistent performance enhancements over a strong baseline model. The paper reports that the quality-aware self-distillation approach leads to a statistically significant improvement in accuracy, with specific metrics not disclosed in the abstract. The results indicate that the combination of correctness-aware gating and teacher-probability scaling effectively enhances the reliability of the teacher signals, leading to better overall model performance.

Limitations
The authors acknowledge that their approach may still be sensitive to the initial quality of the teacher model and that the effectiveness of the gating mechanism relies on the accuracy of the teacher’s predictions. Additionally, the paper does not explore the scalability of the method to other tasks beyond GUI grounding, nor does it address potential computational overhead introduced by the gating and scaling mechanisms. These limitations suggest areas for future research, particularly in optimizing the balance between performance gains and computational efficiency.

Why it matters
This work has significant implications for the development of more robust VLMs in GUI grounding tasks, particularly in applications requiring high precision in user interface interactions. The introduction of quality-aware self-distillation could pave the way for improved methodologies in other coordinate-sensitive tasks within computer vision and natural language processing. The findings contribute to the ongoing discourse on enhancing self-distillation techniques, as published in arXiv.

By Callan Zhang · Jun 16, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI