Notable multimodal

DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu

Published: May 7, 2026 — 17:19 UTC
Summary length: 458 words
Relevance score: 80%

Problem
This paper addresses the limitations of existing contrastive language-image pretraining methods, specifically CLIP, which suffers from two key structural weaknesses: (1) the symmetric InfoNCE loss fails to capture the relative ordering among unmatched in-batch pairs, and (2) global pooling reduces visual representations to a semantic bottleneck, neglecting fine-grained local structures. While RANKCLIP partially mitigates the first issue with a list-wise Plackett-Luce ranking-consistency loss, it does not address the second weakness. This work presents DINORANKCLIP, which aims to jointly resolve both issues. Notably, this is a preprint and has not yet undergone peer review.

Method
DINORANKCLIP introduces a novel pretraining framework that integrates a frozen DINOv3 teacher model into the contrastive learning architecture. The architecture consists of a dual-branch lightweight student model and a multi-scale fusion module that employs channel-spatial attention, a self-attention refiner, and a conflict-aware gate to maintain cross-modal alignment up to first order. The authors propose a high-order Plackett-Luce ranking model that enhances per-position utility with attention-parameterized pairwise and tuple-wise transition terms. This model encompasses CLIP and RANKCLIP as special cases, with the optimal order determined to be ( R^* = 3 ). The training is conducted on the Conceptual Captions dataset, comprising 3 million images, and the entire empirical study, including an order sweep and various ablation studies, is completed within 72 hours on a single eight-GPU H100 node.

Results
DINORANKCLIP demonstrates consistent performance improvements over several baselines, including CLIP, CyCLIP, ALIP, and RANKCLIP, under matched computational resources. The most significant relative gains are observed in fine-grained and out-of-distribution evaluations, which emphasize local structural reasoning. Specific performance metrics are not disclosed in the abstract, but the authors indicate that DINORANKCLIP outperforms these models across multiple benchmarks, particularly in scenarios that challenge the model’s ability to reason about local structures.

Limitations
The authors acknowledge that while DINORANKCLIP improves upon previous models, it still inherits some limitations from the DINOv3 teacher model, particularly in terms of its reliance on a frozen architecture, which may restrict adaptability to new data distributions. Additionally, the empirical results are based solely on the Conceptual Captions dataset, which may limit generalizability to other datasets or tasks. The paper does not discuss potential computational inefficiencies or the scalability of the proposed architecture beyond the tested setup.

Why it matters
The development of DINORANKCLIP has significant implications for future work in vision-language pretraining. By addressing the shortcomings of existing methods, particularly in capturing local structural information and relative ordering, this framework could enhance the performance of models in tasks requiring fine-grained visual understanding and reasoning. The introduction of high-order ranking models may also inspire further research into more sophisticated loss functions that leverage relational information in multimodal learning.

Authors: Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu
Source: arXiv:2605.06592
URL: https://arxiv.org/abs/2605.06592v1

Author Turing Wire editorial staff

Source

arXiv cs.AI https://arxiv.org/abs/2605.06592v1