Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks
Pradnya Halady, Jiale Wei, Zdravko Marinov, Alexander Jaus, Simon Reiß
- Published
- Jun 9, 2026 — 15:08 UTC
Problem
The paper addresses the limited evaluation of visual in-context learning (VICL) models, which have primarily been tested in narrow setups that do not require real adaptation to new tasks or image domains. Existing benchmarks fail to capture the models’ performance in diverse scenarios, leading to an incomplete understanding of their adaptive capabilities. This work presents a unified benchmark, VIBE, to systematically assess VICL across a wide range of imaging domains and tasks, filling a significant gap in the literature. Notably, this is a preprint and has not undergone peer review.
Method
The authors construct the Visual In-Context Benchmark (VIBE), which encompasses 14 datasets and 12 tasks, resulting in 106 dataset-task combinations. They evaluate six different VICL models under a one-shot learning paradigm, employing a unified and reproducible evaluation protocol. The models tested include various architectures, although specific details on the architectures and training compute are not disclosed in the abstract. The evaluation focuses on the models’ ability to adapt to new image and task distributions, providing insights into their performance and limitations.
Results
The evaluation reveals critical insights into the adaptive capabilities of VICL models. The authors report that the models exhibit varying degrees of success across the diverse tasks and datasets, highlighting systematic failure modes. While specific performance metrics (e.g., accuracy, F1 scores) are not detailed in the abstract, the paper emphasizes the importance of these findings in understanding the limitations of current VICL approaches. The results indicate that existing models struggle with certain tasks, suggesting areas for improvement and further research.
Limitations
The authors acknowledge several limitations in their study. First, the one-shot setting may not fully capture the models’ capabilities in more complex scenarios requiring multiple examples. Additionally, the benchmark may not encompass all possible imaging domains and tasks, potentially limiting its generalizability. The authors also note that the models tested may not represent the full spectrum of VICL approaches, as they focus on a select few. Furthermore, the lack of peer review raises questions about the robustness of the findings.
Why it matters
This work is significant as it lays the groundwork for a more comprehensive evaluation of visual in-context learning models, which are increasingly relevant in dynamic and diverse real-world applications. By providing a structured benchmark, the authors facilitate future research aimed at improving VICL methodologies and understanding their limitations. The open release of the VIBE toolkit will enable the broader research community to replicate and extend these findings, fostering advancements in adaptive visual learning systems. This is crucial for developing models that can effectively generalize across varied tasks and domains, as discussed in the context of ongoing research in the field, available on arXiv.
By Turing Wire editorial staff · Jun 9, 2026 · Editorial standards →
Source: arXiv cs.CV