Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision
Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi
- Published
- Jun 8, 2026 — 15:52 UTC
Problem
Recent anomaly detection methods have demonstrated high accuracy on benchmark datasets like MVTec; however, they often fail under real-world conditions where foundational assumptions—such as consistent object scale, viewpoint, background, illumination, and centered placement—are violated. This paper addresses these limitations by proposing a new approach to anomaly detection that is robust to such variations. The work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce a three-pronged approach to improve anomaly detection performance. First, they develop a visual prompting pipeline that employs foreground-background masking to isolate objects, enhancing the model’s focus on relevant features. Second, they implement a dual-teacher supervision mechanism that allows for the unfreezing of the teacher model in student-teacher architectures, thereby improving domain adaptability. Third, they propose a data augmentation strategy that utilizes diffusion-generated synthetic images to enrich the training dataset, thereby enhancing the model’s robustness. The backbone architecture used for these contributions is the Masked Multiscale Reconstruction (MMR) model, which is specifically designed to handle the complexities of anomaly detection tasks.
Results
The proposed method achieves a notable improvement of 3.5 percentage points over the previous state-of-the-art on the AeBAD dataset, a benchmark known for its challenging conditions. This performance enhancement is quantified against existing methods, demonstrating the effectiveness of the visual prompting and dual-teacher supervision strategies. The authors provide detailed comparisons with baseline models, although specific baseline performance metrics are not disclosed in the summary.
Limitations
The authors acknowledge that while their approach improves adaptability and robustness, it may still struggle with extreme variations in object appearance or context that are not represented in the training data. Additionally, the reliance on synthetic data for augmentation could introduce biases if the generated images do not accurately reflect real-world anomalies. The paper does not address the computational overhead introduced by the dual-teacher supervision mechanism, which may impact scalability in resource-constrained environments.
Why it matters
This work has significant implications for the field of anomaly detection, particularly in applications where real-world conditions deviate from training scenarios. By enhancing the adaptability of models to various object appearances and contexts, the proposed framework could lead to more reliable anomaly detection systems in industrial and safety-critical applications. The integration of visual prompting and dual-teacher supervision may also inspire future research directions in model training and data augmentation strategies, as highlighted in similar works on domain adaptation and synthetic data generation, as published in arXiv.
By Turing Wire editorial staff · Jun 8, 2026 · Editorial standards →
Source: arXiv cs.AI