ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
Hui Wang, Hongze Li, Wei Chen, Xiaojin Zhang
Problem
This paper addresses the limitations of existing transformer-based architectures in small-object detection, particularly in scenarios with high spatial heterogeneity, such as dense conflict zones with microscopic targets. The authors highlight that traditional methods impose a uniform global receptive field, leading to local feature degradation. This work is presented as a preprint and has not yet undergone peer review.
Method
The core contribution of this work is the ViCrop-Det framework, which operates without the need for training. It introduces a novel mechanism called Spatial Attention Entropy (SAE) to guide dynamic spatial routing during inference. The framework utilizes the cross-attention distribution from the detection decoder to assess local spatial ambiguity. By shrinking the spatial trust region, ViCrop-Det allocates computational resources to areas with high target saliency and cognitive uncertainty. This approach allows for the injection of high-frequency localized observations, effectively resolving spatial ambiguity and enhancing fine-grained feature recovery. The method does not require any architectural changes, making it adaptable to existing models.
Results
ViCrop-Det was evaluated on the VisDrone and DOTA-v1.5 datasets, demonstrating significant performance improvements. Specifically, it achieved an increase of +1-3 mAP@50 over the RT-DETR-R50 and Deformable DETR baselines, with a latency overhead of only 20-23%. On the MS COCO dataset, the small object average precision ($AP_{S}$) improved, while the medium ($AP_{M}$) and large ($AP_{L}$) object precisions remained stable, indicating that the method effectively refines small-scale features without degrading overall performance. Under compute-matched conditions, ViCrop-Det’s adaptive routing strategy outperformed uniform slicing baselines, achieving a superior accuracy-speed trade-off.
Limitations
The authors acknowledge that the method’s performance is contingent on the quality of the attention mechanism used in the underlying detection models. Additionally, the reliance on a fixed computational budget may limit the framework’s applicability in scenarios with varying resource constraints. The paper does not address potential scalability issues when applied to larger datasets or more complex scenes beyond those tested.
Why it matters
ViCrop-Det presents a significant advancement in small-object detection by providing a training-free solution that enhances feature recovery in spatially heterogeneous environments. Its reliance on attention entropy for dynamic spatial routing could inspire further research into adaptive inference strategies in other domains of computer vision. The implications of this work extend to applications requiring precise detection in cluttered scenes, such as autonomous driving and surveillance, where small-object detection is critical.
Authors: Hui Wang, Hongze Li, Wei Chen, Xiaojin Zhang
Source: arXiv:2604.26806
URL: https://arxiv.org/abs/2604.26806v1