A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition
Edwin Arkel Rios, Augusto Christian Surya, Oswin Gosal, Fernando Mikael, Mary Madeline Nicole, Kisoon Jang
- Published
- May 18, 2026 — 17:36 UTC
Problem
This preprint addresses a significant gap in the literature regarding the trade-offs between accuracy and computational cost in fine-grained image recognition (FGIR). While prior research has focused on the selection of backbone architectures, it has largely overlooked the implications of various training and evaluation settings on these trade-offs. The authors conduct a comprehensive study to fill this gap, providing insights into how different configurations impact performance and resource utilization.
Method
The authors conducted over 2000 experiments across six distinct training and evaluation settings, utilizing nine pretrained backbone architectures and 17 datasets. A key technical contribution is the extension of Counterfactual Attention Learning (CAL), which incorporates cross-image discriminative region mixing augmentation alongside existing data-aware cropping and masking techniques. This extension aims to enhance the model’s ability to generalize from limited data while maintaining high accuracy. Additionally, the authors propose an evaluation-only variant of CAL that omits the forward pass on discriminative crops, thereby reducing inference costs without significantly compromising accuracy. The study emphasizes the importance of data augmentation strategies during training, demonstrating that effective augmentation can lead to high accuracy even in the absence of crop-based evaluations.
Results
The findings reveal that models trained with data-aware augmentations achieve competitive accuracy levels, with notable reductions in inference costs. Specifically, the proposed evaluation-only variant maintains accuracy comparable to traditional methods while significantly lowering computational overhead. The authors report that their approach can reduce inference costs by up to 50% compared to standard CAL implementations, while still achieving state-of-the-art performance on benchmark datasets. The results are quantitatively supported by comparisons against established baselines, although specific numerical performance metrics and baseline names are not detailed in the abstract.
Limitations
The authors acknowledge several limitations, including the potential for overfitting due to the extensive use of data augmentation and the reliance on specific backbone architectures that may not generalize across all FGIR tasks. They also note that while their evaluation-only variant reduces computational costs, it may not be suitable for all applications where high precision is critical. Additionally, the study does not explore the long-term implications of these trade-offs in real-world deployment scenarios, which could vary significantly from controlled experimental conditions.
Why it matters
This work has significant implications for the design and deployment of FGIR systems, particularly in resource-constrained environments. By elucidating the accuracy-cost trade-offs associated with different training and evaluation strategies, the findings can guide future research in optimizing model performance while minimizing computational demands. The insights gained from this study can inform the development of more efficient FGIR systems, potentially leading to broader applications in fields such as autonomous driving, medical imaging, and wildlife monitoring. The authors’ commitment to sharing their code and checkpoints further facilitates reproducibility and encourages further exploration in this domain.
Authors: Edwin Arkel Rios, Augusto Christian Surya, Oswin Gosal, Fernando Mikael, Mary Madeline Nicole, Kisoon Jang, Bo-Cheng Lai, Min-Chun Hu
Source: arXiv cs.CV
URL: https://arxiv.org/abs/2605.18700v1
By Turing Wire editorial staff · May 18, 2026 · Editorial standards →
Source: arXiv cs.CV