Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai, Yinghuan Shi
- Published
- May 19, 2026 — 13:44 UTC
Problem
This preprint addresses a significant gap in the literature regarding the adaptive invocation of tools in multimodal large language models (MLLMs). While prior research has primarily focused on enabling models to invoke tools, it has largely overlooked the critical aspect of determining when tool invocation is beneficial. The authors argue that unnecessary or inappropriate tool usage can lead to increased reasoning overhead and potentially erroneous predictions, highlighting the need for a more nuanced approach to tool invocation.
Method
The authors introduce AutoTool, a novel framework that employs reinforcement learning to adaptively decide whether to invoke tools based on the characteristics of each query. The core technical contribution is the implementation of a dual-mode reasoning strategy, which incorporates mode-specific reward functions to optimize the model’s decision-making process. This strategy allows AutoTool to balance between tool-assisted reasoning and text-centric reasoning throughout training. The model is designed to prevent premature bias towards a single reasoning mode by promoting exploration during the later stages of training. The architecture details, including specific neural network configurations or hyperparameters, are not disclosed in the abstract.
Results
AutoTool demonstrates significant performance improvements over baseline models. On the V* benchmark, it achieves a 21.8% increase in accuracy compared to the base model. Furthermore, on the POPE benchmark, AutoTool shows a remarkable 44.9% enhancement in efficiency relative to existing tool-augmented methods. These results indicate that AutoTool not only improves accuracy but also optimizes the computational resources required for reasoning tasks.
Limitations
The authors acknowledge that their approach may still be limited by the quality and diversity of the training data, which could affect the model’s ability to generalize across different types of queries. Additionally, the reliance on reinforcement learning introduces challenges related to sample efficiency and convergence, which are not fully addressed in the paper. The exploration-exploitation trade-off may also lead to suboptimal performance in certain scenarios if not managed effectively. The authors do not discuss potential biases introduced by the training data or the implications of tool selection on model interpretability.
Why it matters
The implications of this work are significant for the development of more efficient and effective MLLMs. By introducing a mechanism for adaptive tool invocation, AutoTool paves the way for models that can better discern when to leverage external tools, thereby reducing unnecessary computational overhead and improving prediction accuracy. This research could influence future work in multimodal reasoning, particularly in applications where the cost of tool invocation is high or where the appropriateness of tool usage is context-dependent. The findings encourage further exploration of adaptive reasoning strategies in AI systems, potentially leading to more robust and intelligent models.
Authors: Qinghe Ma, Zhen Zhao, Yiming Wu, Jian Zhang, Lei Bai, Yinghuan Shi
Source: arXiv:2605.19852
URL: https://arxiv.org/abs/2605.19852v1
By Turing Wire editorial staff · May 19, 2026 · Editorial standards →
Source: arXiv cs.CL