Notable multimodal

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou

Published: Jun 1, 2026 — 17:59 UTC

Problem
This paper addresses the challenge of Multimodal Continual Instruction Tuning (MCIT) in Multimodal Large Language Models (MLLMs), particularly the issue of inter-task interference when tasks with distinct response structures share similar visual-linguistic semantics. Existing methods, such as Mixture of LoRA Experts, inadequately handle task assignment based solely on image-text similarity, leading to incorrect routing and gradient interference. The authors highlight that current literature lacks a robust mechanism for format-aware task assignment, which is critical for effective continual learning in MLLMs. This work is presented as a preprint and has not undergone peer review.

Method
ProtoAda introduces a novel framework that employs prototype-guided adaptive tuning. The core components include format-aware task prototypes that align task assignment with both semantic content and output structure. This is achieved through a two-step process: first, tasks are assigned to experts based on their prototypes, which encapsulate the semantic essence of the tasks; second, updates are consolidated in a geometry-aware manner, ensuring that only format-compatible updates are integrated into the shared parameters. The architecture leverages a sparse expert model, enhancing collaboration among experts while minimizing interference. The training compute details are not disclosed, but the framework is designed to progressively refine existing parameters through effective reuse.

Results
ProtoAda was evaluated across multiple benchmarks, demonstrating significant performance improvements over baseline methods. Notably, it outperformed Mixture of LoRA Experts by achieving a 5% increase in accuracy on the VQA task and a 7% improvement on grounding tasks, where traditional methods struggled due to format-blind task assignments. The results indicate that ProtoAda effectively mitigates the negative effects of sequential tuning, particularly in scenarios where answer structures are prone to corruption.

Limitations
The authors acknowledge that while ProtoAda improves task assignment and parameter consolidation, it may still face challenges in highly heterogeneous task environments where task semantics diverge significantly. Additionally, the reliance on prototype definitions may introduce biases if the prototypes do not accurately represent the task semantics. The paper does not address the computational overhead introduced by the prototype-guided mechanism, which could impact scalability in real-world applications.

Why it matters
The implications of ProtoAda extend to the broader field of continual learning in MLLMs, providing a framework that enhances the robustness of task assignment and parameter updates. This work lays the groundwork for future research into format-aware learning strategies, potentially influencing the design of more sophisticated MLLMs capable of handling diverse tasks without performance degradation. The findings contribute to the ongoing discourse on improving continual learning methodologies, as discussed in related works on task interference and expert collaboration, and are available on arXiv.

By Callan Zhang · Jun 1, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG