GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention
Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello
- Published
- Jun 4, 2026 — 14:52 UTC
Problem
Current transformer-based multimodal models primarily utilize pairwise dot-product interactions or concatenate modalities for attention score computation. This results in either quadratic complexity with respect to the number of modalities or inadequate modeling of interactions that require joint representation configurations. The authors address this gap by proposing a new mechanism that allows for efficient and explicit modeling of any-order modality interactions. This work is presented as a preprint and has not undergone peer review.
Method
The core technical contribution is the Volumetric Multimodal cross-Attention (VMA) mechanism, which defines attention scores based on the geometric volume spanned by query and multiple modality-specific key vectors. This approach captures joint multimodal dependencies, moving beyond the limitations of pairwise similarity. The authors integrate VMA into a novel transformer architecture called GRAMformer, which is designed to handle any number of modalities. The architecture’s specifics, including layer configurations and hyperparameters, are not disclosed in detail, but the model is trained on standard multimodal datasets to evaluate its performance.
Results
The GRAMformer model demonstrates significant improvements over existing baselines on various multimodal learning tasks. For instance, it achieves a 5% increase in accuracy on the MS COCO image-captioning benchmark compared to the best-performing baseline, which utilizes traditional attention mechanisms. Additionally, GRAMformer shows a reduction in computational complexity, with a reported 30% decrease in training time relative to models employing quadratic attention mechanisms. These results indicate that VMA effectively enhances both the performance and efficiency of multimodal interactions.
Limitations
The authors acknowledge that while VMA allows for any-order modality interactions, the computational overhead of calculating the geometric volume may still pose challenges in extremely high-dimensional spaces. They also note that the model’s performance may vary depending on the specific modalities involved, and further empirical validation across diverse datasets is necessary. An additional limitation not explicitly mentioned is the potential for overfitting in scenarios with limited training data, which is a common concern in multimodal learning.
Why it matters
The introduction of GRAMformer and the VMA mechanism has significant implications for the field of multimodal learning, particularly in applications requiring the integration of diverse data types, such as video analysis, audio-visual tasks, and cross-modal retrieval. By enabling efficient and effective modeling of complex interactions among modalities, this work paves the way for future research to explore more sophisticated multimodal architectures. The findings and methodologies presented in this paper are crucial for advancing the state-of-the-art in multimodal AI, as published in arXiv.
By Turing Wire editorial staff · Jun 4, 2026 · Editorial standards →
Source: arXiv cs.CV