Notable multimodal

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Wei Li, Zhen Huang, Xinmei Tian

Published
Jun 11, 2026 — 12:45 UTC

Problem
This work addresses the limitations of existing contrastively trained vision-language models (VLMs) like CLIP, which struggle with compositional understanding due to their “bag-of-words” behavior. Specifically, these models fail to effectively capture object relations, attribute-object bindings, and word order dependencies. The authors highlight that the reliance on global, single-vector representations for optimization and the inadequate modeling of compositional information in paired image-text data contribute to these shortcomings. This paper is a preprint and has not undergone peer review.

Method
The authors propose MACCO (MAsked Compositional Concept MOdeling), a framework that enhances compositionality by masking compositional concepts in one modality (either image or text) and reconstructing them using contextual information from the other modality. The architecture incorporates two auxiliary objectives designed to align and regularize masked features both inter-modally (between image and text) and intra-modally (within the same modality). The training process leverages a large dataset of paired image-text samples, although specific details regarding the dataset size and training compute are not disclosed.

Results
MACCO was evaluated on five compositional benchmarks, demonstrating significant improvements over baseline models. The authors report that MACCO achieves a 12% increase in compositional accuracy on the COCO dataset compared to CLIP, and a 15% improvement on the Visual Genome dataset. Additionally, the framework shows enhanced performance in capturing syntactic structures and linguistic information, which translates to better results in text-to-image generation tasks. The results indicate that the proposed method not only enhances compositionality but also positively impacts multimodal large language models.

Limitations
The authors acknowledge that while MACCO improves compositional understanding, it may still struggle with highly complex compositional tasks that require deeper reasoning. They also note that the framework’s performance is contingent on the quality of the paired image-text data used for training. An obvious limitation not discussed by the authors is the potential computational overhead introduced by the masking and reconstruction processes, which may affect scalability in real-world applications.

Why it matters
The implications of this work are significant for advancing the capabilities of vision-language models, particularly in applications requiring nuanced understanding of compositional structures. By addressing the limitations of existing models, MACCO paves the way for more sophisticated interactions between visual and linguistic modalities, which can enhance tasks such as image captioning, visual question answering, and multimodal generation. This research contributes to the ongoing discourse in the field, as published in arXiv cs.AI.

Turing Wire

By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →

Source: arXiv cs.AI