Notable evaluation benchmarks

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi, Olga Fink

Published: May 7, 2026 — 17:51 UTC
Summary length: 418 words
Relevance score: 80%

Problem
This paper addresses the lack of standardized evaluation protocols in Multimodal Domain Generalization (MMDG), which has led to fragmented research and unclear performance metrics across various studies. The authors argue that reported performance improvements may not reflect true advancements in algorithmic capabilities but rather inconsistencies in experimental setups. Existing benchmarks primarily focus on action recognition, neglecting critical aspects such as input corruptions, missing modalities, and model trustworthiness. This preprint work aims to fill this gap by providing a comprehensive benchmarking framework.

Method
The authors introduce MMDG-Bench, a unified benchmark designed to standardize evaluation across six datasets and three tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench evaluates nine representative methods across six modality combinations and multiple experimental settings. The benchmark assesses not only standard accuracy but also robustness to input corruptions, generalization with missing modalities, misclassification detection, and out-of-distribution detection. A total of 7,402 neural networks were trained across 95 unique cross-domain tasks, providing a robust dataset for evaluation.

Results
The findings from MMDG-Bench reveal several critical insights: (1) Under fair comparisons, specialized MMDG methods show only marginal improvements over the Empirical Risk Minimization (ERM) baseline. (2) No single method consistently outperforms others across different datasets or modality combinations. (3) A significant performance gap remains to be addressed, indicating that MMDG is not yet a solved problem. (4) Trimodal fusion does not consistently yield better results than the strongest bimodal configurations. (5) All evaluated methods demonstrate considerable performance degradation when faced with corruptions and missing modalities, with some methods further undermining model trustworthiness.

Limitations
The authors acknowledge that their benchmark may not encompass all possible real-world scenarios and that the evaluation settings, while comprehensive, could still miss certain edge cases. Additionally, the reliance on specific datasets may limit the generalizability of the findings. The paper does not address the computational costs associated with training the large number of neural networks, which could be a barrier for some researchers.

Why it matters
This work is significant as it provides a much-needed standardized framework for evaluating MMDG methods, facilitating more reliable comparisons and assessments of algorithmic progress in the field. By highlighting the limitations of current approaches and the persistent gaps in performance, it sets the stage for future research to focus on more robust and trustworthy multimodal models. The findings encourage the community to rethink existing methodologies and explore new avenues for improving model performance in real-world applications.

Authors: Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi, Olga Fink
Source: arXiv:2605.06643
URL: https://arxiv.org/abs/2605.06643v1

Author Turing Wire editorial staff

Source

arXiv cs.AI https://arxiv.org/abs/2605.06643v1