Major evaluation benchmarks Hugging Face

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour

Published: Jun 10, 2026 — 14:56 UTC

Problem
This paper addresses the gap in the capability of large vision-language models (LVLMs) to perform high-stakes clinical reasoning that is grounded in visual evidence and clinical knowledge. Existing models often provide correct answers without demonstrating the reasoning process behind them. The authors present OpenMedReason, a large-scale, open multimodal medical reasoning corpus, to fill this gap. This work is a preprint and has not yet undergone peer review.

Method
OpenMedReason comprises approximately 450,000 image-question-answer instances, with reasoning traces derived from curated biomedical literature. The dataset encompasses various medical modalities, including radiological scans, microscopic images, and charts, providing high-fidelity supervision that surpasses synthetic reasoning chains. The authors also introduce OpenMedReason-Bench, a benchmark designed for fine-grained evaluation of LVLMs across three axes: perception, medical knowledge, and rationale. The models are trained using supervised fine-tuning (SFT) and reinforcement-based alignment techniques. The paper does not disclose specific training compute details.

Results
Training with OpenMedReason results in a 20% average improvement in visual question answering (VQA) accuracy compared to the base model. The performance of the model trained on OpenMedReason is within 4.2% of the best-performing comparable-scale medical LVLMs. Fine-grained analysis indicates that improvements are not limited to a single axis; rather, OpenMedReason enhances perception, medical knowledge, and rationale collectively. Additionally, the reasoning traces generated from OpenMedReason are preferred over those from the base model in 86.1% of pairwise comparisons.

Limitations
The authors acknowledge that while OpenMedReason significantly improves reasoning capabilities, it may still be limited by the quality and diversity of the biomedical literature from which the reasoning traces are derived. Furthermore, the dataset’s reliance on human-authored articles may introduce biases present in the source material. The paper does not address potential overfitting issues that could arise from the dataset’s size or the specific modalities included.

Why it matters
The introduction of OpenMedReason has significant implications for the development of LVLMs in clinical settings, as it provides a structured approach to evaluate and enhance reasoning capabilities grounded in visual evidence. This work enables more reliable and interpretable AI systems in healthcare, which is critical for high-stakes decision-making. The dataset and benchmark are expected to facilitate further research in medical AI, promoting advancements in multimodal reasoning. For more details, see the full paper available on arXiv.

By Callan Zhang · Jun 10, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL