Notable foundation models

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl, Ali Erturk

Published
May 25, 2026 — 16:50 UTC

Problem
This preprint addresses the gap in the application of foundation models to Light Sheet Fluorescence Microscopy (LSM) data, which is characterized by high dimensionality and a significant annotation burden. While LSM provides rich volumetric data for biological analysis, the lack of scalable supervised learning approaches due to the complexity of volumetric representation learning has hindered progress. The authors highlight that existing models do not leverage the abundance of unannotated LSM volumes, thus limiting the potential for effective few-shot learning in this domain.

Method
The authors propose a 3D foundation model specifically designed for LSM data, pretrained on a large, curated dataset comprising diverse 3D images from multiple organisms, stains, and imaging protocols. The model architecture incorporates a dual-objective training regime that jointly optimizes for masked reconstruction and image-text alignment, facilitating the learning of transferable volumetric representations. The pretraining phase significantly reduces the need for extensive labeled data, allowing for efficient few-shot adaptation to various downstream tasks, including segmentation, classification, and deblurring. The computational resources utilized for training are not explicitly disclosed, but the model’s architecture is designed to handle the complexities of 3D data.

Results
The proposed model demonstrates substantial performance improvements over baseline methods across multiple downstream tasks. Specifically, in segmentation tasks, the model achieves a mean Intersection over Union (IoU) score that surpasses traditional supervised approaches by 15% on the LSM benchmark dataset. For classification, accuracy increases by 10% compared to existing models, while deblurring tasks show a reduction in mean squared error (MSE) by 20% relative to baseline methods. These results are corroborated by qualitative assessments from domain experts, indicating that the model not only performs better quantitatively but also produces more clinically relevant outputs.

Limitations
The authors acknowledge several limitations, including the reliance on a curated dataset that may not encompass the full variability of LSM data encountered in real-world applications. Additionally, the model’s performance may be sensitive to the specific imaging protocols and biological specimens used during pretraining. The authors do not address potential issues related to the generalizability of the model across different imaging modalities or the computational cost associated with fine-tuning for specific tasks.

Why it matters
This work has significant implications for the field of computational biology and medical imaging, as it demonstrates the feasibility of applying foundation models to complex volumetric data. By reducing the annotation burden and enabling few-shot learning, this approach can accelerate the analysis of biological specimens, facilitating more rapid advancements in understanding cellular organization and pathology. The availability of pretrained model weights and code further encourages reproducibility and fosters future research in multimodal representation learning within the microscopy domain.

Authors: Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl, Ali Erturk, Zhuhao Wu, Johannes C. Paetzold
Source: arXiv:2605.26026
URL: https://arxiv.org/abs/2605.26026v1

Turing Wire

By Turing Wire editorial staff · May 25, 2026 · Editorial standards →

Source: arXiv cs.AI