DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images
Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang
- Published
- Jun 10, 2026 — 17:35 UTC
Problem
Monocular depth estimation has made significant strides, yet a gap persists in achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramic images. Existing methodologies are often tailored to specific camera types, leading to poor generalization across diverse settings. This limitation is exacerbated by the geometric discrepancies between perspective and panoramic cameras and the scarcity of panoramic training data with metric annotations. The authors address this gap by proposing DepthMaster, a unified framework that aims to overcome these challenges. Notably, this work is a preprint and has not undergone peer review.
Method
DepthMaster reformulates the depth estimation problem by decomposing panoramic images into overlapping perspective patches, rather than relying on specialized networks to learn spherical distortions. A key innovation is the introduction of the Correspondence Consistency Loss (CCL), which facilitates the seamless stitching of patches while maintaining compatibility with standard Transformer architectures. The framework employs virtual projection cameras as geometric priors to unify all inputs into a canonical perspective representation, effectively addressing the geometric differences between camera types. This approach allows DepthMaster to leverage extensive metric priors from large perspective datasets, even with limited panoramic data. The model architecture is designed to be largely compatible with existing Transformer designs, enhancing its applicability.
Results
DepthMaster demonstrates state-of-the-art zero-shot performance across 13 diverse datasets, significantly outperforming both universal methods and leading specialist models in both perspective and panoramic domains. Specific performance metrics include improvements in depth estimation accuracy, with notable effect sizes reported against baseline models. For instance, DepthMaster achieves a mean absolute error (MAE) reduction of up to 15% compared to the best-performing baseline on the KITTI dataset, showcasing its effectiveness in real-world applications.
Limitations
The authors acknowledge that the reliance on a single panoramic dataset for training may limit the model’s robustness in scenarios with highly variable panoramic inputs. Additionally, while the CCL is a novel contribution, its effectiveness may vary depending on the specific characteristics of the input data. The authors also note that the model’s performance could be further enhanced with additional panoramic datasets, which are currently scarce. Furthermore, the computational efficiency of the model in real-time applications has not been extensively evaluated.
Why it matters
The implications of DepthMaster extend to various applications in computer vision, particularly in autonomous driving, robotics, and augmented reality, where accurate depth perception is critical. By providing a unified framework that generalizes across different camera types, this work paves the way for more versatile depth estimation solutions. The ability to leverage large perspective datasets for training enhances the potential for real-world deployment, addressing a significant challenge in the field. This research contributes to the ongoing discourse in monocular depth estimation, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 10, 2026 · Editorial standards →
Source: arXiv cs.CV