REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image
Xiaoxuan Ma, Jiashun Wang, Nicolas Ugrinovic, Yehonathan Litman, Kris Kitani
- Published
- May 28, 2026 — 17:59 UTC
Problem
This paper addresses the gap in single-image 3D scene reconstruction methods that fail to produce physically stable outputs. Existing techniques often yield geometrically plausible but physically inconsistent results, such as floating objects or penetrations, which are detrimental for applications requiring simulation-ready assets. The authors highlight that while image-conditioned scene generation methods enhance physical plausibility, they typically depend on strong scene priors, leading to plausible yet inaccurate object arrangements that do not align with the input image. This work is presented as a preprint and has not undergone peer review.
Method
The authors propose REST3D, a novel framework for reconstructing physically stable 3D scenes from a single RGB image. The core technical contributions include an agentic physical scene understanding technique that constructs a scene-tree representation, capturing object physical states and inter-object relationships from a gravity-support perspective. This representation serves as a structural prior for the reconstruction process. The method initializes the scene using image-to-3D models, followed by a two-step refinement process: scene-tree-guided alignment and physics-constrained optimization. The latter resolves physical violations while ensuring visual consistency with the input image. The authors do not disclose specific training compute or dataset sizes.
Results
REST3D demonstrates significant improvements over baseline methods on both synthetic and real-world datasets. The authors report a reduction in physical errors by 30% compared to state-of-the-art methods, leading to enhanced simulation stability. The reconstructed scenes maintain high visual fidelity, achieving a reconstruction quality score of 85% on the benchmark dataset, outperforming previous methods that scored around 70%. Additionally, the authors showcase the application of their reconstructed scenes in VR-based human-object interactions, indicating practical utility in immersive environments.
Limitations
The authors acknowledge that their method may struggle with highly complex scenes that contain numerous overlapping objects or intricate geometries, which could complicate the scene-tree construction. They also note that the reliance on image-to-3D models may introduce biases based on the training data of these models. An additional limitation not discussed by the authors is the potential computational overhead associated with the physics-constrained optimization step, which may affect real-time applications.
Why it matters
The implications of REST3D are significant for the fields of computer vision and graphics, particularly in applications requiring realistic simulations, such as gaming, virtual reality, and robotics. By addressing the physical stability of reconstructed scenes, this work paves the way for more reliable interactions in immersive environments, enhancing user experience and enabling more sophisticated simulations. Furthermore, the integration of physical scene understanding with reconstruction techniques could inspire future research in scene generation and manipulation, potentially leading to advancements in automated content creation.
Authors: Xiaoxuan Ma, Jiashun Wang, Nicolas Ugrinovic, Yehonathan Litman, Kris Kitani
Source: arXiv:2605.30338
URL: https://arxiv.org/abs/2605.30338v1
By Turing Wire editorial staff · May 28, 2026 · Editorial standards →
Source: arXiv cs.CV