Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor
- Published
- Jun 1, 2026 — 17:59 UTC
Problem
The paper addresses the underexplored area of executable inverse graphics, which aims to reconstruct 3D scenes from 2D images. Traditional methods often rely on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision, which limits their applicability and flexibility. The authors propose a novel approach using pretrained vision-language models (VLMs) to perform this task without these constraints. This work is presented as a preprint, indicating it has not yet undergone peer review.
Method
The core contribution is the Staged Executable Inverse Graphics (SEIG) framework, which reconstructs a 3D scene from a single image by progressively refining various scene factors—geometry, materials, composition, and lighting—directly in the executable code space of Blender. The method leverages the capabilities of VLMs to interpret and manipulate scene elements, allowing for a more agentic approach to scene reconstruction. The authors detail the architecture of their framework, although specific training compute resources and dataset sizes are not disclosed. The staged approach emphasizes task decomposition, which enhances the fidelity of the reconstruction process.
Results
The SEIG framework was evaluated across diverse scenes using multiple reconstruction metrics, including pixel-level accuracy, perceptual quality, and semantic fidelity. The results indicate that staged reconstruction significantly improves fidelity compared to baseline methods, although specific numerical results and comparisons to named baselines are not provided in the abstract. The authors highlight the effectiveness of their approach in producing high-quality, editable Blender scenes, demonstrating the potential of VLMs in this domain.
Limitations
The authors acknowledge that their approach may be limited by the inherent biases and capabilities of the pretrained VLMs used, which could affect the quality of the reconstructed scenes. Additionally, the reliance on a single image for reconstruction may not capture all necessary scene details, potentially leading to incomplete or inaccurate representations. The paper does not discuss the computational efficiency of the SEIG framework or its scalability to more complex scenes.
Why it matters
This work has significant implications for the fields of computer vision and graphics, particularly in applications requiring scene manipulation and editing. By demonstrating that VLMs can effectively perform executable inverse graphics, the authors open avenues for further research into integrating language understanding with visual scene reconstruction. This could lead to advancements in interactive design tools, virtual reality environments, and automated content generation. The findings contribute to the ongoing discourse on the capabilities of VLMs in complex tasks, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 1, 2026 · Editorial standards →
Source: arXiv cs.CV