Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang
- Published
- Jun 11, 2026 — 17:54 UTC
Problem
This paper addresses the limitations of existing human-centric reconstruction methods that depend on explicit geometry priors such as skeletons, depth maps, or normals. The authors propose Flex4DHuman, a preprint work that leverages relative camera-pose conditioning to generate synchronized dense multi-view videos from monocular or sparse multi-view inputs. This approach aims to simplify the reconstruction pipeline by eliminating the need for explicit geometric information, which can be cumbersome and restrictive.
Method
Flex4DHuman is built upon the Wan 2.1 1.3B text-to-video model, maintaining its backbone architecture while introducing a novel five-axis positional encoding system. This system extends spatio-temporal relative positional encoding (RoPE) by incorporating view indices and continuous SE(3) relative camera geometry. The model is trained through a three-stage curriculum that includes: (1) pose following, (2) flexible reference-to-target view generation, and (3) temporal rollout. For temporal rollout, the model is trained with clean historical target-view tokens, enhancing its ability to generate coherent video sequences. Additionally, multi-view captions are integrated to facilitate test-time text control, allowing for more versatile output generation. The final output is processed through a 4D Gaussian Splatting stage, transforming the generated videos into dynamic 4D Gaussian splats suitable for various applications.
Results
Flex4DHuman demonstrates superior performance compared to prior state-of-the-art methods on the DNA-Rendering and ActorsHQ benchmarks. Specific quantitative results indicate that the model achieves a significant improvement in reconstruction quality, although exact numerical metrics are not disclosed in the abstract. The authors also report that the model generalizes effectively to animal categories when trained on a mixed dataset of human and animal videos, showcasing its versatility and robustness.
Limitations
The authors acknowledge that the model’s reliance on relative camera-pose conditioning may limit its performance in scenarios with highly dynamic or occluded subjects. Additionally, the training process requires a substantial amount of clean historical target-view tokens, which may not always be available in practical applications. The paper does not address potential computational costs associated with the three-stage training curriculum or the scalability of the model for real-time applications.
Why it matters
Flex4DHuman represents a significant advancement in the field of 4D human reconstruction, particularly for applications in simulation, gaming, augmented reality (AR), virtual reality (VR), and video re-shooting. By enabling the transformation of casual monocular videos into dynamic 4D representations without the need for explicit geometry, this work paves the way for more accessible and scalable content creation methods. The implications of this research are profound, as it could democratize high-quality 4D content generation, making it feasible for a broader range of users and applications, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →
Source: arXiv cs.CV