HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Zhiwen Fan
- Published
- Jun 1, 2026 — 17:58 UTC
Problem
The paper addresses the challenge of generating photorealistic and universally applicable 3D human avatars from a single RGB image, a task complicated by the limited availability of diverse, high-quality 3D human datasets. The authors note that existing methods often struggle with generalization due to this data scarcity. This work is presented as a preprint, indicating it has not yet undergone peer review.
Method
HumanNOVA employs a two-pronged data generation strategy to enhance the training dataset. First, it utilizes existing rigged 3D assets, animating them with a wide range of poses derived from daily life scenarios. Second, it incorporates multi-camera captures of humans, applying fitting techniques to generate diverse views for training. This approach scales the dataset to approximately 100,000 assets, significantly improving both quantity and diversity. The architecture is a feed-forward, token-conditioned framework that allows for rapid inference (under one second) without requiring test-time optimization. The model processes an input image alongside a simplified human mesh (SMPL) to create compact token representations, which are then fused through cross-attention mechanisms to construct a triplane-based 3D avatar representation.
Results
HumanNOVA demonstrates superior performance across multiple benchmarks, achieving significant improvements over existing baselines. While specific numerical results are not detailed in the abstract, the authors claim both quantitative and qualitative superiority, indicating robust performance under various input conditions. The model’s ability to generate photorealistic avatars rapidly is a key highlight, suggesting a substantial advancement in the field of 3D human modeling.
Limitations
The authors acknowledge that while their approach significantly enhances data diversity and model performance, it may still be limited by the quality of the input images and the initial mesh estimation. Additionally, the reliance on existing rigged assets may restrict the model’s applicability to scenarios where such assets are available. The paper does not discuss potential biases in the training data or the generalizability of the model to non-standard human poses or appearances.
Why it matters
The implications of HumanNOVA are significant for applications in virtual reality, gaming, and digital content creation, where realistic human avatars are essential. The rapid generation capability opens avenues for real-time applications, enhancing user experiences in interactive environments. This work contributes to the ongoing discourse in 3D modeling and avatar generation, as highlighted in related literature, and is available on arXiv.
By Turing Wire editorial staff · Jun 1, 2026 · Editorial standards →
Source: arXiv cs.CV