RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna
- Published
- May 14, 2026 — 17:59 UTC
Problem
This paper addresses a significant gap in the capability of existing video generation models, particularly those based on latent diffusion frameworks. While these models utilize heavily conditioned denoising networks, their decoders often operate in an unconditional manner, leading to a loss of detail and structural inconsistency in generated videos. The authors propose that the decoder should also be conditioned to maintain fidelity to the input reference image. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce RefDecoder, a reference-conditioned video Variational Autoencoder (VAE) decoder. The core innovation lies in the integration of a high-fidelity reference image into the decoding process through a mechanism termed reference attention. A lightweight image encoder is employed to transform the reference frame into detail-rich high-dimensional tokens. These tokens are then co-processed with the denoised video latent tokens at each up-sampling stage of the decoder. This architecture allows for enhanced conditioning during the decoding phase, which is critical for preserving structural integrity and detail in the generated video outputs. The authors evaluate RefDecoder across various decoder backbones, including Wan 2.1 and VideoVAE+, demonstrating its versatility and ease of integration into existing systems without necessitating additional fine-tuning.
Results
RefDecoder achieves notable performance improvements over unconditional baselines, reporting up to +2.1 dB PSNR on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. The authors also highlight improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. These results indicate that the reference conditioning significantly enhances the quality of generated videos, making them more coherent and visually appealing compared to traditional methods.
Limitations
The authors acknowledge that while RefDecoder improves upon existing methods, it may still be limited by the quality of the reference images used. Additionally, the paper does not explore the computational overhead introduced by the reference attention mechanism, which could impact real-time applications. The generalizability of the approach to other video generation tasks beyond those tested remains to be fully validated.
Why it matters
The introduction of RefDecoder has significant implications for the field of video generation and related applications. By demonstrating that conditioning the decoder can lead to substantial improvements in video quality, this work paves the way for more sophisticated models that can better leverage reference information. The ability to integrate RefDecoder into existing systems without fine-tuning also suggests a practical pathway for enhancing current video generation frameworks. Furthermore, the generalization of this approach to tasks such as style transfer and video editing refinement indicates its potential to influence a wide range of visual generation applications.
Authors: Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna
Source: arXiv:2605.15196
URL: https://arxiv.org/abs/2605.15196v1
By Turing Wire editorial staff · May 14, 2026 · Editorial standards →
Source: arXiv cs.CV