Notable efficiency inference

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li

Published
May 14, 2026 — 17:46 UTC

Problem
This paper addresses the limitations of existing autoregressive (AR) diffusion distillation methods in real-time interactive video generation, particularly focusing on the coarse response granularity and significant sampling latency associated with chunk-wise 4-step generation. The authors identify a critical gap in the initialization of few-step AR students, which is either misaligned with targets, incapable of generating in few steps, or too resource-intensive to scale effectively. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors propose a novel framework called Causal Forcing++, which employs a technique termed causal consistency distillation (causal CD) to enhance the initialization of few-step AR models. The architecture leverages a causal flow map similar to that used in causal ODE distillation but innovatively derives supervision from a single online teacher ODE step between adjacent timesteps. This approach eliminates the need for precomputing and storing full probabilistic flow ODE trajectories, thereby improving both efficiency and optimization ease. The training process is designed to operate in a frame-wise autoregressive manner, specifically targeting 1-2 sampling steps, which is a departure from the traditional chunk-wise methods.

Results
Causal Forcing++ demonstrates significant performance improvements over the state-of-the-art (SOTA) 4-step chunk-wise Causal Forcing in the frame-wise 2-step setting. The reported gains include an increase of 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward metrics. Additionally, the method achieves a 50% reduction in first-frame latency and a substantial decrease in Stage 2 training costs by approximately 4 times. These results indicate a marked enhancement in both the quality of generated video and the efficiency of the generation process.

Limitations
The authors acknowledge that while Causal Forcing++ improves initialization efficiency and reduces latency, it may still face challenges in scalability when applied to more complex video generation tasks or larger datasets. Furthermore, the reliance on a single online teacher ODE step may introduce limitations in the diversity of generated outputs compared to methods that utilize more extensive trajectory data. The paper does not address potential issues related to the generalization of the model across different video domains or the robustness of the causal CD approach under varying conditions.

Why it matters
The implications of this work are significant for the field of interactive video generation, particularly in applications requiring real-time responsiveness and high-quality outputs. By reducing latency and training costs while improving generation quality, Causal Forcing++ paves the way for more scalable and efficient video generation systems. This advancement could facilitate the development of interactive applications in gaming, virtual reality, and other domains where real-time video synthesis is critical. The methodology also opens avenues for further research into causal distillation techniques and their applications in other generative tasks.

Authors: Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li et al.
Source: arXiv:2605.15141
URL: https://arxiv.org/abs/2605.15141v1

Turing Wire

By Turing Wire editorial staff · May 14, 2026 · Editorial standards →

Source: arXiv cs.CV