AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation
Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang
- Published
- Jun 2, 2026 — 17:55 UTC
Problem
This paper addresses the limitations of existing adversarial distillation methods in one-step autoregressive video generation, which often suffer from motion collapse and training instability, leading to the production of static videos. The authors highlight that current approaches do not effectively leverage the temporal dynamics necessary for generating coherent video sequences. This work is presented as a preprint, indicating that it has not yet undergone peer review.
Method
The core contribution of AAD-1 is the introduction of an Asymmetric Adversarial Distillation framework that modifies the traditional generator-discriminator architecture. The generator maintains a causal structure to ensure autoregressive sampling, while the discriminator employs a bidirectional attention mechanism over the entire spatiotemporal context of the video. This design allows the discriminator to generate a holistic realism score for the entire sequence, enabling it to identify global temporal inconsistencies and long-range drift that contribute to motion collapse. Additionally, the authors propose a phased training strategy that begins with distribution matching to stabilize the one-step generator before transitioning to adversarial distillation. This warm-up phase is crucial for aligning the student distribution with the teacher distribution, thereby enhancing training stability.
Results
AAD-1 demonstrates significant improvements over existing baselines on the VBench benchmark for one-step autoregressive video generation. The paper reports that AAD-1 achieves state-of-the-art performance, although specific quantitative results (e.g., FID scores, PSNR) are not detailed in the provided abstract. The authors emphasize that their method effectively mitigates motion collapse and improves the overall quality of generated videos compared to prior approaches.
Limitations
The authors acknowledge that while AAD-1 improves upon existing methods, it may still face challenges in generating highly complex motion patterns and may require extensive computational resources for training. They do not discuss potential limitations related to the generalizability of the model across diverse video datasets or the scalability of the architecture for longer video sequences.
Why it matters
The implications of AAD-1 are significant for advancing the field of video generation, particularly in applications requiring high-quality, temporally coherent outputs. By addressing the critical issues of motion collapse and training instability, this work paves the way for more robust autoregressive models in video synthesis. The findings contribute to the ongoing discourse in generative modeling, particularly in the context of adversarial training techniques. This research is relevant for future explorations in video generation methodologies, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →
Source: arXiv cs.CV