Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them
Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang
- Published
- Jun 4, 2026 — 16:30 UTC
Problem
The paper addresses the gap in physical consistency in image-to-video diffusion models, which often generate motion that violates physical laws. Despite advancements in visual fidelity, existing models struggle with maintaining realistic motion over extended denoising steps. The authors highlight that a 2-step generation can outperform a 50-step output in terms of physical consistency, a finding that has not been previously documented in the literature. This work is presented as a preprint, indicating that it has not yet undergone peer review.
Method
The core contribution is the PhaseLock framework, which operates without the need for extensive training. The authors utilize a two-step inference process to extract valid motion priors, which are then enforced during the denoising trajectory using Latent Delta Guidance. This method effectively mitigates phase degradation observed in traditional diffusion models, where phase information drops significantly (approximately 18%) from step 2 to step 50, while the magnitude remains stable. The framework is designed to maintain visual fidelity while improving physical consistency, with minimal computational overhead (1.06x time and 1.02x memory increase) compared to standard methods.
Results
PhaseLock demonstrates a significant improvement in physical consistency, achieving an average increase of 6.2 points across various models when compared to baseline outputs. The authors benchmark their results against standard diffusion models, although specific baseline models are not named in the summary. The method shows a substantial reduction in reliance on expensive external guidance methods, achieving approximately 5 times faster performance. These results indicate that PhaseLock not only enhances physical realism but also optimizes computational efficiency.
Limitations
The authors acknowledge that while PhaseLock improves physical consistency, it may not fully eliminate all forms of motion artifacts that can arise in complex scenarios. Additionally, the framework’s reliance on a two-step inference process may limit its applicability in contexts where longer inference times are necessary for other quality metrics. The paper does not address potential limitations related to the generalizability of the method across all types of diffusion models or its performance in real-world applications.
Why it matters
The implications of this work are significant for the development of more physically consistent generative models in computer vision and graphics. By demonstrating that a training-free approach can effectively preserve motion priors, this research opens avenues for further exploration into efficient generative techniques that prioritize realism. The findings encourage future work to investigate the balance between visual fidelity and physical accuracy in generative models, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 4, 2026 · Editorial standards →
Source: arXiv cs.CV