Notable multimodal

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan, Po-Chien Luan

Published
May 14, 2026 — 16:36 UTC

Problem
This paper addresses the challenge of long-horizon animated video generation, specifically the issues of low-level quality drift and high-level semantic drift that arise in chunk-based generation methods. The authors highlight that existing techniques struggle to maintain visual quality and character identity over extended durations, leading to inconsistencies in both static backgrounds and character attributes. This work is presented as a preprint, indicating it has not yet undergone peer review.

Method
The core contribution of EverAnimate is a post-training method that employs a persistent latent context memory to anchor the generation process. This method consists of two key mechanisms:

  1. Persistent Latent Propagation: This mechanism maintains a context memory across video chunks, allowing for the propagation of character identity and motion in latent space while reducing temporal forgetting.
  2. Restorative Flow Matching: This introduces an implicit restoration objective during the sampling process, which adjusts the velocity of generated frames to enhance fidelity within each chunk. The authors utilize lightweight LoRA (Low-Rank Adaptation) tuning to optimize the model, which allows for efficient adaptation without extensive retraining.

Results
EverAnimate demonstrates significant improvements over state-of-the-art long-animation methods on both short- and long-horizon benchmarks. Specifically, at 10 seconds, it achieves an 8% improvement in PSNR (Peak Signal-to-Noise Ratio) and a 7% improvement in SSIM (Structural Similarity Index), while reducing LPIPS (Learned Perceptual Image Patch Similarity) and FID (Fréchet Inception Distance) by 22% and 11%, respectively. For 90 seconds of animation, the improvements are even more pronounced, with gains of 15% in PSNR, 15% in SSIM, and reductions of 32% in LPIPS and 27% in FID. These results indicate a substantial enhancement in both visual quality and semantic consistency compared to existing methods.

Limitations
The authors acknowledge that while EverAnimate effectively mitigates drift, it may still be susceptible to limitations inherent in the underlying generative model. They do not discuss potential computational costs associated with the persistent context memory or the scalability of the method to even longer animations. Additionally, the reliance on LoRA tuning may limit the applicability of the approach to models that are amenable to such adaptations.

Why it matters
The implications of this work are significant for the field of animated video generation, particularly in applications requiring long-duration content, such as film and video game production. By addressing the critical issues of drift in character identity and visual quality, EverAnimate paves the way for more coherent and visually appealing long-form animations. This advancement could lead to broader adoption of AI-generated content in creative industries, enhancing the capabilities of animators and content creators.

Authors: Wuyang Li, Yang Gao, Mariam Hassan, Lan Feng, Wentao Pan, Po-Chien Luan, Alexandre Alahi
Source: arXiv:2605.15042
URL: https://arxiv.org/abs/2605.15042v1

Turing Wire

By Turing Wire editorial staff · May 14, 2026 · Editorial standards →

Source: arXiv cs.AI