Notable evaluation benchmarks

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Ziyun Zeng, Yiqi Lin, Guoqiang Liang, Mike Zheng Shou

Published
May 7, 2026 — 16:35 UTC
Summary length
423 words
Relevance score
80%

Problem
This paper addresses the gap in high-quality datasets and benchmarks for the task of video background replacement, which is crucial for applications in film production and advertising. Existing datasets, such as OpenVE-3M, primarily focus on local editing and style transfer, leading to models that generate static and unnatural backgrounds. The authors highlight that the lack of precise background guidance during data synthesis contributes to the poor performance of state-of-the-art models like Kiwi-Edit. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors propose a novel pipeline for generating video data that decouples foreground and background guidance, allowing for more precise control over the quality of the synthesized backgrounds. This pipeline includes strict quality filtering mechanisms to ensure that the generated backgrounds are temporally consistent and interact accurately with the foreground elements. The resulting dataset, named Sparkle, consists of approximately 140,000 video pairs across five common background-change themes. Additionally, they introduce Sparkle-Bench, a comprehensive evaluation benchmark specifically designed for background replacement tasks. The model trained on this dataset leverages advanced neural architectures, although specific architectural details and training compute are not disclosed in the summary.

Results
The experiments conducted demonstrate that the Sparkle dataset and the associated model significantly outperform existing baselines on both the OpenVE-Bench and the newly introduced Sparkle-Bench. The authors report substantial improvements in performance metrics, although specific headline numbers and effect sizes are not detailed in the abstract. The results indicate that the proposed approach effectively addresses the limitations of prior models by providing high-quality background synthesis.

Limitations
The authors acknowledge that the quality of the generated backgrounds is contingent on the effectiveness of their decoupled guidance approach. They do not discuss potential limitations related to the generalizability of the dataset across diverse video genres or the computational resources required for training the proposed model. Additionally, the reliance on a specific set of background-change themes may limit the applicability of the dataset to broader contexts.

Why it matters
This work has significant implications for the field of video editing and synthesis, particularly in enhancing the capabilities of models for background replacement tasks. By providing a large-scale, high-quality dataset and a tailored benchmark, the authors facilitate further research and development in this area, potentially leading to advancements in creative applications that require realistic video manipulation. The open-sourcing of the dataset and benchmark encourages community engagement and iterative improvements, which could accelerate progress in the domain of instruction-guided video editing.

Authors: Ziyun Zeng, Yiqi Lin, Guoqiang Liang, Mike Zheng Shou
Source: arXiv:2605.06535
URL: https://arxiv.org/abs/2605.06535v1

Turing Wire
Author Turing Wire editorial staff