Notable evaluation benchmarks

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

Wei Luo, Yiting Lu, Xin Li, Haoran Li, Fengbin Guan, Chen Gao

Published
May 6, 2026 — 17:52 UTC
Summary length
414 words
Relevance score
70%

Problem
This paper presents the LoViF 2026 PhyScore challenge, addressing a significant gap in the evaluation of world-model-generated videos. Current methodologies primarily focus on perceptual quality, neglecting critical aspects such as physical plausibility, temporal coherence, and alignment with input conditions. The challenge is particularly relevant as it seeks to establish a comprehensive metric that evaluates these dimensions collectively, which is currently underexplored in the literature. This work is a preprint and has not undergone peer review.

Method
The challenge requires participants to develop a holistic quality assessment metric that predicts four key dimensions: Video Quality, Physical Realism, Condition-Video Alignment, and Temporal Consistency. Additionally, participants must localize timestamps of physical anomalies for detailed diagnostics. The benchmark dataset comprises 1,554 videos generated by seven distinct world generative models, categorized into three tracks: text-to-2D, image-to-4D, and video-to-4D, covering 26 physics-relevant scenarios, including dynamics, optics, and thermodynamics. The dataset’s reliability is ensured through human annotation, supplemented by an automated quality-control process. Evaluation metrics include a composite protocol that integrates TimeStamp_IOU for anomaly localization and Spearman Rank Correlation Coefficient (SRCC) and Pearson Linear Correlation Coefficient (PLCC) for score prediction.

Results
While specific numerical results from the challenge submissions are not detailed in the abstract, the paper indicates that the evaluation framework allows for a robust comparison of methods across the defined dimensions. The challenge’s design aims to facilitate the development of metrics that can effectively capture the multifaceted nature of video quality in generative models, thus providing a benchmark for future research.

Limitations
The authors acknowledge that the challenge is still in its early stages, and the metrics developed may not fully encapsulate the complexity of physical realism in generated videos. Additionally, the reliance on human annotation, while enhancing label reliability, may introduce subjective biases. The challenge does not address the scalability of the proposed metrics across diverse generative models beyond those included in the benchmark dataset.

Why it matters
The implications of the LoViF 2026 PhyScore challenge are significant for the field of generative modeling and video synthesis. By establishing a comprehensive evaluation framework that incorporates physical realism and temporal coherence, this work encourages the development of more sophisticated generative models that can produce videos with realistic dynamics. It sets a precedent for future research to prioritize holistic quality assessments, potentially leading to advancements in applications such as virtual reality, simulation, and automated content generation.

Authors: Wei Luo, Yiting Lu, Xin Li, Haoran Li, Fengbin Guan, Chen Gao, Xin Jin, Yong Li et al.
Source: arXiv:2605.05187
URL: https://arxiv.org/abs/2605.05187v1

Turing Wire
Author Turing Wire editorial staff