StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset
Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge
- Published
- Jun 4, 2026 — 16:12 UTC
Problem
Existing Video Question Answering (VideoQA) systems excel at factoid questions but struggle with deep video understanding (DVU), which necessitates comprehension of complex narratives across long video content. The authors identify a significant gap in the literature regarding the scalability and diversity of DVU datasets, particularly for movies, which are often underrepresented in existing benchmarks. This work is a preprint and has not undergone peer review.
Method
The authors present StoryMindv2, an enhanced multi-agent collaboration framework designed to automatically generate high-quality DVU datasets. This framework incorporates a supervisor-guided generation mechanism and a refined multi-reviewer voting strategy to ensure the quality of question-answer pairs (QAs). The resulting dataset, StoryVideoQA, is the largest of its kind, comprising over 363,000 QAs derived from 393.2 hours of diverse story videos, including both TV series (average duration of 1,635 seconds) and movies (average duration of 7,878 seconds). To address the limitations of existing VideoQA methods, the authors propose PlotTree, a novel video understanding agent that reorganizes long-range video content into a hierarchical plot structure, facilitating efficient storyline reasoning.
Results
The authors conduct comprehensive evaluations of 20 state-of-the-art VideoQA methods on the StoryVideoQA benchmark. The results indicate that these methods struggle to maintain long-range character associations and construct coherent understandings of complex storylines. While specific performance metrics are not disclosed in the abstract, the implication is that existing models exhibit significant deficiencies when applied to the newly introduced dataset, highlighting the need for improved methodologies in DVU.
Limitations
The authors acknowledge that while StoryMindv2 enhances the generation of QAs, the performance of existing VideoQA models remains inadequate for complex narratives. They do not address potential biases in the dataset generation process or the generalizability of the PlotTree agent across different genres or video formats. Additionally, the reliance on automated methods for dataset generation may introduce noise or inconsistencies in the QAs.
Why it matters
The introduction of StoryVideoQA and the PlotTree agent represents a significant advancement in the field of deep video understanding, providing a robust benchmark for future research. This work lays the groundwork for developing more sophisticated VideoQA systems capable of handling complex narratives, which is crucial for applications in content analysis, interactive storytelling, and AI-driven video summarization. The implications of this research extend to various domains, including entertainment, education, and information retrieval, as published in arXiv cs.CV.
By Turing Wire editorial staff · Jun 4, 2026 · Editorial standards →
Source: arXiv cs.CV