FGSVQA: Frequency-Guided Short-form Video Quality Assessment
Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull
- Published
- May 19, 2026 — 15:44 UTC
Problem
This paper addresses the gap in video quality assessment (VQA) methodologies specifically tailored for short-form user-generated content (UGC). Traditional VQA approaches often struggle with the unique challenges posed by short-form videos, including rapid content variation, complex generation pipelines, and mixed distortions. The authors present a preprint work that aims to fill this gap by proposing a novel framework that leverages frequency domain information to enhance quality assessment.
Method
The proposed framework, FGSVQA, is an end-to-end VQA system that utilizes a dense visual encoder based on the Contrastive Language-Image Pretraining (CLIP) architecture. The core technical contribution lies in the integration of frequency domain compression priors to create artifact- and structure-aware weight maps for feature aggregation. The architecture decomposes visual features into three distinct branches: artifact, structure, and original visual features. A learned gating module adaptively fuses these branches over time, allowing the model to focus on relevant features for quality prediction. The training process involves optimizing a loss function that balances the contributions of these branches, although specific details on the loss formulation and training compute are not disclosed.
Results
The FGSVQA framework demonstrates strong performance on short-form video datasets, achieving a Spearman Rank Correlation Coefficient (SRCC) of 0.736 and a Pearson Linear Correlation Coefficient (PLCC) of 0.787. These metrics indicate a significant improvement over existing baselines, although specific baseline models are not named in the summary. The authors also highlight the efficiency of their method, suggesting that it maintains a competitive inference runtime, which is critical for real-time applications in video quality assessment.
Limitations
The authors acknowledge that their approach may not generalize well to long-form videos or videos with extreme distortions, as the model is specifically designed for short-form content. Additionally, the reliance on frequency domain features may limit the model’s performance in scenarios where such features are less informative. The paper does not discuss the potential impact of varying video resolutions or frame rates on the model’s performance, which could be an important consideration for practical applications.
Why it matters
The FGSVQA framework has significant implications for the field of video quality assessment, particularly in the context of short-form UGC, which is increasingly prevalent on social media platforms. By addressing the unique challenges associated with this type of content, the proposed method could enhance user experience by enabling more accurate quality predictions. Furthermore, the integration of frequency domain information may inspire future research to explore similar approaches in other domains of multimedia quality assessment, potentially leading to more robust and adaptable VQA systems.
Authors: Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull
Source: arXiv:2605.20016
URL: https://arxiv.org/abs/2605.20016v1
By Turing Wire editorial staff · May 19, 2026 · Editorial standards →
Source: arXiv cs.CV