Notable evaluation benchmarks

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang

Published: Jun 1, 2026 — 17:40 UTC

Problem
The paper addresses the insufficient evaluation of large-scale generative models in low-level vision tasks, which require pixel-wise precision. Despite the advancements in generative models for image generation and editing, their performance in low-level vision tasks remains underexplored. The authors present LL-Bench, a comprehensive benchmark designed to fill this gap, providing a structured evaluation framework for these models. This work is a preprint and has not undergone peer review.

Method
LL-Bench consists of 2,469 real-world degraded images across 16 low-level degradation tasks, including noise reduction, deblurring, and super-resolution. The benchmark includes 28,919 restored images generated by 10 state-of-the-art large-scale generative models and 21 conventional restoration models. These images are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores, enabling a robust evaluation of model performance. The authors also introduce LL-Score, a multi-level learning model (MLLM)-based evaluator that assesses both restoration quality and the presence of hallucinations in generated images. LL-Score is trained to align more closely with human preferences, addressing the discrepancies observed in existing quality evaluation metrics.

Results
The systematic evaluation using LL-Bench reveals that large-scale generative models exhibit unique failure modes and performance boundaries when compared to conventional restoration methods. The authors report that LL-Score significantly outperforms existing image quality assessment metrics, achieving a higher correlation with human ratings. Specific performance metrics and comparisons to named baselines are not disclosed in the abstract, but the results indicate a marked improvement in aligning model outputs with human preferences, suggesting that LL-Score can serve as an effective reward model for training generative models in low-level vision tasks.

Limitations
The authors acknowledge that LL-Bench is limited to the specific low-level vision tasks included in the benchmark and may not generalize to all possible degradation scenarios. Additionally, while LL-Score shows promise, its effectiveness in broader contexts beyond the evaluated tasks remains to be established. The reliance on human annotations for quality assessment may also introduce subjectivity, which could affect the reproducibility of results.

Why it matters
The introduction of LL-Bench and LL-Score has significant implications for the evaluation and development of generative models in low-level vision tasks. By providing a structured framework and a more accurate quality assessment tool, this work encourages further research into the capabilities and limitations of generative models in pixel-level tasks. The findings highlight the need for improved evaluation metrics that align with human perception, which is crucial for advancing the field. This research contributes to the ongoing discourse on generative models and their applications in computer vision, as discussed in detail in the paper available on arXiv.

By Callan Zhang · Jun 1, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV