Channel-wise Vector Quantization
Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Ming Li
- Published
- May 25, 2026 — 17:52 UTC
Problem
This paper addresses the limitations of conventional vector quantization (VQ) methods in image tokenization, specifically the reliance on patch-wise tokens that can lead to suboptimal representation of visual details. The authors propose Channel-wise Vector Quantization (CVQ) as a novel approach that quantizes each channel of the feature map instead of individual patches. This work is presented as a preprint and has not yet undergone peer review.
Method
The core technical contribution is the introduction of CVQ, which utilizes a 16K+ codebook size to achieve 100% codebook utilization. This method allows for a more nuanced representation of images by capturing discrete levels of visual detail across channels rather than relying on spatial patches. The authors also propose a new visual autoregressive framework called Channel-wise Autoregressive (CAR), which employs a “next-channel prediction” strategy. In this framework, the model predicts image channels sequentially, starting with a rough global structure and progressively refining fine-grained attributes. The training process and specific loss functions are not detailed in the abstract, but the architecture is designed to enhance the quality of image generation through this channel-wise approach.
Results
Empirical evaluations demonstrate that CVQ significantly improves reconstruction quality compared to traditional VQ methods. The CAR model achieves a DPG score of 86.7 and a GenEval score of 0.79, indicating strong performance in text-to-image generation tasks. These results suggest that the proposed methods outperform existing baselines, although specific baseline models are not named in the abstract.
Limitations
The authors acknowledge that their approach may have limitations in terms of computational efficiency and scalability, particularly when dealing with very high-resolution images or real-time applications. Additionally, the reliance on a large codebook size may introduce challenges in terms of memory usage and inference speed. The paper does not discuss potential issues related to the generalization of the model across diverse datasets or the robustness of the generated images under various conditions.
Why it matters
The introduction of CVQ and the CAR framework has significant implications for the field of image generation and tokenization. By moving away from patch-wise representations, this work opens avenues for more efficient and effective image synthesis methods that can better capture the intricacies of visual details. The sequential channel prediction approach mimics human artistic processes, potentially leading to more natural and coherent image generation. This research could influence future developments in generative models, particularly in applications requiring high fidelity and detail in visual outputs.
Authors: Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Ming Li, Jiaqi Wang, Kaicheng Yu
Source: arXiv:2605.26089
URL: https://arxiv.org/abs/2605.26089v1
By Turing Wire editorial staff · May 25, 2026 · Editorial standards →
Source: arXiv cs.AI