Notable other Hugging Face

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb

Published: May 26, 2026 — 15:09 UTC

Problem
This preprint addresses the gap in the integration of superpixel-based image classification with self-attentional models, specifically Vision Transformers (ViTs). While superpixel methods have traditionally utilized Graph Neural Networks (GNNs) for processing irregular image representations, the potential synergy between GNNs, superpixels, and transformers has not been fully explored. The authors aim to unify these paradigms to enhance image classification performance.

Method
The core technical contribution is the Superpixel Transformers (SPT) framework, which generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViTs. SPT supports arbitrary superpixel chunking strategies, connectivity graphs, and positional encodings. Key innovations include a multidimensional sine-cosine positional encoding that captures spatial relationships and an enriched patch data structure that incorporates superpixel shape and color information. The model is trained on datasets such as CIFAR10, FashionMNIST, and Imagenette, employing various superpixel generation techniques and graph connectivity strategies. The authors do not disclose specific training compute details.

Results
SPT demonstrates superior performance over previous superpixel-based GNN methods and remains competitive with state-of-the-art ViTs. For instance, on CIFAR10, SPT achieves an accuracy of 95.2%, outperforming SICGAT by 4.5% and matching the performance of leading ViT models. On FashionMNIST, SPT achieves 92.1% accuracy, surpassing SICGAT by 3.2%. The results indicate that constrained graph connectivity can enhance ViT performance, with effect sizes showing a consistent improvement across all tested datasets.

Limitations
The authors acknowledge that while SPT improves upon SICGAT, it may still be sensitive to the choice of superpixel generation methods and graph connectivity strategies. Additionally, the framework’s performance may vary with different datasets, and the computational efficiency of SPT compared to pure ViTs is not explicitly evaluated. The authors do not address potential scalability issues when applied to larger datasets or real-time applications.

Why it matters
This work has significant implications for the development of hybrid attentional frameworks in computer vision. By bridging the gap between superpixel-based methods and transformers, SPT opens avenues for cross-domain generalization, potentially enhancing performance in tasks requiring fine-grained image analysis. The integration of superpixel information could lead to more efficient models that leverage both local and global contextual information, paving the way for future innovations in image classification and related fields.

Authors: Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb
Source: arXiv:2605.27144
URL: https://arxiv.org/abs/2605.27144v1

By Callan Zhang · May 26, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG