Notable interpretability

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

Published: May 21, 2026 — 16:23 UTC

Problem
This paper addresses the lack of interpretability in vision-language models (VLMs), specifically focusing on the opaque internal semantics of their multimodal embeddings. While existing methods like sparse autoencoders (SAEs) can extract interpretable features, they typically require an increase in representation dimensionality, which compromises the original geometric structure and introduces redundancy. The authors propose a novel approach, CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), as a post-hoc method to disentangle embeddings without expanding their dimensionality. This work is presented as a preprint and has not yet undergone peer review.

Method
CEDAR employs an invertible transformation that incorporates a top-$k$ sparsity bottleneck to achieve disentangled embeddings. The method focuses on transforming the original embedding space into a new basis where semantic information is concentrated into axis-aligned coordinates. This transformation is designed to maintain the original geometry of the embeddings while enhancing interpretability. The architecture is applied to CLIP-like models, allowing individual coordinates to be mapped to textual concepts, and to generative models like BLIP, enabling the decoding of these coordinates into natural language descriptions. The training process involves optimizing the reconstruction loss while enforcing sparsity constraints, although specific training compute details are not disclosed.

Results
CEDAR demonstrates a competitive reconstruction-sparsity trade-off compared to baseline methods, achieving significant improvements in interpretability. The authors report that the explanations generated by CEDAR are more aligned with human perception than those produced by traditional methods. While specific quantitative results are not detailed in the abstract, the implication is that CEDAR outperforms existing techniques in terms of both reconstruction fidelity and interpretability, suggesting a more effective disentanglement of semantic features in VLMs.

Limitations
The authors acknowledge that CEDAR is a post-hoc method, which may limit its applicability to scenarios where embeddings are not already pretrained. Additionally, the reliance on a top-$k$ sparsity bottleneck may introduce challenges in determining the optimal value of $k$, which could affect the generalizability of the method across different datasets and tasks. The paper does not address potential computational overhead associated with the invertible transformation, nor does it explore the scalability of CEDAR to larger models or more complex datasets.

Why it matters
The implications of this work are significant for the field of multimodal machine learning, particularly in enhancing the interpretability of vision-language models. By providing a method to disentangle embeddings without increasing dimensionality, CEDAR opens avenues for more transparent AI systems, which is crucial for applications requiring human oversight and trust. This research could influence future work on embedding interpretability, potentially leading to more robust and explainable AI models in various domains, including image captioning, visual question answering, and beyond.

Authors: Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja
Source: arXiv:2605.22679
URL: https://arxiv.org/abs/2605.22679v1

By Callan Zhang · May 21, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV