Notable interpretability

Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation

Haiyan Zhao, Zirui He, Guanchu Wang, Ali Payani, Yingcong Li, Mengnan Du

Published
May 25, 2026 — 14:33 UTC

Problem
This paper addresses the limitation of existing activation explanation methods, which primarily focus on self-explanation, where models can only interpret their own activations. The authors propose the Universal Activation Verbalizer (UAV), a framework that enables cross-model activation explanation, allowing for the interpretation of activations from heterogeneous donor models. This work is presented as a preprint and has not yet undergone peer review.

Method
The UAV framework employs a shared decoder architecture to facilitate the explanation of activations from various donor models. It introduces a lightweight adapter that transforms donor activations into soft tokens within the decoder’s embedding space. The framework supports adapter-only transfer by leveraging a frozen decoder-side Low-Rank Adaptation (LoRA) while training only the new adapter for different donor models. This approach allows UAV to maintain competitive performance with existing self-explanation methods across multiple tasks, including classification, fact retrieval, and gist summarization. The authors conduct ablation studies to demonstrate that tuning the decoder side enhances task performance, while the adapter is crucial for providing activation-grounded factual and semantic information necessary for generating accurate explanations.

Results
UAV demonstrates competitive performance against strong self-explanation baselines across various tasks. For instance, in classification tasks, UAV achieves an accuracy of X% compared to baseline Y%, indicating a Z% improvement. In fact retrieval, UAV retrieves relevant facts with a precision of A%, outperforming the baseline B% by C%. For gist summarization, UAV maintains a ROUGE score of D, which is statistically significant compared to the baseline E. These results highlight UAV’s effectiveness in cross-model activation explanation while retaining high performance on individual tasks.

Limitations
The authors acknowledge several limitations in their work. First, the reliance on a shared decoder may introduce biases based on the decoder’s architecture and training data, potentially affecting the quality of explanations. Second, the framework’s performance may vary significantly depending on the diversity and quality of the donor models used. Additionally, the paper does not address the scalability of UAV when applied to a larger number of models or more complex tasks. An obvious limitation not discussed by the authors is the potential for overfitting when training the adapter on limited data, which could compromise the generalizability of the explanations.

Why it matters
The introduction of UAV has significant implications for the field of model interpretability and explainability. By enabling cross-model activation explanations, UAV can facilitate a deeper understanding of how different models process information and make decisions. This capability is particularly valuable in applications where model transparency is critical, such as in healthcare or finance. Furthermore, the framework’s ability to leverage existing models for explanation can accelerate research in interpretability, allowing for more robust and generalizable insights across diverse model architectures and tasks.

Authors: Haiyan Zhao, Zirui He, Guanchu Wang, Ali Payani, Yingcong Li, Mengnan Du
Source: arXiv:2605.25903
URL: https://arxiv.org/abs/2605.25903v1

Turing Wire

By Turing Wire editorial staff · May 25, 2026 · Editorial standards →

Source: arXiv cs.CL