Notable interpretability

Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models

Yuetian Lu, Ali Modarressi, Yihong Liu, Hinrich Schütze

Published: Jun 2, 2026 — 15:35 UTC

Problem
The paper addresses the gap in understanding how factual recall operates within sparse mixture-of-experts (MoE) language models, a topic that has been primarily explored in dense transformer architectures. The authors highlight the need for a method to trace causal contributions of individual experts in MoE models, particularly when factual predictions are influenced by routed expert blocks. This work is presented as a preprint, indicating it has not yet undergone peer review.

Method
The authors propose a novel framework for expert-aware causal tracing in sparse MoE language models. They utilize the CounterFact dataset to introduce noise into subject-token embeddings, thereby corrupting the model’s factual preferences. The tracing process involves evaluating the restoration of true-vs-foil logit contrasts through clean outputs from MoE blocks and expert-level updates. The experiments focus on two models: Qwen3-30B-A3B-Base and Mixtral-8x7B-v0.1. For Qwen3, a layer sweep identifies layer 44 as significant, with expert-level tracing pinpointing L44E069 as a consistently selected expert whose contributions outperform other experts in the same layer. In contrast, for Mixtral, while a mid-layer signal is validated, it is not attributed to a single expert but rather to a coalition of routed multi-expert updates.

Results
The findings reveal that expert-aware causal tracing can effectively identify significant expert contributions in MoE models. For Qwen3-30B-A3B-Base, the expert L44E069 consistently enhances factual recall, outperforming other experts in the same layer. In the case of Mixtral-8x7B-v0.1, the results indicate that the mid-layer signal is not localized to a single expert, suggesting a more complex interaction among multiple experts. The paper does not provide specific quantitative metrics or effect sizes, focusing instead on the qualitative identification of expert contributions.

Limitations
The authors acknowledge that the expert-level localization of contributions is model- and protocol-dependent, indicating that findings may not generalize across different MoE architectures. Additionally, the reliance on noise corruption for tracing may introduce variability in results, and the lack of quantitative performance metrics limits the ability to assess the magnitude of improvements. The study also does not explore the implications of these findings on broader model interpretability or generalization.

Why it matters
This research contributes to the understanding of how sparse MoE architectures can be analyzed for causal contributions, which is crucial for improving model interpretability and performance in factual recall tasks. The expert-aware causal tracing framework has potential implications for future work in model design and evaluation, particularly in optimizing expert selection and routing strategies. This work lays the groundwork for further exploration of expert interactions in MoE models, as discussed in related literature on model interpretability and causal inference in machine learning, as published in arXiv.

By Callan Zhang · Jun 2, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL