Notable interpretability

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Sida Liu, Feijiang Han

Published
Jun 10, 2026 — 06:53 UTC

Problem
The paper addresses the challenge of interpreting language model representations, a critical aspect for understanding and controlling model behavior. Existing methods, particularly Sparse Autoencoders (SAEs), require extensive training and storage of large overcomplete dictionaries, which hinders rapid exploration of interpretable directions. This work is particularly relevant as it is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.

Method
The authors propose ICALens, a workflow that utilizes independent component analysis (ICA) to identify non-Gaussian directions in language model activations. The method integrates an optimized GPU-parallel FastICA implementation tailored for language models, along with stability enhancements and fitting diagnostics specific to LLMs. This approach allows for layer-wise analysis of models like GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base without the need for per-layer gradient-based dictionary training, thus streamlining the interpretability process.

Results
ICALens demonstrates competitive performance against existing SAEs on the SAEBench benchmark. Specifically, it achieves comparable results in sparse probing tasks and outperforms SAEs in targeted probe perturbation when operating under small-to-medium budget constraints. The authors report that ICA can recover compact, human-interpretable directions efficiently, suggesting that ICA should be reconsidered as a robust tool for language model interpretability rather than a mere baseline.

Limitations
The authors acknowledge that while ICA provides a powerful lens for interpretability, it may not capture all aspects of model behavior, particularly in highly complex or nuanced scenarios. Additionally, the reliance on non-Gaussianity as a criterion for interpretable directions may overlook other potentially informative structures. The paper does not address the scalability of ICALens to larger models or its performance across diverse datasets, which could be critical for broader applicability.

Why it matters
The introduction of ICALens has significant implications for the field of model interpretability, offering a more efficient and effective method for analyzing language models without the overhead of additional training. This work encourages researchers to reconsider the utility of ICA in the context of LLMs, potentially leading to new insights into model behavior and guiding future interpretability research. The findings are particularly relevant for ongoing efforts to enhance transparency in AI systems, as published in arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 10, 2026 · Editorial standards →

Source: arXiv cs.CL