Notable alignment safety

Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures

Xueping Gao

Published: Jun 15, 2026 — 16:07 UTC

Problem
This work addresses the gap in understanding how different large language model (LLM) architectures encode high-level concepts structurally. The authors highlight a geometric-functional universality dissociation, where moderate geometric convergence exists alongside near-perfect functional transfer. This study is particularly relevant as it is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.

Method
The core technical contribution is the development of contrastive-difference CKA (CKA_Delta), a diagnostic tool that computes kernel alignment based on per-sample contrastive differences. This method allows for the isolation of concept-specific convergence from generic similarity, achieving significant discrimination in scenarios where standard CKA fails. The authors apply CKA_Delta across six concept domains, including both instruction-based and non-instruction concepts (e.g., code vs. natural language, reasoning vs. recall). The analysis is conducted without system prompts, enhancing the robustness of the findings. The authors also note that a single pair of models (70B–70B) suggests that universality may improve with model scale, although further validation with additional models of 70B or larger is necessary.

Results
The results demonstrate that CKA_Delta achieves significant geometric discrimination across five of the six tested concept domains, with p-values ≤ 0.017. The safety domain shows a converging-functional trend with a p-value of 0.08. The method is positioned as a regime classifier and architectural outlier detector, achieving a Gemma score of d = 1.08 and an area under the curve (AUC) of 0.79. These metrics indicate that CKA_Delta can effectively differentiate between architectures based on their structural alignment of concepts.

Limitations
The authors acknowledge that their findings are based on a limited number of models, particularly the reliance on a single 70B–70B pair, which necessitates further replication with additional models of similar scale. They also note that while CKA_Delta serves as a practical diagnostic tool, it should not be viewed as an absolute predictor of transfer accuracy. The study’s preprint status means that the results are preliminary and subject to change upon peer review.

Why it matters
The implications of this research are significant for the field of machine learning, particularly in understanding the structural properties of LLMs and their ability to transfer knowledge across different architectures. By providing a training-free diagnostic for monitoring concept alignment, CKA_Delta can facilitate the development of more robust and interpretable models. This work contributes to the ongoing discourse on model interpretability and the structural underpinnings of language understanding, as published in arXiv.

By Callan Zhang · Jun 15, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL