Major alignment safety

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

Published: Jun 10, 2026 — 17:15 UTC

Problem
This paper addresses the critical issue of safety degradation in large language models (LLMs) when fine-tuned for specific domains, particularly in response to harmful prompts. Existing inference-time defenses that utilize logit mixing require models to share a vocabulary, which limits their applicability to cross-family specialists where safety is most compromised. The authors present ALIGNBEAM, a novel approach that does not require retraining and overcomes the vocabulary restriction, making it relevant for a broader range of model families. This work is a preprint and has not undergone peer review.

Method
ALIGNBEAM employs a token-by-token translation mechanism to convert logits from a safe anchor model into the vocabulary of the target model during each decoding step. This is achieved without modifying the weights of either model, thus maintaining their original performance characteristics. A small LLM judge is utilized to evaluate K candidate continuations, selecting the safest option based on the translated logits. The method allows for tuning the safety-utility trade-off at deployment, providing flexibility in real-world applications. The authors do not disclose specific architectural details or training compute requirements, as the method is training-free.

Results
The evaluation of ALIGNBEAM demonstrates significant improvements in safety metrics across both cross-vocabulary and same-vocabulary pairs. The method substantially increases refusal rates on adversarial benchmarks compared to baseline models, although specific numerical results are not provided in the abstract. Task accuracy remains consistent with existing models, and the inference overhead is reported to be within practical limits, indicating that the method is efficient for deployment in real-world scenarios.

Limitations
The authors acknowledge that while ALIGNBEAM effectively enhances safety, it does not involve retraining, which may limit its adaptability to rapidly evolving adversarial techniques. Additionally, the reliance on a small LLM judge introduces a potential bottleneck in performance, as the effectiveness of the safety evaluation is contingent on the judge’s capabilities. The paper does not explore the implications of model size or architecture on the performance of ALIGNBEAM, which could be a significant factor in diverse applications.

Why it matters
The implications of ALIGNBEAM are substantial for the deployment of LLMs in safety-critical applications, as it provides a mechanism to enhance safety without the need for retraining, thus facilitating the use of existing models in new contexts. This work opens avenues for further research into inference-time safety mechanisms and cross-family model interactions, potentially leading to more robust AI systems. The findings are particularly relevant for ongoing discussions in the field regarding the safety of AI systems, as published in arXiv cs.AI.

By Callan Zhang · Jun 10, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI