Major alignment safety

Investigating and Alleviating Harm Amplification in LLM Interactions

Ruohao Guo, Wei Xu, Alan Ritter

Published: Jun 1, 2026 — 16:01 UTC

Problem
This work addresses the gap in understanding how large language models (LLMs) can amplify harm during multi-turn interactions, a concern that has been largely overlooked in existing literature. The authors highlight that LLMs can democratize domain expertise, enabling novices to generate harmful content, and can scale harmful operations beyond manual capabilities. The paper is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.

Method
The authors introduce HarmAmp, a benchmark designed to evaluate multi-turn harm amplification across twelve risk categories grounded in real-world threats. Each scenario in HarmAmp is constructed to meet specific criteria: substantive amplification, operational specificity, and the necessity of multi-turn interactions. To mitigate the identified risks, the authors propose TrajSafe, a proactive monitoring system that anticipates harmful trajectories in user interactions. TrajSafe employs techniques such as probing user intents and steering model outputs towards safer completions. The architecture details of TrajSafe are not disclosed, but the system is designed to operate alongside existing LLMs without significantly compromising their general capabilities.

Results
The experiments conducted demonstrate that TrajSafe effectively reduces harmful outputs in multi-turn interactions. While specific numerical results are not provided in the abstract, the authors claim a significant reduction in harmfulness compared to baseline models, alongside a low over-refusal rate. The preservation of the target model’s general capabilities suggests that TrajSafe can be integrated into existing systems with minimal performance degradation. The effectiveness of TrajSafe is evaluated against the HarmAmp benchmark, which serves as a novel metric for assessing harm amplification.

Limitations
The authors acknowledge that their approach may not cover all potential harm amplification scenarios, particularly those that are context-dependent or emergent in nature. Additionally, the reliance on user intent probing may introduce friction in user experience, potentially leading to user frustration. The paper does not address the computational overhead introduced by TrajSafe, which could impact deployment in resource-constrained environments. Furthermore, the benchmark scenarios may not encompass the full spectrum of real-world interactions, limiting the generalizability of the findings.

Why it matters
This research is significant as it provides a structured approach to understanding and mitigating harm amplification in LLM interactions, a critical issue as these models become more integrated into various applications. The introduction of HarmAmp as a benchmark allows for standardized evaluation of safety mechanisms in LLMs, paving the way for future research in this domain. The proactive nature of TrajSafe offers a promising strategy for enhancing the safety of LLMs, which is essential for their responsible deployment in sensitive contexts. This work contributes to the ongoing discourse on AI safety and ethics, as discussed in related literature, and is available on arXiv.

By Callan Zhang · Jun 1, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG