Characterizing the Consistency of the Emergent Misalignment Persona
Problem This preprint addresses the gap in understanding the consistency of emergent misalignment (EM) in large language models (LLMs) fine-tuned on narrowly misaligned datasets. While previous studies have established a...