Notable alignment safety

Towards Emotion Consistency Analysis of Large Language Models in Emotional Conversational Contexts

Sneha Oram, Ojaswita Bhushan, Pushpak Bhattacharyya

Published
May 7, 2026 — 16:01 UTC
Summary length
442 words
Relevance score
80%

Problem
This preprint addresses the gap in understanding the emotional consistency of Large Language Models (LLMs) when generating responses in emotionally charged conversational contexts. While prior research has explored LLM performance in various tasks, there is limited investigation into how these models handle emotionally nuanced queries, particularly those containing false presuppositions. The authors aim to analyze the consistency of LLM responses when they are prompted with their own generated text, focusing on the impact of emotional intensity on model performance.

Method
The authors employ a novel experimental framework where the text generated by LLMs is reintroduced as a query to the same model. This process involves three specific false claim queries that vary in emotional intensity—extreme and moderate emotions. The study evaluates two commercial models, Claude-3.5-haiku and GPT4o-mini, alongside a medium-sized model, Mistral-7B. The analysis includes an extended attention-score-based evaluation to assess how the models prioritize evaluative versus generative tasks in response to emotionally charged inputs. The training compute and specific loss functions used by the models are not disclosed, which limits reproducibility.

Results
The findings reveal that the LLMs exhibit below-average performance in maintaining emotional consistency, particularly when responding to moderate emotional content. The models show a significant vulnerability to false beliefs embedded in the queries, with performance metrics indicating a marked decline in response accuracy as emotional intensity increases. The authors do not provide specific numerical results or effect sizes against established baselines, which would have strengthened the empirical claims. However, the qualitative analysis suggests a notable shift in the models’ focus from evaluative responses to generative outputs, indicating a potential misalignment in handling emotionally sensitive queries.

Limitations
The authors acknowledge several limitations, including the restricted scope of emotional contexts examined and the reliance on a limited set of models. The lack of quantitative performance metrics and comparisons to established benchmarks is a significant oversight, as it hinders the ability to contextualize the findings within the broader landscape of LLM capabilities. Additionally, the study does not explore the implications of model size or architecture on emotional consistency, nor does it consider the potential influence of training data diversity on model responses.

Why it matters
This work has critical implications for the deployment of LLMs in high-stakes environments where emotional sensitivity is paramount, such as mental health applications or customer service interactions. The demonstrated susceptibility of LLMs to false presuppositions in emotionally charged contexts raises concerns about their reliability and trustworthiness. Future research should focus on enhancing the emotional robustness of LLMs, potentially through targeted training strategies or architectural modifications, to ensure their safe and effective use in sensitive applications.

Authors: Sneha Oram, Ojaswita Bhushan, Pushpak Bhattacharyya
Source: arXiv:2605.06476
URL: https://arxiv.org/abs/2605.06476v1

Turing Wire
Author Turing Wire editorial staff